We seem to have been mostly stable for the last few weeks, even with a few new testers joining. A huge thanks to David Cantrell for providing a new API server to start load balancing, which has alleviated most of the 503 errors, and I've been working on Fastly configuration to fix the rest of them (if the health checks get the 503 before one of you do, then we can redirect traffic to the server that's still up!)
To keep track of the server's current status, I've created some Grafana dashboards at http://status.cpantesters.org/grafana <http://status.cpantesters.org/grafana>. These Grafana dashboards also power the alerts that I've set up for the things I can solve. Anyone who is in the CPAN Testers organization on Github has read access to these dashboards if they click the "Log in with Github" button. I would love, in the future, to display some statistics and status messages on the main page of status.cpantesters.org <http://status.cpantesters.org/> (much like how status.github.com <http://status.github.com/> works), but that's a project in need of developer time (I think it'd be an interesting application though, that could have some usefulness for other orgs who use Grafana). Also, anyone who's interested in improving the internal CPAN Testers statistics (http://stats.cpantesters.org <http://stats.cpantesters.org/>) with Grafana and InfluxDB, let me know. We could be offloading a lot of processing to the monitoring server that is currently being done by the main server, which would reduce load on the main server. With that, I think it's time to destroy the AWS instances and databases. Over the next month I will be verifying that all the AWS data is stored in our MySQL database and then shutting it down. This will not disrupt any current processes and should be completely painless. I will let y'all know when this is completed. Thanks again to David Cantrell for donating some hardware. If we had more hardware, we might be able to start measuring API uptime with more than 1 or 2 9's! ;) There are still some minor issues that additional hardware could solve: Processing reports faster and more dependably and increasing the number of load-balanced API nodes to further reduce the number of 503s received by users. Doug Bell d...@preaction.me
signature.asc
Description: Message signed with OpenPGP