We seem to have been mostly stable for the last few weeks, even with a few new 
testers joining. A huge thanks to David Cantrell for providing a new API server 
to start load balancing, which has alleviated most of the 503 errors, and I've 
been working on Fastly configuration to fix the rest of them (if the health 
checks get the 503 before one of you do, then we can redirect traffic to the 
server that's still up!)

To keep track of the server's current status, I've created some Grafana 
dashboards at http://status.cpantesters.org/grafana 
<http://status.cpantesters.org/grafana>. These Grafana dashboards also power 
the alerts that I've set up for the things I can solve. Anyone who is in the 
CPAN Testers organization on Github has read access to these dashboards if they 
click the "Log in with Github" button.

I would love, in the future, to display some statistics and status messages on 
the main page of status.cpantesters.org <http://status.cpantesters.org/> (much 
like how status.github.com <http://status.github.com/> works), but that's a 
project in need of developer time (I think it'd be an interesting application 
though, that could have some usefulness for other orgs who use Grafana).

Also, anyone who's interested in improving the internal CPAN Testers statistics 
(http://stats.cpantesters.org <http://stats.cpantesters.org/>) with Grafana and 
InfluxDB, let me know. We could be offloading a lot of processing to the 
monitoring server that is currently being done by the main server, which would 
reduce load on the main server.

With that, I think it's time to destroy the AWS instances and databases. Over 
the next month I will be verifying that all the AWS data is stored in our MySQL 
database and then shutting it down. This will not disrupt any current processes 
and should be completely painless. I will let y'all know when this is completed.

Thanks again to David Cantrell for donating some hardware. If we had more 
hardware, we might be able to start measuring API uptime with more than 1 or 2 
9's! ;) There are still some minor issues that additional hardware could solve: 
Processing reports faster and more dependably and increasing the number of 
load-balanced API nodes to further reduce the number of 503s received by users.

Doug Bell
d...@preaction.me



Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to