Re: Metabase Changeover Plan

2017-08-17 Thread Doug Bell
Things have been fairly stable for the last 48 hours, so here's a report on the 
current status:

Work done:

* Created a new API to write incoming test reports to a MySQL database
* This removes Amazon SimpleDB and saves us $250/mo
* In the future, we will be able to reduce our total disk usage through 
removing duplicate data
* Translated the Metabase reports to the new test report format
* The new test report format has more fields for future expansion, 
including places for testers to report on all the dependencies of the 
distribution they tested
* The API is doing this transparently
* Another script is migrating all the existing data to the new format
* Started processing incoming test reports in parallel using the Minion job 
runner
* Incoming reports generate a job on the queue
* Worker processes process individual reports
* This work can be spread to multiple machines if we can get access to 
hardware

Outstanding issues around this migration:

* The test_report table is latin1, and some reports are submitted with UTF-8 
characters.
* Mitigation: `ascii => 1` in the serializer_options for the JSON column
* Future Change: Make this table UTF-8 safe
* The Amazon Metabase instances are still up
* After the next week or two of stable operation I will be shutting these 
down
* The original CPAN Testers generate process is still running
* Once the Metabase instance is shut down, these processes will be removed 
from cron
* The Minion task runner must be moved to MySQL
* Presently it is using SQLite and lock timeouts are an occasional annoyance
* Moving to MySQL will let us have multiple machines running Minion workers.
* MySQL allows for greater concurrency accessing the database (to insert 
new jobs and update job status)
* The queue can grow to >10,000 unprocessed reports during the day, and I'd 
like to keep that from happening.
* The Minion task runner needs monitoring
* Queue size will be a good indicator of how the system is doing
* This will be a lot easier when Minion is using MySQL
* The test report and processed test report tables need monitoring
* Right now there's a manual monitor in that Andreas e-mails me once every 
couple weeks to tell me that report processing has stopped
* Counting the number of each and comparing the two should be a good 
indicator
* InnoDB tables have trouble with `SELECT COUNT(*) FROM ` though...
* These tables could be altered to add auto increment ID fields which 
could be a fast indicator of table size
* Some existing processes are still using the MySQL Metabase cache:
* These processes are:
* The original generate process
* The view-report.cgi which views the full text of the report
* Moving these processes to using the new test report format will improve 
performance
* Then we can delete the Metabase cache to free up disk space

Thanks to:

* Joel Berger for writing the CPAN::Testers::Backend::ProcessReports module at 
the 2017 Perl QA Summit
* Andreas König and Slaven Rezić for their help troubleshooting backwards 
compatibility issues and report processing issues
* Barbie for finding some missing parts from the new report processor and 
quickly writing new scripts to fix them
* Everyone who helped test the new API code before this migration (Chris 
Williams, Ioan Rogers)

Next Steps:

* The machine is occasionally overloaded due to view-report.cgi, which causes 
timeouts when submitting reports for 5-10 minutes, so changing this to be a 
daemon that uses the new report format is a high priority.

Doug Bell
d...@preaction.me



> On Aug 12, 2017, at 2:40 PM, Doug Bell  wrote:
> 
> The Metabase API has been changed over. Some bugs were fixed and everything 
> appears stable. If there are any problems, I will revert to the old Metabase 
> API. If I am unresponsive, testers can point their machines to 
> `metabase-old.cpantesters.org ` to 
> reach the old Metabase API. All the old infrastructure is still ticking over 
> and will continue to do so until the new things have been stable for at least 
> a week.
> 
> Moving away from EC2 is a fairly major cost savings: What we were paying 
> $250/mo for we can now pay $0/mo for. That said, the Metabase section of the 
> site has been its own server for a long time, and now we're adding its work 
> to our one existing server that already does everything else except the 
> database.
> 
> To help spread the work out, if anyone still has one or two older, outdated 
> servers sitting in a rack somewhere doing nothing and could make them 
> available to the CPAN Testers project, let me know. I know people have 
> volunteered hardware before, but I wasn't in a place where I could easily 
> make use of it. Now that the new Metabase API exists, and now that I am very 
> close to a Minion job-queue-based backend processing system, I can more 
> easily 

Re: Metabase Changeover Plan

2017-08-14 Thread Nigel Horne

On 8/13/17 7:36 PM, Doug Bell wrote:

On Aug 13, 2017, at 2:18 PM, Nigel Horne,,  wrote:

I doubt this is a co-incidence – I’ve been seeing 503 errors on some of my 
reports:


http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd;>

   
 503 first byte timeout
   
   
 Error 503 first byte timeout
 first byte timeout
 Guru Mediation:
 Details: cache-dca17720-DCA 1502649587 654081008
 
 Varnish cache server
   


-Nigel


I've upped the cache's first byte timeout to 4 minutes, so I hope that fixes 
this problem. If it doesn't, let me know.


Looking better - thanks.

-Nigel


Doug Bell
d...@preaction.me




--
Nigel Horne
Conductor: Rockville Brass Band, Washington Metropolitan GSO
@nigelhorne | fb/nigel.horne | bandsman.co.uk | concert-bands.co.uk | 
www.nigelhorne.com

Unless it's for my eyes only, please use "reply all"


Re: Metabase Changeover Plan

2017-08-13 Thread Doug Bell

> On Aug 13, 2017, at 2:18 PM, Nigel Horne,,  wrote:
> 
> I doubt this is a co-incidence – I’ve been seeing 503 errors on some of my 
> reports:
> 
> 
>   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd;>
> 
>   
> 503 first byte timeout
>   
>   
> Error 503 first byte timeout
> first byte timeout
> Guru Mediation:
> Details: cache-dca17720-DCA 1502649587 654081008
> 
> Varnish cache server
>   
> 
> 
> -Nigel


I've upped the cache's first byte timeout to 4 minutes, so I hope that fixes 
this problem. If it doesn't, let me know.

Doug Bell
d...@preaction.me




signature.asc
Description: Message signed with OpenPGP


Re: Metabase Changeover Plan

2017-08-13 Thread Doug Bell

> On Aug 12, 2017, at 3:09 PM, Slaven Rezic  wrote:
> 
> What I currently see: http://metabase.cpantesters.org/tail/log.txt 
>  quite frequently returns 503 
> in the last two hours.
> 

This one should be fixed: I've increased Fastly's first byte timeout, and also 
am regenerating the log.txt in an offline process. I've also started sending 
proper Cache-Control headers to keep the latency between the backend and the 
cache low. I'll be keeping an eye on this one to make sure all the parts are 
working correctly over the next week or so.
> As for report sending: every now and then I get an error "fact submission 
> failed: Internal Exception at 
> /usr/perl5.24.0p/lib/site_perl/5.24.0/Metabase/Client/Simple.pm line 129." 
> Retrying usually works.
> 

This is weird, and is going to require me to look into the Metabase 
client/server a bit more to see what's up. I might not be emulating something 
correctly... If retrying works, I can get to this over the week.


signature.asc
Description: Message signed with OpenPGP


Re: Metabase Changeover Plan

2017-08-13 Thread Nigel Horne,,
I doubt this is a co-incidence – I’ve been seeing 503 errors on some of my 
reports:


http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd;>

  
    503 first byte timeout
  
  
    Error 503 first byte timeout
    first byte timeout
    Guru Mediation:
    Details: cache-dca17720-DCA 1502649587 654081008
    
    Varnish cache server
  


-Nigel


Re: Metabase Changeover Plan

2017-08-13 Thread Doug Bell
The Metabase API has been changed over. Some bugs were fixed and everything 
appears stable. If there are any problems, I will revert to the old Metabase 
API. If I am unresponsive, testers can point their machines to 
`metabase-old.cpantesters.org ` to reach 
the old Metabase API. All the old infrastructure is still ticking over and will 
continue to do so until the new things have been stable for at least a week.

Moving away from EC2 is a fairly major cost savings: What we were paying 
$250/mo for we can now pay $0/mo for. That said, the Metabase section of the 
site has been its own server for a long time, and now we're adding its work to 
our one existing server that already does everything else except the database.

To help spread the work out, if anyone still has one or two older, outdated 
servers sitting in a rack somewhere doing nothing and could make them available 
to the CPAN Testers project, let me know. I know people have volunteered 
hardware before, but I wasn't in a place where I could easily make use of it. 
Now that the new Metabase API exists, and now that I am very close to a Minion 
job-queue-based backend processing system, I can more easily spread work across 
multiple machines.

For donating, you'll get a place on the sponsors list 
(http://iheart.cpantesters.org ), and you'll 
help continue making Perl and CPAN a community of people collaborating on 
stable, useful software projects.

If there are any other questions, problems, or concerns, please let me know.

Thanks,



Doug Bell
d...@preaction.me



> On Aug 9, 2017, at 3:03 PM, Doug Bell  wrote:
> 
> [Cross-posting from the CPAN Testers blog: 
> http://blog.cpantesters.org/diary/209 ]
> 
> Summary: I will be doing work on the Metabase API on 2017-08-12. Writing test 
> reports may be unresponsive for a few minutes, and there may be bugs. Please 
> let me know if there are any problems submitting test reports.
> 
> I have completed the processing script for the new test report format. This 
> was the last step in moving the Metabase API away from Amazon and on to our 
> MySQL cluster for cost and stability reasons: Amazon SimpleDB is too 
> expensive, and its limitations for our purposes outweigh its costs. We have 
> always maintained a copy of the Metabase data in our MySQL database, and 
> there's no real need to continue having two live copies of the same data 
> (especially when one of the copies costs money every time you ask for a piece 
> of data).
> 
> This Saturday, 2017-08-12, around 1:00 PM US/Central (18:00 UTC), I will be 
> switching DNS over to the new, backwards-compatible Metabase API which writes 
> to our MySQL database. A few months ago, I asked for testers to try this new 
> API out, and everything went well (thanks to everyone who helped with that). 
> The new API works the same as the old API: No changes are needed for your 
> testers or anyone consuming the minimal data feeds out of the Metabase API 
> (the log.txt view).
> 
> Since this is only a DNS change, the downtime for the change should be zero 
> as DNS propagates and your testers are pointed at the new IP address. Since 
> it's possible for me to mess up this change, there may be some downtime. 
> Since all software has bugs, there may be some downtime if any bugs are 
> revealed by all the testers being migrated to the new API.
> 
> This change (and all the work around this change) sets up the project for new 
> changes down the road:
> 
> * Speeding up report processing by triggering individual report processing 
> jobs as reports are submitted
> * Distributing those processing jobs over multiple machines to improve 
> performance
> * Making the test report text available immediately after submission instead 
> of having to wait for backend processing jobs
> 
> If you have any questions, feel free to reply to this thread or to me 
> directly.
> 
> Doug Bell
> d...@preaction.me 
> 
> 
> 



signature.asc
Description: Message signed with OpenPGP


Metabase Changeover Plan

2017-08-09 Thread Doug Bell
[Cross-posting from the CPAN Testers blog: 
http://blog.cpantesters.org/diary/209 ]

Summary: I will be doing work on the Metabase API on 2017-08-12. Writing test 
reports may be unresponsive for a few minutes, and there may be bugs. Please 
let me know if there are any problems submitting test reports.

I have completed the processing script for the new test report format. This was 
the last step in moving the Metabase API away from Amazon and on to our MySQL 
cluster for cost and stability reasons: Amazon SimpleDB is too expensive, and 
its limitations for our purposes outweigh its costs. We have always maintained 
a copy of the Metabase data in our MySQL database, and there's no real need to 
continue having two live copies of the same data (especially when one of the 
copies costs money every time you ask for a piece of data).

This Saturday, 2017-08-12, around 1:00 PM US/Central (18:00 UTC), I will be 
switching DNS over to the new, backwards-compatible Metabase API which writes 
to our MySQL database. A few months ago, I asked for testers to try this new 
API out, and everything went well (thanks to everyone who helped with that). 
The new API works the same as the old API: No changes are needed for your 
testers or anyone consuming the minimal data feeds out of the Metabase API (the 
log.txt view).

Since this is only a DNS change, the downtime for the change should be zero as 
DNS propagates and your testers are pointed at the new IP address. Since it's 
possible for me to mess up this change, there may be some downtime. Since all 
software has bugs, there may be some downtime if any bugs are revealed by all 
the testers being migrated to the new API.

This change (and all the work around this change) sets up the project for new 
changes down the road:

* Speeding up report processing by triggering individual report processing jobs 
as reports are submitted
* Distributing those processing jobs over multiple machines to improve 
performance
* Making the test report text available immediately after submission instead of 
having to wait for backend processing jobs

If you have any questions, feel free to reply to this thread or to me directly.

Doug Bell
d...@preaction.me





signature.asc
Description: Message signed with OpenPGP