[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #15 from Antoine hashar Musso (WMF) has...@free.fr --- (In reply to Yuvi Panda from comment #14) Can someone with projectadmin on integration project edit https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line: puppetmaster::scripts::keep_report_minutes: 360 This will keep reports only for 6 hours. I have copy pasted on: https://wikitech.wikimedia.org/wiki/Hiera:Integration Updated the git repo on integration-puppetmaster.eqiad.wmflabs to include the above Gerrit change and ran puppet. The puppet crontab still has the old entry: # crontab -l -u puppet |egrep -v ^# 27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +2160 -delete :-/ -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 Antoine hashar Musso (WMF) has...@free.fr changed: What|Removed |Added Status|PATCH_TO_REVIEW |RESOLVED Resolution|--- |FIXED Assignee|wikibugs-l@lists.wikimedia. |yuvipa...@gmail.com |org | --- Comment #16 from Antoine hashar Musso (WMF) has...@free.fr --- yuvipanda hashar: bah, typo on my end. it's 'keep_reports_minutes' (s after report) I have reedited the wiki page, ran puppet again: Notice: /Stage[main]/Puppetmaster::Scripts/Cron[removeoldreports]/command: command changed 'find /var/lib/puppet/reports -type f -mmin +2160 -delete' to 'find /var/lib/puppet/reports -type f -mmin +360 -delete' # crontab -l -u puppet |egrep -v ^# 27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +360 -delete That solves the issue for the 'integration' project. I did the same for 'deployment-prep' ( https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prepdiff=135116oldid=134263 ) and it is all happy as well. Thank you Yuvi! -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #11 from Greg Grossmeier g...@wikimedia.org --- Just keeping the heat on this bug, we had an outage this morning (times in Eastern US): 07:49 icinga-wm PROBLEM - BetaLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: deployment-prep.deployment-mediawiki01.diskspace._var.byte_avail.value (33.33%) That probably caused the outage (the only other thing around that time is bug 73567, which hasn't been fixed/reverted yet beta is back up). I *really really really* want to just throw hardware at the problem, but it's a pain given how OpenStack/Beta work, but I'm getting annoyed by all the warnings that we can't do anything else about. Our (Release Engineering's) job is not to rework prod logging policies on a case-by-case basis to make it work in Beta. Continued diff creation for reasons like that only complexify (it's a word) things. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #12 from Antoine hashar Musso (WMF) has...@free.fr --- I know about two reasons for the HHVM application servers on beta cluster fill /var/ : Bug 73262 - hhvm apache fills /var/log/apache2 with access logs They need to send their log to syslog (that would thus end up to the logstash instance) instead of writing to disk debug / access logs. Some bug I can't find which is that the HHVM coredump end up under /var/ as well when they should be saved to /data/project (since we care about) and garbage collected automatically (Bryan wrote a cron to handle that). Finally this bug with puppet filling puppet master disk, that is being worked on by Yuvi. Sorry for hijacking this bug. I can't firefight all the issues nor triage / set priority on bugs flagged hhvm. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #13 from Gerrit Notification Bot gerritad...@wikimedia.org --- Change 174132 merged by Yuvipanda: puppetmaster: Make time to keep old reports for configurable https://gerrit.wikimedia.org/r/174132 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #14 from Yuvi Panda yuvipa...@gmail.com --- Can someone with projectadmin on integration project edit https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line: puppetmaster::scripts::keep_report_minutes: 360 This will keep reports only for 6 hours. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #2 from Marc A. Pelletier m...@uberbox.org --- (In reply to Antoine hashar Musso (WMF) from comment #1) /var is /dev/vda2 , I am wondering whether it can be extended somehow. CCing Andrew B and Marc-André. The latest images, through some rather ugly trickery, have /var on a logical volume and thus are expandable at will. No such luck for the older images which have physical partitions. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #3 from Antoine hashar Musso (WMF) has...@free.fr --- I looked at the state of the beta cluster puppet master (deployment-salt). There, /var/lib is a symlink to /srv/var-lib/ which gives more free space. The puppet.master has the reports.logstash which explains why nothing is written on disk. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #4 from Gerrit Notification Bot gerritad...@wikimedia.org --- Change 174132 had a related patch set uploaded by Yuvipanda: puppetmaster: Make time to keep old reports for configurable https://gerrit.wikimedia.org/r/174132 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 Gerrit Notification Bot gerritad...@wikimedia.org changed: What|Removed |Added Status|NEW |PATCH_TO_REVIEW -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #5 from Antoine hashar Musso (WMF) has...@free.fr --- (In reply to Antoine hashar Musso (WMF) from comment #3) I looked at the state of the beta cluster puppet master (deployment-salt). There, /var/lib is a symlink to /srv/var-lib/ which gives more free space. The puppet.master has the reports.logstash which explains why nothing is written on disk. On beta we have a patch to send reports to logstash which discards reporting on disk https://gerrit.wikimedia.org/r/#/c/143788/10/modules/puppetmaster/templates/30-logstash.conf.erb,unified -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #6 from Greg Grossmeier g...@wikimedia.org --- (In reply to Antoine hashar Musso (WMF) from comment #1) The /var on labs is indeed only 2GB. puppetmaster reports takes 600MB of disk right now. Can we not just increase the size of the beta cluster instances' diskspace? We've run into this issue many many many many times and playing whack-a-mole with symlinks and cronjobs to move data around is not sustainable. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #7 from Andrew Bogott abog...@wikimedia.org --- Greg -- For new instances /var/log is somewhat resizeable. For existing instances you can remount /var/log but that's very messy since every service expects to already have an open file and a directory in /var/log. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #8 from Greg Grossmeier g...@wikimedia.org --- (In reply to Andrew Bogott from comment #7) Greg -- For new instances /var/log is somewhat resizeable. How much? Can we just change the default for new deployment-prep instances to be $large-enough-to-not-matter? For existing instances you can remount /var/log but that's very messy since every service expects to already have an open file and a directory in /var/log. Worst case scenario is creating a second instance of whatever with a larger disk, moving traffic to it, then shutting down the old one, right? Not saying we should do that soon, but... continued hacks like this are hurting the stability of Beta Cluster (as opposed to addressing the real underlying issue of too little space on the VMs we use for our integration environment which everyone depends on daily). -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #9 from Andrew Bogott abog...@wikimedia.org --- (In reply to Greg Grossmeier from comment #8) (In reply to Andrew Bogott from comment #7) Greg -- For new instances /var/log is somewhat resizeable. How much? Can we just change the default for new deployment-prep instances to be $large-enough-to-not-matter? Resizeable up to the available space selected when the instance was originally created. It should be possible to set up sizing of /var/log based on project. I'll have a look at that if that's the direction you want to go. Worst case scenario is creating a second instance of whatever with a larger disk, moving traffic to it, then shutting down the old one, right? That's correct. In perfect-puppet-land, doing that should be trivial, but I've been led to understand that in the real world it's a big pain. (as opposed to addressing the real underlying issue of too little space on the VMs we use for our integration environment which everyone depends on daily). One might argue that the 'real problem' is unbounded log growth, and that beta just displays the symptoms sooner than production. But I don't know if the issue really is unbounded growth or if growth is bounded properly but just bounded outside the capacity of existing instances. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472 --- Comment #10 from Greg Grossmeier g...@wikimedia.org --- (In reply to Andrew Bogott from comment #9) (In reply to Greg Grossmeier from comment #8) (In reply to Andrew Bogott from comment #7) Greg -- For new instances /var/log is somewhat resizeable. How much? Can we just change the default for new deployment-prep instances to be $large-enough-to-not-matter? Resizeable up to the available space selected when the instance was originally created. It should be possible to set up sizing of /var/log based on project. I'll have a look at that if that's the direction you want to go. I guess we should way this ^ and the unbounded growth concern below. Worst case scenario is creating a second instance of whatever with a larger disk, moving traffic to it, then shutting down the old one, right? That's correct. In perfect-puppet-land, doing that should be trivial, but I've been led to understand that in the real world it's a big pain. Sadly, but that also points out other legitimate bugs :) (as opposed to addressing the real underlying issue of too little space on the VMs we use for our integration environment which everyone depends on daily). One might argue that the 'real problem' is unbounded log growth, and that beta just displays the symptoms sooner than production. But I don't know if the issue really is unbounded growth or if growth is bounded properly but just bounded outside the capacity of existing instances. Touche. But I'm still worried about all the differences between prod and beta that cause surprises :/ -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l