[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #15 from Antoine hashar Musso (WMF) has...@free.fr ---
(In reply to Yuvi Panda from comment #14)
 Can someone with projectadmin on integration project edit
 https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line:
 
 puppetmaster::scripts::keep_report_minutes: 360
 
 This will keep reports only for 6 hours.

I have copy pasted on:
https://wikitech.wikimedia.org/wiki/Hiera:Integration

Updated the git repo on integration-puppetmaster.eqiad.wmflabs to include the
above Gerrit change and ran puppet.  The puppet crontab still has the old
entry:

 # crontab -l -u puppet |egrep -v ^#
 27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +2160 -delete

:-/

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

Antoine hashar Musso (WMF) has...@free.fr changed:

   What|Removed |Added

 Status|PATCH_TO_REVIEW |RESOLVED
 Resolution|--- |FIXED
   Assignee|wikibugs-l@lists.wikimedia. |yuvipa...@gmail.com
   |org |

--- Comment #16 from Antoine hashar Musso (WMF) has...@free.fr ---
yuvipanda hashar: bah, typo on my end. it's 'keep_reports_minutes' (s
after report)

I have reedited the wiki page, ran puppet again:

 Notice: /Stage[main]/Puppetmaster::Scripts/Cron[removeoldreports]/command:
 command changed
 'find /var/lib/puppet/reports -type f -mmin +2160 -delete'
  to 'find /var/lib/puppet/reports -type f -mmin +360 -delete'



  # crontab -l -u puppet |egrep -v ^#
  27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +360 -delete

That solves the issue for the 'integration' project.


I did the same for 'deployment-prep' (
https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prepdiff=135116oldid=134263
) and it is all happy as well.


Thank you Yuvi!

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #11 from Greg Grossmeier g...@wikimedia.org ---
Just keeping the heat on this bug, we had an outage this morning (times in
Eastern US):
07:49  icinga-wm PROBLEM - BetaLabs: Low disk space on /var on labmon1001 is
CRITICAL: CRITICAL:
deployment-prep.deployment-mediawiki01.diskspace._var.byte_avail.value (33.33%)

That probably caused the outage (the only other thing around that time is bug
73567, which hasn't been fixed/reverted yet beta is back up).

I *really really really* want to just throw hardware at the problem, but it's a
pain given how OpenStack/Beta work, but I'm getting annoyed by all the warnings
that we can't do anything else about. Our (Release Engineering's) job is not to
rework prod logging policies on a case-by-case basis to make it work in Beta.
Continued diff creation for reasons like that only complexify (it's a word)
things.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #12 from Antoine hashar Musso (WMF) has...@free.fr ---
I know about two reasons for the HHVM application servers on beta cluster fill
/var/ :

Bug 73262 - hhvm apache fills /var/log/apache2 with access logs

They need to send their log to syslog (that would thus end up to the logstash
instance) instead of writing to disk debug / access logs.


Some bug I can't find which is that the HHVM coredump end up under /var/ as
well when they should be saved to /data/project (since we care about) and
garbage collected automatically (Bryan wrote a cron to handle that).

Finally this bug with puppet filling puppet master disk, that is being worked
on by Yuvi.



Sorry for hijacking this bug. I can't firefight all the issues nor triage / set
priority on bugs flagged hhvm.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #13 from Gerrit Notification Bot gerritad...@wikimedia.org ---
Change 174132 merged by Yuvipanda:
puppetmaster: Make time to keep old reports for configurable

https://gerrit.wikimedia.org/r/174132

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #14 from Yuvi Panda yuvipa...@gmail.com ---
Can someone with projectadmin on integration project edit
https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line:

puppetmaster::scripts::keep_report_minutes: 360

This will keep reports only for 6 hours.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #2 from Marc A. Pelletier m...@uberbox.org ---
(In reply to Antoine hashar Musso (WMF) from comment #1)
 /var is /dev/vda2 , I am wondering whether it can be extended somehow. 
 CCing Andrew B and Marc-André.

The latest images, through some rather ugly trickery, have /var on a logical
volume and thus are expandable at will.  No such luck for the older images
which have physical partitions.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #3 from Antoine hashar Musso (WMF) has...@free.fr ---
I looked at the state of the beta cluster puppet master (deployment-salt).

There, /var/lib is a symlink to /srv/var-lib/ which gives more free space.  The
puppet.master has the reports.logstash which explains why nothing is written on
disk.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #4 from Gerrit Notification Bot gerritad...@wikimedia.org ---
Change 174132 had a related patch set uploaded by Yuvipanda:
puppetmaster: Make time to keep old reports for configurable

https://gerrit.wikimedia.org/r/174132

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

Gerrit Notification Bot gerritad...@wikimedia.org changed:

   What|Removed |Added

 Status|NEW |PATCH_TO_REVIEW

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #5 from Antoine hashar Musso (WMF) has...@free.fr ---
(In reply to Antoine hashar Musso (WMF) from comment #3)
 I looked at the state of the beta cluster puppet master (deployment-salt).
 
 There, /var/lib is a symlink to /srv/var-lib/ which gives more free space. 
 The puppet.master has the reports.logstash which explains why nothing is
 written on disk.

On beta we have a patch to send reports to logstash which discards reporting on
disk 
https://gerrit.wikimedia.org/r/#/c/143788/10/modules/puppetmaster/templates/30-logstash.conf.erb,unified

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #6 from Greg Grossmeier g...@wikimedia.org ---
(In reply to Antoine hashar Musso (WMF) from comment #1)
 The /var on labs is indeed only 2GB.  puppetmaster reports takes 600MB of
 disk right now.

Can we not just increase the size of the beta cluster instances' diskspace?
We've run into this issue many many many many times and playing whack-a-mole
with symlinks and cronjobs to move data around is not sustainable.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #7 from Andrew Bogott abog...@wikimedia.org ---
Greg --

For new instances /var/log is somewhat resizeable.  For existing instances you
can remount /var/log but that's very messy since every service expects to
already have an open file and a directory in /var/log.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #8 from Greg Grossmeier g...@wikimedia.org ---
(In reply to Andrew Bogott from comment #7)
 Greg --
 
 For new instances /var/log is somewhat resizeable.

How much? Can we just change the default for new deployment-prep instances to
be $large-enough-to-not-matter?

 For existing instances
 you can remount /var/log but that's very messy since every service expects
 to already have an open file and a directory in /var/log.

Worst case scenario is creating a second instance of whatever with a larger
disk, moving traffic to it, then shutting down the old one, right? Not saying
we should do that soon, but... continued hacks like this are hurting the
stability of Beta Cluster (as opposed to addressing the real underlying issue
of too little space on the VMs we use for our integration environment which
everyone depends on daily).

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #9 from Andrew Bogott abog...@wikimedia.org ---
(In reply to Greg Grossmeier from comment #8)
 (In reply to Andrew Bogott from comment #7)
  Greg --
  
  For new instances /var/log is somewhat resizeable.
 
 How much? Can we just change the default for new deployment-prep instances
 to be $large-enough-to-not-matter?

Resizeable up to the available space selected when the instance was originally
created.

It should be possible to set up sizing of /var/log based on project.  I'll have
a look at that if that's the direction you want to go.

 Worst case scenario is creating a second instance of whatever with a larger
 disk, moving traffic to it, then shutting down the old one, right? 

That's correct.  In perfect-puppet-land, doing that should be trivial, but I've
been led to understand that in the real world it's a big pain.

 (as opposed to addressing the real underlying
 issue of too little space on the VMs we use for our integration environment
 which everyone depends on daily).

One might argue that the 'real problem' is unbounded log growth, and that beta
just displays the symptoms sooner than production.  But I don't know if the
issue really is unbounded growth or if growth is bounded properly but just
bounded outside the capacity of existing instances.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73472] [OPS] Jenkins: puppet master fills /var on labs with yaml reports

2014-11-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73472

--- Comment #10 from Greg Grossmeier g...@wikimedia.org ---
(In reply to Andrew Bogott from comment #9)
 (In reply to Greg Grossmeier from comment #8)
  (In reply to Andrew Bogott from comment #7)
   Greg --
   
   For new instances /var/log is somewhat resizeable.
  
  How much? Can we just change the default for new deployment-prep instances
  to be $large-enough-to-not-matter?
 
 Resizeable up to the available space selected when the instance was
 originally created.
 
 It should be possible to set up sizing of /var/log based on project.  I'll
 have a look at that if that's the direction you want to go.

I guess we should way this ^ and the unbounded growth concern below.

  Worst case scenario is creating a second instance of whatever with a larger
  disk, moving traffic to it, then shutting down the old one, right? 
 
 That's correct.  In perfect-puppet-land, doing that should be trivial, but
 I've been led to understand that in the real world it's a big pain.

Sadly, but that also points out other legitimate bugs :)

  (as opposed to addressing the real underlying
  issue of too little space on the VMs we use for our integration environment
  which everyone depends on daily).
 
 One might argue that the 'real problem' is unbounded log growth, and that
 beta just displays the symptoms sooner than production.  But I don't know if
 the issue really is unbounded growth or if growth is bounded properly but
 just bounded outside the capacity of existing instances.

Touche. But I'm still worried about all the differences between prod and beta
that cause surprises :/

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l