[Bug 71431] deployment-rsync01 20GB hard drive is too small

2014-10-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431

--- Comment #7 from Antoine hashar Musso has...@free.fr ---
Thanks Bryan for the detailed explanation :-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 71431] deployment-rsync01 20GB hard drive is too small

2014-10-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431

Antoine hashar Musso has...@free.fr changed:

   What|Removed |Added

 CC||bda...@wikimedia.org

--- Comment #4 from Antoine hashar Musso has...@free.fr ---
It seems the root cause of the issue was the LDAP being upgraded/unreacheable
intermittently over the past few days.  As a result, when puppet run it
considers that the mwdeploy/l10nupdate (among others) users do not exist and
thus create a local copy of them.  Whenever LDAP comes back, we end up with
files having conflicting UID.  That most probably confuse rsync.

Bryan deleted the local users yesterday.  He also cleaned up some all 'common'
directories which were left around thus reclaiming a huge amount of disk space.

So it is all fixed for now.

Puppet creating local users when LDAP is unreachable is documented at
https://bugzilla.wikimedia.org/show_bug.cgi?id=71480 .

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 71431] deployment-rsync01 20GB hard drive is too small

2014-10-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431

--- Comment #5 from Greg Grossmeier g...@wikimedia.org ---
Created attachment 16640
  -- https://bugzilla.wikimedia.org/attachment.cgi?id=16640action=edit
disk space percent free graph

So it appears that things are stable again, disk-space-free-wise.

Also, does the drop in available disk space around Sept 11th correlate with
anything we should worry about?

I'm inclined to close this bug for now if we aren't realistically going to hit
the limit any time soon (and since we hit the limit this time due to an
unrelated breakage we needed to catch anyways).

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 71431] deployment-rsync01 20GB hard drive is too small

2014-10-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431

Bryan Davis bda...@wikimedia.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
   Assignee|wikibugs-l@lists.wikimedia. |bda...@wikimedia.org
   |org |

--- Comment #6 from Bryan Davis bda...@wikimedia.org ---
(In reply to Antoine hashar Musso from comment #4)
 It seems the root cause of the issue was the LDAP being
 upgraded/unreacheable intermittently over the past few days.  As a result,
 when puppet run it considers that the mwdeploy/l10nupdate (among others)
 users do not exist and thus create a local copy of them.  Whenever LDAP
 comes back, we end up with files having conflicting UID.  That most probably
 confuse rsync.

This was an issue across several hosts in the beta cluster, but it turned out
to be unrelated to the disk space issues on rsync01.

 Bryan deleted the local users yesterday.  He also cleaned up some all
 'common' directories which were left around thus reclaiming a huge amount of
 disk space.

This was the real problem. When I originally added scap deployment to beta I
found that the primary disks for all of the hosts that needed copies of
MediaWiki were too small to comfortably contain a full sync. I added secondary
LVS mounts to all of these hosts on /srv (or made /srv a symlink to /mnt/srv if
LVS was already attached on /mnt). Then I created a symlink from
/usr/local/apache/common-local to /srv/common-local where the synced tree from
deployment-bastion would be stored.

Recently Ori dove into operations/puppet and started working on cleaning up the
legacy file paths (/a/common, /usr/local/apache) and replacing them with more
modern locations. /usr/local/apache/common and /usr/local/apache/common-local
(former was a symlink to the latter) were replaced with /srv/mediawiki. When
these changes hit beta, things mostly just worked because puppet and scap
worked together to create the right content in the right place.

A side effect of this change finally bit us on rsync01. There was no puppet
code added to clean up the old /srv/common-local sync target. This left ~3G of
files on each scap target host. For the deployment-mediawiki* hosts this was
not a big deal. The secondary disk on those hosts is 68G leaving lots of space
for the new copy of everything. On deployment-rsync01 however, /srv is an 8.5G
partition, so 3G is a significant chunk of the available drive space.

I have deleted /src/common-local from all of the hosts in beta.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 71431] deployment-rsync01 20GB hard drive is too small

2014-09-30 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431

--- Comment #1 from Antoine hashar Musso has...@free.fr ---
deployment-rsync01.eqiad.wmflabs (
https://wikitech.wikimedia.org/wiki/Nova_Resource:I-02f4.eqiad.wmflabs ) is
a m1 small with 20GB disk allocation partitioned as:

hashar@deployment-rsync01:~$ df -h -x nfs
Filesystem  Size  Used Avail Use% Mounted on
/dev/vda1   7.6G  2.1G  5.2G  29% /
udev998M   12K  998M   1% /dev
tmpfs   401M  316K  401M   1% /run
none5.0M 0  5.0M   0% /run/lock
none   1002M 0 1002M   0% /run/shm
/dev/vda2   1.9G  647M  1.2G  36% /var
cgroups1002M 0 1002M   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk  8.5G  6.1G  1.9G  77% /srv

The scap process has most probably filled /srv/ because of the l10n cache but
it has been cleaned up.


I am not sure which files needs to be cleaned up. We can do that either in scap
itself or in the Jenkins job beta-scap-eqiad.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 71431] deployment-rsync01 20GB hard drive is too small

2014-09-30 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431

Greg Grossmeier g...@wikimedia.org changed:

   What|Removed |Added

   Priority|Unprioritized   |Normal

--- Comment #2 from Greg Grossmeier g...@wikimedia.org ---
Let's not make the Jenkins beta-scap-eqiad job very divergent from prod (at
all).
voice=oriLet's make the Beta Cluster like prod, not make more hacks that
are different./voice

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 71431] deployment-rsync01 20GB hard drive is too small

2014-09-30 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431

--- Comment #3 from Sam Reed (reedy) s...@reedyboy.net ---
How long did it take to break?

I deleted a weird tmp dir, killed the whole cache dir, and re-ran sync-common.
Which gave ~2G free space.

I'm wondering if it's a one off, or this will break again quickly etc. In
which case, we should reinstall it to a larger machine

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l