[Bug 71431] deployment-rsync01 20GB hard drive is too small
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431 --- Comment #7 from Antoine hashar Musso has...@free.fr --- Thanks Bryan for the detailed explanation :-) -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 71431] deployment-rsync01 20GB hard drive is too small
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431 Antoine hashar Musso has...@free.fr changed: What|Removed |Added CC||bda...@wikimedia.org --- Comment #4 from Antoine hashar Musso has...@free.fr --- It seems the root cause of the issue was the LDAP being upgraded/unreacheable intermittently over the past few days. As a result, when puppet run it considers that the mwdeploy/l10nupdate (among others) users do not exist and thus create a local copy of them. Whenever LDAP comes back, we end up with files having conflicting UID. That most probably confuse rsync. Bryan deleted the local users yesterday. He also cleaned up some all 'common' directories which were left around thus reclaiming a huge amount of disk space. So it is all fixed for now. Puppet creating local users when LDAP is unreachable is documented at https://bugzilla.wikimedia.org/show_bug.cgi?id=71480 . -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 71431] deployment-rsync01 20GB hard drive is too small
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431 --- Comment #5 from Greg Grossmeier g...@wikimedia.org --- Created attachment 16640 -- https://bugzilla.wikimedia.org/attachment.cgi?id=16640action=edit disk space percent free graph So it appears that things are stable again, disk-space-free-wise. Also, does the drop in available disk space around Sept 11th correlate with anything we should worry about? I'm inclined to close this bug for now if we aren't realistically going to hit the limit any time soon (and since we hit the limit this time due to an unrelated breakage we needed to catch anyways). -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 71431] deployment-rsync01 20GB hard drive is too small
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431 Bryan Davis bda...@wikimedia.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED Assignee|wikibugs-l@lists.wikimedia. |bda...@wikimedia.org |org | --- Comment #6 from Bryan Davis bda...@wikimedia.org --- (In reply to Antoine hashar Musso from comment #4) It seems the root cause of the issue was the LDAP being upgraded/unreacheable intermittently over the past few days. As a result, when puppet run it considers that the mwdeploy/l10nupdate (among others) users do not exist and thus create a local copy of them. Whenever LDAP comes back, we end up with files having conflicting UID. That most probably confuse rsync. This was an issue across several hosts in the beta cluster, but it turned out to be unrelated to the disk space issues on rsync01. Bryan deleted the local users yesterday. He also cleaned up some all 'common' directories which were left around thus reclaiming a huge amount of disk space. This was the real problem. When I originally added scap deployment to beta I found that the primary disks for all of the hosts that needed copies of MediaWiki were too small to comfortably contain a full sync. I added secondary LVS mounts to all of these hosts on /srv (or made /srv a symlink to /mnt/srv if LVS was already attached on /mnt). Then I created a symlink from /usr/local/apache/common-local to /srv/common-local where the synced tree from deployment-bastion would be stored. Recently Ori dove into operations/puppet and started working on cleaning up the legacy file paths (/a/common, /usr/local/apache) and replacing them with more modern locations. /usr/local/apache/common and /usr/local/apache/common-local (former was a symlink to the latter) were replaced with /srv/mediawiki. When these changes hit beta, things mostly just worked because puppet and scap worked together to create the right content in the right place. A side effect of this change finally bit us on rsync01. There was no puppet code added to clean up the old /srv/common-local sync target. This left ~3G of files on each scap target host. For the deployment-mediawiki* hosts this was not a big deal. The secondary disk on those hosts is 68G leaving lots of space for the new copy of everything. On deployment-rsync01 however, /srv is an 8.5G partition, so 3G is a significant chunk of the available drive space. I have deleted /src/common-local from all of the hosts in beta. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 71431] deployment-rsync01 20GB hard drive is too small
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431 --- Comment #1 from Antoine hashar Musso has...@free.fr --- deployment-rsync01.eqiad.wmflabs ( https://wikitech.wikimedia.org/wiki/Nova_Resource:I-02f4.eqiad.wmflabs ) is a m1 small with 20GB disk allocation partitioned as: hashar@deployment-rsync01:~$ df -h -x nfs Filesystem Size Used Avail Use% Mounted on /dev/vda1 7.6G 2.1G 5.2G 29% / udev998M 12K 998M 1% /dev tmpfs 401M 316K 401M 1% /run none5.0M 0 5.0M 0% /run/lock none 1002M 0 1002M 0% /run/shm /dev/vda2 1.9G 647M 1.2G 36% /var cgroups1002M 0 1002M 0% /sys/fs/cgroup /dev/mapper/vd-second--local--disk 8.5G 6.1G 1.9G 77% /srv The scap process has most probably filled /srv/ because of the l10n cache but it has been cleaned up. I am not sure which files needs to be cleaned up. We can do that either in scap itself or in the Jenkins job beta-scap-eqiad. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 71431] deployment-rsync01 20GB hard drive is too small
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431 Greg Grossmeier g...@wikimedia.org changed: What|Removed |Added Priority|Unprioritized |Normal --- Comment #2 from Greg Grossmeier g...@wikimedia.org --- Let's not make the Jenkins beta-scap-eqiad job very divergent from prod (at all). voice=oriLet's make the Beta Cluster like prod, not make more hacks that are different./voice -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 71431] deployment-rsync01 20GB hard drive is too small
https://bugzilla.wikimedia.org/show_bug.cgi?id=71431 --- Comment #3 from Sam Reed (reedy) s...@reedyboy.net --- How long did it take to break? I deleted a weird tmp dir, killed the whole cache dir, and re-ran sync-common. Which gave ~2G free space. I'm wondering if it's a one off, or this will break again quickly etc. In which case, we should reinstall it to a larger machine -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l