https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

Antoine "hashar" Musso <has...@free.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Labs cluster dies daily at  |dumps project overload
                   |roughly 6:30 UTC            |GlusterFS and cause cluster
                   |                            |failure
           Severity|normal                      |major

--- Comment #9 from Antoine "hashar" Musso <has...@free.fr> 2012-05-22 14:05:34 
UTC ---
We just had some kind of outage for the whole cluster. The virtualization
cluster showed load gradually increasing at 13:20UTC :

http://ganglia.wikimedia.org/latest/?r=hour&cs=05%2F22%2F2012+13%3A00+&ce=05%2F22%2F2012+14%3A00+&m=load_report&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

At the sometime, the dumps project on labs starts having some network activity
which corresponds to I/O activity over NFS:
http://ganglia.wmflabs.org/latest/graph.php?c=dumps&m=network_report&r=custom&s=by%20name&hc=4&mc=2&cs=05%2F22%2F2012%2011%3A00%20&ce=05%2F22%2F2012%2014%3A00%20&st=1337694997&g=network_report&z=medium&c=dumps

I have seen the exact same behavior earlier this meaning where 30MBytes/s were
output from a datadump host in eqiad and 30Mbytes/s were input in the dumps
project. At the sametime, instances were unresponsive.


We need to find a workaround, some possible solutions:
- get the `dump` project to use some NFS share on real storage thus bypassing
GlusterFS
- rate limit network bandwidth between the dataset1001 in eqiad and the labs
- find a parameter in GlusterFS that will throttle the connection

Other ideas?


Changing summary from: "Labs cluster dies daily at roughly 6:30 UTC"
To: "dumps project overload GlusterFS and cause cluster failure"

Raising severity since that makes the cluster unusable from time to time.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to