[Bug 36993] dumps project overload GlusterFS and cause cluster failure

2012-11-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

--- Comment #14 from Nemo federicol...@tiscali.it 2012-11-10 14:44:21 UTC ---
*** Bug 36997 has been marked as a duplicate of this bug. ***

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 36993] dumps project overload GlusterFS and cause cluster failure

2012-11-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

Nemo federicol...@tiscali.it changed:

   What|Removed |Added

 Blocks||41967

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 36993] dumps project overload GlusterFS and cause cluster failure

2012-05-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

Antoine hashar Musso has...@free.fr changed:

   What|Removed |Added

Summary|Labs cluster dies daily at  |dumps project overload
   |roughly 6:30 UTC|GlusterFS and cause cluster
   ||failure
   Severity|normal  |major

--- Comment #9 from Antoine hashar Musso has...@free.fr 2012-05-22 14:05:34 
UTC ---
We just had some kind of outage for the whole cluster. The virtualization
cluster showed load gradually increasing at 13:20UTC :

http://ganglia.wikimedia.org/latest/?r=hourcs=05%2F22%2F2012+13%3A00+ce=05%2F22%2F2012+14%3A00+m=load_reports=by+namec=Virtualization+cluster+pmtpah=host_regex=max_graphs=0tab=mvn=sh=1z=smallhc=4

At the sometime, the dumps project on labs starts having some network activity
which corresponds to I/O activity over NFS:
http://ganglia.wmflabs.org/latest/graph.php?c=dumpsm=network_reportr=customs=by%20namehc=4mc=2cs=05%2F22%2F2012%2011%3A00%20ce=05%2F22%2F2012%2014%3A00%20st=1337694997g=network_reportz=mediumc=dumps

I have seen the exact same behavior earlier this meaning where 30MBytes/s were
output from a datadump host in eqiad and 30Mbytes/s were input in the dumps
project. At the sametime, instances were unresponsive.


We need to find a workaround, some possible solutions:
- get the `dump` project to use some NFS share on real storage thus bypassing
GlusterFS
- rate limit network bandwidth between the dataset1001 in eqiad and the labs
- find a parameter in GlusterFS that will throttle the connection

Other ideas?


Changing summary from: Labs cluster dies daily at roughly 6:30 UTC
To: dumps project overload GlusterFS and cause cluster failure

Raising severity since that makes the cluster unusable from time to time.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 36993] dumps project overload GlusterFS and cause cluster failure

2012-05-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

Ariel T. Glenn ar...@wikimedia.org changed:

   What|Removed |Added

 CC||ar...@wikimedia.org

--- Comment #10 from Ariel T. Glenn ar...@wikimedia.org 2012-05-22 14:12:38 
UTC ---
There is a gluster share which is supposed to be available across all lab
instances, which has the last 5 good dumps in it.  I don't know if it's been
made accessible to the instances yet.  It updates every day at around 4 am UTC. 

The point of that is so that no one has to download their own copies of the
dumps to work on them in a labs project (wasting space and bandwidth).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 36993] dumps project overload GlusterFS and cause cluster failure

2012-05-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

--- Comment #11 from Antoine hashar Musso has...@free.fr 2012-05-22 
14:16:08 UTC ---
Following a discussion with Hydriz here is what he does:

- rsync dumps to is instance in /data/project/dumps (which hit glusterFS)
- upload the dumps to Internet Archive using curl and their S3 interface

So we are copying the data in Gluster FS  just to move them out after.  I guess
the comment by Ariel above could be a good solution.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 36993] dumps project overload GlusterFS and cause cluster failure

2012-05-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

--- Comment #12 from Antoine hashar Musso has...@free.fr 2012-05-22 
14:41:11 UTC ---
Hydriz is going to upload to S3 from the copy Ariel is referring to in comment
10.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 36993] dumps project overload GlusterFS and cause cluster failure

2012-05-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=36993

Antoine hashar Musso has...@free.fr changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED

--- Comment #13 from Antoine hashar Musso has...@free.fr 2012-05-22 
14:43:09 UTC ---
Since we have found a workaround for the recent problems we had, I am closing
this bug.

The root cause is GlusterFS that can be killed just by one instance doing some
heavy I/O. That should be another bug.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l