[gridengine users] suspend_threshold depending on job I/O

Txema Heredia Genestar Tue, 13 Nov 2012 06:28:11 -0800

Hi all,

we have a 300-core cluster with a ~150Tb shared directory (GPFS). Ourusers run some genomic analysis that use huge files and usually cannotfit the 500Gb internal HDD of the nodes. As you can imagine, sometimesthings get pretty intense and all the nagios disk alarms start going off(the disk "works" but we got 10+ sec timeouts).

Knowing that I cannot trust our users to request any "disk_intensive"parameter/flag, I was pondering on setting a suspend_threshold in thequeues, watching the shared disk status (e.g. timing an ls to the shareddisk) and start suspending jobs when the disk has, say, a 3 sec delay.This would be a nice fix for our issue, but it has some problems: Whenthere are both "IO-intensive" and "normal" jobs, and thesuspend_threshold kicks in, SGE will start suspending jobs ¿without anyparticular criteria? (I don't know this part), and lots of innocent"normal" jobs will be suspended through all the nodes before the diskload is stabilized.

Does anyone have any idea/workaround to solve this? Or should Iignore/relax all the disk alarms?


Thanks in advance,

Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] suspend_threshold depending on job I/O

Reply via email to