Hi sge users,
I have been testing the USE_CGROUPS option that is available to execd. When 
USE_CGROUPS is enabled it works fine to submit jobs one by one. But when I 
submitted 70 serial jobs, all queues on all hosts were set to error state. It 
happens after 2 or more jobs have started on the host, and the error message is 
that the shepherd exited with return code 7, and the shepherds trace pasted 
below. Jobs that successfully start have job spool directories owned by the 
gridadmin administrative user (the user SGE runs as), while the spool 
directories of the failed jobs are still owned by root.
If I turn off USE_CGROUPS everything works ok. It seems as there is some race 
condition which can be triggered when jobs are started rapidly, but I have not 
been able to figure out really what's happening.
Mikael
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to