Mikael Brandström Durling <[email protected]> writes: > Hi sge users, > I have been testing the USE_CGROUPS option that is available to > execd. When USE_CGROUPS is enabled it works fine to submit jobs one by > one. But when I submitted 70 serial jobs, all queues on all hosts were > set to error state. It happens after 2 or more jobs have started on > the host, and the error message is that the shepherd exited with > return code 7, and the shepherds trace pasted below. Jobs that > successfully start have job spool directories owned by the gridadmin > administrative user (the user SGE runs as), while the spool > directories of the failed jobs are still owned by root.
I'll look into it, but I don't see how it could be affected by job submission rate. Do you really mean that, or it's just from multiple jobs on one host? > If I turn off USE_CGROUPS everything works ok. It seems as there is > some race condition which can be triggered when jobs are started > rapidly, but I have not been able to figure out really what's > happening. It's useful to report bugs on the tracker. (Email is fine.) -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
