Mikael Brandström Durling <[email protected]> writes:

> Hi sge users,
> I have been testing the USE_CGROUPS option that is available to
> execd. When USE_CGROUPS is enabled it works fine to submit jobs one by
> one. But when I submitted 70 serial jobs, all queues on all hosts were
> set to error state. It happens after 2 or more jobs have started on
> the host, and the error message is that the shepherd exited with
> return code 7, and the shepherds trace pasted below. Jobs that
> successfully start have job spool directories owned by the gridadmin
> administrative user (the user SGE runs as), while the spool
> directories of the failed jobs are still owned by root.

I'll look into it, but I don't see how it could be affected by job
submission rate.  Do you really mean that, or it's just from multiple
jobs on one host?

> If I turn off USE_CGROUPS everything works ok. It seems as there is
> some race condition which can be triggered when jobs are started
> rapidly, but I have not been able to figure out really what's
> happening.

It's useful to report bugs on the tracker.  (Email is fine.)

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to