Mark Dixon <[email protected]> writes:

> I've found that restarting the qmaster makes things worse (with 6.2u5): 
> after a qmaster restart, anything alrady submitted does not reserve any 
> resources... unless they are qalter'd to something very slightly 
> different (at least with our config).

I guess we need to compare configs to look for clues, unless anyone has
specific debugging suggestions.  I don't know if the developers are
currently listening here.

> It's probably not your problem but, looking at your "qconf -ssconf", I 
> would suggest adding a "DURATION_OFFSET=300" to your "params" section: if 
> you're using tight integration of your parallel jobs, you've probably 
> noticed that jobs linger in the queue for around 5 minutes after they've 
> finished.

Actually, no; you'd hope problems would at least appear consistently.  I
don't know if it's the issue addressed by
https://arc.liv.ac.uk/trac/SGE/changeset/3532/sge (which will be in the
Univa repo too, but harder for me to find).

The only significant patch I'm currently running with over 6.2.u5 is
https://arc.liv.ac.uk/trac/SGE/changeset/3511/sge/source/daemons/qmaster/sge_sched_process_events.c
I don't understand how the bug doesn't don't bite most sites with
vanilla 6.2u5 (qmaster crashes), but maybe it causes other symptoms in
some circumstances.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to