Mark Dixon <[email protected]> writes: > I've found that restarting the qmaster makes things worse (with 6.2u5): > after a qmaster restart, anything alrady submitted does not reserve any > resources... unless they are qalter'd to something very slightly > different (at least with our config).
I guess we need to compare configs to look for clues, unless anyone has specific debugging suggestions. I don't know if the developers are currently listening here. > It's probably not your problem but, looking at your "qconf -ssconf", I > would suggest adding a "DURATION_OFFSET=300" to your "params" section: if > you're using tight integration of your parallel jobs, you've probably > noticed that jobs linger in the queue for around 5 minutes after they've > finished. Actually, no; you'd hope problems would at least appear consistently. I don't know if it's the issue addressed by https://arc.liv.ac.uk/trac/SGE/changeset/3532/sge (which will be in the Univa repo too, but harder for me to find). The only significant patch I'm currently running with over 6.2.u5 is https://arc.liv.ac.uk/trac/SGE/changeset/3511/sge/source/daemons/qmaster/sge_sched_process_events.c I don't understand how the bug doesn't don't bite most sites with vanilla 6.2u5 (qmaster crashes), but maybe it causes other symptoms in some circumstances. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
