Jim Phillips <[email protected]> writes:

> That error means that the process launched by qrsh on node09 exited
> before the rest of the slots so qmaster killed everything for you.

It's not just that it exited, but that it failed for some reason.

I'd set the log level to info and maybe check the shepherd trace file
after setting KEEP_ACTIVE in execd_params if there's nothing useful in
the execd messages file or syslog.

> I see these occasionally even when the parallel run finishes normally
> and exits because the first process to exit may be noticed by qmaster
> before the others.
>
> -Jim
>
> On Mon, 25 Feb 2013, Reuti wrote:
>
>> Am 25.02.2013 um 08:03 schrieb Britto, Rajesh:
>>
>>> I could see the following error message on the message files.
>>>
>>> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 
>>> failed - killing job

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to