On 3 Mar 2011, at 12:07, Reuti wrote:
>> Okay, thanks for the info.  I understand how to set it up.  IMHO, we should 
>> probably have this issue addressed in GE, as it can save a lot of debugging 
>> time to quickly know why your job was killed when it was -- if this 
>> information is available to write to the messages file, then I don't see why 
>> it shouldn't be possible for the shepherd to append it to the job stderr.
> 
> It's already an RFE to spot job abortions easier. But personally I wouldn't 
> like it in the error file (as it's not an error of the started application 
> itself), but going to the email by default. What about a third output to &3 
> which could be set to any file or so?

I quite like the &3 idea.  One reason I slightly shy away from emails is that 
they're not a default option (though easy enough for an administrator to 
configure that way), and it may often be more useful to have job context output 
kept with the stdout and stderr files (for example, I often use the bash 'env' 
command to output to my stderr files).

> Also a proper entry in the accouting file would help. This could then be done 
> by the qmaster, as in case of a parallel job across several nodes it could be 
> any of them where the memory limit is passed. Then it should be noted in the 
> accounting entry of the job (with the node being mentiong there) and the one 
> of the `qrsh -inherit ...` too (in case of accounting_summary is set to 
> FALSE).

Agreed.  The standard 137 exit code is not really quite enough.  For the 
administrator, it could be *very* useful to know how many jobs are being killed 
as a result of exceeding resource requests, and for the user to spot parallel 
job imbalances quickly.  In my academic environment, this would be particularly 
useful for educating new PhD students! :-)  

Chris




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to