Hi,

Am 20.01.2014 um 13:04 schrieb Ian Johnson:

> Reuti,
> 
> The directory /opt/capitati/smartmate_data/test is now writable by the 
> smartmate user. Sorry, this was causing the 26 exit status. I'm back to the 
> exit status 11 again. Now, both the o and e files opened in the 
> /opt/capitati/smartmate_data/test directory but are of zero length.

is the directory in question mounted by autofs/Automounter or a hard NFS mount 
of the user smartmate_data?

- Any prolog on a queue or global level?
- What user:group has the created (empty) file?
- How are the users/IDs distributed to the nodes?

-- Reuti


> The spool directory is in /opt/capitati/ge2011.11/smartmate/spool which is 
> owned by root:root.
> 
> Could you guess as to where the shepherd code is failing using the trace logs 
> I sent last week? I've been looking through the shepherd code but I can't see 
> anything obvious.
> 
> Thanks,
> 
> Ian
> 
> On Mon, 20 Jan 2014 11:52:46 -0000, Reuti <[email protected]> wrote:
> 
>> Am 20.01.2014 um 12:11 schrieb Ian Johnson:
>> 
>>> Reuti,
>>> 
>>> I have changed the qsub options to write stdout and stdout to an NFS 
>>> mounted directory, and the job script is still not being executed. Now the 
>>> job is exiting, according to the shepherd trace, with exit status 26. This 
>>> time no files o and e files are created.
>> 
>> The path /opt/capitati/smartmate_data/test/job_sm_out.log is writable (for 
>> the user) on the node and all directories in the path exist?
>> 
>> BTW: Is the spool directoty local on each host (preferable) or in a shared 
>> /opt/capitati/?
>> 
>> -- Reuti
>> 
>> 
>>> What does exit status 26 mean? And given the previous behaviour on a local 
>>> disk (job exit status 11), can you think of anything that is preventing the 
>>> non-superuser from executing jobs on execution nodes? This is turning into 
>>> a critical bug for us.
>>> 
>>> Thanks for your continued help,
>>> 
>>> Ian
>>> 
>>> <shepherd_trace>
>>> 01/20/2014 11:02:12 [0:1486]: shepherd called with uid = 0, euid = 0
>>> 01/20/2014 11:02:12 [0:1486]: starting up 2011.11
>>> 01/20/2014 11:02:12 [0:1486]: setpgid(1486, 1486) returned 0
>>> 01/20/2014 11:02:12 [0:1486]: do_core_binding: "binding" parameter not 
>>> found in config file
>>> 01/20/2014 11:02:12 [0:1486]: no prolog script to start
>>> 01/20/2014 11:02:12 [0:1486]: parent: forked "job" with pid 1487
>>> 01/20/2014 11:02:12 [0:1487]: child: starting son(job, 
>>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/34, 0);
>>> 01/20/2014 11:02:12 [0:1487]: pid=1487 pgrp=1487 sid=1487 old pgrp=1486 
>>> getlogin()=<no login set>
>>> 01/20/2014 11:02:12 [0:1486]: parent: job-pid: 1487
>>> 01/20/2014 11:02:12 [0:1487]: reading passwd information for user 
>>> 'smartmate'
>>> 01/20/2014 11:02:12 [0:1487]: setosjobid: uid = 0, euid = 0
>>> 01/20/2014 11:02:12 [0:1487]: setting limits
>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CPU setting: (soft 
>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> 18446744073709551615(INFINITY))
>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_FSIZE setting: (soft 
>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> 18446744073709551615(INFINITY))
>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_DATA setting: (soft 
>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> 18446744073709551615(INFINITY))
>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_STACK setting: (soft 
>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> 18446744073709551615(INFINITY))
>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CORE setting: (soft 
>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> 18446744073709551615(INFINITY))
>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 
>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> 18446744073709551615(INFINITY))
>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_RSS setting: (soft 
>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> 18446744073709551615(INFINITY))
>>> 01/20/2014 11:02:12 [0:1487]: setting environment
>>> 01/20/2014 11:02:12 [0:1487]: Initializing error file
>>> 01/20/2014 11:02:12 [0:1487]: switching to intermediate/target user
>>> 01/20/2014 11:02:12 [0:1486]: wait3 returned 1487 (status: 6656; 
>>> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 26)
>>> 01/20/2014 11:02:12 [0:1486]: job exited with exit status 26
>>> 01/20/2014 11:02:12 [0:1486]: reaped "job" with pid 1487
>>> 01/20/2014 11:02:12 [0:1486]: job exited not due to signal
>>> 01/20/2014 11:02:12 [0:1486]: job exited with status 26
>>> 01/20/2014 11:02:12 [0:1486]: now sending signal KILL to pid -1487
>>> 01/20/2014 11:02:12 [0:1486]: writing usage file to "usage"
>>> 01/20/2014 11:02:12 [0:1486]: no tasker to notify
>>> 01/20/2014 11:02:12 [0:1486]: no epilog script to start
>>> </shepherd_trace>
>>> 
>>> <job_script>
>>> #!/bin/bash
>>> #
>>> #$ -j y
>>> #$ -o /opt/capitati/smartmate_data/test/job_sm_out.log
>>> #$ -e /opt/capitati/smartmate_data/test/job_sm_err.log
>>> #$ -S /bin/bash
>>> 
>>> echo "Hello World"
>>> echo `date`
>>> </job_script>
>>> 
>>> Ian Johnson
>>> Software Engineer
>>> 
>>> 
>>> Capita Translation and Interpreting
>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 
>>> 845 367 7000 | Tel (US): +1 (800) 579-5010
>>> | [email protected] | Skype ID: ian.johnson_als
>>> www.capitatranslationinterpreting.com
>>> 
>>> 
>>> On 14 January 2014 18:34, Reuti <[email protected]> wrote:
>>> Am 14.01.2014 um 18:27 schrieb Ian Johnson:
>>> 
>>> > Reuti,
>>> >
>>> > There's no file staging installed. The job script is being copied to the 
>>> > execution host.
>>> 
>>> Correct (for the job script itself).
>>> 
>>> 
>>> > The output file *is* being opened in ~smartmate but it is of zero length.
>>> 
>>> I would assume that they is not created at all in this location, only on 
>>> the nodes. Or do you mean the home directory on the nodes?
>>> 
>>> NB: In Torque there is a file staging for the .o/.e files, but not in SGE.
>>> 
>>> -- Reuti
>>> 
>>> 
>>> > Thanks,
>>> >
>>> > Ian
>>> >
>>> > On Tue, 14 Jan 2014 17:18:06 -0000, Reuti <[email protected]> 
>>> > wrote:
>>> >
>>> >> Am 14.01.2014 um 18:04 schrieb Ian Johnson:
>>> >>
>>> >>> Reuti,
>>> >>>
>>> >>> There is no output from the script at all in the 
>>> >>> ~smartmate/job.sh.o[0-9]+ files. The home directory of the smartmate 
>>> >>> user is local disk. However, grid engine is installed on an NFS share.
>>> >>
>>> >> Do you have any file staging installed? Otherwise the output will not be 
>>> >> send to the real home directory of the user. Also the input files could 
>>> >> be missing on the execution host.
>>> >>
>>> >> -- Reuti
>>> >>
>>> >>
>>> >>
>>> >>> Is there other information you require? Is there any way to get the 
>>> >>> function call that is failing in shepherd, e.g. more verbose tracing?
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Ian
>>> >>>
>>> >>> On Tue, 14 Jan 2014 15:19:34 -0000, Reuti <[email protected]> 
>>> >>> wrote:
>>> >>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> Am 14.01.2014 um 15:19 schrieb Ian Johnson:
>>> >>>>
>>> >>>>> I have a simple job, which echoes `date` to stdout, that I'm using to 
>>> >>>>> test an Open Grid Engine installation. Running qsub as root the job 
>>> >>>>> is run successfully. However, using another non-superuser, in this 
>>> >>>>> case smartmate user, the output from qacct -j says that the job has 
>>> >>>>> exited with exit status 11. The shepherd trace confirms this (see 
>>> >>>>> below).
>>> >>>>
>>> >>>> Do you have any output? 11 means "Resource temporarily unavailable", 
>>> >>>> which could mean it can't write to the (mounted?) home directory of 
>>> >>>> the user. How is it mount configured?
>>> >>>>
>>> >>>> AFAICS the user is known, as otherwise you would face a different 
>>> >>>> error.
>>> >>>>
>>> >>>> -- Reuti
>>> >>>>
>>> >>>>
>>> >>>>> Would anyone have an idea as to what is going on? Thank you.
>>> >>>>>
>>> >>>>> <shepherd_trace>
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: shepherd called with uid = 0, euid = 0
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: starting up 2011.11
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: setpgid(2723, 2723) returned 0
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: do_core_binding: "binding" parameter 
>>> >>>>> not found in config file
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no prolog script to start
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: forked "job" with pid 2724
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: child: starting son(job, 
>>> >>>>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/32, 0);
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: pid=2724 pgrp=2724 sid=2724 old 
>>> >>>>> pgrp=2723 getlogin()=root
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: job-pid: 2724
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: reading passwd information for user 
>>> >>>>> 'smartmate'
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setosjobid: uid = 0, euid = 0
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting limits
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CPU setting: (soft 
>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> >>>>> 18446744073709551615(INFINITY))
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_FSIZE setting: (soft 
>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> >>>>> 18446744073709551615(INFINITY))
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_DATA setting: (soft 
>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> >>>>> 18446744073709551615(INFINITY))
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_STACK setting: (soft 
>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> >>>>> 18446744073709551615(INFINITY))
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CORE setting: (soft 
>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> >>>>> 18446744073709551615(INFINITY))
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 
>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> >>>>> 18446744073709551615(INFINITY))
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_RSS setting: (soft 
>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>> >>>>> 18446744073709551615(INFINITY))
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting environment
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: Initializing error file
>>> >>>>> 01/14/2014 14:08:56 [0:2724]: switching to intermediate/target user
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: wait3 returned 2724 (status: 2816; 
>>> >>>>> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with exit status 11
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: reaped "job" with pid 2724
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited not due to signal
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with status 11
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: now sending signal KILL to pid -2724
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: writing usage file to "usage"
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no tasker to notify
>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no epilog script to start
>>> >>>>> </shepherd_trace>
>>> >>>>>
>>> >>>>> <job_script>
>>> >>>>> #!/bin/bash
>>> >>>>> #
>>> >>>>> #$ -j y
>>> >>>>> #
>>> >>>>> #$ -S /bin/bash
>>> >>>>>
>>> >>>>> echo "Hello World"
>>> >>>>> echo `date`
>>> >>>>> </job_script>
>>> >>>>>
>>> >>>>> Ian Johnson
>>> >>>>> Software Engineer
>>> >>>>>
>>> >>>>>
>>> >>>>> Capita Translation and Interpreting
>>> >>>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel 
>>> >>>>> (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>> >>>>> | [email protected] | Skype ID: ian.johnson_als
>>> >>>>> www.capitatranslationinterpreting.com
>>> >>>>> _______________________________________________
>>> >>>>> users mailing list
>>> >>>>> [email protected]
>>> >>>>> https://gridengine.org/mailman/listinfo/users
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Kind regards,
>>> >>>
>>> >>> Ian Johnson
>>> >>> Software Engineer
>>> >>>
>>> >>> Capita Translation and Interpreting
>>> >>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): 
>>> >>> +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>> >>> | [email protected] | Skype ID: ian.johnson_als
>>> >>> www.capitatranslationinterpreting.com
>>> >>
>>> >
>>> >
>>> > --
>>> > Kind regards,
>>> >
>>> > Ian Johnson
>>> > Software Engineer
>>> >
>>> > Capita Translation and Interpreting
>>> > Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): 
>>> > +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>> > | [email protected] | Skype ID: ian.johnson_als
>>> > www.capitatranslationinterpreting.com
>>> 
>>> 
>> 
> 
> 
> -- 
> Kind regards,
> 
> Ian Johnson
> Software Engineer
> 
> Capita Translation and Interpreting
> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 
> 845 367 7000 | Tel (US): +1 (800) 579-5010
> | [email protected] | Skype ID: ian.johnson_als
> www.capitatranslationinterpreting.com


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to