Hi, Am 20.01.2014 um 13:04 schrieb Ian Johnson:
> Reuti, > > The directory /opt/capitati/smartmate_data/test is now writable by the > smartmate user. Sorry, this was causing the 26 exit status. I'm back to the > exit status 11 again. Now, both the o and e files opened in the > /opt/capitati/smartmate_data/test directory but are of zero length. is the directory in question mounted by autofs/Automounter or a hard NFS mount of the user smartmate_data? - Any prolog on a queue or global level? - What user:group has the created (empty) file? - How are the users/IDs distributed to the nodes? -- Reuti > The spool directory is in /opt/capitati/ge2011.11/smartmate/spool which is > owned by root:root. > > Could you guess as to where the shepherd code is failing using the trace logs > I sent last week? I've been looking through the shepherd code but I can't see > anything obvious. > > Thanks, > > Ian > > On Mon, 20 Jan 2014 11:52:46 -0000, Reuti <[email protected]> wrote: > >> Am 20.01.2014 um 12:11 schrieb Ian Johnson: >> >>> Reuti, >>> >>> I have changed the qsub options to write stdout and stdout to an NFS >>> mounted directory, and the job script is still not being executed. Now the >>> job is exiting, according to the shepherd trace, with exit status 26. This >>> time no files o and e files are created. >> >> The path /opt/capitati/smartmate_data/test/job_sm_out.log is writable (for >> the user) on the node and all directories in the path exist? >> >> BTW: Is the spool directoty local on each host (preferable) or in a shared >> /opt/capitati/? >> >> -- Reuti >> >> >>> What does exit status 26 mean? And given the previous behaviour on a local >>> disk (job exit status 11), can you think of anything that is preventing the >>> non-superuser from executing jobs on execution nodes? This is turning into >>> a critical bug for us. >>> >>> Thanks for your continued help, >>> >>> Ian >>> >>> <shepherd_trace> >>> 01/20/2014 11:02:12 [0:1486]: shepherd called with uid = 0, euid = 0 >>> 01/20/2014 11:02:12 [0:1486]: starting up 2011.11 >>> 01/20/2014 11:02:12 [0:1486]: setpgid(1486, 1486) returned 0 >>> 01/20/2014 11:02:12 [0:1486]: do_core_binding: "binding" parameter not >>> found in config file >>> 01/20/2014 11:02:12 [0:1486]: no prolog script to start >>> 01/20/2014 11:02:12 [0:1486]: parent: forked "job" with pid 1487 >>> 01/20/2014 11:02:12 [0:1487]: child: starting son(job, >>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/34, 0); >>> 01/20/2014 11:02:12 [0:1487]: pid=1487 pgrp=1487 sid=1487 old pgrp=1486 >>> getlogin()=<no login set> >>> 01/20/2014 11:02:12 [0:1486]: parent: job-pid: 1487 >>> 01/20/2014 11:02:12 [0:1487]: reading passwd information for user >>> 'smartmate' >>> 01/20/2014 11:02:12 [0:1487]: setosjobid: uid = 0, euid = 0 >>> 01/20/2014 11:02:12 [0:1487]: setting limits >>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CPU setting: (soft >>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> resulting: (soft 18446744073709551615(INFINITY), hard >>> 18446744073709551615(INFINITY)) >>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_FSIZE setting: (soft >>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> resulting: (soft 18446744073709551615(INFINITY), hard >>> 18446744073709551615(INFINITY)) >>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_DATA setting: (soft >>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> resulting: (soft 18446744073709551615(INFINITY), hard >>> 18446744073709551615(INFINITY)) >>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_STACK setting: (soft >>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> resulting: (soft 18446744073709551615(INFINITY), hard >>> 18446744073709551615(INFINITY)) >>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CORE setting: (soft >>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> resulting: (soft 18446744073709551615(INFINITY), hard >>> 18446744073709551615(INFINITY)) >>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_VMEM/RLIMIT_AS setting: (soft >>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> resulting: (soft 18446744073709551615(INFINITY), hard >>> 18446744073709551615(INFINITY)) >>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_RSS setting: (soft >>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> resulting: (soft 18446744073709551615(INFINITY), hard >>> 18446744073709551615(INFINITY)) >>> 01/20/2014 11:02:12 [0:1487]: setting environment >>> 01/20/2014 11:02:12 [0:1487]: Initializing error file >>> 01/20/2014 11:02:12 [0:1487]: switching to intermediate/target user >>> 01/20/2014 11:02:12 [0:1486]: wait3 returned 1487 (status: 6656; >>> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 26) >>> 01/20/2014 11:02:12 [0:1486]: job exited with exit status 26 >>> 01/20/2014 11:02:12 [0:1486]: reaped "job" with pid 1487 >>> 01/20/2014 11:02:12 [0:1486]: job exited not due to signal >>> 01/20/2014 11:02:12 [0:1486]: job exited with status 26 >>> 01/20/2014 11:02:12 [0:1486]: now sending signal KILL to pid -1487 >>> 01/20/2014 11:02:12 [0:1486]: writing usage file to "usage" >>> 01/20/2014 11:02:12 [0:1486]: no tasker to notify >>> 01/20/2014 11:02:12 [0:1486]: no epilog script to start >>> </shepherd_trace> >>> >>> <job_script> >>> #!/bin/bash >>> # >>> #$ -j y >>> #$ -o /opt/capitati/smartmate_data/test/job_sm_out.log >>> #$ -e /opt/capitati/smartmate_data/test/job_sm_err.log >>> #$ -S /bin/bash >>> >>> echo "Hello World" >>> echo `date` >>> </job_script> >>> >>> Ian Johnson >>> Software Engineer >>> >>> >>> Capita Translation and Interpreting >>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 >>> 845 367 7000 | Tel (US): +1 (800) 579-5010 >>> | [email protected] | Skype ID: ian.johnson_als >>> www.capitatranslationinterpreting.com >>> >>> >>> On 14 January 2014 18:34, Reuti <[email protected]> wrote: >>> Am 14.01.2014 um 18:27 schrieb Ian Johnson: >>> >>> > Reuti, >>> > >>> > There's no file staging installed. The job script is being copied to the >>> > execution host. >>> >>> Correct (for the job script itself). >>> >>> >>> > The output file *is* being opened in ~smartmate but it is of zero length. >>> >>> I would assume that they is not created at all in this location, only on >>> the nodes. Or do you mean the home directory on the nodes? >>> >>> NB: In Torque there is a file staging for the .o/.e files, but not in SGE. >>> >>> -- Reuti >>> >>> >>> > Thanks, >>> > >>> > Ian >>> > >>> > On Tue, 14 Jan 2014 17:18:06 -0000, Reuti <[email protected]> >>> > wrote: >>> > >>> >> Am 14.01.2014 um 18:04 schrieb Ian Johnson: >>> >> >>> >>> Reuti, >>> >>> >>> >>> There is no output from the script at all in the >>> >>> ~smartmate/job.sh.o[0-9]+ files. The home directory of the smartmate >>> >>> user is local disk. However, grid engine is installed on an NFS share. >>> >> >>> >> Do you have any file staging installed? Otherwise the output will not be >>> >> send to the real home directory of the user. Also the input files could >>> >> be missing on the execution host. >>> >> >>> >> -- Reuti >>> >> >>> >> >>> >> >>> >>> Is there other information you require? Is there any way to get the >>> >>> function call that is failing in shepherd, e.g. more verbose tracing? >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Ian >>> >>> >>> >>> On Tue, 14 Jan 2014 15:19:34 -0000, Reuti <[email protected]> >>> >>> wrote: >>> >>> >>> >>>> Hi, >>> >>>> >>> >>>> Am 14.01.2014 um 15:19 schrieb Ian Johnson: >>> >>>> >>> >>>>> I have a simple job, which echoes `date` to stdout, that I'm using to >>> >>>>> test an Open Grid Engine installation. Running qsub as root the job >>> >>>>> is run successfully. However, using another non-superuser, in this >>> >>>>> case smartmate user, the output from qacct -j says that the job has >>> >>>>> exited with exit status 11. The shepherd trace confirms this (see >>> >>>>> below). >>> >>>> >>> >>>> Do you have any output? 11 means "Resource temporarily unavailable", >>> >>>> which could mean it can't write to the (mounted?) home directory of >>> >>>> the user. How is it mount configured? >>> >>>> >>> >>>> AFAICS the user is known, as otherwise you would face a different >>> >>>> error. >>> >>>> >>> >>>> -- Reuti >>> >>>> >>> >>>> >>> >>>>> Would anyone have an idea as to what is going on? Thank you. >>> >>>>> >>> >>>>> <shepherd_trace> >>> >>>>> 01/14/2014 14:08:56 [0:2723]: shepherd called with uid = 0, euid = 0 >>> >>>>> 01/14/2014 14:08:56 [0:2723]: starting up 2011.11 >>> >>>>> 01/14/2014 14:08:56 [0:2723]: setpgid(2723, 2723) returned 0 >>> >>>>> 01/14/2014 14:08:56 [0:2723]: do_core_binding: "binding" parameter >>> >>>>> not found in config file >>> >>>>> 01/14/2014 14:08:56 [0:2723]: no prolog script to start >>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: forked "job" with pid 2724 >>> >>>>> 01/14/2014 14:08:56 [0:2724]: child: starting son(job, >>> >>>>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/32, 0); >>> >>>>> 01/14/2014 14:08:56 [0:2724]: pid=2724 pgrp=2724 sid=2724 old >>> >>>>> pgrp=2723 getlogin()=root >>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: job-pid: 2724 >>> >>>>> 01/14/2014 14:08:56 [0:2724]: reading passwd information for user >>> >>>>> 'smartmate' >>> >>>>> 01/14/2014 14:08:56 [0:2724]: setosjobid: uid = 0, euid = 0 >>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting limits >>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CPU setting: (soft >>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>> >>>>> 18446744073709551615(INFINITY)) >>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_FSIZE setting: (soft >>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>> >>>>> 18446744073709551615(INFINITY)) >>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_DATA setting: (soft >>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>> >>>>> 18446744073709551615(INFINITY)) >>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_STACK setting: (soft >>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>> >>>>> 18446744073709551615(INFINITY)) >>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CORE setting: (soft >>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>> >>>>> 18446744073709551615(INFINITY)) >>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_VMEM/RLIMIT_AS setting: (soft >>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>> >>>>> 18446744073709551615(INFINITY)) >>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_RSS setting: (soft >>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>> >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>> >>>>> 18446744073709551615(INFINITY)) >>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting environment >>> >>>>> 01/14/2014 14:08:56 [0:2724]: Initializing error file >>> >>>>> 01/14/2014 14:08:56 [0:2724]: switching to intermediate/target user >>> >>>>> 01/14/2014 14:08:56 [0:2723]: wait3 returned 2724 (status: 2816; >>> >>>>> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11) >>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with exit status 11 >>> >>>>> 01/14/2014 14:08:56 [0:2723]: reaped "job" with pid 2724 >>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited not due to signal >>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with status 11 >>> >>>>> 01/14/2014 14:08:56 [0:2723]: now sending signal KILL to pid -2724 >>> >>>>> 01/14/2014 14:08:56 [0:2723]: writing usage file to "usage" >>> >>>>> 01/14/2014 14:08:56 [0:2723]: no tasker to notify >>> >>>>> 01/14/2014 14:08:56 [0:2723]: no epilog script to start >>> >>>>> </shepherd_trace> >>> >>>>> >>> >>>>> <job_script> >>> >>>>> #!/bin/bash >>> >>>>> # >>> >>>>> #$ -j y >>> >>>>> # >>> >>>>> #$ -S /bin/bash >>> >>>>> >>> >>>>> echo "Hello World" >>> >>>>> echo `date` >>> >>>>> </job_script> >>> >>>>> >>> >>>>> Ian Johnson >>> >>>>> Software Engineer >>> >>>>> >>> >>>>> >>> >>>>> Capita Translation and Interpreting >>> >>>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel >>> >>>>> (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010 >>> >>>>> | [email protected] | Skype ID: ian.johnson_als >>> >>>>> www.capitatranslationinterpreting.com >>> >>>>> _______________________________________________ >>> >>>>> users mailing list >>> >>>>> [email protected] >>> >>>>> https://gridengine.org/mailman/listinfo/users >>> >>>> >>> >>> >>> >>> >>> >>> -- >>> >>> Kind regards, >>> >>> >>> >>> Ian Johnson >>> >>> Software Engineer >>> >>> >>> >>> Capita Translation and Interpreting >>> >>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): >>> >>> +44 845 367 7000 | Tel (US): +1 (800) 579-5010 >>> >>> | [email protected] | Skype ID: ian.johnson_als >>> >>> www.capitatranslationinterpreting.com >>> >> >>> > >>> > >>> > -- >>> > Kind regards, >>> > >>> > Ian Johnson >>> > Software Engineer >>> > >>> > Capita Translation and Interpreting >>> > Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): >>> > +44 845 367 7000 | Tel (US): +1 (800) 579-5010 >>> > | [email protected] | Skype ID: ian.johnson_als >>> > www.capitatranslationinterpreting.com >>> >>> >> > > > -- > Kind regards, > > Ian Johnson > Software Engineer > > Capita Translation and Interpreting > Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 > 845 367 7000 | Tel (US): +1 (800) 579-5010 > | [email protected] | Skype ID: ian.johnson_als > www.capitatranslationinterpreting.com _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
