Am 20.01.2014 um 17:08 schrieb Ian Johnson: > Reuti, > > Inline... > > On Mon, 20 Jan 2014 14:43:51 -0000, Reuti <[email protected]> wrote: > >> Hi, >> >> Am 20.01.2014 um 13:04 schrieb Ian Johnson: >> >>> Reuti, >>> >>> The directory /opt/capitati/smartmate_data/test is now writable by the >>> smartmate user. Sorry, this was causing the 26 exit status. I'm back to the >>> exit status 11 again. Now, both the o and e files opened in the >>> /opt/capitati/smartmate_data/test directory but are of zero length. >> >> is the directory in question mounted by autofs/Automounter or a hard NFS >> mount of the user smartmate_data? > > This is mounted at boot-time from the fstab file. > >> >> - Any prolog on a queue or global level? > > No prolog on any queue or global. > >> - What user:group has the created (empty) file? > > smartmate:smartmate > >> - How are the users/IDs distributed to the nodes? > > Users and groups are created locally on each exec node and the master node. > User and group names have identical IDs. > > > This problem seems to seem from the fact that the smartmate user is *not* a > superuser. I think it's a problem when the UID and GID are changed in > Shepherd in order to run the job script.
But in principle "smartmate" can write at this location? To investigate further, you could define a small prolog running as a) root and then b) smartmate and write something. Does it show the same behavior? -- Reuti > Thanks, > > Ian > >> >> -- Reuti >> >> >>> The spool directory is in /opt/capitati/ge2011.11/smartmate/spool which is >>> owned by root:root. >>> >>> Could you guess as to where the shepherd code is failing using the trace >>> logs I sent last week? I've been looking through the shepherd code but I >>> can't see anything obvious. >>> >>> Thanks, >>> >>> Ian >>> >>> On Mon, 20 Jan 2014 11:52:46 -0000, Reuti <[email protected]> >>> wrote: >>> >>>> Am 20.01.2014 um 12:11 schrieb Ian Johnson: >>>> >>>>> Reuti, >>>>> >>>>> I have changed the qsub options to write stdout and stdout to an NFS >>>>> mounted directory, and the job script is still not being executed. Now >>>>> the job is exiting, according to the shepherd trace, with exit status 26. >>>>> This time no files o and e files are created. >>>> >>>> The path /opt/capitati/smartmate_data/test/job_sm_out.log is writable (for >>>> the user) on the node and all directories in the path exist? >>>> >>>> BTW: Is the spool directoty local on each host (preferable) or in a shared >>>> /opt/capitati/? >>>> >>>> -- Reuti >>>> >>>> >>>>> What does exit status 26 mean? And given the previous behaviour on a >>>>> local disk (job exit status 11), can you think of anything that is >>>>> preventing the non-superuser from executing jobs on execution nodes? This >>>>> is turning into a critical bug for us. >>>>> >>>>> Thanks for your continued help, >>>>> >>>>> Ian >>>>> >>>>> <shepherd_trace> >>>>> 01/20/2014 11:02:12 [0:1486]: shepherd called with uid = 0, euid = 0 >>>>> 01/20/2014 11:02:12 [0:1486]: starting up 2011.11 >>>>> 01/20/2014 11:02:12 [0:1486]: setpgid(1486, 1486) returned 0 >>>>> 01/20/2014 11:02:12 [0:1486]: do_core_binding: "binding" parameter not >>>>> found in config file >>>>> 01/20/2014 11:02:12 [0:1486]: no prolog script to start >>>>> 01/20/2014 11:02:12 [0:1486]: parent: forked "job" with pid 1487 >>>>> 01/20/2014 11:02:12 [0:1487]: child: starting son(job, >>>>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/34, 0); >>>>> 01/20/2014 11:02:12 [0:1487]: pid=1487 pgrp=1487 sid=1487 old pgrp=1486 >>>>> getlogin()=<no login set> >>>>> 01/20/2014 11:02:12 [0:1486]: parent: job-pid: 1487 >>>>> 01/20/2014 11:02:12 [0:1487]: reading passwd information for user >>>>> 'smartmate' >>>>> 01/20/2014 11:02:12 [0:1487]: setosjobid: uid = 0, euid = 0 >>>>> 01/20/2014 11:02:12 [0:1487]: setting limits >>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CPU setting: (soft >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>>>> 18446744073709551615(INFINITY)) >>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_FSIZE setting: (soft >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>>>> 18446744073709551615(INFINITY)) >>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_DATA setting: (soft >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>>>> 18446744073709551615(INFINITY)) >>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_STACK setting: (soft >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>>>> 18446744073709551615(INFINITY)) >>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CORE setting: (soft >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>>>> 18446744073709551615(INFINITY)) >>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_VMEM/RLIMIT_AS setting: (soft >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>>>> 18446744073709551615(INFINITY)) >>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_RSS setting: (soft >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> resulting: (soft 18446744073709551615(INFINITY), hard >>>>> 18446744073709551615(INFINITY)) >>>>> 01/20/2014 11:02:12 [0:1487]: setting environment >>>>> 01/20/2014 11:02:12 [0:1487]: Initializing error file >>>>> 01/20/2014 11:02:12 [0:1487]: switching to intermediate/target user >>>>> 01/20/2014 11:02:12 [0:1486]: wait3 returned 1487 (status: 6656; >>>>> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 26) >>>>> 01/20/2014 11:02:12 [0:1486]: job exited with exit status 26 >>>>> 01/20/2014 11:02:12 [0:1486]: reaped "job" with pid 1487 >>>>> 01/20/2014 11:02:12 [0:1486]: job exited not due to signal >>>>> 01/20/2014 11:02:12 [0:1486]: job exited with status 26 >>>>> 01/20/2014 11:02:12 [0:1486]: now sending signal KILL to pid -1487 >>>>> 01/20/2014 11:02:12 [0:1486]: writing usage file to "usage" >>>>> 01/20/2014 11:02:12 [0:1486]: no tasker to notify >>>>> 01/20/2014 11:02:12 [0:1486]: no epilog script to start >>>>> </shepherd_trace> >>>>> >>>>> <job_script> >>>>> #!/bin/bash >>>>> # >>>>> #$ -j y >>>>> #$ -o /opt/capitati/smartmate_data/test/job_sm_out.log >>>>> #$ -e /opt/capitati/smartmate_data/test/job_sm_err.log >>>>> #$ -S /bin/bash >>>>> >>>>> echo "Hello World" >>>>> echo `date` >>>>> </job_script> >>>>> >>>>> Ian Johnson >>>>> Software Engineer >>>>> >>>>> >>>>> Capita Translation and Interpreting >>>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): >>>>> +44 845 367 7000 | Tel (US): +1 (800) 579-5010 >>>>> | [email protected] | Skype ID: ian.johnson_als >>>>> www.capitatranslationinterpreting.com >>>>> >>>>> >>>>> On 14 January 2014 18:34, Reuti <[email protected]> wrote: >>>>> Am 14.01.2014 um 18:27 schrieb Ian Johnson: >>>>> >>>>> > Reuti, >>>>> > >>>>> > There's no file staging installed. The job script is being copied to >>>>> > the execution host. >>>>> >>>>> Correct (for the job script itself). >>>>> >>>>> >>>>> > The output file *is* being opened in ~smartmate but it is of zero >>>>> > length. >>>>> >>>>> I would assume that they is not created at all in this location, only on >>>>> the nodes. Or do you mean the home directory on the nodes? >>>>> >>>>> NB: In Torque there is a file staging for the .o/.e files, but not in SGE. >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>> > Thanks, >>>>> > >>>>> > Ian >>>>> > >>>>> > On Tue, 14 Jan 2014 17:18:06 -0000, Reuti <[email protected]> >>>>> > wrote: >>>>> > >>>>> >> Am 14.01.2014 um 18:04 schrieb Ian Johnson: >>>>> >> >>>>> >>> Reuti, >>>>> >>> >>>>> >>> There is no output from the script at all in the >>>>> >>> ~smartmate/job.sh.o[0-9]+ files. The home directory of the smartmate >>>>> >>> user is local disk. However, grid engine is installed on an NFS share. >>>>> >> >>>>> >> Do you have any file staging installed? Otherwise the output will not >>>>> >> be send to the real home directory of the user. Also the input files >>>>> >> could be missing on the execution host. >>>>> >> >>>>> >> -- Reuti >>>>> >> >>>>> >> >>>>> >> >>>>> >>> Is there other information you require? Is there any way to get the >>>>> >>> function call that is failing in shepherd, e.g. more verbose tracing? >>>>> >>> >>>>> >>> Thanks, >>>>> >>> >>>>> >>> Ian >>>>> >>> >>>>> >>> On Tue, 14 Jan 2014 15:19:34 -0000, Reuti >>>>> >>> <[email protected]> wrote: >>>>> >>> >>>>> >>>> Hi, >>>>> >>>> >>>>> >>>> Am 14.01.2014 um 15:19 schrieb Ian Johnson: >>>>> >>>> >>>>> >>>>> I have a simple job, which echoes `date` to stdout, that I'm using >>>>> >>>>> to test an Open Grid Engine installation. Running qsub as root the >>>>> >>>>> job is run successfully. However, using another non-superuser, in >>>>> >>>>> this case smartmate user, the output from qacct -j says that the >>>>> >>>>> job has exited with exit status 11. The shepherd trace confirms >>>>> >>>>> this (see below). >>>>> >>>> >>>>> >>>> Do you have any output? 11 means "Resource temporarily unavailable", >>>>> >>>> which could mean it can't write to the (mounted?) home directory of >>>>> >>>> the user. How is it mount configured? >>>>> >>>> >>>>> >>>> AFAICS the user is known, as otherwise you would face a different >>>>> >>>> error. >>>>> >>>> >>>>> >>>> -- Reuti >>>>> >>>> >>>>> >>>> >>>>> >>>>> Would anyone have an idea as to what is going on? Thank you. >>>>> >>>>> >>>>> >>>>> <shepherd_trace> >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: shepherd called with uid = 0, euid = 0 >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: starting up 2011.11 >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: setpgid(2723, 2723) returned 0 >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: do_core_binding: "binding" parameter >>>>> >>>>> not found in config file >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no prolog script to start >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: forked "job" with pid 2724 >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: child: starting son(job, >>>>> >>>>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/32, 0); >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: pid=2724 pgrp=2724 sid=2724 old >>>>> >>>>> pgrp=2723 getlogin()=root >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: job-pid: 2724 >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: reading passwd information for user >>>>> >>>>> 'smartmate' >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setosjobid: uid = 0, euid = 0 >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting limits >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CPU setting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard >>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_FSIZE setting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard >>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_DATA setting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard >>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_STACK setting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard >>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CORE setting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard >>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_VMEM/RLIMIT_AS setting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard >>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_RSS setting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard >>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft >>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting environment >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: Initializing error file >>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: switching to intermediate/target user >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: wait3 returned 2724 (status: 2816; >>>>> >>>>> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11) >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with exit status 11 >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: reaped "job" with pid 2724 >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited not due to signal >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with status 11 >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: now sending signal KILL to pid -2724 >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: writing usage file to "usage" >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no tasker to notify >>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no epilog script to start >>>>> >>>>> </shepherd_trace> >>>>> >>>>> >>>>> >>>>> <job_script> >>>>> >>>>> #!/bin/bash >>>>> >>>>> # >>>>> >>>>> #$ -j y >>>>> >>>>> # >>>>> >>>>> #$ -S /bin/bash >>>>> >>>>> >>>>> >>>>> echo "Hello World" >>>>> >>>>> echo `date` >>>>> >>>>> </job_script> >>>>> >>>>> >>>>> >>>>> Ian Johnson >>>>> >>>>> Software Engineer >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Capita Translation and Interpreting >>>>> >>>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel >>>>> >>>>> (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010 >>>>> >>>>> | [email protected] | Skype ID: ian.johnson_als >>>>> >>>>> www.capitatranslationinterpreting.com >>>>> >>>>> _______________________________________________ >>>>> >>>>> users mailing list >>>>> >>>>> [email protected] >>>>> >>>>> https://gridengine.org/mailman/listinfo/users >>>>> >>>> >>>>> >>> >>>>> >>> >>>>> >>> -- >>>>> >>> Kind regards, >>>>> >>> >>>>> >>> Ian Johnson >>>>> >>> Software Engineer >>>>> >>> >>>>> >>> Capita Translation and Interpreting >>>>> >>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel >>>>> >>> (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010 >>>>> >>> | [email protected] | Skype ID: ian.johnson_als >>>>> >>> www.capitatranslationinterpreting.com >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Kind regards, >>>>> > >>>>> > Ian Johnson >>>>> > Software Engineer >>>>> > >>>>> > Capita Translation and Interpreting >>>>> > Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): >>>>> > +44 845 367 7000 | Tel (US): +1 (800) 579-5010 >>>>> > | [email protected] | Skype ID: ian.johnson_als >>>>> > www.capitatranslationinterpreting.com >>>>> >>>>> >>>> >>> >>> >>> -- >>> Kind regards, >>> >>> Ian Johnson >>> Software Engineer >>> >>> Capita Translation and Interpreting >>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 >>> 845 367 7000 | Tel (US): +1 (800) 579-5010 >>> | [email protected] | Skype ID: ian.johnson_als >>> www.capitatranslationinterpreting.com >> > > > -- > Kind regards, > > Ian Johnson > Software Engineer > > Capita Translation and Interpreting > Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 > 845 367 7000 | Tel (US): +1 (800) 579-5010 > | [email protected] | Skype ID: ian.johnson_als > www.capitatranslationinterpreting.com _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
