Am 20.01.2014 um 17:08 schrieb Ian Johnson:

> Reuti,
> 
> Inline...
> 
> On Mon, 20 Jan 2014 14:43:51 -0000, Reuti <[email protected]> wrote:
> 
>> Hi,
>> 
>> Am 20.01.2014 um 13:04 schrieb Ian Johnson:
>> 
>>> Reuti,
>>> 
>>> The directory /opt/capitati/smartmate_data/test is now writable by the 
>>> smartmate user. Sorry, this was causing the 26 exit status. I'm back to the 
>>> exit status 11 again. Now, both the o and e files opened in the 
>>> /opt/capitati/smartmate_data/test directory but are of zero length.
>> 
>> is the directory in question mounted by autofs/Automounter or a hard NFS 
>> mount of the user smartmate_data?
> 
> This is mounted at boot-time from the fstab file.
> 
>> 
>> - Any prolog on a queue or global level?
> 
> No prolog on any queue or global.
> 
>> - What user:group has the created (empty) file?
> 
> smartmate:smartmate
> 
>> - How are the users/IDs distributed to the nodes?
> 
> Users and groups are created locally on each exec node and the master node. 
> User and group names have identical IDs.
> 
> 
> This problem seems to seem from the fact that the smartmate user is *not* a 
> superuser. I think it's a problem when the UID and GID are changed in 
> Shepherd in order to run the job script.

But in principle "smartmate" can write at this location?

To investigate further, you could define a small prolog running as a) root and 
then b) smartmate and write something. Does it show the same behavior?

-- Reuti


> Thanks,
> 
> Ian
> 
>> 
>> -- Reuti
>> 
>> 
>>> The spool directory is in /opt/capitati/ge2011.11/smartmate/spool which is 
>>> owned by root:root.
>>> 
>>> Could you guess as to where the shepherd code is failing using the trace 
>>> logs I sent last week? I've been looking through the shepherd code but I 
>>> can't see anything obvious.
>>> 
>>> Thanks,
>>> 
>>> Ian
>>> 
>>> On Mon, 20 Jan 2014 11:52:46 -0000, Reuti <[email protected]> 
>>> wrote:
>>> 
>>>> Am 20.01.2014 um 12:11 schrieb Ian Johnson:
>>>> 
>>>>> Reuti,
>>>>> 
>>>>> I have changed the qsub options to write stdout and stdout to an NFS 
>>>>> mounted directory, and the job script is still not being executed. Now 
>>>>> the job is exiting, according to the shepherd trace, with exit status 26. 
>>>>> This time no files o and e files are created.
>>>> 
>>>> The path /opt/capitati/smartmate_data/test/job_sm_out.log is writable (for 
>>>> the user) on the node and all directories in the path exist?
>>>> 
>>>> BTW: Is the spool directoty local on each host (preferable) or in a shared 
>>>> /opt/capitati/?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> What does exit status 26 mean? And given the previous behaviour on a 
>>>>> local disk (job exit status 11), can you think of anything that is 
>>>>> preventing the non-superuser from executing jobs on execution nodes? This 
>>>>> is turning into a critical bug for us.
>>>>> 
>>>>> Thanks for your continued help,
>>>>> 
>>>>> Ian
>>>>> 
>>>>> <shepherd_trace>
>>>>> 01/20/2014 11:02:12 [0:1486]: shepherd called with uid = 0, euid = 0
>>>>> 01/20/2014 11:02:12 [0:1486]: starting up 2011.11
>>>>> 01/20/2014 11:02:12 [0:1486]: setpgid(1486, 1486) returned 0
>>>>> 01/20/2014 11:02:12 [0:1486]: do_core_binding: "binding" parameter not 
>>>>> found in config file
>>>>> 01/20/2014 11:02:12 [0:1486]: no prolog script to start
>>>>> 01/20/2014 11:02:12 [0:1486]: parent: forked "job" with pid 1487
>>>>> 01/20/2014 11:02:12 [0:1487]: child: starting son(job, 
>>>>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/34, 0);
>>>>> 01/20/2014 11:02:12 [0:1487]: pid=1487 pgrp=1487 sid=1487 old pgrp=1486 
>>>>> getlogin()=<no login set>
>>>>> 01/20/2014 11:02:12 [0:1486]: parent: job-pid: 1487
>>>>> 01/20/2014 11:02:12 [0:1487]: reading passwd information for user 
>>>>> 'smartmate'
>>>>> 01/20/2014 11:02:12 [0:1487]: setosjobid: uid = 0, euid = 0
>>>>> 01/20/2014 11:02:12 [0:1487]: setting limits
>>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CPU setting: (soft 
>>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>>>> 18446744073709551615(INFINITY))
>>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_FSIZE setting: (soft 
>>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>>>> 18446744073709551615(INFINITY))
>>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_DATA setting: (soft 
>>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>>>> 18446744073709551615(INFINITY))
>>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_STACK setting: (soft 
>>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>>>> 18446744073709551615(INFINITY))
>>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_CORE setting: (soft 
>>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>>>> 18446744073709551615(INFINITY))
>>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 
>>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>>>> 18446744073709551615(INFINITY))
>>>>> 01/20/2014 11:02:12 [0:1487]: RLIMIT_RSS setting: (soft 
>>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 
>>>>> resulting: (soft 18446744073709551615(INFINITY), hard 
>>>>> 18446744073709551615(INFINITY))
>>>>> 01/20/2014 11:02:12 [0:1487]: setting environment
>>>>> 01/20/2014 11:02:12 [0:1487]: Initializing error file
>>>>> 01/20/2014 11:02:12 [0:1487]: switching to intermediate/target user
>>>>> 01/20/2014 11:02:12 [0:1486]: wait3 returned 1487 (status: 6656; 
>>>>> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 26)
>>>>> 01/20/2014 11:02:12 [0:1486]: job exited with exit status 26
>>>>> 01/20/2014 11:02:12 [0:1486]: reaped "job" with pid 1487
>>>>> 01/20/2014 11:02:12 [0:1486]: job exited not due to signal
>>>>> 01/20/2014 11:02:12 [0:1486]: job exited with status 26
>>>>> 01/20/2014 11:02:12 [0:1486]: now sending signal KILL to pid -1487
>>>>> 01/20/2014 11:02:12 [0:1486]: writing usage file to "usage"
>>>>> 01/20/2014 11:02:12 [0:1486]: no tasker to notify
>>>>> 01/20/2014 11:02:12 [0:1486]: no epilog script to start
>>>>> </shepherd_trace>
>>>>> 
>>>>> <job_script>
>>>>> #!/bin/bash
>>>>> #
>>>>> #$ -j y
>>>>> #$ -o /opt/capitati/smartmate_data/test/job_sm_out.log
>>>>> #$ -e /opt/capitati/smartmate_data/test/job_sm_err.log
>>>>> #$ -S /bin/bash
>>>>> 
>>>>> echo "Hello World"
>>>>> echo `date`
>>>>> </job_script>
>>>>> 
>>>>> Ian Johnson
>>>>> Software Engineer
>>>>> 
>>>>> 
>>>>> Capita Translation and Interpreting
>>>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): 
>>>>> +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>>>> | [email protected] | Skype ID: ian.johnson_als
>>>>> www.capitatranslationinterpreting.com
>>>>> 
>>>>> 
>>>>> On 14 January 2014 18:34, Reuti <[email protected]> wrote:
>>>>> Am 14.01.2014 um 18:27 schrieb Ian Johnson:
>>>>> 
>>>>> > Reuti,
>>>>> >
>>>>> > There's no file staging installed. The job script is being copied to 
>>>>> > the execution host.
>>>>> 
>>>>> Correct (for the job script itself).
>>>>> 
>>>>> 
>>>>> > The output file *is* being opened in ~smartmate but it is of zero 
>>>>> > length.
>>>>> 
>>>>> I would assume that they is not created at all in this location, only on 
>>>>> the nodes. Or do you mean the home directory on the nodes?
>>>>> 
>>>>> NB: In Torque there is a file staging for the .o/.e files, but not in SGE.
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>> > Thanks,
>>>>> >
>>>>> > Ian
>>>>> >
>>>>> > On Tue, 14 Jan 2014 17:18:06 -0000, Reuti <[email protected]> 
>>>>> > wrote:
>>>>> >
>>>>> >> Am 14.01.2014 um 18:04 schrieb Ian Johnson:
>>>>> >>
>>>>> >>> Reuti,
>>>>> >>>
>>>>> >>> There is no output from the script at all in the 
>>>>> >>> ~smartmate/job.sh.o[0-9]+ files. The home directory of the smartmate 
>>>>> >>> user is local disk. However, grid engine is installed on an NFS share.
>>>>> >>
>>>>> >> Do you have any file staging installed? Otherwise the output will not 
>>>>> >> be send to the real home directory of the user. Also the input files 
>>>>> >> could be missing on the execution host.
>>>>> >>
>>>>> >> -- Reuti
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>> Is there other information you require? Is there any way to get the 
>>>>> >>> function call that is failing in shepherd, e.g. more verbose tracing?
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>>
>>>>> >>> Ian
>>>>> >>>
>>>>> >>> On Tue, 14 Jan 2014 15:19:34 -0000, Reuti 
>>>>> >>> <[email protected]> wrote:
>>>>> >>>
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> Am 14.01.2014 um 15:19 schrieb Ian Johnson:
>>>>> >>>>
>>>>> >>>>> I have a simple job, which echoes `date` to stdout, that I'm using 
>>>>> >>>>> to test an Open Grid Engine installation. Running qsub as root the 
>>>>> >>>>> job is run successfully. However, using another non-superuser, in 
>>>>> >>>>> this case smartmate user, the output from qacct -j says that the 
>>>>> >>>>> job has exited with exit status 11. The shepherd trace confirms 
>>>>> >>>>> this (see below).
>>>>> >>>>
>>>>> >>>> Do you have any output? 11 means "Resource temporarily unavailable", 
>>>>> >>>> which could mean it can't write to the (mounted?) home directory of 
>>>>> >>>> the user. How is it mount configured?
>>>>> >>>>
>>>>> >>>> AFAICS the user is known, as otherwise you would face a different 
>>>>> >>>> error.
>>>>> >>>>
>>>>> >>>> -- Reuti
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>> Would anyone have an idea as to what is going on? Thank you.
>>>>> >>>>>
>>>>> >>>>> <shepherd_trace>
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: shepherd called with uid = 0, euid = 0
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: starting up 2011.11
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: setpgid(2723, 2723) returned 0
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: do_core_binding: "binding" parameter 
>>>>> >>>>> not found in config file
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no prolog script to start
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: forked "job" with pid 2724
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: child: starting son(job, 
>>>>> >>>>> /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/32, 0);
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: pid=2724 pgrp=2724 sid=2724 old 
>>>>> >>>>> pgrp=2723 getlogin()=root
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: parent: job-pid: 2724
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: reading passwd information for user 
>>>>> >>>>> 'smartmate'
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setosjobid: uid = 0, euid = 0
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting limits
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CPU setting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 
>>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_FSIZE setting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 
>>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_DATA setting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 
>>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_STACK setting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 
>>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CORE setting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 
>>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 
>>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_RSS setting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 
>>>>> >>>>> 18446744073709551615(INFINITY)) resulting: (soft 
>>>>> >>>>> 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: setting environment
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: Initializing error file
>>>>> >>>>> 01/14/2014 14:08:56 [0:2724]: switching to intermediate/target user
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: wait3 returned 2724 (status: 2816; 
>>>>> >>>>> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with exit status 11
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: reaped "job" with pid 2724
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited not due to signal
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: job exited with status 11
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: now sending signal KILL to pid -2724
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: writing usage file to "usage"
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no tasker to notify
>>>>> >>>>> 01/14/2014 14:08:56 [0:2723]: no epilog script to start
>>>>> >>>>> </shepherd_trace>
>>>>> >>>>>
>>>>> >>>>> <job_script>
>>>>> >>>>> #!/bin/bash
>>>>> >>>>> #
>>>>> >>>>> #$ -j y
>>>>> >>>>> #
>>>>> >>>>> #$ -S /bin/bash
>>>>> >>>>>
>>>>> >>>>> echo "Hello World"
>>>>> >>>>> echo `date`
>>>>> >>>>> </job_script>
>>>>> >>>>>
>>>>> >>>>> Ian Johnson
>>>>> >>>>> Software Engineer
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> Capita Translation and Interpreting
>>>>> >>>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel 
>>>>> >>>>> (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>>>> >>>>> | [email protected] | Skype ID: ian.johnson_als
>>>>> >>>>> www.capitatranslationinterpreting.com
>>>>> >>>>> _______________________________________________
>>>>> >>>>> users mailing list
>>>>> >>>>> [email protected]
>>>>> >>>>> https://gridengine.org/mailman/listinfo/users
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Kind regards,
>>>>> >>>
>>>>> >>> Ian Johnson
>>>>> >>> Software Engineer
>>>>> >>>
>>>>> >>> Capita Translation and Interpreting
>>>>> >>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel 
>>>>> >>> (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>>>> >>> | [email protected] | Skype ID: ian.johnson_als
>>>>> >>> www.capitatranslationinterpreting.com
>>>>> >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Kind regards,
>>>>> >
>>>>> > Ian Johnson
>>>>> > Software Engineer
>>>>> >
>>>>> > Capita Translation and Interpreting
>>>>> > Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): 
>>>>> > +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>>>> > | [email protected] | Skype ID: ian.johnson_als
>>>>> > www.capitatranslationinterpreting.com
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Kind regards,
>>> 
>>> Ian Johnson
>>> Software Engineer
>>> 
>>> Capita Translation and Interpreting
>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 
>>> 845 367 7000 | Tel (US): +1 (800) 579-5010
>>> | [email protected] | Skype ID: ian.johnson_als
>>> www.capitatranslationinterpreting.com
>> 
> 
> 
> -- 
> Kind regards,
> 
> Ian Johnson
> Software Engineer
> 
> Capita Translation and Interpreting
> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 
> 845 367 7000 | Tel (US): +1 (800) 579-5010
> | [email protected] | Skype ID: ian.johnson_als
> www.capitatranslationinterpreting.com


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to