Reuti,

The directory /opt/capitati/smartmate_data/test is now writable by the smartmate user. Sorry, this was causing the 26 exit status. I'm back to the exit status 11 again. Now, both the o and e files opened in the /opt/capitati/smartmate_data/test directory but are of zero length.

The spool directory is in /opt/capitati/ge2011.11/smartmate/spool which is owned by root:root.

Could you guess as to where the shepherd code is failing using the trace logs I sent last week? I've been looking through the shepherd code but I can't see anything obvious.

Thanks,

Ian

On Mon, 20 Jan 2014 11:52:46 -0000, Reuti <[email protected]> wrote:

Am 20.01.2014 um 12:11 schrieb Ian Johnson:

Reuti,

I have changed the qsub options to write stdout and stdout to an NFS mounted directory, and the job script is still not being executed. Now the job is exiting, according to the shepherd trace, with exit status 26. This time no files o and e files are created.

The path /opt/capitati/smartmate_data/test/job_sm_out.log is writable (for the user) on the node and all directories in the path exist?

BTW: Is the spool directoty local on each host (preferable) or in a shared /opt/capitati/?

-- Reuti


What does exit status 26 mean? And given the previous behaviour on a local disk (job exit status 11), can you think of anything that is preventing the non-superuser from executing jobs on execution nodes? This is turning into a critical bug for us.

Thanks for your continued help,

Ian

<shepherd_trace>
01/20/2014 11:02:12 [0:1486]: shepherd called with uid = 0, euid = 0
01/20/2014 11:02:12 [0:1486]: starting up 2011.11
01/20/2014 11:02:12 [0:1486]: setpgid(1486, 1486) returned 0
01/20/2014 11:02:12 [0:1486]: do_core_binding: "binding" parameter not found in config file
01/20/2014 11:02:12 [0:1486]: no prolog script to start
01/20/2014 11:02:12 [0:1486]: parent: forked "job" with pid 1487
01/20/2014 11:02:12 [0:1487]: child: starting son(job, /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/34, 0); 01/20/2014 11:02:12 [0:1487]: pid=1487 pgrp=1487 sid=1487 old pgrp=1486 getlogin()=<no login set>
01/20/2014 11:02:12 [0:1486]: parent: job-pid: 1487
01/20/2014 11:02:12 [0:1487]: reading passwd information for user 'smartmate'
01/20/2014 11:02:12 [0:1487]: setosjobid: uid = 0, euid = 0
01/20/2014 11:02:12 [0:1487]: setting limits
01/20/2014 11:02:12 [0:1487]: RLIMIT_CPU setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 01/20/2014 11:02:12 [0:1487]: RLIMIT_FSIZE setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 01/20/2014 11:02:12 [0:1487]: RLIMIT_DATA setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 01/20/2014 11:02:12 [0:1487]: RLIMIT_STACK setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 01/20/2014 11:02:12 [0:1487]: RLIMIT_CORE setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 01/20/2014 11:02:12 [0:1487]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) 01/20/2014 11:02:12 [0:1487]: RLIMIT_RSS setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
01/20/2014 11:02:12 [0:1487]: setting environment
01/20/2014 11:02:12 [0:1487]: Initializing error file
01/20/2014 11:02:12 [0:1487]: switching to intermediate/target user
01/20/2014 11:02:12 [0:1486]: wait3 returned 1487 (status: 6656; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 26)
01/20/2014 11:02:12 [0:1486]: job exited with exit status 26
01/20/2014 11:02:12 [0:1486]: reaped "job" with pid 1487
01/20/2014 11:02:12 [0:1486]: job exited not due to signal
01/20/2014 11:02:12 [0:1486]: job exited with status 26
01/20/2014 11:02:12 [0:1486]: now sending signal KILL to pid -1487
01/20/2014 11:02:12 [0:1486]: writing usage file to "usage"
01/20/2014 11:02:12 [0:1486]: no tasker to notify
01/20/2014 11:02:12 [0:1486]: no epilog script to start
</shepherd_trace>

<job_script>
#!/bin/bash
#
#$ -j y
#$ -o /opt/capitati/smartmate_data/test/job_sm_out.log
#$ -e /opt/capitati/smartmate_data/test/job_sm_err.log
#$ -S /bin/bash

echo "Hello World"
echo `date`
</job_script>

Ian Johnson
Software Engineer


Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com


On 14 January 2014 18:34, Reuti <[email protected]> wrote:
Am 14.01.2014 um 18:27 schrieb Ian Johnson:

> Reuti,
>
> There's no file staging installed. The job script is being copied to the execution host.

Correct (for the job script itself).


> The output file *is* being opened in ~smartmate but it is of zero length.

I would assume that they is not created at all in this location, only on the nodes. Or do you mean the home directory on the nodes?

NB: In Torque there is a file staging for the .o/.e files, but not in SGE.

-- Reuti


> Thanks,
>
> Ian
>
> On Tue, 14 Jan 2014 17:18:06 -0000, Reuti <[email protected]> wrote:
>
>> Am 14.01.2014 um 18:04 schrieb Ian Johnson:
>>
>>> Reuti,
>>>
>>> There is no output from the script at all in the ~smartmate/job.sh.o[0-9]+ files. The home directory of the smartmate user is local disk. However, grid engine is installed on an NFS share.
>>
>> Do you have any file staging installed? Otherwise the output will not be send to the real home directory of the user. Also the input files could be missing on the execution host.
>>
>> -- Reuti
>>
>>
>>
>>> Is there other information you require? Is there any way to get the function call that is failing in shepherd, e.g. more verbose tracing?
>>>
>>> Thanks,
>>>
>>> Ian
>>>
>>> On Tue, 14 Jan 2014 15:19:34 -0000, Reuti <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Am 14.01.2014 um 15:19 schrieb Ian Johnson:
>>>>
>>>>> I have a simple job, which echoes `date` to stdout, that I'm using to test an Open Grid Engine installation. Running qsub as root the job is run successfully. However, using another non-superuser, in this case smartmate user, the output from qacct -j says that the job has exited with exit status 11. The shepherd trace confirms this (see below).
>>>>
>>>> Do you have any output? 11 means "Resource temporarily unavailable", which could mean it can't write to the (mounted?) home directory of the user. How is it mount configured?
>>>>
>>>> AFAICS the user is known, as otherwise you would face a different error.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> Would anyone have an idea as to what is going on? Thank you.
>>>>>
>>>>> <shepherd_trace>
>>>>> 01/14/2014 14:08:56 [0:2723]: shepherd called with uid = 0, euid = 0
>>>>> 01/14/2014 14:08:56 [0:2723]: starting up 2011.11
>>>>> 01/14/2014 14:08:56 [0:2723]: setpgid(2723, 2723) returned 0
>>>>> 01/14/2014 14:08:56 [0:2723]: do_core_binding: "binding" parameter not found in config file
>>>>> 01/14/2014 14:08:56 [0:2723]: no prolog script to start
>>>>> 01/14/2014 14:08:56 [0:2723]: parent: forked "job" with pid 2724
>>>>> 01/14/2014 14:08:56 [0:2724]: child: starting son(job, /opt/capitati/ge2011.11/smartmate/spool/exec-1/job_scripts/32, 0); >>>>> 01/14/2014 14:08:56 [0:2724]: pid=2724 pgrp=2724 sid=2724 old pgrp=2723 getlogin()=root
>>>>> 01/14/2014 14:08:56 [0:2723]: parent: job-pid: 2724
>>>>> 01/14/2014 14:08:56 [0:2724]: reading passwd information for user 'smartmate'
>>>>> 01/14/2014 14:08:56 [0:2724]: setosjobid: uid = 0, euid = 0
>>>>> 01/14/2014 14:08:56 [0:2724]: setting limits
>>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CPU setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_FSIZE setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_DATA setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_STACK setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_CORE setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) >>>>> 01/14/2014 14:08:56 [0:2724]: RLIMIT_RSS setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
>>>>> 01/14/2014 14:08:56 [0:2724]: setting environment
>>>>> 01/14/2014 14:08:56 [0:2724]: Initializing error file
>>>>> 01/14/2014 14:08:56 [0:2724]: switching to intermediate/target user >>>>> 01/14/2014 14:08:56 [0:2723]: wait3 returned 2724 (status: 2816; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11)
>>>>> 01/14/2014 14:08:56 [0:2723]: job exited with exit status 11
>>>>> 01/14/2014 14:08:56 [0:2723]: reaped "job" with pid 2724
>>>>> 01/14/2014 14:08:56 [0:2723]: job exited not due to signal
>>>>> 01/14/2014 14:08:56 [0:2723]: job exited with status 11
>>>>> 01/14/2014 14:08:56 [0:2723]: now sending signal KILL to pid -2724
>>>>> 01/14/2014 14:08:56 [0:2723]: writing usage file to "usage"
>>>>> 01/14/2014 14:08:56 [0:2723]: no tasker to notify
>>>>> 01/14/2014 14:08:56 [0:2723]: no epilog script to start
>>>>> </shepherd_trace>
>>>>>
>>>>> <job_script>
>>>>> #!/bin/bash
>>>>> #
>>>>> #$ -j y
>>>>> #
>>>>> #$ -S /bin/bash
>>>>>
>>>>> echo "Hello World"
>>>>> echo `date`
>>>>> </job_script>
>>>>>
>>>>> Ian Johnson
>>>>> Software Engineer
>>>>>
>>>>>
>>>>> Capita Translation and Interpreting
>>>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>>>> | [email protected] | Skype ID: ian.johnson_als
>>>>> www.capitatranslationinterpreting.com
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>
>>>
>>> --
>>> Kind regards,
>>>
>>> Ian Johnson
>>> Software Engineer
>>>
>>> Capita Translation and Interpreting
>>> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
>>> | [email protected] | Skype ID: ian.johnson_als
>>> www.capitatranslationinterpreting.com
>>
>
>
> --
> Kind regards,
>
> Ian Johnson
> Software Engineer
>
> Capita Translation and Interpreting
> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
> | [email protected] | Skype ID: ian.johnson_als
> www.capitatranslationinterpreting.com





--
Kind regards,

Ian Johnson
Software Engineer

Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to