On Wed, Jun 27, 2012 at 1:26 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> The job scripts need to be readable by each individual user who is running 
> the job at execution time. What permission did you put on the execd's spool 
> directory? For binary jobs it will work (unless you want to use $PE_HOSTFILE 
> or alike).

Yes, good point. I did not test it with a full job, only ran a quick &
dirty "qrsh -b y sleep 10" (so that I could see if it complains or
not).

Actually, binary jobs also do not work (error messages are not
displayed, but it is not run properly). I believe that some of the job
setup in shepherd is run as the user (no time to sit down right now
and verify that...).

Rayson



>
> -- Reuti
>
>
>> Rayson
>>
>>
>>
>>
>>
>> On Wed, Jun 27, 2012 at 12:51 PM, CB <cbalw...@gmail.com> wrote:
>>> I was able to figure out how to fix the errors shown below. With
>>> implementing Rayson and Dave's recommendation, I was able to harden file
>>> permission on job owner's spooled files as well as $TMP.  The one last ToDo
>>> is the trace file, which is owned by the job owner and it still has
>>> world-readable permission.....
>>>
>>> For those who are interested in how I fixed the errors below, I added the
>>> diff result of reaper_execd.c file:
>>>
>>> Fixed issue: abnormal termination of shepherd for job 6.1: no "exit_status"
>>> file
>>>
>>> Fixed issue: can't open file active_jobs/6.1/error: Permission denied
>>>
>>> Index: daemons/execd/reaper_execd.c
>>>
>>> ===================================================================
>>>
>>> --- daemons/execd/reaper_execd.c      (revision 5)
>>>
>>> +++ daemons/execd/reaper_execd.c      (working copy)
>>>
>>> @@ -498,6 +498,7 @@
>>>
>>>      */
>>>
>>>     sge_get_active_job_file_path(&fname,
>>>
>>>                                  job_id, ja_task_id, pe_task_id,
>>> "exit_status");
>>>
>>> +   sge_switch2start_user();
>>>
>>>     if (!(fp = fopen(sge_dstring_get_string(&fname), "r"))) {
>>>
>>>        /*
>>>
>>>         * we trust the exit status of the shepherd if it exited regularly
>>>
>>> @@ -509,6 +510,7 @@
>>>
>>>        else
>>>
>>>           failed = SSTATE_BEFORE_PROLOG;
>>>
>>>
>>>
>>> +      sge_switch2admin_user();
>>>
>>>        sprintf(error, MSG_STATUS_ABNORMALTERMINATIONOFSHEPHERDFORJOBXY_S,
>>>
>>>                job_get_id_string(job_id, ja_task_id, pe_task_id,
>>> &id_dstring));
>>>
>>>        ERROR((SGE_EVENT, error));
>>>
>>> @@ -521,6 +523,7 @@
>>>
>>>        int fscanf_count, shepherd_exit_status_file;
>>>
>>>
>>>
>>>        fscanf_count = fscanf(fp, "%d", &shepherd_exit_status_file);
>>>
>>> +      sge_switch2admin_user();
>>>
>>>        FCLOSE_IGNORE_ERROR(fp);
>>>
>>>        if (fscanf_count != 1) {
>>>
>>>           sprintf(error,
>>> MSG_STATUS_ABNORMALTERMINATIONFOSHEPHERDFORJOBXYEXITSTATEFILEISEMPTY_S,
>>>
>>> @@ -564,6 +567,7 @@
>>>
>>>     /* look for error file this overrules errors found yet */
>>>
>>>     sge_get_active_job_file_path(&fname,
>>>
>>>                                  job_id, ja_task_id, pe_task_id, "error");
>>>
>>> +   sge_switch2start_user();
>>>
>>>     if ((fp = fopen(sge_dstring_get_string(&fname), "r"))) {
>>>
>>>        int n;
>>>
>>>        char *new_line;
>>>
>>> @@ -575,17 +579,21 @@
>>>
>>>           /* ensure only first line of error file is in 'error' */
>>>
>>>           if ((new_line=strchr(error, '\n')))
>>>
>>>              *new_line = '\0';
>>>
>>> +         sge_switch2admin_user();
>>>
>>>           DPRINTF(("ERRORFILE: %256s\n", error));
>>>
>>>        }
>>>
>>>        else if (feof(fp)) {
>>>
>>> +         sge_switch2admin_user();
>>>
>>>           DPRINTF(("empty error file\n"));
>>>
>>>        } else {
>>>
>>> +         sge_switch2admin_user();
>>>
>>>           ERROR((SGE_EVENT, MSG_JOB_CANTREADERRORFILEFORJOBXY_S,
>>>
>>>              job_get_id_string(job_id, ja_task_id, pe_task_id,
>>> &id_dstring)));
>>>
>>>        }
>>>
>>>        FCLOSE_IGNORE_ERROR(fp);
>>>
>>>     }
>>>
>>>     else {
>>>
>>> +      sge_switch2admin_user();
>>>
>>>        ERROR((SGE_EVENT, MSG_FILE_NOOPEN_SS, sge_dstring_get_string(&fname),
>>> strerror(errno)));
>>>
>>>        /* There is no error file. */
>>>
>>>     }
>>>
>>>
>>> Regards,
>>> - Chansup
>>>
>>> On Fri, Jun 22, 2012 at 10:59 AM, CB <cbalw...@gmail.com> wrote:
>>>>
>>>> I tried the workaround suggestion in the ticket but it failed when a job
>>>> exited, which failed to update the error state file in the spool directory.
>>>> By using umask(027) instead of umask(022), it changes file permission on
>>>> some of the files in the execd spool directory, which are owned by the job
>>>> owner.
>>>>
>>>> Interestingly not all of them are affected by umask(027) as shown below:
>>>>
>>>> [CH21778@d-7-55 d-7-55]$ pwd
>>>> /opt/llogs/default/spool/d-7-55
>>>> [CH21778@d-7-55 d-7-55]$ find -ls
>>>> 1627040    4 drwxr-xr-x   5 sge      sge          4096 Jun 22 09:34 .
>>>> 1627042    4 drwxr-xr-x   2 sge      sge          4096 Jun 22 10:49 ./jobs
>>>> 1627043    4 drwxr-xr-x   7 sge      sge          4096 Jun 22 10:48
>>>> ./active_jobs
>>>> 1627050    4 drwxr-xr-x   2 sge      sge          4096 Jun 22 10:49
>>>> ./active_jobs/6.1
>>>> 1627056    4 -rw-r--r--   1 sge      sge          2063 Jun 22 10:48
>>>> ./active_jobs/6.1/environment
>>>> 1627074    4 -rw-r--r--   1 sge      sge             6 Jun 22 10:48
>>>> ./active_jobs/6.1/pid
>>>> 1627053    4 -rw-r--r--   1 CH21778  CH21778      3498 Jun 22 10:49
>>>> ./active_jobs/6.1/trace
>>>> 1627078    4 -rw-r--r--   1 sge      sge             6 Jun 22 10:48
>>>> ./active_jobs/6.1/job_pid
>>>> 1627086    4 -rw-r-----   1 sge      sge             6 Jun 22 10:48
>>>> ./active_jobs/6.1/addgrpid
>>>> 1627105    0 -rw-r-----   1 CH21778  CH21778         0 Jun 22 10:48
>>>> ./active_jobs/6.1/error
>>>> 1627048    4 -rw-r--r--   1 sge      sge           305 Jun 22 10:49
>>>> ./active_jobs/6.1/usage
>>>> 1627055    4 -rw-r--r--   1 sge      sge            32 Jun 22 10:48
>>>> ./active_jobs/6.1/pe_hostfile
>>>> 1627061    4 -rw-r--r--   1 sge      sge          1902 Jun 22 10:48
>>>> ./active_jobs/6.1/config
>>>> 1627106    4 -rw-r-----   1 CH21778  CH21778         2 Jun 22 10:49
>>>> ./active_jobs/6.1/exit_status
>>>>
>>>> And then, at the end of job execution, it tried to update the error file
>>>> but failed due to file permission as recorded in the execd messages file:
>>>>
>>>> 06/22/2012 10:49:26|  main|d-7-55|E|abnormal termination of shepherd for
>>>> job 6.1: no "exit_status" file
>>>> 06/22/2012 10:49:26|  main|d-7-55|E|can't open file active_jobs/6.1/error:
>>>> Permission denied
>>>>
>>>> So it appears that the error and exit_status files are updated later by
>>>> the GE admin user (sge) and failed because of the file permission.
>>>> Any suggestions?
>>>>
>>>> Regards,
>>>> - Chansup
>>>>
>>>> On Thu, Jun 21, 2012 at 6:15 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>>>>>
>>>>> CB <cbalw...@gmail.com> writes:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am using the GE2011.11 release.
>>>>>>
>>>>>> When a job dispatched to a node, it creates $TMP directory, which is
>>>>>> usually located at /tmp on the execution host. The current file
>>>>>> permission
>>>>>> on $TMP is 755.  I would like to modify it to 750.  Can anyone point me
>>>>>> which file should I modify?   I thought this might be quicker than me
>>>>>> to
>>>>>> searching through the source code.
>>>>>
>>>>> https://arc.liv.ac.uk/trac/SGE/ticket/109
>>>>>
>>>>> The relevant code is actually in sge_exec_job (in recent versions?).  I
>>>>> haven't got round to seeing if configuring the various umasks will break
>>>>> anything, particularly if it's controlled by a single parameter.  (The
>>>>> permission on the job spool is actually the most interesting.)
>>>>>
>>>>> --
>>>>> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to