Am 27.06.2012 um 19:14 schrieb Rayson Ho:

> Didn't know that you wanted to harden the spool files as well (that
> requirement was not in your original email) - but can't you just
> change the permission of the execd spool directory?? I just do a quick
> experiment and jobs seem to run fine.
> 
> With the permission set at the execd local spool dir, then you don't
> even need to worry about trace file.

The job scripts need to be readable by each individual user who is running the 
job at execution time. What permission did you put on the execd's spool 
directory? For binary jobs it will work (unless you want to use $PE_HOSTFILE or 
alike).

-- Reuti


> Rayson
> 
> 
> 
> 
> 
> On Wed, Jun 27, 2012 at 12:51 PM, CB <cbalw...@gmail.com> wrote:
>> I was able to figure out how to fix the errors shown below. With
>> implementing Rayson and Dave's recommendation, I was able to harden file
>> permission on job owner's spooled files as well as $TMP.  The one last ToDo
>> is the trace file, which is owned by the job owner and it still has
>> world-readable permission.....
>> 
>> For those who are interested in how I fixed the errors below, I added the
>> diff result of reaper_execd.c file:
>> 
>> Fixed issue: abnormal termination of shepherd for job 6.1: no "exit_status"
>> file
>> 
>> Fixed issue: can't open file active_jobs/6.1/error: Permission denied
>> 
>> Index: daemons/execd/reaper_execd.c
>> 
>> ===================================================================
>> 
>> --- daemons/execd/reaper_execd.c      (revision 5)
>> 
>> +++ daemons/execd/reaper_execd.c      (working copy)
>> 
>> @@ -498,6 +498,7 @@
>> 
>>      */
>> 
>>     sge_get_active_job_file_path(&fname,
>> 
>>                                  job_id, ja_task_id, pe_task_id,
>> "exit_status");
>> 
>> +   sge_switch2start_user();
>> 
>>     if (!(fp = fopen(sge_dstring_get_string(&fname), "r"))) {
>> 
>>        /*
>> 
>>         * we trust the exit status of the shepherd if it exited regularly
>> 
>> @@ -509,6 +510,7 @@
>> 
>>        else
>> 
>>           failed = SSTATE_BEFORE_PROLOG;
>> 
>> 
>> 
>> +      sge_switch2admin_user();
>> 
>>        sprintf(error, MSG_STATUS_ABNORMALTERMINATIONOFSHEPHERDFORJOBXY_S,
>> 
>>                job_get_id_string(job_id, ja_task_id, pe_task_id,
>> &id_dstring));
>> 
>>        ERROR((SGE_EVENT, error));
>> 
>> @@ -521,6 +523,7 @@
>> 
>>        int fscanf_count, shepherd_exit_status_file;
>> 
>> 
>> 
>>        fscanf_count = fscanf(fp, "%d", &shepherd_exit_status_file);
>> 
>> +      sge_switch2admin_user();
>> 
>>        FCLOSE_IGNORE_ERROR(fp);
>> 
>>        if (fscanf_count != 1) {
>> 
>>           sprintf(error,
>> MSG_STATUS_ABNORMALTERMINATIONFOSHEPHERDFORJOBXYEXITSTATEFILEISEMPTY_S,
>> 
>> @@ -564,6 +567,7 @@
>> 
>>     /* look for error file this overrules errors found yet */
>> 
>>     sge_get_active_job_file_path(&fname,
>> 
>>                                  job_id, ja_task_id, pe_task_id, "error");
>> 
>> +   sge_switch2start_user();
>> 
>>     if ((fp = fopen(sge_dstring_get_string(&fname), "r"))) {
>> 
>>        int n;
>> 
>>        char *new_line;
>> 
>> @@ -575,17 +579,21 @@
>> 
>>           /* ensure only first line of error file is in 'error' */
>> 
>>           if ((new_line=strchr(error, '\n')))
>> 
>>              *new_line = '\0';
>> 
>> +         sge_switch2admin_user();
>> 
>>           DPRINTF(("ERRORFILE: %256s\n", error));
>> 
>>        }
>> 
>>        else if (feof(fp)) {
>> 
>> +         sge_switch2admin_user();
>> 
>>           DPRINTF(("empty error file\n"));
>> 
>>        } else {
>> 
>> +         sge_switch2admin_user();
>> 
>>           ERROR((SGE_EVENT, MSG_JOB_CANTREADERRORFILEFORJOBXY_S,
>> 
>>              job_get_id_string(job_id, ja_task_id, pe_task_id,
>> &id_dstring)));
>> 
>>        }
>> 
>>        FCLOSE_IGNORE_ERROR(fp);
>> 
>>     }
>> 
>>     else {
>> 
>> +      sge_switch2admin_user();
>> 
>>        ERROR((SGE_EVENT, MSG_FILE_NOOPEN_SS, sge_dstring_get_string(&fname),
>> strerror(errno)));
>> 
>>        /* There is no error file. */
>> 
>>     }
>> 
>> 
>> Regards,
>> - Chansup
>> 
>> On Fri, Jun 22, 2012 at 10:59 AM, CB <cbalw...@gmail.com> wrote:
>>> 
>>> I tried the workaround suggestion in the ticket but it failed when a job
>>> exited, which failed to update the error state file in the spool directory.
>>> By using umask(027) instead of umask(022), it changes file permission on
>>> some of the files in the execd spool directory, which are owned by the job
>>> owner.
>>> 
>>> Interestingly not all of them are affected by umask(027) as shown below:
>>> 
>>> [CH21778@d-7-55 d-7-55]$ pwd
>>> /opt/llogs/default/spool/d-7-55
>>> [CH21778@d-7-55 d-7-55]$ find -ls
>>> 1627040    4 drwxr-xr-x   5 sge      sge          4096 Jun 22 09:34 .
>>> 1627042    4 drwxr-xr-x   2 sge      sge          4096 Jun 22 10:49 ./jobs
>>> 1627043    4 drwxr-xr-x   7 sge      sge          4096 Jun 22 10:48
>>> ./active_jobs
>>> 1627050    4 drwxr-xr-x   2 sge      sge          4096 Jun 22 10:49
>>> ./active_jobs/6.1
>>> 1627056    4 -rw-r--r--   1 sge      sge          2063 Jun 22 10:48
>>> ./active_jobs/6.1/environment
>>> 1627074    4 -rw-r--r--   1 sge      sge             6 Jun 22 10:48
>>> ./active_jobs/6.1/pid
>>> 1627053    4 -rw-r--r--   1 CH21778  CH21778      3498 Jun 22 10:49
>>> ./active_jobs/6.1/trace
>>> 1627078    4 -rw-r--r--   1 sge      sge             6 Jun 22 10:48
>>> ./active_jobs/6.1/job_pid
>>> 1627086    4 -rw-r-----   1 sge      sge             6 Jun 22 10:48
>>> ./active_jobs/6.1/addgrpid
>>> 1627105    0 -rw-r-----   1 CH21778  CH21778         0 Jun 22 10:48
>>> ./active_jobs/6.1/error
>>> 1627048    4 -rw-r--r--   1 sge      sge           305 Jun 22 10:49
>>> ./active_jobs/6.1/usage
>>> 1627055    4 -rw-r--r--   1 sge      sge            32 Jun 22 10:48
>>> ./active_jobs/6.1/pe_hostfile
>>> 1627061    4 -rw-r--r--   1 sge      sge          1902 Jun 22 10:48
>>> ./active_jobs/6.1/config
>>> 1627106    4 -rw-r-----   1 CH21778  CH21778         2 Jun 22 10:49
>>> ./active_jobs/6.1/exit_status
>>> 
>>> And then, at the end of job execution, it tried to update the error file
>>> but failed due to file permission as recorded in the execd messages file:
>>> 
>>> 06/22/2012 10:49:26|  main|d-7-55|E|abnormal termination of shepherd for
>>> job 6.1: no "exit_status" file
>>> 06/22/2012 10:49:26|  main|d-7-55|E|can't open file active_jobs/6.1/error:
>>> Permission denied
>>> 
>>> So it appears that the error and exit_status files are updated later by
>>> the GE admin user (sge) and failed because of the file permission.
>>> Any suggestions?
>>> 
>>> Regards,
>>> - Chansup
>>> 
>>> On Thu, Jun 21, 2012 at 6:15 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>>>> 
>>>> CB <cbalw...@gmail.com> writes:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am using the GE2011.11 release.
>>>>> 
>>>>> When a job dispatched to a node, it creates $TMP directory, which is
>>>>> usually located at /tmp on the execution host. The current file
>>>>> permission
>>>>> on $TMP is 755.  I would like to modify it to 750.  Can anyone point me
>>>>> which file should I modify?   I thought this might be quicker than me
>>>>> to
>>>>> searching through the source code.
>>>> 
>>>> https://arc.liv.ac.uk/trac/SGE/ticket/109
>>>> 
>>>> The relevant code is actually in sge_exec_job (in recent versions?).  I
>>>> haven't got round to seeing if configuring the various umasks will break
>>>> anything, particularly if it's controlled by a single parameter.  (The
>>>> permission on the job spool is actually the most interesting.)
>>>> 
>>>> --
>>>> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to