I was able to figure out how to fix the errors shown below. With
implementing Rason and Dave's recommendation, I was able to harden file
permission on job owner's spooled files as well as $TMP.  The one last ToDo
is the trace file, which is owned by the job owner and it still has
world-readable permission.....

For those who are interested in how I fixed the errors below, I added the
diff result of reaper_execd.c file:

*Fixed issue: abnormal termination of shepherd for job 6.1: no
"exit_status" file***

*Fixed issue: can't open file active_jobs/6.1/error: Permission denied***

Index: daemons/execd/reaper_execd.c

===================================================================

--- daemons/execd/reaper_execd.c      (revision 5)

+++ daemons/execd/reaper_execd.c      (working copy)

@@ -498,6 +498,7 @@

     */

    sge_get_active_job_file_path(&fname,

                                 job_id, ja_task_id, pe_task_id,
"exit_status");

+   sge_switch2start_user();

    if (!(fp = fopen(sge_dstring_get_string(&fname), "r"))) {

       /*

        * we trust the exit status of the shepherd if it exited regularly

@@ -509,6 +510,7 @@

       else

          failed = SSTATE_BEFORE_PROLOG;



+      sge_switch2admin_user();

       sprintf(error, MSG_STATUS_ABNORMALTERMINATIONOFSHEPHERDFORJOBXY_S,

               job_get_id_string(job_id, ja_task_id, pe_task_id,
&id_dstring));

       ERROR((SGE_EVENT, error));

@@ -521,6 +523,7 @@

       int fscanf_count, shepherd_exit_status_file;



       fscanf_count = fscanf(fp, "%d", &shepherd_exit_status_file);

+      sge_switch2admin_user();

       FCLOSE_IGNORE_ERROR(fp);

       if (fscanf_count != 1) {

          sprintf(error,
MSG_STATUS_ABNORMALTERMINATIONFOSHEPHERDFORJOBXYEXITSTATEFILEISEMPTY_S,

@@ -564,6 +567,7 @@

    /* look for error file this overrules errors found yet */

    sge_get_active_job_file_path(&fname,

                                 job_id, ja_task_id, pe_task_id, "error");

+   sge_switch2start_user();

    if ((fp = fopen(sge_dstring_get_string(&fname), "r"))) {

       int n;

       char *new_line;

@@ -575,17 +579,21 @@

          /* ensure only first line of error file is in 'error' */

          if ((new_line=strchr(error, '\n')))

             *new_line = '\0';

+         sge_switch2admin_user();

          DPRINTF(("ERRORFILE: %256s\n", error));

       }

       else if (feof(fp)) {

+         sge_switch2admin_user();

          DPRINTF(("empty error file\n"));

       } else {

+         sge_switch2admin_user();

          ERROR((SGE_EVENT, MSG_JOB_CANTREADERRORFILEFORJOBXY_S,

             job_get_id_string(job_id, ja_task_id, pe_task_id,
&id_dstring)));

       }

       FCLOSE_IGNORE_ERROR(fp);

    }

    else {

+      sge_switch2admin_user();

       ERROR((SGE_EVENT, MSG_FILE_NOOPEN_SS,
sge_dstring_get_string(&fname), strerror(errno)));

       /* There is no error file. */

    }

Regards,
- Chansup

On Fri, Jun 22, 2012 at 10:59 AM, CB <cbalw...@gmail.com> wrote:

> I tried the workaround suggestion in the ticket but it failed when a job
> exited, which failed to update the error state file in the spool directory.
> By using umask(027) instead of umask(022), it changes file permission on
> some of the files in the execd spool directory, which are owned by the job
> owner.
>
> Interestingly not all of them are affected by umask(027) as shown below:
>
> [CH21778@d-7-55 d-7-55]$ pwd
> /opt/llogs/default/spool/d-7-55
> [CH21778@d-7-55 d-7-55]$ find -ls
> 1627040    4 drwxr-xr-x   5 sge      sge          4096 Jun 22 09:34 .
> 1627042    4 drwxr-xr-x   2 sge      sge          4096 Jun 22 10:49 ./jobs
> 1627043    4 drwxr-xr-x   7 sge      sge          4096 Jun 22 10:48
> ./active_jobs
> 1627050    4 drwxr-xr-x   2 sge      sge          4096 Jun 22 10:49
> ./active_jobs/6.1
> 1627056    4 -rw-r--r--   1 sge      sge          2063 Jun 22 10:48
> ./active_jobs/6.1/environment
> 1627074    4 -rw-r--r--   1 sge      sge             6 Jun 22 10:48
> ./active_jobs/6.1/pid
> 1627053    4 -rw-r--r--   1 CH21778  CH21778      3498 Jun 22 10:49
> ./active_jobs/6.1/trace
> 1627078    4 -rw-r--r--   1 sge      sge             6 Jun 22 10:48
> ./active_jobs/6.1/job_pid
> 1627086    4 -rw-r-----   1 sge      sge             6 Jun 22 10:48
> ./active_jobs/6.1/addgrpid
> 1627105    0 -rw-r-----   1 CH21778  CH21778         0 Jun 22 10:48
> ./active_jobs/6.1/error
> 1627048    4 -rw-r--r--   1 sge      sge           305 Jun 22 10:49
> ./active_jobs/6.1/usage
> 1627055    4 -rw-r--r--   1 sge      sge            32 Jun 22 10:48
> ./active_jobs/6.1/pe_hostfile
> 1627061    4 -rw-r--r--   1 sge      sge          1902 Jun 22 10:48
> ./active_jobs/6.1/config
> 1627106    4 -rw-r-----   1 CH21778  CH21778         2 Jun 22 10:49
> ./active_jobs/6.1/exit_status
>
> And then, at the end of job execution, it tried to update the error file
> but failed due to file permission as recorded in the execd messages file:
>
> 06/22/2012 10:49:26|  main|d-7-55|E|abnormal termination of shepherd for
> job 6.1: no "exit_status" file
> 06/22/2012 10:49:26|  main|d-7-55|E|can't open file active_jobs/6.1/error:
> Permission denied
>
> So it appears that the error and exit_status files are updated later by
> the GE admin user (sge) and failed because of the file permission.
> Any suggestions?
>
> Regards,
> - Chansup
>
> On Thu, Jun 21, 2012 at 6:15 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>
>> CB <cbalw...@gmail.com> writes:
>>
>> > Hi,
>> >
>> > I am using the GE2011.11 release.
>> >
>> > When a job dispatched to a node, it creates $TMP directory, which is
>> > usually located at /tmp on the execution host. The current file
>> permission
>> > on $TMP is 755.  I would like to modify it to 750.  Can anyone point me
>> > which file should I modify?   I thought this might be quicker than me to
>> > searching through the source code.
>>
>> https://arc.liv.ac.uk/trac/SGE/ticket/109
>>
>> The relevant code is actually in sge_exec_job (in recent versions?).  I
>> haven't got round to seeing if configuring the various umasks will break
>> anything, particularly if it's controlled by a single parameter.  (The
>> permission on the job spool is actually the most interesting.)
>>
>> --
>> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
>>
>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to