I was able to figure out how to fix the errors shown below. With implementing Rason and Dave's recommendation, I was able to harden file permission on job owner's spooled files as well as $TMP. The one last ToDo is the trace file, which is owned by the job owner and it still has world-readable permission.....
For those who are interested in how I fixed the errors below, I added the diff result of reaper_execd.c file: *Fixed issue: abnormal termination of shepherd for job 6.1: no "exit_status" file*** *Fixed issue: can't open file active_jobs/6.1/error: Permission denied*** Index: daemons/execd/reaper_execd.c =================================================================== --- daemons/execd/reaper_execd.c (revision 5) +++ daemons/execd/reaper_execd.c (working copy) @@ -498,6 +498,7 @@ */ sge_get_active_job_file_path(&fname, job_id, ja_task_id, pe_task_id, "exit_status"); + sge_switch2start_user(); if (!(fp = fopen(sge_dstring_get_string(&fname), "r"))) { /* * we trust the exit status of the shepherd if it exited regularly @@ -509,6 +510,7 @@ else failed = SSTATE_BEFORE_PROLOG; + sge_switch2admin_user(); sprintf(error, MSG_STATUS_ABNORMALTERMINATIONOFSHEPHERDFORJOBXY_S, job_get_id_string(job_id, ja_task_id, pe_task_id, &id_dstring)); ERROR((SGE_EVENT, error)); @@ -521,6 +523,7 @@ int fscanf_count, shepherd_exit_status_file; fscanf_count = fscanf(fp, "%d", &shepherd_exit_status_file); + sge_switch2admin_user(); FCLOSE_IGNORE_ERROR(fp); if (fscanf_count != 1) { sprintf(error, MSG_STATUS_ABNORMALTERMINATIONFOSHEPHERDFORJOBXYEXITSTATEFILEISEMPTY_S, @@ -564,6 +567,7 @@ /* look for error file this overrules errors found yet */ sge_get_active_job_file_path(&fname, job_id, ja_task_id, pe_task_id, "error"); + sge_switch2start_user(); if ((fp = fopen(sge_dstring_get_string(&fname), "r"))) { int n; char *new_line; @@ -575,17 +579,21 @@ /* ensure only first line of error file is in 'error' */ if ((new_line=strchr(error, '\n'))) *new_line = '\0'; + sge_switch2admin_user(); DPRINTF(("ERRORFILE: %256s\n", error)); } else if (feof(fp)) { + sge_switch2admin_user(); DPRINTF(("empty error file\n")); } else { + sge_switch2admin_user(); ERROR((SGE_EVENT, MSG_JOB_CANTREADERRORFILEFORJOBXY_S, job_get_id_string(job_id, ja_task_id, pe_task_id, &id_dstring))); } FCLOSE_IGNORE_ERROR(fp); } else { + sge_switch2admin_user(); ERROR((SGE_EVENT, MSG_FILE_NOOPEN_SS, sge_dstring_get_string(&fname), strerror(errno))); /* There is no error file. */ } Regards, - Chansup On Fri, Jun 22, 2012 at 10:59 AM, CB <cbalw...@gmail.com> wrote: > I tried the workaround suggestion in the ticket but it failed when a job > exited, which failed to update the error state file in the spool directory. > By using umask(027) instead of umask(022), it changes file permission on > some of the files in the execd spool directory, which are owned by the job > owner. > > Interestingly not all of them are affected by umask(027) as shown below: > > [CH21778@d-7-55 d-7-55]$ pwd > /opt/llogs/default/spool/d-7-55 > [CH21778@d-7-55 d-7-55]$ find -ls > 1627040 4 drwxr-xr-x 5 sge sge 4096 Jun 22 09:34 . > 1627042 4 drwxr-xr-x 2 sge sge 4096 Jun 22 10:49 ./jobs > 1627043 4 drwxr-xr-x 7 sge sge 4096 Jun 22 10:48 > ./active_jobs > 1627050 4 drwxr-xr-x 2 sge sge 4096 Jun 22 10:49 > ./active_jobs/6.1 > 1627056 4 -rw-r--r-- 1 sge sge 2063 Jun 22 10:48 > ./active_jobs/6.1/environment > 1627074 4 -rw-r--r-- 1 sge sge 6 Jun 22 10:48 > ./active_jobs/6.1/pid > 1627053 4 -rw-r--r-- 1 CH21778 CH21778 3498 Jun 22 10:49 > ./active_jobs/6.1/trace > 1627078 4 -rw-r--r-- 1 sge sge 6 Jun 22 10:48 > ./active_jobs/6.1/job_pid > 1627086 4 -rw-r----- 1 sge sge 6 Jun 22 10:48 > ./active_jobs/6.1/addgrpid > 1627105 0 -rw-r----- 1 CH21778 CH21778 0 Jun 22 10:48 > ./active_jobs/6.1/error > 1627048 4 -rw-r--r-- 1 sge sge 305 Jun 22 10:49 > ./active_jobs/6.1/usage > 1627055 4 -rw-r--r-- 1 sge sge 32 Jun 22 10:48 > ./active_jobs/6.1/pe_hostfile > 1627061 4 -rw-r--r-- 1 sge sge 1902 Jun 22 10:48 > ./active_jobs/6.1/config > 1627106 4 -rw-r----- 1 CH21778 CH21778 2 Jun 22 10:49 > ./active_jobs/6.1/exit_status > > And then, at the end of job execution, it tried to update the error file > but failed due to file permission as recorded in the execd messages file: > > 06/22/2012 10:49:26| main|d-7-55|E|abnormal termination of shepherd for > job 6.1: no "exit_status" file > 06/22/2012 10:49:26| main|d-7-55|E|can't open file active_jobs/6.1/error: > Permission denied > > So it appears that the error and exit_status files are updated later by > the GE admin user (sge) and failed because of the file permission. > Any suggestions? > > Regards, > - Chansup > > On Thu, Jun 21, 2012 at 6:15 AM, Dave Love <d.l...@liverpool.ac.uk> wrote: > >> CB <cbalw...@gmail.com> writes: >> >> > Hi, >> > >> > I am using the GE2011.11 release. >> > >> > When a job dispatched to a node, it creates $TMP directory, which is >> > usually located at /tmp on the execution host. The current file >> permission >> > on $TMP is 755. I would like to modify it to 750. Can anyone point me >> > which file should I modify? I thought this might be quicker than me to >> > searching through the source code. >> >> https://arc.liv.ac.uk/trac/SGE/ticket/109 >> >> The relevant code is actually in sge_exec_job (in recent versions?). I >> haven't got round to seeing if configuring the various umasks will break >> anything, particularly if it's controlled by a single parameter. (The >> permission on the job spool is actually the most interesting.) >> >> -- >> Community Grid Engine: http://arc.liv.ac.uk/SGE/ >> > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users