Hello all, With current gridengine sources from github.com (Univa?) as well as the Son of a GridEngine fork, classical spooling method is completely broken. If you have more than one slot configured or several hosts, this should result in error messages showing up in the logs, even though jobs may run ok.
The patch changes from calls to sge_rmdir() that also delete recursively all subdirs to standard rmdir() calls to keep the spool dir clean, but not delete other jobs data. Introduced was this bug by the followin commit: https://github.com/gridengine/gridengine/commit/8c6b462a4d85e1b0713b445fe91347eec60188ff I've put new rpm packages for RHEL5 and RHEL6 with the below patch and based on current SoGE v8.0.0c to http://jur-linux.org/download/el-updates/ if someone wants to try out real workloads. big thanks to all the gridengine developers, best regards, Florian La Roche --- a/source/libs/spool/classic/read_write_job.c +++ b/source/libs/spool/classic/read_write_job.c @@ -688,14 +688,12 @@ int job_remove_spool_file(u_long32 jobid, u_long32 ja_taskid, } /* - * Following sge_rmdir call may fail. We can ignore this error. + * Following rmdir call may fail. We can ignore this error. * This is only an indicator that another task is running which has * been spooled in the directory. */ DPRINTF(("try to remove "SFN"\n", task_spool_dir)); - if (sge_rmdir(task_spool_dir, &error_msg)) { - ERROR((SGE_EVENT, MSG_JOB_CANNOT_REMOVE_SS, MSG_JOB_TASK_SPOOL_FILE, error_msg_buffer)); - } + rmdir(task_spool_dir); /* * a task spool directory has been removed: reinit @@ -735,16 +733,15 @@ int job_remove_spool_file(u_long32 jobid, u_long32 ja_taskid, try_to_remove_sub_dirs = 1; } /* - * Following sge_rmdir calls may fail. We can ignore these errors. + * Following rmdir calls may fail. We can ignore these errors. * This is only an indicator that another job is running which has been * spooled in the same directory. */ if (try_to_remove_sub_dirs) { DPRINTF(("try to remove "SFN"\n", spool_dir_third)); - - if (!sge_rmdir(spool_dir_third, NULL)) { + if (!rmdir(spool_dir_third)) { DPRINTF(("try to remove "SFN"\n", spool_dir_second)); - sge_rmdir(spool_dir_second, NULL); + rmdir(spool_dir_second); } } _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
