On Mon, Oct 31, 2011 at 11:12 AM, Reuti <[email protected]> wrote: > Looks like I completely misunderstand the above issue: > > Why should it be an error to have more than one slot or more than one host?
He did not saying that it is, but from the diff he sent, looks like the v8 code checks for the return value of rmdir, which can fail if the directory is not empty under some conditions - eg. for nodes with more than 1 slot, and can have more than 1 task spooled in the local spool directory - I don't have time to dig into the code to check this as it is working for Open Grid Scheduler, but if you have a multi-slot node configured, then you should be able to look into the local spooling directory and see if multiple tasks of the same job share the same third level directory... (I believe it is written this way based on my understand of the classic spooling code, but again don't have time to check). Rayson > > -- Reuti > > >> The patch changes from calls to sge_rmdir() that also delete >> recursively all subdirs to standard rmdir() calls to keep the >> spool dir clean, but not delete other jobs data. >> >> Introduced was this bug by the followin commit: >> https://github.com/gridengine/gridengine/commit/8c6b462a4d85e1b0713b445fe91347eec60188ff >> >> I've put new rpm packages for RHEL5 and RHEL6 with the below patch >> and based on current SoGE v8.0.0c to >> http://jur-linux.org/download/el-updates/ >> if someone wants to try out real workloads. >> >> big thanks to all the gridengine developers, >> best regards, >> >> Florian La Roche >> >> >> --- a/source/libs/spool/classic/read_write_job.c >> +++ b/source/libs/spool/classic/read_write_job.c >> @@ -688,14 +688,12 @@ int job_remove_spool_file(u_long32 jobid, u_long32 >> ja_taskid, >> } >> >> /* >> - * Following sge_rmdir call may fail. We can ignore this error. >> + * Following rmdir call may fail. We can ignore this error. >> * This is only an indicator that another task is running which has >> * been spooled in the directory. >> */ >> DPRINTF(("try to remove "SFN"\n", task_spool_dir)); >> - if (sge_rmdir(task_spool_dir, &error_msg)) { >> - ERROR((SGE_EVENT, MSG_JOB_CANNOT_REMOVE_SS, >> MSG_JOB_TASK_SPOOL_FILE, error_msg_buffer)); >> - } >> + rmdir(task_spool_dir); >> >> /* >> * a task spool directory has been removed: reinit >> @@ -735,16 +733,15 @@ int job_remove_spool_file(u_long32 jobid, u_long32 >> ja_taskid, >> try_to_remove_sub_dirs = 1; >> } >> /* >> - * Following sge_rmdir calls may fail. We can ignore these errors. >> + * Following rmdir calls may fail. We can ignore these errors. >> * This is only an indicator that another job is running which has been >> * spooled in the same directory. >> */ >> if (try_to_remove_sub_dirs) { >> DPRINTF(("try to remove "SFN"\n", spool_dir_third)); >> - >> - if (!sge_rmdir(spool_dir_third, NULL)) { >> + if (!rmdir(spool_dir_third)) { >> DPRINTF(("try to remove "SFN"\n", spool_dir_second)); >> - sge_rmdir(spool_dir_second, NULL); >> + rmdir(spool_dir_second); >> } >> } >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
