Re: [gridengine users] classical spooling method broken in v8 release

Rayson Ho Mon, 31 Oct 2011 08:26:28 -0700

On Mon, Oct 31, 2011 at 11:12 AM, Reuti <[email protected]> wrote:
> Looks like I completely misunderstand the above issue:
>
> Why should it be an error to have more than one slot or more than one host?


He did not saying that it is, but from the diff he sent, looks like
the v8 code checks for the return value of rmdir, which can fail if
the directory is not empty under some conditions - eg. for nodes with
more than 1 slot, and can have more than 1 task spooled in the local
spool directory - I don't have time to dig into the code to check this
as it is working for Open Grid Scheduler, but if you have a multi-slot
node configured, then you should be able to look into the local
spooling directory and see if multiple tasks of the same job share the
same third level directory... (I believe it is written this way based
on my understand of the classic spooling code, but again don't have
time to check).

Rayson




>
> -- Reuti
>
>
>> The patch changes from calls to sge_rmdir() that also delete
>> recursively all subdirs to standard rmdir() calls to keep the
>> spool dir clean, but not delete other jobs data.
>>
>> Introduced was this bug by the followin commit:
>> https://github.com/gridengine/gridengine/commit/8c6b462a4d85e1b0713b445fe91347eec60188ff
>>
>> I've put new rpm packages for RHEL5 and RHEL6 with the below patch
>> and based on current SoGE v8.0.0c to 
>> http://jur-linux.org/download/el-updates/
>> if someone wants to try out real workloads.
>>
>> big thanks to all the gridengine developers,
>> best regards,
>>
>> Florian La Roche
>>
>>
>> --- a/source/libs/spool/classic/read_write_job.c
>> +++ b/source/libs/spool/classic/read_write_job.c
>> @@ -688,14 +688,12 @@ int job_remove_spool_file(u_long32 jobid, u_long32 
>> ja_taskid,
>>          }
>>
>>          /*
>> -          * Following sge_rmdir call may fail. We can ignore this error.
>> +          * Following rmdir call may fail. We can ignore this error.
>>           * This is only an indicator that another task is running which has
>>           * been spooled in the directory.
>>           */
>>          DPRINTF(("try to remove "SFN"\n", task_spool_dir));
>> -         if (sge_rmdir(task_spool_dir, &error_msg)) {
>> -            ERROR((SGE_EVENT, MSG_JOB_CANNOT_REMOVE_SS, 
>> MSG_JOB_TASK_SPOOL_FILE, error_msg_buffer));
>> -         }
>> +      rmdir(task_spool_dir);
>>
>>          /*
>>           * a task spool directory has been removed: reinit
>> @@ -735,16 +733,15 @@ int job_remove_spool_file(u_long32 jobid, u_long32 
>> ja_taskid,
>>       try_to_remove_sub_dirs = 1;
>>    }
>>    /*
>> -    * Following sge_rmdir calls may fail. We can ignore these errors.
>> +    * Following rmdir calls may fail. We can ignore these errors.
>>     * This is only an indicator that another job is running which has been
>>     * spooled in the same directory.
>>     */
>>    if (try_to_remove_sub_dirs) {
>>       DPRINTF(("try to remove "SFN"\n", spool_dir_third));
>> -
>> -      if (!sge_rmdir(spool_dir_third, NULL)) {
>> +      if (!rmdir(spool_dir_third)) {
>>          DPRINTF(("try to remove "SFN"\n", spool_dir_second));
>> -         sge_rmdir(spool_dir_second, NULL);
>> +         rmdir(spool_dir_second);
>>       }
>>    }
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] classical spooling method broken in v8 release

Reply via email to