Am 01.03.2012 um 17:58 schrieb Txema Heredia Genestar:

> Hi, thanks for your answers.
> 
> I didn't know about that SGE gid, thanks for showing it up.
> 
> On a second thought, I think that quota issue it's not the best answer. The 
> usual workflow of a job using $TMPDIR would be:
> 1- cp shared_and_slow_disk/whatever $TMPDIR
> 2- run process (I/O in either $TMPDIR or shard disk)
> 3- (if needed) copy output back to the shared disk.
> 
> That means that if quota gets filled during step 1, the process will start 
> with partial/corrupt data with no notice.

Good programming will check the return code of the copy process before 
continuing and can abort this way.

-- Reuti


> Thus we should add a protective layer (via cron) that controls the disk usage 
> (via quota or du) and kills the job ans emails the user if needed.
> 
> El 29/02/12 19:47, Rayson Ho escribió:
>> On Wed, Feb 29, 2012 at 1:15 PM, Reuti<[email protected]>  wrote:
>>> Aha, I found this:
>>> 
>>> http://arc.liv.ac.uk/pipermail/gridengine-users/2006-November/012125.html
>>> 
>>> as the group is already there as Rayson mentions, creating the quota is the 
>>> easiest.
>> 
>> I was thinking about suggesting that when I first read Txema's
>> email... but turns out that it's what I suggested 5.5 years ago. (I
>> guess most of the questions here are similar. One day we can hire IBM
>> Watson and feed it the list archive, the manpages, and the admin guide
>> and it will answer all questions on this list for us!)
>> 
>> The prolog&  epilog should be just a few lines of shell script -
>> configure a 1-node test cluster and you should be able to implement&
>> test it in less than a few hours. Another minor improvement: if the
>> node crashes, then the startup process needs to cleanup the job
>> $TMPDIR directories when it comes back up.
>> 
>> BTW, if you use a modern filesystem, then it takes almost no time to
>> format a disk. Oracle's BtrFS takes a few seconds to format a disk,
>> and can easily apply quota,  and even snapshots (which is useful when
>> checkpointing a job - the data is consistent with the job progress):
>> 
>> "I Can't Believe This is Butter! A tour of btrfs. - Avi Miller"
>> 
>> https://www.youtube.com/watch?v=hxWuaozpe2I
>> 
>> (Note: I exchanged a few emails with Avi so he is my "e-friend", but I
>> am suggesting his presentation not because I know him but only because
>> it is a great talk)
>> 
>> Rayson
>> 
>> 
>> 
>>> -- Reuti
>>> 
>>> 
>>>> And then terminates the job?
>>>> 
>>>> 
>>>>> But that would be much more complicated and could add some unwanted 
>>>>> complexity to the whole system.
>>>> Do you users stay in $TMPDIR? Then it would be easier I think to have a 
>>>> `du -s *.all.q` and check whether any is above the request.
>>>> 
>>>> NB: There is a suspend_threshold for queues, but unfortunately not for 
>>>> each individual job on its own.
>>>> 
>>>> ===
>>>> 
>>>> Another approach, if the jobs stay in one node:
>>>> 
>>>> - in the job prolog create a file with the requested space
>>>> - format and mount it on $TMPDIR as loop device
>>>> - in the epilog it can be removed again
>>>> 
>>>> Well, creating and formatting will take some time, but they can never pass 
>>>> the requested space and it's guaranteed to be available.
>>>> 
>>>> -- Reuti
>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to