Am 02.11.2012 um 16:32 schrieb Dave Love:

> Reuti <[email protected]> writes:
> 
>> Note: there is also issue https://arc.liv.ac.uk/trac/SGE/ticket/813 where 
>> two `qrsh -inherit` to the same exechost end up in wrong queues. This would 
>> also be solved then, as the desired queue can't be selected right now.
> 
> Looking at the code, I don't actually understand how you get
> inconsistent TMPDIRs, as the name seems to be derived from the master
> queue name in the calls of sge_make_tmpdir.

At 5.3 times it was different as the queue name was different on each exechost 
for a parallel job, and we needed at that time to make some symbolic links on 
the nodes as some software broadcasts the name of the granted $TMPDIR on its 
own - hence it expects the same to be present everywere. But I just reran the 
test with 6.2u5:

reuti@pc15370:~> cat test.sh.o5314
1: /scratch/5314.1.extra.q
2: /scratch/5314.1.extra.q
5: /scratch/5314.1.all.q
3: /scratch/5314.1.extra.q
6: /scratch/5314.1.all.q
7: /scratch/5314.1.all.q
4: /scratch/5314.1.all.q
Jobscript: /scratch/5314.1.extra.q

For a job:

#!/bin/sh
. /usr/sge/default/common/settings.sh
qrsh -inherit pc15370 echo '1: $TMPDIR' &
qrsh -inherit pc15370 echo '2: $TMPDIR' &
qrsh -inherit pc15370 echo '3: $TMPDIR' &
qrsh -inherit pc15370 echo '4: $TMPDIR' &
qrsh -inherit pc15370 echo '5: $TMPDIR' &
qrsh -inherit pc15370 echo '6: $TMPDIR' &
qrsh -inherit pc15370 echo '7: $TMPDIR' &
wait
echo "Jobscript: $TMPDIR"
sleep 30


>> (Only if you would like to get exactly one unique $TMPDIR per `qrsh 
>> -inherit` with a slot count of 1 in each queue you would be out of luck. But 
>> for now this can't be guaranteed anyway. OTOH: it could be a feature to 
>> limit some kind of disk quota inside $TMPDIR and you want to get a correct 
>> one for each `qrsh -inherit` call and the -q option should be implemented.)
> 
> Maybe, though that seems quite obscure and less important than problems
> caused by the current implementation, even if I'm now confused how they
> arise...
> 
>> Before changing this: I wonder what was the intention >12 years ago to
>> include the name of the queue, as the job/task-id is already unique?
> 
> Yes, that's what I mean.  I'm inclined to change it anyway if there's no
> obvious reason.  (The id is only unique in a given cell, and you could
> currently have trouble from multiple cells with job ids of similar
> sizes, though I doubt that's at all common.)

If the queue names were the same too, the problem exists already right now.


>> I'm not sure, whether it was already in DQS. In SGE 5.3 there were no
>> cluster queues (i.e. one queue definition per exechost...) and often
>> the number of the exechost was included in the name of the queue
>> because of this, like 1234.1.serial01.q for a serial queue on node01.
> 
> I'm not sure it helps, but dqs_make_tmpdir:
> 
>  /* Note could have multiple instantiations of same job, */
>  /* on same machine, under same queue */
>  sprintf(str,"%s/%d.%s.%d",qconf->tmpdir,job->job_number,qconf->qname,me.pid);
> 
> c.f. sge_make_tmpdir:
> 
>   /* Note could have multiple instantiations of same job, */
>   /* on same machine, under same queue */
>   snprintf(tmpdir, ltmpdir, "%s/"sge_u32"."sge_u32".%s", t, jobid,
>            jataskid, lGetString(qep, QU_qname));
> 
> -- 
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to