Am 02.11.2012 um 16:32 schrieb Dave Love: > Reuti <[email protected]> writes: > >> Note: there is also issue https://arc.liv.ac.uk/trac/SGE/ticket/813 where >> two `qrsh -inherit` to the same exechost end up in wrong queues. This would >> also be solved then, as the desired queue can't be selected right now. > > Looking at the code, I don't actually understand how you get > inconsistent TMPDIRs, as the name seems to be derived from the master > queue name in the calls of sge_make_tmpdir.
At 5.3 times it was different as the queue name was different on each exechost for a parallel job, and we needed at that time to make some symbolic links on the nodes as some software broadcasts the name of the granted $TMPDIR on its own - hence it expects the same to be present everywere. But I just reran the test with 6.2u5: reuti@pc15370:~> cat test.sh.o5314 1: /scratch/5314.1.extra.q 2: /scratch/5314.1.extra.q 5: /scratch/5314.1.all.q 3: /scratch/5314.1.extra.q 6: /scratch/5314.1.all.q 7: /scratch/5314.1.all.q 4: /scratch/5314.1.all.q Jobscript: /scratch/5314.1.extra.q For a job: #!/bin/sh . /usr/sge/default/common/settings.sh qrsh -inherit pc15370 echo '1: $TMPDIR' & qrsh -inherit pc15370 echo '2: $TMPDIR' & qrsh -inherit pc15370 echo '3: $TMPDIR' & qrsh -inherit pc15370 echo '4: $TMPDIR' & qrsh -inherit pc15370 echo '5: $TMPDIR' & qrsh -inherit pc15370 echo '6: $TMPDIR' & qrsh -inherit pc15370 echo '7: $TMPDIR' & wait echo "Jobscript: $TMPDIR" sleep 30 >> (Only if you would like to get exactly one unique $TMPDIR per `qrsh >> -inherit` with a slot count of 1 in each queue you would be out of luck. But >> for now this can't be guaranteed anyway. OTOH: it could be a feature to >> limit some kind of disk quota inside $TMPDIR and you want to get a correct >> one for each `qrsh -inherit` call and the -q option should be implemented.) > > Maybe, though that seems quite obscure and less important than problems > caused by the current implementation, even if I'm now confused how they > arise... > >> Before changing this: I wonder what was the intention >12 years ago to >> include the name of the queue, as the job/task-id is already unique? > > Yes, that's what I mean. I'm inclined to change it anyway if there's no > obvious reason. (The id is only unique in a given cell, and you could > currently have trouble from multiple cells with job ids of similar > sizes, though I doubt that's at all common.) If the queue names were the same too, the problem exists already right now. >> I'm not sure, whether it was already in DQS. In SGE 5.3 there were no >> cluster queues (i.e. one queue definition per exechost...) and often >> the number of the exechost was included in the name of the queue >> because of this, like 1234.1.serial01.q for a serial queue on node01. > > I'm not sure it helps, but dqs_make_tmpdir: > > /* Note could have multiple instantiations of same job, */ > /* on same machine, under same queue */ > sprintf(str,"%s/%d.%s.%d",qconf->tmpdir,job->job_number,qconf->qname,me.pid); > > c.f. sge_make_tmpdir: > > /* Note could have multiple instantiations of same job, */ > /* on same machine, under same queue */ > snprintf(tmpdir, ltmpdir, "%s/"sge_u32"."sge_u32".%s", t, jobid, > jataskid, lGetString(qep, QU_qname)); > > -- > Community Grid Engine: http://arc.liv.ac.uk/SGE/ > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
