Re: [gridengine users] slot quota dynamically ?

2011-08-22 Thread Reuti
Am 22.08.2011 um 13:13 schrieb Schmidt U.:

 Dear all,
 is there a way to define slot quotas dynamically ? By default I set the RQS to
 {
   name slot_per_us
   description  slot limitation
   enabled  TRUE
   limitusers {*} to slots=200
 }
 But sometimes users have a huge amount of short jobs, because of that I think 
 about a solution similar like slot * walltime
 Can I add a kind of variable instead of fixed number ?

Dynamic limits are only working on a per host basis (man sge_resource_quota), 
although it was a long term goal to implement any combination for them I think.

But even then walltime (unit TIME) and slot (unit INT) can't be multiplied in a 
meaningful way.

You want to allow for example: 1 job with 24 hrs run time, or 24 jobs with 1 hr 
run time?

Would it help to make an RQS as a combination of user and queue (i.e. some kind 
of short-queue with a h_rt limit that judges the jobs therein as being short)?

--- Reuti


 buudo
 
 ___
 users mailing list
 users@gridengine.org
 https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] cluster IP change, now qlogin timeout (4 s) expired while waiting on socket fd 4

2011-08-22 Thread bergman
Last week, our cluster (SGE 6.2u5) was working finewe've got 4
machines designated for interactive use and batch jobs, in a queue that
can be subordinated when needed, and many batch-only nodes.

We're using SSH for qlogin, with the qlogin command set to:

-
#!/bin/sh
HOST=$1
PORT=$2
exec /usr/bin/ssh -t -t -X -Y -p $PORT $HOST
-


On Friday, we changed datacenters and IP numbera. All hostnames (local and 
fqdn) stayed the same.

As of today:

qlogin from the headnode to node interactive1 is fine

qlogin from the headnode to nodes interactive[2-4] fail with
a timeout

qsub jobs from the headnode to all nodes (including
interactive2-4) work fine

All IP changes were scripted, and seem to have been compelete. A simple
check (grep -lr old.IP.subnet /etc /opt/gridengine) reveals no files
that were not updated on interactive[2-4]

The $SGE_ROOT/$SGE_CELL/spool/qmaster/messages file contains entries like:

---
08/22/2011 19:00:35|worker|headnode|W|job 2148448.1 failed on host 
interactive2.fqdn assumedly after job because: job 2148448.1 died through 
signal KILL (9)
---

I've seen many discussions about debugging qlogin timeouts, but no common
threads or solutions.

Are there any suggestions about debugging this instance?

Thanks,

Mark

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users