On Fri, Mar 11, 2011 at 8:30 AM, Reuti <[email protected]> wrote: > Am 10.03.2011 um 22:18 schrieb Lane Schwartz: > >> Some of the jobs are array jobs, others are individually submitted. We >> don't use resource reservations at all. > > As the resource consumption is changing, maybe the load sensor takes over and > the jobs consume more memory than granted. For these default complexes the > tighter restriction either by the consumable or the load sensor will be used. > > Can you check with: > > $ qhost -F mem_free
Reuti, At the moment I have a couple dozen jobs queued and waiting, each of which requested mem_free=3072M. There's one machine with no jobs running on it. When I run qhost -F mem_free, that machine reports hc:mem_free=29.000G LOAD: 0.33 MEM_TOT: 31.4G MEM_USE: 971.7M SWAP_TOT: 59.6G SWAP_USE: 506.5M Lane > > what's really left on the machines. >> On Thu, Mar 10, 2011 at 4:03 PM, Reuti <[email protected]> wrote: >>> Hi, >>> >>> Am 10.03.2011 um 20:04 schrieb Lane Schwartz: >>> >>>> Lately I've noticed that many of my jobs take much longer than >>>> expected (sometimes up to half an hour) to go from pending to >>>> running, even when there are numerous nodes with sufficient resources >>>> available. Right now, for example, I've got a couple dozen jobs in >>>> pending, and 38 nodes where no jobs are running. >>>> >>>> I was wondering if anyone might be able to shed some light on why this >>>> might be. As I said, there are plenty of nodes with sufficient >>>> resources available to run the pending jobs, but they sometimes take a >>>> long time to go from pending to running. >>>> >>>> For reference, mem_free is set to consumable, and my jobs use the >>>> default value of 4GB for their requested mem_free. There are some >>>> other users' jobs which request more memory than that. >>>> >>>> The only clue I've been able to find is from examining the qmaster >>>> messages log file. It has lots of lines that look like the errors >>>> below: >>>> >>>> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded: >>>> capacity is 66765959168.262146, job 495795 requests additional >>>> 68719476736.000000 >>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as >>>> resources have changed during a scheduling run >>>> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders >>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as >>>> resources have changed during a scheduling run >>> >>> - are these serial or parallel jobs? >>> >>> - do you use resource reservation for the mem_free request, as otherwise >>> smaller ones with a lower request may slip in all the time? >>> >>> -- Reuti >>> _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
