I agree with Chris.  You will never find a technical solution to 
"policing" your cluster.  I have found the number one most effective tool 
to be shame.  I publish a largest disk hog and rogue job list to the 
entire community under the guise of a "How is the cluster doing" report. 
Nothing like peer pressure to put a stop to this kind of activity.

        --Gary



From:   Chris Dagdigian <[email protected]>
To:     Reuti <[email protected]>, Gowtham <[email protected]>, 
Sun Grid Engine Discussion List <[email protected]>, NPACI Rocks 
Discussion List <[email protected]>
Date:   08/19/2011 12:52 PM
Subject:        Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs 
on Compute Nodes by Normal Users
Sent by:        [email protected]



I think I learned this trick from Reuti:

  - Any legit job running under Grid Engine will be a child process of 
an sge_execd daemon.

A nice little trick is a cronjob that does a "kill -9" on any user 
process that is not a child of sge_execd -- that will quickly send a 
message to the people bypassing the resource scheduling layer.

That said, however, I've been in this position in a number of 
environments and I can tell you that you will NEVER win the battle with 
users trying to game the system. The motivated user will always have 
more time and more incentive than an overworked cluster administrator.

While simple technical measures like that "kill -9" trick or Reuti's 
more sensible suggestion of blocking interactive SSH access to nodes 
outside of SGE should be pursued I'd suggest that you don't spend much 
more time than that developing technical countermeasures.

The real way this gets solved in a multi-user cluster environment is by 
treating acceptable cluster usage as a human resources policy. You'll 
never win a technical battle with a motivated power user.

Acceptable cluster use should be governed by a published policy and when 
the policy is avoided or gamed then the response should involve mentors, 
managers or the HR department, not technology or scripts.

In a corporate setting this comes down to:

1. First time you bypass SGE the admins send you a warning

2. Second time you get caught your manager gets notified

3. Third time? Account is disabled and you are reported to the HR 
department for violating company policy repeatedly

Sorry for being long winded but most long-time cluster admins might 
share my option that cluster use policies can't be treated as a 
technical war between admins and users -- it's far easier and better to 
treat this as a workplace behavior thing.

-Chris






Reuti wrote:
> Hi,
>
> Am 19.08.2011 um 18:30 schrieb Gowtham:
>
>> In some of the computing clusters across our campus, we have noticed 
many users running their jobs outside of the SGE queuing system. While we 
have plans to continue tutoring them about the benefits of using a queuing 
system, not everyone seems to be getting the message - as such, these
>> violating-users' jobs are hampering those who have been
>> using SGE.
>>
>> On all our Rocks based clusters, we do keep the list of
>> cluster's uses in a flat text file, one user per line.
>>
>> Is there a way by which I (as root) can kill all those
>> jobs submitted outside of SGE on compute nodes by these
>> normal users?
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to