I think I learned this trick from Reuti:

- Any legit job running under Grid Engine will be a child process of an sge_execd daemon.

A nice little trick is a cronjob that does a "kill -9" on any user process that is not a child of sge_execd -- that will quickly send a message to the people bypassing the resource scheduling layer.

That said, however, I've been in this position in a number of environments and I can tell you that you will NEVER win the battle with users trying to game the system. The motivated user will always have more time and more incentive than an overworked cluster administrator.

While simple technical measures like that "kill -9" trick or Reuti's more sensible suggestion of blocking interactive SSH access to nodes outside of SGE should be pursued I'd suggest that you don't spend much more time than that developing technical countermeasures.

The real way this gets solved in a multi-user cluster environment is by treating acceptable cluster usage as a human resources policy. You'll never win a technical battle with a motivated power user.

Acceptable cluster use should be governed by a published policy and when the policy is avoided or gamed then the response should involve mentors, managers or the HR department, not technology or scripts.

In a corporate setting this comes down to:

1. First time you bypass SGE the admins send you a warning

2. Second time you get caught your manager gets notified

3. Third time? Account is disabled and you are reported to the HR department for violating company policy repeatedly

Sorry for being long winded but most long-time cluster admins might share my option that cluster use policies can't be treated as a technical war between admins and users -- it's far easier and better to treat this as a workplace behavior thing.

-Chris






Reuti wrote:
Hi,

Am 19.08.2011 um 18:30 schrieb Gowtham:

In some of the computing clusters across our campus, we have noticed many users 
running their jobs outside of the SGE queuing system. While we have plans to 
continue tutoring them about the benefits of using a queuing system, not 
everyone seems to be getting the message - as such, these
violating-users' jobs are hampering those who have been
using SGE.

On all our Rocks based clusters, we do keep the list of
cluster's uses in a flat text file, one user per line.

Is there a way by which I (as root) can kill all those
jobs submitted outside of SGE on compute nodes by these
normal users?
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to