I have a question that relates to this thread. I have created a process that is started by the prolog script and watches the jobs memory use via /proc/pid/smaps ( to include shared memory). The process watches all process that contain the $SGE_JOB_SPOOL_DIR/addgrip id in their Groups status line. The script is supposed to run until all process have terminated. I have found a few machines where the grid job has completed, but my 'gwatch' script is still running. It turns out that on these systems, there is an nfsd that has the addgrid as part of it's groups. Example:
[root@cs246 ~]# grep 22024 /proc/*/status /proc/3195/status:Groups: 90 22024 27903 [root@cs246 ~]# ypmatch 22024 group.bygid sge22024:*:22024: [root@cs246 ~]# ps -ef |grep 3195 root 3195 1 0 2011 ? 00:00:00 [nfsd] root 15921 15809 0 10:18 pts/0 00:00:00 grep 3195 This may be more appropriate for an NFS mailing list at this point, but an clues as to how and why this groupid gets added to nfsd? Thanks. On Fri, Jan 13, 2012 at 2:57 PM, Reuti <[email protected]> wrote: > Am 13.01.2012 um 19:40 schrieb Michael Coffman: > >>>> <snip> >>>> It currently determines the pid of the shepherd process then watches all >>>> the children processes. >>> >>> I think it's easier to use the additional group ID, which is attached to >>> all kids by SGE, whether they jump out of the process tree or not. This one >>> is recorded in $SGE_JOB_SPOOL_DIR in the file "addgrpid". >>> >> >> Had not thought of this. Sounds like a good idea. At first glance I >> am not seeing how to list the jobs via >> ps that are identified by the gid in the addgrpid file. I tried ps >> -G`cat addgrpid` -o vsz,rss,arg but it >> returns nothing. I'll have to dig into this a bit more. > > Yes, it's most likely only in the /proc: > > $ qrsh > Running inside SGE > Job 3696 > $ id > uid=1000(reuti) gid=100(users) > groups=10(wheel),16(dialout),33(video),100(users),20007 > $ grep -l -r "^Groups.* 20007" /proc/*/status 2>/dev/null | sed -n > "s|/proc/\([0-9]*\)/status|\1|p" > 13306 > 13628 > 13629 > > >>>> Initially it will be watching memory usage and if a job begins using more >>>> physical memory than requested, the user will be notified. That's where >>>> my question comes from. >>> >>> What about setting a soft limit for h_vmem and prepare the job script to >>> handle ithe signal to send an email. How will they request memory - by >>> virtual_free? >> >> Memory is requested via a consumable complex that we define as the >> amount of physical memory. The way most of the jobs are run >> currently, we could not do this. Job scripts typically call a >> commercial vendors binary so there is >> nothing listening for the signals. > > Ok. Depending on the application and whether it resets the traps you can try > to use a subshell as the signal is send to the complete process group to > ignore it for the application: > > #!/bin/bash > trap 'echo USR1' usr1 > (trap '' usr1; exec your_binary) & > PID=$! > wait $PID > RET=$? > while [ $RET -eq 138 ]; do wait $PID; RET=$?; done > > > '' = two single quotation marks > After the first signal `wait` must be called again. > > >>>> Is there any way in the prolog to get access to the hard_request options >>>> besides using qstat? >>>> >>>> What I'm currently doing: >>>> >>>> cmd = "bash -c '. #{@sge_root}/default/common/settings.sh && qstat >>>> -xml -j #{@number}'" >>>> >>>> I have thought of possibly setting an environment variable via a jsv script >>>> that can be queried by the prolog script. Is this a good idea? How much >>>> impact >>>> on submission time does jsv_send_env() add? >>> >>> You can use either a JSV or a `qsub` wrapper for it. >>> >>> >>>> Any one else doing anything like this have any suggestions? >>>> >>>> >>>> The end goal is to have a utility that users can also interact with to >>>> monitor their jobs. By either setting environment variables or grid >>>> complexes >>> >>> Complexes are only handled internally by SGE. There is no user command to >>> change them for a non-admin. >> >> My thoughts on the complex were that there would be a complex flag >> that would indicate that the user >> wanted to monitor memory, or cpu, etc... Not that it would be >> changeable by the user, just an indicator >> for the JSV script > > Ok. > > -- Reuti > > >>>> to affect the behavior of what is being watched and how they >>>> are notified. >>> >>> AFAIK you can't change the content of an already inherited variable, as the >>> process got a copy of the value. Also /proc/12345/environ is only readable. >>> And your "observation daemon" will run on all nodes - one for each job from >>> the prolog if I get you right? >> >> Correct. >> >>> >>> But a nice solution could be the usage of the job context. This can be set >>> by the user on the command line, and your job can access this by issuing a >>> similar command like you did already. If the exechosts are submit hosts, >>> the job can also change this by using `qalter` like the user has to use on >>> the command line. We use the job context only for documentation purpose, to >>> record the issued command and append it to the email which is send after >>> the job. >>> >>> http://gridengine.org/pipermail/users/2011-September/001629.html >>> >>> $ qstat -j 12345 >>> ... >>> context: COMMAND=subturbo -v 631 -g -m 3500 -p 8 -t >>> infinity -s >>> aoforce,OUTPUT=/home/foobar/carbene/gecl4_2carb228/trans_tzvp_3.out >>> >>> It's only one long line, and I split it later on to inidividual entries. In >>> your case you have to watch out for commas, as they are used already to >>> separate entries. >> >> The context sounds very interesting. Not something we have really >> played around with. >> >> Again. Thanks for the input. >> >> >>> >>> -- Reuti >>> >>> >>>> Thanks. >>>> >>>> -- >>>> -MichaelC >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users >>> >> >> >> >> -- >> -MichaelC >> > -- -MichaelC _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
