On Tue, Jan 31, 2012 at 11:19 AM, Reuti <[email protected]> wrote: > Hi, > > Am 31.01.2012 um 18:20 schrieb Michael Coffman: > >> I have a question that relates to this thread. I have created a >> process that is started by the prolog script and watches the jobs >> memory use via /proc/pid/smaps ( to include shared memory). The >> process watches all process that contain the >> $SGE_JOB_SPOOL_DIR/addgrip id in their Groups status line. The >> script is supposed to run until all process have terminated. I have >> found a few machines where the grid job has completed, but my 'gwatch' >> script is still running. It turns out that on these systems, there is >> an nfsd that has the addgrid as part of it's groups. Example: >> >> [root@cs246 ~]# grep 22024 /proc/*/status >> /proc/3195/status:Groups: 90 22024 27903 >> [root@cs246 ~]# ypmatch 22024 group.bygid >> sge22024:*:22024: >> [root@cs246 ~]# ps -ef |grep 3195 >> root 3195 1 0 2011 ? 00:00:00 [nfsd] >> root 15921 15809 0 10:18 pts/0 00:00:00 grep 3195 > > it was by design to run for a short time under the guid of the user and the > reason why the deletion by additional group id was disabled in former times > by default - they could kill the nfsd. But I thought that is was only seen on servers and not clients, and newer kernels shouldn't show it. Do you run an NFS server on all exechosts?
Thanks for the reply. I don't really understand what this means? You mean that nfsd is designed to run under the group id of the user for a short time? Yes, we run nfs servers on all grid exec hosts. > > The additional group id is for sure 22024 and/or 27903 is a second from > another process? Yes the 27903 and 90 are non-sge group id's Sounds like the right thing to do is just to have the script exit once all non-nfsd processes with the groupid have exited. > > -- Reuti > > >> This may be more appropriate for an NFS mailing list at this point, >> but an clues as to how and why this groupid >> gets added to nfsd? >> >> Thanks. >> >> On Fri, Jan 13, 2012 at 2:57 PM, Reuti <[email protected]> wrote: >>> Am 13.01.2012 um 19:40 schrieb Michael Coffman: >>> >>>>>> <snip> >>>>>> It currently determines the pid of the shepherd process then watches all >>>>>> the children processes. >>>>> >>>>> I think it's easier to use the additional group ID, which is attached to >>>>> all kids by SGE, whether they jump out of the process tree or not. This >>>>> one is recorded in $SGE_JOB_SPOOL_DIR in the file "addgrpid". >>>>> >>>> >>>> Had not thought of this. Sounds like a good idea. At first glance I >>>> am not seeing how to list the jobs via >>>> ps that are identified by the gid in the addgrpid file. I tried ps >>>> -G`cat addgrpid` -o vsz,rss,arg but it >>>> returns nothing. I'll have to dig into this a bit more. >>> >>> Yes, it's most likely only in the /proc: >>> >>> $ qrsh >>> Running inside SGE >>> Job 3696 >>> $ id >>> uid=1000(reuti) gid=100(users) >>> groups=10(wheel),16(dialout),33(video),100(users),20007 >>> $ grep -l -r "^Groups.* 20007" /proc/*/status 2>/dev/null | sed -n >>> "s|/proc/\([0-9]*\)/status|\1|p" >>> 13306 >>> 13628 >>> 13629 >>> >>> >>>>>> Initially it will be watching memory usage and if a job begins using more >>>>>> physical memory than requested, the user will be notified. That's where >>>>>> my question comes from. >>>>> >>>>> What about setting a soft limit for h_vmem and prepare the job script to >>>>> handle ithe signal to send an email. How will they request memory - by >>>>> virtual_free? >>>> >>>> Memory is requested via a consumable complex that we define as the >>>> amount of physical memory. The way most of the jobs are run >>>> currently, we could not do this. Job scripts typically call a >>>> commercial vendors binary so there is >>>> nothing listening for the signals. >>> >>> Ok. Depending on the application and whether it resets the traps you can >>> try to use a subshell as the signal is send to the complete process group >>> to ignore it for the application: >>> >>> #!/bin/bash >>> trap 'echo USR1' usr1 >>> (trap '' usr1; exec your_binary) & >>> PID=$! >>> wait $PID >>> RET=$? >>> while [ $RET -eq 138 ]; do wait $PID; RET=$?; done >>> >>> >>> '' = two single quotation marks >>> After the first signal `wait` must be called again. >>> >>> >>>>>> Is there any way in the prolog to get access to the hard_request options >>>>>> besides using qstat? >>>>>> >>>>>> What I'm currently doing: >>>>>> >>>>>> cmd = "bash -c '. #{@sge_root}/default/common/settings.sh && qstat >>>>>> -xml -j #{@number}'" >>>>>> >>>>>> I have thought of possibly setting an environment variable via a jsv >>>>>> script >>>>>> that can be queried by the prolog script. Is this a good idea? How >>>>>> much impact >>>>>> on submission time does jsv_send_env() add? >>>>> >>>>> You can use either a JSV or a `qsub` wrapper for it. >>>>> >>>>> >>>>>> Any one else doing anything like this have any suggestions? >>>>>> >>>>>> >>>>>> The end goal is to have a utility that users can also interact with to >>>>>> monitor their jobs. By either setting environment variables or grid >>>>>> complexes >>>>> >>>>> Complexes are only handled internally by SGE. There is no user command to >>>>> change them for a non-admin. >>>> >>>> My thoughts on the complex were that there would be a complex flag >>>> that would indicate that the user >>>> wanted to monitor memory, or cpu, etc... Not that it would be >>>> changeable by the user, just an indicator >>>> for the JSV script >>> >>> Ok. >>> >>> -- Reuti >>> >>> >>>>>> to affect the behavior of what is being watched and how they >>>>>> are notified. >>>>> >>>>> AFAIK you can't change the content of an already inherited variable, as >>>>> the process got a copy of the value. Also /proc/12345/environ is only >>>>> readable. And your "observation daemon" will run on all nodes - one for >>>>> each job from the prolog if I get you right? >>>> >>>> Correct. >>>> >>>>> >>>>> But a nice solution could be the usage of the job context. This can be >>>>> set by the user on the command line, and your job can access this by >>>>> issuing a similar command like you did already. If the exechosts are >>>>> submit hosts, the job can also change this by using `qalter` like the >>>>> user has to use on the command line. We use the job context only for >>>>> documentation purpose, to record the issued command and append it to the >>>>> email which is send after the job. >>>>> >>>>> http://gridengine.org/pipermail/users/2011-September/001629.html >>>>> >>>>> $ qstat -j 12345 >>>>> ... >>>>> context: COMMAND=subturbo -v 631 -g -m 3500 -p 8 -t >>>>> infinity -s >>>>> aoforce,OUTPUT=/home/foobar/carbene/gecl4_2carb228/trans_tzvp_3.out >>>>> >>>>> It's only one long line, and I split it later on to inidividual entries. >>>>> In your case you have to watch out for commas, as they are used already >>>>> to separate entries. >>>> >>>> The context sounds very interesting. Not something we have really >>>> played around with. >>>> >>>> Again. Thanks for the input. >>>> >>>> >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>>> Thanks. >>>>>> >>>>>> -- >>>>>> -MichaelC >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> [email protected] >>>>>> https://gridengine.org/mailman/listinfo/users >>>>> >>>> >>>> >>>> >>>> -- >>>> -MichaelC >>>> >>> >> >> >> >> -- >> -MichaelC >> > -- -MichaelC _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
