Re: [gridengine users] question about addgrpid (was Available values in prolog)

Michael Coffman Wed, 01 Feb 2012 06:23:50 -0800

On Tue, Jan 31, 2012 at 11:19 AM, Reuti <[email protected]> wrote:
> Hi,
>
> Am 31.01.2012 um 18:20 schrieb Michael Coffman:
>
>> I have a question that relates to this thread.  I have created a
>> process that is started by the prolog script and watches the jobs
>> memory use via /proc/pid/smaps ( to include shared memory).  The
>> process watches all process that contain the
>> $SGE_JOB_SPOOL_DIR/addgrip id in their Groups status line.   The
>> script is supposed to run until all process have terminated.   I have
>> found a few machines where the grid job has completed, but my 'gwatch'
>> script is still running. It turns out that on these systems, there is
>> an nfsd that has the addgrid as part of it's groups.   Example:
>>
>> [root@cs246 ~]# grep 22024 /proc/*/status
>> /proc/3195/status:Groups:       90 22024 27903
>> [root@cs246 ~]# ypmatch 22024 group.bygid
>> sge22024:*:22024:
>> [root@cs246 ~]# ps -ef |grep 3195
>> root      3195     1  0  2011 ?        00:00:00 [nfsd]
>> root     15921 15809  0 10:18 pts/0    00:00:00 grep 3195
>
> it was by design to run for a short time under the guid of the user and the 
> reason why the deletion by additional group id was disabled in former times 
> by default - they could kill the nfsd. But I thought that is was only seen on
servers and not clients, and newer kernels shouldn't show it. Do you
run an NFS server on all exechosts?


Thanks for the reply.

I don't really understand what this means?  You mean that nfsd is
designed to run under the group id of the user for a short time?  Yes,
we run nfs servers on all grid exec hosts.

>
> The additional group id is for sure 22024 and/or 27903 is a second from 
> another process?

Yes the 27903 and 90 are non-sge group id's

Sounds like the right thing to do is just to have the script exit once
all non-nfsd processes with the groupid have
exited.

>
> -- Reuti
>
>
>> This may be more appropriate for an NFS mailing list at this point,
>> but an clues as to how and why this groupid
>> gets added to nfsd?
>>
>> Thanks.
>>
>> On Fri, Jan 13, 2012 at 2:57 PM, Reuti <[email protected]> wrote:
>>> Am 13.01.2012 um 19:40 schrieb Michael Coffman:
>>>
>>>>>> <snip>
>>>>>> It currently determines the pid of the shepherd process then watches all
>>>>>> the children processes.
>>>>>
>>>>> I think it's easier to use the additional group ID, which is attached to 
>>>>> all kids by SGE, whether they jump out of the process tree or not. This 
>>>>> one is recorded in $SGE_JOB_SPOOL_DIR in the file "addgrpid".
>>>>>
>>>>
>>>> Had not thought of this.  Sounds like a good idea.  At first glance I
>>>> am not seeing how to list the jobs via
>>>> ps that are identified by the gid in the addgrpid file.   I tried ps
>>>> -G`cat addgrpid`  -o vsz,rss,arg but it
>>>> returns nothing.   I'll have to dig into this a bit more.
>>>
>>> Yes, it's most likely only in the /proc:
>>>
>>> $ qrsh
>>> Running inside SGE
>>> Job 3696
>>> $ id
>>> uid=1000(reuti) gid=100(users) 
>>> groups=10(wheel),16(dialout),33(video),100(users),20007
>>> $ grep -l -r "^Groups.* 20007" /proc/*/status 2>/dev/null | sed -n 
>>> "s|/proc/\([0-9]*\)/status|\1|p"
>>> 13306
>>> 13628
>>> 13629
>>>
>>>
>>>>>> Initially it will be watching memory usage and if a job begins using more
>>>>>> physical memory than requested, the user will be notified.  That's where
>>>>>> my question comes from.
>>>>>
>>>>> What about setting a soft limit for h_vmem and prepare the job script to 
>>>>> handle ithe signal to send an email. How will they request memory - by 
>>>>> virtual_free?
>>>>
>>>> Memory is requested via a consumable complex that we define as the
>>>> amount of physical memory.  The way most of the jobs are run
>>>> currently, we could not do this.  Job scripts typically call a
>>>> commercial vendors binary so there is
>>>> nothing listening for the signals.
>>>
>>> Ok. Depending on the application and whether it resets the traps you can 
>>> try to use a subshell as the signal is send to the complete process group 
>>> to ignore it for the application:
>>>
>>> #!/bin/bash
>>> trap 'echo USR1' usr1
>>> (trap '' usr1; exec your_binary) &
>>> PID=$!
>>> wait $PID
>>> RET=$?
>>> while [ $RET -eq 138 ]; do wait $PID; RET=$?; done
>>>
>>>
>>> '' = two single quotation marks
>>> After the first signal `wait` must be called again.
>>>
>>>
>>>>>> Is there any way in the prolog to get access to the hard_request options
>>>>>> besides using qstat?
>>>>>>
>>>>>> What I'm currently doing:
>>>>>>
>>>>>>  cmd = "bash -c '. #{@sge_root}/default/common/settings.sh && qstat
>>>>>> -xml -j #{@number}'"
>>>>>>
>>>>>> I have thought of possibly setting an environment variable via a jsv 
>>>>>> script
>>>>>> that can be queried by the prolog script.  Is this a good idea?  How 
>>>>>> much impact
>>>>>> on submission time does jsv_send_env() add?
>>>>>
>>>>> You can use either a JSV or a `qsub` wrapper for it.
>>>>>
>>>>>
>>>>>> Any one else doing anything like this have any suggestions?
>>>>>>
>>>>>>
>>>>>> The end goal is to have a utility that users can also interact with to
>>>>>> monitor their jobs.  By either setting environment variables or grid
>>>>>> complexes
>>>>>
>>>>> Complexes are only handled internally by SGE. There is no user command to 
>>>>> change them for a non-admin.
>>>>
>>>> My thoughts on the complex were that there would be a complex flag
>>>> that would indicate that the user
>>>> wanted to monitor memory, or cpu, etc...  Not that it would be
>>>> changeable by the user, just an indicator
>>>> for the JSV script
>>>
>>> Ok.
>>>
>>> -- Reuti
>>>
>>>
>>>>>> to affect the behavior of what is being watched and how they
>>>>>> are notified.
>>>>>
>>>>> AFAIK you can't change the content of an already inherited variable, as 
>>>>> the process got a copy of the value. Also /proc/12345/environ is only 
>>>>> readable. And your "observation daemon" will run on all nodes - one for 
>>>>> each job from the prolog if I get you right?
>>>>
>>>> Correct.
>>>>
>>>>>
>>>>> But a nice solution could be the usage of the job context. This can be 
>>>>> set by the user on the command line, and your job can access this by 
>>>>> issuing a similar command like you did already. If the exechosts are 
>>>>> submit hosts, the job can also change this by using `qalter` like the 
>>>>> user has to use on the command line. We use the job context only for 
>>>>> documentation purpose, to record the issued command and append it to the 
>>>>> email which is send after the job.
>>>>>
>>>>> http://gridengine.org/pipermail/users/2011-September/001629.html
>>>>>
>>>>> $ qstat -j 12345
>>>>> ...
>>>>> context:                    COMMAND=subturbo -v 631 -g -m 3500 -p 8 -t 
>>>>> infinity -s 
>>>>> aoforce,OUTPUT=/home/foobar/carbene/gecl4_2carb228/trans_tzvp_3.out
>>>>>
>>>>> It's only one long line, and I split it later on to inidividual entries. 
>>>>> In your case you have to watch out for commas, as they are used already 
>>>>> to separate entries.
>>>>
>>>> The context sounds very interesting.  Not something we have really
>>>> played around with.
>>>>
>>>> Again.  Thanks for the input.
>>>>
>>>>
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> --
>>>>>> -MichaelC
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -MichaelC
>>>>
>>>
>>
>>
>>
>> --
>> -MichaelC
>>
>



-- 
-MichaelC

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] question about addgrpid (was Available values in prolog)

Reply via email to