[smf-discuss] Re: Re: SMF Alerter

Shawn Ferry Mon, 09 Oct 2006 14:11:04 -0400

On Oct 9, 2006, at 10:25 AM, Derek Crudgington wrote:
...
>> What about taking the output of a svcs -a, and
>> stashing the STATE and
>> STIME columns?
>>
...
>>
>> Although a better data source would be to use svcprop
>> so you don't
>> have to worry about the STIME representation
>> changing over time.
...
> Shawn,
>
> Thanks for the ideas.  Monitoring all services would be good.. and  
> the exclude.  One thing I've noticed is not all services have a log  
> file, so I guess it would just have to use the /var/svc/log to  
> check if they are available.  I don't understand why smfalert now  
> takes up that many resources.. is it because of the tail process?   
> So with your idea using the STIME, are you saying have it run svcs - 
> a every so often and check the STIME? This would seem more resource  
> hungry to me but not sure..


Derek,

There would be a difference in resource consumption.
Yes all of the processes, each one has overhead.

The way I am looking at it there are tradeoffs...
 From a configuration standpoint, I would rather just monitor them  
all (easier to configure)
        Less time, better coverage, probably more noise

 From a footprint standpoint, significantly less memory for more CPU
        
I could use less memory by using the existing smfalert and only  
monitoring selected services
        More configuration time, ongoing updates

Large zone installations...If I had 100 zones I would have some  
problems with the current implementation
(or really 16 and I would be a bit tight, the overhead of the  
monitoring would make it an untenable solution)

One other thing...you might want to make the from address a variable  
as well. I don't think anyone has the address
smfalert at sun.com, but they might get annoyed if they start getting a  
bunch of bounces and auto responders.
Possibly smfalert at hostname


Running smfalert in a both the global and a local zone for two days  
(243 monitored services)
you can see that the running scripts are not using a horrendous  
amount of cpu time,
but they are using a relatively large amount of memory.
(from a 16GB system)

PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT
    100      484 1314M  815M   4.4%   0:15:35 0.0% smfalert

Other interesting output can been seen from
ptree `pgrep smfalert`
prstat -J -j smfalert (if you happen to have made a project)

What you have is a trade off, in utilization you also have a tradeoff in
terms of what you can get information about.

Do you want to only monitor service a,b, and d but not c?
Or do you want to monitor all services except c?

if you were to use svcprop (which I would recommend over svcs for  
programatic purposes) you could use logic similar to the following
(this is off the cuff and not complete) e.g. the while won't work as  
shown

If you were to use svcs -a  you would have to worry about timesttamps  
changing from times to dates
online         Oct_07   svc:/network/nfs/mapid:default
online         12:51:45 svc:/system/system-log:default



# setup
whle ( my $line =  ( svcprop -p restarter/state -p restarter/ 
state_timestamp -p restarter/logfile '*' ) )
{
        my ($fmri,$prop,$value) = ((split(/\/:properties| /)[0,-1];
         next if ( $fmri ~= /$exclude/ );
         $state->{$fmri}->{$prop} = $value;
}

# update
while ( my $line = ( svcprop -p restarter/state -p restarter/ 
state_timestamp '*' ) )
{
         $now = "";
        my ($fmri,$prop,$value) = ((split(/\/:properties| /)[0,-1];
         next if ( $fmri ~= /$exclude/ );
         $now->{$fmri}->{$prop} = $value;
}

# check
foreach my $fmri ( keys %{ $state } )
{
        if ( defined($now->{$fmri})
                &&      $now->{$fmri}->{'/restarter/state_timestamp'}
                        >
                        $state->{$fmri}->{'/restarter/state_timestamp'}
        ) {
                push @bad, $fmri;
        }
}


# alert

foreach my $fmri ( @bad )
{
        # check for defined logfile, if so gather data, if not maybe  "No  
logfile for service"
        #       or check the restarter log for interesting messages (maybe do  
this anyway)
        # assemble error message, translate time into human readable form
        # send message(es)
        # possibly update state, for restarted services.
        #alert_code here

        # you could/should break your alert code out of the detection logic,
        # it should make it easier to tweak the messages or delivery methods
}

# you could run more frequently than this...but how much will you gain?
sleep 60


>
>
> This message posted from opensolaris.org
> _______________________________________________
> smf-discuss mailing list
> smf-discuss at opensolaris.org

--
Shawn Ferry                    shawn.ferry at sun.com
Senior Principal Systems Engineer
Sun Managed Operations Delivery
703.579.1948

[smf-discuss] Re: Re: SMF Alerter

Reply via email to