> > On Oct 9, 2006, at 10:25 AM, Derek Crudgington wrote: > ... > >> What about taking the output of a svcs -a, and > >> stashing the STATE and > >> STIME columns? > >> > ... > >> > >> Although a better data source would be to use > svcprop > >> so you don't > >> have to worry about the STIME representation > >> changing over time. > ... > > Shawn, > > > > Thanks for the ideas. Monitoring all services > would be good.. and > > the exclude. One thing I've noticed is not all > services have a log > > file, so I guess it would just have to use the > /var/svc/log to > > check if they are available. I don't understand > why smfalert now > > takes up that many resources.. is it because of the > tail process? > > So with your idea using the STIME, are you saying > have it run svcs - > > a every so often and check the STIME? This would > seem more resource > > hungry to me but not sure.. > > Derek, > > There would be a difference in resource consumption. > Yes all of the processes, each one has overhead. > > The way I am looking at it there are tradeoffs... > From a configuration standpoint, I would rather just > monitor them > ll (easier to configure) > Less time, better coverage, probably more noise > > From a footprint standpoint, significantly less > memory for more CPU > > I could use less memory by using the existing > smfalert and only > monitoring selected services > More configuration time, ongoing updates > > Large zone installations...If I had 100 zones I would > have some > problems with the current implementation > (or really 16 and I would be a bit tight, the > overhead of the > monitoring would make it an untenable solution) > > One other thing...you might want to make the from > address a variable > as well. I don't think anyone has the address > smfalert at sun.com, but they might get annoyed if they > start getting a > bunch of bounces and auto responders. > Possibly smfalert at hostname > > > Running smfalert in a both the global and a local > zone for two days > (243 monitored services) > you can see that the running scripts are not using a > horrendous > amount of cpu time, > but they are using a relatively large amount of > memory. > (from a 16GB system) > > PROJID NPROC SIZE RSS MEMORY TIME CPU > PROJECT > 100 484 1314M 815M 4.4% 0:15:35 0.0% > smfalert > Other interesting output can been seen from > ptree `pgrep smfalert` > prstat -J -j smfalert (if you happen to have made a > project) > > What you have is a trade off, in utilization you also > have a tradeoff in > terms of what you can get information about. > > Do you want to only monitor service a,b, and d but > not c? > Or do you want to monitor all services except c? > > if you were to use svcprop (which I would recommend > over svcs for > programatic purposes) you could use logic similar to > the following > (this is off the cuff and not complete) e.g. the > while won't work as > shown > > If you were to use svcs -a you would have to worry > about timesttamps > changing from times to dates > online Oct_07 > svc:/network/nfs/mapid:default > line 12:51:45 svc:/system/system-log:default > > > > # setup > whle ( my $line = ( svcprop -p restarter/state -p > restarter/ > state_timestamp -p restarter/logfile '*' ) ) > { > my ($fmri,$prop,$value) = ((split(/\/:properties| > | /)[0,-1]; > next if ( $fmri ~= /$exclude/ ); > $state->{$fmri}->{$prop} = $value; > } > > # update > while ( my $line = ( svcprop -p restarter/state -p > restarter/ > state_timestamp '*' ) ) > { > $now = ""; > i,$prop,$value) = ((split(/\/:properties| /)[0,-1]; > next if ( $fmri ~= /$exclude/ ); > $now->{$fmri}->{$prop} = $value; > } > > # check > foreach my $fmri ( keys %{ $state } ) > { > if ( defined($now->{$fmri}) > && $now->{$fmri}->{'/restarter/state_timestamp'} > > > $state->{$fmri}->{'/restarter/state_timestamp'} > ) { > push @bad, $fmri; > } > } > > > # alert > > foreach my $fmri ( @bad ) > { > # check for defined logfile, if so gather data, if > f not maybe "No > logfile for service" > # or check the restarter log for interesting > g messages (maybe do > this anyway) > # assemble error message, translate time into human > n readable form > # send message(es) > # possibly update state, for restarted services. > #alert_code here > > # you could/should break your alert code out of the > e detection logic, > # it should make it easier to tweak the messages or > r delivery methods > } > > # you could run more frequently than this...but how > much will you gain? > sleep 60 > > > > > > > > This message posted from opensolaris.org > > _______________________________________________ > > smf-discuss mailing list > > smf-discuss at opensolaris.org > > -- > Shawn Ferry shawn.ferry at sun.com > Senior Principal Systems Engineer > Sun Managed Operations Delivery > 703.579.1948 > > > _______________________________________________ > smf-discuss mailing list > smf-discuss at opensolaris.org >
One thing I am seeing about the property restarter/logfile is some use alt_logfile. For instance # svcs -l svc:/system/identity:node fmri svc:/system/identity:node name system identity (nodename) enabled true state online next_state none state_time Sun Oct 08 10:21:35 2006 alt_logfile /etc/svc/volatile/system-identity:node.log restarter svc:/system/svc/restarter:default dependency require_any/none svc:/network/loopback (online) dependency optional_all/none svc:/network/physical (online) but it also has a log file in /var/svc/log/system-identity\:node.log so I am not sure which one to use here. Can you provide any info on this? This message posted from opensolaris.org