Re: system monitoring (was Re: ZFS error logging)

2016-09-23 Thread Russell Coker via luv-main
On Friday, 23 September 2016 7:14:15 PM AEST Craig Sanders via luv-main wrote:
> and for logging and graphing all sorts of info about systems (disk space,
> memory utilisation, cpu load, network traffic etc) and the services they're
> running (e.g. postgres/mysql query load, VMs/containers running), munin
> isn't bad.
> 
> some prefer cricket or cacti or still use the ancient mrtg, but I find
> munin's easier to set up and write plugins for (e.g. a simple plugin I
> wrote was a small sh + awk script to query slurm to graph the list of
> running, cancelled, failed, queued, etc jobs for a HPC cluster)

I've been happily using MRTG since times when it wasn't regarded as ancient. 
;)

I can't imagine Munin being easier than MRTG for writing plugins, MRTG just 
runs a script that outputs 2 numbers.  I might give Munin a go though and see 
if it does things better.

I really should get graphing going on the LUV server.

Also is there interest in a Beginners' SIG event on running Munin?  That would 
probably go well with one on Mon.  We could do Nagios on the same day if 
someone wants to teach that (I won't).

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


system monitoring (was Re: ZFS error logging)

2016-09-23 Thread Craig Sanders via luv-main
On Fri, Sep 23, 2016 at 04:06:42PM +1000, russ...@coker.com.au wrote:
> The Nagios model is to have a single very complex monitoring system while
> the mon model tends towards multiple simple installations.  Nagios has a
> nrpe daemon on each monitored server while with Mon you have Mon on each
> server and a master Mon monitoring them all.

and for logging and graphing all sorts of info about systems (disk space,
memory utilisation, cpu load, network traffic etc) and the services they're
running (e.g. postgres/mysql query load, VMs/containers running), munin
isn't bad.

some prefer cricket or cacti or still use the ancient mrtg, but I find
munin's easier to set up and write plugins for (e.g. a simple plugin I
wrote was a small sh + awk script to query slurm to graph the list of
running, cancelled, failed, queued, etc jobs for a HPC cluster)

craig

--
craig sanders 
___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


Re: ZFS error logging

2016-09-23 Thread Russell Coker via luv-main
On Friday, 23 September 2016 11:24:41 AM AEST Peter Ross via luv-main wrote:
> For the messages: FreeBSD has a sysctl vfs.zfs.debug. This sysctl approach
> was ported to Linux, my Google 'research' (e.g.
> http://askubuntu.com/questions/228386/how-do-you-apply-performance-tuning-se
> ttings-for-native-zfs) indicates,
> so you may be able to use it under Linux too.

# modinfo zfs|grep debug
parm:   zfs_dbgmsg_enable:Enable ZFS debug message log (int)
parm:   zfs_dbgmsg_maxsize:Maximum ZFS debug log size (int)
parm:   zfs_flags:Set additional debugging flags (uint)
parm:   metaslab_debug_load:load all metaslabs when pool is first 
opened (int)
parm:   metaslab_debug_unload:prevent metaslabs from being unloaded 
(int)

It seems that there are module parameters for this.

# find /sys/module/zfs|grep debug
/sys/module/zfs/parameters/metaslab_debug_load
/sys/module/zfs/parameters/metaslab_debug_unload

But the ones I want can only be set at boot time.

The Linux port of ZFS doesn't have all the features of the BSD ports.

> BTW: There is a Nagios/Icinga check_zfs plugin.

Thanks for the pointer.  I've attached a modified version of that which works 
with zfsonlinux.  I don't think it's a very useful plugin, I tested it on a 
zpool that has multiple checksum errors and it reports no problems!

> I did not know about "mon" before... How does it compare to Nagios/Icinga?

Nagios has a web based interface to manage it that allows acknowledging error 
conditions.  This is a great feature if you have a large team of sysadmins, 
someone can acknowledge a problem before starting work so no-one else 
duplicates their effort.

For a smaller network this sort of thing isn't necessary and the added 
complexity of Nagios just gets in the way.  Mon is much simpler and has a 
single config file.

The Nagios model is to have a single very complex monitoring system while the 
mon model tends towards multiple simple installations.  Nagios has a nrpe 
daemon on each monitored server while with Mon you have Mon on each server and 
a master Mon monitoring them all.

I gave a LUV lecture about Mon earlier this year.  I could run a hands-on 
tutorial at the Beginner's SIG if there's interest.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/


check_zfs.gz
Description: application/gzip
___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


Re: ZFS error logging

2016-09-22 Thread Peter Ross via luv-main
Hi Russell,

I would assume that the resilvering is related to the checksum errors. From
the zpool(8) manpage:

Scrubbing and resilvering are very similar operations. The difference
is that resilvering only examines data that ZFS knows to be out of
date (for example, when attaching a new device to a mirror or
replacing an existing device), whereas scrubbing examines all data to
discover silent errors due to hardware faults or disk failure.


For the messages: FreeBSD has a sysctl vfs.zfs.debug. This sysctl approach
was ported to Linux, my Google 'research' (e.g.
http://askubuntu.com/questions/228386/how-do-you-apply-performance-tuning-settings-for-native-zfs)
indicates,
so you may be able to use it under Linux too.

BTW: There is a Nagios/Icinga check_zfs plugin.

I did not know about "mon" before... How does it compare to Nagios/Icinga?

Regards
Peter


On Thu, Sep 22, 2016 at 10:54 PM, Russell Coker via luv-main <
luv-main@luv.asn.au> wrote:

> Below is part of the output of "zpool status".  It seems that sdr is
> defective, it has a steadily increasing number of checksum errors.
>
> Would the "resilvered 763M" part be about the 121 checksum errors?  If so
> does
> that mean each checksum error required resilvering on average 6M of data?
>
> The kernel message log has NOTHING about this.  I'm used to Ext* and BTRFS
> which give kernel message log entries about filesystem errors.  Can ZFS be
> configured to give similar logging?
>
> As an aside I've written a mon module for monitoring for such ZFS errors.
> I'll release it sometime soon.  But I'd be happy to give a version that's
> quite usable although not ready for full release to anyone who wants it.
>
> status: One or more devices has experienced an unrecoverable error.  An
> attempt was made to correct the error.  Applications are
> unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> using 'zpool clear' or replace the device with 'zpool replace'.
>see: http://zfsonlinux.org/msg/ZFS-8000-9P
>   scan: resilvered 763M in 0h0m with 0 errors on Thu Aug 18 14:48:53 2016
> config:
>
> NAME   STATE READ WRITE CKSUM
> server ONLINE   0 0 0
>   raidz1-0 ONLINE   0 0 0
> sdjONLINE   0 0 0
> sdkONLINE   0 0 0
> sdlONLINE   0 0 0
> sdmONLINE   0 0 0
> sdnONLINE   0 0 0
> sdoONLINE   0 0 0
> sdpONLINE   0 0 0
> sdqONLINE   0 0 0
> sdrONLINE   0 0   121
>
> --
> My Main Blog http://etbe.coker.com.au/
> My Documents Bloghttp://doc.coker.com.au/
>
> ___
> luv-main mailing list
> luv-main@luv.asn.au
> https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
>
___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main