Re: system monitoring (was Re: ZFS error logging)
On Friday, 23 September 2016 7:14:15 PM AEST Craig Sanders via luv-main wrote: > and for logging and graphing all sorts of info about systems (disk space, > memory utilisation, cpu load, network traffic etc) and the services they're > running (e.g. postgres/mysql query load, VMs/containers running), munin > isn't bad. > > some prefer cricket or cacti or still use the ancient mrtg, but I find > munin's easier to set up and write plugins for (e.g. a simple plugin I > wrote was a small sh + awk script to query slurm to graph the list of > running, cancelled, failed, queued, etc jobs for a HPC cluster) I've been happily using MRTG since times when it wasn't regarded as ancient. ;) I can't imagine Munin being easier than MRTG for writing plugins, MRTG just runs a script that outputs 2 numbers. I might give Munin a go though and see if it does things better. I really should get graphing going on the LUV server. Also is there interest in a Beginners' SIG event on running Munin? That would probably go well with one on Mon. We could do Nagios on the same day if someone wants to teach that (I won't). -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ ___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
system monitoring (was Re: ZFS error logging)
On Fri, Sep 23, 2016 at 04:06:42PM +1000, russ...@coker.com.au wrote: > The Nagios model is to have a single very complex monitoring system while > the mon model tends towards multiple simple installations. Nagios has a > nrpe daemon on each monitored server while with Mon you have Mon on each > server and a master Mon monitoring them all. and for logging and graphing all sorts of info about systems (disk space, memory utilisation, cpu load, network traffic etc) and the services they're running (e.g. postgres/mysql query load, VMs/containers running), munin isn't bad. some prefer cricket or cacti or still use the ancient mrtg, but I find munin's easier to set up and write plugins for (e.g. a simple plugin I wrote was a small sh + awk script to query slurm to graph the list of running, cancelled, failed, queued, etc jobs for a HPC cluster) craig -- craig sanders___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
Re: ZFS error logging
On Friday, 23 September 2016 11:24:41 AM AEST Peter Ross via luv-main wrote: > For the messages: FreeBSD has a sysctl vfs.zfs.debug. This sysctl approach > was ported to Linux, my Google 'research' (e.g. > http://askubuntu.com/questions/228386/how-do-you-apply-performance-tuning-se > ttings-for-native-zfs) indicates, > so you may be able to use it under Linux too. # modinfo zfs|grep debug parm: zfs_dbgmsg_enable:Enable ZFS debug message log (int) parm: zfs_dbgmsg_maxsize:Maximum ZFS debug log size (int) parm: zfs_flags:Set additional debugging flags (uint) parm: metaslab_debug_load:load all metaslabs when pool is first opened (int) parm: metaslab_debug_unload:prevent metaslabs from being unloaded (int) It seems that there are module parameters for this. # find /sys/module/zfs|grep debug /sys/module/zfs/parameters/metaslab_debug_load /sys/module/zfs/parameters/metaslab_debug_unload But the ones I want can only be set at boot time. The Linux port of ZFS doesn't have all the features of the BSD ports. > BTW: There is a Nagios/Icinga check_zfs plugin. Thanks for the pointer. I've attached a modified version of that which works with zfsonlinux. I don't think it's a very useful plugin, I tested it on a zpool that has multiple checksum errors and it reports no problems! > I did not know about "mon" before... How does it compare to Nagios/Icinga? Nagios has a web based interface to manage it that allows acknowledging error conditions. This is a great feature if you have a large team of sysadmins, someone can acknowledge a problem before starting work so no-one else duplicates their effort. For a smaller network this sort of thing isn't necessary and the added complexity of Nagios just gets in the way. Mon is much simpler and has a single config file. The Nagios model is to have a single very complex monitoring system while the mon model tends towards multiple simple installations. Nagios has a nrpe daemon on each monitored server while with Mon you have Mon on each server and a master Mon monitoring them all. I gave a LUV lecture about Mon earlier this year. I could run a hands-on tutorial at the Beginner's SIG if there's interest. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ check_zfs.gz Description: application/gzip ___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
Re: ZFS error logging
Hi Russell, I would assume that the resilvering is related to the checksum errors. From the zpool(8) manpage: Scrubbing and resilvering are very similar operations. The difference is that resilvering only examines data that ZFS knows to be out of date (for example, when attaching a new device to a mirror or replacing an existing device), whereas scrubbing examines all data to discover silent errors due to hardware faults or disk failure. For the messages: FreeBSD has a sysctl vfs.zfs.debug. This sysctl approach was ported to Linux, my Google 'research' (e.g. http://askubuntu.com/questions/228386/how-do-you-apply-performance-tuning-settings-for-native-zfs) indicates, so you may be able to use it under Linux too. BTW: There is a Nagios/Icinga check_zfs plugin. I did not know about "mon" before... How does it compare to Nagios/Icinga? Regards Peter On Thu, Sep 22, 2016 at 10:54 PM, Russell Coker via luv-main < luv-main@luv.asn.au> wrote: > Below is part of the output of "zpool status". It seems that sdr is > defective, it has a steadily increasing number of checksum errors. > > Would the "resilvered 763M" part be about the 121 checksum errors? If so > does > that mean each checksum error required resilvering on average 6M of data? > > The kernel message log has NOTHING about this. I'm used to Ext* and BTRFS > which give kernel message log entries about filesystem errors. Can ZFS be > configured to give similar logging? > > As an aside I've written a mon module for monitoring for such ZFS errors. > I'll release it sometime soon. But I'd be happy to give a version that's > quite usable although not ready for full release to anyone who wants it. > > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. >see: http://zfsonlinux.org/msg/ZFS-8000-9P > scan: resilvered 763M in 0h0m with 0 errors on Thu Aug 18 14:48:53 2016 > config: > > NAME STATE READ WRITE CKSUM > server ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > sdjONLINE 0 0 0 > sdkONLINE 0 0 0 > sdlONLINE 0 0 0 > sdmONLINE 0 0 0 > sdnONLINE 0 0 0 > sdoONLINE 0 0 0 > sdpONLINE 0 0 0 > sdqONLINE 0 0 0 > sdrONLINE 0 0 121 > > -- > My Main Blog http://etbe.coker.com.au/ > My Documents Bloghttp://doc.coker.com.au/ > > ___ > luv-main mailing list > luv-main@luv.asn.au > https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main > ___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main