Re: [ceph-users] separate monitoring node

2018-06-22 Thread Stefan Kooman
Quoting Reed Dier (reed.d...@focusvq.com):
> 
> > On Jun 22, 2018, at 2:14 AM, Stefan Kooman  wrote:
> > 
> > Just checking here: Are you using the telegraf ceph plugin on the nodes?
> > In that case you _are_ duplicating data. But the good news is that you
> > don't need to. There is a Ceph mgr telegraf plugin now (mimic) which
> > also works on luminous: http://docs.ceph.com/docs/master/mgr/telegraf/
> 
> Hi Stefan,
> 
> I’m just curious what the advantage you see to the telegraf plugin,
> then feeding into influxdb, rather than the influxdb plugin in
> ceph-mgr already existing.  Just generally curious what the advantage
> is to outputting into telegraf then into influx, unless you are
> outputting to a different TSDB from Telegraf.

We have ceph running in a "storage vrf", which uses routable IPv6, but
not available from "internet vrf". Besides that we have a out-of-band
manamgent inteface running in a network namespace that can reach the
Internet. We use that to send data to monitoring / influx. We have two
telegraf instances running on each host: 1) default (ceph) network
namespace with a listener for ceph (mgr) data and as well pushing
telegraf data to a telegraf-mgmt instance with a
listener (socket). By using sockets we can "escape" the namespace
barrier only for telegraf data. TL;DR: namespaces gives you isolation,
but makes you jump through hoops. That was the main reason why we
sponsored the development of a telegraf mgr plugin.

> 
> Still have my OSD’s reporting their own stats in collectd daemons on
> all of my OSD nodes, as a supplement to the direct ceph-mgr ->
> influxdb statistics.  Almost moved everything to telegraf after
> Luminous broke some collectd data collection, but it all got sorted
> out.

Yeah, not all info is available in the manager yet :/. I hope this will
change. There are some PR's out from Wido, that should fix this. The
telegraf mgr plugin is a drop-in replacement for influx module (and
providing some extra metrics). It also gives you the possibility to
configure more advanced stuff (tls handling) in a seperate tool, instead
of limited functionality in ceph module.

This whole thing gets rendered out of use pretty quickly, with new
dashboard v2 from SuSE, and the prometheus / proxy support for grafana
... options are still a good thing I think ;).

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-22 Thread Reed Dier

> On Jun 22, 2018, at 2:14 AM, Stefan Kooman  wrote:
> 
> Just checking here: Are you using the telegraf ceph plugin on the nodes?
> In that case you _are_ duplicating data. But the good news is that you
> don't need to. There is a Ceph mgr telegraf plugin now (mimic) which
> also works on luminous: http://docs.ceph.com/docs/master/mgr/telegraf/

Hi Stefan,

I’m just curious what the advantage you see to the telegraf plugin, then 
feeding into influxdb, rather than the influxdb plugin in ceph-mgr already 
existing.
Just generally curious what the advantage is to outputting into telegraf then 
into influx, unless you are outputting to a different TSDB from Telegraf.

Still have my OSD’s reporting their own stats in collectd daemons on all of my 
OSD nodes, as a supplement to the direct ceph-mgr -> influxdb statistics.
Almost moved everything to telegraf after Luminous broke some collectd data 
collection, but it all got sorted out.

> 
> You configure a listener ([[inputs.socket_listener]) on the nodes where
> you have ceph mgr running (probably mons) and have the mgr plugin send
> data to the socket. The telegraf daemon will pick it up and send it to
> influx (or whatever target you configured). As there is only one active
> mgr, you don't have the issue of duplicating data, and the solution is
> still HA.
> Gr. Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


And +1 for icinga2 alerting.

Thanks,

Reed
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-22 Thread Stefan Kooman
Quoting Denny Fuchs (linuxm...@4lin.net):
> hi,
> 
> > Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek :
> > 
> > # ceph auth get client.icinga
> > exported keyring for client.icinga
> > [client.icinga]
> > key = 
> > caps mgr = "allow r"
> > caps mon = "allow r"
> 
> thats the point: It's OK, to check, if all processes are up and
> running and may some checks for the disks. But imagine, you check the
> "health" state and the state is on all OSDs the same, because ... its
> a cluster. So if you put on one node "ceph osd set noout", you get a
> warning for every OSD node (check_ceph_health). The same for every
> check, that monitors a cluster wide setting, like df, lost osd (70 in
> from 72 ...) The most checks have also performance data (which can be
> disabled), which is saved in a database.  The same for Telegraf(*):
> every node transmits the same data (because the cluster data is the
> same on all nodes).

Just checking here: Are you using the telegraf ceph plugin on the nodes?
In that case you _are_ duplicating data. But the good news is that you
don't need to. There is a Ceph mgr telegraf plugin now (mimic) which
also works on luminous: http://docs.ceph.com/docs/master/mgr/telegraf/

You configure a listener ([[inputs.socket_listener]) on the nodes where
you have ceph mgr running (probably mons) and have the mgr plugin send
data to the socket. The telegraf daemon will pick it up and send it to
influx (or whatever target you configured). As there is only one active
mgr, you don't have the issue of duplicating data, and the solution is
still HA.

We have this systemd override snippet for telegraf to make the socket
writable for ceph:

[Service]
ExecStartPre=/bin/sleep 5
ExecStartPost=/bin/chown ceph /tmp/telegraf.sock

> 
> I've took also a look on the Ceph mgr dashboard (for a few minutes),
> which I have to enable on all(?) OSD nodes and build a construct, to
> get the dashboard from the active mgr.
> 
> I don't believe, that I'm the first person who thinking about a
> dedicate VM, which is only used for monitoring tools (Icinga / Zabbix
> / Nagios / Dashboard / ceph -s) and get the overall status (and
> performance data) from it. The only thing which I need to keep is the
> OSD (I/O) disk and network on the OSD nodes directly, but thanks to
> InfluxDB ... I can put them on one dashboard :-)

On the Icinga (or sattelite node) you check only for cluster health. On
the nodes you only check for node specific health. No overlap in health
checks this way.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-20 Thread Konstantin Shalygin

Hi,

at the moment, we use Icinga2, check_ceph* and Telegraf with the Ceph
plugin. I'm asking what I need to have a separate host, which knows all
about the Ceph cluster health. The reason is, that each OSD node has
mostly the exact same data, which is transmitted into our database (like
InfluxDB or MySQL) and wasting space. Also if something is going on, we
get alerts for each OSD.

So my idea is, to have a separate VM (on external host) and we use only
this host for monitoring the global cluster state and measurements.
Is it correct, that I need only to have mon and mgr as services ? Or
should I do monitoring in a different way?

cu denny



I use this together:

1. mgr/prometheus module for your Grafana;

2. https://github.com/Crapworks/ceph-dash + 
https://github.com/Crapworks/check_ceph_dash for monitoring cluster events;



First is not need for cephx, second works with python-rados and connects 
to cluster directly.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-19 Thread Denny Fuchs
hi,

> Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek :
> 
> # ceph auth get client.icinga
> exported keyring for client.icinga
> [client.icinga]
> key = 
> caps mgr = "allow r"
> caps mon = "allow r"

thats the point: It's OK, to check, if all processes are up and running and may 
some checks for the disks. But imagine, you check the "health" state and the 
state is on all OSDs the same, because ... its a cluster. So if you put on one 
node "ceph osd set noout", you get a warning for every OSD node 
(check_ceph_health). The same for every check, that monitors a cluster wide 
setting, like df, lost osd (70 in from 72 ...) The most checks have also 
performance data (which can be disabled), which is saved in a database.
The same for Telegraf(*): every node transmits the same data (because the 
cluster data is the same on all nodes).

I've took also a look on the Ceph mgr dashboard (for a few minutes), which I 
have to enable on all(?) OSD nodes and build a construct, to get the dashboard 
from the active mgr.

I don't believe, that I'm the first person who thinking about a dedicate VM, 
which is only used for monitoring tools (Icinga / Zabbix / Nagios / Dashboard / 
ceph -s) and get the overall status (and performance data) from it. The only 
thing which I need to keep is the OSD (I/O) disk and network on the OSD nodes 
directly, but thanks to InfluxDB ... I can put them on one dashboard :-)

@Kevin nice work :-) Because of PCI DSS, the Icinga2 master can't reach the 
Ceph directly, to we have satellite / agent construct to the checks executed.

cu denny

ps. One bad thing: Telegraf can't read the /var/run/ceph/*sock files, because 
of the perms after the OSD services starts 
(https://github.com/influxdata/telegraf/issues/1657). This was fixed, but I 
didn't checked, if this patch was also in Proxmox Ceph packages included.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-19 Thread Kevin Hrpcek
I use icinga2 as well with a check_ceph.py that I wrote a couple years 
ago. The method I use is that icinga2 runs the check from the icinga2 
host itself. ceph-common is installed on the icinga2 host since the 
check_ceph script is a wrapper and parser for the ceph command output 
using python's subprocess. The script takes a conf, id, and keyring 
argument so it acts like a ceph client and only the conf and keyring 
need to be present. I added a cephx user for the icinga checks. I also 
use icinga2,nrpe,check_proc to check the correct number of 
osd/mon/mgr/mds are running on a host.


# ceph auth get client.icinga
exported keyring for client.icinga
[client.icinga]
    key = 
    caps mgr = "allow r"
    caps mon = "allow r"


I just realized my script on github is the first or second result when 
googling for icinga2 ceph checks so there is a chance you are trying to 
use the same thing as me.


Kevin


On 06/19/2018 07:17 AM, Denny Fuchs wrote:

Hi,

at the moment, we use Icinga2, check_ceph* and Telegraf with the Ceph 
plugin. I'm asking what I need to have a separate host, which knows 
all about the Ceph cluster health. The reason is, that each OSD node 
has mostly the exact same data, which is transmitted into our database 
(like InfluxDB or MySQL) and wasting space. Also if something is going 
on, we get alerts for each OSD.


So my idea is, to have a separate VM (on external host) and we use 
only this host for monitoring the global cluster state and measurements.
Is it correct, that I need only to have mon and mgr as services ? Or 
should I do monitoring in a different way?


cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-19 Thread Stefan Kooman
Quoting John Spray (jsp...@redhat.com):
> 
> The general idea with mgr plugins (Telegraf, etc) is that because
> there's only one active mgr daemon, you don't have to worry about
> duplicate feeds going in.
> 
> I haven't use the icinga2 check_ceph plugin, but it seems like it's
> intended to run on any node that has a client keyring, so you wouldn't
> need to run a mon/mgr locally where you were running the check plugin,
> just the ceph.conf and keyring file.

We use Icinga2 and telegraf too. We have Icinga check the global
(cluster) health, and besides that, have icinga monitoring plugins
running on the nodes checking node specific (ceph) health. Every
node checks if the local daemons are running like they should. Global
Ceph health could end up in "HEALTH_OK", because it might recover the
data on another OSD for example. This leaves your cluster in a less
than optimal state (one OSD less) than it should be. Local node monitoring
will inform you in this case (i.e. it expects x OSD daemons running
locally).

Gr. STefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-19 Thread John Spray
On Tue, Jun 19, 2018 at 1:17 PM Denny Fuchs  wrote:
>
> Hi,
>
> at the moment, we use Icinga2, check_ceph* and Telegraf with the Ceph
> plugin. I'm asking what I need to have a separate host, which knows all
> about the Ceph cluster health. The reason is, that each OSD node has
> mostly the exact same data, which is transmitted into our database (like
> InfluxDB or MySQL) and wasting space. Also if something is going on, we
> get alerts for each OSD.
>
> So my idea is, to have a separate VM (on external host) and we use only
> this host for monitoring the global cluster state and measurements.
> Is it correct, that I need only to have mon and mgr as services ? Or
> should I do monitoring in a different way?

The general idea with mgr plugins (Telegraf, etc) is that because
there's only one active mgr daemon, you don't have to worry about
duplicate feeds going in.

I haven't use the icinga2 check_ceph plugin, but it seems like it's
intended to run on any node that has a client keyring, so you wouldn't
need to run a mon/mgr locally where you were running the check plugin,
just the ceph.conf and keyring file.

Cheers,
John

>
> cu denny
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com