The "right" way to do this is to not run your metrics system on the cluster you 
want to monitor. Use the provided metrics via the exporter and ingest them 
using your own system (ours is Mimir/Loki/Grafana + related alerting), so if 
you have failures of nodes/etc you still have access to, at a minimum, your 
metrics/log data and alerting. Using the built-in services is a great stop-gap, 
but in my opinion, should not be relied on for production operation of Ceph 
clusters (or any software, for that matter.) Spin up some VMs if that's what 
you have available to you and manage your LGTM (or other choice) externally.

Cheers,
David

On Fri, Jan 19, 2024, at 23:42, duluxoz wrote:
> Hi All,
>
> In regards to the monitoring services on a Ceph Cluster (ie Prometheus, 
> Grafana, Alertmanager, Loki, Node-Exported, Promtail, etc) how many 
> instances should/can we run for fault tolerance purposes? I can't seem 
> to recall that advice being in the doco anywhere (but of course, I 
> probably missed it).
>
> I'm concerned about HA on those services - will they continue to run if 
> the Ceph Node they're on fails?
>
> At the moment we're running only 1 instance of each in the cluster, but 
> several Ceph Nodes are capable of running each - ie/eg 3 nodes 
> configured but only count:1.
>
> This is on the latest version of Reef using cephadmin (if it makes a 
> huge difference :-) ).
>
> So any advice, etc, would be greatly appreciated, including if we should 
> be running any services not mentioned (not Mgr, Mon, OSD, or iSCSI, 
> obviously :-) )
>
> Cheers
>
> Dulux-Oz
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to