Re: [PVE-User] Enabling telemetry broke all my ceph managers

2020-06-18 Thread Brian :
Nice save. And thanks for the detailed info.

On Thursday, June 18, 2020, Lindsay Mathieson 
wrote:
> Clean nautilous install I setup last week
>
>  * 5 Proxmox nodes
>  o All on latest updates via no-subscription channel
>  * 18 OSD's
>  * 3 Managers
>  * 3 Monitors
>  * Cluster Heal good
>  * In a protracted rebalance phase
>  * All managed via proxmox
>
> I thought I would enable telemetry for caph as per this article:
>
> https://docs.ceph.com/docs/master/mgr/telemetry/
>
>
>  * Enabled the module (command line)
>  * ceph telemetry on
>  * Tested getting the status
>  * Set the contact and description
>ceph config set mgr mgr/telemetry/contact 'John Doe
>'
>ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
>ceph config set mgr mgr/telemetry/channel_ident true
>  * Tried sending it
>ceph telemetry send
>
> I *think* this is when the managers died, but it could have been earlier.
But around then the all ceph IO stopped and I discovered all three managers
had crashed and would not restart. I was shitting myself because this was
remote and the router is a pfSense VM :) Fortunately it kept going without
its disk responding.
>
> systemctl start ceph-mgr@vni.service
> Job for ceph-mgr@vni.service failed because the control process exited
with error code.
> See "systemctl status ceph-mgr@vni.service" and "journalctl -xe" for
details.
>
> From journalcontrol -xe
>
>-- The unit ceph-mgr@vni.service has entered the 'failed' state with
>result 'exit-code'.
>Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
>daemon.
>-- Subject: A start job for unit ceph-mgr@vni.service has failed
>-- Defined-By: systemd
>-- Support: https://www.debian.org/support
>--
>-- A start job for unit ceph-mgr@vni.service has finished with a
>failure.
>--
>-- The job identifier is 91690 and the job result is failed.
>
>
> From systemctl status ceph-mgr@vni.service
>
> ceph-mgr@vni.service - Ceph cluster manager daemon
>Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor
preset: enabled)
>   Drop-In: /lib/systemd/system/ceph-mgr@.service.d
>└─ceph-after-pve-cluster.conf
>Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 AEST;
8min ago
>   Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
--id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
>  Main PID: 415566 (code=exited, status=1/FAILURE)
>
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Service
RestartSec=10s expired, scheduling restart.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Scheduled restart
job, restart counter is at 4.
> Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Start request
repeated too quickly.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Failed with result
'exit-code'.
> Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager
daemon.
>
> I created a new manager service on an unused node and fortunately that
worked. I deleted/recreated the old managers and they started working. It
was a sweaty few minutes :)
>
>
> Everything resumed without a hiccup after that, impressed. Not game to
try and reproduce it though.
>
>
>
> --
> Lindsay
>
> ___
> pve-user mailing list
> pve-user@pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


[PVE-User] Enabling telemetry broke all my ceph managers

2020-06-18 Thread Lindsay Mathieson

Clean nautilous install I setup last week

 * 5 Proxmox nodes
 o All on latest updates via no-subscription channel
 * 18 OSD's
 * 3 Managers
 * 3 Monitors
 * Cluster Heal good
 * In a protracted rebalance phase
 * All managed via proxmox

I thought I would enable telemetry for caph as per this article:

https://docs.ceph.com/docs/master/mgr/telemetry/


 * Enabled the module (command line)
 * ceph telemetry on
 * Tested getting the status
 * Set the contact and description
   ceph config set mgr mgr/telemetry/contact 'John Doe
   '
   ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
   ceph config set mgr mgr/telemetry/channel_ident true
 * Tried sending it
   ceph telemetry send

I *think* this is when the managers died, but it could have been 
earlier. But around then the all ceph IO stopped and I discovered all 
three managers had crashed and would not restart. I was shitting myself 
because this was remote and the router is a pfSense VM :) Fortunately it 
kept going without its disk responding.


systemctl start ceph-mgr@vni.service
Job for ceph-mgr@vni.service failed because the control process exited 
with error code.
See "systemctl status ceph-mgr@vni.service" and "journalctl -xe" for 
details.


From journalcontrol -xe

   -- The unit ceph-mgr@vni.service has entered the 'failed' state with
   result 'exit-code'.
   Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
   daemon.
   -- Subject: A start job for unit ceph-mgr@vni.service has failed
   -- Defined-By: systemd
   -- Support: https://www.debian.org/support
   --
   -- A start job for unit ceph-mgr@vni.service has finished with a
   failure.
   --
   -- The job identifier is 91690 and the job result is failed.


From systemctl status ceph-mgr@vni.service

ceph-mgr@vni.service - Ceph cluster manager daemon
   Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; 
vendor preset: enabled)

  Drop-In: /lib/systemd/system/ceph-mgr@.service.d
   └─ceph-after-pve-cluster.conf
   Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 
AEST; 8min ago
  Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} 
--id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)

 Main PID: 415566 (code=exited, status=1/FAILURE)

Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Service 
RestartSec=10s expired, scheduling restart.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Scheduled restart 
job, restart counter is at 4.

Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Start request 
repeated too quickly.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Failed with result 
'exit-code'.

Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager daemon.

I created a new manager service on an unused node and fortunately that 
worked. I deleted/recreated the old managers and they started working. 
It was a sweaty few minutes :)



Everything resumed without a hiccup after that, impressed. Not game to 
try and reproduce it though.




--
Lindsay

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user