Public bug reported:

This bug originates from my post to the openstack-discuss ML - 
https://lists.openstack.org/archives/list/[email protected]/thread/E3VYY24HUGBNH7626ALOGZMJRVX5VOSZ/
which was discussed at a cinder-weekly 
(https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-43).

In short: There seem to be inconsistencies in the correct and required Ceph 
authx permissions for the RBD clients in Cinder, Glance and also Nova.
While it's nice to have the various deployment tools like openstack-ansible 
([4]) or charm[[5]]) do it somewhat "properly",
first and foremost this needs to be properly documented in the source 
documentation of Glance and also Cinder and Nova for that matter.

And achieving this is what this bug report is intended to do.
The proposed steps are ...

 * determine and discuss the correct caps (least privileges, caps via profiles 
where possible, ...)
 * update the documentation / install guides and the devstack code. Those 
should all serve as references for the correct way of doing things.
 * write an upgrade bullet point to release notes for Caracal, to have 
operators check and align their caps
 * spread the word / open bugs for the deployment tools for them to update 
their config / code accordingly
 * send a PR to have Ceph update their docs


The long story about the various Ceph (RBD) clients and uses withing
Glance, Cinder and Nova:


1) Glance

First there was a simple issue reported for Glance [3].

When Glance is requested to delete an image it will check if this image has 
depended children, see 
https://opendev.org/openstack/glance_store/src/commit/6f5011d1f05c99894fb8b909d33ad23a20bf83a9/glance_store/_drivers/rbd.py#L459.
The children of Glance images usually are (Cinder) volumes, which therefore 
live in a different RBD pool "volumes". But if such children do exist a 500 
error is thrown by Glance API.

Manually using the RBD client shows the same error:

> # rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p images 
> children $IMAGE_ID
>
> 2023-12-13T16:51:48.131+0000 7f198cf4e640 -1 librbd::image::OpenRequest: 
> failed to retrieve name: (1) Operation not permitted
> 2023-12-13T16:51:48.131+0000 7f198d74f640 -1 librbd::ImageState: 
> 0x5639fdd5af60 failed to open image: (1) Operation not permitted
> rbd: listing children failed: (1) Operation not permitted
> 2023-12-13T16:51:48.131+0000 7f1990c474c0 -1 librbd::api::Image: 
> list_descendants: failed to open descendant b7078ed7ace50d from pool 
> instances:(1) Operation not permitted

So it's a permission error. Following either the documentation of Glance [1] or 
Ceph [2] on configuring the ceph auth caps there is no mention of granting 
anything towards the volume pool to Glance.
So this is what I currently have configured:

> client.cinder
>         key: REACTED
>         caps: [mgr] profile rbd pool=volumes, profile rbd-read-only 
> pool=images
>         caps: [mon] profile rbd
>         caps: [osd] profile rbd pool=volumes, profile rbd-read-only 
> pool=images
>
> client.glance
>         key: REACTED
>         caps: [mgr] profile rbd pool=images
>         caps: [mon] profile rbd
>         caps: [osd] profile rbd pool=images
>
>    client.nova
>         key: REACTED
>         caps: [mgr] profile rbd pool=instances, profile rbd pool=images
>         caps: [mon] profile rbd
>         caps: [osd] profile rbd pool=instances, profile rbd pool=images
>

When granting the glance client e.g. "rbd-read-only" to the volumes pool via:
>
> # ceph auth caps client.glance mon 'profile rbd' osd 'profile rbd 
> pool=images, profile rbd-read-only pool=volumes' mgr 'profile rbd 
> pool=images, profile rbd-read-only pool=volumes'
>
the error is gone.
This is the wrong approach though! Which was established during the discussion 
on the ML:


a) Commit [10] introduced the method "_snapshot_has_external_reference" to the 
yoga
release to fix [11]. The commit message also briefly states:
...

    NOTE: To check this dependency glance osd needs 'read' access to
    cinder and nova side RBD pool.
```

but there is zero mention of this requirement in the release notes for
Yoga, only for glance_store [13]. Also this (temporary, Yoga only) requirement 
to grant read-only rights to the volumes to Glance
was never revoked. So likely operators did miss this.

b) The mentioned method to check for snapshot references was removed again with 
[12], this change was also backported to the 2023.1 release.
There again was no mention of the change to operators via the release notes, 
who could now remove the read access for volumes from the Glance user again.

c) For none of the changes a and b there was any update to the actual
documentation on how to configure the glance user ceph caps.

d) The "_snapshot_has_external_reference" method is currently just
dangling and unused [14].

e) I am still wondering what the caps to allow reading "rbd_children" prefixed 
rados objects is or was used for? Especially with the managed profiles such as 
"rbd" or "rbd-readonly",
things should be pretty well covered.


And finally: The Glance documentation at [18] is outdated.


2) DevStack

I also wondered why there are no unit tests that fail in CI because of this [3]?
Looking at what devstack does at [6] it appears that

a) it actually applies "allow class-read object_prefix rbd_children",
which is not what is currently documented in the setup guide(s) (see [7]
and [2])

b) it unnecessarily grants read permissions to NOVA_CEPH_POOL ("vms")
and CINDER_CEPH_POOL ("volumes") also for the Glance user

c) does NOT use the managed capabilities called "profiles" such as "rbd"
or "rbd-readonly" instead of raw ACLs such das "rwx", see [9].

This also differs in the Cinder / Glance documentation and makes a great
difference as "such privileges include the ability to blocklist other
client users.", required for lock of stale RBD clients to be removed from 
images, see
https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/#rbd-exclusive-locks.


This might not matter for CI / DevStack environments in itself. But since those 
are used to validate,
they should at best use the default / documented settings where possible to 
also validate they work.


3) Cinder

There seems to be no documented caps when using the ceph-rbd volume
driver [19].


4) Cinder-Backup

If cinder-backup is used with the ceph driver [17] a keyring is required 
allowing to create snapshots of volumes (RBD images), which then serve as 
source for backups.
Also deletion of those snapshots has to be allowed as cinder-backups will 
remove them if they are not needed anymore. While full "profile rbd" access to 
the volume pool works,
it's likely not required to allow e.g. cinder-backup to modify or even delete 
volumes. Also there could be user snapshots, which cinder-backup also does not 
need to be able to delete.
Then there are the caps to store and retrieve backups via rbd import / rbd 
import-diff from another pool (potentially on a different cluster).

There currently seems to be no caps required for cinder-backup that are
documented in e.g. [17].


4) Nova

While there are lots of RBD related options, e.g. for libvirt [8] and
more ...

 * instance storage (if `images_type=rbd``)
 * volumes
 * interaction with Glance images ([glance] -> enable_rbd_download)


But, there seems to be no list of actually required capabilities and 
recommendations for the various interactions with RBD.


5) OpenStack-Ansible

OpenStack-Ansible uses ceph-ansible, but they actively override the keyrings 
and their caps.
Overriding managed code should really just be a temporary fix (it was done for 
Stein if I read this correctly).
Those openstack_keys in [15], once the proper caps are defined should be 
converted into a PR towards ceph-ansible [16] to fix things globally there as 
well.

Likely there are other deployment tools, applying their home-grown set
of caps and Ceph users/keyrings as there is no references to reply on.



[1] 
https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[2] 
https://docs.ceph.com/en/latest/rbd/rbd-openstack/#setup-ceph-client-authentication
[3] https://bugs.launchpad.net/glance/+bug/2045158
[4] Openstack-Ansible: 
https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/all/ceph.yml#L53-L60
[5] Charm: https://review.opendev.org/q/topic:%22bug/1696073%22 // 
https://bugs.launchpad.net/charm-glance/+bug/1696073
[6] 
https://opendev.org/openstack/devstack-plugin-ceph/src/commit/4c22c3d0905589d676bf4865ca5cf57994eb426d/devstack/lib/ceph#L712
[7] 
https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[8] 
https://docs.openstack.org/nova/queens/configuration/config.html#libvirt.rbd_user
[9] 
https://docs.ceph.com/en/latest/rados/operations/user-management/#authorization-capabilities
[10] 
https://github.com/openstack/glance_store/commit/3d221ec529862d43ab303644e74ee9ad6ce8cd40
[11] https://bugs.launchpad.net/glance-store/+bug/1954883
[12] https://review.opendev.org/q/I34dcd90a09d43127ff2e8b477750c70f3cc01113
[13] 
https://docs.openstack.org/releasenotes/glance_store/yoga.html#relnotes-3-0-0-stable-yoga
[14] 
https://opendev.org/openstack/glance_store/src/commit/054bd5ddf5d4d255076bd5f44296f2521e899394/glance_store/_drivers/rbd.py#L455
[15] 
https://opendev.org/openstack/openstack-ansible/commit/0f92985608c0f6ff941ea0445ae25eab20e94fb4
[16] 
https://github.com/ceph/ceph-ansible/blob/b6102975549d8f870b0c20a01edda59d6ceac422/group_vars/all.yml.sample#L642
[17] 
https://docs.openstack.org/cinder/latest/configuration/block-storage/backup/ceph-backup-driver.html
[18] 
https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[19] 
https://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/ceph-rbd-volume-driver.html

** Affects: cinder
     Importance: Undecided
         Status: New

** Affects: glance
     Importance: Undecided
         Status: New

** Affects: glance-store
     Importance: Undecided
         Status: New

** Affects: nova
     Importance: Undecided
         Status: New

** Also affects: glance
   Importance: Undecided
       Status: New

** Also affects: glance-store
   Importance: Undecided
       Status: New

** Also affects: nova
   Importance: Undecided
       Status: New

** Summary changed:

- Documentation of caps for Ceph auth of RBD clients used by Cinder / Glance / 
Nova is missing or inconsistent
+ Documentation of  Ceph auth caps for RBD clients used by Cinder / Glance / 
Nova is missing or inconsistent

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/2051244

Title:
  Documentation of  Ceph auth caps for RBD clients used by Cinder /
  Glance / Nova is missing or inconsistent

Status in Cinder:
  New
Status in Glance:
  New
Status in glance_store:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  This bug originates from my post to the openstack-discuss ML - 
https://lists.openstack.org/archives/list/[email protected]/thread/E3VYY24HUGBNH7626ALOGZMJRVX5VOSZ/
  which was discussed at a cinder-weekly 
(https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-43).

  In short: There seem to be inconsistencies in the correct and required Ceph 
authx permissions for the RBD clients in Cinder, Glance and also Nova.
  While it's nice to have the various deployment tools like openstack-ansible 
([4]) or charm[[5]]) do it somewhat "properly",
  first and foremost this needs to be properly documented in the source 
documentation of Glance and also Cinder and Nova for that matter.

  And achieving this is what this bug report is intended to do.
  The proposed steps are ...

   * determine and discuss the correct caps (least privileges, caps via 
profiles where possible, ...)
   * update the documentation / install guides and the devstack code. Those 
should all serve as references for the correct way of doing things.
   * write an upgrade bullet point to release notes for Caracal, to have 
operators check and align their caps
   * spread the word / open bugs for the deployment tools for them to update 
their config / code accordingly
   * send a PR to have Ceph update their docs


  The long story about the various Ceph (RBD) clients and uses withing
  Glance, Cinder and Nova:

  
  1) Glance

  First there was a simple issue reported for Glance [3].

  When Glance is requested to delete an image it will check if this image has 
depended children, see 
https://opendev.org/openstack/glance_store/src/commit/6f5011d1f05c99894fb8b909d33ad23a20bf83a9/glance_store/_drivers/rbd.py#L459.
  The children of Glance images usually are (Cinder) volumes, which therefore 
live in a different RBD pool "volumes". But if such children do exist a 500 
error is thrown by Glance API.

  Manually using the RBD client shows the same error:

  > # rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p images 
children $IMAGE_ID
  >
  > 2023-12-13T16:51:48.131+0000 7f198cf4e640 -1 librbd::image::OpenRequest: 
failed to retrieve name: (1) Operation not permitted
  > 2023-12-13T16:51:48.131+0000 7f198d74f640 -1 librbd::ImageState: 
0x5639fdd5af60 failed to open image: (1) Operation not permitted
  > rbd: listing children failed: (1) Operation not permitted
  > 2023-12-13T16:51:48.131+0000 7f1990c474c0 -1 librbd::api::Image: 
list_descendants: failed to open descendant b7078ed7ace50d from pool 
instances:(1) Operation not permitted

  So it's a permission error. Following either the documentation of Glance [1] 
or Ceph [2] on configuring the ceph auth caps there is no mention of granting 
anything towards the volume pool to Glance.
  So this is what I currently have configured:

  > client.cinder
  >         key: REACTED
  >         caps: [mgr] profile rbd pool=volumes, profile rbd-read-only 
pool=images
  >         caps: [mon] profile rbd
  >         caps: [osd] profile rbd pool=volumes, profile rbd-read-only 
pool=images
  >
  > client.glance
  >         key: REACTED
  >         caps: [mgr] profile rbd pool=images
  >         caps: [mon] profile rbd
  >         caps: [osd] profile rbd pool=images
  >
  >    client.nova
  >         key: REACTED
  >         caps: [mgr] profile rbd pool=instances, profile rbd pool=images
  >         caps: [mon] profile rbd
  >         caps: [osd] profile rbd pool=instances, profile rbd pool=images
  >

  When granting the glance client e.g. "rbd-read-only" to the volumes pool via:
  >
  > # ceph auth caps client.glance mon 'profile rbd' osd 'profile rbd 
pool=images, profile rbd-read-only pool=volumes' mgr 'profile rbd pool=images, 
profile rbd-read-only pool=volumes'
  >
  the error is gone.
  This is the wrong approach though! Which was established during the 
discussion on the ML:

  
  a) Commit [10] introduced the method "_snapshot_has_external_reference" to 
the yoga
  release to fix [11]. The commit message also briefly states:
  ...

      NOTE: To check this dependency glance osd needs 'read' access to
      cinder and nova side RBD pool.
  ```

  but there is zero mention of this requirement in the release notes for
  Yoga, only for glance_store [13]. Also this (temporary, Yoga only) 
requirement to grant read-only rights to the volumes to Glance
  was never revoked. So likely operators did miss this.

  b) The mentioned method to check for snapshot references was removed again 
with [12], this change was also backported to the 2023.1 release.
  There again was no mention of the change to operators via the release notes, 
who could now remove the read access for volumes from the Glance user again.

  c) For none of the changes a and b there was any update to the actual
  documentation on how to configure the glance user ceph caps.

  d) The "_snapshot_has_external_reference" method is currently just
  dangling and unused [14].

  e) I am still wondering what the caps to allow reading "rbd_children" 
prefixed rados objects is or was used for? Especially with the managed profiles 
such as "rbd" or "rbd-readonly",
  things should be pretty well covered.


  And finally: The Glance documentation at [18] is outdated.


  
  2) DevStack

  I also wondered why there are no unit tests that fail in CI because of this 
[3]?
  Looking at what devstack does at [6] it appears that

  a) it actually applies "allow class-read object_prefix rbd_children",
  which is not what is currently documented in the setup guide(s) (see [7]
  and [2])

  b) it unnecessarily grants read permissions to NOVA_CEPH_POOL ("vms")
  and CINDER_CEPH_POOL ("volumes") also for the Glance user

  c) does NOT use the managed capabilities called "profiles" such as "rbd"
  or "rbd-readonly" instead of raw ACLs such das "rwx", see [9].

  This also differs in the Cinder / Glance documentation and makes a great
  difference as "such privileges include the ability to blocklist other
  client users.", required for lock of stale RBD clients to be removed from 
images, see
  https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/#rbd-exclusive-locks.

  
  This might not matter for CI / DevStack environments in itself. But since 
those are used to validate,
  they should at best use the default / documented settings where possible to 
also validate they work.


  3) Cinder

  There seems to be no documented caps when using the ceph-rbd volume
  driver [19].


  
  4) Cinder-Backup

  If cinder-backup is used with the ceph driver [17] a keyring is required 
allowing to create snapshots of volumes (RBD images), which then serve as 
source for backups.
  Also deletion of those snapshots has to be allowed as cinder-backups will 
remove them if they are not needed anymore. While full "profile rbd" access to 
the volume pool works,
  it's likely not required to allow e.g. cinder-backup to modify or even delete 
volumes. Also there could be user snapshots, which cinder-backup also does not 
need to be able to delete.
  Then there are the caps to store and retrieve backups via rbd import / rbd 
import-diff from another pool (potentially on a different cluster).

  There currently seems to be no caps required for cinder-backup that
  are documented in e.g. [17].


  4) Nova

  While there are lots of RBD related options, e.g. for libvirt [8] and
  more ...

   * instance storage (if `images_type=rbd``)
   * volumes
   * interaction with Glance images ([glance] -> enable_rbd_download)

  
  But, there seems to be no list of actually required capabilities and 
recommendations for the various interactions with RBD.


  5) OpenStack-Ansible

  OpenStack-Ansible uses ceph-ansible, but they actively override the keyrings 
and their caps.
  Overriding managed code should really just be a temporary fix (it was done 
for Stein if I read this correctly).
  Those openstack_keys in [15], once the proper caps are defined should be 
converted into a PR towards ceph-ansible [16] to fix things globally there as 
well.

  Likely there are other deployment tools, applying their home-grown set
  of caps and Ceph users/keyrings as there is no references to reply on.



  
  [1] 
https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
  [2] 
https://docs.ceph.com/en/latest/rbd/rbd-openstack/#setup-ceph-client-authentication
  [3] https://bugs.launchpad.net/glance/+bug/2045158
  [4] Openstack-Ansible: 
https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/all/ceph.yml#L53-L60
  [5] Charm: https://review.opendev.org/q/topic:%22bug/1696073%22 // 
https://bugs.launchpad.net/charm-glance/+bug/1696073
  [6] 
https://opendev.org/openstack/devstack-plugin-ceph/src/commit/4c22c3d0905589d676bf4865ca5cf57994eb426d/devstack/lib/ceph#L712
  [7] 
https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
  [8] 
https://docs.openstack.org/nova/queens/configuration/config.html#libvirt.rbd_user
  [9] 
https://docs.ceph.com/en/latest/rados/operations/user-management/#authorization-capabilities
  [10] 
https://github.com/openstack/glance_store/commit/3d221ec529862d43ab303644e74ee9ad6ce8cd40
  [11] https://bugs.launchpad.net/glance-store/+bug/1954883
  [12] https://review.opendev.org/q/I34dcd90a09d43127ff2e8b477750c70f3cc01113
  [13] 
https://docs.openstack.org/releasenotes/glance_store/yoga.html#relnotes-3-0-0-stable-yoga
  [14] 
https://opendev.org/openstack/glance_store/src/commit/054bd5ddf5d4d255076bd5f44296f2521e899394/glance_store/_drivers/rbd.py#L455
  [15] 
https://opendev.org/openstack/openstack-ansible/commit/0f92985608c0f6ff941ea0445ae25eab20e94fb4
  [16] 
https://github.com/ceph/ceph-ansible/blob/b6102975549d8f870b0c20a01edda59d6ceac422/group_vars/all.yml.sample#L642
  [17] 
https://docs.openstack.org/cinder/latest/configuration/block-storage/backup/ceph-backup-driver.html
  [18] 
https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
  [19] 
https://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/ceph-rbd-volume-driver.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2051244/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to