[Yahoo-eng-team] [Bug 2019190] Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

melanie witt Tue, 24 Oct 2023 17:30:02 -0700

In an effort to clean up stale bugs, I'm marking this as Invalid for
Nova because the issue is in Cinder.


** Changed in: nova
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2019190

Title:
  [RBD] Retyping of in-use boot volumes renders instances unusable
  (possible data corruption)

Status in Cinder:
  New
Status in Cinder wallaby series:
  New
Status in OpenStack Compute (nova):
  Invalid

Bug description:
  While trying out the volume retype feature in cinder, we noticed that after 
an instance is
  rebooted it will not come back online and be stuck in an error state or if it 
comes back
  online, its filesystem is corrupted.

  ## Observations

  Say there are the two volume types `fast` (stored in ceph pool `volumes`) and 
`slow`
  (stored in ceph pool `volumes.hdd`). Before the retyping we can see that the 
volume
  for example is present in the `volumes.hdd` pool and has a watcher accessing 
the
  volume.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes.hdd
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers:
          watcher=[2001:XX:XX:XX::10ad]:0/3914407456 client.365192 
cookie=140370268803456
  ```

  Starting the retyping process using the migration policy `on-demand` for that 
volume either
  via the horizon dashboard or the CLI causes the volume to be correctly 
transferred to the
  `volumes` pool within the ceph cluster. However, the watcher does not get 
transferred, so
  nobody is accessing the volume after it has been transferred.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers: none
  ```

  Taking a look at the libvirt XML of the instance in question, one can see 
that the `rbd`
  volume path does not change after the retyping is completed. Therefore, if 
the instance is
  restarted nova will not be able to find its volume preventing an instance 
start.

  #### Pre retype

  ```xml
  [...]
  <source protocol='rbd' 
name='volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9' index='1'>
      <host name='2001:XX:XX:XXX::a088' port='6789'/>
      <host name='2001:XX:XX:XXX::3af1' port='6789'/>
      <host name='2001:XX:XX:XXX::ce6f' port='6789'/>
  </source>
  [...]
  ```

  #### Post retype (no change)

  ```xml
  [...]
  <source protocol='rbd' 
name='volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9' index='1'>
      <host name='2001:XX:XX:XXX::a088' port='6789'/>
      <host name='2001:XX:XX:XXX::3af1' port='6789'/>
      <host name='2001:XX:XX:XXX::ce6f' port='6789'/>
  </source>
  [...]
  ```

  ### Possible cause

  While looking through the code that is responsible for the volume retype we 
found a function
  `swap_volume` volume which by our understanding should be responsible for 
fixing the association
  above. As we understand cinder should use an internal API path to let nova 
perform this action.
  This doesn't seem to happen.

  (`_swap_volume`:
  
https://github.com/openstack/nova/blob/stable/wallaby/nova/compute/manager.py#L7218)

  ## Further observations

  If one tries to regenerate the libvirt XML by e.g. live migrating the 
instance and rebooting the
  instance after, the filesystem gets corrupted.

  ## Environmental Information and possibly related reports

  We are running the latest version of TripleO Wallaby using the hardened 
(whole disk)
  overcloud image for the nodes.

  Cinder Volume Version: `openstack-
  cinder-18.2.2-0.20230219112414.f9941d2.el8.noarch`

  ### Possibly related

  - https://bugzilla.redhat.com/show_bug.cgi?id=1293440

  
  (might want to paste the above to a markdown file for better readability)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2019190/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2019190] Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

Reply via email to