[Openstack-operators] After power outage, nearly all vm volumes corrupted and unmountable

Gary Molenkamp Fri, 06 Jul 2018 04:59:16 -0700

Good morning all,

After losing all power to our DC last night due to a storm, nearly allof the volumes in our Pike cluster are unmountable. Of the 30 VMs inuse at the time, only one has been able to successfully mount and bootfrom its rootfs. We are using Ceph as the backend storage to cinderand glance. Any help or pointers to bring this back online would beappreciated.


 What most of the volumes are seeing is

[    2.622252] SGI XFS with ACLs, security attributes, no debug enabled
[    2.629285] XFS (sda1): Mounting V5 Filesystem

[ 2.832223] sd 2:0:0:0: [sda] FAILED Result: hostbyte=DID_OKdriverbyte=DRIVER_SENSE

[    2.838412] sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
[    2.842383] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated

[ 2.846152] sd 2:0:0:0: [sda] CDB: Write(10) 2a 00 00 80 2c 19 00 0400 00

[    2.850146] blk_update_request: I/O error, dev sda, sector 8399897

or

[ 2.590178] EXT4-fs (vda1): INFO: recovery required on readonlyfilesystem

[    2.594319] EXT4-fs (vda1): write access will be enabled during recovery
[    2.957742] print_req_error: I/O error, dev vda, sector 227328

[ 2.962468] Buffer I/O error on dev vda1, logical block 0, lost asyncpage write[ 2.967933] Buffer I/O error on dev vda1, logical block 1, lost asyncpage write

[    2.973076] print_req_error: I/O error, dev vda, sector 229384

As a test for one of the less critical vms, I deleted the vm and mountedthe volume on the one VM I managed to start. The results were notpromising:



# dmesg |tail

[ 5.136862] type=1305 audit(1530847244.811:4): audit_pid=496 old=0auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1

[    7.726331] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)

[29374.967315] scsi 2:0:0:1: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5[29374.988104] sd 2:0:0:1: [sdb] 83886080 512-byte logical blocks: (42.9GB/40.0 GiB)

[29374.991126] sd 2:0:0:1: Attached scsi generic sg1 type 0
[29374.995302] sd 2:0:0:1: [sdb] Write Protect is off
[29374.997109] sd 2:0:0:1: [sdb] Mode Sense: 63 00 00 08

[29374.997186] sd 2:0:0:1: [sdb] Write cache: enabled, read cache:enabled, doesn't support DPO or FUA

[29375.005968]  sdb: sdb1
[29375.007746] sd 2:0:0:1: [sdb] Attached SCSI disk

# parted /dev/sdb
GNU Parted 3.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: QEMU QEMU HARDDISK (scsi)
Disk /dev/sdb: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  42.9GB  42.9GB  primary  xfs          boot

# mount -t xfs /dev/sdb temp
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

# xfs_repair /dev/sdb
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...

Which eventually fails. The ceph cluster looks healthy, I can exportthe volumes from rbd. I can find no other errors in ceph of openstackindicating a fault in either system.


    - Is this recoverable?

- What happened to all of these volumes and can this be preventedfrom occurring again? Note that any shutdown vm at the time of theoutage appears to be fine.



Relevant versions:

    Base OS:  all Centos 7.5

    Ceph:  Luminous 12.2.5-0

    Openstack:  Latest Pike releases in centos-release-openstack-pike-1-1

        nova 16.1.4-1

        cinder  11.1.1-1



--
Gary Molenkamp                  Computer Science/Science Technology Services
Systems Administrator           University of Western Ontario
molen...@uwo.ca                 http://www.csd.uwo.ca
(519) 661-2111 x86882           (519) 661-3566


_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

[Openstack-operators] After power outage, nearly all vm volumes corrupted and unmountable

Reply via email to