Hi Youzhong I see that error fairly commonly whenever I upgrade between versions of SmartOS where the format of the dumps (which result from kernel panics typically) change. The system can’t read them (as its now a different format) so it marks them as errored. You just need to remove the dump device and re-create it (a restart is involved). Once you’ve done that (which is effectively clearing the dump device) then you shouldn’t see those errors anymore.
It isn’t a damaged/dying disk. Thanks, Dave On 29 Mar 2014, at 12:52 pm, Youzhong Yang <youzh...@gmail.com<mailto:youzh...@gmail.com>> wrote: ok, this issue happened again, I don't know what it means, can you or someone help me figure out what is wrong underneath? is it really disk failure? or is it expected to report this kind of scary message because it is a dump area? # zpool status -v pool: zones state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 0h3m with 2 errors on Fri Mar 28 22:32:50 2014 config: NAME STATE READ WRITE CKSUM zones ONLINE 4 0 0 c1t0d0 ONLINE 4 0 0 errors: Permanent errors have been detected in the following files: zones/dump:<0x1> [root@batfs9930 ~]# dumpadm Dump content: kernel pages Dump device: /dev/zvol/dsk/zones/dump (dedicated) Savecore directory: /var/crash/volatile Savecore enabled: yes Save compressed: on # zfs list NAME USED AVAIL REFER MOUNTPOINT zones 245G 219G 655K /zones zones/archive 31K 219G 31K /zones/archive zones/config 38K 219G 38K legacy zones/cores 31K 10.0G 31K /zones/global/cores zones/dump 192G 219G 192G - zones/export 35K 219G 35K /export zones/opt 962M 219G 962M legacy zones/swap 51.6G 271G 16K - zones/tmw-nas-3p 696M 219G 696M /tmw-nas-3p zones/usbkey 153K 219G 153K legacy zones/var 5.79M 219G 5.79M legacy zones/zpoolcache 31K 219G 31K legacy # zpool history History for 'zones': 2014-03-28.20:32:31 zpool create -f zones c1t0d0 2014-03-28.20:32:36 zfs set atime=off zones 2014-03-28.20:32:36 zfs create -V 4096mb zones/dump 2014-03-28.20:32:36 zfs create zones/config 2014-03-28.20:32:36 zfs set mountpoint=legacy zones/config 2014-03-28.20:32:36 zfs create -o mountpoint=legacy zones/usbkey 2014-03-28.20:32:36 zfs create -o quota=10g -o mountpoint=/zones/global/cores -o compression=gzip zones/cores 2014-03-28.20:32:36 zfs create -o mountpoint=legacy zones/opt 2014-03-28.20:32:36 zfs create zones/var 2014-03-28.20:32:36 zfs set mountpoint=legacy zones/var 2014-03-28.20:32:41 zfs create -V 196599mb zones/swap 2014-03-28.20:37:59 zpool import -f zones 2014-03-28.20:37:59 zfs create -o mountpoint=legacy zones/zpoolcache 2014-03-28.20:37:59 zfs set checksum=noparity zones/dump 2014-03-28.20:37:59 zpool set feature@multi_vdev_crash_dump=enabled zones 2014-03-28.20:38:32 zfs create -o compression=lzjb -o mountpoint=/zones/archive zones/archive 2014-03-28.21:22:03 zfs destroy zones/swap 2014-03-28.21:22:06 zfs create -V51200M zones/swap 2014-03-28.21:22:06 zfs create zones/tmw-nas-3p 2014-03-28.21:22:06 zfs set mountpoint=/tmw-nas-3p zones/tmw-nas-3p 2014-03-28.21:22:06 zfs create zones/export 2014-03-28.21:22:06 zfs set mountpoint=/export zones/export 2014-03-28.21:24:17 zfs set volsize=196599M zones/dump 2014-03-28.21:27:13 zpool scrub zones 2014-03-28.21:46:41 zpool import -f zones 2014-03-28.21:46:42 zfs set checksum=noparity zones/dump 2014-03-28.21:46:47 zpool set feature@multi_vdev_crash_dump=enabled zones 2014-03-28.21:50:26 zpool scrub zones 2014-03-28.22:29:29 zpool scrub zones On Mon, Mar 24, 2014 at 10:31 AM, Keith Wesolowski <keith.wesolow...@joyent.com<mailto:keith.wesolow...@joyent.com>> wrote: On Sat, Mar 22, 2014 at 05:16:52PM -0400, Youzhong Yang wrote: > I am having problem logging into the host so unable to provide accurate > information, but basically the 'zones' zpool is very simple, with two > drives (Samsung SSD 840 Pro), mirrored setting. When I had issues, MegaCli > reported no error on the drives. This issue happened on 3 new hosts which > have identical config/spec, so I hesitate to say it is disk failing .. > > I'll get back with more info later. Thanks. > > > # zpool status > pool: zones > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > zones ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 tl;dr: nothing's wrong. I don't see anything here that suggests data corruption. As Elijah pointed out, the FMA ereport that was generated occurred because a particular command (likely MODE SENSE or SELECT, though you can look it up yourself if you want) is not supported by either the underlying SSD or more likely by the MegaRAID controller. That doesn't have anything to do with data corruption and given the above zpool status is completely harmless. In fact, it did not lead to a fault diagnosis at all, meaning I'd expect that fmadm faulty does not report any disks or ZFS vdevs as faulty at all. > >> info for c0t0d0 and c0t1d0 > > Device Name: ATA Product Id: Samsung > SSD 840 > Rev: 5B0Q Vendor Specific: > S1ATNSAF145313T > Device Type: DISK Device ID: 10 > SAS Address 0: 0x4433221106000000 SAS Address 1: 0x0 > Media Error: 0 Other Error: 0 > PredictiveFail: 0 Firmware State: Online > Speed: 6.0Gb/s DDF State: SATA > Primary Defect: --- Grown Defect: --- > Raw size: 244198 MB Non-coerced size: 243686 MB > Coerced size: 243186 MB Enclosure index: 1 > Path Count: 1 Slot Number 6 > > > Device Name: ATA Product Id: Samsung > SSD 840 > Rev: 5B0Q Vendor Specific: > S1ATNSAF145308A > Device Type: DISK Device ID: 11 > SAS Address 0: 0x4433221107000000 SAS Address 1: 0x0 > Media Error: 0 Other Error: 0 > PredictiveFail: 0 Firmware State: Online > Speed: 6.0Gb/s DDF State: SATA > Primary Defect: --- Grown Defect: --- > Raw size: 244198 MB Non-coerced size: 243686 MB > Coerced size: 243186 MB Enclosure index: 1 > Path Count: 1 Slot Number 7 > > > > On Sat, Mar 22, 2014 at 1:22 PM, Elijah Wright > <eli...@joyent.com<mailto:eli...@joyent.com>> wrote: > > > > > What is the structure and current state of your zpool ? > > > > This looks like a disk failing to me - does your pool have sufficient > > redundancy to survive that? > > > > --e > > > > On Mar 22, 2014, at 10:44 AM, Youzhong Yang > > <youzh...@gmail.com<mailto:youzh...@gmail.com>> wrote: > > > > Thanks Keith. > > > > Here is what is what I saw on the console: > > > > <image.png> > > > > and this is from fmdump -eV: > > > > Mar 21 2014 22:45:25.998300087 ereport.io.scsi.cmd.disk.dev.rqs.derr > > nvlist version: 0 > > class = ereport.io.scsi.cmd.disk.dev.rqs.derr > > ena = 0x4eda9c21d805c01 > > detector = (embedded nvlist) > > nvlist version: 0 > > version = 0x0 > > scheme = dev > > device-path = /pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@1,0 > > devid = id1,sd@n6003048012d306001abea6270eb135f4 > > (end detector) > > > > devid = id1,sd@n6003048012d306001abea6270eb135f4 > > driver-assessment = fail > > op-code = 0x15 > > cdb = 0x15 0x10 0x0 0x0 0x20 0x0 > > pkt-reason = 0x0 > > pkt-state = 0x3f > > pkt-stats = 0x0 > > stat-code = 0x2 > > key = 0x5 > > asc = 0x20 > > ascq = 0x0 > > sense-data = 0x70 0x0 0x5 0x0 0x0 0x0 0x0 0xb 0x0 0x0 0x0 0x0 0x20 > > 0x0 0x0 0x0 0x0 0x0 0x0 0x0 > > __ttl = 0x1 > > __tod = 0x532cf945 0x3b80d9b7 > > > > I have no idea if there is relationship between dr_sas and the data > > corruption (or should I say frustrating error messages?), but if you could > > point me to the right direction of how to diagnose/debug this issue, I > > would be more than happy to do so. > > > > Thanks, > > > > Youzhong > > > > > > > > > > On Sat, Mar 22, 2014 at 11:25 AM, Keith Wesolowski < > > keith.wesolow...@joyent.com<mailto:keith.wesolow...@joyent.com>> wrote: > > > >> On Sat, Mar 22, 2014 at 10:55:46AM -0400, Youzhong Yang wrote: > >> > >> > We had zones zpool data corruption when 'dr_sas' driver is loaded for > >> > Supermicro > >> > SMC2108 (LSI MegaRAID) controller. > >> > >> What makes you believe there is a causal relationship between these two > >> things? > >> > >> > After I rebuilt smartos image with the change in /etc/driver_aliases, it > >> > works fine. What I did is to roll back the following commit: > >> > >> Were you planning to give us even the slightest hint of what the > >> "corruption" looked like, its history, your efforts to debug it, how you > >> came to isolate its cause to this change, etc.? Or were you just trying > >> to enrage and infuriate us by stirring up vague, impotent fear? If the > >> latter, you're a winner. > >> > >> > > >> https://github.com/joyent/smartos-live/commit/c0409b90008a6dd76afdf5d9aad0b5be8c0d6bec > >> > > >> > My questions are: > >> > - what is this OS-1529 all about? does it fix any known issue by using > >> > dr_sas driver? has this driver been fully tested? > >> > >> dr_sas was forked off mr_sas at the point in time when mr_sas took a > >> huge amount of change to support various new devices. That change > >> consisted of black box code from LSI that we had limited ability to test > >> or otherwise verify. Therefore, to reduce risk, we duplicated the > >> driver and assigned the existing PCI IDs to dr_sas and the > >> newly-supported ones only to mr_sas. Since we don't have nor intend to > >> have any of the devices for which the change introduced support, this > >> was a zero-risk way to add HW support for the benefit of third parties > >> who want it. > >> > >> We have been running this particular driver (previously named mr_sas, > >> now named dr_sas) in production for several years. I am not aware of a > >> single incident in which the driver has been responsible for data > >> corruption in that time. If anything here is poorly tested, it's > >> mr_sas, not dr_sas. > >> > > > > *smartos-discuss* | > > Archives<https://www.listbox.com/member/archive/184463/=now> > > <https://www.listbox.com/member/archive/rss/184463/21486102-f5999c85> | > > Modify <https://www.listbox.com/member/?&> Your Subscription > > <http://www.listbox.com<http://www.listbox.com/>> > > > > *smartos-discuss* | > > Archives<https://www.listbox.com/member/archive/184463/=now> > > <https://www.listbox.com/member/archive/rss/184463/25077300-734ee1ca> | > > Modify<https://www.listbox.com/member/?&>Your Subscription > > <http://www.listbox.com<http://www.listbox.com/>> > > smartos-discuss | Archives<https://www.listbox.com/member/archive/184463/=now> [https://www.listbox.com/images/feed-icon-10x10.jpg10f3ec5.jpg?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2ZlZWQtaWNvbi0xMHgxMC5qcGc] <https://www.listbox.com/member/archive/rss/184463/25738179-216c4b5f> | Modify<https://www.listbox.com/member/?&> Your Subscription [https://www.listbox.com/images/listbox-logo-small.png10f3ec5.png?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2xpc3Rib3gtbG9nby1zbWFsbC5wbmc] <http://www.listbox.com/> ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com