Re: [smartos-discuss] OS-1529 Hi, everybody! I'm dr_sas.

David Finster Fri, 28 Mar 2014 22:08:31 -0700

Hi Youzhong

I see that error fairly commonly whenever I upgrade between versions of SmartOS 
where the format of the dumps (which result from kernel panics typically) 
change. The system can’t read them (as its now a different format) so it marks 
them as errored. You just need to remove the dump device and re-create it (a 
restart is involved). Once you’ve done that (which is effectively clearing the 
dump device) then you shouldn’t see those errors anymore.


It isn’t a damaged/dying disk.

Thanks,
Dave
On 29 Mar 2014, at 12:52 pm, Youzhong Yang 
<youzh...@gmail.com<mailto:youzh...@gmail.com>> wrote:

ok, this issue happened again, I don't know what it means, can you or someone 
help me figure out what is wrong underneath? is it really disk failure? or is 
it expected to report this kind of scary message because it is a dump area?

# zpool status -v
  pool: zones
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h3m with 2 errors on Fri Mar 28 22:32:50 2014
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       4     0     0
          c1t0d0    ONLINE       4     0     0

errors: Permanent errors have been detected in the following files:

        zones/dump:<0x1>

[root@batfs9930 ~]# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/zones/dump (dedicated)
Savecore directory: /var/crash/volatile
  Savecore enabled: yes
   Save compressed: on

# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
zones              245G   219G   655K  /zones
zones/archive       31K   219G    31K  /zones/archive
zones/config        38K   219G    38K  legacy
zones/cores         31K  10.0G    31K  /zones/global/cores
zones/dump         192G   219G   192G  -
zones/export        35K   219G    35K  /export
zones/opt          962M   219G   962M  legacy
zones/swap        51.6G   271G    16K  -
zones/tmw-nas-3p   696M   219G   696M  /tmw-nas-3p
zones/usbkey       153K   219G   153K  legacy
zones/var         5.79M   219G  5.79M  legacy
zones/zpoolcache    31K   219G    31K  legacy

# zpool history
History for 'zones':
2014-03-28.20:32:31 zpool create -f zones c1t0d0
2014-03-28.20:32:36 zfs set atime=off zones
2014-03-28.20:32:36 zfs create -V 4096mb zones/dump
2014-03-28.20:32:36 zfs create zones/config
2014-03-28.20:32:36 zfs set mountpoint=legacy zones/config
2014-03-28.20:32:36 zfs create -o mountpoint=legacy zones/usbkey
2014-03-28.20:32:36 zfs create -o quota=10g -o mountpoint=/zones/global/cores 
-o compression=gzip zones/cores
2014-03-28.20:32:36 zfs create -o mountpoint=legacy zones/opt
2014-03-28.20:32:36 zfs create zones/var
2014-03-28.20:32:36 zfs set mountpoint=legacy zones/var
2014-03-28.20:32:41 zfs create -V 196599mb zones/swap
2014-03-28.20:37:59 zpool import -f zones
2014-03-28.20:37:59 zfs create -o mountpoint=legacy zones/zpoolcache
2014-03-28.20:37:59 zfs set checksum=noparity zones/dump
2014-03-28.20:37:59 zpool set feature@multi_vdev_crash_dump=enabled zones
2014-03-28.20:38:32 zfs create -o compression=lzjb -o mountpoint=/zones/archive 
zones/archive
2014-03-28.21:22:03 zfs destroy zones/swap
2014-03-28.21:22:06 zfs create -V51200M zones/swap
2014-03-28.21:22:06 zfs create zones/tmw-nas-3p
2014-03-28.21:22:06 zfs set mountpoint=/tmw-nas-3p zones/tmw-nas-3p
2014-03-28.21:22:06 zfs create zones/export
2014-03-28.21:22:06 zfs set mountpoint=/export zones/export
2014-03-28.21:24:17 zfs set volsize=196599M zones/dump
2014-03-28.21:27:13 zpool scrub zones
2014-03-28.21:46:41 zpool import -f zones
2014-03-28.21:46:42 zfs set checksum=noparity zones/dump
2014-03-28.21:46:47 zpool set feature@multi_vdev_crash_dump=enabled zones
2014-03-28.21:50:26 zpool scrub zones
2014-03-28.22:29:29 zpool scrub zones



On Mon, Mar 24, 2014 at 10:31 AM, Keith Wesolowski 
<keith.wesolow...@joyent.com<mailto:keith.wesolow...@joyent.com>> wrote:
On Sat, Mar 22, 2014 at 05:16:52PM -0400, Youzhong Yang wrote:

> I am having problem logging into the host so unable to provide accurate
> information, but basically the 'zones' zpool is very simple, with two
> drives (Samsung SSD 840 Pro), mirrored setting. When I had issues, MegaCli
> reported no error on the drives. This issue happened on 3 new hosts which
> have identical config/spec, so I hesitate to say it is disk failing ..
>
> I'll get back with more info later. Thanks.
>
>
> # zpool status
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         zones       ONLINE       0     0     0
>           mirror-0  ONLINE       0     0     0
>             c0t0d0  ONLINE       0     0     0
>             c0t1d0  ONLINE       0     0     0

tl;dr: nothing's wrong.

I don't see anything here that suggests data corruption.  As Elijah
pointed out, the FMA ereport that was generated occurred because a
particular command (likely MODE SENSE or SELECT, though you can look it
up yourself if you want) is not supported by either the underlying SSD
or more likely by the MegaRAID controller.  That doesn't have anything
to do with data corruption and given the above zpool status is
completely harmless.  In fact, it did not lead to a fault diagnosis at
all, meaning I'd expect that fmadm faulty does not report any disks or
ZFS vdevs as faulty at all.

> >> info for c0t0d0 and c0t1d0
>
> Device Name:       ATA                      Product Id:          Samsung
> SSD 840
> Rev:               5B0Q                     Vendor Specific:
> S1ATNSAF145313T
> Device Type:       DISK                     Device ID:           10
> SAS Address 0:     0x4433221106000000       SAS Address 1:       0x0
> Media Error:       0                        Other Error:         0
> PredictiveFail:    0                        Firmware State:      Online
> Speed:             6.0Gb/s                  DDF State:           SATA
> Primary Defect:    ---                      Grown Defect:        ---
> Raw size:          244198 MB                Non-coerced size:    243686 MB
> Coerced size:      243186 MB                Enclosure index:     1
> Path Count:        1                        Slot Number          6
>
>
> Device Name:       ATA                      Product Id:          Samsung
> SSD 840
> Rev:               5B0Q                     Vendor Specific:
> S1ATNSAF145308A
> Device Type:       DISK                     Device ID:           11
> SAS Address 0:     0x4433221107000000       SAS Address 1:       0x0
> Media Error:       0                        Other Error:         0
> PredictiveFail:    0                        Firmware State:      Online
> Speed:             6.0Gb/s                  DDF State:           SATA
> Primary Defect:    ---                      Grown Defect:        ---
> Raw size:          244198 MB                Non-coerced size:    243686 MB
> Coerced size:      243186 MB                Enclosure index:     1
> Path Count:        1                        Slot Number          7
>
>
>
> On Sat, Mar 22, 2014 at 1:22 PM, Elijah Wright 
> <eli...@joyent.com<mailto:eli...@joyent.com>> wrote:
>
> >
> > What is the structure and current state of your zpool ?
> >
> > This looks like a disk failing to me - does your pool have sufficient
> > redundancy to survive that?
> >
> > --e
> >
> > On Mar 22, 2014, at 10:44 AM, Youzhong Yang 
> > <youzh...@gmail.com<mailto:youzh...@gmail.com>> wrote:
> >
> > Thanks Keith.
> >
> > Here is what is what I saw on the console:
> >
> > <image.png>
> >
> > and this is from fmdump -eV:
> >
> > Mar 21 2014 22:45:25.998300087 ereport.io.scsi.cmd.disk.dev.rqs.derr
> > nvlist version: 0
> >         class = ereport.io.scsi.cmd.disk.dev.rqs.derr
> >         ena = 0x4eda9c21d805c01
> >         detector = (embedded nvlist)
> >         nvlist version: 0
> >                 version = 0x0
> >                 scheme = dev
> >                 device-path = /pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@1,0
> >                 devid = id1,sd@n6003048012d306001abea6270eb135f4
> >         (end detector)
> >
> >         devid = id1,sd@n6003048012d306001abea6270eb135f4
> >         driver-assessment = fail
> >         op-code = 0x15
> >         cdb = 0x15 0x10 0x0 0x0 0x20 0x0
> >         pkt-reason = 0x0
> >         pkt-state = 0x3f
> >         pkt-stats = 0x0
> >         stat-code = 0x2
> >         key = 0x5
> >         asc = 0x20
> >         ascq = 0x0
> >         sense-data = 0x70 0x0 0x5 0x0 0x0 0x0 0x0 0xb 0x0 0x0 0x0 0x0 0x20
> > 0x0 0x0 0x0 0x0 0x0 0x0 0x0
> >         __ttl = 0x1
> >         __tod = 0x532cf945 0x3b80d9b7
> >
> > I have no idea if there is relationship between dr_sas and the data
> > corruption (or should I say frustrating error messages?), but if you could
> > point me to the right direction of how to diagnose/debug this issue, I
> > would be more than happy to do so.
> >
> > Thanks,
> >
> > Youzhong
> >
> >
> >
> >
> > On Sat, Mar 22, 2014 at 11:25 AM, Keith Wesolowski <
> > keith.wesolow...@joyent.com<mailto:keith.wesolow...@joyent.com>> wrote:
> >
> >> On Sat, Mar 22, 2014 at 10:55:46AM -0400, Youzhong Yang wrote:
> >>
> >> > We had zones zpool data corruption when 'dr_sas' driver is loaded for
> >> > Supermicro
> >> > SMC2108 (LSI MegaRAID) controller.
> >>
> >> What makes you believe there is a causal relationship between these two
> >> things?
> >>
> >> > After I rebuilt smartos image with the change in /etc/driver_aliases, it
> >> > works fine. What I did is to roll back the following commit:
> >>
> >> Were you planning to give us even the slightest hint of what the
> >> "corruption" looked like, its history, your efforts to debug it, how you
> >> came to isolate its cause to this change, etc.?  Or were you just trying
> >> to enrage and infuriate us by stirring up vague, impotent fear?  If the
> >> latter, you're a winner.
> >>
> >> >
> >> https://github.com/joyent/smartos-live/commit/c0409b90008a6dd76afdf5d9aad0b5be8c0d6bec
> >> >
> >> > My questions are:
> >> > - what is this OS-1529 all about? does it fix any known issue by using
> >> > dr_sas driver? has this driver been fully tested?
> >>
> >> dr_sas was forked off mr_sas at the point in time when mr_sas took a
> >> huge amount of change to support various new devices.  That change
> >> consisted of black box code from LSI that we had limited ability to test
> >> or otherwise verify.  Therefore, to reduce risk, we duplicated the
> >> driver and assigned the existing PCI IDs to dr_sas and the
> >> newly-supported ones only to mr_sas.  Since we don't have nor intend to
> >> have any of the devices for which the change introduced support, this
> >> was a zero-risk way to add HW support for the benefit of third parties
> >> who want it.
> >>
> >> We have been running this particular driver (previously named mr_sas,
> >> now named dr_sas) in production for several years.  I am not aware of a
> >> single incident in which the driver has been responsible for data
> >> corruption in that time.  If anything here is poorly tested, it's
> >> mr_sas, not dr_sas.
> >>
> >
> > *smartos-discuss* | 
> > Archives<https://www.listbox.com/member/archive/184463/=now>
> > <https://www.listbox.com/member/archive/rss/184463/21486102-f5999c85> |
> > Modify <https://www.listbox.com/member/?&;> Your Subscription
> > <http://www.listbox.com<http://www.listbox.com/>>
> >
> > *smartos-discuss* | 
> > Archives<https://www.listbox.com/member/archive/184463/=now>
> > <https://www.listbox.com/member/archive/rss/184463/25077300-734ee1ca> |
> > Modify<https://www.listbox.com/member/?&;>Your Subscription
> > <http://www.listbox.com<http://www.listbox.com/>>
> >

smartos-discuss | Archives<https://www.listbox.com/member/archive/184463/=now> 
[https://www.listbox.com/images/feed-icon-10x10.jpg10f3ec5.jpg?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2ZlZWQtaWNvbi0xMHgxMC5qcGc]
 <https://www.listbox.com/member/archive/rss/184463/25738179-216c4b5f>  | 
Modify<https://www.listbox.com/member/?&;> Your Subscription       
[https://www.listbox.com/images/listbox-logo-small.png10f3ec5.png?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2xpc3Rib3gtbG9nby1zbWFsbC5wbmc]
 <http://www.listbox.com/>




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Re: [smartos-discuss] OS-1529 Hi, everybody! I'm dr_sas.

Reply via email to