Hi, I am having problems getting NVMe to work on SmartOS and hope that someone can enlighten me on what I need to do to get it flying.
I am runnning the newest SmartOS (joyent_20160330T234717Z) on a new Skylake-based Intel NUC6. My problem: the Samsung PM951 M.2 NVMe SSD is properly recognized and supported in Linux, I could even boot Ubuntu from that SSD. On SmartOS, the device is not recognized and not listed in diskinfo or sysinfo or format. The relevant device info (from Linux) is: 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01) 02:00.0 0108: 144d:a802 (rev 01) (prog-if 02 [NVM Express]) Subsystem: 144d:a801 Flags: bus master, fast devsel, latency 0 Memory at df000000 (64-bit, non-prefetchable) [size=16K] I/O ports at e000 [size=256] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+ Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [b0] MSI-X: Enable+ Count=9 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00 Capabilities: [158] Power Budgeting <?> Capabilities: [168] #19 Capabilities: [188] Latency Tolerance Reporting Capabilities: [190] L1 PM Substates Kernel driver in use: nvme Kernel modules: nvme root@lubuntu:~# nvme list Node Model Version Namepace Usage Format FW Rev ---------------- -------------------- -------- -------- -------------------------- ---------------- -------- /dev/nvme0n1 SAMSUNG MZVLV512HCJH 1.1 1 0.00 B / 512.11 GB 512 B + 0 B BXV7000Q root@dws-desktop:~# nvme fw-log /dev/nvme0n1 Firmware Log for device:nvme0n1 afi : 0x1 frs1 : 0x5130303037565842 (BXV7000Q) The PM951 is a NVMe 1.1 device, therefore I uncommented „strict-version=0;“ in /kernel/drv/nvme.conf and did „update_drv -vf nvme“. That was not enough. „grep nvme /etc/driver_aliases“ only yields: nvme „pciex8086,953“ I tried: [root@nuc6 ~]# update_drv -a -i ‚„pci144d,a802"' nvme That actually attached the driver and make the SSD visible. I could install the zones pool on that SSD. To have SmartOS find the zones pool on boot I had to patch the boot_archive: mount /zones/smartos/platform/i86pc/amd64/boot_archive /mnt sed -i '' -e '/#strict-version=0;/{s:.*:strict-version=0;:;}' /mnt/kernel/drv/nvme.conf sed -i '' -e '/nvme/{p;s:.*:nvme "pci144d,a802":;}' /mnt/etc/driver_aliases umount /mnt That did the trick. I could boot from USB straight into the zones pool on the NVMe. To test things out, I copied some large files into the pool and did a scrub afterwards. OMG!! This was the outcome: [root@nuc6 ~]# zpool status pool: zones state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 0h0m with 1 errors on Tue Apr 12 18:04:37 2016 config: NAME STATE READ WRITE CKSUM zones ONLINE 0 0 5 c0t1d0 ONLINE 0 0 14 errors: 1 data errors, use ‚-v' for a list Obviously something is very wrong here. I do NOT like to see this. I looked at the file that was supposed to be corrupted. Since I had copied that file over from another machine I could verify by means of the md5/sha256 checksums that the file was NOT actually corrupt. The checksums on both machines were identical! I played around a little more, copied a few files and ran a few scrubs. I got many more checksum errors on scrubbing (how can it be that there are less errors for the pool in total than for the one disk that makes up the pool?) and also a second file with permanent errors (let’s say the first file is „a“ and the second „b“). I could verify that file „b“ was also not corrupted. Now I had both files a and b in the list of files with permanent errors. After some time, file a disappeared from that list and only b remained!!! Eventually that list of files was empty again! And that pool does not have redundancy, and I also did not use copies=2 or such. [root@nuc6 ~]# zpool status -v pool: zones state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 0 in 0h0m with 0 errors on Tue Apr 12 18:20:28 2016 config: NAME STATE READ WRITE CKSUM zones ONLINE 0 0 5 c0t1d0 ONLINE 0 0 15 errors: No known data errors [root@nuc6 ~]# This output at the same time says „unrecoverable error“ AND „No known data errors“! And the files with „Permanent errors“ magically disappeared. How is that possible? BTW, the memory is good. I did run a memory test overnight. It appears that these errors only occur randomly in the read path of the scrubbing code. I have no explanation for this. Does anyone have any clues to a) what is going on? b) what can be done about it? Although I could prove that all reported permanent errors during scrubbing were not really errors after all, the entire thing feels really bad and cannot be used like this IMHO. Does anyone have NVMe running reliably? If so, what devices? Thanks for any help. Cheers Dirk ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com