[zfs-discuss] How to enforce probing of all disks?
Hello all, I have a kind of lame question here: how can I force the system (OI) to probe all the HDD controllers and disks that it can find, and be certain that it has searched everywhere for disks? My remotely supported home-NAS PC was unavailable for a while, and a friend rebooted it for me from a LiveUSB image with SSH (oi_148a). I can see my main pool disks, but not the old boot (rpool) drive. Meaning, that it does not appear in zpool import nor in format outputs. While it is possible that it has finally kicked the bucket, and that won't really be unexpected, I'd like to try and confirm. For example, it might fail to spin up or come into contact with the SATA cable initially - but subsequent probing of the same controller might just find it. Happened before, too - though via a reboot and full POST... The friend won't be available for a few days, and there's no other remote management nor inspection facility for this box, so I'd like to probe from within OI as much as I can. Should be an educational quest, too ;) # cfgadm -al Ap_Id Type Receptacle Occupant Condition Slot36 sata/hp connectedconfigured ok sata0/0::dsk/c5t0d0disk connectedconfigured ok sata0/1::dsk/c5t1d0disk connectedconfigured ok sata0/2::dsk/c5t2d0disk connectedconfigured ok sata0/3::dsk/c5t3d0disk connectedconfigured ok sata0/4::dsk/c5t4d0disk connectedconfigured ok sata0/5::dsk/c5t5d0disk connectedconfigured ok sata1/0sata-portemptyunconfigured ok sata1/1sata-portemptyunconfigured ok ... (USB reports follow) # devfsadm -Cv -- nothing new found Nothing of interest in dmesg... # scanpci -v | grep -i ata Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller JMicron Technology Corp. JMB362/JMB363 Serial ATA Controller JMicron Technology Corp. JMB362/JMB363 Serial ATA Controller # prtconf -v | grep -i ata name='ata-dma-enabled' type=string items=1 name='atapi-cd-dma-enabled' type=string items=1 value='ADATA USB Flash Drive' value='ADATA' value='ADATA' name='sata' type=int items=1 dev=none value='SATA AHCI 1.0 Interface' dev_link=/dev/cfg/sata1/0 dev_link=/dev/cfg/sata1/1 name='ata-options' type=int items=1 value='atapi' name='sata' type=int items=1 dev=none value='\_SB_.PCI0.SATA' value='SATA AHCI 1.0 Interface' dev_link=/dev/cfg/sata0/0 dev_link=/dev/cfg/sata0/1 dev_link=/dev/cfg/sata0/2 dev_link=/dev/cfg/sata0/3 dev_link=/dev/cfg/sata0/4 dev_link=/dev/cfg/sata0/5 value='id1,sd@SATA_ST2000DL003-9VT15YD217ZL' name='sata-phy' type=int items=1 value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass' value='id1,sd@SATA_ST2000DL003-9VT15YD1XWWB' name='sata-phy' type=int items=1 value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass' value='id1,sd@SATA_ST2000DL003-9VT15YD1VLKC' name='sata-phy' type=int items=1 value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass' value='id1,sd@SATA_ST2000DL003-9VT15YD21QZL' name='sata-phy' type=int items=1 value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass' value='id1,sd@SATA_ST2000DL003-9VT15YD24GCA' name='sata-phy' type=int items=1 value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass' value='id1,sd@SATA_ST2000DL003-9VT15YD24GDG' name='sata-phy' type=int items=1 value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass' This only sees the six ST2000DL003 drives of the main data pool, and the LiveUSB flash drive... So - is it possible to try reinitializing and locating connections to the disk on a commodity motherboard (i.e. no lsiutil, IPMI and such) using only OI, without rebooting the box? The pools are not imported, so if I can detach and reload the sata drivers - I might try that, but I am stumped at how
Re: [zfs-discuss] SSD for L2arc
On 2013-03-21 16:24, Ram Chander wrote: Hi, Can I know how to configure a SSD to be used for L2arc ? Basically I want to improve read performance. The man zpool page is quite informative on theory and concepts ;) If your pool already exists, you can prepare the SSD (partition/slice it) and: # zpool add POOLNAME cache cXtYdZsS Likewise, to add a ZIL device you can add a log device, either as a single disk (slice) or as a mirror of two or more: # zpool add POOLNAME log cXtYdZsS # zpool add POOLNAME log mirror cXtYdZsS1 cXtYdZsS2 To increase write performance, will SSD for Zil help ? As I read on forums, Zil is only used for mysql/transaction based writes. I have regular writes only. It may increase performance in two ways: If you have any apps (including NFS, maybe VMs, iSCSI, etc. - not only databases) that regularly issue synchronous writes - those which must be stored on media (not just cached and queued) before the call returns a success, then the ZIL catches these writes instead of the main pool devices. The ZIL is written as ring buffer, so its size is proportional to your pool's throughput - about 3 full-size TXG syncs should fit into the designated ZIL space. That's usually max bandwidth (X Mb/s) times 15 sec (3*5s), or a bit more for peace of mind. 1) If the ZIL device (SLOG) is an SSD, it is presumably quick, so writes should return quickly and sync IOs are less blocked. 2) If the SLOG is on HDD(s) separate from the main pool, then writes into the ZIL cause no mechanical seeks during normal pool IOs, thus requiring time for the disk heads to travel to the reserved ZIL area and back - this is time stolen from both reads and writes in the pool. *Possibly*, fragmentation might also be reduced by having ZIL outside of the main pool, though this statement may be technically invalid as my fault, then. 3) As a *speculation*, it is likely that a HDD doing nothing but SLOG (i.e. a hotspare with a designated slice for ZIL so it does something useful while waiting for failover of a larger pool device) would also give a good boost to performance, since it won't have to seek much. The rotational latency will be there however, limiting reachable IOPS in comparison to an SSD SLOG. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] System started crashing hard after zpool reconfigure and OI upgrade
On 2013-03-20 17:15, Peter Wood wrote: I'm going to need some help with the crash dumps. I'm not very familiar with Solaris. Do I have to enable something to get the crash dumps? Where should I look for them? Typically the kernel crash dumps are created as a result of kernel panic; also they may be forced by administrative actions like NMI. They require you to configure a dump volume of sufficient size (see dumpadm) and a /var/crash which may be a dataset on a large enough pool - after the reboot the dump data will be migrated there. To help with the hangs you can try the BIOS watchdog (which would require a bmc driver, one which is known from OpenSolaris is alas not opensourced and not redistributable), or with a software deadman timer: http://www.cuddletech.com/blog/pivot/entry.php?id=1044 http://wiki.illumos.org/display/illumos/System+Hangs Also, if you configure crash dump on NMI and set up your IPMI card, then you can likely gain remote access to both the server console (physical and/or serial) and may be able to trigger the NMI, too. HTH, //Jim Thanks for the help. On Wed, Mar 20, 2013 at 8:53 AM, Michael Schuster michaelspriv...@gmail.com mailto:michaelspriv...@gmail.com wrote: How about crash dumps? michael On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood peterwood...@gmail.com mailto:peterwood...@gmail.com wrote: I'm sorry. I should have mentioned it that I can't find any errors in the logs. The last entry in /var/adm/messages is that I removed the keyboard after the last reboot and then it shows the new boot up messages when I boot up the system after the crash. The BIOS log is empty. I'm not sure how to check the IPMI but IPMI is not configured and I'm not using it. Just another observation - the crashes are more intense the more data the system serves (NFS). I'm looking into FRMW upgrades for the LSI now. On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane will.murn...@gmail.com mailto:will.murn...@gmail.com wrote: Does the Supermicro IPMI show anything when it crashes? Does anything show up in event logs in the BIOS, or in system logs under OI? On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood peterwood...@gmail.com mailto:peterwood...@gmail.com wrote: I have two identical Supermicro boxes with 32GB ram. Hardware details at the end of the message. They were running OI 151.a.5 for months. The zpool configuration was one storage zpool with 3 vdevs of 8 disks in RAIDZ2. The OI installation is absolutely clean. Just next-next-next until done. All I do is configure the network after install. I don't install or enable any other services. Then I added more disks and rebuild the systems with OI 151.a.7 and this time configured the zpool with 6 vdevs of 5 disks in RAIDZ. The systems started crashing really bad. They just disappear from the network, black and unresponsive console, no error lights but no activity indication either. The only way out is to power cycle the system. There is no pattern in the crashes. It may crash in 2 days in may crash in 2 hours. I upgraded the memory on both systems to 128GB at no avail. This is the max memory they can take. In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. Any idea what could be the problem. Thank you -- Peter Supermicro X9DRH-iF Xeon E5-2620 @ 2.0 GHz 6-Core LSI SAS9211-8i HBA 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Michael Schuster http://recursiveramblings.wordpress.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] partioned cache devices
On 2013-03-19 20:38, Cindy Swearingen wrote: Hi Andrew, Your original syntax was incorrect. A p* device is a larger container for the d* device or s* devices. In the case of a cache device, you need to specify a d* or s* device. That you can add p* devices to a pool is a bug. I disagree; at least, I've always thought differently: the d device is the whole disk denomination, with a unique number for a particular controller link (c+t). The disk has some partitioning table, MBR or GPT/EFI. In these tables, partition p0 stands for the table itself (i.e. to manage partitioning), and the rest kind of depends. In case of MBR tables, one partition may be named as having a Solaris (or Solaris2) type, and there it holds a SMI table of Solaris slices, and these slices can hold legacy filesystems or components of ZFS pools. In case of GPT, the GPT-partitions can be used directly by ZFS. However, they are also denominated as slices in ZFS and format utility. I believe, Solaris-based OSes accessing a p-named partition and an s-named slice of the same number on a GPT disk should lead to the same range of bytes on disk, but I am not really certain about this. Also, if a whole disk is given to ZFS (and for OSes other that the latest Solaris 11 this means non-rpool disks), then ZFS labels the disk as GPT and defines a partition for itself plus a small trailing partition (likely to level out discrepancies with replacement disks that might happen to be a few sectors too small). In this case ZFS reports that it uses cXtYdZ as a pool component, since it considers itself in charge of the partitioning table and its inner contents, and doesn't intend to share the disk with other usages (dual-booting and other OSes' partitions, or SLOG and L2ARC parts, etc). This also allows ZFS to influence hardware-related choices, like caching and throttling, and likely auto-expansion with the changed LUN sizes by fixing up the partition table along the way, since it assumes being 100% in charge of the disk. I don't think there is a crime in trying to use the partitions (of either kind) as ZFS leaf vdevs, even the zpool(1M) manpage states that: ... The following virtual devices are supported: disk A block device, typically located under /dev/dsk. ZFS can use individual slices or partitions, though the recommended mode of operation is to use whole disks. ... This is orthogonal to the fact that there can only be one Solaris slice table, inside one partition, on MBR. AFAIK this is irrelevant on GPT/EFI - no SMI slices there. On my old home NAS with OpenSolaris I certainly did have MBR partitions on the rpool intended initially for some dual-booted OSes, but repurposed as L2ARC and ZIL devices for the storage pool on other disks, when I played with that technology. Didn't gain much with a single spindle ;) HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] partioned cache devices
On 2013-03-19 22:07, Andrew Gabriel wrote: The GPT partitioning spec requires the disk to be FDISK partitioned with just one single FDISK partition of type EFI, so that tools which predate GPT partitioning will still see such a GPT disk as fully assigned to FDISK partitions, and therefore less likely to be accidentally blown away. Okay, I guess I got entangled in terminology now ;) Anyhow, your words are not all news to me, though my write-up was likely misleading to unprepared readers... sigh... Thanks for the clarifications and deeper details that I did not know! So, we can concur that GPT does indeed include the fake MBR header with one EFI partition which addresses the smaller of 2TB (MBR limit) or disk size, minus a few sectors for the GPT housekeeping. Inside the EFI partition are defined the GPT, um, partitions (represented as slices in Solaris). This is after all a GUID *Partition* Table, and that's how parted refers to them too ;) Notably, there are also unportable tricks to fool legacy OSes and bootloaders into addressing the same byte ranges via both MBR entries (forged manually and abusing the GPT/EFI spec) and proper GPT entries, as partitions in the sense of each table. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] Petabyte pool?
On 2013-03-16 15:20, Bob Friesenhahn wrote: On Sat, 16 Mar 2013, Kristoffer Sheather @ CloudCentral wrote: Well, off the top of my head: 2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's 8 x 60-Bay JBOD's with 60 x 4TB SAS drives RAIDZ2 stripe over the 8 x JBOD's That should fit within 1 rack comfortably and provide 1 PB storage.. What does one do for power? What are the power requirements when the system is first powered on? Can drive spin-up be staggered between JBOD chassis? Does the server need to be powered up last so that it does not time out on the zfs import? I guess you can use managed PDUs like those from APC (many models for varied socket types and amounts); they can be scripted on an advanced level, and on a basic level I think delays can be just configured per-socket to make the staggered startup after giving power from the wall (UPS) regardless of what the boxes' individual power sources can do. Conveniently, they also allow to do a remote hard-reset of hung boxes without walking to the server room ;) My 2c, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] Petabyte pool?
On 2013-03-16 15:20, Bob Friesenhahn wrote: On Sat, 16 Mar 2013, Kristoffer Sheather @ CloudCentral wrote: Well, off the top of my head: 2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's 8 x 60-Bay JBOD's with 60 x 4TB SAS drives RAIDZ2 stripe over the 8 x JBOD's That should fit within 1 rack comfortably and provide 1 PB storage.. What does one do for power? What are the power requirements when the system is first powered on? Can drive spin-up be staggered between JBOD chassis? Does the server need to be powered up last so that it does not time out on the zfs import? Giving this question a second thought, I think JBODs should spin-up quickly (i.e. when power is given) while the server head(s) take time to pass POST, initialize their HBAs and other stuff. Booting 8 JBODs, one every 15 seconds to complete a typical spin-up power draw, would take a couple of minutes. It is likely that a server booted along with the first JBOD won't get to importing the pool this quickly ;) Anyhow, with such a system attention should be given to redundant power and cooling, including redundant UPSes preferably fed from different power lines going into the room. This does not seem like a fantastic power sucker, however. 480 drives at 15W would consume 7200W; add a bit for processor/RAM heads (perhaps a kW?) and this would still fit into 8-10kW, so a couple of 15kVA UPSes (or more smaller ones) should suffice including redundancy. This might overall exceed a rack in size though. But for power/cooling this seems like a standard figure for a 42U rack or just a bit more. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun X4200 Question...
On 2013-03-11 21:50, Bob Friesenhahn wrote: On Mon, 11 Mar 2013, Tiernan OToole wrote: I know this might be the wrong place to ask, but hopefully someone can point me in the right direction... I got my hands on a Sun x4200. Its the original one, not the M2, and has 2 single core Opterons, 4Gb RAM and 4 73Gb SAS Disks... But, I dont know what to install on it... I was thinking of SmartOS, but the site mentions Intel support for VT, but nothing for AMD... The Opterons dont have VT, so i wont be using XEN, but the Zones may be useful... OpenIndiana or OmniOS seem like the most likely candidates. You can run VirtualBox on OpenIndiana and it should be able to work without VT extensions. Also note that without the extensions VirtualBox has some quirks. Most notably, lack of acceleration and support for virtual SMP. But unlike some other virtualizers, it should work (does work for us on a Thumper also with pre-VTx Opteron CPUs). However, recently the VM virtual hardware clocks became way slow. I am at loss so far, the forum was moderately helpful - probably the load on the host and induced latencies have their role. But the problem does happen on more modern hardware too, so VTx (lack of) shouldn't be our reason... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun X4200 Question...
On 2013-03-15 01:58, Gary Driggs wrote: On Mar 14, 2013, at 5:55 PM, Jim Klimov jimkli...@cos.ru wrote: However, recently the VM virtual hardware clocks became way slow. Does NTP help correct the guest's clock? Unfortunately no, neither guest NTP, ntpdate or rdate in crontabs, nor VirtualBox timesync settings, alone or even combined for test (though known to conflict) - nothing has definitely helped so far. We also have some setups on rather not-loaded hardware where after a few days of uptime the clock stalls to the point that it has a groundhog day - rotating over the same 2-3 second range for hours, until the VM is powered off and booted. Conversely, we also have dozens of VMs (and a few hosts) where no such problems occur. Weird stuff... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SVM ZFS
On 2013-02-26 21:30, Morris Hooten wrote: Besides copying data from /dev/md/dsk/x volume manager filesystems to new zfs filesystems does anyone know of any zfs conversion tools to make the conversion/migration from svm to zfs easier? Do you mean something like a tool that would change metadata around your userdata in-place and turn an SVM volume into a ZFS pool, like Windows' built-in FAT - NTFS conversion? No, there's nothing like it. However, depending on your old system's configuration, you might have to be careful about choice of copy programs. Namely, if your setup used some ACLs (beyond standard POSIX access bits), then you'd need ACL-aware copying tools. Sun tar and cpio are some (see manpages about usage examples), rsync 3.0.10 was recently reported to support Solaris ACLs as well, but I didn't test that myself. GNU tar and cpio are known to do a poor job with intimate Solaris features, though they might be superior for some other tasks. Basic (Sun, not GNU) cp and mv should work correctly too. I most often use rsync -avPHK /src/ /dst/, especially if there are no ACLs to think about, or the target's inheritable ACLs are acceptable (and overriding them with original's access rights might even be wrong). Also, before you do the migration, think ahead of the storage and IO requirements for the datasets. For example, log files are often huge, compress into orders of magnitude less, and the IOPS loss might be negligible (or even boost, due to smaller hardware IOs and less seeks). Randomly accessed (written) data might not like heavier compressions. Databases or VM images might benefit from smaller maximum block sizes, although often these are not made 1:1 with DB block size, but rather balance about 4 DB entries in an FS block of 32Kb or 64Kb (from what I saw suggested on the list). Singly-written data, like OS images, might benefit from compression as well. If you have local zones, you might benefit from carrying over (or installing from scratch) one as a typical example DUMMY into a dedicated dataset, then cloning it into many actual zone roots as you'd need, and rsync -cavPHK --delete-after from originals into this dataset - this way only differing files (or parts thereof) would be transferred, giving you the benefits of cloning (space saving) without the downsides of deduplication. Also, for data in the zones (such as database files, tomcat/glassfish application server roots, etc.) you might like to use separate dataset hierarchies mounted via delegation of a root ZFS dataset into zones. This way your zoneroots would live a separate life from application data and non-packaged applications, which might simplify backups, etc. and you might be able to store these pieces in different pools (i.e. SSDs for some data and HDDs for other - though most list members would rightfully argue in favor of L2ARC on the SSDs). HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SVM ZFS
Ah, I forgot to mention - ufsdump|ufsrestore was at some time also a recommended way of such transition ;) I think it should be aware of all intimacies of the FS, including sparse files which reportedly may puzzle some other archivers. Although with any sort of ZFS compression (including lightweight zle) zero-filled blocks should translate into zero IOs. (Maybe some metadata would appear, to address the holes, however). With proper handling of sparse files you don't write any of that voidness into the FS and you don't process anything on reads. Have fun, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On 2013-02-27 05:36, Ian Collins wrote: Bob Friesenhahn wrote: On Wed, 27 Feb 2013, Ian Collins wrote: I am finding that rsync with the right options (to directly block-overwrite) plus zfs snapshots is providing me with pretty amazing deduplication for backups without even enabling deduplication in zfs. Now backup storage goes a very long way. We do the same for all of our legacy operating system backups. Take a snapshot then do an rsync and an excellent way of maintaining incremental backups for those. Magic rsync options used: -a --inplace --no-whole-file --delete-excluded This causes rsync to overwrite the file blocks in place rather than writing to a new temporary file first. As a result, zfs COW produces primitive deduplication of at least the unchanged blocks (by writing nothing) while writing new COW blocks for the changed blocks. Do these options impact performance or reduce the incremental stream sizes? I just use -a --delete and the snapshots don't take up much space (compared with the incremental stream sizes). Well, to be certain, you can create a dataset with a large file in it, snapshot it, and rsync over a changed variant of the file, snapshot and compare referenced sizes. If the file was rewritten into a new temporary one and then renamed over original, you'd likely end up with as much used storage as for the original file. If only changes are written into it in-place then you'd use a lot less space (and you'd not see a .garbledfilename in the directory during the process). If you use rsync over network to back up stuff, here's an example of SMF wrapper for rsyncd, and a config sample to make a snapshot after completion of the rsync session. http://wiki.openindiana.org/oi/rsync+daemon+service+on+OpenIndiana HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?
On 2013-02-20 23:49, Markus Grundmann wrote: add an pool / filesystem property as an additional security layer for administrators. Whenever I modify zfs pools or filesystems it's possible to destroy [on a bad day :-)] my data. A new property protected=on|off in the pool and/or filesystem can help the administrator for datalost (e.g. zpool destroy tank or zfs destroy tank/filesystem command will be rejected when protected=on property is set). Hello all, I don't want to really hijack this thread, but this request seems like a nice complement to one I voiced a few times and recently posted into the bugtracker lest it be forgotten: Feature #3568: Add a ZFS dataset attribute to disallow creation of snapshots, ever: https://www.illumos.org/issues/3568 It is somewhat of an opposite desire - to not allow creation of datasets (snapshots) rather than forbid their destruction as requested here, but to a similar effect: to not let some scripted or thoughtless manual jobs abuse the storage by wasting space in some datasets in the form of snapshot creation. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?
On 2013-02-21 16:54, Markus Grundmann wrote: It's anyone here on the list that's have some tips for me what files are to modify ? :-) In my current source tree now is a new property PROTECTED available both for pool- und zfs-objects. I have also two functions added to get and set the property above. The source code tree is very big and some files have the same name in different locations. GREP seems to be my new friend. You might also benefit from on-line grepping here: http://src.illumos.org/source/search?q=zfs_do_holddefs=refs=path=hist=project=freebsd-head There is a project freebsd-head in illumos codebase; I have no idea how actual it is for the BSD users. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot destroy, volume is busy
On 2013-02-21 17:02, John D Groenveld wrote: # zfs list -t vol NAME USED AVAIL REFER MOUNTPOINT rpool/dump4.00G 99.9G 4.00G - rpool/foo128 66.2M 100G16K - rpool/swap4.00G 99.9G 4.00G - # zfs destroy rpool/foo128 cannot destroy 'rpool/foo128': volume is busy Can anything local be holding it (databases, virtualbox, etc)? Can there be any clones, held snapshots or an ongoing zfs send? (Perhaps an aborted send left a hold?) Sometimes I have had a bug with a filesystem dataset becoming so busy that I couldn't snapshot it. Unmounting and mounting it back usually helped. This was back in the days of SXCE snv_117 and Solaris 10u8, and the bug often popped up in conjunction with LiveUpgrade. I believe this particular issue was solved since, but maybe something new like it has appeared?.. Hopefully some on-list gurus might walk you through use of a debugger or dtrace to track which calls are being made by zfs destroy and lead it to conclude that the dataset is busy?.. I really only know to use truss -f -l progname params which helps most of the time, and would love to learn the modern equivalents which give more insights into code. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs raid1 error resilvering and mount
On 2013-02-19 12:39, Konstantin Kuklin wrote: i did`t replace disk, after reboot system not started (zfs installed as default root system) and i boot from another system(from flash) and resilvering has auto start and show me warnings with freeze progress(dead on checking zroot/var/crash ) Well, in this case try again with zpool import options I've described earlier, and zpool scrub to try to inspect and repair the pool state you have now. You might want to disconnect the broken disk for now, since resilvering would try to overwrite it anyway (whole disk, or just differences if it is found to have a valid label ending at an earlier TXG number). replacing dead disk healing var/crash with 0x0 adress? Probably not, since your pool's only copy has an error in it. 0x0 is a metadata block (dataset root or close to that), so an error in it is usually fatal (is for most dataset types). Possibly, an import with rollback can return your pool to state where another blockpointer tree version points to a different (older) block as this dataset's 0x0 and that would be valid. But if you've already imported the pool and it ran for a while, chances are that your older possibly better intact TXGs are no longer referencable (rolled out of the ring buffer forever). Good luck, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs raid1 error resilvering and mount
On 2013-02-19 14:24, Konstantin Kuklin wrote: zfs set canmount=off zroot/var/crash i can`t do this, because zfs list empty I'd argue that in your case it might be desirable to evacuate data and reinstall the OS - just to be certain that ZFS on-disk structures on new installation have no defects. To evacuate data, a read-only import would suffice: # zpool import -f -N -R /a -o ro zroot This should import the pool without mounting its datasets (-N). Using zfs mount zpool/ROOT/myrootfsname and so on you can mount just the datasets which hold your valuable data individually (under '/a' in this example), and rsync it to some other storage. After you've saved your data, you can try to repair the pool by roll back: # zpool export zpool # zpool import -F -f -N -R /a zroot This should try to roll back 10 transaction sets or so, possibly giving you an intact state of ZFS data structures and a usable pool. Maybe not. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs raid1 error resilvering and mount
On 2013-02-19 17:02, Victor Latushkin wrote: On 2/19/13 6:32 AM, Jim Klimov wrote: On 2013-02-19 14:24, Konstantin Kuklin wrote: zfs set canmount=off zroot/var/crash i can`t do this, because zfs list empty I'd argue that in your case it might be desirable to evacuate data and reinstall the OS - just to be certain that ZFS on-disk structures on new installation have no defects. To evacuate data, a read-only import would suffice: This is a good idea but .. # zpool import -f -N -R /a -o ro zroot This command will not achieve readonly import. For readonly import one needs to use 'zpool import -o readonly=on poolname' as 'zpool import -o ro poolname' will import in R/W mode and just mount filesystems readonly. Oops, my bad. Do what the guru says! Really, I was mistaken in this fast-typing ;) Feel free to add other options (-f, -N, etc) as needed. This should import the pool without mounting its datasets (-N). Using zfs mount zpool/ROOT/myrootfsname and so on you can mount just the datasets which hold your valuable data individually (under '/a' in this example), and rsync it to some other storage. After you've saved your data, you can try to repair the pool by roll back: # zpool export zpool # zpool import -F -f -N -R /a zroot This should try to roll back 10 transaction sets or so, possibly giving you an intact state of ZFS data structures and a usable pool. Maybe not. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО ЦОС и ВТ JSC COSHT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | |CC:ad...@cos.ru,jimkli...@gmail.com | ++ | () ascii ribbon campaign - against html mail | | /\- against microsoft attachments | ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs raid1 error resilvering and mount
On 2013-02-17 15:46, Konstantin Kuklin wrote: hi, i have raid1 on zfs with 2 device on pool first device died and boot from second not working... You didn't say which OS version created the pool (ultimately - which pool version is there) and I'm not sure about support of the zfs versions in that flash you linked to. Possibly, OI LiveCD might do you a better job - but maybe your disks got too corrupted in some cataclysm :( However, generally, recent implementations should have several useful zpool import flags: * forcing an import with rollback to an older pool state (-F) - which may be or not be more intact (up to 32 or 128 transactions); * import without automount (-N) * read-only import (-o ro) which should panic in a lot less cases and allows to evacuate readable data by at least cp/rsync * import without cachefile and/or relocated pool root mountpoint (-R /a) so as to, in particular, not damage the namespace of your system by this pool (not really relevant in case of livecd's) Hopefully, you can either import without mounts and issue a zfs destroy of your offending dataset, or rollback (irreversible) to a working state. However, it is also possible that the corruption is among metadata. If you're lucky and just the latest transaction got broken during the crash (i.e. disk firmware ignored queuing and caching hints, and wrote something out of order), then rollback by one or a few TXGs may point you to an older root of metadata tree which is not yet overwritten by newer transactions (note: this is not guaranteed by the OS, just probable) and does contain consistent metadata in at least one copy of each of the metadata blocks. Breakage in /var/crash remotely suggests that your system tried to either create a dump (kernel panic) or more likely process one (via savecore in case of Solaris), and failed during this procedure in a mid-write. i try to get http://mfsbsd.vx.sk/ flash and load from it with zpool import http://puu.sh/2402E when i load zfs.ko and opensolaris.ko i see this message: Solaris: WARNING: Can't open objset for zroot/var/crash Solaris: WARNING: Can't open objset for zroot/var/crash zpool status: http://puu.sh/2405f resilvering freeze with: zpool status -v . zroot/usr:0x28ff zroot/usr:0x29ff zroot/usr:0x2aff zroot/var/crash:0x0 root@Flash:/root # how i can delete or drop it fs zroot/var/crash (1m-10m size i didn`t remember) and mount other zfs points with my data -- С уважением Куклин Константин. Good luck, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs raid1 error resilvering and mount
Also, adding to my recent post: instead of resilvering, try to run zpool scrub first - it should verify all checksums and repair whatever it can via redundancy (for metadata - extra copies). Resilver is similar to scrub, but it has its other goals and implementation, and might be not so forgiving about pool errors. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELP! RPool problem
On 2013-02-16 21:49, John D Groenveld wrote: By the way, whatever the error message is when booting, it disapears so quickly I can't read it, so I am only guessing that this is the reason. Boot with kernel debugger so you can see the panic. And that would be so: 1) In the boot loader (GRUB) edit the boot options (press e, select kernel line, press e again), and add -kd to the kernel bootup. Maybe also -v to add verbosity. 2) Press enter to save the change and b to boot 3) The kmdb prompt should pop up; enter :c to continue execution The bootup should start, throw the kernel panic and pause. It is likely that there would be so much info that it doesn't fit on screen - I can only suggest a serial console in this case. However, the end of dump info should point you in the right direction. For example, an error in mount_vfs_root is popular, and usually means either corrupt media or simply unexpected device name for the root pool (i.e. disk plugged on a different port, or BIOS changes between SATA-IDE modes, etc.) The device name changes should go away if you can boot from anything that can import your rpool (livecd, installer cd, failsafe boot image) and just zpool import -f rpool; zpool export rpool - this should clear the dependency on exact device names, and next bootup should work. And yes, I think it is a bug for such a fixable problem to behave so inconveniently - the official docs go as far as to suggest an OS reinstallation in this case. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL
Hello Cindy, Are there any plans to preserve the official mailing lists' archives, or will they go the way of Jive forums and the future digs for bits of knowledge would rely on alternate mirrors and caches? I understand that Oracle has some business priorities, but retiring hardware causes site shutdown? They've gotta be kidding, with all the buzz about clouds and virtualization ;) I'd guess, you also are not authorized to say whether Oracle might permit re-use (re-hosting) of current OpenSolaris.Org materials or even give away the site and domain for community steering and rid itself of more black PR by shooting down another public project of the Sun legacy (hint: if the site does wither and die in community's hands - it is not Oracle's fault; and if it lives on - Oracle did something good for karma... win-win, at no price). Thanks for your helpfulness in the past years, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slow zfs writes
On 2013-02-12 10:32, Ian Collins wrote: Ram Chander wrote: Hi Roy, You are right. So it looks like re-distribution issue. Initially there were two Vdev with 24 disks ( disk 0-23 ) for close to year. After which which we added 24 more disks and created additional vdevs. The initial vdevs are filled up and so write speed declined. Now how to find files that are present in a Vdev or a disk. That way I can remove and re-copy back to distribute data. Any other way to solve this ? The only way is to avoid the problem in the first place by not mixing vdev sizes in a pool. Well, that disbalance is there - in the zpool status printout we see raidz1 top-level vdevs of size 5, 5, 12, 7, 7, 7 disks and some 5 spares - which seems to sum up to 48 ;) Depending on disk size, it might be possible that tlvdev sizes in gigabytes were kept the same (i.e. a raidz set with twice as many disks of half size), but we have no info on this detail and it is unlikely. The disk sets being in one pool, this would still quite disbalance the load among spindles and IO buses. Beside all that - with the older tlvdev's being more full than the newer ones, there is the disbalance which wouldn't be avoided by not mixing vdev sizes - writes into newer ones are more likely to quickly find available holes, while writes into older ones are more fragmented and longer data inspection is needed to find a hole - if not even the gang-block fragmentation. These two are, I believe, the basis for performance drop on full pools, with the measure being rather the mix of IO patterns and fragmentation of data and holes. I think there were developments in illumos ZFS to address more writes onto devices with more available space; I am not sure if the average write latency to a tlvdev was monitored and taken into account during write-targeting decisions (which would also wrap the case of failing devices which take longer to respond). I am not sure which portions nave been completed and integrated into common illumos-gate. As was suggested, you can use zpool iostat -v 5 to monitor IOs to the pool with a fanout per TLVDEV and per disk, and witness possible patterns there. Do keep in mind, however, that for a non-failed raidz set you should see reads from only the data disks for a particular stripe, while parity disks are not used unless a checksum mismatch occurs. On the average data should be on all disks in such a manner that there is no dedicated parity disk, but with small IOs you are likely to notice this. If the budget permits, I'd suggest building (or leasing) another system with balanced disk sets and replicating all data onto it, then repurposing the older system - for example, to be a backup of the newer box (also after remaking the disk layout). As for the question of which files are on the older disks - you can as a rule of thumb use the file creation/modification time in comparison with the date when you expanded the pool ;) Closer inspection could be done with a ZDB walk to print out the DVA block addresses for blocks of a file (the DVA includes the number of the top-level vdev), but that would take some time - to determine which files you want to expect (likely some band of sizes) and then to do these zdb walks. Good luck, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS monitoring
On 2013-02-11 17:14, Borja Marcos wrote: On Feb 11, 2013, at 4:56 PM, Tim Cook wrote: The zpool iostat output has all sorts of statistics I think would be useful/interesting to record over time. Yes, thanks :) I think I will add them, I just started with the esoteric ones. Anyway, still there's no better way to read it than running zpool iostat and parsing the output, right? I believe, in this case you'd have to run it as a continuous process and parse the outputs after the first one (overall uptime stat, IIRC). Also note that on problems with ZFS engine itself, zpool may lock up and thus halt your program - so have it ready to abort an outstanding statistics read after a timeout and perhaps log an error. And if pools are imported-exported during work, the zpool iostat output changes dynamically, so you basically need to parse its text structure every time. The zpool iostat -v might be even more interesting though, as it lets you see per-vdev statistics and perhaps notice imbalances, etc... All that said, I don't know if this data isn't also available as some set of kstats - that would probably be a lot better for your cause. Inspect the zpool source to see where it gets its numbers from... and perhaps make and RTI relevant kstats, if they aren't yet there ;) On the other hand, I am not certain how Solaris-based kstats interact or correspond to structures in FreeBSD (or Linux for that matter)?.. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Freeing unused space in thin provisioned zvols
On 2013-02-10 10:57, Datnus wrote: I run dd if=/dev/zero of=testfile bs=1024k count=5 inside the iscsi vmfs from ESXi and rm textfile. However, the zpool list doesn't decrease at all. In fact, the used storage increase when I do dd. FreeNas 8.0.4 and ESXi 5.0 Help. Thanks. Did you also enable compression (any non-off kind) for the ZVOL which houses your iSCSI volume? The procedure with zero-copying does allocate (logically) the blocks requested in the sparse volume. If this volume is stored on ZFS with compression (active at the moment when you write these blocks), then ZFS detects an all-zeroes blocks and uses no space to store it, only adding a block pointer entry to reference its emptiness. This way you get some growth in metadata, but none in userdata for the volume. If by doing this trick you overwrite the non-empty but logically deleted blocks in the VM's filesystem housed inside iSCSI in the ZVOL, then the backend storage should shrink by releasing those non-empty blocks. Ultimately, if you use snapshots - those released blocks would be reassigned into the snapshots of the ZVOL; and so in order to get usable free space on your pool, you'd have to destroy all those older snapshots (between creation and deletion times of those no-longer-useful blocks). If you have reservations about compression for VMs (performance-wise or somehow else), take a look at zle compression mode which should only reduce consecutive strings of zeroes. Also I'd reiterate - the compression mode takes effect for blocks written after the mode was set. For example, if you prefer to store your datasets generally uncompressed for any reason, then you can enable a compression mode, zero-fill the VM disk's free space as you did, and re-disable the compression for the volume for any further writes. Also note that if you zfs send or otherwise copy the data off the dataset into another (backup one), only the one compression method last defined for the target dataset would be applied to the new writes into it - regardless of absence or presence (and type) of compression on the original dataset. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to know available disk space
On 2013-02-08 22:47, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: Maybe this isn't exactly what you need, but maybe: for fs in `zfs list -H -o name` ; do echo $fs ; zfs get reservation,refreservation,usedbyrefreservation $fs ; done What is the sacramental purpose of such construct in comparison to: zfs list -H -o reservation,refreservation,usedbyrefreservation,name \ -t filesystem {-r pool/interesting/dataset} Just asking - or suggesting a simpler way to do stuff ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub performance
On 2013-02-04 15:52, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: I noticed that sometimes I had terrible rates with 10MB/sec. Then later it rose up to 70MB/sec. Are you talking about scrub rates for the complete scrub? Because if you sit there and watch it, from minute to minute, it's normal for it to bounce really low for a long time, and then really high for a long time, etc. The only measurement that has any real meaning is time to completion. To paraphrase, the random IOs on HDDs are slow - these are multiple reads of small blocks dispersed on the disk, be it small files or copies of metadata or seeks into the DDT. Fast reads are large sequentially stored files, i.e. when a scrub hits an ISO image or a movie on your disk, or a series of smaller files from the same directory than happened to be created and saved in the same TXG or so, and their userdata was queued to disk as a large sequential blob in a coalesced write operation. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub performance
On 2013-02-04 17:10, Karl Wagner wrote: OK then, I guess my next question would be what's the best way to undedupe the data I have? Would it work for me to zfs send/receive on the same pool (with dedup off), deleting the old datasets once they have been 'copied'? I think I remember reading somewhere that the DDT never shrinks, so this would not work, but it would be the simplest way. Otherwise, I would be left with creating another pool or destroying and restoring from a backup, neither of which is ideal. If you have enough space, then copying with dedup=off should work (zfs send, rsync, whatever works for you best). I think DDT should shrink, deleting entries as soon as their reference count goes to 0, however this by itself can take quite a while and cause lots of random IO - in my case this might have been reason for system hangs and/or panics due to memory starvation. However, after a series of reboots (and a couple of weeks of disk-thrashing) I was able to get rid of some more offending datasets in my tests a couple of years ago now... As for smarter undedup - I've asked recently, proposing a method to do it in a stone-age way; but overall there is no ready solution so far. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 2013-01-24 11:06, Darren J Moffat wrote: On 01/24/13 00:04, Matthew Ahrens wrote: On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat darr...@opensolaris.org mailto:darr...@opensolaris.org wrote: Preallocated ZVOLs - for swap/dump. Darren, good to hear about the cool stuff in S11. Yes, thanks, Darren :) Just to clarify, is this preallocated ZVOL different than the preallocated dump which has been there for quite some time (and is in Illumos)? Can you use it for other zvols besides swap and dump? It is the same but we are using it for swap now too. It isn't available for general use. Some background: the zfs dump device has always been preallocated (thick provisioned), so that we can reliably dump. By definition, something has gone horribly wrong when we are dumping, so this code path needs to be as small as possible to have any hope of getting a dump. So we preallocate the space for dump, and store a simple linked list of disk segments where it will be stored. The dump device is not COW, checksummed, deduped, compressed, etc. by ZFS. Comparing these two statements, can I say (and be correct) that the preallocated swap devices would lack COW (as I proposed too) and thus likely snapshots, but would also lack the checksums? (we might live without compression, though that was once touted as a bonus for swap over zfs, and certainly can do without dedup) Basically, they are seemingly little different from preallocated disk slices - and for those an admin might have better control over the dedicated disk locations (i.e. faster tracks in a small-seek stroke range), except that ZFS datasets are easier to resize... right or wrong? //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 2013-01-23 09:41, casper@oracle.com wrote: Yes and no: the system reserves a lot of additional memory (Solaris doesn't over-commits swap) and swap is needed to support those reservations. Also, some pages are dirtied early on and never touched again; those pages should not be kept in memory. I believe, by the symptoms, that this is what happens often in particular to Java processes (app-servers and such) - I do regularly see these have large VM sizes and much (3x) smaller RSS sizes. One explanation I've seen is that JVM nominally depends on a number of shared libraries which are loaded to fulfill the runtime requirements, but aren't actively used and thus go out into swap quickly. I chose to trust that statement ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 2013-01-22 14:29, Darren J Moffat wrote: Preallocated ZVOLs - for swap/dump. Sounds like something I proposed on these lists, too ;) Does this preallocation only mean filling an otherwise ordinary ZVOL with zeroes (or some other pattern) - if so, to what effect? Or is it also supported to disable COW for such datasets, so that the preallocated swap/dump zvols might remain contiguous on the faster tracks of the drive (i.e. like a dedicated partition, but with benefits of ZFS checksums and maybe compression)? Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 2013-01-22 23:03, Sašo Kiselkov wrote: On 01/22/2013 10:45 PM, Jim Klimov wrote: On 2013-01-22 14:29, Darren J Moffat wrote: Preallocated ZVOLs - for swap/dump. Or is it also supported to disable COW for such datasets, so that the preallocated swap/dump zvols might remain contiguous on the faster tracks of the drive (i.e. like a dedicated partition, but with benefits of ZFS checksums and maybe compression)? I highly doubt it, as it breaks one of the fundamental design principles behind ZFS (always maintain transactional consistency). Also, contiguousness and compression are fundamentally at odds (contiguousness requires each block to remain the same length regardless of contents, compression varies block length depending on the entropy of the contents). Well, dump and swap devices are kind of special in that they need verifiable storage (i.e. detectable to have no bit-errors) but not really consistency as in sudden-power-off transaction protection. Both have a lifetime span of a single system uptime - like L2ARC, for example - and will be reused anew afterwards - after a reboot, a power-surge, or a kernel panic. So while metadata used to address the swap ZVOL contents may and should be subject to common ZFS transactions and COW and so on, and jump around the disk along with rewrites of blocks, the ZVOL userdata itself may as well occupy the same positions on the disk, I think, rewriting older stuff. With mirroring likely in place as well as checksums, there are other ways than COW to ensure that the swap (at least some component thereof) contains what it should, even with intermittent errors of some component devices. Likewise, swap/dump breed of zvols shouldn't really have snapshots, especially not automatic ones (and the installer should take care of this at least for the two zvols it creates) ;) Compression for swap is an interesting matter... for example, how should it be accounted? As dynamic expansion and/or shrinking of available swap space (or just of space needed to store it)? If the latter, and we still intend to preallocate and guarantee that the swap has its administratively predefined amount of gigabytes, compressed blocks can be aligned on those starting locations as if they were not compressed. In effect this would just decrease the bandwidth requirements, maybe. For dump this might be just a bulky compressed write from start to however much it needs, within the preallocated psize limits... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 2013-01-22 23:32, Nico Williams wrote: IIRC dump is special. As for swap... really, you don't want to swap. If you're swapping you have problems. Any swap space you have is to help you detect those problems and correct them before apps start getting ENOMEM. There *are* exceptions to this, such as Varnish. For Varnish and any other apps like it I'd dedicate an entire flash drive to it, no ZFS, no nothing. I know of this stance, and in general you're right. But... ;) Sometimes, there are once-in-a-longtime tasks that might require enormous virtual memory that you wouldn't normally provision proper hardware for (RAM, SSD) and/or cases when you have to run similarly greedy tasks on hardware with limited specs (i.e. home PC capped at 8GB RAM). As an example I might think of a ZDB walk taking about 35-40GB VM on my box. This is not something I do every month, but when I do - I need it to complete regardless that I have 5 times less RAM on that box (and kernel's equivalent of that walk fails with scanrate hell because it can't swap, btw). On another hand, there are tasks like VirtualBox which require swap to be configured in amounts equivalent to VM RAM size, but don't really swap (most of the time). Setting aside SSDs for this task might be too expensive, if they are never to be used in real practice. But this point is more of a task for swap device tiering (like with Linux swap priorities), as I proposed earlier last year... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
The discussion gets suddenly hot and interesting - albeit quite diverged from the original topic ;) First of all, as a disclaimer, when I have earlier proposed such changes to datasets for swap (and maybe dump) use, I've explicitly proposed that this be a new dataset type - compared to zvol and fs and snapshot that we have today. Granted, this distinction was lost in today's exchange of words, but it is still an important one - especially since it means that while basic ZFS (or rather ZPOOL) rules are maintained, the dataset rules might be redefined ;) I'll try to reply to a few points below, snipping a lot of older text. Well, dump and swap devices are kind of special in that they need verifiable storage (i.e. detectable to have no bit-errors) but not really consistency as in sudden-power-off transaction protection. I get your point, but I would argue that if you are willing to preallocate storage for these, then putting dump/swap on an iSCSI LUN as opposed to having it locally is kind of pointless anyway. Since they are used rarely, having them thin provisioned is probably better in a iSCSI environment than wasting valuable network-storage resources on something you rarely need. I am not sure what in my post led you to think that I meant iSCSI or otherwise networked storage to keep swap and dump. Some servers have local disks, you know - and in networked storage environments the local disks are only used to keep the OS image, swap and dump ;) Besides, if you plan to shred your dump contents after reboot anyway, why fat-provision them? I can understand swap, but dump? Guarantee that the space is there... Given the recent mischiefs with dumping (i.e. the context is quite stripped compared to the general kernel work, so multithreading broke somehow) I guess that pre-provisioned sequential areas might also reduce some risks... though likely not - random metadata would still have to get into the pool. You don't understand, the transactional integrity in ZFS isn't just to protect the data you put in, it's also meant to protect ZFS' internal structure (i.e. the metadata). This includes the layout of your zvols (which are also just another dataset). I understand that you want to view a this kind of fat-provisioned zvol as a simple contiguous container block, but it is probably more hassle to implement than it's worth. I'd argue that transactional integrity in ZFS primarily protects metadata, so that there is a tree of always-actual block pointers. There is this octopus of a block-pointer tree whose leaf nodes point to data blocks - but only as DVAs and checksums, basically. Nothing really requires data to be or not be COWed and stored at a different location than the previous version of the block at the same logical offset for the data consumers (FS users, zvol users), except that we want that data to be readable even after a catastrophic pool close (system crash, poweroff, etc.). We don't (AFAIK) have such a requirement for swap. If the pool which contained swap kicked the bucket, we probably have a larger problem whose solution will likely involve reboot and thus recycling of all swap data. And for single-device errors with (contiguous) preallocated unrelocatable swap, we can protect with mirrors and checksums (used upon read, within this same uptime that wrote the bits). Likewise, swap/dump breed of zvols shouldn't really have snapshots, especially not automatic ones (and the installer should take care of this at least for the two zvols it creates) ;) If you are talking about the standard opensolaris-style boot-environments, then yes, this is taken into account. Your BE lives under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump respectively (both thin-provisioned, since they are rarely needed). I meant the attribute for zfs-auto-snapshots service, i.e.: rpool/swap com.sun:auto-snapshot false local As I wrote, I'd argue that for new swap (and maybe dump) datasets the snapshot action should not even be implemented. Compression for swap is an interesting matter... for example, how should it be accounted? As dynamic expansion and/or shrinking of available swap space (or just of space needed to store it)? Since compression occurs way below the dataset layer, your zvol capacity doesn't change with compression, even though how much space it actually uses in the pool can. A zvol's capacity pertains to its logical attributes, i.e. most importantly the maximum byte offset within it accessible to an application (in this case, swap). How the underlying blocks are actually stored and how much space they take up is up to the lower layers. ... But you forget that a compressed block's physical size fundamentally depends on its contents. That's why compressed zvols still appear the same size as before. What changes is how much space they occupy on the underlying pool. I won't argue with this, as it is perfectly correct for zvols and undefined for the
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-21 07:06, Stephan Budach wrote: Are there switch stats on whether it has seen media errors? Has anybody gotton QLogic's SanSurfer to work with anything newer than Java 1.4.2? ;) I checked the logs on my switches and they don't seem to indicate such issues, but I am lacking the real-time monitoring that the old SanSurfer provides. I don't know what that is except by your message's context, but can't you install JDK 1.4.2 on your system or in a VM, and set up a script or batch file to launch the SanSurfer with the specific JAVA_HOME and PATH values? ;) Or the problem is in finding the old Java version? //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 2013-01-20 19:55, Tomas Forsman wrote: On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes: Hello all, While revising my home NAS which had dedup enabled before I gathered that its RAM capacity was too puny for the task, I found that there is some deduplication among the data bits I uploaded there (makes sense, since it holds backups of many of the computers I've worked on - some of my homedirs' contents were bound to intersect). However, a lot of the blocks are in fact unique - have entries in the DDT with count=1 and the blkptr_t bit set. In fact they are not deduped, and with my pouring of backups complete - they are unlikely to ever become deduped. Another RFE would be 'zfs dedup mypool/somefs' and basically go through and do a one-shot dedup. Would be useful in various scenarios. Possibly go through the entire pool at once, to make dedups intra-datasets (like the real thing). Yes, but that was asked before =) Actually, the pool's metadata does contain all the needed bits (i.e. checksum and size of blocks) such that a scrub-like procedure could try and find same blocks among unique ones (perhaps with a filter of this block being referenced from a dataset that currently wants dedup), throw one out and add a DDT entry to another. On 2013-01-20 17:16, Edward Harvey wrote: So ... The way things presently are, ideally you would know in advance what stuff you were planning to write that has duplicate copies. You could enable dedup, then write all the stuff that's highly duplicated, then turn off dedup and write all the non-duplicate stuff. Obviously, however, this is a fairly implausible actual scenario. Well, I guess I could script a solution that uses ZDB to dump the blockpointer tree (about 100Gb of text on my system), and some perl or sort/uniq/grep parsing over this huge text to find blocks that are the same but not deduped - as well as those single-copy deduped ones, and toggle the dedup property while rewriting the block inside its parent file with DD. This would all be within current ZFS's capabilities and ultimately reach the goals of deduping pre-existing data as well as dropping unique blocks from the DDT. It would certainly not be a real-time solution (likely might take months on my box - just fetching the BP tree took a couple of days) and would require more resources than needed otherwise (rewrites of same userdata, storing and parsing of addresses as text instead of binaries, etc.) But I do see how this is doable even today even by a non-expert ;) (Not sure I'd ever get around to actually doing this thus, though - it is not a very clean solution nor a performant one). As a bonus, however, this ZDB dump would also provide an answer to a frequently-asked question: which files on my system intersect or are the same - and have some/all blocks in common via dedup? Knowledge of this answer might help admins with some policy decisions, be it witch-hunt for hoarders of same files or some pattern-making to determine which datasets should keep dedup=on... My few cents, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-20 16:56, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov And regarding the considerable activity - AFAIK there is little way for ZFS to reliably read and test TXGs newer than X My understanding is like this: When you make a snapshot, you're just creating a named copy of the present latest TXG. When you zfs send incremental from one snapshot to another, you're creating the delta between two TXG's, that happen to have names. So when you break a mirror and resilver, it's exactly the same operation as an incremental zfs send, it needs to calculate the delta between the latest (older) TXG on the previously UNAVAIL device, up to the latest TXG on the current pool. Yes this involves examining the meta tree structure, and yes the system will be very busy while that takes place. But the work load is very small relative to whatever else you're likely to do with your pool during normal operation, because that's the nature of the meta tree structure ... very small relative to the rest of your data. Hmmm... Given that many people use automatic snapshots, those do provide us many roots for branches of block-pointer tree after a certain TXG (creation of snapshot and the next live variant of the dataset). This might allow resilvering to quickly select only those branches of the metadata tree that are known or assumed to have changed after a disk was temporarily lost - and not go over datasets (snapshots) that are known to have been committed and closed (became read-only) while that disk was online. I have no idea if this optimization does take place in ZFS code, but it seems bound to be there... if not - a worthy RFE, IMHO ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
Did you try replacing the patch-cables and/or SFPs on the path between servers and disks, or at least cleaning them? A speck of dust (or, God forbid, a pixel of body fat from a fingerprint) caught between the two optic cable cutoffs might cause any kind of signal weirdness from time to time... and lead to improper packets of that optic protocol. Are there switch stats on whether it has seen media errors? //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 2013-01-20 17:16, Edward Harvey wrote: But, by talking about it, we're just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream... I beg to disagree. While most of my contribution was so far about learning stuff and sharing with others, as well as planting some new ideas and (hopefully, seen as constructively) doubting others - including the implementation we have now - and I do have yet to see someone pick up my ideas and turn them into code (or prove why they are rubbish) -- overall I can't say that development stagnated by some metric of stagnation or activity. Yes, maybe there were more cool new things per year popping up with Sun's concentrated engineering talent and financing, but now it seems that most players - wherever they work now - took a pause from the marathon, to refine what was done in the decade before. And this is just as important as churning out innovations faster than people can comprehend or audit or use them. As a loud example of present active development - take the LZ4 quests completed by Saso recently. From what I gather, this is a single man's job done on-line in the view of fellow list members over a few months, almost like a reality-show; and I guess anyone with enough concentration, time and devotion could do likewise. I suspect many of my proposals to the list might also take some half of a man-year to complete. Unfortunately for the community and for part of myself, I now have some higher daily priorities so that I likely won't sit down and code lots of stuff in the nearest years (until that Priority goes to school, or so). Maybe that's why I'm eager to suggest quests for brilliant coders here who can complete the job better and faster than I ever would ;) So I'm doing the next best things I can do to help the progress :) And I don't believe this is in vain, that the development ceased and my writings are only destined to be stuffed under the carpet. Be it these RFEs or dome others, better and more useful, I believe they shall be coded and published in common ZFS code. Sometime... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RFE: Un-dedup for unique blocks
Hello all, While revising my home NAS which had dedup enabled before I gathered that its RAM capacity was too puny for the task, I found that there is some deduplication among the data bits I uploaded there (makes sense, since it holds backups of many of the computers I've worked on - some of my homedirs' contents were bound to intersect). However, a lot of the blocks are in fact unique - have entries in the DDT with count=1 and the blkptr_t bit set. In fact they are not deduped, and with my pouring of backups complete - they are unlikely to ever become deduped. Thus these many unique deduped blocks are just a burden when my system writes into the datasets with dedup enabled, when it walks the superfluously large DDT, when it has to store this DDT on disk and in ARC, maybe during the scrubbing... These entries bring lots of headache (or performance degradation) for zero gain. So I thought it would be a nice feature to let ZFS go over the DDT (I won't care if it requires to offline/export the pool) and evict the entries with count==1 as well as locate the block-pointer tree entries on disk and clear the dedup bits, making such blocks into regular unique ones. This would require rewriting metadata (less DDT, new blockpointer) but should not touch or reallocate the already-saved userdata (blocks' contents) on the disk. The new BP without the dedup bit set would have the same contents of other fields (though its parents would of course have to be changed more - new DVAs, new checksums...) In the end my pool would only track as deduped those blocks which do already have two or more references - which, given the static nature of such backup box, should be enough (i.e. new full backups of the same source data would remain deduped and use no extra space, while unique data won't waste the resources being accounted as deduped). What do you think? //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) The way I get it, resilvering is related to scrubbing but limited in impact such that it rebuilds a particular top-level vdev (i.e. one of the component mirrors) with an assigned-bad and new device. So they both should walk the block-pointer tree from the uberblock (current BP tree root) until they ultimately read all the BP entries and validate the userdata with checksums. But while scrub walks and verifies the whole pool and fixes discrepancies (logging checksum errors), the resilver verifies a particular TLVdev (and maybe has a cut-off earliest TXG for disks which fell out of the pool and later returned into it - with a known latest TXT that is assumed valid on this disk) and the process expects there to be errors - it is intent on (partially) rewriting one of the devices in it. Hmmm... Maybe that's why there are no errors logged? I don't know :) As for practice, I also have one Thumper that logs errors on a couple of drives upon every scrub. I think it was related to connectors, at least replugging the disks helped a lot (counts went from tens per scrub to 0-3). One of the original 250Gb disks was replaced with a 3Tb one and a 250Gb partition became part of the old pool (the remainder became a new test pool over a single device). Scrubbing the pools yields errors in those new 250Gb, but never on the 2.75Tb single-disk pool... so go figure :) Overall, intermittent errors might be attibuted to non-ECC RAM/CPUs (not our case), temperature affecting the mechanics and electronics (conditioned server room - not our case), electric power variations and noise (other systems in the room on the same and other UPSes don't complain like this), and cable/connector/HBA degradation (oxydization, wear, etc. - likely all that remains for our causes). This example regards internal disks of the Thumper, so at least we are certain to attribute no problems related to further breakage components - external cables, disk trays, etc... HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-19 20:08, Bob Friesenhahn wrote: On Sat, 19 Jan 2013, Jim Klimov wrote: On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) I don't think that zfs would call it scrubbing unless the user requested scrubbing. Unplugging a USB drive which is part of a mirror for a short while results in considerable activity when it is plugged back in. It is as if zfs does not trust the device which was temporarily unplugged and does a full validation of it. Now, THAT would be resilvering - and by default it should be a limited one, with a cutoff at the last TXG known to the disk that went MIA/AWOL. The disk's copy of the pool label (4 copies in fact) record the last TXG it knew safely. So the resilver should only try to validate and copy over the blocks whose BP entries' birth TXG number is above that. And since these blocks' components (mirror copies or raidz parity/data parts) are expected to be missing on this device, mismatches are likely not reported - I am not sure there's any attempt to even detect them. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-19 20:23, Jim Klimov wrote: On 2013-01-19 20:08, Bob Friesenhahn wrote: On Sat, 19 Jan 2013, Jim Klimov wrote: On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) I don't think that zfs would call it scrubbing unless the user requested scrubbing. Unplugging a USB drive which is part of a mirror for a short while results in considerable activity when it is plugged back in. It is as if zfs does not trust the device which was temporarily unplugged and does a full validation of it. Now, THAT would be resilvering - and by default it should be a limited one, with a cutoff at the last TXG known to the disk that went MIA/AWOL. The disk's copy of the pool label (4 copies in fact) record the last TXG it knew safely. So the resilver should only try to validate and copy over the blocks whose BP entries' birth TXG number is above that. And since these blocks' components (mirror copies or raidz parity/data parts) are expected to be missing on this device, mismatches are likely not reported - I am not sure there's any attempt to even detect them. And regarding the considerable activity - AFAIK there is little way for ZFS to reliably read and test TXGs newer than X other than to walk the whole current tree of block pointers and go deeper into those that match the filter (TLVDEV number in DVA, and optionally TXG numbers in birth/physical fields). So likely the resilver does much of the same activity that a full scrub would - at least in terms of reading all of the pool's metadata (though maybe not all copies thereof). My 2c and my speculation, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-19 23:39, Richard Elling wrote: This is not quite true for raidz. If there is a 4k write to a raidz comprised of 4k sector disks, then there will be one data and one parity block. There will not be 4 data + 1 parity with 75% space wastage. Rather, the space allocation more closely resembles a variant of mirroring, like some vendors call RAID-1E I agree with this exact reply, but as I posted sometime late last year, reporting on my digging in the bowels of ZFS and my problematic pool, for a 6-disk raidz2 set I only saw allocations (including two parity disks) divisible by 3 sectors, even if the amount of the (compressed) userdata was not so rounded. I.e. I had either miniature files or tails of files fitting into one sector plus two parities (overall a 3 sector allocation), or tails ranging 2-4 sectors and occupying 6 with parity (while 2 or 3 sectors could use just 4 or 5 w/parities, respectively). I am not sure what these numbers mean - 3 being a case for one userdata sector plus both parities or for half of 6-disk stripe - both such explanations fit in my case. But yes, with current raidz allocation there are many ways to waste space. And those small percentages (or not so small) do add up. Rectifying this example, i.e. allocating only as much as is used, does not seem like an incompatible on-disk format change, and should be doable within the write-queue logic. Maybe it would cause tradeoffs in efficiency; however, ZFS does explicitly rotate starting disks of allocations every few megabytes in order to even out the loads among spindles (normally parity disks don't have to be accessed - unless mismatches occur on data disks). Disabling such padding would only help achieve this goal and save space at the same time... My 2c, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-18 06:35, Thomas Nau wrote: If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. 4k might be a little small. 8k will have less metadata overhead. In some cases we've seen good performance on these workloads up through 32k. Real pain is felt at 128k :-) My only pain so far is the time a send/receive takes without really loading the network at all. VM performance is nothing I worry about at all as it's pretty good. So key question for me is if going from 8k to 16k or even 32k would have some benefit for that problem? I would guess that increasing the block size would on one hand improve your reads - due to more userdata being stored contiguously as part of one ZFS block - and thus sending of the backup streams should be more about reading and sending the data and less about random seeking. On the other hand, this may likely be paid off with the need to do more read-modify-writes (when larger ZFS blocks are partially updated with the smaller clusters in the VM's filesystem) while the overall system is running and used for its primary purpose. However, since the guest FS is likely to store files of non-minimal size, it is likely that the whole larger backend block would be updated anyway... So, I think, this is something an experiment can show you - whether the gain during backup (and primary-job) reads vs. possible degradation during the primary-job writes would be worth it. As for the experiment, I guess you can always make a ZVOL with different recordsize, DD data into it from the production dataset's snapshot, and attach the VM or its clone to the newly created clone of its disk image. Good luck, and I hope I got Richard's logic right in that answer ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-17 16:04, Bob Friesenhahn wrote: If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. Also, it would make sense while you are at it to verify that the clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that their partitions start at a 512b-based sector offset divisible by 8 inside the virtual HDDs, and the FS headers also align to that so the first cluster is 4KB-aligned. Classic MSDOS MBR did not warrant that partition start, by using 63 sectors as the cylinder size and offset factor. Newer OSes don't use the classic layout, as any config is allowable; and GPT is well aligned as well. Overall, a single IO in the VM guest changing a 4KB cluster in its FS should translate to one 4KB IO in your backend storage changing the dataset's userdata (without reading a bigger block and modifying it with COW), plus some avalanche of metadata updates (likely with the COW) for ZFS's own bookkeeping. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On 2013-01-18 00:42, Bob Friesenhahn wrote: You can install Brendan Gregg's DTraceToolkit and use it to find out who and what is doing all the writing. 1.2GB in an hour is quite a lot of writing. If this is going continuously, then it may be causing more fragmentation in conjunction with your snapshots. As a moderately wild guess, since you're speaking of galleries, are these problematic filesystems often-read? By default ZFS updates the last access-time of files it reads, as do many other filesystems, and this causes avalanches of metadata updates - sync writes (likely) as well as fragmentation. This may also be a poorly traceable but considerable used space in frequent snapshots. You can verify (and unset) this behaviour with the ZFS FS dataset property atime, i.e.: # zfs get atime pond/export/home NAME PROPERTY VALUE SOURCE pond/export/home atime offinherited from pond On another hand, verify where your software keeps the temporary files (i.e. during uploads as may be with galleries). Again, if this is a frequently snapshotted dataset (though 1 hour is not really that frequent) then needless temp files can be held by those older snapshots. Moving such temporary works to a different dataset with a different snapshot schedule and/or to a different pool (to keep related fragmentation constrained) may prove useful. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] help zfs pool with duplicated and missing entry of hdd
On 2013-01-10 08:51, Jason wrote: Hi, One of my server's zfs faulted and it shows following: NAMESTATE READ WRITE CKSUM backup UNAVAIL 0 0 0 insufficient replicas raidz2-0 UNAVAIL 0 0 0 insufficient replicas c4t0d0 ONLINE 0 0 0 c4t0d1 ONLINE 0 0 0 c4t0d0 FAULTED 0 0 0 corrupted data c4t0d3 FAULTED 0 0 0 too many errors c4t0d4 FAULTED 0 0 0 too many errors ...(omit the rest). My question is why c4t0d0 appeared twice, and c4t0d2 is missing. Have check the controller card and hard disk, they are all working fine. This renaming does seem like an error in detecting (and further naming) of the disks - i.e. if a connector got loose, and one of the disks is not seen by the system, the numbering can shift in such manner. It is indeed strange however that only d2 got shifted or missing and not all those numbers after it. So, you did verify that the controller sees all the disks in format command (and perhaps after a cold reboot - in BIOS)? Just in case, try to unplug and replug all cables (power, data) in case their pins got oxydized over time. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool performance when nearly full
ctime Fri Jun 8 00:22:17 2012 crtime Fri Jun 8 00:22:17 2012 gen 1349746 mode100755 size649720 parent 25 links 1 pflags 4080104 Indirect blocks: 0 L1 DVA[0]=0:940298000:400 DVA[1]=0:263234a00:400 [L1 ZFS plain file] fletcher4 lzjb LE contiguous unique double size=4000L/400P birth=1349746L/1349746P fill=5 cksum=682d4fda0b:3cc1aa306094:13ebb22837cf14:4c5c67e522dbca8 0 L0 DVA[0]=0:95f337000:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=1349746L/1349746P fill=1 cksum=23fce6aa160b:5ab11e5fcbc6c2e:5b38f230e01d508d:12cf92941e4b2487 2 L0 DVA[0]=0:95f357000:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=1349746L/1349746P fill=1 cksum=3f0ac207affd:f8ed413113d6bdd:24e36c7682cfc297:2549c866ab61e464 4 L0 DVA[0]=0:95f377000:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=1349746L/1349746P fill=1 cksum=3d40bf3329f0:f459bc876303dd7:2230ee348b7b08c5:3a65d1ebbf52c9dc 6 L0 DVA[0]=0:95f397000:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=1349746L/1349746P fill=1 cksum=19e01b53eb67:956b52d1df6ecd4:38ff9bd1302bf879:e4661798dd1ae8a0 8 L0 DVA[0]=0:95f3b7000:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=1349746L/1349746P fill=1 cksum=361e6fd03d40:d0903e491fa09e9:7a2e453ed28baa92:28562c53af3c0495 segment [, 000a) size 640K After several higher layers of the pointers (just L1 in example above), you have L0 entries which point to actual data blocks with their DVA fields. The example file above fits in five 128K blocks at level L0. The first component of the DVA address is the top-level vdev ID, followed by offset and allocation size (including raidzN redundancy). Depending on your pool's history, larger files may have been striped over several TLVDEVs however, and relocating them (copying over and deleting the original) might help or not help free up a particular TLVDEV (upon rewrite they will be striped again, albeit maybe ZFS will make different decisions upon a new write - and prefer the more free devices). Also, if the file's blocks are referenced via snapshots, clones, dedup or hardlinks, they won't actually be released when you delete a particular copy of the file. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The format command crashes on 3TB disk but zpool create ok
On 2012-12-14 17:03, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: Suspicion and conjecture only: I think format uses a fdisk label, which has a 2T limit. Technically, fdisk is a program and labels (partitioning tables) are MBR and EFI/GPT :) And fdisk at least in OpenIndiana can explicitly label a disk as EFI, similarly to what ZFS does when given the whole disk to a pool. You might also have luck with GNU parted, though I've had older builds (i.e. in SXCE) crash on 3Tb disks too, including one that's labeled as EFI and used in a pool on the same SXCE. There were no such problems with newer build of parted as in OI, so that disk was in fact labeled for SXCE while the box was booted with OI LiveCD. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Digging in the bowels of ZFS
On 2012-12-02 05:42, Jim Klimov wrote: My plan is to dig out the needed sectors of the broken block from each of the 6 disks and try any and all reasonable recombinations of redundancy and data sectors to try and match the checksum - this should be my definite answer on whether ZFS (of that oi151.1.3-based build) does all I think it can to save data or not. Either I put the last nail into my itching question's coffin, or I'd nail a bug to yell about ;) Well, I've come to a number of conclusions, though did not yet close this matter to myself. One regards the definition of all reasonable recombinations - ZFS does not do *everything* possible to recover corrupt data, and in fact it can't, nobody can. When I took this to an extreme, assuming that the bytes at different offsets within a sector might fail on different disks that comprise a block, attempt to reconstruct and test single failed sector per-byte becomes computationally infeasible - for 4 data disks and 1 parity I got about 4^4096 combinations to test. The next Big Bang will happen sooner than I'd get a yes or no, or so they say (yes, I did a rough estimate - about 10^100 seconds if I used all computing horsepower on Earth today). If there are R known-broken rows of data (be it bits, bytes, sectors, whole columns, or whatever quantum of data we take) on D data disks and P parity disks (all readable without HW IO errors), where known brokenness is both a parity mismatch in this row and checksum mismatch for the whole userdata block, we do not know in advance how many errors there are in the row (only hope that not more than there are parity columns) nor where exactly the problem is. Thanks to checksum mismatch we do know that at least one error is in the data disks' on-disk data. We might hope to find a correct original data which matches the checksum by determining for each data disk the possible alternate byte values (computed from bytes at same offsets on other disks of data and parity), and checksumming the recombined userdata blocks with some of the on-disk bytes replaced by these calculated values. For each row we test 1..P alternate column values, and we must apply the alteration to all of the rows where known errors exist, in order to detect some neighboring but not overlapping errors in different components of the block's allocation. (This was the breakage scenario that was deemed possible for raidzN with disk heads hovering over similar locations all the time). This can yield a very large field of combinations with small height of rows (i.e. matching 1 byte per disk), or too few combinations with row height chosen too big (i.e. whole portion of one disk's part of the userdata - quarter in case of my 4-data-disk set). For single-break-per-row tests based on hypotheses from P parities, D data disks and R broken rows, we need to checksum P*(D^R) userdata recombinations in order to determine that we can't recover the block. To catch the less probable several errors per row (up to the amount of parities we have), we need to retry even more combinations afterwards. My 5-year-old Pentium D tested 1000 sha256 checksums over 128KB blocks in about 2-3 seconds, so it is reasonable to keep reconstruction loops and thus the smallness of a step and thus the amount of steps within a given arbitrarily chosen timeout (30 sec? 1 sec?) With a fixed amount of parity and data disks in a particular TLVDEV, we can determine the reasonable row heights. Also, this low-level recovery at higher amount of cycles might be a job for a separate tool - i.e. on-line recovery during ZFS IO and scrubs might be limited by a few sectors, and whatever is not fixed by that can be manually fed to programmatic number-cruncher and possibly get recovered overnight... I now know that it is cheap and fast to determine parity mismatches for each single-byte column offset in a userdata block (leading to D*R userdata bytes whose contents we are not certain of), so even if the quantum of data for reconstructions is a sector, it is quite reasonable to start with byte-by-byte mismatch detection. Locations of detected errors can help us determine whether the errors are colocated in a single row of sectors (so likely one or more sectors at the same offset on different disks got broken), or in several sectors (we might be lucky and have single errors per disk in neighboring sector numbers). It is, after all, not reasonable to go below 512b or even the larger HW sector size as the quantum of data for recovery attempts. But testing *only* whole columns (*if* this is done today) also avoids some chances of automated recovery - though, certainly, the recovery attempts should start with some of the most probable combinations, such as all errors being confined to a single disk, and then going down in step size and testing possible errors on several component disks. We can afford several thousand checksum tests, which might give a chance to recover more data that might be recoverable
Re: [zfs-discuss] Digging in the bowels of ZFS
On 2012-12-11 16:44, Jim Klimov wrote: For single-break-per-row tests based on hypotheses from P parities, D data disks and R broken rows, we need to checksum P*(D^R) userdata recombinations in order to determine that we can't recover the block. A small maths correction: the formula above reflects that we change some one item from on-disk value to reconstrycted hypothesis on some one data disk(column) in all rows, or on P disks if we try to recover from more than one failed item in a row. Reality is worse :) Our original info (parity errors and checksum mismatch) warranted only that we have at least one error in userdata. It is possible that other (R-1) errors are on the parity disk, so the recombination should also check all variants with (0..R-1) unchanged rows with their on-disk contents intact. This gives us something like P*(D + D^2 + ... + D^R) variants to test, which is roughly a 25% increase in recombinations in the range of computationally feasible amounts of error-matching. Heck, just counting from 1 to 2^64 in a i++ loop takes a lot of CPU time By my estimate, even that would take until the next Big Bang, at least on my one computer ;) Just for fun: a count to 2^32 took 42 seconds, so my computer can do 10^8 trivial loops per second - but that's just a data point. What really matters is that 4^64 == (2^32)^33, which is a lot. Roughly, 2^3 = 8 ~= 10, so the plain count from 1 to 4^64 would take about 42*10^30 seconds, or roughly 10^24 years. If the astronomers' estimates are correct, this amounts to 10^13 lifetimes of our universe, or so ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Digging in the bowels of ZFS
On 2012-12-10 07:35, Timothy Coalson wrote: The corrupted area looks like a series of 0xFC 0x42 bytes about half a kilobyte long, followed by zero bytes to the end of sector. Start of this area is not aligned to a multiple of 512 bytes. Just a guess, but that might be how the sectors were when the drive came from the manufacturer, rather than filled with zeros (a test pattern while checking for bad sectors). As for why some other sectors did show zeros in your other results, perhaps those sectors got reallocated from the reserved sectors after whatever caused your problems, which may not have been written to during the manufacturer test. Thanks for the idea. I also figured it might be some test pattern or maybe some sort of secure wipe, and HDD's relocation to spare sectors might be a reasonable scenario for such an error creeping into an LBA which previously had valid data - i.e. the disk tried to salvage as much of a newly corrupted sector as it could... I dismissed it because several HDDs had the error at same offsets, and some of them had the same contents of the corrupted sectors; how-ever identical the disks might be, this is just too much of a coincidence for disk-internal hardware relocation to be The reason. Controller going haywire - that is possible, given that this box was off until recently repaired due to broken cooling, and this is the nearest centralized SPOF location common to all disks (with overheated CPU, non-ECC RAM and the software further along the road). I am not sure which one of these *couldn't* issue (or be interpreted to issue) a number of weird identical writes to different disks at same offsets. Everyone is a suspect :( Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Digging in the bowels of ZFS
more below... On 2012-12-06 03:06, Jim Klimov wrote: It also happens that on disks 1,2,3 the first row's sectors (d0, d2, d3) are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes. The neighboring blocks, located a few sectors away from this one, also have compressed data and have some regular-looking patterns of bytes, certainly no long stretches of zeroes. However, the byte-by-byte XOR matching complains about the whole sector. All bytes, except some 40 single-byte locations here and there, don't XOR up to produce the expected (known from disk) value. I did not yet try the second parity algorithm. At least in this case, it does not seem that I would find an incantation needed to recover this block - too many zeroes overlapping (at least 3 disks' data proven compromised), where I did hope for some shortcoming in ZFS recombination exhaustiveness. In this case - it is indeed too much failure to handle. Now waiting for scrub to find me more test subjects - broken files ;) So, these findings from my first tested bad file remain valid. Now that I have a couple more error locations found again by scrub (which for the past week progressed just above 50% of the pool), there are some more results. So far only one location has random-looking different data in the sectors of the block on different disks, which I might at least try to salvage as described in the beginning of this thread. In two of three cases, some of the sectors (in the range which mismatches the parity data) are not only clearly invalid, like being filled with long stretches of zeroes with other sectors being uniformly-looking binary data (results of compression). Moreover, several of these sectors (4096-bytes long at same offsets on different drives which are data components of the same block) are literally identical, which is apparently some error upon write (perhaps, some noise was interpreted by several disks at once like a command for them to write at that location). The corrupted area looks like a series of 0xFC 0x42 bytes about half a kilobyte long, followed by zero bytes to the end of sector. Start of this area is not aligned to a multiple of 512 bytes. These disks being of an identical model and firmware, I am ready to believe that they might misinterpret same interference in the same way. However, I was under the impression that SATA involved CRCs on commands and data in the protocol - to counter the noise?.. Question: does such conclusion sound like a potentially possible explanation for my data corruptions (on disks which passed dozens of scrubs successfully before developing these problems nearly at once in about ten locations)? Thanks for attention, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zpool error in metadata:0x0
I've had this error on my pool since over a year ago, when I posted and asked about it. The general consent was that this is only fixable by recreation of the pool, and that if things don't die right away, the problem may be benign (i.e. in some first blocks of MOS that are in practice written once and not really used nor relied upon). In detailed zpool status this error shows as: metadata:0x0 By analogy to other errors in unnamed files, this was deemed to be the MOS dataset, object number 0. Anyway, now that I am digging deeper into ZFS bowels (as detailed in my other current thread), I've made a tool which can request sectors which pertain to a given DVA and verify the XOR parity. With ZDB I've extracted what I believe to be the block-pointer tree for this despite ZDB trying to dump the whole pool upon access to no child dataset (I saw recently on-list that someone picked up this ZDB bug as well), I used a bit of perl magic: # time zdb -d -bb -e 1601233584937321596 0 | \ perl -e '$a=0; while () { chomp; if ( /^Dataset mos/ ) { $a=1; } elsif ( /^Dataset / ) {$a=2; exit 0;}; if ( $a == 1 ) { print $_\n; } }' mos.txt This gives me everything ZDB thinks is part of MOS, up to the start of a next Dataset dump: Dataset mos [META], ID 0, cr_txg 4, 50.5G, 76355 objects, rootbp DVA[0]=0:590df6a4000:3000 DVA[1]=0:8e4c636000:3000 DVA[2]=0:8107426b000:3000 [L0 DMU objset] fletcher4 lzjb LE contiguous unique triple size=800L/200P birth=326429440L/326429440P fill=76355 cksum=1042f7ae8a:63ab010a1de:138cbe92583cd:29e4cd03f544fe Object lvl iblk dblk dsize lsize %full type 0316K16K 84.1M 80.2M 46.49 DMU dnode dnode flags: USED_BYTES dnode maxblkid: 5132 Indirect blocks: 0 L2 DVA[0]=0:590df6a1000:3000 DVA[1]= 0:8e4c63:3000 DVA[2]=0:81074268000:3000 [L2 DMU dnode] fletcher4 lzjb LE contiguous unique triple size=4000L/e00P birth=326429440L/326429440P fill=76355 cksum=128bfcb12fe:237fe2ec55891:29135030da5c326:36973942bee30ba3 0 L1 DVA[0]=0:590df69b000:6000 DVA[1]= 0:8fd76b8000:6000 DVA[2]=0:81074262000:6000 [L1 DMU dnode] fletcher4 lzjb LE contiguous unique triple size=4000L/1200P birth=326429440L/326429440P fill=1155 cksum=18d8d8f3e6c:3ab2b45afba95:57ad6e7efb1cb00:216c4680d8cb9644 0 L0 DVA[0]=0:590df695000:3000 DVA[1]= 0:8e4c61e000:3000 DVA[2]=0:8107425c000:3000 [L0 DMU dnode] fletcher4 lzjb LE contiguous unique triple size=4000L/c00P birth=326429440L/326429440P fill=31 cksum=da94d97873:15b87afcb5388:15ac58fbe7745d6:2e083d8ef9f3c90 ... (for a total of 3572 block pointers) I fed this list into my new verification tool, testing all DVA ditto copies, and it found no blocks with bad sectors - all the XOR parities and the checksums matched their sector or two worth of data. So, given that there are no on-disk errors in the Dataset mos [META], ID 0 Object #0 - what does the zpool scrub find time after time and call an error in metadata:0x0? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Userdata physical allocation in rows vs. columns WAS Digging in the bowels of ZFS
For those who have work to do and can't be bothered to read detailed context, please do scroll down to the marked Applied question about the possible project to implement a better on-disk layout of blocks. The busy experts' opinions are highly regarded here. Thanks ;) //Jim CONTEXT AND SPECULATION Well, now that I've mostly completed building my tool to locate and extract from disk and verify the sectors related to any particular block, I can state with certainty: data sector numbering is columnar as was depicted in my recent mails (quote below), not rows as I had believed earlier - and which would be more compact to store. Columns do make certain sense, but do also lead to more wasted space than could be possible otherwise - and I'm not sure if the allocation in rows would be really slower to write or read, especially since the HDD caching would coalesce requests to neighboring sectors - be they a contiguous quarter of my block's physical data or a series of every fourth sector from that. This would be more complex to code and comprehend - likely. Might even require more CPU cycles to account sizes properly (IF today we just quickly allocate columns of same size - I skimmed over vdev_raidz.c, but did not look into this detail). Saving 1-2 sectors from allocations which are some 10-30 sectors long altogether - this is IMHO a worthy percentage of savings to worry and bother about, especially with the compression-related paradigm of our CPUs are slackers with nothing to do. ZFS overhead on 4K-sectored disks is pretty expensive already, so I see little need to feed it extra desserts too ;) APPLIED QUESTION: If one were to implement a different sector allocator (rows with more precise cutoff vs. columns as they are today) and expose it as a zfs property that can be set by users (or testing developers), would it make sense to call it a compression mode (in current terms) and use a bit from that field? Or should a GRID bits be more properly used for this? I am not sure if feature flags are a proper mechanism for this, except to protect form import and interpretation of such fixed datasets and pools on incompatible (older) implementations - the allocation layout is likely going to be an attribute applied to each block at write-time and noted in blkptr_t like the checksums and compression, but only apply to raidzN. AFAIK, the contents of userdata sectors and their ordering don't even matter to ZFS layers until decompression - parities and checksums just apply to prepared bulk data... //Jim Klimov On 2012-12-06 02:08, Jim Klimov wrote: On 2012-12-05 05:52, Jim Klimov wrote: For undersized allocations, i.e. of compressed data, it is possible to see P-sizes not divisible by 4 (disks) in 4KB sectors, however, some sectors do apparently get wasted because the A-size in the DVA is divisible by 6*4KB. With columnar allocation of disks, it is easier to see why full stripes have to be used: p1 p2 d1 d2 d3 d4 . , 1 5 9 13 . , 2 6 10 14 . , 3 7 11 x . , 4 8 12 x In this illustration a 14-sector-long block is saved, with X being the empty leftovers, on which we can't really save (as would be the case with the other allocation, which is likely less efficient for CPU and IOs). Getting more and more puzzled with this... I have seen DVA values matching both theories now... Interestingly, all the allocations I looked over involved the number of sectors divisible by 3... rounding to half of my 6-disk RAID set - is it merely a coincidence, or some means of balancing IOs? ... I did not yet research where exactly the unused sectors are allocated - vertically on the last strip, like in my yesterdays depiction quoted above, or horizontally across several disks, but now that I know about this - it really bothers me as wasted space with no apparent gain. I mean, the raidz code does tricks to ensure that parities are located on different disks, and in normal conditions the userdata sector reads land on all disks in a uniform manner. Why forfeit the natural rotation thanks to P-sizes smaller than the multiple of number of data-disks? ... In short: can someone explain the rationale - why are allocations such as they are now, and can it be discussed as a bug or should this be rationalized as a feature? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove disk
On 2012-12-06 09:35, Albert Shih wrote: 1) add a 5th top-level vdev (eg. another set of 12 disks) That's not a problem. That IS a problem if you're going to ultimately remove an enclosure - once added, you won't be able to remove the extra top-level VDEV from your ZFS pool. 2) replace the disks with larger ones one-by-one, waiting for a resilver in between This is the point I don't see how to do it. I've 48 disk actually from /dev/da0 - /dev/da47 (I'm under FreeBSD 9.0) lets say 3To. I've 4 raidz2 the first from /dev/da0 - /dev/da11 etc.. So I add physically a new enclosure with new 12 disks for example 4To disk. I'm going to have new /dev/da48 -- /dev/da59. Say I want remove /dev/da0 - /dev/da11. First I pull out the /dev/da0. I believe FreeBSD should perform similarly to that in Solaris-based OSes. Since your pools are not yet broken, and since you have the luxury of all disks being present during migration, it is safer not to pull out a disk physically and put a new one in its place (physically or via hotsparing), but rather to try software replacement with zpool replace. This way your pool does not lose redundancy for the duration of replacement. The first raidz2 going to be in «degraded state». So I going to tell the pool the new disk is /dev/da48. repeat this_process until /dev/da11 replace by /dev/da59. Roughly so. Other list members might chime in - but MAYBE it is even possible or advisable to do software replacement on all 12 disks in parallel (since the originals are all present)? But at the end how many space I'm going to use on those /dev/da48 -- /dev/da51. Am I going to have 3To or 4To ? Because each time before complete ZFS going to use only 3 To how at the end he going to magically use 4To ? While the migration is underway and some but not all disks have completed it, you can only address the old size (3To); when your active disks are all big - you'd suddenly see the pool expand to use the available space (if the autoexpand property is on), or use a series of zpool online -e componentname. When I would like to change the disk, I also would like change the disk enclosure, I don't want to use the old one. Second question, when I'm going to pull out the first enclosure meaning the old /dev/da0 -- /dev/da11 and reboot the server the kernel going to give new number of those disk meaning old /dev/da12 -- /dev/da0 old /dev/da13 -- /dev/da1 etc... old /dev/da59 -- /dev/da47 how zfs going to manage that ? Supposedly, it should manage that well :) Once your old enclosure's disks are not used anyway, so you can remove it, you should zpool export your pool before turning off the hardware. This would remove it from the OS's zfs cachefile, and upon the next import the pool would undergo a full search for components. It is slower than cachefile when you have many devices at static locations, because it ensures that all storage devices are consulted and the new map of the pool components' locations is drawn. Thus the device numbering would change somehow due to HW changes and OS reconfiguration, then the full zpool import will take note of this and import old data from new addresses (device-names). HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS QoS and priorities
On 2012-12-05 04:11, Richard Elling wrote: On Nov 29, 2012, at 1:56 AM, Jim Klimov jimkli...@cos.ru mailto:jimkli...@cos.ru wrote: I've heard a claim that ZFS relies too much on RAM caching, but implements no sort of priorities (indeed, I've seen no knobs to tune those) - so that if the storage box receives many different types of IO requests with different administrative weights in the view of admins, it can not really throttle some IOs to boost others, when such IOs have to hit the pool's spindles. Caching has nothing to do with QoS in this context. *All* modern filesystems cache to RAM, otherwise they are unusable. Yes, I get that. However, many systems get away with less RAM than recommended for ZFS rigs (like the ZFS SA with a couple hundred GB as the starting option), and make their compromises elsewhere. They have to anyway, and they get different results, perhaps even better suited to certain narrow or big niches. Whatever the aggregate result, this difference does lead to some differing features that The Others' marketing trumpets praise as the advantage :) - like this ability to mark some IO traffic as of higher priority than other traffics, in one case (which is now also an Oracle product line, apparently)... Actually, this question stems from a discussion at a seminar I've recently attended - which praised ZFS but pointed out its weaknesses against some other players on the market, so we are not unaware of those. For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. Please read the papers on the ARC and how it deals with MFU and MRU cache types. You can adjust these policies using the primarycache and secondarycache properties at the dataset level. I've read on that, and don't exactly see how much these help if there is pressure on RAM so that cache entries expire... Meaning, if I want certain datasets to remain cached as long as possible (i.e. serve website or DB from RAM, not HDD), at expense of other datasets that might see higher usage, but have lower business priority - how do I do that? Or, perhaps, add (L2)ARC shares, reservations and/or quotas concepts to the certain datasets which I explicitly want to throttle up or down? At most, now I can mark the lower-priority datasets' data or even metadata as not cached in ARC or L2ARC. On-off. There seems to be no smaller steps, like in QoS tags [0-7] or something like that. BTW, as a short side question: is it a true or false statement, that: if I set primarycache=metadata, then ZFS ARC won't cache any userdata and thus it won't appear in (expire into) L2ARC? So the real setting is that I can cache data+meta in RAM, and only meta in SSD? Not the other way around (meta in RAM but both data+meta in SSD)? AFAIK, now such requests would hit the ARC, then the disks if needed - in no particular order. Well, can the order be made particular with current ZFS architecture, i.e. by setting some datasets to have a certain NICEness or another priority mechanism? ZFS has a priority-based I/O scheduler that works at the DMU level. However, there is no system call interface in UNIX that transfers priority or QoS information (eg read() or write()) into the file system VFS interface. So the grainularity of priority control is by zone or dataset. I do not think I've seen mention of priority controls per dataset, at least not in generic ZFS. Actually, that was part of my question above. And while throttling or resource shares between higher level software components (zones, VMs) might have similar effect, this is not something really controlled and enforced by the storage layer. -- richard Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS QoS and priorities
On 2012-11-29 10:56, Jim Klimov wrote: For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. On a side note, I'm now revisiting old ZFS presentations collected over the years, and one suggested as TBD statements the ideas that metaslabs with varying speeds could be used for specific tasks, and not only to receive the allocations first so that a new pool would perform quickly. I.e. TBD: Workload specific freespace selection policies. Say, I create a new storage box and lay out some bulk file, backup and database datasets. Even as they are receiving their first bytes, I have some idea about the kind of performance I'd expect from them - with QoS per dataset I might destine the databases to the fast LBAs (and smaller seeks between tracks I expect to use frequently), and the bulk data onto slower tracks right from the start, and the rest of unspecified data would grow around the middle of the allocation range. These types of data would then only creep onto the less fitting metaslabs (faster for bulk, slower for DB) if the target ones run out of free space. Then the next-best-fitting would be used... This one idea is somewhat reminiscent of hierarchical storage management, except that it is about static allocation at the write-time and takes place within the single disk (or set of similar disks), in order to warrant different performance for different tasks. ///Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot
On 2012-11-17 22:54, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey An easier event to trigger is the starting of the virtualbox guest. Upon vbox guest starting, check the service properties for that instance of vboxsvc, and chmod if necessary. But vboxsvc runs as non-root user... I like the idea of using zfs properties, if someday the functionality is going to be built into ZFS, and we can simply scrap the SMF chown service. But these days, ZFS isn't seeing a lot of public development. I just built this into simplesmf, http://code.google.com/p/simplesmf/ Support to execut the zvol chown immediately prior to launching guestvm I know Jim is also building it into vboxsvc, but I haven't tried that yet. Lest this point be lost - during discussion of the thread, Edward and myself ultimately embarked on the voyage to the solutions we saw best, hacked together during that day or so. Edward tailored his to VM startup events, while I made a more generic script which can save POSIX and ACL info from devfs into user attributes of ZVOLs and extract and apply those values to ZVOLs on demand. This script can register itself as an SMF service, and apply such values from zfs to devfs at service startup, and save from devfs to zfs at the service shutdown. I guess this can be integrated into my main vbox.sh script to initiate such activities during VM startup, but haven't yet explored or completed this variant (all the needed pieces should be there already). Perhaps I need to make such integration before next official release of vboxsvc. This is rather a proof-of-concept so far (i.e. the script should be sure to run after zpool imports/before zpool exports), but brave souls can feel free to try it out and comment. Presence of the service didn't cause any noticeable troubles on my test boxen over the past couple of weeks. http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/zfs-zvolrights HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VXFS to ZFS
On 2012-12-05 23:11, Morris Hooten wrote: Is there a documented way or suggestion on how to migrate data from VXFS to ZFS? Off the top of my head, I think this would go like any other migration - create the new pool on new disks and use rsync for simplicity (if your VxFS setup does not utilize extended attributes or anything similarly special), or use Solaris tar or cpio of such attributes are used (IIRC VxFS was a prime citizen in Solaris, so native tools - unlike GNU ones and rsync - should support the intimate details). Also note that if you have VxFS, then you likely come from a clustered setup, which may be quite native and safe to VxFS. ZFS does not support simultaneous pool-imports by several hosts, so you'd have to set up the clusterware to make sure only one host controls the pool at any time. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Digging in the bowels of ZFS
On 2012-12-05 05:52, Jim Klimov wrote: For undersized allocations, i.e. of compressed data, it is possible to see P-sizes not divisible by 4 (disks) in 4KB sectors, however, some sectors do apparently get wasted because the A-size in the DVA is divisible by 6*4KB. With columnar allocation of disks, it is easier to see why full stripes have to be used: p1 p2 d1 d2 d3 d4 . , 1 5 9 13 . , 2 6 10 14 . , 3 7 11 x . , 4 8 12 x In this illustration a 14-sector-long block is saved, with X being the empty leftovers, on which we can't really save (as would be the case with the other allocation, which is likely less efficient for CPU and IOs). Getting more and more puzzled with this... I have seen DVA values matching both theories now... Interestingly, all the allocations I looked over involved the number of sectors divisible by 3... rounding to half of my 6-disk RAID set - is it merely a coincidence, or some means of balancing IOs? Anyhow, with 4KB sectors involved, I saw many 128KB logical blocks compressed into just half a dozen sectors of userdata payload, so wasting one or two sectors here is quite a large percentage of my storage overhead. Exposition of found evidence follows: Say, this one from my original post: DVA[0]=0:594928b8000:9000 ... size=2L/4800P It has 5 data sectors (@4Kb) over 4 data disks in my raidz2 set, so it spills over to a second row and requires additional parity sectors - overall 5d+4p = 9 sectors, which we see in DVA A-size. This is normal, like expected. These ones however differ: DVA[0]=0:acef500e000:c000 ... size=2L/6a00P DVA[0]=0:acef501a000:c000 ... size=2L/7200P DVA[0]=0:acef5026000:c000 ... size=2L/5c00P These neighbors, with 7, 8 and 6 sectors worth of data all occupy 12 sectors on disk along with their parities. DVA[0]=0:59492a92000:6000 ... size=2L/2800P With 3*4Kb sectors worth of data and 2 parity sectors, this block is allocated over 6 not 5 sectors. DVA[0]=0:5996bf7c000:12000 ... size=2L/a800P Likewise, with 11 sectors of data and likely 6 sectors of parity, this one is given 18, not 17 sectors of storage allocation. DVA[0]=0:5996be32000:1e000 ... size=2L/12c00P Here, 19 sectors of data and 10 of parity occupy 30 sectors on disk. I did not yet research where exactly the unused sectors are allocated - vertically on the last strip, like in my yesterdays depiction quoted above, or horizontally across several disks, but now that I know about this - it really bothers me as wasted space with no apparent gain. I mean, the raidz code does tricks to ensure that parities are located on different disks, and in normal conditions the userdata sector reads land on all disks in a uniform manner. Why forfeit the natural rotation thanks to P-sizes smaller than the multiple of number of data-disks? Writes are anyway streamed and coalesced, so by not allocating these unused blocks we'd only reduce the needed write IOPS by some portion - and save disk space... In short: can someone explain the rationale - why are allocations such as they are now, and can it be discussed as a bug or should this be rationalized as a feature? Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Digging in the bowels of ZFS
more below... On 2012-12-05 23:16, Timothy Coalson wrote: On Tue, Dec 4, 2012 at 10:52 PM, Jim Klimov jimkli...@cos.ru mailto:jimkli...@cos.ru wrote: On 2012-12-03 18:23, Jim Klimov wrote: On 2012-12-02 05:42, Jim Klimov wrote: 4) Where are the redundancy algorithms specified? Is there any simple tool that would recombine a given algo-N redundancy sector with some other 4 sectors from a 6-sector stripe in order to try and recalculate the sixth sector's contents? (Perhaps part of some unit tests?) I'm a bit late to the party, but from a previous list thread about redundancy algorithms, I had found this: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c Particularly the functions vdev_raidz_reconstruct_p, vdev_raidz_reconstruct_q, vdev_raidz_reconstruct_pq (and possibly vdev_raidz_reconstruct_general) seem like what you are looking for. As I understand it, the case where you have both redundancy blocks, but are missing two data blocks, is the hardest (if you are missing only one data block, you can either do straight xor with the first redundancy section, or some LFSR shifting, xor, and then reverse LFSR shifting to use the second redundancy section). Wikipedia describes the math to restore from two missing data sections here, under computing parity: http://en.wikipedia.org/wiki/Raid6#RAID_6 I don't know any tools to do this for you from arbitrary input, sorry. Thanks, you are not late and welcome to the party ;) I'm hacking together a simple program to look over the data sectors and XOR parity and determine how many, if any, discrepancies there are, and at what offsets into the sector - byte by byte. Running it on raw ZFS block component sectors, extracted with DD in the ways I wrote of earlier in the thread, I did confirm some good sectors and the one erroneous block that I have. The latter turns out to have 4.5 worth of sectors in userdata, overall laid out like this: dsk0 dsk1 dsk2 dsk3 dsk4 dsk5 _ _ _ _ _ p1 q1d0d2d3d4* p2 q2d1_ _ _ _ Here the compressed userdata is contained in the order of my d-sector numbering, d0-d1-d2-d3-d4, and d4 is only partially occupied (P-size of the block is 0x4c00) so its final quarter is all zeroes. It also happens that on disks 1,2,3 the first row's sectors (d0, d2, d3) are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes. The neighboring blocks, located a few sectors away from this one, also have compressed data and have some regular-looking patterns of bytes, certainly no long stretches of zeroes. However, the byte-by-byte XOR matching complains about the whole sector. All bytes, except some 40 single-byte locations here and there, don't XOR up to produce the expected (known from disk) value. I did not yet try the second parity algorithm. At least in this case, it does not seem that I would find an incantation needed to recover this block - too many zeroes overlapping (at least 3 disks' data proven compromised), where I did hope for some shortcoming in ZFS recombination exhaustiveness. In this case - it is indeed too much failure to handle. Now waiting for scrub to find me more test subjects - broken files ;) Thnks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Digging in the bowels of ZFS
On 2012-12-03 18:23, Jim Klimov wrote: On 2012-12-02 05:42, Jim Klimov wrote: So... here are some applied questions: Well, I am ready to reply a few of my own questions now :) Continuing the desecration of my deceased files' resting grounds... 2) Do I understand correctly that for the offset definition, sectors in a top-level VDEV (which is all of my pool) are numbered in rows per-component disk? Like this: 0 1 2 3 4 5 6 7 8 9 10 11... That is, offset % setsize = disknum? If true, does such numbering scheme apply all over the TLVDEV, so as for my block on a 6-disk raidz2 disk set - its sectors start at (roughly rounded) offset_from_DVA / 6 on each disk, right? 3) Then, if I read the ZFS on-disk spec correctly, the sectors of the first disk holding anything from this block would contain the raid-algo1 permutations of the four data sectors, sectors of the second disk contain the raid-algo2 for those 4 sectors, and the remaining 4 disks contain the data sectors? My understanding was correct. For posterity, in the earlier set up example I had an uncompressed 128KB block residing at the address DVA[0]=0:590002c1000:3. Counting in my disks' 4KB sectors, this is 0x590002c1000/0x1000 = 0x590002C1 or 1493172929 logical offset into the TLVDEV number 0 (and the only one in this pool). Given that this TLVDEV is a 6-disk raidz2 set, my expected offset on each component drive is 1493172929/6 = 248862154.83 (.83=5/6), starting from after the ZFS header (2 labels and a reservation, amounting to 4MB = 1024*4KB sectors). So this block's allocation covers 8 4KB-sectors starting at 248862154+1024 on disk5 and at 248862155+1024 on disks 0,1,2,3,4. As my further tests showed, the sector-columns (not rows as I had expected after doc-reading) from disks 1,2,3,4 do recombine into the original userdata (sha256 checksum matches), so disks 5 and 0 should hold the two parities - how ever that is calculated: # for D in 1 2 3 4; do dd bs=4096 count=8 conv=noerror,sync \ if=/dev/dsk/c7t${D}d0s0 of=b1d${D}.img skip=248863179; done # for D in 1 2 3 4; do for R in 0 1 2 3 4 5 6 7; do \ dd if=/pool/test3/b1d${D}.img bs=4096 skip=$R count=1; \ done; done /tmp/d Note that the latter can be greatly simplified as cat, which also works to the same effect, and is faster: # cat /pool/test3/b1d?.img /tmp/d However I left the difficult notation to use in experiments later on. That is, the original 128KB block was cut into 4 pieces (my 4 data drives in the 6-disk raidz2 set), and each 32Kb strip was stored on a separate drive. Nice descriptive pictures in some presentations suggested to me that the original block is stored sector by sector rotating onto the next disk - the set of 4 sectors with 2 parity sectors in my case being a single stripe for the RAID purposes. This directly suggested that incomplete such stripes, such as the ends of files or whole small files, would still have the two parity sectors and a handful of data sectors. Reality differs. For undersized allocations, i.e. of compressed data, it is possible to see P-sizes not divisible by 4 (disks) in 4KB sectors, however, some sectors do apparently get wasted because the A-size in the DVA is divisible by 6*4KB. With columnar allocation of disks, it is easier to see why full stripes have to be used: p1 p2 d1 d2 d3 d4 . , 1 5 9 13 . , 2 6 10 14 . , 3 7 11 x . , 4 8 12 x In this illustration a 14-sector-long block is saved, with X being the empty leftovers, on which we can't really save (as would be the case with the other allocation, which is likely less efficient for CPU and IOs). The metadata blocks do have A-sizes of 0x3000 (2 parity + 1 data), at least, which on 4KB-sectored disks is also pretty much for these miniature data objects - but not as sad as 6*4KB would have been ;) It also seems that the instinctive desire to have raidzN sets of 4*M+N disks (i.e. 6-disk raidz2, 11-disk raidz3, etc.) which was discussed over and over on the list a couple of years ago, may still be valid with typical block sizes being powers of two... Even though gurus said that this should not matter much. For IOPS - maybe not. For wasted space - likely... I'm almost ready to go and test Q2 and Q3, however, the questions which regard useable tools (and what data should be fed into such tools?) are still on the table. Some OLD questions remain raised, just in case anyone answers them. 3b) The redundancy algos should in fact cover other redundancy disks too (in order to sustain loss of any 2 disks), correct? (...) 4) Where are the redundancy algorithms specified? Is there any simple tool that would recombine a given algo-N redundancy sector with some other 4 sectors from a 6-sector stripe in order to try and recalculate the sixth sector's contents? (Perhaps part of some unit tests?) 7) Is there a command-line tool to do lzjb compressions
Re: [zfs-discuss] zpool rpool rename offline
On 2012-12-03 01:15, Phillip Wagstrom wrote: You can't change the name of a zpool without importing it. For what you're attempting to do, why not attach a larger vdisk and mirror the existing disk in rpool? Then drop the smaller vdisk and you'll have a larger rpool. In general, I'd do the renaming with a different bootable media, including a LiveCD/LiveUSB, another distro that can import and rename this pool version, etc. - as long as booting does not involve use of the old rpool. Phillip however has a good point about mirroring onto a larger disk. This should also carry over your old pool's attributes (bootfs, name, etc.) - however you should likely have to use installgrub on the new disk image. When you detach the old mirror half, you'd automatically have a larger pool on the remaining disk image. dom0 # xm list $vm -l | egrep vbd|:disk|zvol (vbd (dev xvda:disk) (uname phy:/dev/zvol/dsk/rpool/zvol/domu-2-root) (vbd (dev xvdb:disk) (uname phy:/dev/zvol/dsk/rpool/zvol/domu-21-root) By far, the easiest approach in your case would be to just increase the host's zfs volume which backs your old rpool and use autoexpansion (or manual expansion) to let your VM's rpool capture the whole increased virtual disk. If automagic doesn't work, I posted about a month ago about the manual procedure on this list: http://mail.opensolaris.org/pipermail/zfs-discuss/2012-November/052712.html HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Eradicating full-disk pool headers safely
Hello all, When I started with my old test box (the 6-disk raidz2 pool), I had first created the pool on partitions (i.e. c7t1d0p0 or physical paths like /pci@0,0/pci1043,81ec@1f,2/disk@1,0:q), but I've soon destroyed it and recreated (with the same name pool) in slices (i.e. c7t0d0s0 or /pci@0,0/pci1043,81ec@1f,2/disk@1,0:a) with a tailing 8Mb slice (whole-disk ZFS layout). The disks currently carry the format EFI, and the zpool command finds the correct pool by name. However, whenever I use zdb, it finds leftovers of my original test as labels number 2 and 3 (numbers 0 and 1 are failed to unpack), so zdb refuses to use my pool by name and I have to provide the GUID. Is it easy to find out at which locations ZDB finds these labels so I could zero them out and let zdb use the correct pool by name? Should I assume that p0 addresses the while disk and wipe the last 512K of the disk size (which are now in the reserved 8Mb partition)? BTW, what role does this 8Mb piece play? I might guess it helps to replace disks by new ones with similar (not exact) sizes and this slice on the new disk would shrink or expand to cover up the HDD size discrepancy. But I haven't done any replacements so far which would prove or disprove this ;) Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Digging in the bowels of ZFS
On 2012-12-02 05:42, Jim Klimov wrote: So... here are some applied questions: Well, I am ready to reply a few of my own questions now :) I've staged an experiment by taking a 128Kb block from that file and appending it to a new file in a test dataset, where I changed the compression settings between the appendages. Thus I've got a ZDB dump of three blocks with identical logical userdata and different physical data. # zdb -d -bb -e 1601233584937321596/test3 8 /pool/test3/a.zdb ... Indirect blocks: 0 L1 DVA[0]=0:59492a98000:3000 DVA[1]= 0:83e2f65000:3000 [L1 ZFS plain file] sha256 lzjb LE contiguous unique double size=4000L/400P birth=326381727L/326381727P fill=3 cksum=2ebbfb189e7ce003:166a23fd39d583ed:f527884977645395:896a967526ea9cea 0 L0 DVA[0]=0:590002c1000:3 [L0 ZFS plain file] sha256 uncompressed LE contiguous unique single size=2L/2P birth=326381721L/326381721P fill=1 cksum=3c691e8fc86de2ea:90a0b76f0d1fe3ff:46e055c32dfd116d:f2af276f0a6a96b9 2 L0 DVA[0]=0:594928b8000:9000 [L0 ZFS plain file] sha256 lzjb LE contiguous unique single size=2L/4800P birth=326381724L/326381724P fill=1 cksum=57164faa0c1cbef4:23348aa9722f47d3:3b1b480dc731610b:7f62fce0cc18876f 4 L0 DVA[0]=0:59492a92000:6000 [L0 ZFS plain file] sha256 gzip-9 LE contiguous unique single size=2L/2800P birth=326381727L/326381727P fill=1 cksum=d68246ee846944c6:70e28f6c52e0c6ba:ea8f94fc93f8dbfd:c22ad491c1e78530 segment [, 0008) size 512K 1) So... how DO I properly interpret this to select sector ranges to DD into my test area from each of the 6 disks in the raidz2 set? On one hand, the DVA states the block length is 0x9000, and this matches the offsets of neighboring blocks. On the other hand, compressed physical data size is 0x4c00 for this block, and ranges 0x4800-0x5000 for other blocks of the file. Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way smaller than 0x9000. For uncompressed files I think I saw entries like size=2L/3P, so I'm not sure even my multiplication by 1.5x above is valid, and the discrepancy between DVA size and interval, and physical allocation size reaches about 2x. Apparently, my memory failed me. The values in size field regard the userdata (compressed, non-redundant). Also I forgot to consider that this pool uses 4KB sectors (ashift=12). So my userdata which takes up about 0x4800 bytes would require 4.5 (rather, 5 whole) sectors and this warrants 4 sectors of the raidz2 redundancy on a 6-disk set - 2 sectors for the first 4 data sectors, and 2 sectors for the remaining half-sector's worth of data. This does sum up to 9*0x1000 bytes in whole-sector counting (as in offsets). However, the gzip-compressed block above which only has 0x2800 bytes of userdata and requires 3 sectors plus 2 redundancy sectors, still has a DVA size of six 4KB sectors (0x6000). This is strange to me - I'd expect 5 sectors for this block altogether... does anyone have an explanation? Also, what should the extra userdata sector contain physically - zeroes? 5) Is there any magic to the checksum algorithms? I.e. if I pass some 128KB block's logical (userdata) contents to the command-line sha256 or openssl sha256 - should I get the same checksum as ZFS provides and uses? The original 128KB file's sha256 checksum matches the uncompressed block's ZFS checksum, so in my further tests I can use the command line tools to verify the recombined results: # sha256sum /tmp/b128 3c691e8fc86de2ea90a0b76f0d1fe3ff46e055c32dfd116df2af276f0a6a96b9 /tmp/b128 No magic, as long as there are useable command-line implementations of the needed algos (sha256sum is there, fletcher[24] are not). 6) What exactly does a checksum apply to - the 128Kb userdata block or a 15-20Kb (lzjb-)compressed portion of data? I am sure it's the latter, but ask just in case I don't miss anything... :) ZFS parent block checksum applies to the on-disk variant of userdata payload (compression included, redundancy excluded). NEW QUESTIONS: 7) Is there a command-line tool to do lzjb compressions and decompressions (in the same blocky manner as would be applicable to ZFS compression)? I've also tried to gzip-compress the original 128KB file, but none of the compressed results (with varying gzip level) yielded the same checksum that would match the ZFS block's one. Zero-padding to 10240 bytes (psize=0x2800) did not help. 8) When should the decompression stop - as soon as it has extracted the logical-size number of bytes (i.e. 0x2)? 9) Physical sizes magically are in whole 512b units, so it seems... I doubt that the compressed data would always end at such boundary. How many bytes should be covered by a checksum? Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)? Some OLD questions remain raised, just in case
Re: [zfs-discuss] zpool rpool rename offline
On 2012-12-03 20:35, Heiko L. wrote: I've already tested: beadm create -p $dstpool $bename beadm list zpool set bootfs=$dstpool/ROOT/$bename $dstpool beadm activate $bename beadm list init 6 - result: root@opensolaris:~# init 6 updating //platform/i86pc/boot_archive updating //platform/i86pc/amd64/boot_archive Hostname: opensolaris WARNING: pool 'rpool1' could not be loaded as it was last accessed by another system (host: opensolaris hostid: 0xc08358). See: http://www.sun.com/msg/ZFS-8000-EY ...hang... - seen to be a bug... You wrote you use opensolaris - if literally true, this is quite old and few people would say definitely which bugs to expect in which version. I might guess (hope) your ultimate goal by increasing the disk is to upgrade the VM to a more current build, like OI or Sol11? Still, while booted from the old rpool, after activating the new one, you could also zpool export rpool1 in order to mark it as cleanly exported and not potentially held by another OS instance. This should allow to boot from it, unless some other bug steps in... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool rpool rename offline
On 2012-12-03 20:51, Heiko L. wrote: jimklimov wrote: In general, I'd do the renaming with a different bootable media, including a LiveCD/LiveUSB, another distro that can import and rename this pool version, etc. - as long as booting does not involve use of the old rpool. Thank you. I will test it in the coming days. Well then, hopefully this (other media to boot) will help with that (forcing the old rpool to expand)... If automagic doesn't work, I posted about a month ago about the manual procedure on this list: http://mail.opensolaris.org/pipermail/zfs-discuss/2012-November/052712.html procedure work on 2. disk, but i cannot use zpool import on rpool... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 6Tb Database with ZFS
On 2012-12-01 15:05, Fung Zheng wrote: Hello, Im about to migrate a 6Tb database from Veritas Volume Manager to ZFS, I want to set arc_max parameter so ZFS cant use all my system's memory, but i dont know how much i should set, do you think 24Gb will be enough for a 6Tb database? obviously the more the better but i cant set too much memory. Have someone implemented succesfully something similar? Not claiming to be an expert fully ready to (mis)lead you (and I haven't done similar quests for databases), I might suggest that you set the ZFS dataset option primarycache=metadata on your dataset which holds the database. (PS: what OS version are you on?) The general consent is that serious apps like databases are better than generic OS/FS caches at caching what the DBMS deems fit (and the data blocks might get cached twice - in ARC and in app cache), however having ZFS *metadata* cached should speed up your HDD IO - the server might keep the {much of} needed block map in RAM and not have to start by fetching it from disks every time. Also make sure to set the recordsize attribute as appropriate for your DB software - to match the DB block size. Usually this ranges around 4, 8 or 16Kb (with zfs default being 128Kb for filesystem datasets). You might also want to put non-tablespace files (logs, indexes, etc.) into separate datasets with their appropriate record sizes - this would let you play with different caching and compression settings, if applicable (you might save some IOPS by reading and writing less mechanical data at a small hit to CPU horsepower by using LZJB). Also such systems tend to benefit from SSD L2ARC read-caches and SSD SLOG (ZIL) write-caches. These are different pieces of equipment with distinct characteristics (SLOG is mirrored, small, write-mostly, and should endure write-wear and survive sudden poweroffs; L2ARC is big, fast for small random reads, moderately reliable). If you do use a big L2ARC, you might indeed want to have both ZFS caches for frequently accessed datasets (i.e. index) to hold both the userdata and metadata (as is the default), while the randomly accessed tablespaces might be or not be good candidates for such caching - however you can test this setting change on the fly. I believe, you must allow caching userdata for a dataset in RAM if you want to let it spill over onto L2ARC. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Digging in the bowels of ZFS
such combo and ZFS does what it should exhaustively and correctly, indeed ;) Thanks a lot in advance for any info, ideas, insights, and just for reading this long post to the end ;) //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove disk
On 2012-11-30 15:52, Tomas Forsman wrote: On 30 November, 2012 - Albert Shih sent me these 0,8K bytes: Hi all, I would like to knwon if with ZFS it's possible to do something like that : http://tldp.org/HOWTO/LVM-HOWTO/removeadisk.html Removing a disk - no, one still can not reduce the amount of devices in a zfs pool nor change raidzN redundancy levels (you can change single disks to mirrors and back), nor reduce disk size. As Tomas wrote, you can increase the disk size by replacing smaller ones with bigger ones. With sufficiently small starting disks and big new disks (i.e. moving up from 1-2Tb to 4Tb) you can cheat by putting several partitions on one drive and giving that to different pool components - if your goal is to reduce the amount of hardware disks in the pool. However, note that: 1) A single HDD becomes a SPOF, so you should put pieces of different raidz sets onto particular disks - if a HDD dies, it does not bring down a critical amount of pool components and does not kill the pool. 2) The disk mechanics will be torn between many requests to your pool's top-level VDEVs, probably greatly reducing achievable IOPS (since the TLVDEVs are accessed in parallel). So while possible, this cheat is useful as a temporary measure - i.e. while you migrate data and don't have enough drive bays to hold the old and new disks, and want to be on the safe side by not *removing* a good disk in order to replace it with a bigger one. With this cheat you have all data safely redundantly stored on disks at all time during migration. In the end this disk can be the last piece of the puzzle in your migration. meaning : I have a zpool with 48 disks with 4 raidz2 (12 disk). Inside those 48 disk I've 36x 3T and 12 x 2T. Can I buy new 12x4 To disk put in the server, add in the zpool, ask zpool to migrate all data on those 12 old disk on the new and remove those old disk ? You pull out one 2T, put in a 4T, wait for resilver (possibly tell it to replace, if you don't have autoreplace on) Repeat until done. If you have the physical space, you can first put in a new disk, tell it to replace and then remove the old. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS QoS and priorities
I've heard a claim that ZFS relies too much on RAM caching, but implements no sort of priorities (indeed, I've seen no knobs to tune those) - so that if the storage box receives many different types of IO requests with different administrative weights in the view of admins, it can not really throttle some IOs to boost others, when such IOs have to hit the pool's spindles. For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. AFAIK, now such requests would hit the ARC, then the disks if needed - in no particular order. Well, can the order be made particular with current ZFS architecture, i.e. by setting some datasets to have a certain NICEness or another priority mechanism? Thanks for info/ideas, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: There are very few situations where (gzip) option is better than the default lzjb. Well, for the most part my question regarded the slowness (or lack of) gzip DEcompression as compared to lz* algorithms. If there are files and data like the OS (LZ/GZ) image and program binaries, which are written once but read many times, I don't really care how expensive it is to write less data (and for an OI installation the difference between lzjb and gzip-9 compression of /usr can be around or over 100Mb's) - as long as I keep less data on-disk and have less IOs to read in the OS during boot and work. Especially so, if - and this is the part I am not certain about - it is roughly as cheap to READ the gzip-9 datasets as it is to read lzjb (in terms of CPU decompression). //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
Performance-wise, I think you should go for mirrors/raid10, and separate the pools (i.e. rpool mirror on SSD and data mirror on HDDs). If you have 4 SSDs, you might mirror the other couple for zoneroots or some databases in datasets delegated into zones, for example. Don't use dedup. Carve out some space for L2ARC. As Ed noted, you might not want to dedicate much disk space due to remaining RAM pressure when using the cache; however, spreading the IO load between smaller cache partitions/slices on each SSD may help your IOPS on average. Maybe go for compression. I really hope someone better versed in compression - like Saso - would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in terms of read-speeds from the pools. My HDD-based assumption is in general that the less data you read (or write) on platters - the better, and the spare CPU cycles can usually take the hit. I'd spread out the different data types (i.e. WORM programs, WORM-append logs and random-io other application data) into various datasets with different settings, backed by different storage - since you have the luxury. Many best practice documents (and original Sol10/SXCE/LiveUpgrade requirements) place the zoneroots on the same rpool so they can be upgraded seamlessly as part of the OS image. However you can also delegate ZFS datasets into zones and/or have lofs mounts from GZ to LZ (maybe needed for shared datasets like distros and homes - and faster/more robust than NFS from GZ to LZ). For OS images (zoneroots) I'd use gzip-9 or better (likely lz4 when it gets integrated), same for logfile datasets, and lzjb, zle or none for the random-io datasets. For structured things like databases I also research the block IO size and use that (at dataset creation time) to reduce extra work with ZFS COW during writes - at expense of more metadata. You'll likely benefit from having OS images on SSDs, logs on HDDs (including logs from the GZ and LZ OSes, to reduce needless writes on the SSDs), and databases on SSDs. Things depend for other data types, and in general would be helped by L2ARC on the SSDs. Also note that much of the default OS image is not really used (i.e. X11 on headless boxes), so you might want to do weird things with GZ or LZ rootfs data layouts - note that these might puzzle your beadm/liveupgrade software, so you'll have to do any upgrades with lots of manual labor :) On a somewhat orthogonal route, I'd start with setting up a generic dummy zone, perhaps with much unneeded software, and zfs-cloning that to spawn application zones. This way you only pay the footprint price once, at least until you have to upgrade the LZ OSes - in that case it might be cheaper (in terms of storage at least) to upgrade the dummy, clone it again, and port the LZ's customizations (installed software) by finding the differences between the old dummy and current zone state (zfs diff, rsync -cn, etc.) In such upgrades you're really well served by storing volatile data in separate datasets from the zone OS root - you just reattach these datasets to the upgraded OS image and go on serving. As a particular example of the thing often upgraded and taking considerable disk space per copy - I'd have the current JDK installed in GZ: either simply lofs-mounted from GZ to LZs, or in a separate dataset, cloned and delegated into LZs (if JDK customizations are further needed by some - but not all - local zones, i.e. timezone updates, trusted CA certs, etc.). HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
Now that I thought of it some more, a follow-up is due on my advices: 1) While the best practices do(did) dictate to set up zoneroots in rpool, this is certainly not required - and I maintain lots of systems which store zones in separate data pools. This minimizes write-impact on rpools and gives the fuzzy feeling of keeping the systems safer from unmountable or overfilled roots. 2) Whether LZs and GZs are in the same rpool for you, or you stack tens of your LZ roots in a separate pool, they do in fact offer a nice target for dedup - with expected large dedup ratio which would outweigh both the overheads and IO lags (especially if it is on SSD pool) and the inconveniences of my approach with cloned dummy zones - especially upgrades thereof. Just remember to use the same compression settings (or lack of compression) on all zoneroots, so that the zfs blocks for OS image files would be the same and dedupable. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Directory is not accessible
On 2012-11-26 15:15, The OP wrote: How can one remove a directory containing corrupt files or a corrupt file itself? For me rm just gives input/output error. I believe you can get rid of the corrupt files by overwriting them. In my case of corrupted files, I dd'ed the corrupt blocks from a backup source into the right spot of the file. Overall this released the corrupt blocks from the pool and allowed them to get freed (or perhaps leaked in case of that bug I've stepped onto). Trying to free the block can get your pool into trouble or panics, depending on the nature of the corruption, though (in my case, DDT was trying to release a block that was not entered into the DDT). If this happens, your next best bet would be to trace where the error happens, invent a patch (such as letting it possibly leak away) and compile your own kernel to clean up the pool. Of course, it is also possible that the block would go away (if it is not referenced also by snapshots/clones/dedup), and such drastic measures won't be needed. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Appliance as a general-purpose server question
A customer is looking to replace or augment their Sun Thumper with a ZFS appliance like 7320. However, the Thumper was used not only as a protocol storage server (home dirs, files, backups over NFS/CIFS/Rsync), but also as a general-purpose server with unpredictably-big-data programs running directly on it (such as corporate databases, Alfresco for intellectual document storage, etc.) in order to avoid the networking transfer of such data between pure-storage and compute nodes - this networking was seen as both a bottleneck and a possible point of failure. Is it possible to use the ZFS Storage appliances in a similar way, and fire up a Solaris zone (or a few) directly on the box for general-purpose software; or to shell-script administrative tasks such as the backup archive management in the global zone (if that concept still applies) as is done on their current Solaris-based box? Is it possible to run VirtualBoxes in the ZFS-SA OS, dare I ask? ;) Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
On 2012-11-22 17:31, Darren J Moffat wrote: Is it possible to use the ZFS Storage appliances in a similar way, and fire up a Solaris zone (or a few) directly on the box for general-purpose software; or to shell-script administrative tasks such as the backup archive management in the global zone (if that concept still applies) as is done on their current Solaris-based box? No it is a true appliance, it might look like it has Solaris underneath but it is just based on Solaris. You can script administrative tasks but not using bash/ksh style scripting you use the ZFSSA's own scripting language. So, the only supported (or even possible) way is indeed to us it as NAS for file or block IO from another head running the database or application servers?.. In the Datasheet I read that Cloning and Remote replication are separately licensed features; does this mean that the capability for zfs send|zfs recv backups from remote Solaris systems should be purchased separately? :( I wonder if it would make weird sense to get the boxes, forfeit the cool-looking Fishworks, and install Solaris/OI/Nexenta/whatever to get the most flexibility and bang for a buck from the owned hardware... Or, rather, shop for the equivalent non-appliance servers... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mixing WD20EFRX and WD2002FYPS in one pool
On 2012-11-21 16:45, Eugen Leitl wrote: Thanks, this is great to know. The box will be headless, and run in text-only mode. I have an Intel NIC in there, and don't intend to use the Realtek port for anything serious. My laptop based on AMD E2 VISION integrated CPU and Realtek Gigabit had intermittent problems with rge driver (intr count went to about 100k/sec and X11 locked up until I disconnected the LAN), but these diminished or disappeared after I switched to gani driver (source available from internet). OI lacks support for the Radeon chips in my CPU (works as vesavga). And USB3. I intend to boot off USB flash stick, and runn OI with napp-it. 8 GByte RAM, unfortunately not ECC, but it will do for a secondary SOHO NAS, as data is largely read-only. Theoretically, if memory has a hiccup while scrub verifies your disks, it can cause phantom checksum mismatches to be detected. I am not sure about timing of reads and other events involved in further reconstitution of the data - whether the recovery attempt will use the re-read (and possibly correct) sector data or if it will continue based on invalid buffer contents. I guess ZFS being on the safe side should double-check the found discrepancies and those sectors it's going to use to recover a block, at least of the kernel knows it is on non-ECC RAM (if it does), but I don't know if it really does that. (Worthy RFE if not). HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel DC S3700
On 2012-11-21 21:55, Ian Collins wrote: I can't help thinking these drives would be overkill for an ARC device. All of the expensive controller hardware is geared to boosting random write IOPs, which somewhat wasted on a write slowly, read often device. The enhancements would be good for a ZIL, but the smallest drive is at least an order of magnitude too big... I think, given the write-endurance and powerloss protection, these devices might make for good pool devices - whether for an SSD-only pool, or for an rpool+zil(s) mirrors with main pools (and likely L2ARCs, yes) being on different types of devices. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
On 2012-11-21 03:21, nathan wrote: Overall, the pain of the doubling of bandwidth requirements seems like a big downer for *my* configuration, as I have just the one SSD, but I'll persist and see what I can get out of it. I might also speculate that for each rewritten block of userdata in the VM image, you have a series of metadata block updates in ZFS. If you keep the zvol blocks relatively small, you might get the effective doubling of writes for the userdata updates. As for ZIL - even if it is used with the in-pool variant, I don't think your setup needs any extra steps to disable it (as Edward likes to suggest), and most other setups don't need to disable it either. It also shouldn't add much to your writes - the in-pool ZIL blocks are then referenced as userdata when the TXG commit happens (I think). I also think that with a VM in a raw partition you don't get any snapshots - neither ZFS as underlying storage ('cause it's not), not hypervisor snaps of the VM. So while faster, this is also some trade-off :) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
On 2012-11-19 20:28, Peter Jeremy wrote: Yep - that's the fallback solution. With 1874 snapshots spread over 54 filesystems (including a couple of clones), that's a major undertaking. (And it loses timestamp information). Well, as long as you have and know the base snapshots for the clones, you can recreate them at the same branching point on the new copy too. Remember to use something like rsync -cavPHK --delete-after --inplace src/ dst/ to do the copy, so that the files removed from the source snapshot are removed on target, the changes are detected thanks to file checksum verification (not only size and timestamp), and changes take place within the target's copy of the file (not as rsync's default copy-and-rewrite) in order for the retained snapshots history to remain sensible and space-saving. Also, while you are at it, you can use different settings on the new pool, based on your achieved knowledge of your data - perhaps using better compression (IMHO stale old data that became mostly read-only is a good candidate for gzip-9), setting proper block sizes for files of databases and disk images, maybe setting better checksums, and if your RAM vastness and data similarity permit - perhaps employing dedup (run zdb -S on source pool to simulate dedup and see if you get any better than 3x savings - then it may become worthwhile). But, yes, this will take quite a while to effectively walk your pool several thousand times, if you do the plain rsync from each snapdir. Perhaps, if the zfs diff does perform reasonably for you, you can feed its output as the list of objects to replicate in rsync's input and save many cycles this way. Good luck, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
On 2012-11-19 20:58, Mark Shellenbaum wrote: There is probably nothing wrong with the snapshots. This is a bug in ZFS diff. The ZPL parent pointer is only guaranteed to be correct for directory objects. What you probably have is a file that was hard linked multiple times and the parent pointer (i.e. directory) was recycled and is now a file Interesting... do the ZPL files in ZFS keep pointers to parents? How in the COW transactiveness could the parent directory be removed, and not the pointer to it from the files inside it? Is this possible in current ZFS, or could this be a leftover in the pool from its history with older releases? Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
Oh, and one more thing: rsync is only good if your filesystems don't really rely on ZFS/NFSv4-style ACLs. If you need those, you are stuck with Solaris tar or Solaris cpio to carry the files over, or you have to script up replication of ACLs after rsync somehow. You should also replicate the local zfs attributes of your datasets, zfs allow permissions, ACLs on .zfs/shares/* (if any, for CIFS) - at least of their currently relevant live copies, which is also not a fatally difficult scripting (I don't know if it is possible to fetch the older attribute values from snapshots - which were in force at that past moment of time; if somebody knows anything on this - plz write). On another note, to speed up the rsyncs, you can try to save on the encryption (if you do this within a trusted LAN) - use rsh, or ssh with arcfour or none enc. algos, or perhaps rsync over NFS as if you are in the local filesystem. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
On 2012-11-19 22:38, Mark Shellenbaum wrote: The parent pointer is a single 64 bit quantity that can't track all the possible parents a hard linked file could have. I believe it is inode number of the parent, or similar to that - and an available inode number can get recycled and used by newer objects? Now when the original dir.2 object number is recycled you could have a situation where the parent pointer for points to a non-directory. The ZPL never uses the parent pointer internally. It is only used by zfs diff and other utility code to translate object numbers to full pathnames. The ZPL has always set the parent pointer, but it is more for debugging purposes. Thanks, very interesting! Now that this value is used and somewhat exposed to users, isn't it time to replace it with some nvlist or a different object type that would hold all such parent pointers for hardlinked files (perhaps, when moving from a single integer to nvlist if we have more than one link from a directory to a file inode)? At least, it would make zdiff more consistent and reliable, though at a cost of some complexity... inodes do already track their reference counts. If we keep track of one referrer explicitly, why not track them all? Thanks for info, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot
On 2012-11-15 21:43, Geoff Nordli wrote: Instead of using vdi, I use comstar targets and then use vbox built-in scsi initiator. Out of curiosity: in this case are there any devices whose ownership might get similarly botched, or you've tested that this approach also works well for non-root VMs? Did you measure any overheads of initiator-target vs. zvol, both being on the local system? Is there any significant performance difference worth thinking and talking about? Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot
On 2012-11-16 12:43, Robert Milkowski wrote: No, there isn’t other way to do it currently. SMF approach is probably the best option for the time being. I think that there should be couple of other properties for zvol where permissions could be stated. +1 :) Well, when the subject was discussed a month ago, I posted a couple of RFEs, lest the problem be quietly forgotten: https://www.illumos.org/issues/3283 ZFS: correctly remember device node ownership and ACLs for ZVOLs https://www.illumos.org/issues/3284 ACLs on device node can become applied to wrong devices; UID/GID not retained While trying to find workarounds for Edward's problem, I discovered that NFSv4/ZFS-style ACLs can be applied to /devices/* and are even remembered across reboots, but in fact this causes more problems than solutions. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot
Well, as a simple stone-age solution (to simplify your SMF approach), you can define custom attributes on dataset, zvols included. I think a custom attr must include a colon : in the name, and values can be multiline if needed. Simple example follows: # zfs set owner:user=jim pool/rsvd # zfs set owner:group=staff pool/rsvd # zfs set owner:chmod=777 pool/rsvd # zfs set owner:acl=`ls -vd .profile` pool/rsvd # zfs get all pool/rsvd ... pool/rsvd owner:chmod777 local pool/rsvd owner:acl -rw-r--r-- 1 root root 54 Nov 11 22:21 .profile 0:owner@:read_data/write_data/append_data/read_xattr/write_xattr /read_attributes/write_attributes/read_acl/write_acl/write_owner /synchronize:allow 1:group@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow 2:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize :allow local pool/rsvd owner:groupstaff local pool/rsvd owner:user jim local Then you can query the zvols for such attribute values and use them in chmod, chown, ACL settings, etc. from your script. This way the main goal is reached: the ownership config data stays within the pool. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot
On 2012-11-16 14:45, Jim Klimov wrote: Well, as a simple stone-age solution (to simplify your SMF approach), you can define custom attributes on dataset, zvols included. I think a custom attr must include a colon : in the name, and values can be multiline if needed. Simple example follows: Forgot to mention: to clear these custom values, you can just zfs inherit them on this same dataset. As long as the parent does not define them, they should just get wiped out. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel DC S3700
On 2012-11-14 18:05, Eric D. Mudama wrote: On Wed, Nov 14 at 0:28, Jim Klimov wrote: All in all, I can't come up with anything offensive against it quickly ;) One possible nit regards the ratings being geared towards 4KB block (which is not unusual with SSDs), so it may be further from announced performance with other block sizes - i.e. when caching ZFS metadata. Would an ashift of 12 conceivably address that issue? Performance-wise (and wear-wise) - probably. Gotta test how bad it is at 512b IOs ;) Also I am not sure if ashift applies to (can be set for) L2ARC cache devices... Actually, if read performance does not happen to suck at smaller block sizes, ashift is not needed - the L2ARC writes seem to be streamed sequentially (as in an infinite tape) so smaller writes would still coalesce into big HW writes and not cause excessive wear by banging many random flash cells. IMHO :) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel DC S3700
On 2012-11-13 22:56, Mauricio Tavares wrote: Trying again: Intel just released those drives. Any thoughts on how nicely they will play in a zfs/hardware raid setup? Seems interesting - fast, assumed reliable and consistent in its IOPS (according to marketing talk), addresses power loss reliability (acc. to datasheet): * Endurance Rating - 10 drive writes/day over 5 years while running JESD218 standard * The Intel SSD DC S3700 supports testing of the power loss capacitor, which can be monitored using the following SMART attribute: (175, AFh). Somewhat affordably priced (at least in the volume market for shops that buy hardware in cubic meters ;) http://newsroom.intel.com/community/intel_newsroom/blog/2012/11/05/intel-announces-intel-ssd-dc-s3700-series--next-generation-data-center-solid-state-drive-ssd http://download.intel.com/newsroom/kits/ssd/pdfs/Intel_SSD_DC_S3700_Product_Specification.pdf All in all, I can't come up with anything offensive against it quickly ;) One possible nit regards the ratings being geared towards 4KB block (which is not unusual with SSDs), so it may be further from announced performance with other block sizes - i.e. when caching ZFS metadata. Thanks for bringing it into attention spotlight, and I hope the more savvy posters would overview it better. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?
On 2012-11-14 03:20, Dan Swartzendruber wrote: Well, I think I give up for now. I spent quite a few hours over the last couple of days trying to get gnome desktop working on bare-metal OI, followed by virtualbox. Supposedly that works in headless mode with RDP for management, but nothing but fail for me. Found quite a few posts on various forums of people complaining that RDP with external auth doesn't work (or not reliably), and that was my experience. I can't say I used VirtualBox RDP extensively, certainly not in the newer 4.x series, yet. For my tasks it sufficed to switch the VM from headless to GUI and back via savestate, as automated by my script from vboxsvc (vbox.sh -s vmname startgui for a VM config'd as a vboxsvc SMF service already). The final straw was when I rebooted the OI server as part of cleaning things up, and... It hung. Last line in verbose boot log is 'ucode0 is /pseudo/ucode@0'. I power-cycled it to no avail. Even tried a backup BE from hours earlier, to no avail. Likely whatever was bunged happened prior to that. If I could get something that ran like xen or kvm reliably for a headless setup, I'd be willing to give it a try, but for now, no... I can't say much about OI desktop problems either - works for me (along with VBox 4.2.0 release), suboptimally due to lack of drivers, but reliably. Try to boot with -k option to use a kmdb debugger as well - maybe the system would enter it upon getting stuck (does so instead of rebooting when it is panicking) and you can find some more details there?.. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Expanding a ZFS pool disk in Solaris 10 on VMWare (or other expandable storage technology)
% = == = === == === 1 ActiveSolaris 1 26092609100 If you apply this technique to rpools, note that the partition type is different (SOLARIS vs EFI), the start cylinders differ (1 for MBR, 0 for EFI), and the bootable partition is Active. 6) The scary part is that I need to remove the partition and slice tables and recreate them starting at the same positions. So in fdisk I press 3 to delete the partition 1, then I press 1 to create a new partition. If I select EFI, it automatically fills the disk from 0 to end. An MBR-based (Solaris2) partition started at 1 and asked me to enter desired size. For the disk dedicated fully to a pool, I chose EFI as it was originally. Now I press 5 to save the new partition table and return to format. Entering p,p I see that the slice sizes remain as they were... Returning to the disk-level menu, I entered t for Type: format t AVAILABLE DRIVE TYPES: 0. Auto configure 1. other Specify disk type (enter its number)[1]: 0 c1t1d0: configured with capacity of 60.00GB VMware-Virtual disk-1.0-60.00GB selecting c1t1d0 [disk formatted] /dev/dsk/c1t1d0s0 is part of active ZFS pool pool. Please see zpool(1M). I picked 0, et voila - the partition sizes are reassigned. Too early to celebrate however: the ZFS slice #0 now starts at a wrong position: format p partition p Current partition table (default): Total disk sectors available: 125812701 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm 34 59.99GB 125812701 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 1258127028.00MB 125829085 Remembering that the original table started this slice at 256, and remembering the new table's last sector value, I mix the two: partition 0 Part TagFlag First Sector Size Last Sector 0usrwm 34 59.99GB 125812701 Enter partition id tag[usr]: Enter partition permission flags[wm]: Enter new starting Sector[34]: 256 Enter partition size[125812668b, 125812923e, 61431mb, 59gb, 0tb]: 125812701e partition p Current partition table (unnamed): Total disk sectors available: 125812701 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm256 59.99GB 125812701 Finally, I can save the changed tables and exit format: partition label Ready to label disk, continue? y partition q format q 7) Inspecting the pool, and even exporting and importing it and inspecting again, I see that autoexpand did not take place and the pool is still 20Gb in size (dunno why - sol10u10 bug?) :( So I do the manual step: # zpool online -e pool c1t1d0 The -e flag marks the component as eligible for expansion. When all pieces of a top-level vdev become larger, the setting takes effect and the pool finally becomes larger: # zpool list NAMESIZE ALLOC FREECAP HEALTH ALTROOT pool 59.9G 441M 59.4G 0% ONLINE - rpool 19.9G 6.91G 13.0G34% ONLINE - Now I can finally go to my primary quest and install that large piece of software into a zone that lives on pool! ;) HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot replace X with Y: devices have different sector alignment
On 2012-11-10 17:16, Jan Owoc wrote: Any other ideas short of block pointer rewrite? A few... one is an idea of what could be the cause: AFAIK the ashift value is not so much per-pool as per-toplevel-vdev. If the pool started as a set of the 512b drives and was then expanded to include sets of 4K drives, this mixed ashift could happen... It might be possible to override the ashift value with sd.conf and fool the OS into using 512b sectors over a 4KB native disk (this is mostly used the other way around, though - to enforce 4KB sectors on 4KB native drives that emulate 512b sectors). This might work, and earlier posters on the list saw no evidence to say that 512b emulation is inherently evil and unreliable (modulo firmware/hardware errors that can be anywhere anyway), but this would likely make the disk slower on random writes. Also, I am not sure how the 4KB-native HDD would process partial overwrites of a 4KB sector with 512b pieces of data - would other bytes remain intact or not?.. Before trying to fool a production system this way, if at all, I believe some stress-tests with small blocks are due on some other system. My 2c, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?
On 2012-11-09 16:14, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: Karl Wagner [mailto:k...@mouse-hole.com] If I was doing this now, I would probably use the ZFS aware OS bare metal, but I still think I would use iSCSI to export the ZVols (mainly due to the ability to use it across a real network, hence allowing guests to be migrated simply) Yes, if your VM host is some system other than your ZFS baremetal storage server, then exporting the zvol via iscsi is a good choice, or exporting your storage via NFS. Each one has their own pros/cons, and I would personally be biased in favor of iscsi. But if you're going to run the guest VM on the same machine that is the ZFS storage server, there's no need for the iscsi. Well, since the ease of re-attachment of VM hosts to iSCSI was mentioned a few times in this thread (and there are particular nuances with iSCSI to localhost), it is worth mentioning that NFS files can be re-attached just as easily - including the localhost. Cloning disks is just as easy when they are zvols or files in dedicated datasets; note that disk image UUIDs must be re-forged anyway (see doc). Also note, that in general, there might be need for some fencing (i.e. only one host tries to start up a VM from a particular backend image). I am not sure iSCSI inherently does a better job than NFS at this?.. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?
On 2012-11-09 16:11, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: Dan Swartzendruber [mailto:dswa...@druber.com] I have to admit Ned's (what do I call you?)idea is interesting. I may give it a try... Yup, officially Edward, most people call me Ned. I contributed to the OI VirtualBox instructions. See here: http://wiki.openindiana.org/oi/VirtualBox Jim's vboxsvc is super powerful Thanks for kudos, and I'd also welcome some on the SourceForge project page :) http://sourceforge.net/projects/vboxsvc/ for now, if you find it confusing in any way, just ask for help here. (Right Jim?) I'd prefer the questions and discussion on vboxsvc to continue in the VirtualBox forum, so it's all in one place for other users too. It is certainly an offtopic for the lists about ZFS, so I won't take this podium for too long :) https://forums.virtualbox.org/viewtopic.php?f=11t=33249 One of these days I'm planning to contribute a Quick Start guide to vboxsvc, I agree that the README might need cleaning up, so far it is like a snowball growing with details and new features. Perhaps some part should be separated into a concise quick-start guide that would not scare people off by the sheer amount of letters ;) I don't think I can point to a chapter and say Take this as the QuickStart :( - But at first I found it overwhelming, mostly due to unfamiliarity with SMF. The current README does, however, provide an overview of SMF as was needed by some of the inquiring users, and an example on command-line creation of a service to wrap a VM. A feature to do this by the script itself is pending, somewhat indefinitely. Also note that for OI desktop users in particular (and likely for other OpenSolaris-based OSes with X11 too), I'm now adding features to ease management of VMs that are not executed headless, but rather are interactive. Now these can too be wrapped as SMF services to automate shutdown and/or backgrounding into headless mode and back. I made and use this myself to enter other OSes on my laptop that are dual-bootable and can run in VBox as well as on hardware. There is also a new foregrounding startgui mode that can trap the signals which stop its terminal, and properly savestate or shutdown the VM, as well as this wraps taking of ZFS snapshots for VM disk resources, if applicable. There is also a mode where this spawns a dedicated xterm for the script's execution; by closing the xterm you can properly stop the VM with the preselected method of your choice with one click, before you log out of X11 session. However, this part of my work was almost in vain - the end of X11 session happens as a bruteforce close of X-connections, so the interactive GUIs just die before they can process any signals. This makes sense for networked X-servers that can't really send signals to remote client OSes, but is rather stupid for local OS. I hope the desktop environment gurus might come up with something. Or perhaps I'll come up with an SMF wrapper for X sessions that the vbox startgui feature could depend on, and the close of a session would be an SMF disablement. Hopefully, spawned local X-clients would also be under the SMF contract and would get chances to stop properly :) Anyway, if anybody else is interested in the new features described above - check out the code repository for the vboxsvc project (this is not yet so finished as to publish a new package version): http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/var/svc/manifest/site/vbox-svc.xml http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/usr/share/doc/vboxsvc/README-vboxsvc.txt See you in the VirtualBox forum thread if you do have questions :) //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Forcing ZFS options
There are times when ZFS options can not be applied at the moment, i.e. changing desired mountpoints of active filesystems (or setting a mountpoint over a filesystem location that is currently not empty). Such attempts now bail out with messages like: cannot unmount '/var/adm': Device busy cannot mount '/export': directory is not empty and such. Is it possible to force the new values to be saved into ZFS dataset properties, so they do take effect upon next pool import? I currently work around the harder of such situations with a reboot into a different boot environment or even into a livecd/failsafe, just so that the needed datasets or paths won't be busy and so I can set, verify and apply these mountpoint values. This is not a convenient way to do things :) Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Forcing ZFS options
On 2012-11-09 18:06, Gregg Wonderly wrote: Do you move the pools between machines, or just on the same physical machine? Could you just use symlinks from the new root to the old root so that the names work until you can reboot? It might be more practical to always use symlinks if you do a lot of moving things around, and then you wouldn't have to figure out how to do the reboot shuffle. Instead, you could just shuffle the symlinks. No, this concerns datasets within the machine. And symlinks often don't cut it. For example, I've recently needed to switch '/var' from an automounted filesystem dataset into a legacy one with the mount from /etc/vfstab. I can't set the different mountpoint value (legacy) while the OS is up and using the 'var', and I don't seem to have a way to do this during reboot automatically (short of crafting an SMF script that would fire early in the boot sequence - but that's a workaround outside ZFS technology, as is using the livecd or a failsafe boot or another BE). A different example is that sometimes uncareful works with beadm leave the root dataset with a non='/' mountpoint attribute. While the proper rootfs is forced to mount at the root node, it is not clean to have the discrepancy. However, I can not successfully zfs set mountpoint=/ rpool/ROOT/bename while booted into this BE. Forcing the attribute to save the value I need, so it takes effect after reboot - that's what I am asking for (if that was not clear from my first post). Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?
On 2012-11-08 05:43, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: you've got a Linux or Windows VM isnide of ESX, which is writing to a virtual disk, which ESX is then wrapping up inside NFS and TCP, talking on the virtual LAN to the ZFS server, which unwraps the TCP and NFS, pushes it all through the ZFS/Zpool layer, writing back to the virtual disk that ESX gave it, which is itself a layer on top of Ext3 I think this is a part where you disagree. The way I get all-in-ones, the VM running a ZFS OS enjoys PCI-pass-through, so it gets dedicated hardware access to the HBA(s) and harddisks at raw speeds, with no extra layers of lags in between. So there are a couple of OS disks where ESXi itself is installed, distros, logging and stuff, and the other disks are managed by a ZFS in a VM and served back to ESXi to store other VMs on the system. Also, VMWare does not (AFAIK) use ext3, but their own VMFS which is, among other things, cluster-aware (same storage can be shared by several VMware hosts). That said, on older ESX (with minimized RHEL userspace interface) which was picky about only using certified hardware with virt-enabled drivers, I did combine some disks served by the motherboard into a Linux mdadm array (within the RHEL-based management OS) and exported that to the vmkernel over NFS. Back then disk performance was indeed abysmal whatever you do, so the NFS disks were not after all used to store virtual disks, but rather distros and backups. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss