[zfs-discuss] Trick to keeping NFS file references in kernel memory for Dtrace?
Hey all, So I have a couple of storage boxes (NexentaCore & Illumian) and have been playing with some DTrace scripts to monitor NFS usage. Initially I ran into the (seemingly common) problem of basically everything showing up as '', and then after some searching online I found a workaround was to do a 'find' on the file system from the remote end and it would refresh the kernels knowledge of the files. This works.. however it doesn't stay for good. It seems to sometimes last a couple of hours (and sometimes much less) and then we are back to receiving 's. Has anyone else come across something similar? Does anyone know what may be causing the kernel to lose the references? There is plenty of memory in the main system (72gb with ARC sitting ~53gb and 11gb 'free'), so I don't think a OOM situation is causing it. Otherwise does anyone have any other tips for monitoring usage? I wonder how they have it all working in Fishworks gear as some of the analytics demos show you being able to drill down on through file activity in real time. Any advice or suggestions greatly appreciated. Cheers, Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZIL faster
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Schweiss, Chip > > How can I determine for sure that my ZIL is my bottleneck? If it is the > bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to > make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Temporarily set sync=disabled Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing rpool device paths/drivers
2012-10-03 16:04, Fajar A. Nugraha wrote: On Ubuntu + zfsonlinux + root/boot on zfs, the boot script helper is "smart" enough to try all available device nodes, so it wouldn't matter if the dev path/id/name changed. But ONLY if there's no zpool.cache in the initramfs. Not sure how easy it would be to port that functionality to solaris. Thanks, I thought of zpool.cache too, but it is only listed in /boot/solaris/filelist.safe which ironically still exists - though proper failsafe archives are not generated anymore. Even returning them would be a huge step forward in - a locally hosted self-sufficient interactive mini OS image in an archive unpacked and booted by GRUB indepependently of Solaris's view of the hardware is much simpler than external live media... Unfortunately, so far I didn't see ways of fixing the boot procedure short of hacking the binaries by compiling new ones, i.e. I did not find any easily changeable scripted logic. I digress, I did not yet look much further than unpacking the boot archive file itself and inspecting the files there. There's even no binaries in it, which I'm afraid means the logic is in the kernel monofile... :( //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZIL faster
To answer your questions more directly, zilstat is what I used to check what the ZIL was doing: http://www.richardelling.com/Home/scripts-and-programs-1/zilstat While I have added a mirrored log device, I haven't tried adding multiple sets of mirror log devices, but I think it should work. I believe that a failed unmirrored log device is only a problem if the pool is ungracefully closed before ZFS notices that the log device failed (ie, simultaneous power failure and log device failure), so mirroring them may not be required. Tim On Wed, Oct 3, 2012 at 2:54 PM, Timothy Coalson wrote: > I found something similar happening when writing over NFS (at > significantly lower throughput than available on the system directly), > specifically that effectively all data, even asynchronous writes, were > being written to the ZIL, which I eventually traced (with help from Richard > Elling and others on this list) at least partially to the linux NFS client > issuing commit requests before ZFS wanted to write the asynchronous data to > a txg. I tried fiddling with zfs_write_limit_override to get more data > onto normal vdevs faster, but this reduced performance (perhaps setting a > tunable to make ZFS not throttle writes while hitting the write limit could > fix that), and didn't cause it to go significantly easier on the ZIL > devices. I decided to live with the default behavior, since my main > bottleneck is ethernet anyway, and the projected lifespan of the ZIL > devices was fairly large due to our workload. > > I did find that setting logbias=throughput on a zfs filesystem caused it > to act as though the ZIL devices weren't there, which actually reduced > commit times under continuous streaming writes (mostly due to having more > throughput for the same amount of data to commit, in large chunks, but the > zilstat script also reported less writing to the ZIL blocks (which are > allocated from normal vdevs without a ZIL device, or with > logbias=throughput) under this condition, so perhaps there is more to the > story), so if you have different workloads for different datasets, this > could help (since it isn't a poolwide setting). Obviously, small > synchronous writes to that zfs filesystem will take a large hit from this > setting. > > It would be nice if there was a feature in ZFS that could direct small > commits to ZIL blocks on log devices, but behave like logbias=throughput > for large commits. It would probably need manual tuning, but it would > treat SSD log devices more gently, and increase performance for large > contiguous writes. > > If you can't configure ZFS to write less data to the ZIL, I think a RAM > based ZIL device would be a good way to get throughput up higher (and less > worries about flash endurance, etc). > > Tim > > On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip wrote: > >> I'm in the planing stages of a rather larger ZFS system to house >> approximately 1 PB of data. >> >> I have only one system with SSDs for L2ARC and ZIL, The ZIL seems to be >> the bottle neck for large bursts of data being written.I can't confirm >> this for sure, but the when throwing enough data at my storage pool and the >> write latency starts rising, the ZIL write speed hangs close the max >> sustained throughput I've measured on the SSD (~200 MB/s). >> >> The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and >> showed ~1300MB/s serial read and ~800MB/s serial write speed. >> >> How can I determine for sure that my ZIL is my bottleneck? If it is the >> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL >> to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. >> >> Thanks for any input, >> -Chip >> >> >> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZIL faster
I found something similar happening when writing over NFS (at significantly lower throughput than available on the system directly), specifically that effectively all data, even asynchronous writes, were being written to the ZIL, which I eventually traced (with help from Richard Elling and others on this list) at least partially to the linux NFS client issuing commit requests before ZFS wanted to write the asynchronous data to a txg. I tried fiddling with zfs_write_limit_override to get more data onto normal vdevs faster, but this reduced performance (perhaps setting a tunable to make ZFS not throttle writes while hitting the write limit could fix that), and didn't cause it to go significantly easier on the ZIL devices. I decided to live with the default behavior, since my main bottleneck is ethernet anyway, and the projected lifespan of the ZIL devices was fairly large due to our workload. I did find that setting logbias=throughput on a zfs filesystem caused it to act as though the ZIL devices weren't there, which actually reduced commit times under continuous streaming writes (mostly due to having more throughput for the same amount of data to commit, in large chunks, but the zilstat script also reported less writing to the ZIL blocks (which are allocated from normal vdevs without a ZIL device, or with logbias=throughput) under this condition, so perhaps there is more to the story), so if you have different workloads for different datasets, this could help (since it isn't a poolwide setting). Obviously, small synchronous writes to that zfs filesystem will take a large hit from this setting. It would be nice if there was a feature in ZFS that could direct small commits to ZIL blocks on log devices, but behave like logbias=throughput for large commits. It would probably need manual tuning, but it would treat SSD log devices more gently, and increase performance for large contiguous writes. If you can't configure ZFS to write less data to the ZIL, I think a RAM based ZIL device would be a good way to get throughput up higher (and less worries about flash endurance, etc). Tim On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip wrote: > I'm in the planing stages of a rather larger ZFS system to house > approximately 1 PB of data. > > I have only one system with SSDs for L2ARC and ZIL, The ZIL seems to be > the bottle neck for large bursts of data being written.I can't confirm > this for sure, but the when throwing enough data at my storage pool and the > write latency starts rising, the ZIL write speed hangs close the max > sustained throughput I've measured on the SSD (~200 MB/s). > > The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and > showed ~1300MB/s serial read and ~800MB/s serial write speed. > > How can I determine for sure that my ZIL is my bottleneck? If it is the > bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL > to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. > > Thanks for any input, > -Chip > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Making ZIL faster
I'm in the planing stages of a rather larger ZFS system to house approximately 1 PB of data. I have only one system with SSDs for L2ARC and ZIL, The ZIL seems to be the bottle neck for large bursts of data being written.I can't confirm this for sure, but the when throwing enough data at my storage pool and the write latency starts rising, the ZIL write speed hangs close the max sustained throughput I've measured on the SSD (~200 MB/s). The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and showed ~1300MB/s serial read and ~800MB/s serial write speed. How can I determine for sure that my ZIL is my bottleneck? If it is the bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Thanks for any input, -Chip ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vm server storage mirror
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Edward Ned Harvey > > it doesn't work right - It turns out, iscsi > devices (And I presume SAS devices) are not removable storage. That > means, if the device goes offline and comes back online again, it doesn't just > gracefully resilver and move on without any problems, it's in a perpetual > state of IO error, device unreadable. I am revisiting this issue today. I've tried everything I can think of to recreate this issue, and haven't been able to do it. I have certainly encountered some bad behaviors - which I'll expound upon momentarily - but they all seem to be addressable, fixable, logical problems, and none of them result in a supposedly good pool (as reported in zpool status) returning scsi IO errors or halting the system. The most likely explanation right now, for the bad behavior I saw before, perpetual IO error even after restoring connection, is that I screwed something up in my iscsi config the first time. Herein lie the new problems: If I don't export the pool before rebooting, then either the iscsi target or initiator is shutdown before the filesystems are unmounted. So the system spews all sorts of error messages while trying to go down, but it eventually succeeds. It's somewhat important to know if it was the target or initiator that went down first - If it was the target, then only the local disks became inaccessible, but if it was the intiiator, then both the local and remote disks became inaccessible. I don't know yet. Upon reboot, the pool fails to import, so the svc:/system/filesystem/local service fails, and comes up in maintenance mode. The whole world is a mess, you have to login at physical text console to export the pool, and reboot. But it comes up cleanly the second time. These sorts of problems seem like they should be solvable by introducing some service manifest dependencies... But there's no way to make it a generalization for the distribution as a whole (illumos/openindiana/oracle). It's just something that should be solvable on a case-by-case basis. If you are going to be an initiator only, then it makes sense for svc:/network/iscsi/initiator to be required by svc:/system/filesystem/local If you are going to be a target only, then it makes sense for svc:/system/filesystem/local to be required by svc:/network/iscsi/target If you are going to be a target & initiator, then you could get yourself into a deadlock situation. Make the filesystem depend on the initiator, and make the initiator depend on the target, and make the target depend on the filesystem. Uh-oh. But we can break that cycle easy enough in a lot of situations - If you're doing as I'm doing, where the only targets are raw devices (not zvols) then it should be ok to make the filesystem depend on the initiator, which depends on the target, and the target doesn't depend on anything. If you're both a target and an initiator, but all of your targets are zvols that you export to other systems (you're not nesting a filesystem in a zvol of your own, are you?) then it's ok to let the target needs filesystem and filesystem needs initiator, but initiator doesn't need anything. So in my case, I'm sharing raw disks, I'm going to try and make filessytem needs initiator, initiator needs target, and target doesn't need anything. Haven't tried yet ... Hopefully google will help accelerate me figuring out how to do that. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Failure to zfs destroy - after interrupting zfs receive
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Ariel T. Glenn > > I have the same issue as described by Ned in his email. I had a zfs > recv going that deadlocked against a zfs list; after a day of leaving > them hung I finally had to hard reset the box (shutdown wouldn't, since > it couldn't terminate the processes). When it came back up, I wanted to > zfs destroy that last snapshot but I got the dreaded For what it's worth - that is precisely the behavior I saw. No "zfs" or "zpool" commands would return, and eventually the system hung badly enough I had to power cycle. And afterward, I was unable to destroy either the filesystem, the snapshot, or any clones. I posted here, didn't get any response... And at some point, I "zfs send" my filesystem somewhere else, destroy & recreate the pool, "zfs send" the filesystem back. http://mail.opensolaris.org/pipermail/zfs-discuss/2012-September/052412.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Failure to zfs destroy - after interrupting zfs receive
I have the same issue as described by Ned in his email. I had a zfs recv going that deadlocked against a zfs list; after a day of leaving them hung I finally had to hard reset the box (shutdown wouldn't, since it couldn't terminate the processes). When it came back up, I wanted to zfs destroy that last snapshot but I got the dreaded cannot destroy 'export/upload@partial-2012-10-01_08:00:00': snapshot is cloned but there are no clones: root@ms8 # zdb -d export/upload | grep '%' root@ms8 # and an attempt to remove what the clone ought to be fails: zfs destroy export/upload/%partial-2012-10-01_08:00:00 cannot open 'export/upload/%partial-2012-10-01_08:00:00': dataset does not exist This isn't opensolaris, it's SunOS 5.10 Generic_142901-06 from before Oracle took it over, but that's not going to make any difference as to the bug, I think. Any ideas? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing rpool device paths/drivers
On Wed, Oct 3, 2012 at 5:43 PM, Jim Klimov wrote: > 2012-10-03 14:40, Ray Arachelian пишет: > >> On 10/03/2012 05:54 AM, Jim Klimov wrote: >>> >>> Hello all, >>> >>>It was often asked and discussed on the list about "how to >>> change rpool HDDs from AHCI to IDE mode" and back, with the >>> modern routine involving reconfiguration of the BIOS, bootup >>> from separate live media, simple import and export of the >>> rpool, and bootup from the rpool. IIRC when working with xen I had to boot with live cd, import the pool, then poweroff (without exporting the pool). Then it can boot. Somewhat inline with what you described. >> The documented way is to >>> reinstall the OS upon HW changes. Both are inconvenient to >>> say the least. >> >> >> Any chance to touch /reconfigure, power off, then change the BIOS >> settings and reboot, like in the old days? Or maybe with passing -r >> and optionally -s and -v from grub like the old way we used to >> reconfigure Solaris? > > > Tried that, does not help. Adding forceloads to /etc/system > and remaking the boot archive - also no. On Ubuntu + zfsonlinux + root/boot on zfs, the boot script helper is "smart" enough to try all available device nodes, so it wouldn't matter if the dev path/id/name changed. But ONLY if there's no zpool.cache in the initramfs. Not sure how easy it would be to port that functionality to solaris. -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing rpool device paths/drivers
2012-10-03 14:40, Ray Arachelian пишет: On 10/03/2012 05:54 AM, Jim Klimov wrote: Hello all, It was often asked and discussed on the list about "how to change rpool HDDs from AHCI to IDE mode" and back, with the modern routine involving reconfiguration of the BIOS, bootup from separate live media, simple import and export of the rpool, and bootup from the rpool. The documented way is to reinstall the OS upon HW changes. Both are inconvenient to say the least. Any chance to touch /reconfigure, power off, then change the BIOS settings and reboot, like in the old days? Or maybe with passing -r and optionally -s and -v from grub like the old way we used to reconfigure Solaris? Tried that, does not help. Adding forceloads to /etc/system and remaking the boot archive - also no. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing rpool device paths/drivers
On 10/03/2012 05:54 AM, Jim Klimov wrote: > Hello all, > > It was often asked and discussed on the list about "how to > change rpool HDDs from AHCI to IDE mode" and back, with the > modern routine involving reconfiguration of the BIOS, bootup > from separate live media, simple import and export of the > rpool, and bootup from the rpool. The documented way is to > reinstall the OS upon HW changes. Both are inconvenient to > say the least. Any chance to touch /reconfigure, power off, then change the BIOS settings and reboot, like in the old days? Or maybe with passing -r and optionally -s and -v from grub like the old way we used to reconfigure Solaris? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Changing rpool device paths/drivers
Hello all, It was often asked and discussed on the list about "how to change rpool HDDs from AHCI to IDE mode" and back, with the modern routine involving reconfiguration of the BIOS, bootup from separate live media, simple import and export of the rpool, and bootup from the rpool. The documented way is to reinstall the OS upon HW changes. Both are inconvenient to say the least. Linux and recent Windows are much more careless about total changes of hardware underneath the OS image between boots, they just boot up and work. Why do we shoot ourselves in the foot with this boot-up problem? Now that I'm trying to dual-boot my OI-based system, I hit the problem hard: I have either a HW SATA (AMD Hudson, often not recognized upon bootup, but that's another story) and a VirtualBox SATA on different pci dev/vendor IDs, or Physical and Virtual IDE which result in the same device path to cmdk and pci-ide - so I'm stuck with IDE mode at least for these compatibility reasons. So the basic question is: WHY does the OS want to use the device path (/pci... string) coded into the rpool's vdevs mid-way in the bootup during vfs root-import routine, and fail with a panic if the device naming changed, when the loader (GRUB) for example already had no problem reading the same rpool? Is there any rationale or historic baggage to this situation? Is it a design error or shortsight? Isn't it possible to use the same routine as for other pool imports, including import of this same rpool from a live-media boot - just find the component devices (starting with the one passed by the loader and/or matching by pool name and/or GUID) and import the resulting pool? Perhaps, this could be attempted if the current method fails, before reverting to a kernel panic - try another method first. Would this be a sane thing to change, or are there known beasts lurking in the dark? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss