Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote: This is not a new problem. It seems that I have been banging my head against this from the time I started using zfs. I'd like to see mpstat 1 for each case, on an otherwise idle system, but then there's probably a whole lot of dtrace I'd like to do ... but I'm just off on vacation for a week, and this will probably have to be my last post on this thread until I'm back. Shame on you for taking well-earned vacation in my time of need. :-) 'mpstat 1' output when I/O is good: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 1700 247 2187 11 214 110 102702 5 0 93 10 00 14785 2812 18 241 100 184242 4 0 94 20 01 12100 2392 60 185 190 3019275 28 0 67 30 00 3242 2320 2028 60 18190 2225003 24 0 73 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 1862 244 25549 23160 28802 3 0 95 10 00 11581 2055 17 22170 44791 3 0 96 20 00 10370 2051 65 186 140 2502114 24 0 73 30 00 3037 2167 2101 62 186 110 2513934 25 0 71 'mpstat 1' output when I/O is bad: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 859 243 10065 10600 207332 3 0 95 10 00 504 15 942 12 8460 740093 6 0 91 20 00 1920 3380 4800380 1 0 99 30 00 549 376 5221 3600 1350 2 0 98 Notice how intensely unbusy the CPU cores are when I/O is bad. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Sat, 4 Jul 2009, Phil Harman wrote: However, it seems that memory mapping is not responsible for the problem I am seeing here. Memory mapping may make the problem seem worse, but it is clearly not the cause. mmap(2) is what brings ZFS files into the page cache. I think you've shown us that once you've copied files with cp(1) - which does use mmap(2) - that anything that uses read(2) on the same files is impacted. The problem is observed with cpio, which does not use mmap. This is immediately after a reboot or unmount/mount of the filesystem. Sorry, I didn't get to your other post ... Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 with latest firmware. I rebooted the system used cpio to send the input files to /dev/null, and then immediately used cpio a second time to send the input files to /dev/null. Note that the amount of file data (243 GB) is plenty sufficient to purge any file data from the ARC (which has a cap of 10 GB). % time cat dpx-files.txt | cpio -o > /dev/null 495713288 blocks cat dpx-files.txt 0.00s user 0.00s system 0% cpu 1.573 total cpio -o > /dev/null 78.92s user 360.55s system 43% cpu 16:59.48 total % time cat dpx-files.txt | cpio -o > /dev/null 495713288 blocks cat dpx-files.txt 0.00s user 0.00s system 0% cpu 0.198 total cpio -o > /dev/null 79.92s user 358.75s system 11% cpu 1:01:05.88 total zpool iostat averaged over 60 seconds reported that the first run through the files read the data at 251 MB/s and the second run only achieved 68 MB/s. It seems clear that there is something really bad about Solaris 10 zfs's file caching code which is causing it to go into the weeds. I don't think that the results mean much, but I have attached output from 'hotkernel' while a subsequent cpio copy is taking place. It shows that the kernel is mostly sleeping. This is not a new problem. It seems that I have been banging my head against this from the time I started using zfs. I'd like to see mpstat 1 for each case, on an otherwise idle system, but then there's probably a whole lot of dtrace I'd like to do ... but I'm just off on vacation for a week, and this will probably have to be my last post on this thread until I'm back. Cheers, Phil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009 13:03:52 -0500 (CDT) Bob Friesenhahn wrote: > On Sat, 4 Jul 2009, Joerg Schilling wrote: > > Did you try to use highly performant software like star? > > No, because I don't want to tarnish your software's stellar > reputation. I am focusing on Solaris 10 bugs today. Blunt. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | nevada / OpenSolaris 2009.06 release + All that's really worth doing is what we do for others (Lewis Carrol) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Jonathan Edwards wrote: this is only going to help if you've got problems in zfetch .. you'd probably see this better by looking for high lock contention in zfetch with lockstat This is what lockstat says when performance is poor: Adaptive mutex spin: 477 events in 30.019 seconds (16 events/sec) Count indv cuml rcnt nsec Lock Caller --- 47 10% 10% 0.00 5813 0x80256000 untimeout+0x24 46 10% 19% 0.00 2223 0xb0a2f200 taskq_thread+0xe3 38 8% 27% 0.00 2252 0xb0a2f200 cv_wait+0x70 29 6% 34% 0.00 1115 0x80256000 callout_execute+0xeb 26 5% 39% 0.00 3006 0xb0a2f200 taskq_dispatch+0x1b8 22 5% 44% 0.00 1200 0xa06158c0 post_syscall+0x206 18 4% 47% 0.00 3858 arc_eviction_mtx arc_do_user_evicts+0x76 16 3% 51% 0.00 1352 arc_eviction_mtx arc_buf_add_ref+0x2d 15 3% 54% 0.00 5376 0xb1adac28 taskq_thread+0xe3 11 2% 56% 0.00 2520 0xb1adac28 taskq_dispatch+0x1b8 9 2% 58% 0.00 2158 0xbb909e20 pollwakeup+0x116 9 2% 60% 0.00 2431 0xb1adac28 cv_wait+0x70 8 2% 62% 0.00 3912 0x80259000 untimeout+0x24 7 1% 63% 0.00 3679 0xb10dfbc0 polllock+0x3f 7 1% 65% 0.00 2171 0xb0a2f2d8 cv_wait+0x70 6 1% 66% 0.00 771 0xb3f23708 pcache_delete_fd+0xac 6 1% 67% 0.00 4679 0xb0a2f2d8 taskq_dispatch+0x1b8 5 1% 68% 0.00 500 0xbe555040 fifo_read+0xf8 5 1% 69% 0.0015838 0x8025c000 untimeout+0x24 4 1% 70% 0.00 1213 0xac44b558 sd_initpkt_for_buf+0x110 4 1% 71% 0.00 638 0xa28722a0 polllock+0x3f 4 1% 72% 0.00 610 0x80259000 timeout_common+0x39 4 1% 73% 0.0010691 0x80256000 timeout_common+0x39 3 1% 73% 0.00 1559 htable_mutex+0x78 htable_release+0x8a 3 1% 74% 0.00 3610 0xbb909e20 cv_timedwait_sig+0x1c1 3 1% 74% 0.00 1636 0xa240d410 ohci_allocate_periodic_in_resource+0x71 2 0% 75% 0.00 5959 0xbe555040 fifo_read+0x5c 2 0% 75% 0.00 3744 0xbe555040 polllock+0x3f 2 0% 76% 0.00 635 0xb3f23708 pollwakeup+0x116 2 0% 76% 0.00 709 0xb3f23708 cv_timedwait_sig+0x1c1 2 0% 77% 0.00 831 0xb3dd2070 pcache_insert+0x13d 2 0% 77% 0.00 5976 0xb3dd2070 pollwakeup+0x116 2 0% 77% 0.00 1339 0xb1eb9b80 metaslab_group_alloc+0x136 2 0% 78% 0.00 1514 0xb0a2f2d8 taskq_thread+0xe3 2 0% 78% 0.00 4042 0xb0a22988 vdev_queue_io_done+0xc3 2 0% 79% 0.00 3428 0xb0a21f08 vdev_queue_io_done+0xc3 2 0% 79% 0.00 1002 0xac44b558 sd_core_iostart+0x37 2 0% 79% 0.00 1387 0xa8c56d80 xbuf_iostart+0x7d 2 0% 80% 0.00 698 0xa58a3318 sd_return_command+0x11b 2 0% 80% 0.00 385 0xa58a3318 sd_start_cmds+0x115 2 0% 81% 0.00 562 0xa5647800 ssfcp_scsi_start+0x30 2 0% 81% 0.00 1620 0xa4162d58 ssfcp_scsi_init_pkt+0x1be 2 0% 82% 0.00 897 0xa4162d58 ssfcp_scsi_start+0x42 2 0% 82% 0.00 475 0xa4162b78 ssfcp_scsi_start+0x42 2 0% 82% 0.00 697 0xa40fb158 sd_start_cmds+0x115 2 0% 83% 0.0010901 0xa28722a0 fifo_write+0x5b 2 0% 83% 0.00 4379 0xa28722a0 fifo_read+0xf8 2 0% 84% 0.00 1534 0xa2638390 emlxs_tx_get+0x38 2 0% 84% 0.00 1601 0xa2638350 emlxs_issue_iocb_cmd+0xc1 2 0% 84% 0.00 6697 0xa2503f08 vdev_queue_io_done+0x7b 2 0% 85% 0.00 4113 0xa24040b0 gcpu_ntv_mca_poll_wrapper+0x64 2 0% 85% 0.00 928 0xfe85dc140658 pollwakeup+0x116 1 0% 86% 0.00 404 iommulib_lock lookup_cache+0x2c 1 0% 86% 0.00 4867 pidlockthread_exit+0x6f 1 0% 86% 0.00 1245 plocks+0x3c0 pollhead_delete+0x23 1 0% 86% 0.00 2452 plocks+0x3c0 pollhead_insert+0x35 1 0% 86% 0.00 882 htable_mutex+0x3c0 htable_lookup+0x83 1 0% 87% 0.0028547 htable_mutex+0x3c0 htable_create+0xe3 1 0% 87% 0.0021173 htable_mutex+0x3c0 htable_release+0x8a 1 0% 87% 0.00 1235 htable_mutex+0x370 htable_lookup+0x83 1 0% 87% 0.00 3212 htable_mutex+0x370 htable_release+0x8a 1 0% 87% 0.00 793 htable_mutex+0x78 htable_lookup+0x83 1 0%
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote: However, it seems that memory mapping is not responsible for the problem I am seeing here. Memory mapping may make the problem seem worse, but it is clearly not the cause. mmap(2) is what brings ZFS files into the page cache. I think you've shown us that once you've copied files with cp(1) - which does use mmap(2) - that anything that uses read(2) on the same files is impacted. The problem is observed with cpio, which does not use mmap. This is immediately after a reboot or unmount/mount of the filesystem. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Sat, 4 Jul 2009, Phil Harman wrote: However, this is only part of the problem. The fundamental issue is that ZFS has its own ARC apart from the Solaris page cache, so whenever mmap() is used, all I/O to that file has to make sure that the two caches are in sync. Hence, a read(2) on a file which has sometime been mapped, will be impacted, even if the file is nolonger mapped. However, it seems that memory mapping is not responsible for the problem I am seeing here. Memory mapping may make the problem seem worse, but it is clearly not the cause. mmap(2) is what brings ZFS files into the page cache. I think you've shown us that once you've copied files with cp(1) - which does use mmap(2) - that anything that uses read(2) on the same files is impacted. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Phil Harman wrote: > I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) > was the first UNIX to get a working implementation of mmap(2) for files > (if I recall correctly, BSD 4.3 had a manpage but no implementation for > files). From that we got a whole lot of cool stuff, not least dynamic > linking with ld.so (which has made it just about everywhere). Well on BSD, you could mmap() devices but as a result from the fact that there was no useful address space management, you had to first malloc() the amount of space, forcing you to have the same amount of memory available as swap. Later, the device was mapped on top of the allocated memory and made the underlying spap space unacessible. We had to add expensive amounts of swap that time in order to be able to mmap the 256 MB of RAM from our image processor that time at Berthold AG. > The Solaris implementation of mmap(2) is functionally correct, but the > wait for a 64 bit address space rather moved the attention of > performance tuning elsewhere. I must admit I was surprised to see so > much code out there that still uses mmap(2) for general I/O (rather than > just to support dynamic linking). When the new memory management architecture was introduced with SunOS-4.0, things became better although the now unified and partially anomized address space made it hard to implement "limit memoryuse" (rlmit with RLIMIT_RSS). I made a working implementation for SunOS-4.0 but this did not make it into SunOS. There are still related performance issues. If you e.g. store a CD/DVD/BluRay image in /tmp that is bigger than the amount of RAM in the machine, you will observe a buffer overflow while writing with cdrecord unless you use driveropts=burnfree because pagin in is slow on tmpfs. > Software engineering is always about prioritising resource. Nothing > prioritises performance tuning attention quite like compelling > competitive data. When Bart Smaalders and I wrote libMicro we generated > a lot of very compelling data. I also coined the phrase "If Linux is > faster, it's a Solaris bug". You will find quite a few (mostly fixed) > bugs with the synopsis "linux is faster than solaris at ...". Fortunately, Linux is slower with most tasks ;-) In 1988, the effect of mmap() was much more visible than it is now. 20 years ago, the CPU speed limited copy operations making pipes, copyout() and similar slow. This changed with modern CPUs and for this reason, the demand for using mmap() is lower than it has been 20 years ago. > So, if mmap(2) playing nicely with ZFS is important to you, probably the > best thing you can do to help that along is to provide data that will > help build the business case for spending engineering resource on the issue. I would be interested to see a open(2) flag that tells the system that I will read a file that I opened exactly once in native oder. This could tell the system to do read ahead and to later mark the pages as immediately reusable. This would make star even faster than it is now. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] surprisingly poor performance
On Jul 4, 2009, at 14:30, Miles Nordin wrote: yes, which is why it's worth suspecting knfsd as well. However I don't think you can sell a Solaris system that performs 1/3 as well on better hardware without a real test case showing the fast system's broken. It should be noted that RAID-0 performs better than any other RAID level. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] surprisingly poor performance
> "rw" == Ross Walker writes: rw> Barriers are by default are disabled on ext3 mounts... http://lwn.net/Articles/283161/ https://bugzilla.redhat.com/show_bug.cgi?id=458936 enabled by default on SLES. to enable on other distro: mount -t ext3 -o barrier=1 (no help if using LVM2) rw> A lot of decisions in Linux are made in favor of performance rw> over data consistency. yes, which is why it's worth suspecting knfsd as well. However I don't think you can sell a Solaris system that performs 1/3 as well on better hardware without a real test case showing the fast system's broken. The fantastic thing to my view (and I'm NOT being sarcastic) is that, if the fast system really is broken, you've the option of breaking ZFS to match its performance (by disabling the ZIL). And after you've done this, ZFS is still ahead of the old system because your cheat hasn't put you at greater risk of corrupting the whole pool, while the other system's cheating _has_. ...ZFS may still be more fragile overall, but as a tradeoff, it's interesting. But having this choice puts you in the position of really wanting to know for sure if the other system's broken before you cripple your own system perhaps destroying your reputation... when you find out some ``unfair'' corner case was rescuing the old system: ex. suppose contiguous journal doesn't get reordered by the drive, and disk write buffers still get flushed if OS crashes which turns out to be the common failure mode, not cord-yanking. pgpw8otTlYKfG.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote: However, this is only part of the problem. The fundamental issue is that ZFS has its own ARC apart from the Solaris page cache, so whenever mmap() is used, all I/O to that file has to make sure that the two caches are in sync. Hence, a read(2) on a file which has sometime been mapped, will be impacted, even if the file is nolonger mapped. However, it seems that memory mapping is not responsible for the problem I am seeing here. Memory mapping may make the problem seem worse, but it is clearly not the cause. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 with latest firmware. I rebooted the system used cpio to send the input files to /dev/null, and then immediately used cpio a second time to send the input files to /dev/null. Note that the amount of file data (243 GB) is plenty sufficient to purge any file data from the ARC (which has a cap of 10 GB). % time cat dpx-files.txt | cpio -o > /dev/null 495713288 blocks cat dpx-files.txt 0.00s user 0.00s system 0% cpu 1.573 total cpio -o > /dev/null 78.92s user 360.55s system 43% cpu 16:59.48 total % time cat dpx-files.txt | cpio -o > /dev/null 495713288 blocks cat dpx-files.txt 0.00s user 0.00s system 0% cpu 0.198 total cpio -o > /dev/null 79.92s user 358.75s system 11% cpu 1:01:05.88 total zpool iostat averaged over 60 seconds reported that the first run through the files read the data at 251 MB/s and the second run only achieved 68 MB/s. It seems clear that there is something really bad about Solaris 10 zfs's file caching code which is causing it to go into the weeds. I don't think that the results mean much, but I have attached output from 'hotkernel' while a subsequent cpio copy is taking place. It shows that the kernel is mostly sleeping. This is not a new problem. It seems that I have been banging my head against this from the time I started using zfs. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ Sampling... Hit Ctrl-C to end. FUNCTIONCOUNT PCNT unix`SHA1Update 1 0.0% unix`page_unlock1 0.0% unix`lwp_segregs_save 1 0.0% rootnex`rootnex_dma_allochdl1 0.0% unix`mutex_delay_default1 0.0% emlxs`emlxs_initialize_pkt 1 0.0% genunix`pid_lookup 1 0.0% TS`ts_setrun1 0.0% fcp`ssfcp_adjust_cmd1 0.0% genunix`strrput 1 0.0% genunix`cyclic_softint 1 0.0% genunix`fop_poll1 0.0% sd`sd_xbuf_strategy 1 0.0% ohci`ohci_state_is_operational 1 0.0% zfs`SHA256Transform 1 0.0% unix`cpu_resched1 0.0% nvidia`_nv006110rm 1 0.0% genunix`lwp_timer_timeout 1 0.0% genunix`realtime_timeout1 0.0% fcp`ssfcp_scsi_destroy_pkt 1 0.0% nvidia`nvidia_pci_check_config_space1 0.0% genunix`closef 1 0.0% sd`sd_setup_rw_pkt 1 0.0% unix`vsnprintf 1 0.0% zfs`vdev_dtl_contains 1 0.0% genunix`siginfo_kto32 1 0.0% iommulib`iommulib_nex_open 1 0.0% genunix`vn_has_cached_data 1 0.0% ohci`ohci_sendup_td_message 1 0.0% scsi_vhci`vhci_scsi_destroy_pkt 1 0.0% genunix`avl_add 1 0.0% unix`page_create_va 1 0.0% genunix`savectx 1 0.0% ohci`ohci_root_hub_allocate_intr_pipe_resource 1 0.0% unix`page_add 1 0.0% zfs`zfs_unix_to_v4 1 0.0% genunix`set_qend1 0.0% zfs`vdev_queue_io_done 1 0.0% unix`set_idle_cpu 1 0.0% zfs`vdev_cache_read 1 0.0% nvidia`_nv002998rm 1 0.0% ohci`ohci_do_intrs_stats1 0.0% genunix`putq1 0.0% genunix`strput 1 0.0% zfs`zio_buf_alloc 1 0.0% sockfs`socktpi_poll 1 0.0% sockfs`so_update_attrs 1 0.0% sockfs`so_unlock_read
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Gary Mills wrote: On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. That's the first I've heard of this issue. Our e-mail server runs Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) extensively. I understand that Solaris has an excellent implementation of mmap(2). ZFS has many advantages, snapshots for example, for mailbox storage. Is there anything that we can be do to optimize the two caches in this environment? Will mmap(2) one day play nicely with ZFS? I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) was the first UNIX to get a working implementation of mmap(2) for files (if I recall correctly, BSD 4.3 had a manpage but no implementation for files). From that we got a whole lot of cool stuff, not least dynamic linking with ld.so (which has made it just about everywhere). The Solaris implementation of mmap(2) is functionally correct, but the wait for a 64 bit address space rather moved the attention of performance tuning elsewhere. I must admit I was surprised to see so much code out there that still uses mmap(2) for general I/O (rather than just to support dynamic linking). Software engineering is always about prioritising resource. Nothing prioritises performance tuning attention quite like compelling competitive data. When Bart Smaalders and I wrote libMicro we generated a lot of very compelling data. I also coined the phrase "If Linux is faster, it's a Solaris bug". You will find quite a few (mostly fixed) bugs with the synopsis "linux is faster than solaris at ...". So, if mmap(2) playing nicely with ZFS is important to you, probably the best thing you can do to help that along is to provide data that will help build the business case for spending engineering resource on the issue. Cheers, Phil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote: This brings me to the absurd conclusion that the system must be rebooted immediately prior to each use. see Phil's later email .. an export/import of the pool or a remount of the filesystem should clear the page cache - with mmap'd files you're essentially both them both in the page cache and also in the ARC .. then invalidations in the page cache are going to have effects on dirty data in the cache /etc/system tunables are currently: set zfs:zfs_arc_max = 0x28000 set zfs:zfs_write_limit_override = 0xea60 set zfs:zfs_vdev_max_pending = 5 if you're on x86 - i'd also increase maxphys to 128K .. we still have a 56KB default value in there which is still a bad thing (IMO) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > On Sat, 4 Jul 2009, Joerg Schilling wrote: > >> by more than half. Based on yesterday's experience, that may diminish > >> to only 33 MB/s. > > > > "star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir" > > > > is nearly 40% faster than > > > > "find . | cpio -pdum to-dir" > > > > Did you try to use highly performant software like star? > > No, because I don't want to tarnish your software's stellar > reputation. I am focusing on Solaris 10 bugs today. I've seen more prefessional replies. At the end it is your decision to ignore helpful advise. BTW: if star on ZFS would not be faster than cpio this would be just a hint for a problem in ZFS that needs to be fixed. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Sat, 4 Jul 2009, Phil Harman wrote: If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my iPod I can't be sure. It would be interesting to see how tar(1) performs if you run that test before cp(1) on a freshly rebooted system. Ok, I just rebooted the system. Now 'zpool iostat Sun_2540 60' shows that the cpio read rate has increased from (the most recently observed) 33 MB/second to as much as 132 MB/second. To some this may not seem significant but to me it looks a whole lot different. ;-) Thanks, that's really useful data. I wasn't near a machine at the time, so I couldn't do it for myself. I answered your initial question based on what I understood of the implementation, and it's very satisfying to have the data to back it up. I have done some work with the ZFS team towards a fix, but it is only currently in OpenSolaris. Hopefully the fix is very very good. It is difficult to displace the many years of SunOS training that using mmap is the path to best performance. Mmap provides many tools to improve application performance which are just not available via traditional I/O. The part of the problem I highlighted was ... 6699438 zfs induces crosscall storm under heavy mapped sequential read This has been fixed in OpenSolaris, and should be fixed in Solaris 10 update 8. However, this is only part of the problem. The fundamental issue is that ZFS has its own ARC apart from the Solaris page cache, so whenever mmap() is used, all I/O to that file has to make sure that the two caches are in sync. Hence, a read(2) on a file which has sometime been mapped, will be impacted, even if the file is nolonger mapped. I'm sure the data and interest from this thread will be useful to the ZFS team in prioritising further performance enhancements. So thanks again. And if there's any more useful data you can add, please do so. If you have a support contract, you might also consider logging a call and even raising an escalation request. Cheers, Phil Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Joerg Schilling wrote: by more than half. Based on yesterday's experience, that may diminish to only 33 MB/s. "star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir" is nearly 40% faster than "find . | cpio -pdum to-dir" Did you try to use highly performant software like star? No, because I don't want to tarnish your software's stellar reputation. I am focusing on Solaris 10 bugs today. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Joerg Schilling wrote: Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies into the Solaris page cache. As long as they remain there ZFS will be slow for those files, even if you subsequently use read(2) to access them. If you reboot, your cpio(1) tests will probably go fast again, until Do you believe that reboot is the only way to reset this? No, but from my iPod I didn't have the patience to write a fuller explanation :) See ... http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zfs_vnops.c#514 We take the long path is the vnode has any pages cached in the page cache. So instead of a reboot, you should also be able to export/import the pool or unmount/mount the filesystem. Also, if you didn't touch the file for a long time, and had lots of other page cache churn, the file might eventually get expunged from the page cache. Phil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?
Ian Collins wrote: Ross wrote: Is that accounting for ZFS overhead? I thought it was more than that (but of course, it's great news if not) :-) A raidz2 pool with 8 500G drives showed 2.67GB free. Same here. The ZFS overhead appears to be much smaller than similar UFS filesystems. E.g. on 500GB Hitachi drives: Total disk sectors available: 976743646 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm 256 465.75GB 976743646 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 9767436478.00MB 976760030 This is with an EFI label, which reports almost EXACTLY the amount expected (500GB = 465GiB). I'm using them in a 4-disk RAID-Z, so I lose 1 disk to parity. The info is: # zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT data 1.81T 1.75T 64.0G96% ONLINE - # zfs list data NAME USED AVAIL REFER MOUNTPOINT data 1.31T 26.2G 41.2G /data # df -k /data Filesystemkbytesused avail capacity Mounted on data 1433069568 43178074 2747955962%/data Given the numbers, I would expect 3 x 465.75GB = 3 x 488374272kB = 1465122816 kB. So, 'df' reports my RAID-Z as being 2.18% smaller than the aggregate raw disk partition size. If the same numbers hold up for you, with 8 x 1.5TB in a RAID-Z: 1.5TB ~ 1.364TiB 7 x 1.364TiB ~ 9.546TiB Lose 2.2% for ZFS overhead: 9.546TiB x 0.978 ~ 9.34 TiB That's todays math lesson! :-) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > A tar pipeline still provides terrible file copy performance. Read > bandwidth is only 26 MB. So I stopped the tar copy and re-tried the > cpio copy. > > A second copy with the cpio results in a read/write data rate of only > 54.9 MB/s (vs the just experienced 132 MB/s). Performance is reduced > by more than half. Based on yesterday's experience, that may diminish > to only 33 MB/s. "star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir" is nearly 40% faster than "find . | cpio -pdum to-dir" Did you try to use highly performant software like star? Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
A tar pipeline still provides terrible file copy performance. Read bandwidth is only 26 MB. So I stopped the tar copy and re-tried the cpio copy. A second copy with the cpio results in a read/write data rate of only 54.9 MB/s (vs the just experienced 132 MB/s). Performance is reduced by more than half. Based on yesterday's experience, that may diminish to only 33 MB/s. The amount of data being copied is much larger than any cache yet somehow reading a file a second time is less than 1/2 as fast. This brings me to the absurd conclusion that the system must be rebooted immediately prior to each use. /etc/system tunables are currently: set zfs:zfs_arc_max = 0x28000 set zfs:zfs_write_limit_override = 0xea60 set zfs:zfs_vdev_max_pending = 5 Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] surprisingly poor performance
On Sat, Jul 4, 2009 at 1:38 AM, James Lever wrote: > > On 04/07/2009, at 1:49 PM, Ross Walker wrote: > >> I ran some benchmarks back when verifying this, but didn't keep them >> unfortunately. >> >> You can google: XFS Barrier LVM OR EVMS and see the threads about this. > > Interesting reading. Testing seems to show that either it's not relevant or > there is something interesting going on with ext3 as a separate case. Barriers are by default are disabled on ext3 mounts... Google it and you'll see interesting threads in the LKML. Seems there was some serious performance degradation in using them. A lot of decisions in Linux are made in favor of performance over data consistency. >> When you do send me a copy, try both on a straight partition then on a >> LVM volume and always use NFS sync, but when exporting use the >> no_wdelay option if you don't already that eliminates slow downs with >> NFS sync on Linux. > > > The numbers below seem to indicate that either there is no barrier issues > here, or the BBWC in the raid controller makes them more-or-less invisible > as the ext3fs volume below is directly onto the exposed LUN while the xfs > partition is on top of LVM2. > > It does, however, show that xfs is much faster for deletes. Actually it's LVM/EVMS that hides the barrier performance problems because they act as a barrier filter (because they don't support barriers), so running on LVM/EVMS shows great performance, but also the #1 reason that people complain XFS isn't reliable during a system failure, which is all because logging isn't done properly without barriers! -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: > ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC > instead of the Solaris page cache. But mmap() uses the latter. So if > anyone maps a file, ZFS has to keep the two caches in sync. That's the first I've heard of this issue. Our e-mail server runs Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) extensively. I understand that Solaris has an excellent implementation of mmap(2). ZFS has many advantages, snapshots for example, for mailbox storage. Is there anything that we can be do to optimize the two caches in this environment? Will mmap(2) one day play nicely with ZFS? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
On Sat, 4 Jul 2009, James Lever wrote: Any insightful observations? Probably multiple slog devices are used to expand slog size and not used in parallel since that would require somehow knowing the order. The principle bottleneck is likely the update rate of the first device in the chain, followed by the update rate of the underlying disks. If you put the ramdisk first in the slog chain, the performance is likely to jump. Note that using the non-volatile log device is just a way to defer the writes to the underlying device, and the writes need to occur eventually or else the slog will fill up. Ideally the writes to the underlying devices can be ordered more sequentially for better throughput or else the gain will be short-lived since the slog will fill up. If you do a search, you will find that others have reported less than hoped for performance with these Samsung SSDs. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote: If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my iPod I can't be sure. It would be interesting to see how tar(1) performs if you run that test before cp(1) on a freshly rebooted system. Ok, I just rebooted the system. Now 'zpool iostat Sun_2540 60' shows that the cpio read rate has increased from (the most recently observed) 33 MB/second to as much as 132 MB/second. To some this may not seem significant but to me it looks a whole lot different. ;-) I have done some work with the ZFS team towards a fix, but it is only currently in OpenSolaris. Hopefully the fix is very very good. It is difficult to displace the many years of SunOS training that using mmap is the path to best performance. Mmap provides many tools to improve application performance which are just not available via traditional I/O. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies into the Solaris page cache. As long as they remain there ZFS will be slow for those files, even if you subsequently use read(2) to access them. This is very interesting information and certainly can explain a lot. My application has a choice of using mmap or traditional I/O. I often use mmap. From what you are saying, using mmap is poison to subsequent performance. On June 29th I tested my application (which was set to use mmap) shortly after a reboot and got this overall initial runtime: real 2:24:25.675 user 4:38:57.837 sys 14:30.823 By June 30th (with no intermediate reboot) the overall runtime had increased to real 3:08:58.941 user 4:38:38.192 sys 15:44.197 which seems like quite a large change. If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my I will test. The other thing that slows you down is that ZFS only flushes to disk every 5 seconds if there are no synchronous writes. It would be interesting to see iostat -xnz 1 while you are running your tests. You may find the disks are writing very efficiently for one second in every five. Actually I found that the disks were writing flat out for five seconds at a time which stalled all other pool I/O (and dependent CPU) for at least three seconds (see earlier discussion). So at the moment I have zfs_write_limit_override set to 2684354560 so that the write cycle is more on the order of one second in every five. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Jonathan Edwards wrote: somehow i don't think that reading the first 64MB off (presumably) off a raw disk device 3 times and picking the middle value is going to give you much useful information on the overall state of the disks .. i believe this was more of a quick hack to just validate that there's nothing too far out of the norm, but with that said - what's the c2 and c3 device above? you've got to be caching the heck out of that to get that unbelievable 13 GB/s - so you're really only seeing memory speeds there Agreed. It is just a quick sanity check. I think that the c2 and c3 devices are speedy USB drives. more useful information would be something more like the old taz or some of the disk IO latency tools when you're driving a workload. What I see from 'iostat -cx' is a low latency (<= 4 ms) and low workload while the data is being read, and then (periodically) a burst of write data with much higher latency (40-64ms svc_t). The write burst does not take long so it is clear that reading is the bottleneck. if you're using LUNs off an array - this might be another case of the zfs_vdev_max_pending being tuned more for direct attach drives .. you could be trying to queue up too much I/O against the RAID controller, particularly if the RAID controller is also trying to prefetch out of it's cache. I have played with zfs_vdev_max_pending before. It does dial down the latency pretty linearly during the write phase (e.g. 35 queued I/Os results in 64 ms svc_t). you might want to dtrace this to break down where the latency is occuring .. eg: is this a DNLC caching problem, ARC problem, or device level problem also - is this really coming off a 2540? if so - you should probably investigate the array throughput numbers and what's happening on the RAID controller .. i typically find it helpful to understand what the raw hardware is capable of (hence tools like vdbench to drive an anticipated load before i configure anything) - and then attempting to configure the various tunables to match after that Yes, this comes off of a 2540. I used iozone for testing and see that through zfs, the hardware is able to write a 64GB file at 380 MB/s and read at 551 MB/s. Unfortunately, this does not seem to translate well for the actual task. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Phil Harman wrote: > ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC > instead of the Solaris page cache. But mmap() uses the latter. So if > anyone maps a file, ZFS has to keep the two caches in sync. > > cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it > copies into the Solaris page cache. As long as they remain there ZFS > will be slow for those files, even if you subsequently use read(2) to > access them. > > If you reboot, your cpio(1) tests will probably go fast again, until Do you believe that reboot is the only way to reset this? > someone uses mmap(2) on the files again. I think tar(1) uses read(2), > but from my iPod I can't be sure. It would be interesting to see how > tar(1) performs if you run that test before cp(1) on a freshly > rebooted system. There are many tar implementations. The oldest is the UNIX tar implementation from around 1978, the next was star from 1982, then there is GNU tar from 1987. Star forks into two processes that are connected via shared memory in order to speed up things. If you compare the copy speed from star amd cp on UFS and if you tell star to be as unreliable as cp (by specifying the star option -no-fsync), star will do the job by 30% faster than cp does even though star does not use mmap. Copying with Sun's tar is a tic faster than using cp and it is a bit more accurat. GNU tar is not better than Sun's tar. If you are looking for the best speed, use: star -copy -no-fsync -C from-dir . to-dir and set up e.v. bs=1m fs=128m. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 03:48, Phil Harman wrote: The other thing that slows you down is that ZFS only flushes to disk every 5 seconds if there are no synchronous writes. It would be interesting to see iostat -xnz 1 while you are running your tests. You may find the disks are writing very efficiently for one second in every five. The value of 5 seconds is no longer a hard stop since SNV 87. Since snv_87 (and S10u6) it can be up to 30 seconds (but it does shoot for 5 seconds): http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 See the 20-Mar-2008 change for txg.c for details. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 12:03 AM, Bob Friesenhahn wrote: % ./diskqual.sh c1t0d0 130 MB/sec c1t1d0 130 MB/sec c2t202400A0B83A8A0Bd31 13422 MB/sec c3t202500A0B83A8A0Bd31 13422 MB/sec c4t600A0B80003A8A0B096A47B4559Ed0 191 MB/sec c4t600A0B80003A8A0B096E47B456DAd0 192 MB/sec c4t600A0B80003A8A0B096147B451BEd0 192 MB/sec c4t600A0B80003A8A0B096647B453CEd0 192 MB/sec c4t600A0B80003A8A0B097347B457D4d0 212 MB/sec c4t600A0B800039C9B50A9C47B4522Dd0 191 MB/sec c4t600A0B800039C9B50AA047B4529Bd0 192 MB/sec c4t600A0B800039C9B50AA447B4544Fd0 192 MB/sec c4t600A0B800039C9B50AA847B45605d0 191 MB/sec c4t600A0B800039C9B50AAC47B45739d0 191 MB/sec c4t600A0B800039C9B50AB047B457ADd0 191 MB/sec c4t600A0B800039C9B50AB447B4595Fd0 191 MB/sec somehow i don't think that reading the first 64MB off (presumably) off a raw disk device 3 times and picking the middle value is going to give you much useful information on the overall state of the disks .. i believe this was more of a quick hack to just validate that there's nothing too far out of the norm, but with that said - what's the c2 and c3 device above? you've got to be caching the heck out of that to get that unbelievable 13 GB/s - so you're really only seeing memory speeds there more useful information would be something more like the old taz or some of the disk IO latency tools when you're driving a workload. % arc_summary.pl System Memory: Physical RAM: 20470 MB Free Memory : 2371 MB LotsFree: 312 MB ZFS Tunables (/etc/system): * set zfs:zfs_arc_max = 0x3 set zfs:zfs_arc_max = 0x28000 * set zfs:zfs_arc_max = 0x2 ARC Size: Current Size: 9383 MB (arcsize) Target Size (Adaptive): 10240 MB (c) Min Size (Hard Limit):1280 MB (zfs_arc_min) Max Size (Hard Limit):10240 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 6%644 MB (p) Most Frequently Used Cache Size:93%9595 MB (c-p) ARC Efficency: Cache Access Total: 674638362 Cache Hit Ratio: 91% 615586988 [Defined State for buffer] Cache Miss Ratio: 8% 59051374 [Undefined State for Buffer] REAL Hit Ratio: 87% 590314508 [MRU/MFU Hits Only] Data Demand Efficiency:96% Data Prefetch Efficiency: 7% CACHE HITS BY CACHE LIST: Anon:2% 13626529 [ New Customer, First Cache Hit ] Most Recently Used: 78% 480379752 (mru) [ Return Customer ] Most Frequently Used: 17% 109934756 (mfu) [ Frequent Customer ] Most Recently Used Ghost:0% 5180256 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 1% 6465695 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:78%485431759 Prefetch Data: 0%3045442 Demand Metadata:16%103900170 Prefetch Metadata: 3%23209617 CACHE MISSES BY DATA TYPE: Demand Data:30%18109355 Prefetch Data: 60%35633374 Demand Metadata: 6%3806177 Prefetch Metadata: 2% 1502468 - Prefetch seems to be performing badly. The Ben Rockwood's blog entry at http://www.cuddletech.com/blog/pivot/entry.php?id=1040 discusses prefetch. The sample Dtrace script on that page only shows cache misses: vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774849536: MISS vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774980608: MISS Unfortunately, the file-level prefetch DTrace sample script from the same page seems to have a syntax error. if you're using LUNs off an array - this might be another case of the zfs_vdev_max_pending being tuned more for direct attach drives .. you could be trying to queue up too much I/O against the RAID controller, particularly if the RAID controller is also trying to prefetch out of it's cache. I tried disabling file level prefetch (zfs_prefetch_disable=1) but did not observe any change in behavior. this is only going to help if you've got problems in zfetch .. you'd probably see this better by looking for high lock contention in zfetch with lockstat # kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:classmisc zfs:0:vdev_cache_stats:crtime 130.61298275 zfs:0:vdev_cache_stats:delegations 754287 zfs:0:vdev_cache_stats:hits 3973496 zfs:0:vdev_cache_stats:misses 2154959 zfs:0:vdev_cache_stats:snaptime 451955.55419545 Performance when coping 236 GB of files (each file is 5537792 bytes, wit
[zfs-discuss] cannot initialize user accounting information on pool/nfs: Unknown error
After upgrading to b117 and doing a zfs upgrade -a the machine hang. I pulled the power and booted it again. Everything seems to work, except zfs userspace on a single filesystem. Here is some info. # uname -a SunOS opensolaris 5.11 snv_117 i86pc i386 i86pc Solaris # zfs upgrade This system is currently running ZFS filesystem version 4. All filesystems are formatted with the current version. ## zfs list pool/nfs NAME USED AVAIL REFER MOUNTPOINT pool/nfs 4,89T 1,49T 4,89T /data/nfs # zfs userspace pool/nfs Initializing accounting information on old filesystem, please wait... cannot initialize user accounting information on pool/nfs: Unknown error ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Mattias Pantzare wrote: > > Performance when coping 236 GB of files (each file is 5537792 bytes, with > > 20001 files per directory) from one directory to another: > > > > Copy Method Data Rate > > == > > cpio -pdum 75 MB/s > > cp -r 32 MB/s > > tar -cf - . | (cd dest && tar -xf -) 26 MB/s > > > > I would expect data copy rates approaching 200 MB/s. > > > > What happens if you run two copy at the same time? (On different data) Before you do things like this, you first should start using test that may give you useful results. Note of the programs above have been written for decent performance. I know that "cp" on Solaris is a partial exception for songle file copies, but does not help us if we like to compare _aparent_ performance. Let me first introduce other programs: sdd A dd(1) replacement that was first written in 1984 and that includes built-in speed metering since Jily 1988. starA tar(1) replacement that was first written in 1982 and that supports much better performance by using a shared memory based FIFO. Note that most speed tests that are run on Linux do not result un useful values as you don't know what's happening dunring the observation time. If you like to meter read performance, I recommend to use a filesystem that was mounted directly before doing the test or to use files that are big enough not to fit into memory. Use e.g.: sdd if=file-name bs=64k -onull -time If you like to meter write performance, I recomment to write big enough files to avoid using wrong numbers as a result from caching. Use e.g.sdd -inull bs=64k count=some-number of=file-name -time Us an apropriate value for "some-number". For copying files, I recommend to use: star -copy bs=1m fs=128m -time -C from-dir . to-dir It makes sense to run another test using the option: -no-fsync in addition. On Solaris with UFS, using -no-fsync speeds up things by aprox. 10%. On Linux with a local filesystem, using -no-fsync speeds up things by aprox. 400%. This is why you get useless high numbers from using GNU tar for copy tests on Linux. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 4, 2009 at 06:03, Bob Friesenhahn wrote: > I am still trying to determine why Solaris 10 (Generic_141415-03) ZFS > performs so terribly on my system. I blew a good bit of personal life > savings on this set-up but am not seeing performance anywhere near what is > expected. Testing with iozone shows that bulk I/O performance is good. > Testing with Jeff Bonwick's 'diskqual.sh' shows expected disk performance. > The problem is that actual observed application performance sucks, and > could often be satisified by portable USB drives rather than high-end SAS > drives. It could be satisified by just one SAS disk drive. Behavior is as > if zfs is very slow to read data since disks are read at only 2 or 3 > MB/second followed by an intermittent write on a long cycle. Drive lights > blink slowly. It is as if ZFS does no successful sequential read-ahead on > the files (see Prefetch Data hit rate of 0% and Prefetch Data cache miss of > 60% below), or there is a semaphore bottleneck somewhere (but CPU use is > very low). > > Observed behavior is very program dependent. > > # zpool status Sun_2540 > pool: Sun_2540 > state: ONLINE > status: The pool is formatted using an older on-disk format. The pool can > still be used, but some features are unavailable. > action: Upgrade the pool using 'zpool upgrade'. Once this is done, the > pool will no longer be accessible on older software versions. > scrub: scrub completed after 0h46m with 0 errors on Mon Jun 29 05:06:33 > 2009 > config: > > NAME STATE READ WRITE CKSUM > Sun_2540 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096A47B4559Ed0 ONLINE 0 0 0 > c4t600A0B800039C9B50AA047B4529Bd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096E47B456DAd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AA447B4544Fd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096147B451BEd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AA847B45605d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096647B453CEd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AAC47B45739d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B097347B457D4d0 ONLINE 0 0 0 > c4t600A0B800039C9B50AB047B457ADd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B800039C9B50A9C47B4522Dd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AB447B4595Fd0 ONLINE 0 0 0 > > errors: No known data errors > > > Prefetch seems to be performing badly. The Ben Rockwood's blog entry at > http://www.cuddletech.com/blog/pivot/entry.php?id=1040 discusses prefetch. > The sample Dtrace script on that page only shows cache misses: > > vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774849536: > MISS > vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774980608: > MISS > > Unfortunately, the file-level prefetch DTrace sample script from the same > page seems to have a syntax error. > > I tried disabling file level prefetch (zfs_prefetch_disable=1) but did not > observe any change in behavior. > > # kstat -p zfs:0:vdev_cache_stats > zfs:0:vdev_cache_stats:class misc > zfs:0:vdev_cache_stats:crtime 130.61298275 > zfs:0:vdev_cache_stats:delegations 754287 > zfs:0:vdev_cache_stats:hits 3973496 > zfs:0:vdev_cache_stats:misses 2154959 > zfs:0:vdev_cache_stats:snaptime 451955.55419545 > > Performance when coping 236 GB of files (each file is 5537792 bytes, with > 20001 files per directory) from one directory to another: > > Copy Method Data Rate > == > cpio -pdum 75 MB/s > cp -r 32 MB/s > tar -cf - . | (cd dest && tar -xf -) 26 MB/s > > I would expect data copy rates approaching 200 MB/s. > What happens if you run two copy at the same time? (On different data) Your test is very bad at using striping as reads are done sequential. Prefetch can only help in a file and your files are only 5Mb. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies into the Solaris page cache. As long as they remain there ZFS will be slow for those files, even if you subsequently use read(2) to access them. If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my iPod I can't be sure. It would be interesting to see how tar(1) performs if you run that test before cp(1) on a freshly rebooted system. I have done some work with the ZFS team towards a fix, but it is only currently in OpenSolaris. The other thing that slows you down is that ZFS only flushes to disk every 5 seconds if there are no synchronous writes. It would be interesting to see iostat -xnz 1 while you are running your tests. You may find the disks are writing very efficiently for one second in every five. Hope this helps, Phil blogs.sun.com/pgdh Sent from my iPod On 4 Jul 2009, at 05:26, Bob Friesenhahn wrote: On Fri, 3 Jul 2009, Bob Friesenhahn wrote: Copy MethodData Rate == cpio -pdum75 MB/s cp -r32 MB/s tar -cf - . | (cd dest && tar -xf -)26 MB/s It seems that the above should be ammended. Running the cpio based copy again results in zpool iostat only reporting a read bandwidth of 33 MB/second. The system seems to get slower and slower as it runs. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss