Re: [zfs-discuss] ZFS write bursts cause short app stalls
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 The application I'm working on is a kind of large-scale network-PVR system for our IPTV services. It records all running TV channels in a X-hour carrousel (typically 24 or 48-hours), retaining only those bits which users have marked as being interesting to them. The current setup I'm doing development on is a small 12TB array, future deployment is planned on several 96TB X4540 machines. I agree that I kind of misused the term `sequential' - it really is 77 concurrent sequential writes. However, as I explained, I/O is not the bottleneck here, as the array is capable of writes around 600MBytes/s, and the write load I'm putting on it is around 55MBytes/s (430Mbit/s). The problem is, as Brent explained, that as soon as the OS decides it wants to write the transaction group to disk, it totally ignores all other time-critical activity in the system and focuses on just that, causing an input poll() stall on all network sockets. What I'd need to do is force it to commit transactions to disk more often so as to even the load out over a longer period of time, to bring the CPU usage spikes down to a more manageable and predictable level. Regards, - -- Saso Tim Cook wrote: On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote: Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. Once again, Mb or MB? They're two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that's not really impressive at all. If you're saying you got 400MB, that's a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn't happening with random. The writes lag is noticeable however with ZFS, and the behavior of the transaction group writes. If you have a big write that needs to land on disk, it seems all other I/O, CPU and niceness is thrown out the window in favor of getting all that data on disk. I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris support, I'll try to find that bug number, but I believe some improvements were done in 129 and 130. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks1y8oACgkQRO8UcfzpOHBkDQCgxScaPPS7d+peoiY16Nafo8lu 1nsAoNMwiUdOdQKCZpdyPGoAWz36IWY5 =T6fy -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Would an upgrade to the development repository of 2010.02 do the same? I'd like to avoid having to do a complete reinstall, since I've got quite a bit of custom software in the system already in various places and recompiling and fine-tuning would take me another 1-2 days. Regards, - -- Saso Leonid Kogan wrote: Try b130. http://genunix.org/ Cheers, LK On 12/26/2009 12:59 AM, Saso Kiselkov wrote: Hi, I tried it and I got the following error message: # zfs set logbias=throughput content cannot set property for 'content': invalid property 'logbias' Is it because I'm running some older version which does not have this feature? (2009.06) Regards, -- Saso Leonid Kogan wrote: Hi there, Try to: zfs set logbias=throughputyourdataset Good luck, LK ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks1zCIACgkQRO8UcfzpOHA1SQCaAqK+2v/+lQnuaXPc4pOju7UC oaIAoNKJO3oOr4DCdCXHCp+vf2/Ri2mW =pmGr -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote: On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote: Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. Once again, Mb or MB? They're two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that's not really impressive at all. If you're saying you got 400MB, that's a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn't happening with random. Mb, megabit. 400 megabit is not terribly high, a single SATA drive could write that 24/7 without a sweat. Which is why he is reporting his issue. Sequential or random, any modern system should be able to perform that task without causing disruption to other processes running on the system (if Windows can, Solaris/ZFS most definitely should be able to). I have similar workload on my X4540's, streaming backups from multiple systems at a time. These are very high end machines, dual quadcore opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. The write stalls have been a significant problem since ZFS came out, and hasn't really been addressed in an acceptable fashion yet, though work has been done to improve it. I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Brent Jones wrote: On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote: On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote: Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. Once again, Mb or MB? They're two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that's not really impressive at all. If you're saying you got 400MB, that's a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn't happening with random. Mb, megabit. 400 megabit is not terribly high, a single SATA drive could write that 24/7 without a sweat. Which is why he is reporting his issue. Sequential or random, any modern system should be able to perform that task without causing disruption to other processes running on the system (if Windows can, Solaris/ZFS most definitely should be able to). I have similar workload on my X4540's, streaming backups from multiple systems at a time. These are very high end machines, dual quadcore opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. The write stalls have been a significant problem since ZFS came out, and hasn't really been addressed in an acceptable fashion yet, though work has been done to improve it. I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. Wow, if there were a production-release solution to the problem, that would be great! Reading the mailing list I almost gave up hope that I'd be able to work around this issue without upgrading to the latest bleeding-edge development version. Regards, - -- Saso -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks10xQACgkQRO8UcfzpOHCFUQCeJ0kHwOgM3Vjc6QjIL6XHVip5 ed4AoIYrNGAZR2V69uUk3Gc/MAl3kew3 =5uSX -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov skisel...@gmail.com wrote: I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. Wow, if there were a production-release solution to the problem, that would be great! Have you checked this thread? http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28704.html Reading the mailing list I almost gave up hope that I'd be able to work around this issue without upgrading to the latest bleeding-edge development version. Isn't opensolaris already bleeding edge? -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thank you, the post you mentioned helped me move a bit forward. I tried putting: zfs:zfs_txg_timeout = 1 in /etc/system and now I'm getting much more even write load (a burst every 5 seconds), which now does not cause any significant poll() stalling anymore. So far I fail to find the timer in the ZFS source code which causes the 5-second timeout instead of what I want (1 second). Another thing that's left on my mind is why I'm still getting a very slight burst every 60 seconds (causing a poll() delay of around 20-30ms, instead of the usual 0-2ms). It's not that big a problem, it's just that I'm curious as to where it's being created. I assume some 60-second timer is firing, but I don't know where. Regards, - -- Saso Fajar A. Nugraha wrote: On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov skisel...@gmail.com wrote: I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. Wow, if there were a production-release solution to the problem, that would be great! Have you checked this thread? http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28704.html Reading the mailing list I almost gave up hope that I'd be able to work around this issue without upgrading to the latest bleeding-edge development version. Isn't opensolaris already bleeding edge? -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks1/+8ACgkQRO8UcfzpOHC6kgCfcTv86Gwh2MvvVQJeJr/BRghe f6IAn2N1t4QNLfwBdafZHUbXCw0grTRk =hUJV -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)
Richard Elling wrote: On Dec 25, 2009, at 4:15 PM, Erik Trimble wrote: I haven't seen this mentioned before, but the OCZ Vertex Turbo is still an MLC-based SSD, and is /substantially/ inferior to an Intel X25-E in terms of random write performance, which is what a ZIL device does almost exclusively in the case of NFS traffic. ZIL traffic tends to be sequential on separate logs. But may be of different sizes. -- richard Really? Now that I think about it, that seems to make sense - I was assuming that each NFS write would be relatively small, but that's likely not a valid general assumption. I'd still think that the MLC-nature of the Vertex isn't optimal for write-heavy applications like here, even with a modest SDRAM cache on the SSD. In fact, I think that the Vertex's sustained random write IOPs performance is actually inferior to a 15k SAS drive. I read a benchmark report yesterday that might be interesting. It seems that there is a market for modest sized SSDs, which would be perfect for separate logs + OS for servers. http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=392Itemid=60 -- richard I'm still hoping that vendors realize that there definitely is a (very large) market for ~20GB high write IOPS SSDs.I like my 18GB Zeus SSD, but it sure would be nice to be able to pay $2/Gb for it, instead of 10x that now... -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On Fri, 25 Dec 2009, Saso Kiselkov wrote: sometimes even longer. I figured that I might be able to resolve this by lowering the txg timeout to something like 1-2 seconds (I need ZFS to write as soon as data arrives, since it will likely never be overwritten), but I couldn't find any tunable parameter for it anywhere on the net. On FreeBSD, I think this can be done via the While there are some useful tunable parameters, another approach is to consider requesting a synchronous write using fdatasync(3RT) or fsync(3C) immediately after the final write() request in one of your poll() time quantums. This will cause the data to be written immediately. System behavior will then seem totally different. Unfortunately, it will also be less efficient. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On 12/26/09 09:53, Brent Jones wrote: On Fri, Dec 25, 2009 at 9:56 PM, Tim Cookt...@cook.ms wrote: On Fri, Dec 25, 2009 at 11:43 PM, Brent Jonesbr...@servuhome.net wrote: Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. Once again, Mb or MB? They're two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that's not really impressive at all. If you're saying you got 400MB, that's a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn't happening with random. Mb, megabit. 400 megabit is not terribly high, a single SATA drive could write that 24/7 without a sweat. Which is why he is reporting his issue. Sequential or random, any modern system should be able to perform that task without causing disruption to other processes running on the system (if Windows can, Solaris/ZFS most definitely should be able to). I have similar workload on my X4540's, streaming backups from multiple systems at a time. These are very high end machines, dual quadcore opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. The write stalls have been a significant problem since ZFS came out, and hasn't really been addressed in an acceptable fashion yet, though work has been done to improve it. I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. That would be the new System Duty Cycle Scheduling Class that was putback in build 129: Author: Jonathan Adams jonathan.ad...@sun.com Repository: /export/onnv-gate Total changesets: 1 Changeset: 87f3734e64df Comments: 6881015 ZFS write activity prevents other threads from running in a timely manner 6899867 mstate_thread_onproc_time() doesn't account for runnable time correctly PSARC/2009/615 System Duty Cycle Scheduling Class and ZFS IO Observability See http://arc.opensolaris.org/caselog/PSARC/2009/615/ for more information. If you're using the dev repository, you can pkg image-update to get this new functionality. Cheers, Menno -- Menno Lageman - Sun Microsystems - http://blogs.sun.com/menno ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Brent Jones wrote: On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote: On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote: Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. Once again, Mb or MB? They're two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that's not really impressive at all. If you're saying you got 400MB, that's a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn't happening with random. Mb, megabit. 400 megabit is not terribly high, a single SATA drive could write that 24/7 without a sweat. Which is why he is reporting his issue. Sequential or random, any modern system should be able to perform that task without causing disruption to other processes running on the system (if Windows can, Solaris/ZFS most definitely should be able to). I have similar workload on my X4540's, streaming backups from multiple systems at a time. These are very high end machines, dual quadcore opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. The write stalls have been a significant problem since ZFS came out, and hasn't really been addressed in an acceptable fashion yet, though work has been done to improve it. PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO Observability was integrated into b129. This creates a scheduling class for ZFS IO and automatically places the zio threads into that class. This is not really an earth-shattering change, Solaris has had a very flexible scheduler for almost 20 years now. Another example is that on a desktop, the application which has mouse focus runs in the interactive scheduling class. This is completely transparent to most folks and there is no tweaking required. Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other threads from running in a timely manner, which is related to the above. I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. Wow, if there were a production-release solution to the problem, that would be great! Reading the mailing list I almost gave up hope that I'd be able to work around this issue without upgrading to the latest bleeding-edge development version. Changes have to occur someplace first. In the OpenSolaris world, the changes occur first in the dev train and then are back ported to Solaris 10 (sometimes, not always). You should try the latest build first -- be sure to follow the release notes. Then, if the problem persists, you might consider tuning zfs_txg_timeout, which can be done on a live system. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)
On Dec 25, 2009, at 3:01 PM, Jeroen Roodhart wrote: -BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Hi Freddie, list, Option 4 is to re-do your pool, using fewer disks per raidz2 vdev, giving more vdevs to the pool, and thus increasing the IOps for the whole pool. 14 disks in a single raidz2 vdev is going to give horrible IO, regardless of how fast the individual disks are. Redoing it with 6-disk raidz2 vdevs, or even 8-drive raidz2 vdevs will give you much better throughput. We are aware of the configuration being possibly suboptimal. However, before we had the SSDs, we did test earlier with 6x7 Z2 and even 2way mirrorset setups. These gave better IOPS but not significantly enough improvement (I would expect roughly a bit more than double the performance in 14x3 vs 6x7) . In the end it is indeed a choice between performance, space and security. Our hope is that the SSD slogs serialise the data flow enough to make this work. But you have a fair point and we will also look into the combination of SSDs and pool-configurations. For your benchmark, there will not be a significant difference for any combination of HDDs. They all have at least 4 ms of write latency. Going from 10 ms down to 4 ms will not be nearly as noticeable as going from 10 ms to 0.01 ms :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thanks for the advice. I did an in-place upgrade to the latest development b130 release and it seems that the change in scheduling classes for the kernel writer threads worked (not even having to fiddle around with logbias) - now I'm just getting small delays every 60 seconds (on the order of 20-30ms). I'm not sure these have something to do with ZFS, though... they happen outside of the write bursts. Thank you all for the valuable advice! Regards, - -- Saso Richard Elling wrote: On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Brent Jones wrote: On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote: On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote: Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. Once again, Mb or MB? They're two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that's not really impressive at all. If you're saying you got 400MB, that's a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn't happening with random. Mb, megabit. 400 megabit is not terribly high, a single SATA drive could write that 24/7 without a sweat. Which is why he is reporting his issue. Sequential or random, any modern system should be able to perform that task without causing disruption to other processes running on the system (if Windows can, Solaris/ZFS most definitely should be able to). I have similar workload on my X4540's, streaming backups from multiple systems at a time. These are very high end machines, dual quadcore opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. The write stalls have been a significant problem since ZFS came out, and hasn't really been addressed in an acceptable fashion yet, though work has been done to improve it. PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO Observability was integrated into b129. This creates a scheduling class for ZFS IO and automatically places the zio threads into that class. This is not really an earth-shattering change, Solaris has had a very flexible scheduler for almost 20 years now. Another example is that on a desktop, the application which has mouse focus runs in the interactive scheduling class. This is completely transparent to most folks and there is no tweaking required. Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other threads from running in a timely manner, which is related to the above. I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. Wow, if there were a production-release solution to the problem, that would be great! Reading the mailing list I almost gave up hope that I'd be able to work around this issue without upgrading to the latest bleeding-edge development version. Changes have to occur someplace first. In the OpenSolaris world, the changes occur first in the dev train and then are back ported to Solaris 10 (sometimes, not always). You should try the latest build first -- be sure to follow the release notes. Then, if the problem persists, you might consider tuning zfs_txg_timeout, which can be done on a live system. -- richard -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks2RfgACgkQRO8UcfzpOHDhCQCeIrJxcy4TcqgvPwGYm/f97NG9 ac8An2zTTqtz/KCK6a4IzKHzgYdEB0Qe =9zO8 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
I am having the exact same problem after destroying a dataset with a few gigabytes of data and dedup. I type zfs destroy vault/virtualmachines which was a zvol with dedup turned on and the server hung, couldn't ping, couldn't get on the console. Next bootup same thing just hangs when importing the filesystems. I removed one of the pool disks and also all the mirrors so that I can experiment without losing the original data. But all of my behaviors are acting the exact same way as in this thread, although I can't lose this particular data as it's been a week since last backup so I am freaking out a little. I am on build 130 and was going to do another backup but was troubleshooting a different issue regarding poor iscsi performance before I did the backup. I deleted that zvol with dedup on and now I'm in the same boat as the parent. same hangs. and interesting thing that happens is that when I hit the power button which usually tells the system to start a shutdown, dur ing these hangs it says not enough kernel memory. Perhaps a memory leak during the failed destroy is causing the hangups. But literally every description and in here matches my symptoms and during an import I can see the disks get hit pretty hard for a about 3 minutes and then stop cold turkey and the system is unresponsive, just a blinking cursor at the console and I can hit enter to generate a newline but everything is blank. so the console is locked pretty hard. The pool is made up of 3 mirrored vdevs, a cache and a log device. everything was running great until I destroy that one little deduped zvol. I was able to destroy other zvols right before that one. Has anybody had a chance to look at the dump the poster sent up? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] repost - high read iops
repost - Sorry for ccing the other forums. I'm running into a issue where there seems to be a high number of read iops hitting disks and physical free memory is fluctuating between 200MB - 450MB out of 16GB total. We have the l2arc configured on a 32GB Intel X25-E ssd and slog on another 32GB X25-E ssd. According to our tester, Oracle writes are extremely slow (high latency). Below is a snippet of iostat: r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1 0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0 401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0 421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0 403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0 406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0 414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0 406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0 404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0 404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0 407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0 407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0 402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0 408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0 9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0 0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0 Is this an indicator that we need more physical memory? From http://blogs.sun.com/brendan/entry/test, the order that a read request is satisfied is: 1) ARC 2) vdev cache of L2ARC devices 3) L2ARC devices 4) vdev cache of disks 5) disks Using arc_summary.pl, we determined that prefletch was not helping much so we disabled. CACHE HITS BY DATA TYPE: Demand Data: 22% 158853174 Prefetch Data: 17% 123009991 ---not helping??? Demand Metadata: 60% 437439104 Prefetch Metadata: 0% 2446824 The write iops started to kick in more and latency reduced on spinning disks: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1 0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0 126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0 129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0 128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0 128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0 125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0 128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0 128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0 128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0 129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0 130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0 126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0 129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0 90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0 0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0 Is it true if your MFU stats start to go over 50% then more memory is needed? CACHE HITS BY CACHE LIST: Anon: 10% 74845266 [ New Customer, First Cache Hit ] Most Recently Used: 19% 140478087 (mru) [ Return Customer ] Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 22% 158852935 Prefetch Data: 17% 123009991 Demand Metadata: 60% 437438658 Prefetch Metadata: 0% 2446824 My theory is since there's not enough memory for the arc to cache data, its hits the l2arc where it can't find data and has to query the disk for the request. This causes contention between reads and writes causing the service times to inflate. uname: 5.10 Generic_141445-09 i86pc i386 i86pc Sun Fire X4270: 11+1 raidz (SAS) l2arc Intel X25-E slog Intel X25-E Thoughts? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
I still haven't given up :) I moved my Virtual Machines to my main rig (which gets rebooted often, so this is 'not optimal' to say the least) :) I have since upgraded to 129. I noticed that even if timeslider/autosnaps are disabled, a zpool command still gets generated every 15 minutes. Since all zpool/zfs commands freeze during the import, I'd have hundreds of hung zpool processes. I stopped this by commenting out all jobs on the zfssnap crontab as well as the auto-snap cleanup job on roots crontab. This did nothing to resolve my issue, but I figured I should note it. I'd copy and past the exact jobs, but my server is once again hung. I'm going to upgrade my server (new motherboard that supports more than 4GB of RAM). I'll have double the RAM, perhaps there is some sort of RAM issue going on. I really wanted to get 16GB of RAM, by my own personal budget will not allow it :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
Just wondering, How much RAM is in your system? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss