Re: [zfs-discuss] Periodic flush
ZFS has always done a certain amount of write throttling. In the past (or the present, for those of you running S10 or pre build 87 bits) this throttling was controlled by a timer and the size of the ARC: we would cut a transaction group every 5 seconds based off of our timer, and we would also cut a transaction group if we had more than 1/4 of the ARC size worth of dirty data in the transaction group. So, for example, if you have a machine with 16GB of physical memory it wouldn't be unusual to see an ARC size of around 12GB. This means we would allow up to 3GB of dirty data into a single transaction group (if the writes complete in less than 5 seconds). Now we can have up to three transaction groups in progress at any time: open context, quiesce context, and sync context. As a final wrinkle, we also don't allow more than 1/2 the ARC to be composed of dirty write data. All taken together, this means that there can be up to 6GB of writes in the pipe (using the 12GB ARC example from above). Problems with this design start to show up when the write-to-disk bandwidth can't keep up with the application: if the application is writing at a rate of, say, 1GB/sec, it will fill the pipe within 6 seconds. But if the IO bandwidth to disk is only 512MB/sec, its going to take 12sec to get this data onto the disk. This impedance mis-match is going to manifest as pauses: the application fills the pipe, then waits for the pipe to empty, then starts writing again. Note that this won't be smooth, since we need to complete an entire sync phase before allowing things to progress. So you can end up with IO gaps. This is probably what the original submitter is experiencing. Note there are a few other subtleties here that I have glossed over, but the general picture is accurate. The new write throttle code put back into build 87 attempts to smooth out the process. We now measure the amount of time it takes to sync each transaction group, and the amount of data in that group. We dynamically resize our write throttle to try to keep the sync time constant (at 5secs) under write load. We also introduce fairness delays on writers when we near pipeline capacity: each write is delayed 1/100sec when we are about to fill up. This prevents a single heavy writer from starving out occasional writers. So instead of coming to an abrupt halt when the pipeline fills, we slow down our write pace. The result should be a constant even IO load. There is one down side to this new model: if a write load is very bursty, e.g., a large 5GB write followed by 30secs of idle, the new code may be less efficient than the old. In the old code, all of this IO would be let in at memory speed and then more slowly make its way out to disk. In the new code, the writes may be slowed down. The data makes its way to the disk in the same amount of time, but the application takes longer. Conceptually: we are sizing the write buffer to the pool bandwidth, rather than to the memory size. Robert Milkowski wrote: Hello eric, Thursday, March 27, 2008, 9:36:42 PM, you wrote: ek On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote: On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. I hope that this feature is implemented soon, and works well. :-) ek Actually, this has gone back into snv_87 (and no we don't know which ek s10uX it will go into yet). Could you share more details how it works right now after change? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID-Z resilver broken
ugh, thanks for exploring this and isolating the problem. We will look into what is going on (wrong) here. I have filed bug: 6545015 RAID-Z resilver broken to track this problem. -Mark Marco van Lienen wrote: On Sat, Apr 07, 2007 at 05:05:18PM -0500, in a galaxy far far away, Chris Csanady said: In a recent message, I detailed the excessive checksum errors that occurred after replacing a disk. It seems that after a resilver completes, it leaves a large number of blocks in the pool which fail to checksum properly. Afterward, it is necessary to scrub the pool in order to correct these errors. After some testing, it seems that this only occurs with RAID-Z. The same behavior can be observed on both snv_59 and snv_60, though I do not have any other installs to test at the moment. A colleague at work and I have followed the same steps, included running a digest on the /test/file, on a SXCE:61 build today and can confirm the exact same, and disturbing?, result. My colleague mentioned to me he has witnessed the same 'resilver' behavior on builds 57 and 60. The box which these steps were performed on was 'luupgraded' from SXCE: 60 to 61 using the SUNWlu* packages from 61! # cat /etc/release Solaris Nevada snv_61 X86 Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 26 March 2007 # mkdir /tmp/test # mkfile 64m /tmp/test/0 /tmp/test/1 # zpool create test raidz /tmp/test/0 /tmp/test/1 # mkfile 16m /test/file # digest -v -a sha1 /test/file sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2 # # zpool export test # rm /tmp/test/0 # zpool import -d /tmp/test test # mkfile 64m /tmp/test/0 # zpool replace test /tmp/test/0 # digest -v -a sha1 /test/file sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2 # zpool status test pool: test state: ONLINE scrub: resilver completed with 0 errors on Wed Apr 11 15:19:15 2007 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /tmp/test/0 ONLINE 0 0 0 /tmp/test/1 ONLINE 0 0 0 errors: No known data errors # zpool scrub test # # zpool status test pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed with 0 errors on Wed Apr 11 15:22:30 2007 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /tmp/test/0 ONLINE 0 017 /tmp/test/1 ONLINE 0 0 0 errors: No known data errors I don't think these checksum errors are a good sign. The sha1 digest on the file *does* show to be the same so the question arises: is the resilver process truly broken (even though in this test-case the test file does appear to unchanged based on the sha1 digest) ? Marco ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Something like spare sectors...
Anton B. Rang wrote: This sounds a lot like: 6417779 ZFS: I/O failure (write on ...) -- need to reallocate writes Which would allow us to retry write failures on alternate vdevs. Of course, if there's only one vdev, the write should be retried to a different block on the original vdev ... right? Yes, although it depends on the nature of the write failure. If the write failed because the device is no longer available, ZFS will not continue to try different blocks. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs snapshot issues.
Joseph Barbey wrote: Matthew Ahrens wrote: Joseph Barbey wrote: Robert Milkowski wrote: JB So, normally, when the script runs, all snapshots finish in maybe a minute JB total. However, on Sundays, it continues to take longer and longer. On JB 2/25 it took 30 minutes, and this last Sunday, it took 2:11. The only JB thing special thing about Sunday's snapshots is that they are the first JB ones created since the full backup (using NetBackup) on Saturday. All JB other backups are incrementals. hm do you have atime property set to off? Maybe you spend most of the time in destroying snapshots due to much larger delta coused by atime updates? You can possibly also gain some performance by setting atime to off. Yep, atime is set to off for all pools and filesystems. I looked through the other possible properties, and nothing really looked like it would really affect things. One additional weird thing. My script hits each filesystem (email-pool/A..Z) individually, so I can run zfs list -t snapshot and find out how long each snapshot actually takes. Everything runs fine until I get to around V or (normally) W. Then it can take a couple of hours on the one FS. After that, the rest go quickly. So, what operation exactly is taking a couple of hours on the one FS? The only one I can imagine taking more than a minute would be 'zfs destroy', but even that should be very rare on a snapshot. Is it always the same FS that takes longer than the rest? Is the pool busy when you do the slow operation? I've now determined that renaming the previous snapshot seems to be the problem in certain instances. What we are currently doing through the script is to keep 2 weeks of daily snapshots of the various pool/filesystems. These snapshots are named {fs}.$Day-2, {fs}.$Day-2, and {fs}.snap. Specifically, for our 'V' filesystem, which is created under the email-pool, I will have the following snapshots: email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED] So, my script does the following for each FS: Check for FS.$Day-2. If exists, then destroy it. Check if there is a FS.$Day-1. If so, rename it to $DAY-2. Check for FS.snap. If so, rename to FS.$Yesterday-1 (day it was created). Create FS.snap I added logging to a file, along with the action just run and the time that it completed: Destroy email-pool/[EMAIL PROTECTED]Sun Apr 8 00:01:04 CDT 2007 Rename email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED]Sun Apr 8 00:01:05 CDT 2007 Rename email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED]Sun Apr 8 00:54:52 CDT 2007 Create email-pool/[EMAIL PROTECTED]Sun Apr 8 00:54:53 CDT 2007 Looking at the above, Rename took from 00:01:05 until 00:54:52, so almost 54 minutes. So, any ideas on why a rename should take so long? And again, why is this only happening on Sunday? Any other information I can provide that might help diagnose this? This could be an instance of: 6509628 unmount of a snapshot (from 'zfs destroy') is slow The fact that this bug comes from a destroy op is not relevant, what is relevant is the required unmount (also required in a rename op). Has there been recent activity in the Sunday-1 snapshot (like a backup or 'find' perhaps)? This will cause the unmount to proceed very slowly. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [zfs-code] Contents of transaction group?
Atul Vidwansa wrote: Hi, I have few questions about the way a transaction group is created. 1. Is it possible to group transactions related to multiple operations in same group? For example, an rmdir foo followed by mkdir bar, can these end up in same transaction group? Yes. 2. Is it possible for an operation (say write()) to occupie multiple transaction groups? Yes. Writes are broken into transactions at block boundaries. So it is possible for a large write to span multiple transaction groups. 3. Is it possible to know the thread id(s) for every commited txg_id? No. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: zfs blocks numbers for small files
Frederic Payet - Availability Services wrote: Hi gurus, When creating some small files an ZFS directory, used blocks number is not what could be espected: hinano# zfs list NAME USED AVAIL REFER MOUNTPOINT pool2 702K 16.5G 26.5K /pool2 pool2/new 604K 16.5G34K /pool2/new pool2/new/fs2 570K 16.5G 286K /pool2/new/fs2 pool2/new/fs2/subfs2 284K 16.5G 284K /pool2/new/fs2/subfs2 hinano# pwd /pool2/new/fred hinano# zfs get all pool2/new NAME PROPERTY VALUE SOURCE pool2/newtype filesystem - pool2/newcreation Tue Mar 20 13:27 2007 - pool2/newused 603K - pool2/new available 16.5G - pool2/newreferenced 33.5K - pool2/newcompressratio 1.00x - pool2/new mountedyes- pool2/newquota none default pool2/newreservation none default pool2/new recordsize 128K default pool2/newmountpoint /pool2/new default pool2/newsharenfs offdefault pool2/new checksum on default pool2/newcompressionoff default pool2/newatime on default pool2/new deviceson default pool2/newexec on default pool2/newsetuid on default pool2/new readonly offdefault pool2/newzoned off default pool2/newsnapdir hidden default pool2/new aclmodegroupmask default pool2/newaclinherit secure default hinano# mkfile 9 file9bytes hinano# mkfile 520 file520bytes hinano# mkfile 1025 file1025bytes hinano# mkfile 1023 file1023bytes hinano# mkfile 10 file10bytes hinano# ls -ls total 14 3 -rw--T 1 root root1023 Apr 4 13:34 file1023bytes 4 -rw--T 1 root root1025 Apr 4 13:34 file1025bytes 1 -rw--T 1 root root 10 Apr 4 13:38 file10bytes 3 -rw--T 1 root root 520 Apr 4 13:33 file520bytes 2 -rw--T 1 root root 9 Apr 4 13:33 file9bytes After 2 seconds : hinano# ls -ls total 13 3 -rw--T 1 root root1023 Apr 4 13:34 file1023bytes 4 -rw--T 1 root root1025 Apr 4 13:34 file1025bytes 2 -rw--T 1 root root 10 Apr 4 13:38 file10bytes 3 -rw--T 1 root root 520 Apr 4 13:33 file520bytes 2 -rw--T 1 root root 9 Apr 4 13:33 file9bytes 2 questions : - Could somebody explain why a file of 9 bytes takes 2 512b blocks ? One block for the znode (the meta-data), one block for the data. - Why the block number of file10bytes has changes after a while doing nothing more than 'ls -ls' The block count reflects the actual allocated storage on disk. The first time you did an 'ls' the data block had not yet been allocated (i.e., the data was still in transit to the disk). Please reply me directly as I'm not in this alias . Best, fred ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELIOS and ZFS cache
This issue has been discussed a number of times in this forum. To summerize: ZFS (specifically, the ARC) will try to use *most* of the systems available memory to cache file system data. The default is to max out at physmem-1GB (i.e., use all of physical memory except for 1GB). In the face of memory pressure, the ARC will give up memory, however there are some situations where we are unable to free up memory fast enough for an application that needs it (see example in the HELIOS note below). In these situations, it may be necessary to lower the ARCs maximum memory footprint, so that there is a larger amount of memory immediately available for applications. This is particularly relevant in situations where there is a known amount of memory that will always be required for use by some application (databases often fall into this category). The tradeoff here is that the ARC will not be able to cache as much file system data, and that could impact performance. For example, if you know that an application will need 5GB on a 36GB machine, you could set the arc maximum to 30GB (0x78000). In ZFS on on10 prior to update 4, you can only change the arc max size via explicit actions with mdb(1): # mdb -kw arc::print -a c_max address c_max = current-max address/Z new-max In the current opensolaris nevada bits, and in s10u4, you can use the system variable 'zfs_arc_max' to set the maximum arc size. Just set this in /etc/system. -Mark Erik Vanden Meersch wrote: Could someone please provide comments or solution for this? Subject: Solaris 10 ZFS problems with database applications HELIOS TechInfo #106 Tue, 20 Feb 2007 Solaris 10 ZFS problems with database applications -- We have tested Solaris 10 release 11/06 with ZFS without any problems using all HELIOS UB based products, including very high load tests. However we learned from customers that some database solutions (known are Sybase and Oracle), when allocating a large amount of memory may slow down or even freeze the system for up to a minute. This can result in RPC timeout messages and service interrupts for HELIOS processes. ZFS is basically using most memory for file caching. Freeing this ZFS memory for the database memory allocation can result into serious delays. This does not occur when using HELIOS products only. HELIOS tested system was using 4GB memory. Customer production machine was using 16GB memory. Contact your SUN representative how to limit the ZFS cache and what else to consider using ZFS in your workflow. Check also with your application vendor for recommendations using ZFS with their applications. Best regards, HELIOS Support HELIOS Software GmbH Steinriede 3 30827 Garbsen (Hannover) Germany Phone: +49 5131 709320 FAX:+49 5131 709325 http://www.helios.de -- http://www.sun.com/solaris * Erik Vanden Meersch * Solution Architect *Sun Microsystems, Inc.* Phone x48835/+32-2-704 8835 Mobile 0479/95 05 98 Email [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool export consumes whole CPU and takes more than 30 minutes to complete
Robert, This doesn't look like cache flushing, rather it looks like we are trying to finish up some writes... but are having a hard time allocating space for them. Is this pool almost 100% full? There are lots of instances of zio_write_allocate_gang_members(), which indicates a very high degree of fragmentation in the pool. -Mark Robert Milkowski wrote: Hi. T2000 1.2GHz 8-core, 32GB RAM, S10U3, zil_disable=1. Command 'zpool export f3-2' is hung for 30 minutes now and still is going. Nothing else is running on the server. I can see one CPU being 100% in SYS like: bash-3.00# mpstat 1 [...] CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 0 67 220 110 200000 00 0 0 100 10 00 1000000 00 0 0 100 20 00 2020000 50 0 0 100 30 00 1000000 00 0 0 100 40 00 60 100000 00 0 0 100 50 00 60 100000 00 0 0 100 60 00 1000000 00 0 0 100 70 0 62 80 140000 00 0 0 100 80 00 60 100010 00 0 0 100 90 01 70 120000 00 0 0 100 100 00 80 140000 00 0 0 100 110 00 4060000 00 0 0 100 120 01210 390020 00 0 0 100 130 01 5080000 00 0 0 100 140 0 1842 35 120000 00 0 0 100 150 01 5320000 00 0 0 100 160 0 1111960010 00 0 0 100 170 00 70 120000130 0 0 100 180 00 60 100000 00 0 0 100 190 00 1000000 00 0 0 100 200 00 5080000200 0 0 100 210 03 60 100020 1880 0 0 100 220 00 2020000 00 0 0 100 230 01 5080000 2190 0 0 100 240 00 1000000 00 0 0 100 250 00 1000000 00 100 0 0 260 00 1000000 00 0 0 100 270 00 1000000 00 0 0 100 280 00 5080000 00 3 0 97 290 01 4060010 2260 0 0 100 300 01 2020000 00 0 0 100 310 00 1000000 00 0 0 100 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 0 61 219 108 220020 00 1 0 99 10 00 2020100 10 0 0 100 20 00 2020000 50 0 0 100 30 00 1000000 00 0 0 100 40 00 5080000 00 0 0 100 50 00 5080000 00 0 0 100 60 00 1000000 00 0 0 100 70 03 90 160040 00 0 0 100 80 00 4060000 00 0 0 100 90 00 60 100000 00 0 0 100 100 01 70 120000 00 0 0 100 110 00 4060000 00 0 0 100 120 02220 410010 00 0 0 100 130 00 60 100000 00 0 0 100 140 0 2046 38 140010 00 0 0 100 150 00 2100000 00 0 0 100 160 0 1210960000 00 0 0 100 170 00 90 160020130 0 0 100 180 00 4060000 00 0 0 100 190 00 2020000 00 0 0 100 200 00 3040000100 0 0 100 210 03 70 101040 1850 0 0 100 220 00 1000000 00 0 0 100
Re: [zfs-discuss] file not persistent after node bounce when there is a bad disk?
Peter Buckingham wrote: Hi Eric, eric kustarz wrote: The first thing i would do is see if any I/O is happening ('zpool iostat 1'). If there's none, then perhaps the machine is hung (which you then would want to grab a couple of '::threadlist -v 10's from mdb to figure out if there are hung threads). there seems to be no IO after the initial IO according to zpool iostat. When we run zpool status it hangs: HON hcb116 ~ $ zpool status pool: tank state: ONLINE scrub: none requested hang I'll send you the mdb output privately since it's quite big. 60 seconds should be plenty of time for the async write(s) to complete. We try to push out txg (transaction groups) every 5 seconds. However, if the system is overloaded, then the txgs could take longer. That's what I would have thought. They 'sync' hanging is intriguing. Perhaps the system is just overloaded and sync command is making it worse. Seeing what 'fsync' would do would be interesting. I've not tried this yet. What else is the machine doing? we are running the honeycomb environment (you can see when I send you the mdb output). is there some issue for the zpool mirrors if one of the slices disappears or is unresponsive after the pool has been brought online? This can be a problem if an IO issued to the device never completes (i.e., hangs). This can hang up the pool. A well-behaved device/driver should eventually time out the IO, but we have seen instances where this never seems to happen. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Al Hopper wrote: On Wed, 10 Jan 2007, Mark Maybee wrote: Jason J. W. Williams wrote: Hi Robert, Thank you! Holy mackerel! That's a lot of memory. With that type of a calculation my 4GB arc_max setting is still in the danger zone on a Thumper. I wonder if any of the ZFS developers could shed some light on the calculation? In a worst-case scenario, Robert's calculations are accurate to a certain degree: If you have 1GB of dnode_phys data in your arc cache (that would be about 1,200,000 files referenced), then this will result in another 3GB of related data held in memory: vnodes/znodes/ dnodes/etc. This related data is the in-core data associated with an accessed file. Its not quite true that this data is not evictable, it *is* evictable, but the space is returned from these kmem caches only after the arc has cleared its blocks and triggered the free of the related data structures (and even then, the kernel will need to to a kmem_reap to reclaim the memory from the caches). The fragmentation that Robert mentions is an issue because, if we don't free everything, the kmem_reap may not be able to reclaim all the memory from these caches, as they are allocated in slabs. We are in the process of trying to improve this situation. snip . Understood (and many Thanks). In the meantime, is there a rule-of-thumb that you could share that would allow mere humans (like me) to calculate the best values of zfs:zfs_arc_max and ncsize, given the that machine has nGb of RAM and is used in the following broad workload scenarios: a) a busy NFS server b) a general multiuser development server c) a database server d) an Apache/Tomcat/FTP server e) a single user Gnome desktop running U3 with home dirs on a ZFS filesystem It would seem, from reading between the lines of previous emails, particularly the ones you've (Mark M) written, that there is a rule of thumb that would apply given a standard or modified ncsize tunable?? I'm primarily interested in a calculation that would allow settings that would reduce the possibility of the machine descending into swap hell. Ideally, there would be no need for any tunables; ZFS would always do the right thing. This is our grail. In the meantime, I can give some recommendations, but there is no rule of thumb that is going to work in all circumstances. ncsize: As I have mentioned previously, there are overheads associated with caching vnode data in ZFS. While the physical on-disk data for a znode is only 512bytes, the related in-core cost is significantly higher. Roughly, you can expect that each ZFS vnode held in the DNLC will cost about 3K of kernel memory. So, you need to set ncsize appropriately for how much memory you are willing to devote to it. 500,000 entries is going to cost you 1.5GB of memory. zfs_arc_max: This is the maximum amount of memory you want the ARC to be able to use. Note that the ARC won't necessarily use this much memory: if other applications need memory, the ARC will shrink to accommodate. Although, also note that the ARC *can't* shrink if all of its memory is held. For example, data in the DNLC cannot be evicted from the ARC, so this data must first be evicted from the DNLC before the ARC can free up space (this is why it is dangerous to turn off the ARCs ability to evict vnodes from the DNLC). Also keep in mind that the ARC size does not account for many in-core data structures used by ZFS (znodes/dnodes/ dbufs/etc). Roughly, for every 1MB of cached file pointers, you can expect another 3MB of memory used outside of the ARC. So, in the example above, where ncsize is 500,000, the ARC is only seeing about 400MB of the 1.5GB consumed. As I have stated previously, we consider this a bug in the current ARC accounting that we will soon fix. This is only an issue in environments where many files are being accessed. If the number of files accessed is relatively low, then the ARC size will be much closer to the actual memory consumed by ZFS. So, in general, you should not really need to tune zfs_arc_max. However, in environments where you have specific applications that consume known quantities of memory (e.g. database), it will likely help to set the ARC max size down, to guarantee that the necessary kernel memory is available. There may be other times when it will be beneficial to explicitly set the ARCs maximum
Re: [zfs-discuss] Limit ZFS Memory Utilization
Jason J. W. Williams wrote: Hi Mark, That does help tremendously. How does ZFS decide which zio cache to use? I apologize if this has already been addressed somewhere. The ARC caches data blocks in the zio_buf_xxx() cache that matches the block size. For example, dnode data is stored on disk in 16K blocks (32 dnodes/block), so zio_buf_16384() is used for those blocks. Most file data blocks (in large files) are stored in 128K blocks, so zio_buf_131072() is used, etc. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Jason J. W. Williams wrote: Hi Robert, Thank you! Holy mackerel! That's a lot of memory. With that type of a calculation my 4GB arc_max setting is still in the danger zone on a Thumper. I wonder if any of the ZFS developers could shed some light on the calculation? In a worst-case scenario, Robert's calculations are accurate to a certain degree: If you have 1GB of dnode_phys data in your arc cache (that would be about 1,200,000 files referenced), then this will result in another 3GB of related data held in memory: vnodes/znodes/ dnodes/etc. This related data is the in-core data associated with an accessed file. Its not quite true that this data is not evictable, it *is* evictable, but the space is returned from these kmem caches only after the arc has cleared its blocks and triggered the free of the related data structures (and even then, the kernel will need to to a kmem_reap to reclaim the memory from the caches). The fragmentation that Robert mentions is an issue because, if we don't free everything, the kmem_reap may not be able to reclaim all the memory from these caches, as they are allocated in slabs. We are in the process of trying to improve this situation. That kind of memory loss makes ZFS almost unusable for a database system. Note that you are not going to experience these sorts of overheads unless you are accessing *many* files. In a database system, there are only going to be a few files = no significant overhead. I agree that a page cache similar to UFS would be much better. Linux works similarly to free pages, and it has been effective enough in the past. Though I'm equally unhappy about Linux's tendency to grab every bit of free RAM available for filesystem caching, and then cause massive memory thrashing as it frees it for applications. The page cache is much better in the respect that it is more tightly integrated with the VM system, so you get more efficient response to memory pressure. It is *much worse* than the ARC at caching data for a file system. In the long-term we plan to integrate the ARC into the Solaris VM system. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 10, 2007, 9:45:05 PM, you wrote: JJWW Sanjeev Robert, JJWW Thanks guys. We put that in place last night and it seems to be doing JJWW a lot better job of consuming less RAM. We set it to 4GB and each of JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully slush JJWW of 4GB on the Thumper is enough. I would be interested in what the JJWW other ZFS modules memory behaviors are. I'll take a perusal through JJWW the archives. In general it seems to me that a max cap for ZFS whether JJWW set through a series of individual tunables or a single root tunable JJWW would be very helpful. Yes it would. Better yet would be if memory consumed by ZFS for caching (dnodes, vnodes, data, ...) would behave similar to page cache like with UFS so applications will be able to get back almost all memory used for ZFS caches if needed. I guess (and it's really a guess only based on some emails here) that in worst case scenario ZFS caches would consume about: arc_max + 3*arc_max + memory lost for fragmentation So I guess with arc_max set to 1GB you can lost even 5GB (or more) and currently only that first 1GB can be get back automatically. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53
Thomas, This could be fragmentation in the meta-data caches. Could you print out the results of ::kmastat? -Mark Tomas Ögren wrote: On 05 January, 2007 - Robert Milkowski sent me these 3,8K bytes: Hello Tomas, I saw the same behavior here when ncsize was increased from default. Try with default and lets see what will happen - if it works then it's better than hung every an hour or so. That's still not the point.. It was fine with ncsize=500k (and all of it used) for a while.. then all of a sudden it just want haywire.. and when it freed up dnlc, I got back 250MB.. where's the rest ~1750MB tied up? /Tomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53
So it looks like this data does not include ::kmastat info from *after* you reset arc_reduce_dnlc_percent. Can I get that? What I suspect is happening: 1 with your large ncsize, you eventually ran the machine out of memory because (currently) the arc is not accounting for the space consumed by auxiliary caches (dnode_t, etc.). 2 the arc could not reduce at this point since almost all of its memory was tied up by the dnlc refs. 3 when you eventually allowed the arc to reduce the dnlc size, it managed to free up some space, but much of this did not appear because it was tied up in slabs in the auxiliary caches (fragmentation). We are working on a fix for number 1 above. You should *not* be setting arc_reduce_dnlc_percent to zero, even if you want a large number of dnlc entries. You are tying the arc hands here, so it has no ability to reduce its size. Number 3 is the most difficult issue. We are looking into that at the moment as well. -Mark Tomas Ögren wrote: On 05 January, 2007 - Mark Maybee sent me these 0,8K bytes: Thomas, This could be fragmentation in the meta-data caches. Could you print out the results of ::kmastat? http://www.acc.umu.se/~stric/tmp/zfs-dumps.tar.bz2 memstat, kmastat and dnlc_nentries from 10 minutes after boot up until the near death experience.. I've got vmcore as well if needed.. /Tomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53
Tomas Ögren wrote: On 05 January, 2007 - Mark Maybee sent me these 1,5K bytes: So it looks like this data does not include ::kmastat info from *after* you reset arc_reduce_dnlc_percent. Can I get that? Yeah, attached. (although about 18 hours after the others) Excellent, this confirms #3 below. What I suspect is happening: 1 with your large ncsize, you eventually ran the machine out of memory because (currently) the arc is not accounting for the space consumed by auxiliary caches (dnode_t, etc.). 2 the arc could not reduce at this point since almost all of its memory was tied up by the dnlc refs. 3 when you eventually allowed the arc to reduce the dnlc size, it managed to free up some space, but much of this did not appear because it was tied up in slabs in the auxiliary caches (fragmentation). We are working on a fix for number 1 above. Great! You should *not* be setting arc_reduce_dnlc_percent to zero, even if you want a large number of dnlc entries. You are tying the arc hands here, so it has no ability to reduce its size. Number 3 is the most difficult issue. We are looking into that at the moment as well. Any idea where all the memory is going? I sure hope that 500k dnlc entries (+dnode_t's etc belonging to that) isn't using up about 2GB ram..? Actually, thats pretty much what is happening: 500k dnlc = 170MB in the vnodes (vn_cache) + 320MB in znode_phys data (zio_buf_512) + 382MB in dnode_phys data (zio_buf_16384) + 208MB in dmu bufs (dmu_buf_impl_t) + 400MB in dnodes (dnode_t) + 120MB in znodes (zfs_znode_cache) - total 1600MB These numbers come from the last ::kmastat you ran before reducing the DNLC size. Note below that much of this space is still consumed by these caches, even after the DNLC has dropped it references. This is largely due to fragmentation in the caches. /Tomas cachebufbufbufmemory alloc alloc namesize in use totalin use succeed fail - -- -- -- - - - vn_cache 240 405388 657696 173801472948191 0 ... zio_buf_512 512 137801 294975 161095680 43660052 0 ... zio_buf_16384 16384 6692 6697 109723648 5877279 0 ... dmu_buf_impl_t 328 145260 622392 212443136 65461261 0 dnode_t 640 137799 512508 349872128 37995548 0 ... zfs_znode_cache 200 137763 568040 116334592 35683478 0 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53
Tomas, There are a couple of things going on here: 1. There is a lot of fragmentation in your meta-data caches (znode, dnode, dbuf, etc). This is burning up about 300MB of space in your hung kernel. This is a known problem that we are currently working on. 2. While the ARC has set its desired size down to c_min (64MB), its actually still consuming ~800MB in the hung kernel. This is odd. The bulk of this space is in the 32K and 64K data caches. Could you print out the contents of ARC_anon, ARC_mru, ARC_mfu, ARC_mru_ghost, and ARC_mfu_ghost? -Mark Tomas Ögren wrote: Hello. Having some hangs on a snv53 machine which is quite probably ZFS+NFS related, since that's all the machine do ;) The machine is a 2x750MHz Blade1000 with 2GB ram, using a SysKonnect 9821 GigE card (with their 8.19.1.3 skge driver) and two HP branded MPT SCSI cards. Normal load is pretty much read all you can with misc tarballs and isos since it's an NFS backend to our caching http/ftp cluster delivering free software to the world. What happens is that the machine just stops responding.. it can respond to ping for a while (while userland is non-responsive, including console) but after a while, that stops too.. Produced a panic to get a dump and tried ::memstat; unterweser:/scratch/070103# mdb unix.0 vmcore.0 Loading modules: [ unix krtld genunix specfs dtrace ufs scsi_vhci pcisch ssd fcp fctl qlc md ip hook neti sctp arp usba s1394 nca lofs zfs random sd nfs ptm cpc ] ::memstat Page SummaryPagesMB %Tot Kernel 250919 1960 98% Anon 888 60% Exec and libs 247 10% Page cache 38 00% Free (cachelist) 405 30% Free (freelist) 4370342% Total 256867 2006 Physical 253028 1976 That doesn't seem too healthy to me.. probably something kernely eating up everything and the machine is just swapping to death or something.. A dump from live kernel with mdb -k after 1.5h uptime; Page SummaryPagesMB %Tot Kernel 212310 1658 83% Anon11307884% Exec and libs2418181% Page cache 18400 1437% Free (cachelist) 4383342% Free (freelist) 8049623% The tweaks I have are: set ncsize = 50 set nfs:nrnode = 50 set zfs:zil_disable=1 set zfs:zfs_vdev_cache_bshift=14 set zfs:zfs_vdev_cache_size=0 Which according to ::kmem_cache results in about: 030002e30008 dmu_buf_impl_t 00 328 487728 030002e30288 dnode_t 00 640 453204 030002e30508 arc_buf_hdr_t 00 144 103544 030002e30788 arc_buf_t 00 4036743 030002e30a08 zil_lwb_cache 00 2000 030002e30c88 zfs_znode_cache 00 200 453200 but those buffers equal to about 550MB.. dnlc_nentries on the hung has gone down to 15000.. (where are the rest of the ~450k-15k dnode/znodes hanging out?) Hung kernel: arc::print { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x358a0600 p = 0x400 c = 0x400 c_min = 0x400 c_max = 0x5e114800 hits = 0xbc860fd misses = 0x2f296e1 deleted = 0x1d88739 recycle_miss = 0xf7f30c mutex_miss = 0x24b13d evict_skip = 0x21501d02 hash_elements = 0x27f97 hash_elements_max = 0x27f97 hash_collisions = 0x1651b43 hash_chains = 0x7ac3 hash_chain_max = 0x12 no_grow = 0x1 } Live kernel: arc::print { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x1b279400 p = 0x1a1dcaa4 c = 0x1a1dcaa4 c_min = 0x400 c_max = 0x5e114800 hits = 0xef7c96 misses = 0x25efa8 deleted = 0x1db537 recycle_miss = 0xa6221 mutex_miss = 0x12b59 evict_skip = 0x70d62b hash_elements = 0xcda1 hash_elements_max = 0x1b589 hash_collisions = 0x18e58a hash_chains = 0x3d16 hash_chain_max = 0xf no_grow = 0x1 } Should I post ::kmem_cache and/or ::kmastat somewhere? It's about 2*(20+30)kB.. /Tomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53
Hmmm, so there is lots of evictable cache here (mostly in the MFU part of the cache)... could you make your core file available? I would like to take a look at it. -Mark Tomas Ögren wrote: On 03 January, 2007 - Mark Maybee sent me these 5,0K bytes: Tomas, There are a couple of things going on here: 1. There is a lot of fragmentation in your meta-data caches (znode, dnode, dbuf, etc). This is burning up about 300MB of space in your hung kernel. This is a known problem that we are currently working on. Great! 2. While the ARC has set its desired size down to c_min (64MB), its actually still consuming ~800MB in the hung kernel. This is odd. The bulk of this space is in the 32K and 64K data caches. Could you print out the contents of ARC_anon, ARC_mru, ARC_mfu, ARC_mru_ghost, and ARC_mfu_ghost? Like this? ARC_anon::print { list = { list_size = 0 list_offset = 0 list_head = { list_next = 0 list_prev = 0 } } lsize = 0 size = 0x19c000 hits = 0 mtx = { _opaque = [ 0 ] } } ARC_mru::print { list = { list_size = 0x90 list_offset = 0x70 list_head = { list_next = 0x30072a5b5f8 list_prev = 0x300758b6c70 } } lsize = 0x1f88200 size = 0x3e5c200 hits = 0x44c48ad mtx = { _opaque = [ 0 ] } } ARC_mfu::print { list = { list_size = 0x90 list_offset = 0x70 list_head = { list_next = 0x30099c7a730 list_prev = 0x300dc11fee0 } } lsize = 0x2f2e4400 size = 0x318a8400 hits = 0x466bbec mtx = { _opaque = [ 0 ] } } ARC_mru_ghost::print { list = { list_size = 0x90 list_offset = 0x70 list_head = { list_next = 0x300758b6eb0 list_prev = 0x300d65faa10 } } lsize = 0x97a3cc00 size = 0x97a3cc00 hits = 0xfa4a49 mtx = { _opaque = [ 0 ] } } ARC_mfu_ghost::print { list = { list_size = 0x90 list_offset = 0x70 list_head = { list_next = 0x3006c7c8ce0 list_prev = 0x300d65fa2c0 } } lsize = 0x879ddc00 size = 0x879ddc00 hits = 0x3b37c8 mtx = { _opaque = [ 0 ] } } /Tomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53
Ah yes! Thank you Casper. I knew this looked familiar! :-) Yes, this is almost certainly what is happening here. The bug was introduced in build 51 and fixed in build 54. [EMAIL PROTECTED] wrote: Hmmm, so there is lots of evictable cache here (mostly in the MFU part of the cache)... could you make your core file available? I would like to take a look at it. Isn't this just like: 6493923 nfsfind on ZFS filesystem quickly depletes memory in a 1GB system Which was introduced in b51(or 52) and fixed in snv_54. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Uber block corruption?
[EMAIL PROTECTED] wrote: Hello Casper, Tuesday, December 12, 2006, 10:54:27 AM, you wrote: So 'a' UB can become corrupt, but it is unlikely that 'all' UBs will become corrupt through something that doesn't also make all the data also corrupt or inaccessible. CDSC So how does this work for data which is freed and overwritten; does CDSC the system make sure that none of the data referenced by any of the CDSC old ueberblocks is ever overwritten? Why it should? If blocks are not used due to current UB I guess you can safely assume they are free. What if a newer UB is corrupted and you fall back to an older one? Casper A block freed in transaction group N cannot be reused until transaction group N+3; so there is no possibility of referencing an overwritten block unless you have to back off more than two uberblocks. At this point, blocks that have been overwritten will show up as corrupted (bad checksums). -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression / ARC interaction
Andrew Miller wrote: Quick question about the interaction of ZFS filesystem compression and the filesystem cache. We have an Opensolaris (actually Nexenta alpha-6) box running RRD collection. These files seem to be quite compressible. A test filesystem containing about 3,000 of these files shows a compressratio of 12.5x. My question is about how the filesystem cache works with compressed files. Does the fscache keep a copy of the compressed data, or the uncompressed blocks? To update one of these RRD files, I believe the whole contents are read into memory, modified, and then written back out. If the filesystem cache maintained a copy of the compressed data, a lot more, maybe more than 10x more, of these files could be maintained in the cache. That would mean we could have a lot more data files without ever needing to do a physical read. Looking at the source code overview, it looks like the compression happens underneath the ARC layer, so by that I am assuming the uncompressed blocks are cached, but I wanted to ask to be sure. Thanks! -Andy Yup, your assumption is correct. We currently do compression below the ARC. We have contemplated caching data in compressed form, but have not really explored the idea fully yet. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a drive in a raidz vdev
Jeremy Teo wrote: On 12/5/06, Bill Sommerfeld [EMAIL PROTECTED] wrote: On Mon, 2006-12-04 at 13:56 -0500, Krzys wrote: mypool2/[EMAIL PROTECTED] 34.4M - 151G - mypool2/[EMAIL PROTECTED] 141K - 189G - mypool2/d3 492G 254G 11.5G legacy I am so confused with all of this... Why its taking so long to replace that one bad disk? To workaround a bug where a pool traverse gets lost when the snapshot configuration of a pool changes, both scrubs and resilvers will start over again any time you create or delete a snapshot. Unfortunately, this workaround has problems of its own -- If your inter-snapshot interval is less than the time required to complete a scrub, the resilver will never complete. The open bug is: 6343667 scrub/resilver has to start over when a snapshot is taken if it's not going to be fixed any time soon, perhaps we need a better workaround: Anyone internal working on this? Yes. But its going to be a few months. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] df -e in ZFS
Robert Milkowski wrote: Hello John, Thursday, November 9, 2006, 12:03:58 PM, you wrote: JC Hi all, JC When testing our programs, I got a problem. On UFS, we get the number of JC free inode via 'df -e', then do some things based this value, such as JC create an empty file, the value will decrease by 1. But on ZFS, it does JC not work. I still can get a number via 'df -e', and create a same empty JC file, the value is not my expectation. So I use a loop to produce empty JC files and watch the output of 'df -e'. After some long time, the number JC is 671, then 639, 641, 603, 605, 609, 397, 607... JC I check the number of files, yes, it increases steadily. JC Could you explain it? UFS has static number of inodes in a given file system so it's easy to say how much free inodes are left. ZFS creates inodes on demand so you can't say how much inodes you can create - however I guess one could calculate maximum possible number of inodes to be created given free space in a pool/fs. Yup, this is what ZFS does. It makes a *very rough* estimate of how many empty files could be created given the amount of available space. This number may be useful as some sort of upper bound, but no more than that. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: reproducible zfs panic on Solaris 10 06/06
Matthew Flanagan wrote: Matt, Matthew Flanagan wrote: mkfile 100m /data zpool create tank /data ... rm /data ... panic[cpu0]/thread=2a1011d3cc0: ZFS: I/O failure (write on unknown off 0: zio 60007432bc0 [L0 unallocated] 4000L/400P DVA[0]=0:b000:400 DVA[1]=0:120a000:400 fletcher4 lzjb BE contiguous birth=6 fill=0 cksum=672165b9e7:328e78ae25fd:ed007c9008f5f:34c05b1090 0b668): error 6 ... is there a fix for this? Um, don't do that? This is a known bug that we're working on. What is the bugid for this an ETA for fix? 6417779 ZFS: I/O failure (write on ...) -- need to reallocate writes and 6322646 ZFS should gracefully handle all devices failing (when writing) These bugs are actively being worked on, but it will probably be a while before fixes appear. -Mark I'm extremely surprised that this kind of bug can make it into a Solaris release. This is the second zfs related panic that I've found in testing it in our labs. The first was caused to the system to panic when the ZFS volume got close to 100% full (Sun case id #10914593). I've just replicated this panic with a USB flash drive as well by creating the zpool and then yanking the drive out. This is probably a common situation for desktop/laptop users who would not be impressed that their otherwise robust Solaris system crashed. regards matthew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Automounting ? (idea ?)
Patrick wrote: Hi, So recently, i decided to test out some of the ideas i've been toying with, and decided to create 50 000 and 100 000 filesystems, the test machine was a nice V20Z with dual 1.8 opterons, 4gb ram, connecting a scsi 3310 raid array, via two scsi controllers. Now creating the mass of filesystems, and the mass of properties i randomly assigned them was pretty easy, and i must say, i LOVE zfs, i really do LOVE zfs ! the script i created, basically created /data/clients/clientID, and then randomly set a quota, as well as randomly decided if compression was to be on, basically just to set properties for it, and such. clientID is a numeric value which starts at 1 and continues upwards. Now, creating, i was quite surprised to see the ammount of IO generated on the array's managment console, but never the less it created them without a hitch, although it took a little while, in the real world one wouldn't create 100 000 filesystems over night, and even if one did, one could wait an hour, or two... The problem came in when, i had to reboot the machine, and well... yes, a few hours later, it came up :) So this got me thinking, ZFS makes a perfect solution for massive user directory type solutions, and gives you the ability to have quota's and such stored on the filesystem, and then export the root filesystem, alas, some systems have thousands, if not hundreds of thousands of users, where that would be an awesome solution, mounting ALL of those filesystems on boot, becomes a pain. So ... how about an automounter? Is this even possible? Does it exist ? Helll!! Patrick *sigh*, one of the issues we recognized, when we introduced the new cheap/fast file system creation, was that this new model would stress the scalability (or lack thereof) of other parts of the operating system. This is a prime example. I think the notion of an automount option for zfs directories is an excellent one. Solaris does support automount, and it should be possible, by setting the mountpoint property to legacy, to set up automount tables to achieve what you want now; but it would be nice if zfs had a property to do this for you automatically. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs panic installing a brandz zone
Yup, its almost certain that this is the bug you are hitting. -Mark Alan Hargreaves wrote: I know, bad form replying to myself, but I am wondering if it might be related to 6438702 error handling in zfs_getpage() can trigger page not locked Which is marked fix in progress with a target of the current build. alan. Alan Hargreaves wrote: Folks, before I start delving too deeply into this crashdump, has anyone seen anything like it? The background is that I'm running a non-debug open build of b49 and was in the process of running the zoneadm -z redlx install After a bit, the machine panics, initially looking at the crashdump, I'm down to 88mb free (out of a gig) and see the following stack. fe8000de7800 page_unlock+0x3b(180218720) fe8000de78d0 zfs_getpage+0x236(89b84d80, 12000, 2000, fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, fe808180a000, 1, 80826dc8) fe8000de7950 fop_getpage+0x52(89b84d80, 12000, 2000, fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, fe8081818000, 1, 80826dc8) fe8000de7a50 segmap_fault+0x1d6(801a6f38, fbc29b20, fe8081818000, 2000, 0, 1) fe8000de7b30 segmap_getmapflt+0x67a(fbc29b20, 89b84d80, 12000, 2000, 1, 1) fe8000de7bd0 lofi_strategy_task+0x14b(959d2400) fe8000de7c60 taskq_thread+0x1a7(84453da8) fe8000de7c70 thread_start+8() %rax = 0x %r9 = 0x0300430e %rbx = 0x000e %r10 = 0x1000 %rcx = 0xfe8081819000 %r11 = 0x113709b0 %rdx = 0xfe8000de7c80 %r12 = 0x000180218720 %rsi = 0x00013000 %r13 = 0xfbc52160 pse_mutex+0x200 %rdi = 0xfbc52160 pse_mutex+0x200 %r14 = 0x4000 %r8 = 0x0200 %r15 = 0xfe8000de79d8 %rip = 0xfb8474fb page_unlock+0x3b %rbp = 0xfe8000de7800 %rsp = 0xfe8000de77e0 %rflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=of,df,IF,tf,sf,ZF,af,PF,cf %cs = 0x0028%ds = 0x0043%es = 0x0043 %trapno = 0xe %fs = 0xfsbase = 0x8000 %err = 0x0 %gs = 0x01c3gsbase = 0xfbc27b70 While the panic string says NULL pointer dereference, it appears that 0x180218720 is not mapped. The dereference looks like the first dereference in page_unlock(), which looks at pp-p_selock. I can spend a little time looking at it, but was wondering if anyone had seen this kind of panic previously? I have two identical crashdumps created in exactly the same way. alan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS forces system to paging to the point it is
Robert Milkowski wrote: Hello Philippe, It was recommended to lower ncsize and I did (to default ~128K). So far it works ok for last days and staying at about 1GB free ram (fluctuating between 900MB-1,4GB). Do you think it's a long term solution or with more load and more data the problem can surface again even with current ncsize value? Robert, I don't think this should be impacted too much by load/data, as long as the DNLC is able to evict, you should be in good shape. We are still working on a fix for the root cause of this issue however. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: when zfs enabled java
Jill Manfield wrote: My customer is running java on a ZFS file system. His platform is Soalris 10 x86 SF X4200. When he enabled ZFS his memory of 18 gigs drops to 2 gigs rather quickly. I had him do a # ps -e -o pid,vsz,comm | sort -n +1 and it came back: The culprit application you see is java: 507 89464 /usr/bin/postmaster 515 89944 /usr/bin/postmaster 517 91136 /usr/bin/postmaster 508 96444 /usr/bin/postmaster 516 98088 /usr/bin/postmaster 503 3449580 /usr/jre1.5.0_07/bin/amd64/java 512 3732468 /usr/jre1.5.0_07/bin/amd64/java Here is what the customer responded: Well, Java's is a memory hog, but it's not the leak -- it's the application. Even after it fails due to lack of memory, the memory is not reclaimed and we can no longer restart it. Is there a bug on zfs? I did not find one in sunsolve but then again I might have been searching the wrong thing. We have done some slueth work and are starting to think our problem might be ZFS -- the new file system Sun supports. The documentation for ZFS states that it tries to cache as much as it can, and it uses kernel memory for the cache. That would explain memory gradually disappearing. ZFS can give memory back, but it does not do so quickly. Yup, this is likely your problem. ZFS takes a little time to give back memory, and the app may fail with ENOMEM before this happens. So, is there any way to check that? If turns out to be the problem... 1) Is there a way to limit the size of ZFS's caches? Well... sort of. You can set the size of arc.c_max and this will put an upper bound on the cache. But this is a bit of a hack. If not, then 2) Is there a way to clear ZFS's cache? Try unmounting/mounting the file system, if that does not work, try export/import of the pool. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
Thomas Burns wrote: Hi, We have been using zfs for a couple of months now, and, overall, really like it. However, we have run into a major problem -- zfs's memory requirements crowd out our primary application. Ultimately, we have to reboot the machine so there is enough free memory to start the application. What I would like is: 1) A way to limit the size of the cache (a gig or two would be fine for us) 2) A way to clear the caches -- hopefully, something faster than rebooting the machine. Is there any way I can do either of these things? Thanks, Tom Burns Tom, What version of solaris are you running? In theory, ZFS should not be hogging your system memory to the point that it crowds out your primary applications... but this is still an area that we are working out the kinks in. If you could provide a core dump of the machine when it gets to the point that you can't start your app, it would help us. As to your questions; I will give you some ways to do these things, but these are not considered best practice: 1) You should be able to limit your cache max size by setting arc.c_max. Its currently initialized to be phys-mem-size - 1GB. 2) First try unmount/remounting your file system to clear the cache. If that doesn't work, try exporting/importing your pool. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
Thomas Burns wrote: On Sep 12, 2006, at 2:04 PM, Mark Maybee wrote: Thomas Burns wrote: Hi, We have been using zfs for a couple of months now, and, overall, really like it. However, we have run into a major problem -- zfs's memory requirements crowd out our primary application. Ultimately, we have to reboot the machine so there is enough free memory to start the application. What I would like is: 1) A way to limit the size of the cache (a gig or two would be fine for us) 2) A way to clear the caches -- hopefully, something faster than rebooting the machine. Is there any way I can do either of these things? Thanks, Tom Burns Tom, What version of solaris are you running? In theory, ZFS should not be hogging your system memory to the point that it crowds out your primary applications... but this is still an area that we are working out the kinks in. If you could provide a core dump of the machine when it gets to the point that you can't start your app, it would help us. We are running the jun 06 version of solaris (10/6?). I don't have a core dump now -- but can probably get one in the next week or so. Where should I send it? You can drop cores via ftp to: sunsolve.sun.com login as anonymous or ftp deposit into /cores Also, where do I set arc.c_max? In etc/system? Out of curiosity, why isn't limiting arc.c_max considered best practice (I just want to make sure I am not missing something about the effect limiting it will have)? My guess is that in our case (lots of small groups -- 50 people or less -- sharing files over the web) that file system caches are not that useful. The small groups mean that no one file gets used that often and, since access is over the web, their response time will be largely limited by their internet connection. We don't want users to need to tune a bunch of knobs to get performance out of ZFS. We want it to work well out of the box. So we are trying to discourage using these tunables, and instead figure out what the root problem is and fix it. There is really no reason why zfs shouldn't be able to adapt itself appropriately to the available memory. Thanks a lot for the response! As to your questions; I will give you some ways to do these things, but these are not considered best practice: 1) You should be able to limit your cache max size by setting arc.c_max. Its currently initialized to be phys-mem-size - 1GB. 2) First try unmount/remounting your file system to clear the cache. If that doesn't work, try exporting/importing your pool. -Mark Tom Burns ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: ZFS forces system to paging to the point it is
Jürgen Keil wrote: We are trying to obtain a mutex that is currently held by another thread trying to get memory. Hmm, reminds me a bit on the zvol swap hang I got some time ago: http://www.opensolaris.org/jive/thread.jspa?threadID=11956tstart=150 I guess if the other thead is stuck trying to get memory, then it is allocating the memory with KM_SLEEP, while holding a mutex? Yup, this is essentially another instance of this problem. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem of ZFS ( Sol 10U2 )
Ivan, What mail clients use your mail server? You may be seeing the effects of: 6440499 zil should avoid txg_wait_synced() and use dmu_sync() to issue parallel IOs when fsyncing This bug was fixed in nevada build 43, and I don't think made it into s10 update 2. It will, of course, be in update 3 and be available in a patch at some point. Ivan Debnár wrote: Hi, I deployed ZFS on our mailserver recently, hoping for eternal peace after running on UFS and moving files witch each TB added. It is mailserver - it's mdirs are on ZFS pool: capacity operationsbandwidth poolused avail read write read write - - - - - - - mailstore 3.54T 2.08T280295 7.10M 5.24M mirror590G 106G 34 31 676K 786K c6t3d0 - - 14 16 960K 773K c8t22260001552EFE2Cd0 - - 16 18 1.06M 786K mirror613G 82.9G 51 37 1.44M 838K c6t3d1 - - 20 19 1.57M 824K c5t1d1 - - 20 24 1.40M 838K c8t227C0001559A761Bd0 - - 5101 403K 4.63M mirror618G 78.3G133 60 6.23M 361K c6t3d2 - - 40 27 3.21M 903K c4t2d0 - - 23 81 1.91M 2.98M c8t221200015599F2CFd0 - - 6108 442K 4.71M mirror613G 83.2G110 51 3.66M 337K c6t3d3 - - 36 25 2.72M 906K c5t2d1 - - 29 65 1.80M 2.92M mirror415G 29.0G 30 28 460K 278K c6t3d4 - - 11 19 804K 268K c4t1d2 - - 15 22 987K 278K mirror255G 441G 26 49 536K 1.02M c8t22110001552F3C46d0 - - 12 27 835K 1.02M c8t224B0001559BB471d0 - - 12 29 835K 1.02M mirror257G 439G 32 52 571K 1.04M c8t22480001552D7AF8d0 - - 14 28 1003K 1.04M c4t1d0 - - 14 32 1002K 1.04M mirror251G 445G 28 53 543K 1.02M c8t227F0001552CB892d0 - - 13 28 897K 1.02M c8t22250001559830A5d0 - - 13 30 897K 1.02M mirror 17.4G 427G 22 38 339K 393K c8t22FA00015529F784d0 - - 9 19 648K 393K c5t2d2 - - 9 23 647K 393K It is 3x dual-iSCSI + 2x dual SCSI DAS arrays (RAID0, 13x250). I have problem however: The 2 SCSI arrays were able to handle the mail-traffic fine with UFS on them. The new config with 3 additional arrays seem to have problem using ZFS. The writes are waiting for 10-15 seconds to get to disk - so queue fills ver quickly, reads are quite ok. I assume this is the problem with ZFS prefering reads to writes. I also see in 'zpool iostat -v 1' that writes are issued to disk only once in 10 secs, and then its 2000rq one sec. Reads are sustained at cca 800rq/s. Is there a way to tune this read/write ratio? Is this know problem? I tried to change vq_max_pending as suggested by Eric in http://blogs.sun.com/erickustarz/entry/vq_max_pending But no change in this write behaviour. Iostat shows cca 20-30ms asvc_t, 0%w, and cca 30% busy on all drives so these are not saturated it seems. (before with UTF they had 90%busy, 1%wait). System is Sol 10 U2, sun x4200, 4GB RAM. Please if you could give me some hint to really make this working as the way back to UFS is almost impossible on live system. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS forces system to paging to the point it is
Hmmm, interesting data. See comments in-line: Robert Milkowski wrote: Yes, server has 8GB of RAM. Most of the time there's about 1GB of free RAM. bash-3.00# mdb 0 Loading modules: [ unix krtld genunix dtrace specfs ufs sd md ip sctp usba fcp fctl qlc ssd lofs zfs random logindmux ptm cpc nfs ipc ] arc::print { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x8b72ae00 We are referencing about 2.2GB of data from the ARC. p = 0xfe41b00 c = 0xfe51b00 We are trying to get down to our minimum target size of 16MB. So we are obviously feeling memory pressure and trying to react. c_min = 0xfe51b00 c_max = 0x1bca36000 ... ::kmastat cachebufbufbufmemory alloc alloc namesize in use totalin use succeed fail - -- -- -- - - - ... vn_cache 240 2400324 2507745 662691840 6307891 0 This is very interesting: 2.4 million vnodes are active. ... zio_buf_512 512 2388292 2388330 1304346624 176134688 0 zio_buf_10241024 18 96 98304 17058709 0 zio_buf_15361536 0 30 49152 2791254 0 zio_buf_20482048 0 20 40960 1051435 0 zio_buf_25602560 0 33 90112 1716360 0 zio_buf_30723072 0 40122880 1902497 0 zio_buf_35843584 0225819200 3918593 0 zio_buf_40964096 3 34139264 20336550 0 zio_buf_51205120 0144737280 8932632 0 zio_buf_61446144 0 36221184 5274922 0 zio_buf_71687168 0 16114688 3350804 0 zio_buf_81928192 0 11 90112 9131264 0 zio_buf_10240 10240 0 12122880 2268700 0 zio_buf_12288 12288 0 8 98304 3258896 0 zio_buf_14336 14336 0 60860160 15853089 0 zio_buf_16384 16384 142762 142793 2339520512 74889652 0 zio_buf_20480 20480 0 6122880 1299564 0 zio_buf_24576 24576 0 5122880 1063597 0 zio_buf_28672 28672 0 6172032712545 0 zio_buf_32768 32768 0 4131072 1339604 0 zio_buf_40960 40960 0 6245760 1736172 0 zio_buf_49152 49152 0 4196608609853 0 zio_buf_57344 57344 0 5286720428139 0 zio_buf_65536 65536520522 34209792 8839788 0 zio_buf_73728 73728 0 5368640284979 0 zio_buf_81920 81920 0 5409600133392 0 zio_buf_90112 90112 0 6540672 96787 0 zio_buf_98304 98304 0 4393216133942 0 zio_buf_106496106496 0 5532480 91769 0 zio_buf_114688114688 0 5573440 72130 0 zio_buf_122880122880 0 5614400 52151 0 zio_buf_131072131072100107 14024704 7326248 0 dmu_buf_impl_t 328 2531066 2531232 863993856 237052643 0 dnode_t 648 2395209 2395212 1635131392 83304588 0 arc_buf_hdr_t128 142786 390852 50823168 155745359 0 arc_buf_t 40 142786 347333 14016512 160502001 0 zil_lwb_cache208 28468 98304 30507668 0 zfs_znode_cache 192 2388224 2388246 465821696 83149771 0 ... Because of all of those vnodes, we are seeing a lot of extra memory being used by ZFS: - about 1.5GB for the dnodes - another 800MB for dbufs - plus 1.3GB for the bonus buffers (not accounted for in the arc) - plus about 400MB for znodes This totals to another 4GB + .6GB held in vnodes The question is who is holding these vnodes in memory... Could you do a ::dnlc!wc and let me know what it comes back with? -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS forces system to paging to the point it is
Robert Milkowski wrote: On Wed, 6 Sep 2006, Mark Maybee wrote: Robert Milkowski wrote: ::dnlc!wc 1048545 3145811 76522461 Well, that explains half your problem... and maybe all of it: After I reduced vdev prefetch from 64K to 8K for last few hours system is working properly without workaround and free memory stays at about 1GB. Reducing vdev prefetch to 8K alse reduced read thruoutput 10x. I belive this is somehow related - maybe vdev cache was so aggressive (I got 40-100MB/s of reads) and consuming memory so fast that thread which is supposed to regain some memory couldn't keep up? I suppose, although the data volume doesn't seem that high... maybe you are just operating at the hairy edge here. Anyway, I have filed a bug to track this issue: 6467963 do_dnlc_reduce_cache() can be blocked by ZFS_OBJ_HOLD_ENTER() -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS forces system to paging to the point it is unresponsive
Robert, I would be interested in seeing your crash dump. ZFS will consume much of your memory *in the absence of memory pressure*, but it should be responsive to memory pressure, and give up memory when this happens. It looks like you have 8GB of memory on your system? ZFS should never consume more than 7GB of this under any circumstances. Note there are a few outstanding bugs that could be coming into play here: 6456888 zpool scrubbing leads to memory exhaustion and system hang 6416757 zfs could still use less memory 6447701 ZFS hangs when iSCSI Target attempts to initialize its backing store -Mark P.S. It would be useful to see the output of: arc::print and ::kmastat Robert Milkowski wrote: Hi. v440, S10U2 + patches OS and Kernel Version: SunOS X 5.10 Generic_118833-20 sun4u sparc SUNW,Sun-Fire-V440 NFS server with ZFS as a local storage. We were rsyncing UFS filesystem to ZFS filesystem exported over NFS. After some time server which exports ZFS over NFS was unresponsive. Operator decided to force panic and reboot server. Further examination showed that system was heavily paging probably due to ZFS as no other services are running there. I had just another problem - looks similar to last one. I decided to put nfsd into RT class. I guess ZFS is using all memory for its caches and after some time it fails to free it and forces system to paging. This is BAD, really BAD. More details to previous problem. bash-3.00# savecore /f3-1/ System dump time: Sat Sep 2 03:31:18 2006 Constructing namelist /f3-1//unix.0 Constructing corefile /f3-1//vmcore.0 100% done: 1043993 of 1043993 pages saved bash-3.00# cd /f3-1/ bash-3.00# bash-3.00# mdb 0 Loading modules: [ unix krtld genunix dtrace specfs ufs sd md ip sctp usba fcp fctl qlc ssd lofs zfs random logindmux ptm cpc nfs ipc ] ::status debugging crash dump vmcore.0 (64-bit) from XX operating system: 5.10 Generic_118833-20 (sun4u) panic message: sync initiated dump content: kernel pages only ::spa ADDR STATE NAME 060001271680 ACTIVE f3-1 060003bd4dc0 ACTIVE f3-2 ::memstat Page Summary Pages MB %Tot Kernel 1016199 7939 98% Anon 4420 34 0% Exec and libs 736 5 0% Page cache 36 0 0% Free (cachelist) 1962 15 0% Free (freelist) 18338 143 2% Total 1041691 8138 Physical 1024836 8006 ::swapinfo ADDR VNODE PAGES FREE NAME 0600034ab5a0 600012ff8c0 1048763 1028489 /dev/md/dsk/d15 We were synchronizing lot of small files over nfs and writing to f3-1/d611. I would say that with ZFS it's expected to be on low memory most of the time but not to the point when host starts to paging. bash-3.00# sar -g SunOS X 5.10 Generic_118833-20 sun4u 09/02/2006 00:00:00 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf [...] 02:15:01 0.03 0.04 0.02 0.00 0.00 02:20:00 0.04 0.04 0.02 0.00 0.00 02:25:00 0.02 0.03 0.01 0.00 0.00 02:30:00 0.02 0.03 0.01 0.00 0.00 02:35:00 0.03 0.03 0.01 0.00 0.00 02:40:01 0.03 0.04 0.03 0.00 0.00 02:45:02 5.98 82.77 93.20 65115.59 0.00 03:39:28 unix restarts 03:40:00 0.35 0.61 0.61 0.00 60.00 03:45:00 0.03 0.06 0.06 0.00 0.00 03:50:00 0.02 0.03 0.02 0.00 0.00 03:55:00 0.02 0.02 0.02 0.00 0.00 bash-3.00# sar -u SunOS 5.10 Generic_118833-20 sun4u 09/02/2006 00:00:00 %usr %sys %wio %idle [...] 02:00:00 0 1 0 99 02:05:00 0 1 0 99 02:10:00 0 1 0 99 02:15:01 0 1 0 99 02:20:00 0 15 0 85 02:25:00 0 34 0 66 02:30:00 0 20 0 80 02:35:00 0 22 0 78 02:40:01 0 45 0 55 02:45:02 0 61 0 38 03:39:28 unix restarts 03:40:00 5 10 0 84 03:45:00 1 1 0 98 03:50:00 0 0 0 100 bash-3.00# sar -q SunOS xxx 5.10 Generic_118833-20 sun4u 09/02/2006 00:00:00 runq-sz %runocc swpq-sz %swpocc [...] 02:00:00 0.0 0 0.0 0 02:05:00 1.0 0 0.0 0 02:10:00 0.0 0 0.0 0 02:15:01 0.0 0 0.0 0 02:20:00 1.1 5 0.0 0 02:25:00 1.4 12 0.0 0 02:30:00 2.1 6 0.0 0 02:35:00 3.4 9 0.0 0 02:40:01 2.8 25 0.0 0 02:45:02 4.0 44 116.6 12 03:39:28 unix restarts 03:40:00 1.0 3 0.0 0 03:45:00 0.0 0 0.0 0 03:50:00 0.0 0 0.0 0 Crashdump could be provided off-the list and not for public eyes. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [fbsd] Porting ZFS file system to FreeBSD.
Michael Schuster - Sun Microsystems wrote: Pawel Jakub Dawidek wrote: On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote: I don't know much about ZFS, but Sun states this is a 128 bits filesystem. How will you handle this in regards to the FreeBSD kernel interface that is already struggling to be 64 bits compliant ? (I'm stating this based on this URL [1], but maybe it's not fully up-to-date.) 128 bits is not my goal, but I do want all the other goodies:) are you going to attempt on-disk compatibility? Michael Amazing work Pawel! Please do try to maintain on-disk compatibility! Let us know if you run into anything that might prevent that (or any other issues that you run across). -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import - cannot mount [...] directory is not empty
Robert, Are you sure that nfs-s5-p0/d5110 and nfs-s5-p0/d5111 are mounted following the import? These messages imply that the d5110 and d5111 directories in the top-level filesystem of pool nfs-s5-p0 are not empty. Could you verify that 'df /nfs-s5-p0/d5110' displays nfs-s5-p0/d5110 as the Filesystem (and not just nfs-s5-p0)? -Mark Robert Milkowski wrote: All pools were exported than I tried to import one-by-one and got this with only a first pool. bash-3.00# zpool export nfs-s5-p4 nfs-s5-s5 nfs-s5-s6 nfs-s5-s7 nfs-s5-s8 bash-3.00# zpool import nfs-s5-p4 cannot mount '/nfs-s5-p4/d5139': directory is not empty cannot mount '/nfs-s5-p4/d5141': directory is not empty cannot mount '/nfs-s5-p4/d5138': directory is not empty cannot mount '/nfs-s5-p4/d5142': directory is not empty bash-3.00# df -h /nfs-s5-p4/d5139 Filesystem size used avail capacity Mounted on nfs-s5-p4/d5139600G 556G44G93%/nfs-s5-p4/d5139 bash-3.00# zpool export nfs-s5-p4 bash-3.00# ls -l /nfs-s5-p4/d5139 /nfs-s5-p4/d5139: No such file or directory bash-3.00# ls -l /nfs-s5-p4/ total 0 bash-3.00# zpool import nfs-s5-p4 bash-3.00# uname -a SunOS XXX 5.11 snv_43 sun4u sparc SUNW,Sun-Fire-V240 bash-3.00# No problem with other pools - all other pools imported without any warnings. The same on another server (all pools were exported first): bash-3.00# zpool import nfs-s5-p0 cannot mount '/nfs-s5-p0/d5110': directory is not empty use legacy mountpoint to allow this behavior, or use the -O flag cannot mount 'nfs-s5-p0/d5112': mountpoint or dataset is busy cannot mount '/nfs-s5-p0/d5111': directory is not empty use legacy mountpoint to allow this behavior, or use the -O flag bash-3.00# zpool export nfs-s5-p0 bash-3.00# zpool import nfs-s5-p0 cannot mount '/nfs-s5-p0/d5110': directory is not empty use legacy mountpoint to allow this behavior, or use the -O flag cannot mount '/nfs-s5-p0/d5111': directory is not empty use legacy mountpoint to allow this behavior, or use the -O flag bash-3.00# zpool export nfs-s5-p0 bash-3.00# ls -la /nfs-s5-p0/ total 4 drwxr-xr-x 2 root other512 Jun 14 14:37 . drwxr-xr-x 40 root root1024 Aug 8 11:00 .. bash-3.00# zpool import nfs-s5-p0 cannot mount '/nfs-s5-p0/d5110': directory is not empty use legacy mountpoint to allow this behavior, or use the -O flag cannot mount 'nfs-s5-p0/d5112': mountpoint or dataset is busy cannot mount '/nfs-s5-p0/d5111': directory is not empty use legacy mountpoint to allow this behavior, or use the -O flag bash-3.00# bash-3.00# uname -a SunOS X 5.11 snv_39 sun4v sparc SUNW,Sun-Fire-T200 bash-3.00# All filesystems from that pool were however mounted. No problem with other pools - all other pools imported without any warnings. All filesystems in a pool have set sharenfs (actually sharenfs is set on a pool and then inherited by filesystems). Additionally nfs/server was disabled just before I exported pools and automatically started when first pool was imported. I belive there's already open bug for this. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can't remove corrupt file
Eric Lowe wrote: Eric Schrock wrote: Well the fact that it's a level 2 indirect block indicates why it can't simply be removed. We don't know what data it refers to, so we can't free the associated blocks. The panic on move is quite interesting - after BFU give it another shot and file a bug if it still happens. I'm still seeing the panic (build 42) when trying to 'mv' the file with corrupt indirect blocks. The problem looks like 6424466 and 6440780, the panic string is data after EOF. Email me offline if you would like to collect the core from my system. - Eric Yup, this is a duplicate of 6424466 (6440780 is also probably a dup of 6424466). You are seeing this panic on a 'mv' because of some old debug code in dnode_sync() scanning the dnode contents. The data after EOF message is bogus, the real problem is your data corruption. Anyway, this is not going to go away until I put back a fix for 6424466. Sorry about that. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Luke Lonergan wrote: Robert, On 8/8/06 9:11 AM, Robert Milkowski [EMAIL PROTECTED] wrote: 1. UFS, noatime, HW RAID5 6 disks, S10U2 70MB/s 2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1) 87MB/s 3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2 130MB/s 4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44 133MB/s Well, the UFS results are miserable, but the ZFS results aren't good - I'd expect between 250-350MB/s from a 6-disk RAID5 with read() blocksize from 8kb to 32kb. Most of my ZFS experiments have been with RAID10, but there were some massive improvements to seq I/O with the fixes I mentioned - I'd expect that this shows that they aren't in snv44. Those fixes went into snv_45 -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_46: hangs when using zvol swap and the system is low on free memory ?
Jürgen Keil wrote: I've tried to use dmake lint on on-src-20060731, and was running out of swap on my Tecra S1 laptop, 32-bit x86, 768MB main memory, with a 512MB swap slice. The FULL KERNEL: global crosschecks: lint run consumes lots (~800MB) of space in /tmp, so the system was running out of swap space. To fix this I've tried to add a 512MB (compressed) zvol device as additional swap space. Now the dmake lint hangs the OS, sooner or later. In a crash dump, I found: [b] ::pgrep lint SPID PPID PGIDSIDUID FLAGS ADDR NAME R 10806 10805747712109 0x42004000 e2d158c8 lint R 10802 10801747712109 0x42004000 e2d1b990 lint e2d158c8::walk thread|::findstack -v stack pointer for thread e3806800: dc39adf8 dc39ae2c swtch+0x168() dc39ae64 turnstile_block+0x6a5(da709288, 0, e39d1110, fec045e0, 0, 0) dc39aea4 rw_enter_sleep+0x13b(e39d1110, 0) dc39aec8 tmp_write+0x2d(e9fc39c0, dc39af3c, 0, daaa3978, 0) dc39af04 fop_write+0x2e(e9fc39c0, dc39af3c, 0, daaa3978, 0) dc39af84 write+0x2ac() dc39afac sys_sysenter+0x104() e2d1b990::walk thread|::findstack -v stack pointer for thread e0f11400: d32c47b8 d32c47e4 swtch+0x168() d32c47f4 cv_wait+0x4e(fec1ef42, fec1cf20) d32c4820 page_create_throttle+0x123(20, 3) d32c488c page_create_va+0x9f(fec20990, da8d1000, 0, 2, 3, d32c48b4) d32c48ec segkmem_page_create+0x67(da8d1000, 2, 0, 0) d32c4924 segkmem_xalloc+0xa3(da00f690, 0, 2, 0, 0, fe840f08) d32c4950 segkmem_alloc+0xa0(da00f690, 2, 0) d32c49ec vmem_xalloc+0x405(da01, 2, 1000, 0, 0, 0) d32c4a3c vmem_alloc+0x126(da01, 2, 0) d32c4a94 kmem_slab_create+0x6e(d3e21030, 0) d32c4ac0 kmem_slab_alloc+0x59(d3e21030, 0) d32c4af0 kmem_cache_alloc+0x119(d3e21030, 0) d32c4b04 zio_buf_alloc+0x1b(2) d32c4b40 arc_read+0x332(0, da1f6740, e31f5100, f9a1c2d0, 0, 0) d32c4bbc dbuf_prefetch+0x124(d64e1640, 22, 0) d32c4bf4 dmu_zfetch_fetch+0x48(d64e1640, 20, 0, 6, 0) d32c4c54 dmu_zfetch_dofetch+0x183(d64e179c, ee0ecdb0) d32c4ca0 dmu_zfetch_find+0x530(d64e179c, d32c4cc8, 20) d32c4d24 dmu_zfetch+0xbf(d64e179c, 18, 0, 2, 0, 20) d32c4d5c dbuf_read+0xc9(d89ce4a0, e7d6e500, 32) d32c4db4 dmu_buf_hold_array_by_dnode+0x1fe(d64e1640, 18, 0, 2, 0, 1) d32c4de4 dmu_buf_hold_array_by_bonus+0x2a(e28686d8, 18, 0, 2, 0, 1) d32c4e68 zfs_read+0x17e(ddd27900, d32c4f3c, 0, daaa3978, 0) d32c4ea4 fop_read+0x2e(ddd27900, d32c4f3c, 0, daaa3978, 0) d32c4f84 read+0x2a1() d32c4fac sys_sysenter+0x104() freemem/D freemem: freemem:0 [/b] arc_read() needs a new buffer, tries to allocate kernel memory with KM_SLEEP. But there is no more free memory, so the allocation sleeps until resources become available. It seems that arc_read() is trying to restore a buffer from the arc ghost cache, and has the arc_buf_hdr_t locked while trying to allocate memory. At the same time, the pageout deamon seems to be stuck in the zfs code, like this: [b] ::pgrep pageout|::walk thread|::findstack -v stack pointer for thread d386dc00: d38988a8 d38988dc swtch+0x168() d3898914 turnstile_block+0x6a5(d3da3e90, 0, d3dce0cc, fec03b38, 0, 0) d3898974 mutex_vector_enter+0x2dc(d3dce0cc) d38989b4 buf_hash_find+0x4d(da1f6740, e5476900, bb305, 0, d38989fc) d3898a00 arc_read+0x24(0, da1f6740, e5476900, f9a1c2d0, 0, 0) d3898a7c dbuf_prefetch+0x124(e0c9e318, 3832, 0) d3898ab4 dmu_zfetch_fetch+0x48(e0c9e318, 3832, 0, 1, 0) d3898b14 dmu_zfetch_dofetch+0x183(e0c9e474, d8b46c60) d3898b60 dmu_zfetch_find+0x530(e0c9e474, d3898b88, 20) d3898be4 dmu_zfetch+0xbf(e0c9e474, 6972000, 0, 2000, 0, 20) d3898c10 dbuf_read+0xc9(df5679d8, df43eb00, 22) d3898c34 dmu_tx_check_ioerr+0x49(df43eb00, e0c9e318, 0, 34b9, 0) d3898c94 dmu_tx_count_write+0x114(d3875c98, 6965000, 0, e000, 0) d3898cdc dmu_tx_hold_write+0x52(eef5d5a8, 1, 0, 6965000, 0, e000) d3898d5c zvol_strategy+0x184(d972e1e8) d3898d78 bdev_strategy+0x4d(d972e1e8) d3898d94 spec_startio+0x6e(d8b85240, fde36240, 6965000, 0, e000, 8500) d3898dc0 spec_pageio+0x2a(d8b85240, fde36240, 6965000, 0, e000, 8500) d3898e0c fop_pageio+0x2d(d8b85240, fde36240, 6965000, 0, e000, 8500) d3898e80 tmp_putapage+0x177(e9fc39c0, fdbd63b0, d3898eb8, d3898ef0, 8400, da710e68) d3898ef4 tmp_putpage+0x1c6(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68) d3898f3c fop_putpage+0x27(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68) d3898f94 pageout+0x205(0, 0) d3898fa4 thread_start+8() [/b] It seems the problem is that arc_read() has part of the buf hash table locked, then goes to sleep inside some kmem_*alloc(...KM_SLEEP) call. When the pageout daemon tries to access some zfs backed page that happens to use the same hash chain that is locked by the previous arc_read() call, the system is stuck and I have to power cycle it. I made more tests with uncompressed zvol devices, too. But the problem basically remains the same. The pageout deamon becomes stuck,
Re: [zfs-discuss] ZFS on 32bit x86
Yup, your probably running up against the limitations of 32-bit kernel addressability. We are currently very conservative in this environment, and so tend to end up with a small cache as a result. It may be possible to tweak things to get larger cache sizes, but you run the risk of starving out other processes trying to get memory. -Mark Robert Milkowski wrote: Hello zfs-discuss, Simple test 'ptime find /zfs/filesystem /dev/null' with 2GB RAM. After second, third, etc. time still it reads a lot from disks while find is running (atime is off). on x64 (Opteron) it doesn't. I guess it's due to 512MB heap limit in kernel for its cache. ::memstat shows 469MB for kernel and 1524MB on freelist. Is there anything could be done? I guess not but perhaps ps. of course there're a lot of files like ~150K. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss