Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Dick Davies [EMAIL PROTECTED] wrote on 01/10/2007 05:26:45 AM: On 08/01/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. This idea has occurred to me too - I think there are definite advantages to 'block re-use'. When you start talking about multiple similar zones, I suspect substantial space savings could be made - and if you can re-use that saved storage to provide additional redundancy, everyone would be happy. Very true, even on normal fileserver usage I have historically found that there is 15 - 30% file level duplication, when added to the cheap snapping and the already existing compression I think this is a big big win. Assumptions: SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is 2^128 if I remember correctly) SHA256 hash is taken on the data portion of the block as it exists on disk. the metadata structure is hashed separately. In the current metadata structure, there is a reserved bit portion to be used in the future. Description of change: Creates: The filesystem goes through its normal process of writing a block, and creating the checksum. Before the step where the metadata tree is pushed, the checksum is checked against a global checksum tree to see if there is any match. If match exists, insert a metadata placeholder for the block, that references the already existing block on disk, increment a number_of_links pointer on the metadata blocks to keep track of the pointers pointing to this block. free up the new block that was written and check-summed to be used in the future. else if no match, update the checksum tree with the new checksum and continue as normal. Unless I'm reading this wrong, this sounds a lot like Plan9s 'Venti' architecture ( http://cm.bell-labs.com/sys/doc/venti.html ) . But using a hash 'label' seems the wrong approach. ZFS is supposed to scale to terrifying levels, and the chances of a collision, however small, works against that. I wouldn't want to trade reliability for some extra space. That issue has already come up in the thread, SHA256 is 2^128 for random, 2^80 for targeted collisions. That is pretty darn good, but it would also make sense to perform a rsync like secondary check on match using a dissimilar crypto hash. If we hit very unlikely chance that 2 blocks match both sha256 and whatever other secondary hash I think that block should be lost (act of god). =) Even with this dual check approach, the index (and the only hash stored) can still be just the sha256 as the chance for collision is similar to nil in this context. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Adding disk to a RAID-Z?
[i]I think the original poster, was thinking that non-enterprise users would be most interested in only having to *purchase* one drive at a time. Enterprise users aren't likely to balk at purchasing 6-10 drives at a time, so for them adding an additional *new* RaidZ to stripe across is easier. [/i] Yes. I have $xxx to spend on disks and can afford 3. As my needs increase, I'll have saved enough to buy another disk. Traditionally, you RAID your disks together then use a VM to divvy it up into partitions that can grow/shrink as needed. The total size of the RAID isn't important until you've filled it. Then you want to increase the RAID. You could just add new RAID chunks and have a VM on each chunk. But you'd be wasting some of your space. The incremental cost of the added space is the same as the original RAID. 3*n*R5=2n 4*n*R5=3n. Or doubling the disks: 6*n*R5=5n vs 3*n*R5 + 3*n*R5 = 2n + 2n = 4n (6 disks) or 3*n*R5 + 4*n*R5 = 2n + 3n = 5n (7 disks) The cost of scaling/loss of space is balanced against the cost of backup/wipereraid/restore. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hello Kyle, Wednesday, January 10, 2007, 5:33:12 PM, you wrote: KM Remember though that it's been mathematically figured that the KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's Well, nothing like this was proved and definitely not mathematically. It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Hi Guys, After reading through the discussion on this regarding ZFS memory fragmentation on snv_53 (and forward) and going through our ::kmastat...looks like ZFS is sucking down about 544 MB of RAM in the various caches. About 360MB of that is in the zio_buf_65536 cache. Next most notable is 55MB in zio_buf_32768, and 36MB in zio_buf_16384. I don't think that's too bad but worth keeping track of. At this point our kernel memory growth seems to have slowed, with it hovering around 5GB, and the anon column is mostly what's growing now (as expected...MySQL). Most of the problem in the discussion thread on this seemed to be related to a lot of DLNC entries due to the workload of a file server. How would this affect a database server with operations in only a couple very large files? Thank you in advance. Best Regards, Jason On 1/10/07, Jason J. W. Williams [EMAIL PROTECTED] wrote: Sanjeev Robert, Thanks guys. We put that in place last night and it seems to be doing a lot better job of consuming less RAM. We set it to 4GB and each of our 2 MySQL instances on the box to a max of 4GB. So hopefully slush of 4GB on the Thumper is enough. I would be interested in what the other ZFS modules memory behaviors are. I'll take a perusal through the archives. In general it seems to me that a max cap for ZFS whether set through a series of individual tunables or a single root tunable would be very helpful. Best Regards, Jason On 1/10/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote: Jason, Robert is right... The point is ARC is the caching module of ZFS and majority of the memory is consumed through ARC. Hence by limiting the c_max of ARC we are limiting the amount ARC consumes. However, other modules of ZFS would consume more but that may not be as significant as ARC. Expert, please correct me if I am wrong here. Thanks and regards, Sanjeev. Robert Milkowski wrote: Hello Jason, Tuesday, January 9, 2007, 10:28:12 PM, you wrote: JJWW Hi Sanjeev, JJWW Thank you! I was not able to find anything as useful on the subject as JJWW that! We are running build 54 on an X4500, would I be correct in my JJWW reading of that article that if I put set zfs:zfs_arc_max = JJWW 0x1 #4GB in my /etc/system, ZFS will consume no more than JJWW 4GB? Thank you in advance. That's the idea however it's not working that way now - under some circumstances ZFS could still consume much more memory - see other posts lately here. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Adding disk to a RAID-Z?
Robert Milkowski wrote: Hello Kyle, Wednesday, January 10, 2007, 5:33:12 PM, you wrote: KM Remember though that it's been mathematically figured that the KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's Well, nothing like this was proved and definitely not mathematically. It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. It's very possible I misstated something. :) I thought I had read though, something like over 9 or so disks would put mean that each FS block would be written to less than a single disk block on each disk? Or maybe it was that waiting to read from all drives for files less than a FS block would suffer? Ahhh... I can't remember what the effect were thought to be. I thought there was some theoretical math involved though. I do remember people advising against it though. Not just on a performance basis, but also on a increased risk of failure basis. I think it was just seen as a good balancing point. -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hi Kyle, I think there was a lot of talk about this behavior on the RAIDZ2 vs. RAID-10 thread. My understanding from that discussion was that every write stripes the block across all disks on a RAIDZ/Z2 group, thereby making writing the group no faster than writing to a single disk. However reads are much faster, as all the disk are activated in the read process. The default config on the X4500 we received recently was RAIDZ-groups of 6 disks (across the 6 controllers) striped together into one large zpool. Best Regards, Jason On 1/10/07, Kyle McDonald [EMAIL PROTECTED] wrote: Robert Milkowski wrote: Hello Kyle, Wednesday, January 10, 2007, 5:33:12 PM, you wrote: KM Remember though that it's been mathematically figured that the KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's Well, nothing like this was proved and definitely not mathematically. It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. It's very possible I misstated something. :) I thought I had read though, something like over 9 or so disks would put mean that each FS block would be written to less than a single disk block on each disk? Or maybe it was that waiting to read from all drives for files less than a FS block would suffer? Ahhh... I can't remember what the effect were thought to be. I thought there was some theoretical math involved though. I do remember people advising against it though. Not just on a performance basis, but also on a increased risk of failure basis. I think it was just seen as a good balancing point. -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Why is + not allowed in a ZFS file system name ?
# zpool create 500megpool /home/roland/tmp/500meg.dat cannot create '500megpool': name must begin with a letter pool name may have been omitted huh? ok - no problem if special characters aren`t allowed, but why _this_ weird looking limitaton ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hello Jason, Wednesday, January 10, 2007, 10:54:29 PM, you wrote: JJWW Hi Kyle, JJWW I think there was a lot of talk about this behavior on the RAIDZ2 vs. JJWW RAID-10 thread. My understanding from that discussion was that every JJWW write stripes the block across all disks on a RAIDZ/Z2 group, thereby JJWW making writing the group no faster than writing to a single disk. JJWW However reads are much faster, as all the disk are activated in the JJWW read process. The opposite actually. Because of COW, writing (modifying as well) will give you up-to N-1 disks performance for raid-z1 and N-2 disks performance for raid-z2. Howeer reading can be slow in case of many small random reads as to read each fs block you've got to wait for all data disks in a group. JJWW The default config on the X4500 we received recently was RAIDZ-groups JJWW of 6 disks (across the 6 controllers) striped together into one large JJWW zpool. However the problem with that config is lack of hot-spare. Of course it depends what you want (and there was no hot spare support in U2 which is os installed in factory so far). -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] Limit ZFS Memory Utilization
Hello Jason, Wednesday, January 10, 2007, 9:45:05 PM, you wrote: JJWW Sanjeev Robert, JJWW Thanks guys. We put that in place last night and it seems to be doing JJWW a lot better job of consuming less RAM. We set it to 4GB and each of JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully slush JJWW of 4GB on the Thumper is enough. I would be interested in what the JJWW other ZFS modules memory behaviors are. I'll take a perusal through JJWW the archives. In general it seems to me that a max cap for ZFS whether JJWW set through a series of individual tunables or a single root tunable JJWW would be very helpful. Yes it would. Better yet would be if memory consumed by ZFS for caching (dnodes, vnodes, data, ...) would behave similar to page cache like with UFS so applications will be able to get back almost all memory used for ZFS caches if needed. I guess (and it's really a guess only based on some emails here) that in worst case scenario ZFS caches would consume about: arc_max + 3*arc_max + memory lost for fragmentation So I guess with arc_max set to 1GB you can lost even 5GB (or more) and currently only that first 1GB can be get back automatically. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?
[EMAIL PROTECTED] wrote on 01/10/2007 05:16:33 PM: Hello Jason, Wednesday, January 10, 2007, 10:54:29 PM, you wrote: JJWW Hi Kyle, JJWW I think there was a lot of talk about this behavior on the RAIDZ2 vs. JJWW RAID-10 thread. My understanding from that discussion was that every JJWW write stripes the block across all disks on a RAIDZ/Z2 group, thereby JJWW making writing the group no faster than writing to a single disk. JJWW However reads are much faster, as all the disk are activated in the JJWW read process. The opposite actually. Because of COW, writing (modifying as well) will give you up-to N-1 disks performance for raid-z1 and N-2 disks performance for raid-z2. Howeer reading can be slow in case of many small random reads as to read each fs block you've got to wait for all data disks in a group. JJWW The default config on the X4500 we received recently was RAIDZ-groups JJWW of 6 disks (across the 6 controllers) striped together into one large JJWW zpool. However the problem with that config is lack of hot-spare. Of course it depends what you want (and there was no hot spare support in U2 which is os installed in factory so far). Yeah, this kinda ticked me off, first thing I notice is that the thumper that was on back order for 3 months waiting for U3 fixes was shipped with U2 + patches. Called support to try to track down if U3 base was installable with/without patches and spent 3 days of off and on calling to get to someone who could find the info (sun's internal documentation was locked down and unpublished to support at the time). 5 out of 6 support engineers I talked to did not even realize that U3 was released (three weeks after the fact). It also took 4 (long) calls to clarify that it did infact need 220 power (at the time I ordered it was listed as 110, and it shipped with 110 rated cables). Long story short, I wiped and reinstalled with U3 and raidz2 with hostspares like it should have had in the first place. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hi Robert, I read the following section from http://blogs.sun.com/roch/entry/when_to_and_not_to as indicating random writes to a RAID-Z had the performance of a single disk regardless of the group size: Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of deliveredrandom input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 10, 2007, 10:54:29 PM, you wrote: JJWW Hi Kyle, JJWW I think there was a lot of talk about this behavior on the RAIDZ2 vs. JJWW RAID-10 thread. My understanding from that discussion was that every JJWW write stripes the block across all disks on a RAIDZ/Z2 group, thereby JJWW making writing the group no faster than writing to a single disk. JJWW However reads are much faster, as all the disk are activated in the JJWW read process. The opposite actually. Because of COW, writing (modifying as well) will give you up-to N-1 disks performance for raid-z1 and N-2 disks performance for raid-z2. Howeer reading can be slow in case of many small random reads as to read each fs block you've got to wait for all data disks in a group. JJWW The default config on the X4500 we received recently was RAIDZ-groups JJWW of 6 disks (across the 6 controllers) striped together into one large JJWW zpool. However the problem with that config is lack of hot-spare. Of course it depends what you want (and there was no hot spare support in U2 which is os installed in factory so far). -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[4]: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hello Jason, Thursday, January 11, 2007, 12:46:32 AM, you wrote: JJWW Hi Robert, JJWW I read the following section from JJWW http://blogs.sun.com/roch/entry/when_to_and_not_to as indicating JJWW random writes to a RAID-Z had the performance of a single disk JJWW regardless of the group size: Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of deliveredrandom input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. random input IOPS means random reads not writes. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[4]: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hello Wade, Thursday, January 11, 2007, 12:30:40 AM, you wrote: WSfc Long story short, I wiped and reinstalled with U3 and raidz2 with WSfc hostspares like it should have had in the first place. The same here. Besides I always put my own system and I'm not using preinstalled ones - except when x4500s arrive I run small script (dd + scrubbing) for 2-3 days to see if everything works fine before putting into production. Then I re-install. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Adding disk to a RAID-Z?
It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. There is at least one reason for wanting more drives in the same raidz/raid5/etc: redundancy. Suppose you have 18 drives. Having two raidz:s constisting of 9 drives is going to mean you are more likaly to fail than having a single raidz2 consisting of 18 drives, since in the former case yes - two drives can go down, but only if they are the *right* two drives. In the latter case any two drives can go down. The ZFS administration guide mentions this recommendation, but does not give any hint as to why. A reader may assume/believe it's just general adviced, based on someone's opinion that with more than 9 drives, the statistical probability of failure is too high for raidz (or raid5). It's a shame the statement in the guide is not further qualified to actually explain that there is a concrete issue at play. (I haven't looked into the archives to find the previously mentioned discussion.) -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[4]: [zfs-discuss] Limit ZFS Memory Utilization
Hi Robert, We've got the default ncsize. I didn't see any advantage to increasing it outside of NFS serving...which this server is not. For speed the X4500 is showing to be a killer MySQL platform. Between the blazing fast procs and the sheer number of spindles, its perfromance is tremendous. If MySQL cluster had full disk-based support, scale-out with X4500s a-la Greenplum would be terrific solution. At this point, the ZFS memory gobbling is the main roadblock to being a good database platform. Regarding the paging activity, we too saw tremendous paging of up to 24% of the X4500s CPU being used for that with the default arc_max. After changing it to 4GB, we haven't seen anything much over 5-10%. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Thursday, January 11, 2007, 12:36:46 AM, you wrote: JJWW Hi Robert, JJWW Thank you! Holy mackerel! That's a lot of memory. With that type of a JJWW calculation my 4GB arc_max setting is still in the danger zone on a JJWW Thumper. I wonder if any of the ZFS developers could shed some light JJWW on the calculation? JJWW That kind of memory loss makes ZFS almost unusable for a database system. If you leave ncsize with default value then I belive it won't consume that much memory. JJWW I agree that a page cache similar to UFS would be much better. Linux JJWW works similarly to free pages, and it has been effective enough in the JJWW past. Though I'm equally unhappy about Linux's tendency to grab every JJWW bit of free RAM available for filesystem caching, and then cause JJWW massive memory thrashing as it frees it for applications. Page cache won't be better - just better memory control for ZFS caches is strongly desired. Unfortunately from time to time ZFS makes servers to page enormously :( -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Why is + not allowed in a ZFS file system name ?
On 10-Jan-07, at 5:29 PM, roland wrote: # zpool create 500megpool /home/roland/tmp/500meg.dat cannot create '500megpool': name must begin with a letter pool name may have been omitted huh? ok - no problem if special characters aren`t allowed, but why _this_ weird looking limitaton ? Potential for confusion with numbers (esp since alphabetic units are often suffixed). --T This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hello Peter, Thursday, January 11, 2007, 1:08:38 AM, you wrote: It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. PS There is at least one reason for wanting more drives in the same PS raidz/raid5/etc: redundancy. PS Suppose you have 18 drives. Having two raidz:s constisting of 9 drives is PS going to mean you are more likaly to fail than having a single raidz2 PS consisting of 18 drives, since in the former case yes - two drives can go PS down, but only if they are the *right* two drives. In the latter case any two PS drives can go down. PS The ZFS administration guide mentions this recommendation, but does not give PS any hint as to why. A reader may assume/believe it's just general adviced, PS based on someone's opinion that with more than 9 drives, the statistical PS probability of failure is too high for raidz (or raid5). It's a shame the PS statement in the guide is not further qualified to actually explain that PS there is a concrete issue at play. I don't know if ZFS MAN pages should teach people about RAID. If somebody doesn't understand RAID basics then some kind of tool where you just specify pool of disk and have to choose from: space efficient, performance, non-redundant and that's it - all the rest will be hidden. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[6]: [zfs-discuss] Limit ZFS Memory Utilization
Hello Jason, Thursday, January 11, 2007, 1:10:10 AM, you wrote: JJWW Hi Robert, JJWW We've got the default ncsize. I didn't see any advantage to increasing JJWW it outside of NFS serving...which this server is not. For speed the JJWW X4500 is showing to be a killer MySQL platform. Between the blazing JJWW fast procs and the sheer number of spindles, its perfromance is Have you got any numbers you can share? -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Jason J. W. Williams wrote: Hi Robert, Thank you! Holy mackerel! That's a lot of memory. With that type of a calculation my 4GB arc_max setting is still in the danger zone on a Thumper. I wonder if any of the ZFS developers could shed some light on the calculation? In a worst-case scenario, Robert's calculations are accurate to a certain degree: If you have 1GB of dnode_phys data in your arc cache (that would be about 1,200,000 files referenced), then this will result in another 3GB of related data held in memory: vnodes/znodes/ dnodes/etc. This related data is the in-core data associated with an accessed file. Its not quite true that this data is not evictable, it *is* evictable, but the space is returned from these kmem caches only after the arc has cleared its blocks and triggered the free of the related data structures (and even then, the kernel will need to to a kmem_reap to reclaim the memory from the caches). The fragmentation that Robert mentions is an issue because, if we don't free everything, the kmem_reap may not be able to reclaim all the memory from these caches, as they are allocated in slabs. We are in the process of trying to improve this situation. That kind of memory loss makes ZFS almost unusable for a database system. Note that you are not going to experience these sorts of overheads unless you are accessing *many* files. In a database system, there are only going to be a few files = no significant overhead. I agree that a page cache similar to UFS would be much better. Linux works similarly to free pages, and it has been effective enough in the past. Though I'm equally unhappy about Linux's tendency to grab every bit of free RAM available for filesystem caching, and then cause massive memory thrashing as it frees it for applications. The page cache is much better in the respect that it is more tightly integrated with the VM system, so you get more efficient response to memory pressure. It is *much worse* than the ARC at caching data for a file system. In the long-term we plan to integrate the ARC into the Solaris VM system. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 10, 2007, 9:45:05 PM, you wrote: JJWW Sanjeev Robert, JJWW Thanks guys. We put that in place last night and it seems to be doing JJWW a lot better job of consuming less RAM. We set it to 4GB and each of JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully slush JJWW of 4GB on the Thumper is enough. I would be interested in what the JJWW other ZFS modules memory behaviors are. I'll take a perusal through JJWW the archives. In general it seems to me that a max cap for ZFS whether JJWW set through a series of individual tunables or a single root tunable JJWW would be very helpful. Yes it would. Better yet would be if memory consumed by ZFS for caching (dnodes, vnodes, data, ...) would behave similar to page cache like with UFS so applications will be able to get back almost all memory used for ZFS caches if needed. I guess (and it's really a guess only based on some emails here) that in worst case scenario ZFS caches would consume about: arc_max + 3*arc_max + memory lost for fragmentation So I guess with arc_max set to 1GB you can lost even 5GB (or more) and currently only that first 1GB can be get back automatically. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Hey guys, Do to lng URL lookups, the DNLC was pushed to variable sized entries. The hit rate was dropping because of name to long misses. This was done long ago while I was at Sun under a bug reported by me.. I don't know your usage, but you should attempt to estimate the amount of mem used with the default size. Yes, this is after you start tracking your DNLC hit rate and make sure it doesn't significantly drop if the ncsize is decreased. You also may wish to increase the size and again check the hit rate.. Yes, it is posible that your access is random enough that no changes will effect the hit rte. 2nd item.. Bonwick's mem allcators I think still have the ability to limit the size of each slab. The issue is that some parts of the code expect non mem failures with SLEEPs. This can result in extended SLEEPs, but can be done. If your company generates changes to your local source and then you rebuild, it is possible to pre-allocate a fixed number of objects per cache and then use NOLSLEEPs with returning values that indicate to retry or failure. 3rd.. And could be the most important, the mem cache allocators are lazy in freeing memory when it is not needed by anyone else. Thus, unfreed memory is effectively used as a cache to remove latencies of on-demand memory allocations. This artificially keeps memory usage high, but should have minimal latencies to realloc when necessary. Also, it is possible to make mods to increase the level of mem garbage collection after some watermark code is added to minimize repeated allocs and frees. Mitchell Erblich Jason J. W. Williams wrote: Hi Robert, We've got the default ncsize. I didn't see any advantage to increasing it outside of NFS serving...which this server is not. For speed the X4500 is showing to be a killer MySQL platform. Between the blazing fast procs and the sheer number of spindles, its perfromance is tremendous. If MySQL cluster had full disk-based support, scale-out with X4500s a-la Greenplum would be terrific solution. At this point, the ZFS memory gobbling is the main roadblock to being a good database platform. Regarding the paging activity, we too saw tremendous paging of up to 24% of the X4500s CPU being used for that with the default arc_max. After changing it to 4GB, we haven't seen anything much over 5-10%. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Thursday, January 11, 2007, 12:36:46 AM, you wrote: JJWW Hi Robert, JJWW Thank you! Holy mackerel! That's a lot of memory. With that type of a JJWW calculation my 4GB arc_max setting is still in the danger zone on a JJWW Thumper. I wonder if any of the ZFS developers could shed some light JJWW on the calculation? JJWW That kind of memory loss makes ZFS almost unusable for a database system. If you leave ncsize with default value then I belive it won't consume that much memory. JJWW I agree that a page cache similar to UFS would be much better. Linux JJWW works similarly to free pages, and it has been effective enough in the JJWW past. Though I'm equally unhappy about Linux's tendency to grab every JJWW bit of free RAM available for filesystem caching, and then cause JJWW massive memory thrashing as it frees it for applications. Page cache won't be better - just better memory control for ZFS caches is strongly desired. Unfortunately from time to time ZFS makes servers to page enormously :( -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re[2]: Re: Adding disk to a RAID-Z?
Hello Kyle, Wednesday, January 10, 2007, 5:33:12 PM, you wrote: KM Remember though that it's been mathematically figured that the KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's Well, nothing like this was proved and definitely not mathematically. It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. Wow, lots of good discussion here. I started the idea of allowing a RAIDZ group to grow to arbitrary drives because I was unaware of the downsides to massive pools. From my RAID5 experience, a perfect world would be large numbers of data spindles and a sufficient number of parity spindles, e.g. 99+17 (99 data drives and 17 parity drives). In RAID5 this would give massive iops and redundancy. After studying the code and reading the blogs, a few things have jumped out, with some interesting (and sometimes goofy) implications. Since I am still learning, I could be wrong on any of the following. RAIDZ pools operate with a storage granularity of one stripe. If you request a read of a block within the stripe, you get the whole stripe. If you modify a block within the stripe, the whole stripe is written to a different location (ala COW). This implies that ANY read will require the whole stripe, therefore all spindles to seek and read a sector. All drives will return the sectors (mostly) simultaneously. For performance purposes, a RAIDZ pool seeks like a single drive would and has the throughput of multiple drives. Unlike traditional RAID5, adding more spindles does NOT increase read IOPS. Another implication is ZFS checksums the stripe, not the component sectors. If a drive silently returns a bad sector, ZFS only knows is that the whole stripe is bad (which could probably also be inferred from a bogus parity sector). ZFS has no clue which drive produced bad data, only that the whole stripe failed the checksum. ZFS finds the offending sector by process of elimination: going through the sectors one at a time, throwing away the data actually read, reconstructing the data from the parity then determining if the stripe passes the checksum. Two parity drives make this a bigger problem still, almost squaring the number of computations needed. If a stripe has enough parity drives, then the cost of determining N bad data sectors in a stripe is roughly O(k^N), where k is some constant. Another implication is that there is no RAID5 write penalty. More accurately, the write penalty is incurred during the read operation where an entire stripe is read. Finally, there is no need to rotate parity. Rotating parity was introduced in RAID5 because every write of a single sector in a stripe also necessitated the read and subsequent write of the parity sector. Since there are no partial stripe writes in ZFS, there is no need to read then write the parity sector. For those in the know, where I am off base here? Thanks! Marty This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
On Wed, 10 Jan 2007, Mark Maybee wrote: Jason J. W. Williams wrote: Hi Robert, Thank you! Holy mackerel! That's a lot of memory. With that type of a calculation my 4GB arc_max setting is still in the danger zone on a Thumper. I wonder if any of the ZFS developers could shed some light on the calculation? In a worst-case scenario, Robert's calculations are accurate to a certain degree: If you have 1GB of dnode_phys data in your arc cache (that would be about 1,200,000 files referenced), then this will result in another 3GB of related data held in memory: vnodes/znodes/ dnodes/etc. This related data is the in-core data associated with an accessed file. Its not quite true that this data is not evictable, it *is* evictable, but the space is returned from these kmem caches only after the arc has cleared its blocks and triggered the free of the related data structures (and even then, the kernel will need to to a kmem_reap to reclaim the memory from the caches). The fragmentation that Robert mentions is an issue because, if we don't free everything, the kmem_reap may not be able to reclaim all the memory from these caches, as they are allocated in slabs. We are in the process of trying to improve this situation. snip . Understood (and many Thanks). In the meantime, is there a rule-of-thumb that you could share that would allow mere humans (like me) to calculate the best values of zfs:zfs_arc_max and ncsize, given the that machine has nGb of RAM and is used in the following broad workload scenarios: a) a busy NFS server b) a general multiuser development server c) a database server d) an Apache/Tomcat/FTP server e) a single user Gnome desktop running U3 with home dirs on a ZFS filesystem It would seem, from reading between the lines of previous emails, particularly the ones you've (Mark M) written, that there is a rule of thumb that would apply given a standard or modified ncsize tunable?? I'm primarily interested in a calculation that would allow settings that would reduce the possibility of the machine descending into swap hell. PS: Interesting is that no one has mentioned (the tunable) maxpgio. I've often found that increasing maxpgio is the only way to improve the odds of a machine remaining usable when lots of swapping is taking place. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS entry in /etc/vfstab
Hi, Why would I ever need to specify ZFS mount(s) in /etc/vfstab at all? I see it in some documents that zfs can be defined in /etc/vfstab with fstype zfs. Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] N.J. suspected as source of stench MORE ...
News Alert! Fueled by the possibility of an upcoming merger, (UTVG) is gearing up for an explosion. Tension is building and soon the scramble to take a position will push this one off the charts. Symbol: UTVG }Short Term Target: $5.00 Long term Target: $10 Finally the market is ready for explosion Thursday Jan 11 2007. will be a huge growth of UTVG Get ready to make some cash today! Get in while there is still time. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss