Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote: On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp And you really work at Oracle?:) The answer is definiately yes. ARC caches on-disk blocks and dedup just reference those blocks. When you read dedup code is not involved at all. Let me show it to you with simple test: Create a file (dedup is on): # dd if=/dev/random of=/foo/a bs=1m count=1024 Copy this file so that it is deduped: # dd if=/foo/a of=/foo/b bs=1m Export the pool so all cache is removed and reimport it: # zpool export foo # zpool import foo Now let's read one file: # dd if=/foo/a of=/dev/null bs=1m 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) We read file 'a' and all its blocks are in cache now. The 'b' file shares all the same blocks, so if ARC caches blocks only once, reading 'b' should be much faster: # dd if=/foo/b of=/dev/null bs=1m 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) Now look at it, 'b' was read 12.5 times faster than 'a' with no disk activity. Magic?:) Hey all, That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
2011-12-11 15:10, Nathan Kroenert wrote: Hey all, That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? I believe there's a couple of things in play. One is that you'd rarely get 100Mb/s from a single HDD disk due to fragmentation, especially inherent to ZFS. But you do mention sequential reading, so that's covered. Besides, from Pavel's DD examples we see that he first read at 98Mbyte/sec average, and then at 1233Mbyte/sec. Another aspect is the RAM bandwidth, and we don't know the specs of Pavel's test rig. For example, a 100MHz DDR2 would peak out at 3200Mbyte/sec. That would include walking the (cached) DDT tree for each block involved, determining which (cached) data blocks correspond to it, and fetching them from RAM or disk. I would not be surprised to see that there is some disk IO adding delays for the second case (read of a deduped file clone), because you still have to determine references to this second file's blocks, and another path of on-disk blocks might lead to it from a separate inode in a separate dataset (or I might be wrong). Reading this second path of pointers to the same cached data blocks might decrease speed a little. It would be interesting to see Pavel's test updated with second reads of both files (now that data and metadata are all cached in RAM). It's possible that NOW reads would be closer to RAM speeds with no disk IO. And I would be very surprised if speeds would be noticeably different ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Nathan Kroenert That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? Actually, cpu's and memory aren't as fast as you might think. In a system with 12 disks, I've had to write my own dd replacement, because dd if=/dev/zero bs=1024k wasn't fast enough to keep the disks busy. Later, I wanted to do something similar, using unique data, and it was simply impossible to generate random data fast enough. I had to tweak my dd replacement to write serial numbers, which still wasn't fast enough, so I had to tweak my dd replacement to write a big block of static data, followed by a serial number, followed by another big block (always smaller than the disk block, so it would be treated as unique when hitting the pool...) 1 typical disk sustains 1Gbit/sec. In theory, 12 should be able to sustain 12 Gbit/sec. According to Nathan's email, the memory bandwidth might be 25 Gbit, of which, you probably need to both read write, thus making it effectively 12.5 Gbit... I'm sure the actual bandwidth available varies by system and memory type. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
What kind of drives are we talking about? Even SATA drives are available according to application type (desktop, enterprise server, home PVR, surveillance PVR, etc). Then there are drives with SAS fiber channel interfaces. Then you've got Winchester platters vs SSD vs hybrids. But even before considering that and all the other system factors, throughput for direct attached storage can vary greatly not only from interface type and storage tech but even small on drive controller firmware differences could potentially introduce variances. That's why server manufacturers like HP, DELL, et al prefer that you replace failed drives with one of theirs instead of something off the shelf because they usually have firmware that's been fine tuned in house or in conjunction with the manufacturer. On Dec 11, 2011, at 8:25 AM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Nathan Kroenert That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? Actually, cpu's and memory aren't as fast as you might think. In a system with 12 disks, I've had to write my own dd replacement, because dd if=/dev/zero bs=1024k wasn't fast enough to keep the disks busy. Later, I wanted to do something similar, using unique data, and it was simply impossible to generate random data fast enough. I had to tweak my dd replacement to write serial numbers, which still wasn't fast enough, so I had to tweak my dd replacement to write a big block of static data, followed by a serial number, followed by another big block (always smaller than the disk block, so it would be treated as unique when hitting the pool...) 1 typical disk sustains 1Gbit/sec. In theory, 12 should be able to sustain 12 Gbit/sec. According to Nathan's email, the memory bandwidth might be 25 Gbit, of which, you probably need to both read write, thus making it effectively 12.5 Gbit... I'm sure the actual bandwidth available varies by system and memory type. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zdb leaks checking
Does zdb leak checking mechanism also check for the opposite situation? That is, used/referenced blocks being in free regions of space maps. Thank you. -- Andriy Gapon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] does log device (ZIL) require a mirror setup?
Dear all We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool. As they are supposed to be read only after a crash or when booting and those nice things are pretty expensive I'm wondering if mirroring the log devices is a must / highly recommended Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] does log device (ZIL) require a mirror setup?
I would say that it's a highly recommended. If you have a pool that needs to be imported and it has a faulted, unmirrored log device, you risk data corruption. -Matt Breitbach -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau Sent: Sunday, December 11, 2011 1:28 PM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] does log device (ZIL) require a mirror setup? Dear all We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool. As they are supposed to be read only after a crash or when booting and those nice things are pretty expensive I'm wondering if mirroring the log devices is a must / highly recommended Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] does log device (ZIL) require a mirror setup?
Corruption? Or just loss? On Sun, Dec 11, 2011 at 1:27 PM, Matt Breitbach matth...@flash.shanje.comwrote: I would say that it's a highly recommended. If you have a pool that needs to be imported and it has a faulted, unmirrored log device, you risk data corruption. -Matt Breitbach -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau Sent: Sunday, December 11, 2011 1:28 PM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] does log device (ZIL) require a mirror setup? Dear all We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool. As they are supposed to be read only after a crash or when booting and those nice things are pretty expensive I'm wondering if mirroring the log devices is a must / highly recommended Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] does log device (ZIL) require a mirror setup?
Loss of bits, but depending upon the usage of the system, corruption _could_ be a possibility. I could envision an scenario where you were mapping an iSCSI lun to a system, and that system had it's own FS on top of it (think VMFS or NTFS) and when it came back online, parts of the last write commands didn't get written causing that filesystem to be corrupted. Obviously this is likely an edge case scenario, but I could see it as a possibility. The actual zpool would likely be fine and importable, but the underlying data could be corrupt if there are other filesystems layered on top of it. _ From: Garrett D'Amore [mailto:garrett.dam...@nexenta.com] Sent: Sunday, December 11, 2011 10:35 PM To: Frank Cusack Cc: Matt Breitbach; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] does log device (ZIL) require a mirror setup? Loss only. Sent from my iPhone On Dec 12, 2011, at 4:00 AM, Frank Cusack fr...@linetwo.net wrote: Corruption? Or just loss? On Sun, Dec 11, 2011 at 1:27 PM, Matt Breitbach matth...@flash.shanje.com wrote: I would say that it's a highly recommended. If you have a pool that needs to be imported and it has a faulted, unmirrored log device, you risk data corruption. -Matt Breitbach -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau Sent: Sunday, December 11, 2011 1:28 PM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] does log device (ZIL) require a mirror setup? Dear all We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool. As they are supposed to be read only after a crash or when booting and those nice things are pretty expensive I'm wondering if mirroring the log devices is a must / highly recommended Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss