Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-20 Thread Robert Milkowski

On 20/07/2010 04:41, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Richard L. Hamilton

I would imagine that if it's read-mostly, it's a win, but
otherwise it costs more than it saves.  Even more conventional
compression tends to be more resource intensive than decompression...
 

I would imagine it's *easier* to have a win when it's read-mostly, but the
expense of computing checksums is going to be done either way, with or
without dedup.  The only extra cost dedup adds is to maintain a hash tree of
some kind, to see if some block has already been stored on disk.  So ... of
course I'm speaking hypothetically and haven't been proven ... I think dedup
will accelerate the system in nearly all use cases.

The main exception is whenever you have highly non-duplicated data.  I think
the cost of dedup CPU power is tiny little small, but in the case of highly
non-duplicated data, even that little expense is a waste.

   


Please note that by default ZFS uses fletcher4 checksums but dedup 
currently allows only for sha256 which are more CPU intensive. Also from 
a performance point of view there will be a sudden drop in write 
performance the moment DDT can't fit entirely in a memory. L2ARC could 
mitigate the impact though.


Then there will be less memory available for data caching due to extra 
memory requirements for DDT.
(however please note that IIRC DDT is treated as meta data and by 
default there is a limit of meta-data cache size to be no bigger than 
20% of ARC - there is a bug open for it, I haven't checked if it's been 
fixed yet or not).



What I'm wondering is when dedup is a better value than compression.
 

Whenever files have internal repetition, compression will be better.
Whenever the repetition crosses file barriers, dedup will be better.

   


Not necessarily. Compression in ZFS works only within a single fs block 
scope.
So for example if you have a large file with most of its block identical 
dedup should compress the file much better than a compression. Also 
please note that you can use both: compression and dedup at the same time.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-19 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Richard L. Hamilton
 
 I would imagine that if it's read-mostly, it's a win, but
 otherwise it costs more than it saves.  Even more conventional
 compression tends to be more resource intensive than decompression...

I would imagine it's *easier* to have a win when it's read-mostly, but the
expense of computing checksums is going to be done either way, with or
without dedup.  The only extra cost dedup adds is to maintain a hash tree of
some kind, to see if some block has already been stored on disk.  So ... of
course I'm speaking hypothetically and haven't been proven ... I think dedup
will accelerate the system in nearly all use cases. 

The main exception is whenever you have highly non-duplicated data.  I think
the cost of dedup CPU power is tiny little small, but in the case of highly
non-duplicated data, even that little expense is a waste.


 What I'm wondering is when dedup is a better value than compression.

Whenever files have internal repetition, compression will be better.
Whenever the repetition crosses file barriers, dedup will be better.


 Most obviously, when there are a lot of identical blocks across
 different
 files; but I'm not sure how often that happens, aside from maybe
 blocks of zeros (which may well be sparse anyway).

I think the main value here is when there are more than one copy of some
files in the filesystem.  For example:

In subversion, there are two copies of every file in your working directory.
Every file has a corresponding base copy located in the .svn directory.

If you have a lot of developers ... software or whatever ... who have all
checked out the same project, and they're all working on it in their home
directories ...  All of those copies get essentially cut down to 1.

Combine the developers with subversion ... You would have 2x copies of every
file, in every person's home dir = ... a lot of copies of the same files ...
All cut down to 1.

You build some package from source code.  somefile.c becomes somefile.o, and
then the linker takes somefile.o and a bunch of other .o files and mashes
them all together to make finalproduct executable file.  Well, that
executable is just copies of all these .o files mashed together.  So again
... cut it all down to 1.  And multiply by the number of developers who are
all doing the same thing in their home dirs.

Others have mentioned VM's, when VM's are duplicated ... I don't personally
duplicate many VM's, so it doesn't matter to me ... but I see this value for
others ...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-18 Thread Richard L. Hamilton
 Even the most expensive decompression algorithms
 generally run
 significantly faster than I/O to disk -- at least
 when real disks are
 involved.  So, as long as you don't run out of CPU
 and have to wait for
 CPU to be available for decompression, the
 decompression will win.  The
 same concept is true for dedup, although I don't
 necessarily think of
 dedup as a form of compression (others might
 reasonably do so though.)

Effectively, dedup is a form of compression of the
filesystem rather than any single file, but one
oriented to not interfering with access to any of what
may be sharing blocks.

I would imagine that if it's read-mostly, it's a win, but
otherwise it costs more than it saves.  Even more conventional
compression tends to be more resource intensive than decompression...

What I'm wondering is when dedup is a better value than compression.
Most obviously, when there are a lot of identical blocks across different
files; but I'm not sure how often that happens, aside from maybe
blocks of zeros (which may well be sparse anyway).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-18 Thread Erik Trimble

On 7/18/2010 4:18 PM, Richard L. Hamilton wrote:

Even the most expensive decompression algorithms
generally run
significantly faster than I/O to disk -- at least
when real disks are
involved.  So, as long as you don't run out of CPU
and have to wait for
CPU to be available for decompression, the
decompression will win.  The
same concept is true for dedup, although I don't
necessarily think of
dedup as a form of compression (others might
reasonably do so though.)
 

Effectively, dedup is a form of compression of the
filesystem rather than any single file, but one
oriented to not interfering with access to any of what
may be sharing blocks.

I would imagine that if it's read-mostly, it's a win, but
otherwise it costs more than it saves.  Even more conventional
compression tends to be more resource intensive than decompression...

What I'm wondering is when dedup is a better value than compression.
Most obviously, when there are a lot of identical blocks across different
files; but I'm not sure how often that happens, aside from maybe
blocks of zeros (which may well be sparse anyway).
   


From my own experience, a dedup win is much more data-usage-dependent 
than compression.


Compression seems to be of general use across the vast majority of data 
I've encountered - with the sole big exception of media file servers 
(where the data is already compressed pictures, audio, or video).  It 
seems to be of general utility, since I've always got spare CPU cycles, 
and it's really not very expensive in terms of CPU in most cases. Of 
course, the *value* of compression varies according to the data (i.e. 
how much it will compress), but that doesn't matter for *utility* for 
the most part.


Dedup, on the other hand, currently has a very steep price in terms of 
needed ARC/L2ARC/RAM, so it's much harder to justify in those cases 
where it only provides modest benefits. Additionally, we're still in the 
development side of dedup (IMHO), so I can't really make a full 
evaluation of Dedup concept, as many of its issues today are 
implementation-related, not concept-related.   All that said, Dedup has 
a showcase use case where it is of *massive* benefit:  hosting Virtual 
Machines.  For a machine hosting only 100 VM data stores, I can see 99% 
space savings. And, I see a significant performance boost, since I can 
cache that one VM image in RAM easily.   There's other places where 
Dedup seems modestly useful these days (one is in the afore-mentioned 
media-file server, which you'd be surprised how much duplication there 
is), but it's *much* harder to pre-determine dedup's utility for a given 
dataset, unless you have highly detailed knowledge of that dataset's 
composition.


I'll admit to not being a big fan of the Dedup concept originally (go 
back a couple of years here on this list), but, given that the world is 
marching straight to Virtualization as fast as we can go, I'm a convert 
now.



From my perspective, here's a couple of things that I think would help 
improve dedup's utility for me:


(a) fix the outstanding issues in the current implementation (duh!).

(b) add the ability to store the entire DDT in the backing store, and 
not have to construct it in ARC from disk-resident info (this would be 
of great help where backing store = SSD or RAM-based things)


(c) be able to test-dedup a given filesystem.  I'd like ZFS to be able 
to look at a filesystem and tell me how much dedup I'd get out of it, 
WITHOUT having to actually create a dedup-enabled filesystem and copy 
the data to it.  While it would be nice to be able to simply turn on 
dedup for a filesystem, and have ZFS dedup the existing data there 
(in-place, without copying), I realize the implementation is hard given 
how things currently work, and frankly, that's of much lower priority 
for me than being able to test-dedup a dataset.


(d) increase the slab (record) size significantly, to at least 1MB or 
more. I daresay the primary way VM images are stored these days are as 
single, large files (though iSCSI volumes are coming up fast), and as 
such, I've got 20G files which would really, really, benefit from having 
a much larger slab size.


(e) and, of course, seeing if there's some way we can cut down on 
dedup's piggy DDT size.  :-)



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-18 Thread Garrett D'Amore
On Sun, 2010-07-18 at 16:18 -0700, Richard L. Hamilton wrote:

 
 I would imagine that if it's read-mostly, it's a win, but
 otherwise it costs more than it saves.  Even more conventional
 compression tends to be more resource intensive than decompression...
 
 What I'm wondering is when dedup is a better value than compression.
 Most obviously, when there are a lot of identical blocks across different
 files; but I'm not sure how often that happens, aside from maybe
 blocks of zeros (which may well be sparse anyway).

Shared/identical blocks come into play in several specific scenarios:

1) Multiple VMs, cloud.  If you have multiple guest OS' installed,
they're going to benefit heavily from dedup.  Even Zones can benefit
here.

2) Situations with lots of copies of large amounts of data where only
some of the data is different between each copy.  The classic example is
a Solaris build server, hosting dozens or even hundreds, of copies of
the Solaris tree, each being worked on by different developers.
Typically the developer is working on something less than 1% of the
total source code, so the other 99% can be shared via dedup.

For general purpose usage, e.g. hosting your music or movie collection,
I doubt that dedup offers any real advantage.  If I were talking about
deploying dedup, I'd only use it in situations like the two I mentioned,
and not for just a general purpose storage server.  For general purpose
applications I think compression is better.  (Though I think dedup will
have higher savings -- significantly so -- in the particular situation
where you know you lots and lots of duplicate/redundant data.)

Note also that dedup actually does some things where your duplicated
data may gain an effective increase in redundancy/security, because it
does make sure that the data that is deduped has higher redundancy than
non-deduped data.  (This sounds counterintuitive, but as long as you
have at least 3 copies of the duplicated data, its a net win.)

Btw, compression on top of dedup may actually kill your benefit of
dedup.   My hypothesis (unproven, admittedly) is that because many
compression algos actually cause small permutations of data to
significantly change the bit values (even just by changing their offset
in the binary) in the overall compressed object, it can seriously defeat
dedup's efficacy.

- Garrett

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Brandon High
On Fri, Jul 9, 2010 at 5:18 PM, Brandon High bh...@freaks.com wrote:

 I think that DDT entries are a little bigger than what you're using. The
 size seems to range between 150 and 250 bytes depending on how it's
 calculated, call it 200b each. Your 128G dataset would require closer to
 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique
 data would require 600M - 1000M for the DDT.


Using 376b per entry, it's 376M for 128G of unique data, or just under 3GB
for 1TB of unique data.

A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the
DDT. Ouch.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Richard Elling
On Jul 9, 2010, at 11:10 PM, Brandon High wrote:

 On Fri, Jul 9, 2010 at 5:18 PM, Brandon High bh...@freaks.com wrote:
 I think that DDT entries are a little bigger than what you're using. The size 
 seems to range between 150 and 250 bytes depending on how it's calculated, 
 call it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) 
 for the DDT if your data was completely unique. 1TB of unique data would 
 require 600M - 1000M for the DDT.
 
 Using 376b per entry, it's 376M for 128G of unique data, or just under 3GB 
 for 1TB of unique data.

4% seems to be a pretty good SWAG.

 A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the 
 DDT. Ouch.

... or more than 300GB for 512-byte records.

The performance issue is that DDT access tends to be random. This implies that
if you don't have a lot of RAM and your pool has poor random read I/O 
performance,
then you will not be impressed with dedup performance. In other words, trying to
dedup lots of data on a small DRAM machine using big, slow pool HDDs will not 
set
any benchmark records. By contrast, using SSDs for the pool can demonstrate good
random read performance. As the price per bit of HDDs continues to drop, the 
value
of deduping pools using HDDs also drops.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Erik Trimble

On 7/10/2010 5:24 AM, Richard Elling wrote:

On Jul 9, 2010, at 11:10 PM, Brandon High wrote:

   

On Fri, Jul 9, 2010 at 5:18 PM, Brandon Highbh...@freaks.com  wrote:
I think that DDT entries are a little bigger than what you're using. The size 
seems to range between 150 and 250 bytes depending on how it's calculated, call 
it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) for the 
DDT if your data was completely unique. 1TB of unique data would require 600M - 
1000M for the DDT.

Using 376b per entry, it's 376M for 128G of unique data, or just under 3GB for 
1TB of unique data.
 

4% seems to be a pretty good SWAG.

   

A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the DDT. 
Ouch.
 

... or more than 300GB for 512-byte records.

The performance issue is that DDT access tends to be random. This implies that
if you don't have a lot of RAM and your pool has poor random read I/O 
performance,
then you will not be impressed with dedup performance. In other words, trying to
dedup lots of data on a small DRAM machine using big, slow pool HDDs will not 
set
any benchmark records. By contrast, using SSDs for the pool can demonstrate good
random read performance. As the price per bit of HDDs continues to drop, the 
value
of deduping pools using HDDs also drops.
  -- richard

   


Which brings up an interesting idea:   if I have a pool with good random 
I/O  (perhaps made from SSDs, or even one of those nifty Oracle F5100 
things),  I would probably not want to have a DDT created, or at least 
have one that was very significantly abbreviated.   What capability does 
ZFS have for recognizing that we won't need a full DDT created for 
high-I/O-speed pools?  Particularly with the fact that such pools would 
almost certainly be heavy candidates for dedup (the $/GB being 
significantly higher than other mediums, and thus space being at a 
premium) ?


I'm not up on exactly how the DDT gets built and referenced to 
understand how this might happen.  But, I can certainly see it as being 
useful to tell ZFS (perhaps through a pool property?) that building an 
in-ARC DDT isn't really needed.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Richard Elling
On Jul 10, 2010, at 5:33 AM, Erik Trimble wrote:

 On 7/10/2010 5:24 AM, Richard Elling wrote:
 On Jul 9, 2010, at 11:10 PM, Brandon High wrote:
 
   
 On Fri, Jul 9, 2010 at 5:18 PM, Brandon Highbh...@freaks.com  wrote:
 I think that DDT entries are a little bigger than what you're using. The 
 size seems to range between 150 and 250 bytes depending on how it's 
 calculated, call it 200b each. Your 128G dataset would require closer to 
 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of 
 unique data would require 600M - 1000M for the DDT.
 
 Using 376b per entry, it's 376M for 128G of unique data, or just under 3GB 
 for 1TB of unique data.
 
 4% seems to be a pretty good SWAG.
 
   
 A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the 
 DDT. Ouch.
 
 ... or more than 300GB for 512-byte records.
 
 The performance issue is that DDT access tends to be random. This implies 
 that
 if you don't have a lot of RAM and your pool has poor random read I/O 
 performance,
 then you will not be impressed with dedup performance. In other words, 
 trying to
 dedup lots of data on a small DRAM machine using big, slow pool HDDs will 
 not set
 any benchmark records. By contrast, using SSDs for the pool can demonstrate 
 good
 random read performance. As the price per bit of HDDs continues to drop, the 
 value
 of deduping pools using HDDs also drops.
  -- richard
 
   
 
 Which brings up an interesting idea:   if I have a pool with good random I/O  
 (perhaps made from SSDs, or even one of those nifty Oracle F5100 things),  I 
 would probably not want to have a DDT created, or at least have one that was 
 very significantly abbreviated.   What capability does ZFS have for 
 recognizing that we won't need a full DDT created for high-I/O-speed pools?  
 Particularly with the fact that such pools would almost certainly be heavy 
 candidates for dedup (the $/GB being significantly higher than other mediums, 
 and thus space being at a premium) ?

Methinks it is impossible to build a complete DDT, we'll run out of atoms... 
maybe if 
we can use strings?  :-)  Think of it as a very, very sparse array.  Otherwise 
it
is managed just like other metadata.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Brandon High
On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble erik.trim...@oracle.comwrote:

 Which brings up an interesting idea:   if I have a pool with good random
 I/O  (perhaps made from SSDs, or even one of those nifty Oracle F5100
 things),  I would probably not want to have a DDT created, or at least have
 one that was very significantly abbreviated.   What capability does ZFS have
 for recognizing that we won't need a full DDT created for high-I/O-speed
 pools?  Particularly with the fact that such pools would almost certainly be
 heavy candidates for dedup (the $/GB being significantly higher than other
 mediums, and thus space being at a premium) ?


I'm not exactly sure what problem you're trying to solve. Dedup is to save
space, not accelerate i/o. While the DDT is pool-wide, only data that's
added to datasets with dedup enabled will create entries in the DDT. If
there's data that you don't want to dedup, then don't add it to a pool with
dedup enabled.

I'm not up on exactly how the DDT gets built and referenced to understand
 how this might happen.  But, I can certainly see it as being useful to tell
 ZFS (perhaps through a pool property?) that building an in-ARC DDT isn't
 really needed.


The DDT is in the pool, not in the ARC. Because it's frequently accessed,
some / most of it will reside in the ARC.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]
 
 4% seems to be a pretty good SWAG.

Is the above 4% wrong, or am I wrong?

Suppose 200bytes to 400bytes, per 128Kbyte block ... 
200/131072 = 0.0015 = 0.15%
400/131072 = 0.003 = 0.3%
which would mean for 100G unique data = 153M to 312M ram.

Around 3G ram for 1Tb unique data, assuming default 128K block

Next question:

Correct me if I'm wrong, if you have a lot of duplicated data, then dedup
increases the probability of arc/ram cache hit.  So dedup allows you to
stretch your disk, and also stretch your ram cache.  Which also benefits
performance.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Brandon High
 
 Dedup is to
 save space, not accelerate i/o. 

I'm going to have to disagree with you there.  Dedup is a type of
compression.  Compression can be used for storage savings, and/or
acceleration.  Fast and lightweight compression algorithms (lzop, v.42bis,
v.44) are usually used in-line for acceleration, while a compute-expensive
algorithms (bzip2, lzma, gzip) are usually used for space savings and rarely
for acceleration (except when transmitting data across a slow channel).

Most general-purpose lossless compression algorithms (and certainly most of
the ones I just mentioned) achieve compression by reducing duplicated data.
There are special purpose lossless (flac etc) and lossy (jpg, mp3 etc) which
do other techniques.  But general purpose compression might possibly even be
exclusively algorithms for reduction of repeated data.

Unless I'm somehow mistaken, the performance benefit of dedup comes from the
fact that it increases cache hits.  Instead of having to read a thousand
duplicate blocks from different sectors of disks, you read it once, and the
other 999 have all been stored same as the original block, so it's 999
cache hits and unnecessary to read disk again.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Roy Sigurd Karlsbakk
  4% seems to be a pretty good SWAG.
 
 Is the above 4% wrong, or am I wrong?
 
 Suppose 200bytes to 400bytes, per 128Kbyte block ...
 200/131072 = 0.0015 = 0.15%
 400/131072 = 0.003 = 0.3%
 which would mean for 100G unique data = 153M to 312M ram.
 
 Around 3G ram for 1Tb unique data, assuming default 128K block

Recodsize means maximum block size. Smaller files will be stored in smaller 
blocks. With lots of files of different sizes, the block size will generally be 
smaller than the recordsize set for ZFS.

 Next question:
 
 Correct me if I'm wrong, if you have a lot of duplicated data, then
 dedup
 increases the probability of arc/ram cache hit. So dedup allows you to
 stretch your disk, and also stretch your ram cache. Which also
 benefits performance.

Theoretically, yes, but there will be an overhead in cpu/memory that can reduce 
this benefit to a penalty.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Edward Ned Harvey
 From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net]
  increases the probability of arc/ram cache hit. So dedup allows you
 to
  stretch your disk, and also stretch your ram cache. Which also
  benefits performance.
 
 Theoretically, yes, but there will be an overhead in cpu/memory that
 can reduce this benefit to a penalty.

That's why a really fast compression algorithm is used in-line, in hopes that 
the time cost of compression is smaller than the performance gain of 
compression.  Take for example, v.42bis and v.44 which was used to accelerate 
56K modems.  (Probably still are, if you actually have a modem somewhere.  ;-)

Nowadays we have faster communication channels; in fact when talking about 
dedup we're talking about local disk speed, which is really fast.  But we also 
have fast processors, and the algorithm in question can be really fast.

I recently benchmarked lzop, gzip, bzip2, and lzma for some important data on 
our fileserver that I would call typical.  No matter what I did, lzop was so 
ridiculously light weight that I could never get lzop up to 100% cpu.  Even 
reading data 100% from cache and filtering through lzop to /dev/null, the 
kernel overhead of reading ram cache was higher than the cpu overhead to 
compress.

For the data in question, lzop compressed to 70%, gzip compressed to 42%, bzip 
32%, and lzma something like 16%.  bzip2 was the slowest (by a factor of 4).  
lzma -1 and gzip --fast were closely matched in speed but not compression.  So 
the compression of lzop was really weak for the data in question, but it 
contributed no significant cpu overhead.  The point is:  It's absolutely 
possible to compress quickly, if you have a fast algorithm, and gain 
performance.  I'm boldly assuming dedup performs this fast.  It would be nice 
to actually measure and prove it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Erik Trimble

On 7/10/2010 10:14 AM, Brandon High wrote:
On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble erik.trim...@oracle.com 
mailto:erik.trim...@oracle.com wrote:


Which brings up an interesting idea:   if I have a pool with good
random I/O  (perhaps made from SSDs, or even one of those nifty
Oracle F5100 things),  I would probably not want to have a DDT
created, or at least have one that was very significantly
abbreviated.   What capability does ZFS have for recognizing that
we won't need a full DDT created for high-I/O-speed pools?
 Particularly with the fact that such pools would almost certainly
be heavy candidates for dedup (the $/GB being significantly higher
than other mediums, and thus space being at a premium) ?


I'm not exactly sure what problem you're trying to solve. Dedup is to 
save space, not accelerate i/o. While the DDT is pool-wide, only data 
that's added to datasets with dedup enabled will create entries in the 
DDT. If there's data that you don't want to dedup, then don't add it 
to a pool with dedup enabled.




What I'm talking about here is that caching the DDT in the ARC takes a 
non-trivial amount of space (as we've discovered). For a pool consisting 
of backing store with access times very close to that of main memory, 
there's no real benefit from caching it in the ARC/L2ARC, so it would be 
useful if the DDT was simply kept somewhere on the actual backing store, 
and there was some way to tell ZFS to look there exclusively, and not 
try to build/store a DDT in ARC.





I'm not up on exactly how the DDT gets built and referenced to
understand how this might happen.  But, I can certainly see it as
being useful to tell ZFS (perhaps through a pool property?) that
building an in-ARC DDT isn't really needed.


The DDT is in the pool, not in the ARC. Because it's frequently 
accessed, some / most of it will reside in the ARC.


-B

--
Brandon High : bh...@freaks.com mailto:bh...@freaks.com


Are you sure? I was under the impression that the DDT had to be built 
from info in the pool, but that what we call the DDT only exists in the 
ARC.  That's my understanding from reading the ddt.h and ddt.c files - 
that the 'ddt_enty' and 'ddt' structures exist in RAM/ARC/L2ARC, but not 
on disk. Those two are built using the 'ddt_key' and 'ddt_bookmark' 
structures on disk.


Am I missing something?

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Debunking the dedup memory myth

2010-07-09 Thread Edward Ned Harvey
Whenever somebody asks the question, How much memory do I need to dedup X
terabytes filesystem, the standard answer is as much as you can afford to
buy.  This is true and correct, but I don't believe it's the best we can
do.  Because as much as you can buy is a true assessment for memory in
*any* situation.

 

To improve knowledge in this area, I think the question just needs to be
asked differently.  How much *extra* memory is required for X terabytes,
with dedup enabled versus disabled?

 

I hope somebody knows more about this than me.  I expect the answer will be
something like ...

 

The default ZFS block size is 128K.  If you have a filesystem with 128G
used, that means you are consuming 1,048,576 blocks, each of which must be
checksummed.  ZFS uses adler32 and sha256, which means 4bytes and 32bytes
...  36 bytes * 1M blocks = an extra 36 Mbytes and some fluff consumed by
enabling dedup.

 

I suspect my numbers are off, because 36Mbytes seems impossibly small.  But
I hope some sort of similar (and more correct) logic will apply.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-09 Thread Brandon High
On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey solar...@nedharvey.comwrote:

  The default ZFS block size is 128K.  If you have a filesystem with 128G
 used, that means you are consuming 1,048,576 blocks, each of which must be
 checksummed.  ZFS uses adler32 and sha256, which means 4bytes and 32bytes
 ...  36 bytes * 1M blocks = an extra 36 Mbytes and some fluff consumed by
 enabling dedup.



 I suspect my numbers are off, because 36Mbytes seems impossibly small.  But
 I hope some sort of similar (and more correct) logic will apply.  ;-)


I think that DDT entries are a little bigger than what you're using. The
size seems to range between 150 and 250 bytes depending on how it's
calculated, call it 200b each. Your 128G dataset would require closer to
200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique
data would require 600M - 1000M for the DDT.

The numbers are fuzzy of course, and assum only 128k blocks. Lots of small
files will increase the memory cost of dedupe, and using it on a zvol that
has the default block size (8k) would require 16 times the memory.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-09 Thread Erik Trimble

On 7/9/2010 5:18 PM, Brandon High wrote:
On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey 
solar...@nedharvey.com mailto:solar...@nedharvey.com wrote:


The default ZFS block size is 128K.  If you have a filesystem with
128G used, that means you are consuming 1,048,576 blocks, each of
which must be checksummed.  ZFS uses adler32 and sha256, which
means 4bytes and 32bytes ...  36 bytes * 1M blocks = an extra 36
Mbytes and some fluff consumed by enabling dedup.

I suspect my numbers are off, because 36Mbytes seems impossibly
small.  But I hope some sort of similar (and more correct) logic
will apply.  ;-)


I think that DDT entries are a little bigger than what you're using. 
The size seems to range between 150 and 250 bytes depending on how 
it's calculated, call it 200b each. Your 128G dataset would require 
closer to 200M (+/- 25%) for the DDT if your data was completely 
unique. 1TB of unique data would require 600M - 1000M for the DDT.


The numbers are fuzzy of course, and assum only 128k blocks. Lots of 
small files will increase the memory cost of dedupe, and using it on a 
zvol that has the default block size (8k) would require 16 times the 
memory.


-B




Go back and read several threads last month about ZFS/L2ARC memory usage 
for dedup. In particular, I've been quite specific about how to 
calculate estimated DDT size.  Richard has also been quite good at 
giving size estimates (as well as explaining how to see current block 
size usage in a filesystem).



The structure in question is this one:

ddt_entry

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108

I'd have to fire up an IDE to track down all the sizes of the ddt_entry 
structure's members, but I feel comfortable using Richard's 270 
bytes-per-entry estimate.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-09 Thread Neil Perrin

On 07/09/10 19:40, Erik Trimble wrote:

On 7/9/2010 5:18 PM, Brandon High wrote:
On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey 
solar...@nedharvey.com mailto:solar...@nedharvey.com wrote:


The default ZFS block size is 128K.  If you have a filesystem
with 128G used, that means you are consuming 1,048,576 blocks,
each of which must be checksummed.  ZFS uses adler32 and sha256,
which means 4bytes and 32bytes ...  36 bytes * 1M blocks = an
extra 36 Mbytes and some fluff consumed by enabling dedup.

 


I suspect my numbers are off, because 36Mbytes seems impossibly
small.  But I hope some sort of similar (and more correct) logic
will apply.  ;-)


I think that DDT entries are a little bigger than what you're using. 
The size seems to range between 150 and 250 bytes depending on how 
it's calculated, call it 200b each. Your 128G dataset would require 
closer to 200M (+/- 25%) for the DDT if your data was completely 
unique. 1TB of unique data would require 600M - 1000M for the DDT.


The numbers are fuzzy of course, and assum only 128k blocks. Lots of 
small files will increase the memory cost of dedupe, and using it on 
a zvol that has the default block size (8k) would require 16 times 
the memory.


-B




Go back and read several threads last month about ZFS/L2ARC memory 
usage for dedup. In particular, I've been quite specific about how to 
calculate estimated DDT size.  Richard has also been quite good at 
giving size estimates (as well as explaining how to see current block 
size usage in a filesystem).



The structure in question is this one:

ddt_entry

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108

I'd have to fire up an IDE to track down all the sizes of the 
ddt_entry structure's members, but I feel comfortable using Richard's 
270 bytes-per-entry estimate.




It must have grown a bit because on 64 bit x86 a ddt_entry is currently 
0x178 = 376 bytes  :


# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic 
cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs sata sd ip hook neti 
sockfs arp usba fctl random cpc fcip nfs lofs ufs logindmux ptm sppp ipc ]

 ::sizeof struct ddt_entry
sizeof (struct ddt_entry) = 0x178


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss