Re: [zfs-discuss] Dedup performance hit

2010-06-14 Thread Richard Elling
Erik is right, more below...

On Jun 13, 2010, at 10:17 PM, Erik Trimble wrote:

 Hernan F wrote:
 Hello, I tried enabling dedup on a filesystem, and moved files into it to 
 take advantage of it. I had about 700GB of files and left it for some hours. 
 When I returned, only 70GB were moved.
 
 I checked zpool iostat, and it showed about 8MB/s R/W performance (the old 
 and new zfs filesystems are in the same pool). So I disabled dedup for a few 
 seconds and instantly the performance jumped to 80MB/s
 
 It's Athlon64 x2 machine with 4GB RAM, it's only a fileserver (4x1TB SATA 
 for ZFS). arcstat.pl shows 2G for arcsz, top shows 13% CPU during the 8MB/s 
 transfers. 
 Is this normal behavior? Should I always expect such low performance, or is 
 there anything wrong with my setup? 
 Thanks in advance,
 Hernan
  
 You are severely RAM limited.  In order to do dedup, ZFS has to maintain a 
 catalog of every single block it writes and the checksum for that block. This 
 is called the Dedup Table (DDT for short).  
 So, during the copy, ZFS has to (a) read a block from the old filesystem, (b) 
 check the current DDT to see if that block exists and (c) either write the 
 block to the new filesytem (and add an appropriate DDT entry for it), or 
 write a metadata update with the dedup reference block reference.
 
 Likely, you have two problems:
 
 (1) I suspect your source filesystem has lots of blocks (that is, it's likely 
 made up smaller-sized files).  Lots of blocks means lots of seeking back and 
 forth to read all those blocks.
 
 (2) Lots of blocks also means lots of entries in the DDT.  It's trivial to 
 overwhelm a 4GB system with a large DDT.  If the DDT can't fit in RAM, then 
 it has to get partially refreshed from disk.
 
 Thus, here's what's likely going on:
 
 (1)  ZFS reads a block and it's checksum from the old filesystem
 (2)  it checks the DDT to see if that checksum exists
 (3) finding that the entire DDT isn't resident in RAM, it starts a cycle to 
 read the rest of the (potential) entries from the new filesystems' metadata.  
 That is, it tries to reconstruct the DDT from disk.  Which involves a HUGE 
 amount of random seek reads on the new filesystem.
 
 In essence, since you likely can't fit the DDT in RAM, each block read from 
 the old filesystem forces a flurry of reads from the new filesystem. Which 
 eats up the IOPS that your single pool can provide.  It thrashes the disks.  
 Your solution is to either buy more RAM, or find something you can use as an 
 L2ARC cache device for your pool.  Ideally, it would be an SSD.  However, in 
 this case, a plain hard drive would do OK (NOT one already in a pool).To 
 add such a device, you would do:  'zpool add tank mycachedevice'


A typical next question is how large with the DDT become?
Without measuring, I use 3% as a SWAG.  So if you have 700GB
of space then, for a SWAG, you need about 21GB for the DDT.
Since you also want data in the ARC and the ARC can use up to
7/8th of RAM, then a 32GB machine should work fine.  Or perhaps
invest in a 32+GB SSD for a cache. Note that every entry in the
cache SSD also consumes space in the ARC, so your small machine
will find its limits.

If you want to know more precisely how big the DDT is, use
the zdb -D command to get a summary of how many objects
are in the DDT and their size.  Simple arithmetic solves the
equation.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup performance hit

2010-06-14 Thread Dennis Clarke


 You are severely RAM limited.  In order to do dedup, ZFS has to maintain
 a catalog of every single block it writes and the checksum for that
 block. This is called the Dedup Table (DDT for short).

 So, during the copy, ZFS has to (a) read a block from the old
 filesystem, (b) check the current DDT to see if that block exists and
 (c) either write the block to the new filesytem (and add an appropriate
 DDT entry for it), or write a metadata update with the dedup reference
 block reference.

 Likely, you have two problems:

 (1) I suspect your source filesystem has lots of blocks (that is, it's
 likely made up smaller-sized files).  Lots of blocks means lots of
 seeking back and forth to read all those blocks.

 (2) Lots of blocks also means lots of entries in the DDT.  It's trivial
 to overwhelm a 4GB system with a large DDT.  If the DDT can't fit in
 RAM, then it has to get partially refreshed from disk.

 Thus, here's what's likely going on:

 (1)  ZFS reads a block and it's checksum from the old filesystem
 (2)  it checks the DDT to see if that checksum exists
 (3) finding that the entire DDT isn't resident in RAM, it starts a cycle
 to read the rest of the (potential) entries from the new filesystems'
 metadata.  That is, it tries to reconstruct the DDT from disk.  Which
 involves a HUGE amount of random seek reads on the new filesystem.

 In essence, since you likely can't fit the DDT in RAM, each block read
 from the old filesystem forces a flurry of reads from the new
 filesystem. Which eats up the IOPS that your single pool can provide.
 It thrashes the disks.  Your solution is to either buy more RAM, or find
 something you can use as an L2ARC cache device for your pool.  Ideally,
 it would be an SSD.  However, in this case, a plain hard drive would do
 OK (NOT one already in a pool).To add such a device, you would do:
 'zpool add tank mycachedevice'



That was an awesome response!  Thank you for that :-)
I tend to config my servers with 16G of ram minimum these days and now I
know why.


-- 
Dennis Clarke
dcla...@opensolaris.ca  - Email related to the open source Solaris
dcla...@blastwave.org   - Email related to open source for Solaris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup performance hit

2010-06-14 Thread remi.urbillac
 

 To add such a device, you would do:
 'zpool add tank mycachedevice'



Hi

Correct me if I'm wrong, but for me the good command should be : 
'zpool add tank cache mycachedevice'

If you don't use the cache keyword, the device would be added as a classical 
top level vdev.

Remi
*
This message and any attachments (the message) are confidential and intended 
solely for the addressees. 
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration. 
France Telecom Group shall not be liable for the message if altered, changed or 
falsified.
If you are not the intended addressee of this message, please cancel it 
immediately and inform the sender.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Dedup performance hit

2010-06-13 Thread Hernan F
Hello, I tried enabling dedup on a filesystem, and moved files into it to take 
advantage of it. I had about 700GB of files and left it for some hours. When I 
returned, only 70GB were moved.

I checked zpool iostat, and it showed about 8MB/s R/W performance (the old and 
new zfs filesystems are in the same pool). So I disabled dedup for a few 
seconds and instantly the performance jumped to 80MB/s

It's Athlon64 x2 machine with 4GB RAM, it's only a fileserver (4x1TB SATA for 
ZFS). arcstat.pl shows 2G for arcsz, top shows 13% CPU during the 8MB/s 
transfers. 

Is this normal behavior? Should I always expect such low performance, or is 
there anything wrong with my setup? 

Thanks in advance,
Hernan
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup performance hit

2010-06-13 Thread Erik Trimble

Hernan F wrote:

Hello, I tried enabling dedup on a filesystem, and moved files into it to take 
advantage of it. I had about 700GB of files and left it for some hours. When I 
returned, only 70GB were moved.

I checked zpool iostat, and it showed about 8MB/s R/W performance (the old and 
new zfs filesystems are in the same pool). So I disabled dedup for a few 
seconds and instantly the performance jumped to 80MB/s

It's Athlon64 x2 machine with 4GB RAM, it's only a fileserver (4x1TB SATA for ZFS). arcstat.pl shows 2G for arcsz, top shows 13% CPU during the 8MB/s transfers. 

Is this normal behavior? Should I always expect such low performance, or is there anything wrong with my setup? 


Thanks in advance,
Hernan
  
You are severely RAM limited.  In order to do dedup, ZFS has to maintain 
a catalog of every single block it writes and the checksum for that 
block. This is called the Dedup Table (DDT for short).  

So, during the copy, ZFS has to (a) read a block from the old 
filesystem, (b) check the current DDT to see if that block exists and 
(c) either write the block to the new filesytem (and add an appropriate 
DDT entry for it), or write a metadata update with the dedup reference 
block reference.


Likely, you have two problems:

(1) I suspect your source filesystem has lots of blocks (that is, it's 
likely made up smaller-sized files).  Lots of blocks means lots of 
seeking back and forth to read all those blocks.


(2) Lots of blocks also means lots of entries in the DDT.  It's trivial 
to overwhelm a 4GB system with a large DDT.  If the DDT can't fit in 
RAM, then it has to get partially refreshed from disk.


Thus, here's what's likely going on:

(1)  ZFS reads a block and it's checksum from the old filesystem
(2)  it checks the DDT to see if that checksum exists
(3) finding that the entire DDT isn't resident in RAM, it starts a cycle 
to read the rest of the (potential) entries from the new filesystems' 
metadata.  That is, it tries to reconstruct the DDT from disk.  Which 
involves a HUGE amount of random seek reads on the new filesystem.


In essence, since you likely can't fit the DDT in RAM, each block read 
from the old filesystem forces a flurry of reads from the new 
filesystem. Which eats up the IOPS that your single pool can provide.  
It thrashes the disks.  Your solution is to either buy more RAM, or find 
something you can use as an L2ARC cache device for your pool.  Ideally, 
it would be an SSD.  However, in this case, a plain hard drive would do 
OK (NOT one already in a pool).To add such a device, you would do:  
'zpool add tank mycachedevice'





--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup performance hit

2010-06-13 Thread John J Balestrini
Howdy all,

I too dabbled with dedup and found the performance poor with only 4gb ram. I've 
since disabled dedup and find the performance better but zpool list still 
shows a 1.15x dedup ratio. Is this still a hit on disk io performance? Aside 
from copying the data off and back onto the filesystem, is there another way to 
de-dedup the pool?

Thanks,

John



On Jun 13, 2010, at 10:17 PM, Erik Trimble wrote:

 Hernan F wrote:
 Hello, I tried enabling dedup on a filesystem, and moved files into it to 
 take advantage of it. I had about 700GB of files and left it for some hours. 
 When I returned, only 70GB were moved.
 
 I checked zpool iostat, and it showed about 8MB/s R/W performance (the old 
 and new zfs filesystems are in the same pool). So I disabled dedup for a few 
 seconds and instantly the performance jumped to 80MB/s
 
 It's Athlon64 x2 machine with 4GB RAM, it's only a fileserver (4x1TB SATA 
 for ZFS). arcstat.pl shows 2G for arcsz, top shows 13% CPU during the 8MB/s 
 transfers. 
 Is this normal behavior? Should I always expect such low performance, or is 
 there anything wrong with my setup? 
 Thanks in advance,
 Hernan
  
 You are severely RAM limited.  In order to do dedup, ZFS has to maintain a 
 catalog of every single block it writes and the checksum for that block. This 
 is called the Dedup Table (DDT for short).  
 So, during the copy, ZFS has to (a) read a block from the old filesystem, (b) 
 check the current DDT to see if that block exists and (c) either write the 
 block to the new filesytem (and add an appropriate DDT entry for it), or 
 write a metadata update with the dedup reference block reference.
 
 Likely, you have two problems:
 
 (1) I suspect your source filesystem has lots of blocks (that is, it's likely 
 made up smaller-sized files).  Lots of blocks means lots of seeking back and 
 forth to read all those blocks.
 
 (2) Lots of blocks also means lots of entries in the DDT.  It's trivial to 
 overwhelm a 4GB system with a large DDT.  If the DDT can't fit in RAM, then 
 it has to get partially refreshed from disk.
 
 Thus, here's what's likely going on:
 
 (1)  ZFS reads a block and it's checksum from the old filesystem
 (2)  it checks the DDT to see if that checksum exists
 (3) finding that the entire DDT isn't resident in RAM, it starts a cycle to 
 read the rest of the (potential) entries from the new filesystems' metadata.  
 That is, it tries to reconstruct the DDT from disk.  Which involves a HUGE 
 amount of random seek reads on the new filesystem.
 
 In essence, since you likely can't fit the DDT in RAM, each block read from 
 the old filesystem forces a flurry of reads from the new filesystem. Which 
 eats up the IOPS that your single pool can provide.  It thrashes the disks.  
 Your solution is to either buy more RAM, or find something you can use as an 
 L2ARC cache device for your pool.  Ideally, it would be an SSD.  However, in 
 this case, a plain hard drive would do OK (NOT one already in a pool).To 
 add such a device, you would do:  'zpool add tank mycachedevice'
 
 
 
 
 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss