Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Nathan Kroenert

 On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote:

On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:

Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

The only vendor i know that can do this is Netapp

And you really work at Oracle?:)

The answer is definiately yes. ARC caches on-disk blocks and dedup just
reference those blocks. When you read dedup code is not involved at all.
Let me show it to you with simple test:

Create a file (dedup is on):

# dd if=/dev/random of=/foo/a bs=1m count=1024

Copy this file so that it is deduped:

# dd if=/foo/a of=/foo/b bs=1m

Export the pool so all cache is removed and reimport it:

# zpool export foo
# zpool import foo

Now let's read one file:

# dd if=/foo/a of=/dev/null bs=1m
1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)

We read file 'a' and all its blocks are in cache now. The 'b' file
shares all the same blocks, so if ARC caches blocks only once, reading
'b' should be much faster:

# dd if=/foo/b of=/dev/null bs=1m
1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec)

Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
activity. Magic?:)



Hey all,

That reminds me of something I have been wondering about... Why only 12x 
faster? If we are effectively reading from memory - as compared to a 
disk reading at approximately 100MB/s (which is about an average PC HDD 
reading sequentially), I'd have thought it should be a lot faster than 12x.


Can we really only pull stuff from cache at only a little over one 
gigabyte per second if it's dedup data?


Cheers!

Nathan.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Jim Klimov

2011-12-11 15:10, Nathan Kroenert wrote:


Hey all,

That reminds me of something I have been wondering about... Why only 12x
faster? If we are effectively reading from memory - as compared to a
disk reading at approximately 100MB/s (which is about an average PC HDD
reading sequentially), I'd have thought it should be a lot faster than 12x.

Can we really only pull stuff from cache at only a little over one
gigabyte per second if it's dedup data?



I believe there's a couple of things in play.

One is that you'd rarely get 100Mb/s from a single HDD disk
due to fragmentation, especially inherent to ZFS. But you do
mention sequential reading, so that's covered.

Besides, from Pavel's DD examples we see that he first read
at 98Mbyte/sec average, and then at 1233Mbyte/sec.

Another aspect is the RAM bandwidth, and we don't know the
specs of Pavel's test rig. For example, a 100MHz DDR2 would
peak out at 3200Mbyte/sec. That would include walking the
(cached) DDT tree for each block involved, determining which
(cached) data blocks correspond to it, and fetching them
from RAM or disk.

I would not be surprised to see that there is some disk IO
adding delays for the second case (read of a deduped file
clone), because you still have to determine references
to this second file's blocks, and another path of on-disk
blocks might lead to it from a separate inode in a separate
dataset (or I might be wrong). Reading this second path of
pointers to the same cached data blocks might decrease speed
a little.

It would be interesting to see Pavel's test updated with
second reads of both files (now that data and metadata are
all cached in RAM). It's possible that NOW reads would be
closer to RAM speeds with no disk IO. And I would be very
surprised if speeds would be noticeably different ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nathan Kroenert
 
 That reminds me of something I have been wondering about... Why only 12x
 faster? If we are effectively reading from memory - as compared to a
 disk reading at approximately 100MB/s (which is about an average PC HDD
 reading sequentially), I'd have thought it should be a lot faster than
12x.
 
 Can we really only pull stuff from cache at only a little over one
 gigabyte per second if it's dedup data?

Actually, cpu's and memory aren't as fast as you might think.  In a system
with 12 disks, I've had to write my own dd replacement, because dd
if=/dev/zero bs=1024k wasn't fast enough to keep the disks busy.  Later, I
wanted to do something similar, using unique data, and it was simply
impossible to generate random data fast enough.  I had to tweak my dd
replacement to write serial numbers, which still wasn't fast enough, so I
had to tweak my dd replacement to write a big block of static data,
followed by a serial number, followed by another big block (always smaller
than the disk block, so it would be treated as unique when hitting the
pool...)

1 typical disk sustains 1Gbit/sec.  In theory, 12 should be able to sustain
12 Gbit/sec.  According to Nathan's email, the memory bandwidth might be 25
Gbit, of which, you probably need to both read  write, thus making it
effectively 12.5 Gbit...  I'm sure the actual bandwidth available varies by
system and memory type.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Gary Driggs
What kind of drives are we talking about? Even SATA drives are
available according to application type (desktop, enterprise server,
home PVR, surveillance PVR, etc). Then there are drives with SAS 
fiber channel interfaces. Then you've got Winchester platters vs SSD
vs hybrids. But even before considering that and all the other system
factors, throughput for direct attached storage can vary greatly not
only from interface type and storage tech but even small on drive
controller firmware differences could potentially introduce variances.
That's why server manufacturers like HP, DELL, et al prefer that you
replace failed drives with one of theirs instead of something off the
shelf because they usually have firmware that's been fine tuned in
house or in conjunction with the manufacturer.


On Dec 11, 2011, at 8:25 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nathan Kroenert

 That reminds me of something I have been wondering about... Why only 12x
 faster? If we are effectively reading from memory - as compared to a
 disk reading at approximately 100MB/s (which is about an average PC HDD
 reading sequentially), I'd have thought it should be a lot faster than
 12x.

 Can we really only pull stuff from cache at only a little over one
 gigabyte per second if it's dedup data?

 Actually, cpu's and memory aren't as fast as you might think.  In a system
 with 12 disks, I've had to write my own dd replacement, because dd
 if=/dev/zero bs=1024k wasn't fast enough to keep the disks busy.  Later, I
 wanted to do something similar, using unique data, and it was simply
 impossible to generate random data fast enough.  I had to tweak my dd
 replacement to write serial numbers, which still wasn't fast enough, so I
 had to tweak my dd replacement to write a big block of static data,
 followed by a serial number, followed by another big block (always smaller
 than the disk block, so it would be treated as unique when hitting the
 pool...)

 1 typical disk sustains 1Gbit/sec.  In theory, 12 should be able to sustain
 12 Gbit/sec.  According to Nathan's email, the memory bandwidth might be 25
 Gbit, of which, you probably need to both read  write, thus making it
 effectively 12.5 Gbit...  I'm sure the actual bandwidth available varies by
 system and memory type.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zdb leaks checking

2011-12-11 Thread Andriy Gapon

Does zdb leak checking mechanism also check for the opposite situation?
That is, used/referenced blocks being in free regions of space maps.
Thank you.
-- 
Andriy Gapon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] does log device (ZIL) require a mirror setup?

2011-12-11 Thread Thomas Nau
Dear all
We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool.
As they are supposed to be read only after a crash or when booting and
those nice things are pretty expensive I'm wondering if mirroring
the log devices is a must / highly recommended

Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] does log device (ZIL) require a mirror setup?

2011-12-11 Thread Matt Breitbach
I would say that it's a highly recommended.  If you have a pool that needs
to be imported and it has a faulted, unmirrored log device, you risk data
corruption.

-Matt Breitbach

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau
Sent: Sunday, December 11, 2011 1:28 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] does log device (ZIL) require a mirror setup?

Dear all
We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool.
As they are supposed to be read only after a crash or when booting and
those nice things are pretty expensive I'm wondering if mirroring
the log devices is a must / highly recommended

Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] does log device (ZIL) require a mirror setup?

2011-12-11 Thread Frank Cusack
Corruption?  Or just loss?

On Sun, Dec 11, 2011 at 1:27 PM, Matt Breitbach
matth...@flash.shanje.comwrote:

 I would say that it's a highly recommended.  If you have a pool that
 needs
 to be imported and it has a faulted, unmirrored log device, you risk data
 corruption.

 -Matt Breitbach

 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org
 [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau
 Sent: Sunday, December 11, 2011 1:28 PM
 To: zfs-discuss@opensolaris.org
 Subject: [zfs-discuss] does log device (ZIL) require a mirror setup?

 Dear all
 We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool.
 As they are supposed to be read only after a crash or when booting and
 those nice things are pretty expensive I'm wondering if mirroring
 the log devices is a must / highly recommended

 Thomas
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] does log device (ZIL) require a mirror setup?

2011-12-11 Thread Matt Breitbach
Loss of bits, but depending upon the usage of the system, corruption _could_
be a possibility.  I could envision an scenario where you were mapping an
iSCSI lun to a system, and that system had it's own FS on top of it (think
VMFS or NTFS) and when it came back online, parts of the last write commands
didn't get written causing that filesystem to be corrupted.  Obviously this
is likely an edge case scenario, but I could see it as a possibility.

 

The actual zpool would likely be fine and importable, but the underlying
data could be corrupt if there are other filesystems layered on top of it.

 

  _  

From: Garrett D'Amore [mailto:garrett.dam...@nexenta.com] 
Sent: Sunday, December 11, 2011 10:35 PM
To: Frank Cusack
Cc: Matt Breitbach; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] does log device (ZIL) require a mirror setup?

 

Loss only. 

Sent from my iPhone


On Dec 12, 2011, at 4:00 AM, Frank Cusack fr...@linetwo.net wrote:

Corruption?  Or just loss?

On Sun, Dec 11, 2011 at 1:27 PM, Matt Breitbach matth...@flash.shanje.com
wrote:

I would say that it's a highly recommended.  If you have a pool that needs
to be imported and it has a faulted, unmirrored log device, you risk data
corruption.

-Matt Breitbach

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau
Sent: Sunday, December 11, 2011 1:28 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] does log device (ZIL) require a mirror setup?

Dear all
We use a STEC ZeusRAM as a log device for a 200TB RAID-Z2 pool.
As they are supposed to be read only after a crash or when booting and
those nice things are pretty expensive I'm wondering if mirroring
the log devices is a must / highly recommended

Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss