Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-09 Thread Tim Cook
On Mon, May 9, 2011 at 2:11 AM, Evaldas Auryla wrote:

>  On 05/ 6/11 07:21 PM, Brandon High wrote:
>
>> On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson
>> wrote:
>>
>>  We use dedupe on our VMware datastores and typically see 50% savings,
>>> often times more.  We do of course keep "like" VM's on the same volume
>>>
>> I think NetApp uses 4k blocks by default, so the block size and
>> alignment should match up for most filesystems and yield better
>> savings.
>>
> Assuming that VMware datastores are on NFS ? Otherwise VMware filesystem
> VMFS uses its own block sizes from 1M to 8M, so the important point is to
> align guest OS partition to 1M, and Windows guests starting from Vista/2008
> do that by default now.
>
> Regards,
>
>
The VMFS filesystem itself is aligned by NetApp at LUN creation time.  You
still align to a 4K block on a filer because there is no way to
automatically align an encapsulated guest, especially when you could have
different guest OS types on a LUN.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-09 Thread Evaldas Auryla

 On 05/ 6/11 07:21 PM, Brandon High wrote:

On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson
wrote:


We use dedupe on our VMware datastores and typically see 50% savings,
often times more.  We do of course keep "like" VM's on the same volume

I think NetApp uses 4k blocks by default, so the block size and
alignment should match up for most filesystems and yield better
savings.
Assuming that VMware datastores are on NFS ? Otherwise VMware filesystem 
VMFS uses its own block sizes from 1M to 8M, so the important point is 
to align guest OS partition to 1M, and Windows guests starting from 
Vista/2008 do that by default now.


Regards,

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-06 Thread Brandon High
On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson  wrote:

> We use dedupe on our VMware datastores and typically see 50% savings,
> often times more.  We do of course keep "like" VM's on the same volume

I think NetApp uses 4k blocks by default, so the block size and
alignment should match up for most filesystems and yield better
savings.

Your server's resource requirements for ZFS and dedup will be much
higher due to the large DDT, as you initially suspected.

If bp_rewrite is ever completed and released, this might change. It
should allow for offline dedup, which may make dedup usable in more
situations.

> Apologies for devolving the conversation too much in the NetApp
> direction -- simply was a point of reference for me to get a better
> understanding of things on the ZFS side. :)

It's good to compare the two, since they have a pretty large overlap
in functionality but sometimes very different implementations.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-06 Thread Ray Van Dolson
On Wed, May 04, 2011 at 08:49:03PM -0700, Edward Ned Harvey wrote:
> > From: Tim Cook [mailto:t...@cook.ms]
> > 
> > That's patently false.  VM images are the absolute best use-case for dedup
> > outside of backup workloads.  I'm not sure who told you/where you got the
> > idea that VM images are not ripe for dedup, but it's wrong.
> 
> Well, I got that idea from this list.  I said a little bit about why I
> believed it was true ... about dedup being ineffective for VM's ... Would
> you care to describe a use case where dedup would be effective for a VM?  Or
> perhaps cite something specific, instead of just wiping the whole thing and
> saying "patently false?"  I don't feel like this comment was productive...
> 

We use dedupe on our VMware datastores and typically see 50% savings,
often times more.  We do of course keep "like" VM's on the same volume
(at this point nothing more than groups of Windows VM's, Linux VM's and
so on).

Note that this isn't on ZFS (yet), but we hope to begin experimenting
with it soon (using NexentaStor).

Apologies for devolving the conversation too much in the NetApp
direction -- simply was a point of reference for me to get a better
understanding of things on the ZFS side. :)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High
On Thu, May 5, 2011 at 8:50 PM, Edward Ned Harvey
 wrote:
> If you have to use the 4k recordsize, it is likely to consume 32x more
> memory than the default 128k recordsize of ZFS.  At this rate, it becomes
> increasingly difficult to get a justification to enable the dedup.  But it's
> certainly possible.

You're forgetting that zvols use an 8k volblocksize by default. If
you're currently exporting exporting volumes with iSCSI it's only a 2x
increase.

The tradeoff is that you should have more duplicate blocks, and reap
the rewards there. I'm fairly certain that it won't offset the large
increase in the size of the DDT however. Dedup with zvols is probably
never a good idea as a result.

Only if you're hosting your VM images in .vmdk files will you get 128k
blocks. Of course, your chance of getting many identical blocks gets
much, much smaller. You'll have to worry about the guests' block
alignment in the context of the image file, since two identical files
may not create identical blocks as seen from ZFS. This means you may
get only fractional savings and have an enormous DDT.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> If you have to use the 4k recordsize, it is likely to consume 32x more
> memory than the default 128k recordsize of ZFS.  At this rate, it becomes
> increasingly difficult to get a justification to enable the dedup.  But
it's
> certainly possible.

Sorry, I didn't realize ... RE just said (and I take his word for it) that
the default recordsize for a zvol is 8k.  While of course the default
recordsize for a ZFS filesystem is 128k.

Emphasis is that memory requirement is a constant multiplied by number of
blocks, so smaller blocks ==> higher number of blocks ==> more memory
consumption.

This could be a major difference in implementation ... If you are going to
use ZFS over NFS as your VM storage backend that would default to 128k
recordsize, while if you're going to use ZFS over ISCSI as your VM storage
backend that would default to 8k recordsize.

In either case, you really want to be aware of, and tune your recordsize
appropriately for the guest(s) that you are running.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Edward Ned Harvey
> From: Brandon High [mailto:bh...@freaks.com]
> 
> On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
>  wrote:
> > Generally speaking, dedup doesn't work on VM images.  (Same is true for
> ZFS
> > or netapp or anything else.)  Because the VM images are all going to
have
> > their own filesystems internally with whatever blocksize is relevant to
the
> > guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
> > whatever FS) host blocks...  Then even when you write duplicated data
> inside
> > the guest, the host won't see it as a duplicated block.
> 
> A zvol with 4k blocks should give you decent results with Windows
> guests. Recent versions use 4k alignment by default and 4k blocks, so
> there should be lots of duplicates for a base OS image.


I agree with everything Brandon said.

The one thing I would add is:  The "correct" recordsize for each guest
machine would depend on the filesystem that the guest machine is using.
Without knowing a specific filesystem on a specific guest OS, the 4k
recordsize sounds like a reasonable general-purpose setting.  But if you
know more details of the guest, you could hopefully use a larger recordsize
and therefore consume less ram on the host.

If you have to use the 4k recordsize, it is likely to consume 32x more
memory than the default 128k recordsize of ZFS.  At this rate, it becomes
increasingly difficult to get a justification to enable the dedup.  But it's
certainly possible.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Richard Elling
On May 5, 2011, at 6:02 AM, Edward Ned Harvey wrote:
> Is this a zfs discussion list, or a nexenta sales & promotion list?

Obviously, this is a Nextenta sales & promotion list. And Oracle. And OSX.
And BSD. And Linux. And anyone who needs help or can offer help with ZFS
technology :-) This list has never been more diverse. The only sad part is the
unnecessary assassination of the OpenSolaris brand. But life moves on, and so
does good technology.
 -- richard-who-is-proud-to-work-at-Nexenta

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Richard Elling
On May 5, 2011, at 2:58 PM, Brandon High wrote:
> On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
> 
>> Or if you're intimately familiar with both the guest & host filesystems, and
>> you choose blocksizes carefully to make them align.  But that seems
>> complicated and likely to fail.
> 
> Using a 4k block size is a safe bet, since most OSs use a block size
> that is a multiple of 4k. It's the same reason that the new "Advanced
> Format" drives use 4k sectors.

Yes, 4KB block sizes are replacing the 512B blocks of yesteryear. However,
the real reason the HDD manufacturers headed this way is because they can
get more usable bits per platter. The tradeoff is that your workload may consume
more real space on the platter than before. TANSTAAFL.

The trick for best performance and best opportunity for dedup (alignment 
notwithstanding)
is to have a block size that is smaller than your workload.  Or, don't bring a 
128KB block
to a 4KB block battle. For this reason, the default 8KB block size for a zvol 
is a reasonable
choice, but perhaps 4KB is better for many workloads.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High
On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
 wrote:
> Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
> or netapp or anything else.)  Because the VM images are all going to have
> their own filesystems internally with whatever blocksize is relevant to the
> guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
> whatever FS) host blocks...  Then even when you write duplicated data inside
> the guest, the host won't see it as a duplicated block.

A zvol with 4k blocks should give you decent results with Windows
guests. Recent versions use 4k alignment by default and 4k blocks, so
there should be lots of duplicates for a base OS image.

> There are some situations where dedup may help on VM images...  For example
> if you're not using sparse files and you have a zero-filed disk...  But in

compression=zle works even better for these cases, since it doesn't
require DDT resources.

> Or if you're intimately familiar with both the guest & host filesystems, and
> you choose blocksizes carefully to make them align.  But that seems
> complicated and likely to fail.

Using a 4k block size is a safe bet, since most OSs use a block size
that is a multiple of 4k. It's the same reason that the new "Advanced
Format" drives use 4k sectors.

Windows uses 4k alignment and 4k (or larger) clusters.
ext3/ext4 uses 1k, 2k, or 4k blocks. Drives over 512MB should use 4k
by default. The block alignment is determined by the partitioning, so
some care needs to be taken there.
zfs uses 'ashift' size blocks. I'm not sure what ashift works out to
be when using a zvol though, so it could be as small as 512b but may
be set to the same as the blocksize property.
ufs is 4k or 8k on x86 and 8k on sun4u. As with ext4, block alignment
is determined by partitioning and slices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Garrett D'Amore
We have customers using dedup with lots of vm images... in one extreme case 
they are getting dedup ratios of over 200:1! 

You don't need dedup or sparse files for zero filling.  Simple zle compression 
will eliminate those for you far more efficiently and without needing massive 
amounts of ram.

Our customers have the ability to access our systems engineers to design the 
solution for their needs.  If you are serious about doing this stuff right, 
work with someone like Nexenta that can engineer a complete solution instead of 
trying to figure out which of us on this forum are quacks and which are cracks. 
 :)

Tim Cook  wrote:

>On Wed, May 4, 2011 at 10:23 PM, Edward Ned Harvey <
>opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:
>
>> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> > boun...@opensolaris.org] On Behalf Of Ray Van Dolson
>> >
>> > Are any of you out there using dedupe ZFS file systems to store VMware
>> > VMDK (or any VM tech. really)?  Curious what recordsize you use and
>> > what your hardware specs / experiences have been.
>>
>> Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
>> or netapp or anything else.)  Because the VM images are all going to have
>> their own filesystems internally with whatever blocksize is relevant to the
>> guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
>> whatever FS) host blocks...  Then even when you write duplicated data
>> inside
>> the guest, the host won't see it as a duplicated block.
>>
>> There are some situations where dedup may help on VM images...  For example
>> if you're not using sparse files and you have a zero-filed disk...  But in
>> that case, you should probably just use a sparse file instead...  Or ...
>>  If
>> you have a "golden" image that you're copying all over the place ... but in
>> that case, you should probably just use clones instead...
>>
>> Or if you're intimately familiar with both the guest & host filesystems,
>> and
>> you choose blocksizes carefully to make them align.  But that seems
>> complicated and likely to fail.
>>
>>
>>
>That's patently false.  VM images are the absolute best use-case for dedup
>outside of backup workloads.  I'm not sure who told you/where you got the
>idea that VM images are not ripe for dedup, but it's wrong.
>
>--Tim
>
>___
>zfs-discuss mailing list
>zfs-discuss@opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Joerg Moellenkamp



I assume you're talking about a situation where there is an initial VM image, 
and then to clone the machine, the customers copy the VM, correct?
If that is correct, have you considered ZFS cloning instead?

When I said dedup wasn't good for VM's, what I'm talking about is:  If there is data 
inside the VM which is cloned...  For example if somebody logs into the guest OS and then 
does a "cp" operation...  Then dedup of the host is unlikely to be able to 
recognize that data as cloned data inside the virtual disk.
I have the same opinion. When having talks with customers about the 
usage of dedup and cloning, the answer is simple: When you know that 
duplicates will occur but don't know when, then use dedup, when you know 
that duplicates occur and you know that they are there from the 
beginning, then use cloning.


Thus VM images cries for cloning. I'm not a fan for dedup for VMs. I 
heard  the argument once "but what is with vm patching". Aside from the 
problem of detecting the clones, i wouldn't patch a vm, but patch the 
master image and regenerate the clones, especially when it's about 
general patching session (just saving a gig because there is a patch on 
2 or 3 of 100 server) isn't worth the effort of spending a lot of memory 
for dedup). Out of a simple reason: Patching the VMs each on it's own is 
likely to increase VM sprawl. So all i save is some iron, but i'm not 
simplifying administration. However this needs good administrative 
processes.


You can use dedup for VMs, but i'm not sure someone should ...


Is this a zfs discussion list, or a nexenta sales & promotion list?

Well ... i have an opinion how he sees that ... however it's just my own ;)

--
ORACLE
Joerg Moellenkamp | Sales Consultant
Phone: +49 40 251523-460 | Mobile: +49 172 8318433
Oracle Hardware Presales - Nord

ORACLE Deutschland B.V.&   Co. KG | Nagelsweg 55 | 20097 Hamburg

ORACLE Deutschland B.V.&   Co. KG
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Rijnzathe 6, 3454PV De Meern, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Jürgen Kunz, Marcel van de Molen, Alexander van der Ven

Oracle is committed to developing practices and products that help protect the 
environment

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Garrett D'Amore
On Thu, 2011-05-05 at 09:02 -0400, Edward Ned Harvey wrote:
> > From: Garrett D'Amore [mailto:garr...@nexenta.com]
> > 
> > We have customers using dedup with lots of vm images... in one extreme
> > case they are getting dedup ratios of over 200:1!
> 
> I assume you're talking about a situation where there is an initial VM image, 
> and then to clone the machine, the customers copy the VM, correct?
> If that is correct, have you considered ZFS cloning instead?

No.  Obviously if you can clone, its better.  But sometimes you can't do
this even with v12n, and we have this situation at customer sites today.
(I have always said, zfs clone is far easier, far more proven, and far
more efficient, *if* you can control the "ancestral" relationship to
take advantage of the clone.)  For example, one are where cloning can't
help is with patches and updates.  In some instances these can get quite
large, and across 1000's of VMs the space required can be considerable.

> 
> When I said dedup wasn't good for VM's, what I'm talking about is:  If there 
> is data inside the VM which is cloned...  For example if somebody logs into 
> the guest OS and then does a "cp" operation...  Then dedup of the host is 
> unlikely to be able to recognize that data as cloned data inside the virtual 
> disk.

I disagree.  I believe that within the VMDKs data is aligned nicely,
since these are disk images.

At any rate, we are seeing real (and large) dedup ratios in the field
when used with v12n.  In fact, this is the killer app for dedup.
 
> 
> > Our customers have the ability to access our systems engineers to design the
> > solution for their needs.  If you are serious about doing this stuff right, 
> > work
> > with someone like Nexenta that can engineer a complete solution instead of
> > trying to figure out which of us on this forum are quacks and which are
> > cracks.  :)
> 
> Is this a zfs discussion list, or a nexenta sales & promotion list?

My point here was that there is a lot of half baked advice being
given... the idea that you should only use dedup if you have a bunch of
zeros on your disk images is absolutely and totally nuts for example.
It doesn't match real world experience, and it doesn't match the theory
either.

And sometimes real-world experience trumps the theory.  I've been shown
on numerous occasions that ideas that I thought were half-baked turned
out to be very effective in the field, and vice versa.  (I'm a
developer, not a systems engineer.  Fortunately I have a very close
working relationship with a couple of awesome systems engineers.)

Folks come here looking for advice.  I think the advice that if you're
contemplating these kinds of solutions, you should get someone with some
real world experience solving these kinds of problems every day, is very
sound advice.  Trying to pull out the truths from the myths I see stated
here nearly every day is going to be difficult for the average reader
here, I think.

- Garrett


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Constantin Gonzalez

Hi,

On 05/ 5/11 03:02 PM, Edward Ned Harvey wrote:

From: Garrett D'Amore [mailto:garr...@nexenta.com]

We have customers using dedup with lots of vm images... in one extreme
case they are getting dedup ratios of over 200:1!


I assume you're talking about a situation where there is an initial VM image, 
and then to clone the machine, the customers copy the VM, correct?
If that is correct, have you considered ZFS cloning instead?

When I said dedup wasn't good for VM's, what I'm talking about is:  If there is data 
inside the VM which is cloned...  For example if somebody logs into the guest OS and then 
does a "cp" operation...  Then dedup of the host is unlikely to be able to 
recognize that data as cloned data inside the virtual disk.


ZFS cloning and ZFS dedup are solving two problems that are related, but
different:

- Through Cloning, a lot of space can be saved in situations where it is
  known beforehand that data is going to be used multiple times from multiple
  different "views". Virtualization is a perfect example of this.

- Through Dedup, space can be saved in situations where the duplicate nature
  of data is not known, or not known beforehand. Again, in virtualization
  scenarios, this could be common modifications to VM images that are
  performed multiple times, but not anticipated, such as extra software,
  OS patches, or simply man users saving the same files to their local
  desktops.

To go back to the "cp" example: If someone logs into a VM that is backed by
ZFS with dedup enabled, then copies a file, the extra space that the file will
take will be minimal. The act of copying the file will break down into a
series of blocks that will be recognized as duplicate blocks.

This is completely independent of the clone nature of the underlying VM's
backing store.

But I agree that the biggest savings are to be expected from cloning first,
as they typically translate into n GB (for the base image) x # of users,
which is a _lot_.

Dedup is still the icing on the cake for all those data blocks that were
unforeseen. And that can be a lot, too, as everone who has seen cluttered
desktops full of downloaded files can probably confirm.


Cheers,
   Constantin


--

Constantin Gonzalez Schmitz, Sales Consultant,
Oracle Hardware Presales Germany
Phone: +49 89 460 08 25 91  | Mobile: +49 172 834 90 30
Blog: http://constantin.glez.de/| Twitter: zalez

ORACLE Deutschland B.V. & Co. KG, Sonnenallee 1, 85551 Kirchheim-Heimstetten

ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstraße 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Jürgen Kunz, Marcel van de Molen, Alexander van der Ven

Oracle is committed to developing practices and products that help protect the
environment
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Edward Ned Harvey
> From: Garrett D'Amore [mailto:garr...@nexenta.com]
> 
> We have customers using dedup with lots of vm images... in one extreme
> case they are getting dedup ratios of over 200:1!

I assume you're talking about a situation where there is an initial VM image, 
and then to clone the machine, the customers copy the VM, correct?
If that is correct, have you considered ZFS cloning instead?

When I said dedup wasn't good for VM's, what I'm talking about is:  If there is 
data inside the VM which is cloned...  For example if somebody logs into the 
guest OS and then does a "cp" operation...  Then dedup of the host is unlikely 
to be able to recognize that data as cloned data inside the virtual disk.


> Our customers have the ability to access our systems engineers to design the
> solution for their needs.  If you are serious about doing this stuff right, 
> work
> with someone like Nexenta that can engineer a complete solution instead of
> trying to figure out which of us on this forum are quacks and which are
> cracks.  :)

Is this a zfs discussion list, or a nexenta sales & promotion list?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: Tim Cook [mailto:t...@cook.ms]
> 
> That's patently false.  VM images are the absolute best use-case for dedup
> outside of backup workloads.  I'm not sure who told you/where you got the
> idea that VM images are not ripe for dedup, but it's wrong.

Well, I got that idea from this list.  I said a little bit about why I
believed it was true ... about dedup being ineffective for VM's ... Would
you care to describe a use case where dedup would be effective for a VM?  Or
perhaps cite something specific, instead of just wiping the whole thing and
saying "patently false?"  I don't feel like this comment was productive...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: Tim Cook [mailto:t...@cook.ms]
> 
> > ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> > time, and can't really share them well if it expects to keep performance
> > from tanking... (no pun intended)
> That's true, but on the flipside, if you don't have adequate resources
> dedicated all the time, it means performance is unsustainable.  Anything
> which is going to do post-write dedup will necessarily have degraded
> performance on a periodic basis.  This is in *addition* to all your scrubs
> and backups and so on.
> 
> 
> AGAIN, you're assuming that all system resources are used all the time and
> can't possibly go anywhere else.  This is absolutely false.  If someone is
> running a system at 99% capacity 24/7, perhaps that might be a factual
> statement.  I'd argue if someone is running the system 99% all of the
time,
> the system is grossly undersized for the workload.  

Well, here is my situation:  I do IT for a company whose workload is very
spiky.  For weeks at a time, the system will be 99% idle.  Then when the
engineers have a deadline to meet, they will expand and consume all
available resources, no matter how much you give them.  So they will keep
all systems 99% busy for a month at a time.  After the deadline passes, they
drop back down to 99% idle.

The work is IO intensive so it's not appropriate for something like the
cloud.


> I'm gathering that this list in general has a lack of understanding of how
> NetApp does things.  If you don't know for a fact how it works, stop
jumping
> to conclusions on how you think it works.  I know for a fact that short of
the

I'm a little confused by this rant.  Cuz I didn't say anything about netapp.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 10:23 PM, Edward Ned Harvey <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Ray Van Dolson
> >
> > Are any of you out there using dedupe ZFS file systems to store VMware
> > VMDK (or any VM tech. really)?  Curious what recordsize you use and
> > what your hardware specs / experiences have been.
>
> Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
> or netapp or anything else.)  Because the VM images are all going to have
> their own filesystems internally with whatever blocksize is relevant to the
> guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
> whatever FS) host blocks...  Then even when you write duplicated data
> inside
> the guest, the host won't see it as a duplicated block.
>
> There are some situations where dedup may help on VM images...  For example
> if you're not using sparse files and you have a zero-filed disk...  But in
> that case, you should probably just use a sparse file instead...  Or ...
>  If
> you have a "golden" image that you're copying all over the place ... but in
> that case, you should probably just use clones instead...
>
> Or if you're intimately familiar with both the guest & host filesystems,
> and
> you choose blocksizes carefully to make them align.  But that seems
> complicated and likely to fail.
>
>
>
That's patently false.  VM images are the absolute best use-case for dedup
outside of backup workloads.  I'm not sure who told you/where you got the
idea that VM images are not ripe for dedup, but it's wrong.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 10:15 PM, Edward Ned Harvey <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Erik Trimble
> >
> > ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> > time, and can't really share them well if it expects to keep performance
> > from tanking... (no pun intended)
>
> That's true, but on the flipside, if you don't have adequate resources
> dedicated all the time, it means performance is unsustainable.  Anything
> which is going to do post-write dedup will necessarily have degraded
> performance on a periodic basis.  This is in *addition* to all your scrubs
> and backups and so on.
>
>
>
AGAIN, you're assuming that all system resources are used all the time and
can't possibly go anywhere else.  This is absolutely false.  If someone is
running a system at 99% capacity 24/7, perhaps that might be a factual
statement.  I'd argue if someone is running the system 99% all of the time,
the system is grossly undersized for the workload.  How can you EVER expect
a highly available system to run 99% on both nodes (all nodes in a vmax/vsp
scenario) and ever be able to fail over?  Either a home-brew Opensolaris
Cluster, Oracle 7000 cluster, or NetApp?

I'm gathering that this list in general has a lack of understanding of how
NetApp does things.  If you don't know for a fact how it works, stop jumping
to conclusions on how you think it works.  I know for a fact that short of
the guys currently/previously writing the code at NetApp, there's a handful
of people in the entire world who know (factually) how they're allocating
resources from soup to nuts.

As far as this discussion is concerned, there's only two points that matter:
They've got dedup on primary storage, it works in the field.  The rest is
just static that doesn't matter.  Let's focus on how to make ZFS better
instead of trying to guess how others are making it work, especially when
they've got a completely different implementation.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Ray Van Dolson
>  
> Are any of you out there using dedupe ZFS file systems to store VMware
> VMDK (or any VM tech. really)?  Curious what recordsize you use and
> what your hardware specs / experiences have been.

Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
or netapp or anything else.)  Because the VM images are all going to have
their own filesystems internally with whatever blocksize is relevant to the
guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
whatever FS) host blocks...  Then even when you write duplicated data inside
the guest, the host won't see it as a duplicated block.

There are some situations where dedup may help on VM images...  For example
if you're not using sparse files and you have a zero-filed disk...  But in
that case, you should probably just use a sparse file instead...  Or ...  If
you have a "golden" image that you're copying all over the place ... but in
that case, you should probably just use clones instead...

Or if you're intimately familiar with both the guest & host filesystems, and
you choose blocksizes carefully to make them align.  But that seems
complicated and likely to fail.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Erik Trimble
> 
> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> time, and can't really share them well if it expects to keep performance
> from tanking... (no pun intended)

That's true, but on the flipside, if you don't have adequate resources
dedicated all the time, it means performance is unsustainable.  Anything
which is going to do post-write dedup will necessarily have degraded
performance on a periodic basis.  This is in *addition* to all your scrubs
and backups and so on.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 5:11 PM, Brandon High wrote:

On Wed, May 4, 2011 at 4:36 PM, Erik Trimble  wrote:

If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
strictly controlled max FlexVol size helps with keeping the resource limits
down, as it will be able to round-robin the post-write dedup to each FlexVol
in turn.

They are, its in their docs. A volume is dedup'd when 20% of
non-deduped data is added to it, or something similar. 8 volumes can
be processed at once though, I believe, and it could be that weaker
systems are not able to do as many in parallel.


Sounds rational.


block usage has a significant 4k presence.  One way I reduced this initally
was to have the VMdisk image stored on local disk, then copied the *entire*
image to the ZFS server, so the server saw a single large file, which meant
it tended to write full 128k blocks.  Do note, that my 30 images only takes

Wouldn't you have been better off cloning datasets that contain an
unconfigured install and customizing from there?

-B
Given that my "OS" installs include a fair amount of 3rd-party add-ons 
(compilers, SDKs, et al), I generally find the best method for me is to 
fully configure a client (with the VMdisk on local storage), then copy 
that VMdisk to the ZFS server as a "golden image".  I can then clone 
that image for my other clients of that type, and only have to change 
the network information.


Initially, each new VM image consumes about 1MB of space. :-)

Overall, I've found that as I have to patch each image, it's worth-while 
to take a new "golden-image" snapshot every so often, and then 
reconfigure each client machine again from that new Golden image. I'm 
sure I could do some optimization here, but the method works well enough.



What you want to avoid is having the OS image written to, and waiting 
for any other configuration and customization to happen AFTER it was 
placed on the ZFS server is sub-optimal.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 6:51 PM, Erik Trimble wrote:

>  On 5/4/2011 4:44 PM, Tim Cook wrote:
>
>
>
> On Wed, May 4, 2011 at 6:36 PM, Erik Trimble wrote:
>
>> On 5/4/2011 4:14 PM, Ray Van Dolson wrote:
>>
>>> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
>>>
 On Wed, May 4, 2011 at 12:29 PM, Erik Trimble
  wrote:

>I suspect that NetApp does the following to limit their resource
> usage:   they presume the presence of some sort of cache that can be
> dedicated to the DDT (and, since they also control the hardware, they
> can
> make sure there is always one present).  Thus, they can make their code
>
 AFAIK, NetApp has more restrictive requirements about how much data
 can be dedup'd on each type of hardware.

 See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
 pieces of hardware can only dedup 1TB volumes, and even the big-daddy
 filers will only dedup up to 16TB per volume, even if the volume size
 is 32TB (the largest volume available for dedup).

 NetApp solves the problem by putting rigid constraints around the
 problem, whereas ZFS lets you enable dedup for any size dataset. Both
 approaches have limitations, and it sucks when you hit them.

 -B

>>> That is very true, although worth mentioning you can have quite a few
>>> of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
>>> FAS2050 has a bunch of 2TB SIS enabled FlexVols).
>>>
>>>  Stupid question - can you hit all the various SIS volumes at once, and
>> not get horrid performance penalties?
>>
>> If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
>> strictly controlled max FlexVol size helps with keeping the resource limits
>> down, as it will be able to round-robin the post-write dedup to each FlexVol
>> in turn.
>>
>> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
>> time, and can't really share them well if it expects to keep performance
>> from tanking... (no pun intended)
>>
>>
>  On a 2050?  Probably not.  It's got a single-core mobile celeron CPU and
> 2GB/ram.  You couldn't even run ZFS on that box, much less ZFS+dedup.  Can
> you do it on a model that isn't 4 years old without tanking performance?
>  Absolutely.
>
>  Outside of those two 2000 series, the reason there are dedup limits isn't
> performance.
>
>  --Tim
>
>  Indirectly, yes, it's performance, since NetApp has plainly chosen
> post-write dedup as a method to restrict the required hardware
> capabilities.  The dedup limits on Volsize are almost certainly driven by
> the local RAM requirements for post-write dedup.
>
> It also looks like NetApp isn't providing for a dedicated DDT cache, which
> means that when the NetApp is doing dedup, it's consuming the normal
> filesystem cache (i.e. chewing through RAM).  Frankly, I'd be very surprised
> if you didn't see a noticeable performance hit during the period that the
> NetApp appliance is performing the dedup scans.
>
>

Again, it depends on the model/load/etc.  The smallest models will see
performance hits for sure.  If the vol size limits are strictly a matter of
ram, why exactly would they jump from 4TB to 16TB on a 3140 by simply
upgrading ONTAP?  If the limits haven't gone up on, at the very least, every
one of the x2xx systems 12 months from now, feel free to dig up the thread
and give an I-told-you-so.  I'm quite confident that won't be the case.  The
16TB limit SCREAMS to me that it's a holdover from the same 32bit limit that
causes 32-bit volumes to have a 16TB limit.  I'm quite confident they're
just taking the cautious approach on moving to 64bit dedup code.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High
On Wed, May 4, 2011 at 4:36 PM, Erik Trimble  wrote:
> If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
> strictly controlled max FlexVol size helps with keeping the resource limits
> down, as it will be able to round-robin the post-write dedup to each FlexVol
> in turn.

They are, its in their docs. A volume is dedup'd when 20% of
non-deduped data is added to it, or something similar. 8 volumes can
be processed at once though, I believe, and it could be that weaker
systems are not able to do as many in parallel.

> block usage has a significant 4k presence.  One way I reduced this initally
> was to have the VMdisk image stored on local disk, then copied the *entire*
> image to the ZFS server, so the server saw a single large file, which meant
> it tended to write full 128k blocks.  Do note, that my 30 images only takes

Wouldn't you have been better off cloning datasets that contain an
unconfigured install and customizing from there?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 04:51:36PM -0700, Erik Trimble wrote:
> On 5/4/2011 4:44 PM, Tim Cook wrote:
> 
> 
> 
> On Wed, May 4, 2011 at 6:36 PM, Erik Trimble 
> wrote:
> 
> On 5/4/2011 4:14 PM, Ray Van Dolson wrote:
> 
> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
> 
> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble<
> erik.trim...@oracle.com>  wrote:
> 
>I suspect that NetApp does the following to limit
> their resource
> usage:   they presume the presence of some sort of cache
> that can be
> dedicated to the DDT (and, since they also control the
> hardware, they can
> make sure there is always one present).  Thus, they can
> make their code
> 
> AFAIK, NetApp has more restrictive requirements about how much
> data
> can be dedup'd on each type of hardware.
> 
> See page 29 of http://media.netapp.com/documents/tr-3505.pdf -
> Smaller
> pieces of hardware can only dedup 1TB volumes, and even the
> big-daddy
> filers will only dedup up to 16TB per volume, even if the
> volume size
> is 32TB (the largest volume available for dedup).
> 
> NetApp solves the problem by putting rigid constraints around
> the
> problem, whereas ZFS lets you enable dedup for any size
> dataset. Both
> approaches have limitations, and it sucks when you hit them.
> 
> -B
> 
> That is very true, although worth mentioning you can have quite a
> few
> of the dedupe/SIS enabled FlexVols on even the lower-end filers
> (our
> FAS2050 has a bunch of 2TB SIS enabled FlexVols).
> 
> 
> Stupid question - can you hit all the various SIS volumes at once, and
> not get horrid performance penalties?
> 
> If so, I'm almost certain NetApp is doing post-write dedup.  That way,
> the strictly controlled max FlexVol size helps with keeping the
> resource limits down, as it will be able to round-robin the post-write
> dedup to each FlexVol in turn.
> 
> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> time, and can't really share them well if it expects to keep
> performance from tanking... (no pun intended)
> 
> 
> 
> On a 2050?  Probably not.  It's got a single-core mobile celeron CPU and
> 2GB/ram.  You couldn't even run ZFS on that box, much less ZFS+dedup.  Can
> you do it on a model that isn't 4 years old without tanking performance?
>  Absolutely.
> 
> Outside of those two 2000 series, the reason there are dedup limits isn't
> performance. 
> 
> --Tim
> 
> 
> Indirectly, yes, it's performance, since NetApp has plainly chosen
> post-write dedup as a method to restrict the required hardware
> capabilities.  The dedup limits on Volsize are almost certainly
> driven by the local RAM requirements for post-write dedup.
> 
> It also looks like NetApp isn't providing for a dedicated DDT cache,
> which means that when the NetApp is doing dedup, it's consuming the
> normal filesystem cache (i.e. chewing through RAM).  Frankly, I'd be
> very surprised if you didn't see a noticeable performance hit during
> the period that the NetApp appliance is performing the dedup scans.

Yep, when the dedupe process runs, there is a drop in performance
(hence we usually schedule it to run off-peak hours).  Obviously this
is a luxury that wouldn't be an option in every environment...

During normal operations outside of the dedupe period we haven't
noticed a performance hit.  I don't think we hit the filer too hard
however -- it's acting as a VMware datastore and only a few of the VM's
have higher I/O footprints.

It is a 2050C however so we spread the load across the two filer heads
(although we occasionally run everything on one head when performing
maintenance on the other).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 4:44 PM, Tim Cook wrote:



On Wed, May 4, 2011 at 6:36 PM, Erik Trimble > wrote:


On 5/4/2011 4:14 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:

On Wed, May 4, 2011 at 12:29 PM, Erik
Trimblemailto:erik.trim...@oracle.com>>  wrote:

   I suspect that NetApp does the following to
limit their resource
usage:   they presume the presence of some sort of
cache that can be
dedicated to the DDT (and, since they also control the
hardware, they can
make sure there is always one present).  Thus, they
can make their code

AFAIK, NetApp has more restrictive requirements about how
much data
can be dedup'd on each type of hardware.

See page 29 of
http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even
the big-daddy
filers will only dedup up to 16TB per volume, even if the
volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints
around the
problem, whereas ZFS lets you enable dedup for any size
dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

That is very true, although worth mentioning you can have
quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end
filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

Stupid question - can you hit all the various SIS volumes at once,
and not get horrid performance penalties?

If so, I'm almost certain NetApp is doing post-write dedup.  That
way, the strictly controlled max FlexVol size helps with keeping
the resource limits down, as it will be able to round-robin the
post-write dedup to each FlexVol in turn.

ZFS's problem is that it needs ALL the resouces for EACH pool ALL
the time, and can't really share them well if it expects to keep
performance from tanking... (no pun intended)


On a 2050?  Probably not.  It's got a single-core mobile celeron CPU 
and 2GB/ram.  You couldn't even run ZFS on that box, much less 
ZFS+dedup.  Can you do it on a model that isn't 4 years old without 
tanking performance?  Absolutely.


Outside of those two 2000 series, the reason there are dedup limits 
isn't performance.


--Tim

Indirectly, yes, it's performance, since NetApp has plainly chosen 
post-write dedup as a method to restrict the required hardware 
capabilities.  The dedup limits on Volsize are almost certainly driven 
by the local RAM requirements for post-write dedup.


It also looks like NetApp isn't providing for a dedicated DDT cache, 
which means that when the NetApp is doing dedup, it's consuming the 
normal filesystem cache (i.e. chewing through RAM).  Frankly, I'd be 
very surprised if you didn't see a noticeable performance hit during the 
period that the NetApp appliance is performing the dedup scans.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 4:17 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote:

On 5/4/2011 2:54 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:

(2) Block size:  a 4k block size will yield better dedup than a 128k
block size, presuming reasonable data turnover.  This is inherent, as
any single bit change in a block will make it non-duplicated.  With 32x
the block size, there is a much greater chance that a small change in
data will require a large loss of dedup ratio.  That is, 4k blocks
should almost always yield much better dedup ratios than larger ones.
Also, remember that the ZFS block size is a SUGGESTION for zfs
filesystems (i.e. it will use UP TO that block size, but not always that
size), but is FIXED for zvols.

(3) Method of storing (and data stored in) the dedup table.
   ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
boils down to 500+ bytes of combined L2ARC&   RAM usage per block entry
in the DDT.  Also, the actual DDT entry itself is perhaps larger than
absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.

It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC
cache device, the DDT must be stored in RAM.  That's about 376 bytes per
dedup block.

If you have an L2ARC cache device, then the ARC must contain a reference
to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT
entry reference.

So, adding a L2ARC reduces the ARC consumption by about 55%.

Of course, the other benefit from a L2ARC is the data/metadata caching,
which is likely worth it just by itself.

Great info.  Thanks Erik.

For dedupe workloads on larger file systems (8TB+), I wonder if makes
sense to use SLC / enterprise class SSD (or better) devices for L2ARC
instead of lower-end MLC stuff?  Seems like we'd be seeing more writes
to the device than in a non-dedupe scenario.

Thanks,
Ray
I'm using Enterprise-class MLC drives (without a supercap), and they 
work fine with dedup.  I'd have to test, but I don't think that the 
increase in write is that much, so I don't expect a SLC to really make 
much of a difference. (fill rate of the L2ARC is limited, so I can't 
imaging we'd bump up against the MLC's limits)


--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 6:36 PM, Erik Trimble wrote:

> On 5/4/2011 4:14 PM, Ray Van Dolson wrote:
>
>> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
>>
>>> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble
>>>  wrote:
>>>
I suspect that NetApp does the following to limit their resource
 usage:   they presume the presence of some sort of cache that can be
 dedicated to the DDT (and, since they also control the hardware, they
 can
 make sure there is always one present).  Thus, they can make their code

>>> AFAIK, NetApp has more restrictive requirements about how much data
>>> can be dedup'd on each type of hardware.
>>>
>>> See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
>>> pieces of hardware can only dedup 1TB volumes, and even the big-daddy
>>> filers will only dedup up to 16TB per volume, even if the volume size
>>> is 32TB (the largest volume available for dedup).
>>>
>>> NetApp solves the problem by putting rigid constraints around the
>>> problem, whereas ZFS lets you enable dedup for any size dataset. Both
>>> approaches have limitations, and it sucks when you hit them.
>>>
>>> -B
>>>
>> That is very true, although worth mentioning you can have quite a few
>> of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
>> FAS2050 has a bunch of 2TB SIS enabled FlexVols).
>>
>>  Stupid question - can you hit all the various SIS volumes at once, and
> not get horrid performance penalties?
>
> If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
> strictly controlled max FlexVol size helps with keeping the resource limits
> down, as it will be able to round-robin the post-write dedup to each FlexVol
> in turn.
>
> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the time,
> and can't really share them well if it expects to keep performance from
> tanking... (no pun intended)
>
>
On a 2050?  Probably not.  It's got a single-core mobile celeron CPU and
2GB/ram.  You couldn't even run ZFS on that box, much less ZFS+dedup.  Can
you do it on a model that isn't 4 years old without tanking performance?
 Absolutely.

Outside of those two 2000 series, the reason there are dedup limits isn't
performance.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 4:14 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:

On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:

I suspect that NetApp does the following to limit their resource
usage:   they presume the presence of some sort of cache that can be
dedicated to the DDT (and, since they also control the hardware, they can
make sure there is always one present).  Thus, they can make their code

AFAIK, NetApp has more restrictive requirements about how much data
can be dedup'd on each type of hardware.

See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even the big-daddy
filers will only dedup up to 16TB per volume, even if the volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints around the
problem, whereas ZFS lets you enable dedup for any size dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

That is very true, although worth mentioning you can have quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

Stupid question - can you hit all the various SIS volumes at once, and 
not get horrid performance penalties?


If so, I'm almost certain NetApp is doing post-write dedup.  That way, 
the strictly controlled max FlexVol size helps with keeping the resource 
limits down, as it will be able to round-robin the post-write dedup to 
each FlexVol in turn.


ZFS's problem is that it needs ALL the resouces for EACH pool ALL the 
time, and can't really share them well if it expects to keep performance 
from tanking... (no pun intended)



The FAS2050 of course has a fairly small memory footprint...

I do like the additional flexibility you have with ZFS, just trying to
get a handle on the memory requirements.

Are any of you out there using dedupe ZFS file systems to store VMware
VMDK (or any VM tech. really)?  Curious what recordsize you use and
what your hardware specs / experiences have been.

Ray


Right now, I use it for my Solaris 8 containers and VirtualBox images.  
the VB images are mostly Windows (XP and Win2003).


I tend to put the OS image in one VMdisk, and my scratch disks in 
another. That is, I generally don't want my apps writing much to my OS 
images. My scratch/data disks aren't dedup.


Overall, I'm running about 30 deduped images served out over NFS. My 
recordsize is set to 128k, but, given that they're OS images, my actual 
disk block usage has a significant 4k presence.  One way I reduced this 
initally was to have the VMdisk image stored on local disk, then copied 
the *entire* image to the ZFS server, so the server saw a single large 
file, which meant it tended to write full 128k blocks.  Do note, that my 
30 images only takes about 20GB of actual space, after dedup. I figure 
about 5GB of dedup space per OS type (and, I have 4 different setups).


My data VMdisks, however, chew though about 4TB of disk space, which is 
nondeduped. I'm still trying to determine if I'm better off serving 
those data disks as NFS mounts to my clients, or as VMdisk images 
available over iSCSI or NFS.  Right now, I'm doing VMdisks over NFS.


The setup I'm using is an older X4200 (non-M2), with 3rd-party SSDs as 
L2ARC, hooked to an old 3500FC array. It has 8GB of RAM in total, and 
runs just fine with that.  I definitely am going to upgrade to something 
much larger in the near future, since I expect to up my number of VM 
images by at least a factor of 5.



That all said, if you're relatively careful about separating OS installs 
from active data, you can get really impressive dedup ratios using a 
relatively small amount of actual space.In my case, I expect to 
eventually be serving about 10 different configs out to a total of maybe 
100 clients, and probably never exceed 100GB max on the deduped end. 
Which means that I'll be able to get away with 16GB of RAM for the whole 
server, comfortably.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote:
> On 5/4/2011 2:54 PM, Ray Van Dolson wrote:
> > On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:
> >> (2) Block size:  a 4k block size will yield better dedup than a 128k
> >> block size, presuming reasonable data turnover.  This is inherent, as
> >> any single bit change in a block will make it non-duplicated.  With 32x
> >> the block size, there is a much greater chance that a small change in
> >> data will require a large loss of dedup ratio.  That is, 4k blocks
> >> should almost always yield much better dedup ratios than larger ones.
> >> Also, remember that the ZFS block size is a SUGGESTION for zfs
> >> filesystems (i.e. it will use UP TO that block size, but not always that
> >> size), but is FIXED for zvols.
> >>
> >> (3) Method of storing (and data stored in) the dedup table.
> >>   ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
> >> lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
> >> for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
> >> boils down to 500+ bytes of combined L2ARC&  RAM usage per block entry
> >> in the DDT.  Also, the actual DDT entry itself is perhaps larger than
> >> absolutely necessary.
> > So the addition of L2ARC doesn't necessarily reduce the need for
> > memory (at least not much if you're talking about 500 bytes combined)?
> > I was hoping we could slap in 80GB's of SSD L2ARC and get away with
> > "only" 16GB of RAM for example.
> 
> It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC 
> cache device, the DDT must be stored in RAM.  That's about 376 bytes per 
> dedup block.
> 
> If you have an L2ARC cache device, then the ARC must contain a reference 
> to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT 
> entry reference.
> 
> So, adding a L2ARC reduces the ARC consumption by about 55%.
> 
> Of course, the other benefit from a L2ARC is the data/metadata caching, 
> which is likely worth it just by itself.

Great info.  Thanks Erik.

For dedupe workloads on larger file systems (8TB+), I wonder if makes
sense to use SLC / enterprise class SSD (or better) devices for L2ARC
instead of lower-end MLC stuff?  Seems like we'd be seeing more writes
to the device than in a non-dedupe scenario.

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:
> >        I suspect that NetApp does the following to limit their resource
> > usage:   they presume the presence of some sort of cache that can be
> > dedicated to the DDT (and, since they also control the hardware, they can
> > make sure there is always one present).  Thus, they can make their code
> 
> AFAIK, NetApp has more restrictive requirements about how much data
> can be dedup'd on each type of hardware.
> 
> See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
> pieces of hardware can only dedup 1TB volumes, and even the big-daddy
> filers will only dedup up to 16TB per volume, even if the volume size
> is 32TB (the largest volume available for dedup).
> 
> NetApp solves the problem by putting rigid constraints around the
> problem, whereas ZFS lets you enable dedup for any size dataset. Both
> approaches have limitations, and it sucks when you hit them.
> 
> -B

That is very true, although worth mentioning you can have quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

The FAS2050 of course has a fairly small memory footprint... 

I do like the additional flexibility you have with ZFS, just trying to
get a handle on the memory requirements.

Are any of you out there using dedupe ZFS file systems to store VMware
VMDK (or any VM tech. really)?  Curious what recordsize you use and
what your hardware specs / experiences have been.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 2:54 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:

(2) Block size:  a 4k block size will yield better dedup than a 128k
block size, presuming reasonable data turnover.  This is inherent, as
any single bit change in a block will make it non-duplicated.  With 32x
the block size, there is a much greater chance that a small change in
data will require a large loss of dedup ratio.  That is, 4k blocks
should almost always yield much better dedup ratios than larger ones.
Also, remember that the ZFS block size is a SUGGESTION for zfs
filesystems (i.e. it will use UP TO that block size, but not always that
size), but is FIXED for zvols.

(3) Method of storing (and data stored in) the dedup table.
  ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
boils down to 500+ bytes of combined L2ARC&  RAM usage per block entry
in the DDT.  Also, the actual DDT entry itself is perhaps larger than
absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.


It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC 
cache device, the DDT must be stored in RAM.  That's about 376 bytes per 
dedup block.


If you have an L2ARC cache device, then the ARC must contain a reference 
to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT 
entry reference.


So, adding a L2ARC reduces the ARC consumption by about 55%.

Of course, the other benefit from a L2ARC is the data/metadata caching, 
which is likely worth it just by itself.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High
On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:
>        I suspect that NetApp does the following to limit their resource
> usage:   they presume the presence of some sort of cache that can be
> dedicated to the DDT (and, since they also control the hardware, they can
> make sure there is always one present).  Thus, they can make their code

AFAIK, NetApp has more restrictive requirements about how much data
can be dedup'd on each type of hardware.

See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even the big-daddy
filers will only dedup up to 16TB per volume, even if the volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints around the
problem, whereas ZFS lets you enable dedup for any size dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:
> On 5/4/2011 9:57 AM, Ray Van Dolson wrote:
> > There are a number of threads (this one[1] for example) that describe
> > memory requirements for deduplication.  They're pretty high.
> >
> > I'm trying to get a better understanding... on our NetApps we use 4K
> > block sizes with their post-process deduplication and get pretty good
> > dedupe ratios for VM content.
> >
> > Using ZFS we are using 128K record sizes by default, which nets us less
> > impressive savings... however, to drop to a 4K record size would
> > theoretically require that we have nearly 40GB of memory for only 1TB
> > of storage (based on 150 bytes per block for the DDT).
> >
> > This obviously becomes prohibitively higher for 10+ TB file systems.
> >
> > I will note that our NetApps are using only 2TB FlexVols, but would
> > like to better understand ZFS's (apparently) higher memory
> > requirements... or maybe I'm missing something entirely.
> >
> > Thanks,
> > Ray
> 
> I'm not familiar with NetApp's implementation, so I can't speak to
> why it might appear to use less resources.
> 
> However, there are a couple of possible issues here:
> 
> (1)  Pre-write vs Post-write Deduplication.
>  ZFS does pre-write dedup, where it looks for duplicates before 
> it writes anything to disk.  In order to do pre-write dedup, you really 
> have to store the ENTIRE deduplication block lookup table in some sort 
> of fast (random) access media, realistically Flash or RAM.  The win is 
> that you get significantly lower disk utilization (i.e. better I/O 
> performance), as (potentially) much less data is actually written to disk.
>  Post-write Dedup is done via batch processing - that is, such a 
> design has the system periodically scan the saved data, looking for 
> duplicates. While this method also greatly benefits from being able to 
> store the dedup table in fast random storage, it's not anywhere as 
> critical. The downside here is that you see much higher disk utilization 
> - the system must first write all new data to disk (without looking for 
> dedup), and then must also perform significant I/O later on to do the dedup.

Makes sense.

> (2) Block size:  a 4k block size will yield better dedup than a 128k 
> block size, presuming reasonable data turnover.  This is inherent, as 
> any single bit change in a block will make it non-duplicated.  With 32x 
> the block size, there is a much greater chance that a small change in 
> data will require a large loss of dedup ratio.  That is, 4k blocks 
> should almost always yield much better dedup ratios than larger ones. 
> Also, remember that the ZFS block size is a SUGGESTION for zfs 
> filesystems (i.e. it will use UP TO that block size, but not always that 
> size), but is FIXED for zvols.
> 
> (3) Method of storing (and data stored in) the dedup table.
>  ZFS's current design is (IMHO) rather piggy on DDT and L2ARC 
> lookup requirements. Right now, ZFS requires a record in the ARC (RAM) 
> for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it 
> boils down to 500+ bytes of combined L2ARC & RAM usage per block entry 
> in the DDT.  Also, the actual DDT entry itself is perhaps larger than 
> absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.

>  I suspect that NetApp does the following to limit their 
> resource usage:   they presume the presence of some sort of cache that 
> can be dedicated to the DDT (and, since they also control the hardware, 
> they can make sure there is always one present).  Thus, they can make 
> their code completely avoid the need for an equivalent to the ARC-based 
> lookup.  In addition, I suspect they have a smaller DDT entry itself.  
> Which boils down to probably needing 50% of the total resource 
> consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement.
> 
> Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The 
> big issue is the ARC requirements, which, until they can be seriously 
> reduced (or, best case, simply eliminated), really is a significant 
> barrier to adoption of ZFS dedup.
> 
> Right now, ZFS treats DDT entries like any other data or metadata in how 
> it ages from ARC to L2ARC to gone.  IMHO, the better way to do this is 
> simply require the DDT to be entirely stored on the L2ARC (if present), 
> and not ever keep any DDT info in the ARC at all (that is, the ARC 
> should contain a pointer to the DDT in the L2ARC, and that's it, 
> regardless of the amount or frequency of access of the DDT).  Frankly, 
> at this point, I'd almost change the design to REQUIRE a L2ARC device in 
> order to turn on Dedup.

Thanks for you response, Eric.  Very helpful.

Ray
__

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 9:57 AM, Ray Van Dolson wrote:

There are a number of threads (this one[1] for example) that describe
memory requirements for deduplication.  They're pretty high.

I'm trying to get a better understanding... on our NetApps we use 4K
block sizes with their post-process deduplication and get pretty good
dedupe ratios for VM content.

Using ZFS we are using 128K record sizes by default, which nets us less
impressive savings... however, to drop to a 4K record size would
theoretically require that we have nearly 40GB of memory for only 1TB
of storage (based on 150 bytes per block for the DDT).

This obviously becomes prohibitively higher for 10+ TB file systems.

I will note that our NetApps are using only 2TB FlexVols, but would
like to better understand ZFS's (apparently) higher memory
requirements... or maybe I'm missing something entirely.

Thanks,
Ray

[1] http://markmail.org/message/wile6kawka6qnjdw
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I'm not familiar with NetApp's implementation, so I can't speak to why 
it might appear to use less resources.


However, there are a couple of possible issues here:

(1)  Pre-write vs Post-write Deduplication.
ZFS does pre-write dedup, where it looks for duplicates before 
it writes anything to disk.  In order to do pre-write dedup, you really 
have to store the ENTIRE deduplication block lookup table in some sort 
of fast (random) access media, realistically Flash or RAM.  The win is 
that you get significantly lower disk utilization (i.e. better I/O 
performance), as (potentially) much less data is actually written to disk.
Post-write Dedup is done via batch processing - that is, such a 
design has the system periodically scan the saved data, looking for 
duplicates. While this method also greatly benefits from being able to 
store the dedup table in fast random storage, it's not anywhere as 
critical. The downside here is that you see much higher disk utilization 
- the system must first write all new data to disk (without looking for 
dedup), and then must also perform significant I/O later on to do the dedup.


(2) Block size:  a 4k block size will yield better dedup than a 128k 
block size, presuming reasonable data turnover.  This is inherent, as 
any single bit change in a block will make it non-duplicated.  With 32x 
the block size, there is a much greater chance that a small change in 
data will require a large loss of dedup ratio.  That is, 4k blocks 
should almost always yield much better dedup ratios than larger ones. 
Also, remember that the ZFS block size is a SUGGESTION for zfs 
filesystems (i.e. it will use UP TO that block size, but not always that 
size), but is FIXED for zvols.


(3) Method of storing (and data stored in) the dedup table.
ZFS's current design is (IMHO) rather piggy on DDT and L2ARC 
lookup requirements. Right now, ZFS requires a record in the ARC (RAM) 
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it 
boils down to 500+ bytes of combined L2ARC & RAM usage per block entry 
in the DDT.  Also, the actual DDT entry itself is perhaps larger than 
absolutely necessary.
I suspect that NetApp does the following to limit their 
resource usage:   they presume the presence of some sort of cache that 
can be dedicated to the DDT (and, since they also control the hardware, 
they can make sure there is always one present).  Thus, they can make 
their code completely avoid the need for an equivalent to the ARC-based 
lookup.  In addition, I suspect they have a smaller DDT entry itself.  
Which boils down to probably needing 50% of the total resource 
consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement.



Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The 
big issue is the ARC requirements, which, until they can be seriously 
reduced (or, best case, simply eliminated), really is a significant 
barrier to adoption of ZFS dedup.


Right now, ZFS treats DDT entries like any other data or metadata in how 
it ages from ARC to L2ARC to gone.  IMHO, the better way to do this is 
simply require the DDT to be entirely stored on the L2ARC (if present), 
and not ever keep any DDT info in the ARC at all (that is, the ARC 
should contain a pointer to the DDT in the L2ARC, and that's it, 
regardless of the amount or frequency of access of the DDT).  Frankly, 
at this point, I'd almost change the design to REQUIRE a L2ARC device in 
order to turn on Dedup.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
There are a number of threads (this one[1] for example) that describe
memory requirements for deduplication.  They're pretty high.

I'm trying to get a better understanding... on our NetApps we use 4K
block sizes with their post-process deduplication and get pretty good
dedupe ratios for VM content.

Using ZFS we are using 128K record sizes by default, which nets us less
impressive savings... however, to drop to a 4K record size would
theoretically require that we have nearly 40GB of memory for only 1TB
of storage (based on 150 bytes per block for the DDT).

This obviously becomes prohibitively higher for 10+ TB file systems.

I will note that our NetApps are using only 2TB FlexVols, but would
like to better understand ZFS's (apparently) higher memory
requirements... or maybe I'm missing something entirely.

Thanks,
Ray

[1] http://markmail.org/message/wile6kawka6qnjdw
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss