Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: Tim Cook [mailto:t...@cook.ms]
> 
> That's patently false.  VM images are the absolute best use-case for dedup
> outside of backup workloads.  I'm not sure who told you/where you got the
> idea that VM images are not ripe for dedup, but it's wrong.

Well, I got that idea from this list.  I said a little bit about why I
believed it was true ... about dedup being ineffective for VM's ... Would
you care to describe a use case where dedup would be effective for a VM?  Or
perhaps cite something specific, instead of just wiping the whole thing and
saying "patently false?"  I don't feel like this comment was productive...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: Tim Cook [mailto:t...@cook.ms]
> 
> > ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> > time, and can't really share them well if it expects to keep performance
> > from tanking... (no pun intended)
> That's true, but on the flipside, if you don't have adequate resources
> dedicated all the time, it means performance is unsustainable.  Anything
> which is going to do post-write dedup will necessarily have degraded
> performance on a periodic basis.  This is in *addition* to all your scrubs
> and backups and so on.
> 
> 
> AGAIN, you're assuming that all system resources are used all the time and
> can't possibly go anywhere else.  This is absolutely false.  If someone is
> running a system at 99% capacity 24/7, perhaps that might be a factual
> statement.  I'd argue if someone is running the system 99% all of the
time,
> the system is grossly undersized for the workload.  

Well, here is my situation:  I do IT for a company whose workload is very
spiky.  For weeks at a time, the system will be 99% idle.  Then when the
engineers have a deadline to meet, they will expand and consume all
available resources, no matter how much you give them.  So they will keep
all systems 99% busy for a month at a time.  After the deadline passes, they
drop back down to 99% idle.

The work is IO intensive so it's not appropriate for something like the
cloud.


> I'm gathering that this list in general has a lack of understanding of how
> NetApp does things.  If you don't know for a fact how it works, stop
jumping
> to conclusions on how you think it works.  I know for a fact that short of
the

I'm a little confused by this rant.  Cuz I didn't say anything about netapp.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-04 Thread Erik Trimble

Good summary, Ned.  A couple of minor corrections.

On 5/4/2011 7:56 PM, Edward Ned Harvey wrote:

This is a summary of a much longer discussion "Dedup and L2ARC memory
requirements (again)"
Sorry even this summary is long.  But the results vary enormously based on
individual usage, so any "rule of thumb" metric that has been bouncing
around on the internet is simply not sufficient.  You need to go into this
level of detail to get an estimate that's worth the napkin or bathroom
tissue it's scribbled on.

This is how to (reasonably) accurately estimate the hypothetical ram
requirements to hold the complete data deduplication tables (DDT) and L2ARC
references in ram.  Please note both the DDT and L2ARC references can be
evicted from memory according to system policy, whenever the system decides
some other data is more valuable to keep.  So following this guide does not
guarantee that the whole DDT will remain in ARC or L2ARC.  But it's a good
start.

I am using a solaris 11 express x86 test system for my example numbers
below.

--- To calculate size of DDT ---

Each entry in the DDT is a fixed size, which varies by platform.  You can
find it with the command:
echo ::sizeof ddt_entry_t | mdb -k
This will return a hex value, that you probably want to convert to decimal.
On my test system, it is 0x178 which is 376 bytes

There is one DDT entry per non-dedup'd (unique) block in the zpool.  Be
aware that you cannot reliably estimate #blocks by counting #files.  You can
find the number of total blocks including dedup'd blocks in your pool with
this command:
zdb -bb poolname | grep 'bp count'
Note:  This command will run a long time and is IO intensive.  On my systems
where a scrub runs for 8-9 hours, this zdb command ran for about 90 minutes.
On my test system, the result is 44145049 (44.1M) total blocks.

To estimate the number of non-dedup'd (unique) blocks (assuming average size
of dedup'd blocks = average size of blocks in the whole pool), use:
zpool list
Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
blocks by the dedup ratio to find the number of non-dedup'd (unique) blocks.

In my test system:
44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
non-dedup'd (unique) blocks

Then multiply by the size of a DDT entry.
19707611 * 376 = 7410061796 bytes = 7G total DDT size

--- To calculate size of ARC/L2ARC references ---

Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
another fixed size, which varies by platform.  You can find it with the
command:
echo ::sizeof arc_buf_hdr_t | mdb -k
On my test system, it is 0xb0 which is 176 bytes

We need to know the average block size in the pool, to estimate the number
of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
pool:
zpool list
Divide by the number of non-dedup'd (unique) blocks in the pool, to find the
average block size.  In my test system:
790G / 19707611 = 42K average block size

Remember:  If your L2ARC were only caching average size blocks, then the
payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
in a negligible ARC memory consumption.  But since your DDT can be pushed
out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
the complete DDT present in ARC and L2ARC, thus consuming tons of ram.

Remember disk mfgrs use base-10.  So my 32G SSD is only 30G base-2.
(32,000,000,000 / 1024/1024/1024)

So I have 30G L2ARC, and the first 7G may be consumed by DDT.  This leaves
23G remaining to be used for average-sized blocks.
The ARC consumed to reference the DDT in L2ARC is 176/376 * DDT size. In my
test system this is 176/376 * 7G = 3.3G

Take the remaining size of your L2ARC, divide by average block size, to get
the number of average size blocks the L2ARC can hold.  In my test system:
23G / 42K = 574220 average-size blocks in L2ARC
Multiply by the ARC size of a L2ARC reference.  On my test system:
574220 * 176 = 101062753 bytes = 96MB ARC consumed to reference the
average-size blocks in L2ARC

So the total ARC consumption to hold L2ARC references in my test system is
3.3G + 96M ~= 3.4G

--- To calculate total ram needed ---

And finally - The max size the ARC is allowed to grow, is a constant that
varies by platform.  On my system, it is 80% of system ram.  You can find
this value using the command:
kstat -p zfs::arcstats:c_max
Divide by your total system memory to find the ratio.
Assuming the ratio is 4/5, it means you need to buy 5/4 the amount of
calculated ram to satisfy all your requirements.

Using the standard c_max value of 80%, remember that this is 80% of the 
TOTAL

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 10:23 PM, Edward Ned Harvey <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Ray Van Dolson
> >
> > Are any of you out there using dedupe ZFS file systems to store VMware
> > VMDK (or any VM tech. really)?  Curious what recordsize you use and
> > what your hardware specs / experiences have been.
>
> Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
> or netapp or anything else.)  Because the VM images are all going to have
> their own filesystems internally with whatever blocksize is relevant to the
> guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
> whatever FS) host blocks...  Then even when you write duplicated data
> inside
> the guest, the host won't see it as a duplicated block.
>
> There are some situations where dedup may help on VM images...  For example
> if you're not using sparse files and you have a zero-filed disk...  But in
> that case, you should probably just use a sparse file instead...  Or ...
>  If
> you have a "golden" image that you're copying all over the place ... but in
> that case, you should probably just use clones instead...
>
> Or if you're intimately familiar with both the guest & host filesystems,
> and
> you choose blocksizes carefully to make them align.  But that seems
> complicated and likely to fail.
>
>
>
That's patently false.  VM images are the absolute best use-case for dedup
outside of backup workloads.  I'm not sure who told you/where you got the
idea that VM images are not ripe for dedup, but it's wrong.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 10:15 PM, Edward Ned Harvey <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Erik Trimble
> >
> > ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> > time, and can't really share them well if it expects to keep performance
> > from tanking... (no pun intended)
>
> That's true, but on the flipside, if you don't have adequate resources
> dedicated all the time, it means performance is unsustainable.  Anything
> which is going to do post-write dedup will necessarily have degraded
> performance on a periodic basis.  This is in *addition* to all your scrubs
> and backups and so on.
>
>
>
AGAIN, you're assuming that all system resources are used all the time and
can't possibly go anywhere else.  This is absolutely false.  If someone is
running a system at 99% capacity 24/7, perhaps that might be a factual
statement.  I'd argue if someone is running the system 99% all of the time,
the system is grossly undersized for the workload.  How can you EVER expect
a highly available system to run 99% on both nodes (all nodes in a vmax/vsp
scenario) and ever be able to fail over?  Either a home-brew Opensolaris
Cluster, Oracle 7000 cluster, or NetApp?

I'm gathering that this list in general has a lack of understanding of how
NetApp does things.  If you don't know for a fact how it works, stop jumping
to conclusions on how you think it works.  I know for a fact that short of
the guys currently/previously writing the code at NetApp, there's a handful
of people in the entire world who know (factually) how they're allocating
resources from soup to nuts.

As far as this discussion is concerned, there's only two points that matter:
They've got dedup on primary storage, it works in the field.  The rest is
just static that doesn't matter.  Let's focus on how to make ZFS better
instead of trying to guess how others are making it work, especially when
they've got a completely different implementation.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Ray Van Dolson
>  
> Are any of you out there using dedupe ZFS file systems to store VMware
> VMDK (or any VM tech. really)?  Curious what recordsize you use and
> what your hardware specs / experiences have been.

Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
or netapp or anything else.)  Because the VM images are all going to have
their own filesystems internally with whatever blocksize is relevant to the
guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
whatever FS) host blocks...  Then even when you write duplicated data inside
the guest, the host won't see it as a duplicated block.

There are some situations where dedup may help on VM images...  For example
if you're not using sparse files and you have a zero-filed disk...  But in
that case, you should probably just use a sparse file instead...  Or ...  If
you have a "golden" image that you're copying all over the place ... but in
that case, you should probably just use clones instead...

Or if you're intimately familiar with both the guest & host filesystems, and
you choose blocksizes carefully to make them align.  But that seems
complicated and likely to fail.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Erik Trimble
> 
> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> time, and can't really share them well if it expects to keep performance
> from tanking... (no pun intended)

That's true, but on the flipside, if you don't have adequate resources
dedicated all the time, it means performance is unsustainable.  Anything
which is going to do post-write dedup will necessarily have degraded
performance on a periodic basis.  This is in *addition* to all your scrubs
and backups and so on.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-04 Thread Edward Ned Harvey
This is a summary of a much longer discussion "Dedup and L2ARC memory
requirements (again)"
Sorry even this summary is long.  But the results vary enormously based on
individual usage, so any "rule of thumb" metric that has been bouncing
around on the internet is simply not sufficient.  You need to go into this
level of detail to get an estimate that's worth the napkin or bathroom
tissue it's scribbled on.

This is how to (reasonably) accurately estimate the hypothetical ram
requirements to hold the complete data deduplication tables (DDT) and L2ARC
references in ram.  Please note both the DDT and L2ARC references can be
evicted from memory according to system policy, whenever the system decides
some other data is more valuable to keep.  So following this guide does not
guarantee that the whole DDT will remain in ARC or L2ARC.  But it's a good
start.

I am using a solaris 11 express x86 test system for my example numbers
below.  

--- To calculate size of DDT ---

Each entry in the DDT is a fixed size, which varies by platform.  You can
find it with the command:
echo ::sizeof ddt_entry_t | mdb -k
This will return a hex value, that you probably want to convert to decimal.
On my test system, it is 0x178 which is 376 bytes

There is one DDT entry per non-dedup'd (unique) block in the zpool.  Be
aware that you cannot reliably estimate #blocks by counting #files.  You can
find the number of total blocks including dedup'd blocks in your pool with
this command:
zdb -bb poolname | grep 'bp count'
Note:  This command will run a long time and is IO intensive.  On my systems
where a scrub runs for 8-9 hours, this zdb command ran for about 90 minutes.
On my test system, the result is 44145049 (44.1M) total blocks.

To estimate the number of non-dedup'd (unique) blocks (assuming average size
of dedup'd blocks = average size of blocks in the whole pool), use:
zpool list
Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
blocks by the dedup ratio to find the number of non-dedup'd (unique) blocks.

In my test system:
44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
non-dedup'd (unique) blocks

Then multiply by the size of a DDT entry.
19707611 * 376 = 7410061796 bytes = 7G total DDT size

--- To calculate size of ARC/L2ARC references ---

Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
another fixed size, which varies by platform.  You can find it with the
command:
echo ::sizeof arc_buf_hdr_t | mdb -k
On my test system, it is 0xb0 which is 176 bytes

We need to know the average block size in the pool, to estimate the number
of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
pool:
zpool list
Divide by the number of non-dedup'd (unique) blocks in the pool, to find the
average block size.  In my test system:
790G / 19707611 = 42K average block size

Remember:  If your L2ARC were only caching average size blocks, then the
payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
in a negligible ARC memory consumption.  But since your DDT can be pushed
out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
the complete DDT present in ARC and L2ARC, thus consuming tons of ram.

Remember disk mfgrs use base-10.  So my 32G SSD is only 30G base-2.
(32,000,000,000 / 1024/1024/1024)

So I have 30G L2ARC, and the first 7G may be consumed by DDT.  This leaves
23G remaining to be used for average-sized blocks.
The ARC consumed to reference the DDT in L2ARC is 176/376 * DDT size. In my
test system this is 176/376 * 7G = 3.3G

Take the remaining size of your L2ARC, divide by average block size, to get
the number of average size blocks the L2ARC can hold.  In my test system:
23G / 42K = 574220 average-size blocks in L2ARC
Multiply by the ARC size of a L2ARC reference.  On my test system:
574220 * 176 = 101062753 bytes = 96MB ARC consumed to reference the
average-size blocks in L2ARC

So the total ARC consumption to hold L2ARC references in my test system is
3.3G + 96M ~= 3.4G

--- To calculate total ram needed ---

And finally - The max size the ARC is allowed to grow, is a constant that
varies by platform.  On my system, it is 80% of system ram.  You can find
this value using the command:
kstat -p zfs::arcstats:c_max
Divide by your total system memory to find the ratio.
Assuming the ratio is 4/5, it means you need to buy 5/4 the amount of
calculated ram to satisfy all your requirements.

So the end result is:
On my test system I guess the OS and processes consume 1G.  (I'm making that
up without any reason.)
On my test system I guess I need 8G in the system to 

Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Randy Jones

On 05/03/11 22:45, Rich Teer wrote:



True, but the SB1000 only supports 2GB of RAM IIRC!  I'll soon be


Actually you can get up to 16GB ram in a SB1000 (or SB2000). The 4GB
dimms are most likely not too common however the 1GB and 2GB dimms seem
to be common. At one time Dataram and maybe Kingston made 4GB dimms for
the SB1000 and SB2000. And don't forget can also put the 1.2GHz processors
in it.

Even with all that it is still not even close to the speed of the U20 M2
you are mentioning below. At least as a workstation...


migrating this machine's duties to an Ultra 20 M2.  A faster CPU
and 4 GB should make an noticable improvement (not to mention, on


You can also get up to 8GB ram in the U20 M2.


board USB 2.0 ports).

Thanks for you ideas!




--
--
Randy Jones
E-Mail: ra...@jones.tri.net
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 5:11 PM, Brandon High wrote:

On Wed, May 4, 2011 at 4:36 PM, Erik Trimble  wrote:

If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
strictly controlled max FlexVol size helps with keeping the resource limits
down, as it will be able to round-robin the post-write dedup to each FlexVol
in turn.

They are, its in their docs. A volume is dedup'd when 20% of
non-deduped data is added to it, or something similar. 8 volumes can
be processed at once though, I believe, and it could be that weaker
systems are not able to do as many in parallel.


Sounds rational.


block usage has a significant 4k presence.  One way I reduced this initally
was to have the VMdisk image stored on local disk, then copied the *entire*
image to the ZFS server, so the server saw a single large file, which meant
it tended to write full 128k blocks.  Do note, that my 30 images only takes

Wouldn't you have been better off cloning datasets that contain an
unconfigured install and customizing from there?

-B
Given that my "OS" installs include a fair amount of 3rd-party add-ons 
(compilers, SDKs, et al), I generally find the best method for me is to 
fully configure a client (with the VMdisk on local storage), then copy 
that VMdisk to the ZFS server as a "golden image".  I can then clone 
that image for my other clients of that type, and only have to change 
the network information.


Initially, each new VM image consumes about 1MB of space. :-)

Overall, I've found that as I have to patch each image, it's worth-while 
to take a new "golden-image" snapshot every so often, and then 
reconfigure each client machine again from that new Golden image. I'm 
sure I could do some optimization here, but the method works well enough.



What you want to avoid is having the OS image written to, and waiting 
for any other configuration and customization to happen AFTER it was 
placed on the ZFS server is sub-optimal.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 6:51 PM, Erik Trimble wrote:

>  On 5/4/2011 4:44 PM, Tim Cook wrote:
>
>
>
> On Wed, May 4, 2011 at 6:36 PM, Erik Trimble wrote:
>
>> On 5/4/2011 4:14 PM, Ray Van Dolson wrote:
>>
>>> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
>>>
 On Wed, May 4, 2011 at 12:29 PM, Erik Trimble
  wrote:

>I suspect that NetApp does the following to limit their resource
> usage:   they presume the presence of some sort of cache that can be
> dedicated to the DDT (and, since they also control the hardware, they
> can
> make sure there is always one present).  Thus, they can make their code
>
 AFAIK, NetApp has more restrictive requirements about how much data
 can be dedup'd on each type of hardware.

 See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
 pieces of hardware can only dedup 1TB volumes, and even the big-daddy
 filers will only dedup up to 16TB per volume, even if the volume size
 is 32TB (the largest volume available for dedup).

 NetApp solves the problem by putting rigid constraints around the
 problem, whereas ZFS lets you enable dedup for any size dataset. Both
 approaches have limitations, and it sucks when you hit them.

 -B

>>> That is very true, although worth mentioning you can have quite a few
>>> of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
>>> FAS2050 has a bunch of 2TB SIS enabled FlexVols).
>>>
>>>  Stupid question - can you hit all the various SIS volumes at once, and
>> not get horrid performance penalties?
>>
>> If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
>> strictly controlled max FlexVol size helps with keeping the resource limits
>> down, as it will be able to round-robin the post-write dedup to each FlexVol
>> in turn.
>>
>> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
>> time, and can't really share them well if it expects to keep performance
>> from tanking... (no pun intended)
>>
>>
>  On a 2050?  Probably not.  It's got a single-core mobile celeron CPU and
> 2GB/ram.  You couldn't even run ZFS on that box, much less ZFS+dedup.  Can
> you do it on a model that isn't 4 years old without tanking performance?
>  Absolutely.
>
>  Outside of those two 2000 series, the reason there are dedup limits isn't
> performance.
>
>  --Tim
>
>  Indirectly, yes, it's performance, since NetApp has plainly chosen
> post-write dedup as a method to restrict the required hardware
> capabilities.  The dedup limits on Volsize are almost certainly driven by
> the local RAM requirements for post-write dedup.
>
> It also looks like NetApp isn't providing for a dedicated DDT cache, which
> means that when the NetApp is doing dedup, it's consuming the normal
> filesystem cache (i.e. chewing through RAM).  Frankly, I'd be very surprised
> if you didn't see a noticeable performance hit during the period that the
> NetApp appliance is performing the dedup scans.
>
>

Again, it depends on the model/load/etc.  The smallest models will see
performance hits for sure.  If the vol size limits are strictly a matter of
ram, why exactly would they jump from 4TB to 16TB on a 3140 by simply
upgrading ONTAP?  If the limits haven't gone up on, at the very least, every
one of the x2xx systems 12 months from now, feel free to dig up the thread
and give an I-told-you-so.  I'm quite confident that won't be the case.  The
16TB limit SCREAMS to me that it's a holdover from the same 32bit limit that
causes 32-bit volumes to have a 16TB limit.  I'm quite confident they're
just taking the cautious approach on moving to 64bit dedup code.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High
On Wed, May 4, 2011 at 4:36 PM, Erik Trimble  wrote:
> If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
> strictly controlled max FlexVol size helps with keeping the resource limits
> down, as it will be able to round-robin the post-write dedup to each FlexVol
> in turn.

They are, its in their docs. A volume is dedup'd when 20% of
non-deduped data is added to it, or something similar. 8 volumes can
be processed at once though, I believe, and it could be that weaker
systems are not able to do as many in parallel.

> block usage has a significant 4k presence.  One way I reduced this initally
> was to have the VMdisk image stored on local disk, then copied the *entire*
> image to the ZFS server, so the server saw a single large file, which meant
> it tended to write full 128k blocks.  Do note, that my 30 images only takes

Wouldn't you have been better off cloning datasets that contain an
unconfigured install and customizing from there?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Brandon High
On Wed, May 4, 2011 at 2:25 PM, Giovanni Tirloni  wrote:
>   The problem we've started seeing is that a zfs send -i is taking hours to
> send a very small amount of data (eg. 20GB in 6 hours) while a zfs send full
> transfer everything faster than the incremental (40-70MB/s). Sometimes we
> just give up on sending the incremental and send a full altogether.

Does the send complete faster if you just pipe to /dev/null? I've
observed that if recv stalls, it'll pause the send, and they two go
back and forth stepping on each other's toes. Unfortunately, send and
recv tend to pause with each individual snapshot they are working on.

Putting something like mbuffer
(http://www.maier-komor.de/mbuffer.html) in the middle can help smooth
it out and speed things up tremendously. It prevents the send from
pausing when the recv stalls, and allows the recv to continue working
when the send is stalled. You will have to fiddle with the buffer size
and other options to tune it for your use.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 04:51:36PM -0700, Erik Trimble wrote:
> On 5/4/2011 4:44 PM, Tim Cook wrote:
> 
> 
> 
> On Wed, May 4, 2011 at 6:36 PM, Erik Trimble 
> wrote:
> 
> On 5/4/2011 4:14 PM, Ray Van Dolson wrote:
> 
> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
> 
> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble<
> erik.trim...@oracle.com>  wrote:
> 
>I suspect that NetApp does the following to limit
> their resource
> usage:   they presume the presence of some sort of cache
> that can be
> dedicated to the DDT (and, since they also control the
> hardware, they can
> make sure there is always one present).  Thus, they can
> make their code
> 
> AFAIK, NetApp has more restrictive requirements about how much
> data
> can be dedup'd on each type of hardware.
> 
> See page 29 of http://media.netapp.com/documents/tr-3505.pdf -
> Smaller
> pieces of hardware can only dedup 1TB volumes, and even the
> big-daddy
> filers will only dedup up to 16TB per volume, even if the
> volume size
> is 32TB (the largest volume available for dedup).
> 
> NetApp solves the problem by putting rigid constraints around
> the
> problem, whereas ZFS lets you enable dedup for any size
> dataset. Both
> approaches have limitations, and it sucks when you hit them.
> 
> -B
> 
> That is very true, although worth mentioning you can have quite a
> few
> of the dedupe/SIS enabled FlexVols on even the lower-end filers
> (our
> FAS2050 has a bunch of 2TB SIS enabled FlexVols).
> 
> 
> Stupid question - can you hit all the various SIS volumes at once, and
> not get horrid performance penalties?
> 
> If so, I'm almost certain NetApp is doing post-write dedup.  That way,
> the strictly controlled max FlexVol size helps with keeping the
> resource limits down, as it will be able to round-robin the post-write
> dedup to each FlexVol in turn.
> 
> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
> time, and can't really share them well if it expects to keep
> performance from tanking... (no pun intended)
> 
> 
> 
> On a 2050?  Probably not.  It's got a single-core mobile celeron CPU and
> 2GB/ram.  You couldn't even run ZFS on that box, much less ZFS+dedup.  Can
> you do it on a model that isn't 4 years old without tanking performance?
>  Absolutely.
> 
> Outside of those two 2000 series, the reason there are dedup limits isn't
> performance. 
> 
> --Tim
> 
> 
> Indirectly, yes, it's performance, since NetApp has plainly chosen
> post-write dedup as a method to restrict the required hardware
> capabilities.  The dedup limits on Volsize are almost certainly
> driven by the local RAM requirements for post-write dedup.
> 
> It also looks like NetApp isn't providing for a dedicated DDT cache,
> which means that when the NetApp is doing dedup, it's consuming the
> normal filesystem cache (i.e. chewing through RAM).  Frankly, I'd be
> very surprised if you didn't see a noticeable performance hit during
> the period that the NetApp appliance is performing the dedup scans.

Yep, when the dedupe process runs, there is a drop in performance
(hence we usually schedule it to run off-peak hours).  Obviously this
is a luxury that wouldn't be an option in every environment...

During normal operations outside of the dedupe period we haven't
noticed a performance hit.  I don't think we hit the filer too hard
however -- it's acting as a VMware datastore and only a few of the VM's
have higher I/O footprints.

It is a 2050C however so we spread the load across the two filer heads
(although we occasionally run everything on one head when performing
maintenance on the other).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 4:44 PM, Tim Cook wrote:



On Wed, May 4, 2011 at 6:36 PM, Erik Trimble > wrote:


On 5/4/2011 4:14 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:

On Wed, May 4, 2011 at 12:29 PM, Erik
Trimblemailto:erik.trim...@oracle.com>>  wrote:

   I suspect that NetApp does the following to
limit their resource
usage:   they presume the presence of some sort of
cache that can be
dedicated to the DDT (and, since they also control the
hardware, they can
make sure there is always one present).  Thus, they
can make their code

AFAIK, NetApp has more restrictive requirements about how
much data
can be dedup'd on each type of hardware.

See page 29 of
http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even
the big-daddy
filers will only dedup up to 16TB per volume, even if the
volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints
around the
problem, whereas ZFS lets you enable dedup for any size
dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

That is very true, although worth mentioning you can have
quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end
filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

Stupid question - can you hit all the various SIS volumes at once,
and not get horrid performance penalties?

If so, I'm almost certain NetApp is doing post-write dedup.  That
way, the strictly controlled max FlexVol size helps with keeping
the resource limits down, as it will be able to round-robin the
post-write dedup to each FlexVol in turn.

ZFS's problem is that it needs ALL the resouces for EACH pool ALL
the time, and can't really share them well if it expects to keep
performance from tanking... (no pun intended)


On a 2050?  Probably not.  It's got a single-core mobile celeron CPU 
and 2GB/ram.  You couldn't even run ZFS on that box, much less 
ZFS+dedup.  Can you do it on a model that isn't 4 years old without 
tanking performance?  Absolutely.


Outside of those two 2000 series, the reason there are dedup limits 
isn't performance.


--Tim

Indirectly, yes, it's performance, since NetApp has plainly chosen 
post-write dedup as a method to restrict the required hardware 
capabilities.  The dedup limits on Volsize are almost certainly driven 
by the local RAM requirements for post-write dedup.


It also looks like NetApp isn't providing for a dedicated DDT cache, 
which means that when the NetApp is doing dedup, it's consuming the 
normal filesystem cache (i.e. chewing through RAM).  Frankly, I'd be 
very surprised if you didn't see a noticeable performance hit during the 
period that the NetApp appliance is performing the dedup scans.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 4:17 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote:

On 5/4/2011 2:54 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:

(2) Block size:  a 4k block size will yield better dedup than a 128k
block size, presuming reasonable data turnover.  This is inherent, as
any single bit change in a block will make it non-duplicated.  With 32x
the block size, there is a much greater chance that a small change in
data will require a large loss of dedup ratio.  That is, 4k blocks
should almost always yield much better dedup ratios than larger ones.
Also, remember that the ZFS block size is a SUGGESTION for zfs
filesystems (i.e. it will use UP TO that block size, but not always that
size), but is FIXED for zvols.

(3) Method of storing (and data stored in) the dedup table.
   ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
boils down to 500+ bytes of combined L2ARC&   RAM usage per block entry
in the DDT.  Also, the actual DDT entry itself is perhaps larger than
absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.

It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC
cache device, the DDT must be stored in RAM.  That's about 376 bytes per
dedup block.

If you have an L2ARC cache device, then the ARC must contain a reference
to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT
entry reference.

So, adding a L2ARC reduces the ARC consumption by about 55%.

Of course, the other benefit from a L2ARC is the data/metadata caching,
which is likely worth it just by itself.

Great info.  Thanks Erik.

For dedupe workloads on larger file systems (8TB+), I wonder if makes
sense to use SLC / enterprise class SSD (or better) devices for L2ARC
instead of lower-end MLC stuff?  Seems like we'd be seeing more writes
to the device than in a non-dedupe scenario.

Thanks,
Ray
I'm using Enterprise-class MLC drives (without a supercap), and they 
work fine with dedup.  I'd have to test, but I don't think that the 
increase in write is that much, so I don't expect a SLC to really make 
much of a difference. (fill rate of the L2ARC is limited, so I can't 
imaging we'd bump up against the MLC's limits)


--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Tim Cook
On Wed, May 4, 2011 at 6:36 PM, Erik Trimble wrote:

> On 5/4/2011 4:14 PM, Ray Van Dolson wrote:
>
>> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
>>
>>> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble
>>>  wrote:
>>>
I suspect that NetApp does the following to limit their resource
 usage:   they presume the presence of some sort of cache that can be
 dedicated to the DDT (and, since they also control the hardware, they
 can
 make sure there is always one present).  Thus, they can make their code

>>> AFAIK, NetApp has more restrictive requirements about how much data
>>> can be dedup'd on each type of hardware.
>>>
>>> See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
>>> pieces of hardware can only dedup 1TB volumes, and even the big-daddy
>>> filers will only dedup up to 16TB per volume, even if the volume size
>>> is 32TB (the largest volume available for dedup).
>>>
>>> NetApp solves the problem by putting rigid constraints around the
>>> problem, whereas ZFS lets you enable dedup for any size dataset. Both
>>> approaches have limitations, and it sucks when you hit them.
>>>
>>> -B
>>>
>> That is very true, although worth mentioning you can have quite a few
>> of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
>> FAS2050 has a bunch of 2TB SIS enabled FlexVols).
>>
>>  Stupid question - can you hit all the various SIS volumes at once, and
> not get horrid performance penalties?
>
> If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
> strictly controlled max FlexVol size helps with keeping the resource limits
> down, as it will be able to round-robin the post-write dedup to each FlexVol
> in turn.
>
> ZFS's problem is that it needs ALL the resouces for EACH pool ALL the time,
> and can't really share them well if it expects to keep performance from
> tanking... (no pun intended)
>
>
On a 2050?  Probably not.  It's got a single-core mobile celeron CPU and
2GB/ram.  You couldn't even run ZFS on that box, much less ZFS+dedup.  Can
you do it on a model that isn't 4 years old without tanking performance?
 Absolutely.

Outside of those two 2000 series, the reason there are dedup limits isn't
performance.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 4:14 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:

On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:

I suspect that NetApp does the following to limit their resource
usage:   they presume the presence of some sort of cache that can be
dedicated to the DDT (and, since they also control the hardware, they can
make sure there is always one present).  Thus, they can make their code

AFAIK, NetApp has more restrictive requirements about how much data
can be dedup'd on each type of hardware.

See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even the big-daddy
filers will only dedup up to 16TB per volume, even if the volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints around the
problem, whereas ZFS lets you enable dedup for any size dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

That is very true, although worth mentioning you can have quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

Stupid question - can you hit all the various SIS volumes at once, and 
not get horrid performance penalties?


If so, I'm almost certain NetApp is doing post-write dedup.  That way, 
the strictly controlled max FlexVol size helps with keeping the resource 
limits down, as it will be able to round-robin the post-write dedup to 
each FlexVol in turn.


ZFS's problem is that it needs ALL the resouces for EACH pool ALL the 
time, and can't really share them well if it expects to keep performance 
from tanking... (no pun intended)



The FAS2050 of course has a fairly small memory footprint...

I do like the additional flexibility you have with ZFS, just trying to
get a handle on the memory requirements.

Are any of you out there using dedupe ZFS file systems to store VMware
VMDK (or any VM tech. really)?  Curious what recordsize you use and
what your hardware specs / experiences have been.

Ray


Right now, I use it for my Solaris 8 containers and VirtualBox images.  
the VB images are mostly Windows (XP and Win2003).


I tend to put the OS image in one VMdisk, and my scratch disks in 
another. That is, I generally don't want my apps writing much to my OS 
images. My scratch/data disks aren't dedup.


Overall, I'm running about 30 deduped images served out over NFS. My 
recordsize is set to 128k, but, given that they're OS images, my actual 
disk block usage has a significant 4k presence.  One way I reduced this 
initally was to have the VMdisk image stored on local disk, then copied 
the *entire* image to the ZFS server, so the server saw a single large 
file, which meant it tended to write full 128k blocks.  Do note, that my 
30 images only takes about 20GB of actual space, after dedup. I figure 
about 5GB of dedup space per OS type (and, I have 4 different setups).


My data VMdisks, however, chew though about 4TB of disk space, which is 
nondeduped. I'm still trying to determine if I'm better off serving 
those data disks as NFS mounts to my clients, or as VMdisk images 
available over iSCSI or NFS.  Right now, I'm doing VMdisks over NFS.


The setup I'm using is an older X4200 (non-M2), with 3rd-party SSDs as 
L2ARC, hooked to an old 3500FC array. It has 8GB of RAM in total, and 
runs just fine with that.  I definitely am going to upgrade to something 
much larger in the near future, since I expect to up my number of VM 
images by at least a factor of 5.



That all said, if you're relatively careful about separating OS installs 
from active data, you can get really impressive dedup ratios using a 
relatively small amount of actual space.In my case, I expect to 
eventually be serving about 10 different configs out to a total of maybe 
100 clients, and probably never exceed 100GB max on the deduped end. 
Which means that I'll be able to get away with 16GB of RAM for the whole 
server, comfortably.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely Slow ZFS Performance

2011-05-04 Thread Adam Serediuk
On May 4, 2011, at 4:16 PM, Victor Latushkin wrote:

> Try
> 
> echo metaslab_debug/W1 | mdb -kw
> 
> If it does not help, reset it back to zero 
> 
> echo metaslab_debug/W0 | mdb -kw

That appears to have resolved the issue! Within seconds of making the change 
performance has increased by an order of magnitude. I was typing the reply 
below when your message came in. Is this bug 7000208?

On May 4, 2011, at 4:01 PM, Garrett D'Amore wrote:

> Sounds like a nasty bug, and not one I've seen in illumos or
> NexentaStor.  What build are you running?


running snv_151a

Running some synthetic tests right now and comparing the various stats, one 
thing that stands out as very different on this system compared to our others 
is that writes seem to be going to ~5 mirror sets at a time (of 22 configured.) 
The next batch of writes will move on to the next ~5 mirror sets, and so forth 
cycling around. The other systems will write to many more mirror sets 
simultaneously. This particular machine does not appear to be buffering writes 
and appears to be doing everything sync to disk despite having sync/zil 
disabled.

I'm trying to do a little more introspection into the zpool thread that is 
using cpu but not having much luck finding anything meaningful. Occasionally 
the cpu usage for that thread will drop, and when it does performance of the 
filesystem increases.


> On Wed, 2011-05-04 at 15:40 -0700, Adam Serediuk wrote:
>> Dedup is disabled (confirmed to be.) Doing some digging it looks like
>> this is a very similar issue
>> to http://forums.oracle.com/forums/thread.jspa?threadID=2200577&tstart=0.
>> 
>> 
>> 
>> On May 4, 2011, at 2:26 PM, Garrett D'Amore wrote:
>> 
>>> My first thought is dedup... perhaps you've got dedup enabled and
>>> the DDT no longer fits in RAM?  That would create a huge performance
>>> cliff.
>>> 
>>> -Original Message-
>>> From: zfs-discuss-boun...@opensolaris.org on behalf of Eric D.
>>> Mudama
>>> Sent: Wed 5/4/2011 12:55 PM
>>> To: Adam Serediuk
>>> Cc: zfs-discuss@opensolaris.org
>>> Subject: Re: [zfs-discuss] Extremely Slow ZFS Performance
>>> 
>>> On Wed, May  4 at 12:21, Adam Serediuk wrote:
 Both iostat and zpool iostat show very little to zero load on the
>>> devices even while blocking.
 
 Any suggestions on avenues of approach for troubleshooting?
>>> 
>>> is 'iostat -en' error free?
>>> 
>>> 
>>> --
>>> Eric D. Mudama
>>> edmud...@bounceswoosh.org
>>> 
>>> ___
>>> zfs-discuss mailing list
>>> zfs-discuss@opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote:
> On 5/4/2011 2:54 PM, Ray Van Dolson wrote:
> > On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:
> >> (2) Block size:  a 4k block size will yield better dedup than a 128k
> >> block size, presuming reasonable data turnover.  This is inherent, as
> >> any single bit change in a block will make it non-duplicated.  With 32x
> >> the block size, there is a much greater chance that a small change in
> >> data will require a large loss of dedup ratio.  That is, 4k blocks
> >> should almost always yield much better dedup ratios than larger ones.
> >> Also, remember that the ZFS block size is a SUGGESTION for zfs
> >> filesystems (i.e. it will use UP TO that block size, but not always that
> >> size), but is FIXED for zvols.
> >>
> >> (3) Method of storing (and data stored in) the dedup table.
> >>   ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
> >> lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
> >> for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
> >> boils down to 500+ bytes of combined L2ARC&  RAM usage per block entry
> >> in the DDT.  Also, the actual DDT entry itself is perhaps larger than
> >> absolutely necessary.
> > So the addition of L2ARC doesn't necessarily reduce the need for
> > memory (at least not much if you're talking about 500 bytes combined)?
> > I was hoping we could slap in 80GB's of SSD L2ARC and get away with
> > "only" 16GB of RAM for example.
> 
> It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC 
> cache device, the DDT must be stored in RAM.  That's about 376 bytes per 
> dedup block.
> 
> If you have an L2ARC cache device, then the ARC must contain a reference 
> to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT 
> entry reference.
> 
> So, adding a L2ARC reduces the ARC consumption by about 55%.
> 
> Of course, the other benefit from a L2ARC is the data/metadata caching, 
> which is likely worth it just by itself.

Great info.  Thanks Erik.

For dedupe workloads on larger file systems (8TB+), I wonder if makes
sense to use SLC / enterprise class SSD (or better) devices for L2ARC
instead of lower-end MLC stuff?  Seems like we'd be seeing more writes
to the device than in a non-dedupe scenario.

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:
> >        I suspect that NetApp does the following to limit their resource
> > usage:   they presume the presence of some sort of cache that can be
> > dedicated to the DDT (and, since they also control the hardware, they can
> > make sure there is always one present).  Thus, they can make their code
> 
> AFAIK, NetApp has more restrictive requirements about how much data
> can be dedup'd on each type of hardware.
> 
> See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
> pieces of hardware can only dedup 1TB volumes, and even the big-daddy
> filers will only dedup up to 16TB per volume, even if the volume size
> is 32TB (the largest volume available for dedup).
> 
> NetApp solves the problem by putting rigid constraints around the
> problem, whereas ZFS lets you enable dedup for any size dataset. Both
> approaches have limitations, and it sucks when you hit them.
> 
> -B

That is very true, although worth mentioning you can have quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

The FAS2050 of course has a fairly small memory footprint... 

I do like the additional flexibility you have with ZFS, just trying to
get a handle on the memory requirements.

Are any of you out there using dedupe ZFS file systems to store VMware
VMDK (or any VM tech. really)?  Curious what recordsize you use and
what your hardware specs / experiences have been.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 2:54 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:

(2) Block size:  a 4k block size will yield better dedup than a 128k
block size, presuming reasonable data turnover.  This is inherent, as
any single bit change in a block will make it non-duplicated.  With 32x
the block size, there is a much greater chance that a small change in
data will require a large loss of dedup ratio.  That is, 4k blocks
should almost always yield much better dedup ratios than larger ones.
Also, remember that the ZFS block size is a SUGGESTION for zfs
filesystems (i.e. it will use UP TO that block size, but not always that
size), but is FIXED for zvols.

(3) Method of storing (and data stored in) the dedup table.
  ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
boils down to 500+ bytes of combined L2ARC&  RAM usage per block entry
in the DDT.  Also, the actual DDT entry itself is perhaps larger than
absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.


It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC 
cache device, the DDT must be stored in RAM.  That's about 376 bytes per 
dedup block.


If you have an L2ARC cache device, then the ARC must contain a reference 
to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT 
entry reference.


So, adding a L2ARC reduces the ARC consumption by about 55%.

Of course, the other benefit from a L2ARC is the data/metadata caching, 
which is likely worth it just by itself.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely Slow ZFS Performance

2011-05-04 Thread Adam Serediuk
Dedup is disabled (confirmed to be.) Doing some digging it looks like this is a 
very similar issue to 
http://forums.oracle.com/forums/thread.jspa?threadID=2200577&tstart=0.


On May 4, 2011, at 2:26 PM, Garrett D'Amore wrote:

> My first thought is dedup... perhaps you've got dedup enabled and the DDT no 
> longer fits in RAM?  That would create a huge performance cliff.
> 
> -Original Message-
> From: zfs-discuss-boun...@opensolaris.org on behalf of Eric D. Mudama
> Sent: Wed 5/4/2011 12:55 PM
> To: Adam Serediuk
> Cc: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Extremely Slow ZFS Performance
> 
> On Wed, May  4 at 12:21, Adam Serediuk wrote:
> >Both iostat and zpool iostat show very little to zero load on the devices 
> >even while blocking.
> >
> >Any suggestions on avenues of approach for troubleshooting?
> 
> is 'iostat -en' error free?
> 
> 
> --
> Eric D. Mudama
> edmud...@bounceswoosh.org
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-05-04 Thread Edward Ned Harvey
> From: Richard Elling [mailto:richard.ell...@gmail.com]
> Sent: Friday, April 29, 2011 12:49 AM
> 
> The lower bound of ARC size is c_min
> 
> # kstat -p zfs::arcstats:c_min

I see there is another character in the plot:  c_max
c_max seems to be 80% of system ram (at least on my systems).

I assume this means the ARC will never grow larger than 80%, so if you're
trying to calculate the ram needed for your system, in order to hold DDT and
L2ARC references in ARC...  This better be factored into the equation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High
On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:
>        I suspect that NetApp does the following to limit their resource
> usage:   they presume the presence of some sort of cache that can be
> dedicated to the DDT (and, since they also control the hardware, they can
> make sure there is always one present).  Thus, they can make their code

AFAIK, NetApp has more restrictive requirements about how much data
can be dedup'd on each type of hardware.

See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even the big-daddy
filers will only dedup up to 16TB per volume, even if the volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints around the
problem, whereas ZFS lets you enable dedup for any size dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:
> On 5/4/2011 9:57 AM, Ray Van Dolson wrote:
> > There are a number of threads (this one[1] for example) that describe
> > memory requirements for deduplication.  They're pretty high.
> >
> > I'm trying to get a better understanding... on our NetApps we use 4K
> > block sizes with their post-process deduplication and get pretty good
> > dedupe ratios for VM content.
> >
> > Using ZFS we are using 128K record sizes by default, which nets us less
> > impressive savings... however, to drop to a 4K record size would
> > theoretically require that we have nearly 40GB of memory for only 1TB
> > of storage (based on 150 bytes per block for the DDT).
> >
> > This obviously becomes prohibitively higher for 10+ TB file systems.
> >
> > I will note that our NetApps are using only 2TB FlexVols, but would
> > like to better understand ZFS's (apparently) higher memory
> > requirements... or maybe I'm missing something entirely.
> >
> > Thanks,
> > Ray
> 
> I'm not familiar with NetApp's implementation, so I can't speak to
> why it might appear to use less resources.
> 
> However, there are a couple of possible issues here:
> 
> (1)  Pre-write vs Post-write Deduplication.
>  ZFS does pre-write dedup, where it looks for duplicates before 
> it writes anything to disk.  In order to do pre-write dedup, you really 
> have to store the ENTIRE deduplication block lookup table in some sort 
> of fast (random) access media, realistically Flash or RAM.  The win is 
> that you get significantly lower disk utilization (i.e. better I/O 
> performance), as (potentially) much less data is actually written to disk.
>  Post-write Dedup is done via batch processing - that is, such a 
> design has the system periodically scan the saved data, looking for 
> duplicates. While this method also greatly benefits from being able to 
> store the dedup table in fast random storage, it's not anywhere as 
> critical. The downside here is that you see much higher disk utilization 
> - the system must first write all new data to disk (without looking for 
> dedup), and then must also perform significant I/O later on to do the dedup.

Makes sense.

> (2) Block size:  a 4k block size will yield better dedup than a 128k 
> block size, presuming reasonable data turnover.  This is inherent, as 
> any single bit change in a block will make it non-duplicated.  With 32x 
> the block size, there is a much greater chance that a small change in 
> data will require a large loss of dedup ratio.  That is, 4k blocks 
> should almost always yield much better dedup ratios than larger ones. 
> Also, remember that the ZFS block size is a SUGGESTION for zfs 
> filesystems (i.e. it will use UP TO that block size, but not always that 
> size), but is FIXED for zvols.
> 
> (3) Method of storing (and data stored in) the dedup table.
>  ZFS's current design is (IMHO) rather piggy on DDT and L2ARC 
> lookup requirements. Right now, ZFS requires a record in the ARC (RAM) 
> for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it 
> boils down to 500+ bytes of combined L2ARC & RAM usage per block entry 
> in the DDT.  Also, the actual DDT entry itself is perhaps larger than 
> absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.

>  I suspect that NetApp does the following to limit their 
> resource usage:   they presume the presence of some sort of cache that 
> can be dedicated to the DDT (and, since they also control the hardware, 
> they can make sure there is always one present).  Thus, they can make 
> their code completely avoid the need for an equivalent to the ARC-based 
> lookup.  In addition, I suspect they have a smaller DDT entry itself.  
> Which boils down to probably needing 50% of the total resource 
> consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement.
> 
> Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The 
> big issue is the ARC requirements, which, until they can be seriously 
> reduced (or, best case, simply eliminated), really is a significant 
> barrier to adoption of ZFS dedup.
> 
> Right now, ZFS treats DDT entries like any other data or metadata in how 
> it ages from ARC to L2ARC to gone.  IMHO, the better way to do this is 
> simply require the DDT to be entirely stored on the L2ARC (if present), 
> and not ever keep any DDT info in the ARC at all (that is, the ARC 
> should contain a pointer to the DDT in the L2ARC, and that's it, 
> regardless of the amount or frequency of access of the DDT).  Frankly, 
> at this point, I'd almost change the design to REQUIRE a L2ARC device in 
> order to turn on Dedup.

Thanks for you response, Eric.  Very helpful.

Ray
__

Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Giovanni Tirloni
On Tue, May 3, 2011 at 11:42 PM, Peter Jeremy <
peter.jer...@alcatel-lucent.com> wrote:

> - Is the source pool heavily fragmented with lots of small files?
>

Peter,

  We've some servers holding Xen VMs and the setup was create to have a
default VM from where others would be cloned so the space saving are quite
good.

  The problem we've started seeing is that a zfs send -i is taking hours to
send a very small amount of data (eg. 20GB in 6 hours) while a zfs send full
transfer everything faster than the incremental (40-70MB/s). Sometimes we
just give up on sending the incremental and send a full altogether.

  I'm wondering if it has to do with fragmentation too. Has anyone
experience this? OpenSolaris b111. As a data point, we also have servers
holding Vmware VMs (not cloned) and there is no problem. Anyone know what's
special about Xen's cloned VMs? Sparse files maybe?

Thanks,

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely Slow ZFS Performance

2011-05-04 Thread Adam Serediuk
iostat doesn't show any high service times and fsstat also shows low 
throughput. Occasionally I can generate enough load that you do see some very 
high asvc_t but when that occurs the pool is performing as expected. As a 
precaution I just added two extra drives to the zpool incase zfs was having 
difficulty finding a location to allocate new blocks and it has made no 
difference. You can still see still see block allocation cycling evenly across 
the available mirrors.



iostat -xnz 3 before, during, and after the blocking
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   33.30.0  184.1  0.0  0.10.01.5   0   1 c12t0d0
0.0   15.30.0   98.7  0.0  0.00.01.0   0   0 c12t1d0
0.0   15.30.0   98.7  0.0  0.00.01.1   0   0 c12t2d0
0.0   14.30.0   98.7  0.0  0.00.01.0   0   0 c12t6d0
0.0   20.70.0  256.6  0.0  0.00.01.4   0   0 c11t5d0
0.0   48.00.0  273.4  0.0  0.10.01.3   0   1 c11t6d0
0.0   34.00.0  199.9  0.0  0.00.01.5   0   1 c11t7d0
0.0   20.30.0  256.6  0.0  0.00.01.4   0   0 c10t5d0
0.0   47.70.0  273.4  0.0  0.10.01.2   0   1 c10t6d0
0.0   34.00.0  199.9  0.0  0.00.01.5   0   1 c10t7d0
0.09.70.0 1237.9  0.0  0.00.04.7   0   1 c14t0d0
0.0   33.30.0  184.1  0.0  0.10.01.8   0   1 c13t0d0
0.0   15.70.0   98.7  0.0  0.00.01.2   0   0 c13t1d0
0.0   15.70.0   98.7  0.0  0.00.01.0   0   0 c13t2d0
0.0   13.70.0   98.7  0.0  0.00.01.1   0   0 c13t6d0

extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.30.00.0  0.0  0.00.00.0   0   0 c8t1d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c8t2d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c8t3d0
0.30.31.70.0  0.0  0.00.06.0   0   0 c8t4d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c8t5d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c8t6d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c8t7d0
0.02.00.01.3  0.0  0.00.00.2   0   0 c12t0d0
0.0   15.70.0  235.9  0.0  0.00.00.8   0   0 c12t1d0
0.3   16.31.8  235.9  0.0  0.00.01.2   0   1 c12t2d0
0.3   15.02.8  130.6  0.0  0.00.01.0   0   1 c12t3d0
0.0   11.70.0  127.9  0.0  0.00.00.8   0   0 c12t4d0
0.30.30.20.0  0.0  0.00.09.0   0   1 c12t5d0
0.0   40.30.0  599.9  0.0  0.00.00.8   0   1 c12t6d0
0.30.32.80.0  0.0  0.00.03.2   0   0 c11t0d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c11t1d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c11t2d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c11t3d0
0.70.33.00.0  0.0  0.00.06.5   0   1 c11t4d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c11t5d0
0.32.01.51.3  0.0  0.00.02.8   0   1 c11t6d0
0.02.00.01.3  0.0  0.00.00.2   0   0 c11t7d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c10t0d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c10t1d0
0.30.30.20.0  0.0  0.00.06.7   0   0 c10t2d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c10t3d0
0.30.31.50.0  0.0  0.00.09.1   0   1 c10t4d0
0.30.31.50.0  0.0  0.00.07.7   0   1 c10t5d0
0.32.01.51.3  0.0  0.00.02.2   0   0 c10t6d0
0.02.00.01.3  0.0  0.00.00.2   0   0 c10t7d0
0.00.30.0   42.6  0.0  0.00.00.7   0   0 c14t1d0
0.30.31.50.0  0.0  0.00.05.9   0   0 c9t1d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c9t2d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c9t3d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c9t4d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c9t5d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c9t6d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c9t7d0
0.02.00.01.3  0.0  0.00.00.4   0   0 c13t0d0
0.0   15.70.0  235.9  0.0  0.00.00.8   0   0 c13t1d0
0.0   16.30.0  235.9  0.0  0.00.00.9   0   0 c13t2d0
0.0   14.70.0  130.6  0.0  0.00.00.7   0   0 c13t3d0
0.0   12.00.0  127.9  0.0  0.00.00.9   0   0 c13t4d0
0.00.30.00.0  0.0  0.00.00.0   0   0 c13t5d0
0.0   39.60.0  599.9  0.0  0.00.00.8   0   1 c13t6d0
extended device statistics  
r/sw/s   kr/

Re: [zfs-discuss] Extremely Slow ZFS Performance

2011-05-04 Thread Eric D. Mudama

On Wed, May  4 at 12:21, Adam Serediuk wrote:

Both iostat and zpool iostat show very little to zero load on the devices even 
while blocking.

Any suggestions on avenues of approach for troubleshooting?


is 'iostat -en' error free?


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble

On 5/4/2011 9:57 AM, Ray Van Dolson wrote:

There are a number of threads (this one[1] for example) that describe
memory requirements for deduplication.  They're pretty high.

I'm trying to get a better understanding... on our NetApps we use 4K
block sizes with their post-process deduplication and get pretty good
dedupe ratios for VM content.

Using ZFS we are using 128K record sizes by default, which nets us less
impressive savings... however, to drop to a 4K record size would
theoretically require that we have nearly 40GB of memory for only 1TB
of storage (based on 150 bytes per block for the DDT).

This obviously becomes prohibitively higher for 10+ TB file systems.

I will note that our NetApps are using only 2TB FlexVols, but would
like to better understand ZFS's (apparently) higher memory
requirements... or maybe I'm missing something entirely.

Thanks,
Ray

[1] http://markmail.org/message/wile6kawka6qnjdw
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I'm not familiar with NetApp's implementation, so I can't speak to why 
it might appear to use less resources.


However, there are a couple of possible issues here:

(1)  Pre-write vs Post-write Deduplication.
ZFS does pre-write dedup, where it looks for duplicates before 
it writes anything to disk.  In order to do pre-write dedup, you really 
have to store the ENTIRE deduplication block lookup table in some sort 
of fast (random) access media, realistically Flash or RAM.  The win is 
that you get significantly lower disk utilization (i.e. better I/O 
performance), as (potentially) much less data is actually written to disk.
Post-write Dedup is done via batch processing - that is, such a 
design has the system periodically scan the saved data, looking for 
duplicates. While this method also greatly benefits from being able to 
store the dedup table in fast random storage, it's not anywhere as 
critical. The downside here is that you see much higher disk utilization 
- the system must first write all new data to disk (without looking for 
dedup), and then must also perform significant I/O later on to do the dedup.


(2) Block size:  a 4k block size will yield better dedup than a 128k 
block size, presuming reasonable data turnover.  This is inherent, as 
any single bit change in a block will make it non-duplicated.  With 32x 
the block size, there is a much greater chance that a small change in 
data will require a large loss of dedup ratio.  That is, 4k blocks 
should almost always yield much better dedup ratios than larger ones. 
Also, remember that the ZFS block size is a SUGGESTION for zfs 
filesystems (i.e. it will use UP TO that block size, but not always that 
size), but is FIXED for zvols.


(3) Method of storing (and data stored in) the dedup table.
ZFS's current design is (IMHO) rather piggy on DDT and L2ARC 
lookup requirements. Right now, ZFS requires a record in the ARC (RAM) 
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it 
boils down to 500+ bytes of combined L2ARC & RAM usage per block entry 
in the DDT.  Also, the actual DDT entry itself is perhaps larger than 
absolutely necessary.
I suspect that NetApp does the following to limit their 
resource usage:   they presume the presence of some sort of cache that 
can be dedicated to the DDT (and, since they also control the hardware, 
they can make sure there is always one present).  Thus, they can make 
their code completely avoid the need for an equivalent to the ARC-based 
lookup.  In addition, I suspect they have a smaller DDT entry itself.  
Which boils down to probably needing 50% of the total resource 
consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement.



Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The 
big issue is the ARC requirements, which, until they can be seriously 
reduced (or, best case, simply eliminated), really is a significant 
barrier to adoption of ZFS dedup.


Right now, ZFS treats DDT entries like any other data or metadata in how 
it ages from ARC to L2ARC to gone.  IMHO, the better way to do this is 
simply require the DDT to be entirely stored on the L2ARC (if present), 
and not ever keep any DDT info in the ARC at all (that is, the ARC 
should contain a pointer to the DDT in the L2ARC, and that's it, 
regardless of the amount or frequency of access of the DDT).  Frankly, 
at this point, I'd almost change the design to REQUIRE a L2ARC device in 
order to turn on Dedup.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Extremely Slow ZFS Performance

2011-05-04 Thread Adam Serediuk
We have an X4540 running Solaris 11 Express snv_151a that has developed an 
issue where its write performance is absolutely abysmal. Even touching a file 
takes over five seconds both locally and remotely.

/pool1/data# time touch foo

real0m5.305s
user0m0.001s
sys 0m0.004s
/pool1/data# time rm foo

real0m5.912s
user0m0.001s
sys 0m0.005s

The system exhibits this issue under the slightest load.  We have sync=disabled 
set on all filesystems in this pool. The pool is at 75% capacity and is 
healthy. This issue started suddenly several days ago and persists after 
reboot. prstat shows zpool-pool1/150 taking 10% CPU constantly whereas other 
similar systems in our infrastructure under the same load do not. Even doing a 
'zfs set' on a property takes up to 10 seconds and on other systems is 
instantaneous. Something appears to be blocking internally.

Both iostat and zpool iostat show very little to zero load on the devices even 
while blocking.

Any suggestions on avenues of approach for troubleshooting?

Thanks,

Adam


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Rich Teer
On Wed, 4 May 2011, Edward Ned Harvey wrote:

> 4G is also lightweight, unless you're not doing much of anything.  No dedup,
> no L2ARC, just simple pushing bits around.  No services running...  Just ssh

Yep, that's right.  This is a repurposed workstation for use in my home network.

> I don't understand why so many people are building systems with insufficient
> ram these days.  I don't put less than 8G into a personal laptop anymore...

The Ultra 20 only supports 4 GB of RAM, and I've installed that much.  It
can't hold any more!  I have to make do with the resources I have here,
with my next to $0 budget...

-- 
Rich Teer, Publisher
Vinylphile Magazine

www.vinylphilemag.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Rich Teer
On Wed, 4 May 2011, Edward Ned Harvey wrote:

> I suspect you're using a junky 1G slow-as-dirt usb thumb drive.

Nope--unless an IOMega Prestige Desktop Hard Drive (containing an
Hitachi 7200K RPM hard drive with 32MB of cache) counts as a slow
as dirt USB thumb drive!  

-- 
Rich Teer, Publisher
Vinylphile Magazine

www.vinylphilemag.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
There are a number of threads (this one[1] for example) that describe
memory requirements for deduplication.  They're pretty high.

I'm trying to get a better understanding... on our NetApps we use 4K
block sizes with their post-process deduplication and get pretty good
dedupe ratios for VM content.

Using ZFS we are using 128K record sizes by default, which nets us less
impressive savings... however, to drop to a 4K record size would
theoretically require that we have nearly 40GB of memory for only 1TB
of storage (based on 150 bytes per block for the DDT).

This obviously becomes prohibitively higher for 10+ TB file systems.

I will note that our NetApps are using only 2TB FlexVols, but would
like to better understand ZFS's (apparently) higher memory
requirements... or maybe I'm missing something entirely.

Thanks,
Ray

[1] http://markmail.org/message/wile6kawka6qnjdw
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Rich Teer
> 
> Not such a silly question.  :-)  The USB1 port was indeed the source of
> much of the bottleneck.  The same 50 MB file system took only 8 seconds
> to copy when I plugged the drive into a USB 2.0 card I had in the machine!

50 Mbit/sec


> An 80 GB file system took 2 hours with the USB 2 port in use, with

88 Mbit/sec.


> True, but the SB1000 only supports 2GB of RAM IIRC!  I'll soon be
> migrating this machine's duties to an Ultra 20 M2.  A faster CPU
> and 4 GB should make an noticable improvement (not to mention, on
> board USB 2.0 ports).

4G is also lightweight, unless you're not doing much of anything.  No dedup,
no L2ARC, just simple pushing bits around.  No services running...  Just ssh

I don't understand why so many people are building systems with insufficient
ram these days.  I don't put less than 8G into a personal laptop anymore...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Rich Teer
> 
> Also related to this is a performance question.  My initial test involved
> copying a 50 MB zfs file system to a new disk, which took 2.5 minutes
> to complete.  The strikes me as being a bit high for a mere 50 MB;
> 
> The source pool is on a pair of 146 GB 10K RPM disks on separate
> busses in a D1000 (split bus arrangement) and the destination pool
> is on a IOMega 1 GB USB attached disk.  

Even the fastest USB3 thumb drive is slower than the slowest cheapest hard
drive.  Whatever specs are published on the supposed speed of the flash
drive, don't believe them.  I'm not saying they lie - just that they publish
the fastest conceivable speed under ideal situations, which are totally
unrealistic and meaningless in the real world.

I suspect you're using a junky 1G slow-as-dirt usb thumb drive.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] gaining speed with l2arc

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Frank Van Damme
> 
> another dedup question. I just installed an ssd disk as l2arc.  This
> is a backup server with 6 GB RAM (ie I don't often read the same data
> again), basically it has a large number of old backups on it and they
> need to be deleted. Deletion speed seems to have improved although the
> majority of reads are still coming from disk.
> 
>  capacity operationsbandwidth
> pool  alloc   free   read  write   read  write
>   -  -  -  -  -  -
> backups   5.49T  1.58T  1.03K  6  3.13M  91.1K
>   raidz1  5.49T  1.58T  1.03K  6  3.13M  91.1K
> c0t0d0s1  -  -200  2  4.35M  20.8K
> c0t1d0s1  -  -202  1  4.28M  24.7K
> c0t2d0s1  -  -202  1  4.28M  24.9K
> c0t3d0s1  -  -197  1  4.27M  13.1K
> cache -  -  -  -  -  -
>   c1t5d0   112G  7.96M 63  2   337K  66.6K

You have a server with 5T of storage (4T used), 112G l2arc, dedup enabled,
and 6g of ram.
Ouch.  That is not nearly enough ram.  I expect to summarize the thread
"Dedup and L2ARC memory requirements (again)" but until that time, I suggest
going to read that thread.  A more reasonable amount of ram for your system
is likely in the 20G-30G range.


> first riddle: how to explain the low number of writes to l2arc
> compared to the reads from disk.

As you read things from disk, they go into ARC.  As things are about to
expire from ARC, they might or might not get into L2ARC.  If they expire too
quickly from ARC, they won't make their way into L2ARC.  I'm sure your
problem is lack of ram.


> I don't need much process memory on this machine (I use rsync and not
> much else).

For rough numbers:  I have a machine that does absolutely nothing.  It has
2G ram.  If I create a process to runaway and consume all ram infinitely, it
starts to push things into swap when it gets up to around 1300M.  Which
means the actual process & kernel memory consumption is around 700M.

If you want reasonable performance, you will need 1G + whatever L2ARC
requires + whatever DDT requires + the actual arc used to cache files.  So a
few G on top of the L2ARC + DDT requirements.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread David Dyer-Bennet

On Tue, May 3, 2011 19:39, Rich Teer wrote:

> I'm playing around with nearline backups using zfs send | zfs recv.
> A full backup made this way takes quite a lot of time, so I was
> wondering: after the initial copy, would using an incremental send
> (zfs send -i) make the process much quick because only the stuff that
> had changed between the previous snapshot and the current one be
> copied?  Is my understanding of incremental zfs send correct?

Yes, that works.  In my setup, a full backup takes 6 hours (about 800GB of
data to an external USB 2 drive), the incremental maybe 20 minutes even if
I've added several gigabytes of images.

> Also related to this is a performance question.  My initial test involved
> copying a 50 MB zfs file system to a new disk, which took 2.5 minutes
> to complete.  The strikes me as being a bit high for a mere 50 MB;
> are my expectation realistic or is it just because of my very budget
> concious set up?  If so, where's the bottleneck?

In addition to issues others have mentiond, the way incremental send
works, it follows the order the blocks were written in rather than disk
order, so that can sometimes be bad.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multipl disk failures cause zpool hang

2011-05-04 Thread TianHong Zhao
Thanks for the reply.

This sounds a serious issue if we have to reboot a machine in such case, I am 
wondering if anybody is working on this.
BTW, the zpool failmode is set to continue, in my test case.

Tianhong Zhao

-Original Message-
From: Edward Ned Harvey 
[mailto:opensolarisisdeadlongliveopensola...@nedharvey.com] 
Sent: Wednesday, May 04, 2011 9:50 AM
To: TianHong Zhao; zfs-discuss@opensolaris.org
Subject: RE: multipl disk failures cause zpool hang

> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- 
> boun...@opensolaris.org] On Behalf Of TianHong Zhao
> 
> There seems to be a few threads about zpool hang,  do we have a 
> workaround to resolve the hang issue without rebooting ?
> 
> In my case,  I have a pool with disks from external LUNs via a fiber
cable.
> When the cable is unplugged while there is IO in the pool, All zpool 
> related command hang (zpool status, zpool list, etc.), put the
cable
> back does not solve the problem.
> 
> Eventually, I cannot even open a new SSH session to the box,  somehow 
> the system goes into  half-locked state.

I've hit that one a lot.  I am not aware of any way to fix it without reboot.  
In fact, in all of my experience, you don't even have a choice.
You wait long enough (a few hours) and the system will become totally 
unresponsive, and you'll have no alternative but power cycle.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multipl disk failures cause zpool hang

2011-05-04 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of TianHong Zhao
> 
> There seems to be a few threads about zpool hang,  do we have a
> workaround to resolve the hang issue without rebooting ?
> 
> In my case,  I have a pool with disks from external LUNs via a fiber
cable.
> When the cable is unplugged while there is IO in the pool,
> All zpool related command hang (zpool status, zpool list, etc.), put the
cable
> back does not solve the problem.
> 
> Eventually, I cannot even open a new SSH session to the box,  somehow the
> system goes into  half-locked state.

I've hit that one a lot.  I am not aware of any way to fix it without
reboot.  In fact, in all of my experience, you don't even have a choice.
You wait long enough (a few hours) and the system will become totally
unresponsive, and you'll have no alternative but power cycle.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] pool dissapearing

2011-05-04 Thread Kristinn Soffanías Rúnarsson
Hi, I have a ZFS pool backed by iscsi volumes and the filesystem is
dissapearing a lot lately, all that rectifies it is rebooting the
machine.

running zfs list I don't get a list of the filesystems on the pool

running zpool status I do get a list of the pool and the disks behind it.

I'm running OpenIndiana 148a, anybody seen this behaviour?

br,
soffi
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss