Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-03 Thread Richard Elling
On Aug 2, 2012, at 5:40 PM, Nigel W wrote:

 On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 On Aug 1, 2012, at 8:30 AM, Nigel W wrote:
 
 
 Yes. +1
 
 The L2ARC as is it currently implemented is not terribly useful for
 storing the DDT in anyway because each DDT entry is 376 bytes but the
 L2ARC reference is 176 bytes, so best case you get just over double
 the DDT entries in the L2ARC as what you would get into the ARC but
 then you have also have no ARC left for anything else :(.
 
 
 You are making the assumption that each DDT table entry consumes one
 metadata update. This is not the case. The DDT is implemented as an AVL
 tree. As per other metadata in ZFS, the data is compressed. So you cannot
 make a direct correlation between the DDT entry size and the affect on the
 stored metadata on disk sectors.
 -- richard
 
 It's compressed even when in the ARC?


That is a slightly odd question. The ARC contains ZFS blocks. DDT metadata is 
manipulated in memory as an AVL tree, so what you can see in the ARC is the
metadata blocks that were read and uncompressed from the pool or packaged
in blocks and written to the pool. Perhaps it is easier to think of them as 
metadata
in transition? :-)
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread opensolarisisdeadlongliveopensolaris
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 2012-08-01 23:40, opensolarisisdeadlongliveopensolaris пишет:
 
  Agreed, ARC/L2ARC help in finding the DDT, but whenever you've got a
 snapshot destroy (happens every 15 minutes) you've got a lot of entries you
 need to write.  Those are all scattered about the pool...  Even if you can 
 find
 them fast, it's still a bear.
 
 No, these entries you need to update are scattered around your
 SSD (be it ARC or a hypothetical SSD-based copy of metadata
 which I also campaigned for some time ago). 

If they were scattered around the hypothetical dedicated DDT SSD, I would say, 
no problem.  But in reality, they're scattered in your main pool.  DDT writes 
don't get coalesced.  Is this simply because they're sync writes?  Or is it 
because they're metadata, which is even lower level than sync writes?  I know, 
for example, that you can disable ZIL on your pool, but still the system is 
going to flush the buffer after certain operations, such as writing the 
uberblock.  I have not seen the code that flushes the buffer after DDT writes, 
but I have seen the performance evidence.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread opensolarisisdeadlongliveopensolaris
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 In some of my cases I was lucky enough to get a corrupted /sbin/init
 or something like that once, and the box had no other BE's yet, so the
 OS could not do anything reasonable after boot. It is different from a
 corrupted zpool, but ended in a useless OS image due to one broken
 sector nonetheless.

That's very annoying, but if copies could have saved you, then pool 
redundancy could have also saved you.


 For a single-disk box, copies IS the redundancy. ;)

Ok, so the point is, in some cases, somebody might want redundancy on a device 
that has no redundancy.  They're willing to pay for it by halving their 
performance.  The only situation I'll acknowledge is the laptop situation, and 
I'll say, present day very few people would be willing to pay *that* much for 
this limited use-case redundancy.  The solution that I as an IT person would 
recommend and deploy would be to run without copies and instead cover you bum 
by doing backups.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread Richard Elling
On Aug 1, 2012, at 2:41 PM, Peter Jeremy wrote:

 On 2012-Aug-01 21:00:46 +0530, Nigel W nige...@nosun.ca wrote:
 I think a fantastic idea for dealing with the DDT (and all other
 metadata for that matter) would be an option to put (a copy of)
 metadata exclusively on a SSD.
 
 This is on my wishlist as well.  I believe ZEVO supports it so possibly
 it'll be available in ZFS in the near future.

ZEVO does not. The only ZFS vendor I'm aware of with a separate top-level
vdev for metadata is Tegile, and it is available today. 
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread Richard Elling
On Aug 1, 2012, at 8:30 AM, Nigel W wrote:
 On Wed, Aug 1, 2012 at 8:33 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 On 08/01/2012 04:14 PM, Jim Klimov wrote:
 chances are that
 some blocks of userdata might be more popular than a DDT block and
 would push it out of L2ARC as well...
 
 Which is why I plan on investigating implementing some tunable policy
 module that would allow the administrator to get around this problem.
 E.g. administrator dedicates 50G of ARC space to metadata (which
 includes the DDT) or only the DDT specifically. My idea is still a bit
 fuzzy, but it revolves primarily around allocating and policing min and
 max quotas for a given ARC entry type. I'll start a separate discussion
 thread for this later on once I have everything organized in my mind
 about where I plan on taking this.
 
 
 Yes. +1
 
 The L2ARC as is it currently implemented is not terribly useful for
 storing the DDT in anyway because each DDT entry is 376 bytes but the
 L2ARC reference is 176 bytes, so best case you get just over double
 the DDT entries in the L2ARC as what you would get into the ARC but
 then you have also have no ARC left for anything else :(.

You are making the assumption that each DDT table entry consumes one
metadata update. This is not the case. The DDT is implemented as an AVL
tree. As per other metadata in ZFS, the data is compressed. So you cannot
make a direct correlation between the DDT entry size and the affect on the
stored metadata on disk sectors.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread Peter Jeremy
On 2012-Aug-02 18:30:01 +0530, opensolarisisdeadlongliveopensolaris 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
Ok, so the point is, in some cases, somebody might want redundancy on
a device that has no redundancy.  They're willing to pay for it by
halving their performance.

This isn't quite true - write performance will be at least halved
(possibly worse due to additional seeking) but read performance
could potentially improve (more copies means, on average, there should
be less seeking to get a a copy than if there was only one copy).
And non-IO performance is unaffected.

  The only situation I'll acknowledge is
the laptop situation, and I'll say, present day very few people would
be willing to pay *that* much for this limited use-case redundancy.

My guess is that, for most people, the overall performance impact
would be minimal because disk write performance isn't the limiting
factor for most laptop usage scenarios.

The solution that I as an IT person would recommend and deploy would
be to run without copies and instead cover you bum by doing backups.

You need backups in any case but backups won't help you if you can't
conveniently access them.  Before giving a blanket recommendation, you
need to consider how the person uses their laptop.  Consider the
following scenario:  You're in the middle of a week-long business trip
and your laptop develops a bad sector in an inconvenient spot.  Do you:
a) Let ZFS automagically repair the sector thanks to copies=2.
b) Attempt to rebuild your laptop and restore from backups (left securely
   at home) via the dodgy hotel wifi.

-- 
Peter Jeremy


pgpvosNQQa9DJ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread Nigel W
On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling richard.ell...@gmail.com wrote:
 On Aug 1, 2012, at 8:30 AM, Nigel W wrote:


 Yes. +1

 The L2ARC as is it currently implemented is not terribly useful for
 storing the DDT in anyway because each DDT entry is 376 bytes but the
 L2ARC reference is 176 bytes, so best case you get just over double
 the DDT entries in the L2ARC as what you would get into the ARC but
 then you have also have no ARC left for anything else :(.


 You are making the assumption that each DDT table entry consumes one
 metadata update. This is not the case. The DDT is implemented as an AVL
 tree. As per other metadata in ZFS, the data is compressed. So you cannot
 make a direct correlation between the DDT entry size and the affect on the
 stored metadata on disk sectors.
  -- richard

It's compressed even when in the ARC?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Jim Klimov

2012-07-31 17:55, opensolarisisdeadlongliveopensolaris пишет:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Nico Williams

The copies thing is a really only for laptops, where the likelihood of
redundancy is very low


ZFS also stores multiple copies of things that it considers extra important.  
I'm not sure what exactly - uber block, or stuff like that...

When you set the copies property, you're just making it apply to other stuff, 
that otherwise would be only 1.


IIRC, the copies defaults are:
1 block for userdata
2 blocks for regular metadata (block-pointer tree)
3 blocks for higher-level metadata (metadata tree root, dataset
  definitions)

The Uberblock I am not so sure about, from the top of my head.
There is a record in the ZFS labels, and that is stored 4 times
on each leaf VDEV, and points to a ZFS block with the tree root
for the current (newest consistent flushed-to-pool) TXG number.
Which one of these concepts is named The 00bab10c - *that* I am
a bit vague about ;)

Probably DDT is also stored with 2 or 3 copies of each block,
since it is metadata. It was not in the last ZFS on-disk spec
from 2006 that I found, for some apparent reason ;)

Also, I am not sure whether bumping the copies attribute to,
say, 3 increases only the redundancy of userdata, or of
regular metadata as well.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Sašo Kiselkov
On 08/01/2012 12:04 PM, Jim Klimov wrote:
 Probably DDT is also stored with 2 or 3 copies of each block,
 since it is metadata. It was not in the last ZFS on-disk spec
 from 2006 that I found, for some apparent reason ;)

That's probably because it's extremely big (dozens, hundreds or even
thousands of GB).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Jim Klimov

2012-08-01 16:22, Sašo Kiselkov пишет:

On 08/01/2012 12:04 PM, Jim Klimov wrote:

Probably DDT is also stored with 2 or 3 copies of each block,
since it is metadata. It was not in the last ZFS on-disk spec
from 2006 that I found, for some apparent reason ;)



The idea of the pun was that the latest available full spec is
over half a decade old, alas. At least I failed to find any one
newer, when I searched last winter. And back in 2006 there was
no dedup nor any mention of it in the spec (surprising, huh? ;)

Hopefully with all the upcoming changes - including integration
of feature flags and new checksum and compression algorithms,
the consistent textual document of Current ZFS On-Disk spec in
illumos(/FreeBSD/...) would appear and be maintained up-to-date.



That's probably because it's extremely big (dozens, hundreds or even
thousands of GB).


Availability of the DDT is IMHO crucial to a deduped pool, so
I won't be surprised to see it forced to triple copies. Not
that it is very difficult to check with ZDB, though finding
the DDT dataset for inspection (when I last tried) was not
an obvious task.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread opensolarisisdeadlongliveopensolaris
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
  
 Availability of the DDT is IMHO crucial to a deduped pool, so
 I won't be surprised to see it forced to triple copies. 

Agreed, although, the DDT is also paramount to performance.  In theory, an 
online dedup'd pool could be much faster than non-dedup'd pools, or offline 
dedup'd pools.  So there's a lot of potential here - Lost potential at the 
present.

IMHO, the more important thing for dedup moving forward is to create an option 
to dedicate a fast device (SSD or whatever) to the DDT.  So all those little 
random IO operations never hit the rusty side of the pool.

Personally, I've never been supportive of the whole copies idea.  If you need 
more than one redundant copy of some data, that's why you have pool redundancy. 
 You're just hurting performance by using copies.  And protecting against 
failure conditions that are otherwise nearly nonexistent...  And just as easily 
solved (without performance penalty) via pool redundancy.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Sašo Kiselkov
On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
  
 Availability of the DDT is IMHO crucial to a deduped pool, so
 I won't be surprised to see it forced to triple copies. 
 
 IMHO, the more important thing for dedup moving forward is to create an 
 option to dedicate a fast device (SSD or whatever) to the DDT.  So all those 
 little random IO operations never hit the rusty side of the pool.

That's something you can already do with an L2ARC. In the future I plan
on investigating implementing a set of more fine-grained ARC and L2ARC
policy tuning parameters that would give more control into the hands of
admins over how the ARC/L2ARC cache is used.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Jim Klimov

2012-08-01 17:35, opensolarisisdeadlongliveopensolaris пишет:

Personally, I've never been supportive of the whole copies idea.  If you need more than 
one redundant copy of some data, that's why you have pool redundancy.  You're just hurting 
performance by using copies.  And protecting against failure conditions that are 
otherwise nearly nonexistent...  And just as easily solved (without performance penalty) via pool 
redundancy.


Well, there is at least a couple of failure scenarios where
copies1 are good:

1) A single-disk pool, as in a laptop. Noise on the bus,
   media degradation, or any other reason to misread or
   miswrite a block can result in a failed pool. One of
   my older test boxes has an untrustworthy 80Gb HDD for
   its rpool, and the system did crash into an unbootable
   image with just half-a-dozen of CKSUM errors.
   Remaking the rpool with copies=2 enforced from the
   start and rsyncing the rootfs files back into the new
   pool - and this thing works well since then, despite
   finding several errors upon each weekly scrub.

2) The data pool on the same box experienced some errors
   where raidz2 failed to recreate a userdata block, thus
   invalidating a file despite having a 2-disk redundancy.
   There was some discussion of that on the list, and my
   ultimate guess is that the six disks' heads were over
   similar locations of the same file - i.e. during scrub -
   and a power surge or some similar event caused them to
   scramble portions of the disk pertaining to the same
   ZFS block. At least, this could have induced many
   enough errors to make raidz2 protection irrelevant.
   If the pool had copies=2, there would be another replica
   of the same block that would have been not corrupted
   by such assumed failure mechanism - because the disk
   heads were elsewhere.

Hmmm... now I wonder if ZFS checksum validation can try
permutations of should-be-identical sectors from different
copies of a block - in case both copies have received some
non-overlapping errors, and together contain enough data to
reconstruct a ZFS block (and rewrite both its copies now).

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Jim Klimov

2012-08-01 17:55, Sašo Kiselkov пишет:

On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov

Availability of the DDT is IMHO crucial to a deduped pool, so
I won't be surprised to see it forced to triple copies.


IMHO, the more important thing for dedup moving forward is to create an option 
to dedicate a fast device (SSD or whatever) to the DDT.  So all those little 
random IO operations never hit the rusty side of the pool.


That's something you can already do with an L2ARC. In the future I plan
on investigating implementing a set of more fine-grained ARC and L2ARC
policy tuning parameters that would give more control into the hands of
admins over how the ARC/L2ARC cache is used.



Unfortunately, as of current implementations, L2ARC starts up cold.
That is, upon every import of the pool the L2ARC is empty, and the
DDT (as in the example above) would have to migrate into the cache
via read-from-rust to RAM ARC and expiration from the ARC. Getting
it to be hot and fast again takes some time, and chances are that
some blocks of userdata might be more popular than a DDT block and
would push it out of L2ARC as well...

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Sašo Kiselkov
On 08/01/2012 04:14 PM, Jim Klimov wrote:
 2012-08-01 17:55, Sašo Kiselkov пишет:
 On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov

 Availability of the DDT is IMHO crucial to a deduped pool, so
 I won't be surprised to see it forced to triple copies.

 IMHO, the more important thing for dedup moving forward is to create
 an option to dedicate a fast device (SSD or whatever) to the DDT.  So
 all those little random IO operations never hit the rusty side of the
 pool.

 That's something you can already do with an L2ARC. In the future I plan
 on investigating implementing a set of more fine-grained ARC and L2ARC
 policy tuning parameters that would give more control into the hands of
 admins over how the ARC/L2ARC cache is used.
 
 
 Unfortunately, as of current implementations, L2ARC starts up cold.

Yes, that's by design, because the L2ARC is simply a secondary backing
store for ARC blocks. If the memory pointer isn't valid, chances are,
you'll still be able to find the block on the L2ARC devices. You can't
scan an L2ARC device and discover some usable structures, as there
aren't any. It's literally just a big pile of disk blocks and their
associated ARC headers only live in RAM.

 chances are that
 some blocks of userdata might be more popular than a DDT block and
 would push it out of L2ARC as well...

Which is why I plan on investigating implementing some tunable policy
module that would allow the administrator to get around this problem.
E.g. administrator dedicates 50G of ARC space to metadata (which
includes the DDT) or only the DDT specifically. My idea is still a bit
fuzzy, but it revolves primarily around allocating and policing min and
max quotas for a given ARC entry type. I'll start a separate discussion
thread for this later on once I have everything organized in my mind
about where I plan on taking this.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Nigel W
On Wed, Aug 1, 2012 at 8:33 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 On 08/01/2012 04:14 PM, Jim Klimov wrote:
 chances are that
 some blocks of userdata might be more popular than a DDT block and
 would push it out of L2ARC as well...

 Which is why I plan on investigating implementing some tunable policy
 module that would allow the administrator to get around this problem.
 E.g. administrator dedicates 50G of ARC space to metadata (which
 includes the DDT) or only the DDT specifically. My idea is still a bit
 fuzzy, but it revolves primarily around allocating and policing min and
 max quotas for a given ARC entry type. I'll start a separate discussion
 thread for this later on once I have everything organized in my mind
 about where I plan on taking this.


Yes. +1

The L2ARC as is it currently implemented is not terribly useful for
storing the DDT in anyway because each DDT entry is 376 bytes but the
L2ARC reference is 176 bytes, so best case you get just over double
the DDT entries in the L2ARC as what you would get into the ARC but
then you have also have no ARC left for anything else :(.

I think a fantastic idea for dealing with the DDT (and all other
metadata for that matter) would be an option to put (a copy of)
metadata exclusively on a SSD.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread opensolarisisdeadlongliveopensolaris
 From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]
 Sent: Wednesday, August 01, 2012 9:56 AM
 
 On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Jim Klimov
 
  Availability of the DDT is IMHO crucial to a deduped pool, so
  I won't be surprised to see it forced to triple copies.
 
  IMHO, the more important thing for dedup moving forward is to create an
 option to dedicate a fast device (SSD or whatever) to the DDT.  So all those
 little random IO operations never hit the rusty side of the pool.
 
 That's something you can already do with an L2ARC. In the future I plan
 on investigating implementing a set of more fine-grained ARC and L2ARC
 policy tuning parameters that would give more control into the hands of
 admins over how the ARC/L2ARC cache is used.

L2ARC is a read cache.  Hence the R and C in L2ARC.
This means two major things:
#1  Writes don't benefit, 
and
#2  There's no way to load the whole DDT into the cache anyway.  So you're 
guaranteed to have performance degradation with the dedup.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread opensolarisisdeadlongliveopensolaris
 From: opensolarisisdeadlongliveopensolaris
 Sent: Wednesday, August 01, 2012 2:08 PM
  
 L2ARC is a read cache.  Hence the R and C in L2ARC.
 This means two major things:
 #1  Writes don't benefit,
 and
 #2  There's no way to load the whole DDT into the cache anyway.  So you're
 guaranteed to have performance degradation with the dedup.

In other words, the DDT is always written in rust (written in main pool).  You 
gain some performance by adding arc/l2arc/log devices, but it can only reduce 
the problem.  Not solved.

The problem would be solved if you could choose to dedicate an SSD mirror for 
DDT, and either allow the pool size to be limited by the amount of DDT storage 
available, or overflow into the main pool if the DDT device got full.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Jim Klimov

2012-08-01 22:07, opensolarisisdeadlongliveopensolaris пишет:

L2ARC is a read cache.  Hence the R and C in L2ARC.


R is replacement, but what the hell ;)


This means two major things:
#1  Writes don't benefit,
and
#2  There's no way to load the whole DDT into the cache anyway.  So you're 
guaranteed to have performance degradation with the dedup.


If the whole DDT does make it into the cache, or onto an SSD
storing an extra copy of all pool metadata, then searching
for a particular entry in DDT would be faster. When you write
(or delete) and need to update the counters in DDT, or even
ultimately remove an unreferenced entry, then you benefit on
writes as well - you don't take as long to find DDT entries
(or determine lack thereof) for the blocks you add or remove.

Or did I get your answer wrong? ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Tomas Forsman
On 01 August, 2012 - opensolarisisdeadlongliveopensolaris sent me these 1,8K 
bytes:

  From: Sa??o Kiselkov [mailto:skiselkov...@gmail.com]
  Sent: Wednesday, August 01, 2012 9:56 AM
  
  On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
   From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
   boun...@opensolaris.org] On Behalf Of Jim Klimov
  
   Availability of the DDT is IMHO crucial to a deduped pool, so
   I won't be surprised to see it forced to triple copies.
  
   IMHO, the more important thing for dedup moving forward is to create an
  option to dedicate a fast device (SSD or whatever) to the DDT.  So all those
  little random IO operations never hit the rusty side of the pool.
  
  That's something you can already do with an L2ARC. In the future I plan
  on investigating implementing a set of more fine-grained ARC and L2ARC
  policy tuning parameters that would give more control into the hands of
  admins over how the ARC/L2ARC cache is used.
 
 L2ARC is a read cache.  Hence the R and C in L2ARC.

Adaptive Replacement Cache, right.

 This means two major things:
 #1  Writes don't benefit, 
 and
 #2  There's no way to load the whole DDT into the cache anyway.  So you're 
 guaranteed to have performance degradation with the dedup.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

/Tomas
-- 
Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread opensolarisisdeadlongliveopensolaris
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 Well, there is at least a couple of failure scenarios where
 copies1 are good:
 
 1) A single-disk pool, as in a laptop. Noise on the bus,
 media degradation, or any other reason to misread or
 miswrite a block can result in a failed pool. 

How does mac/win/lin handle this situation?  (Not counting btrfs.)

Such noise might result in a temporarily faulted pool (blue screen of death) 
that is fully recovered after reboot.  Meanwhile you're always paying for it in 
terms of performance, and it's all solvable via pool redundancy.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread opensolarisisdeadlongliveopensolaris
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 2012-08-01 22:07, opensolarisisdeadlongliveopensolaris пишет:
  L2ARC is a read cache.  Hence the R and C in L2ARC.
 
 R is replacement, but what the hell ;)
 
  This means two major things:
  #1  Writes don't benefit,
  and
  #2  There's no way to load the whole DDT into the cache anyway.  So you're
 guaranteed to have performance degradation with the dedup.
 
 If the whole DDT does make it into the cache, or onto an SSD
 storing an extra copy of all pool metadata, then searching
 for a particular entry in DDT would be faster. When you write
 (or delete) and need to update the counters in DDT, or even
 ultimately remove an unreferenced entry, then you benefit on
 writes as well - you don't take as long to find DDT entries
 (or determine lack thereof) for the blocks you add or remove.
 
 Or did I get your answer wrong? ;)

Agreed, ARC/L2ARC help in finding the DDT, but whenever you've got a snapshot 
destroy (happens every 15 minutes) you've got a lot of entries you need to 
write.  Those are all scattered about the pool...  Even if you can find them 
fast, it's still a bear.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Jim Klimov

2012-08-01 23:40, opensolarisisdeadlongliveopensolaris пишет:


Agreed, ARC/L2ARC help in finding the DDT, but whenever you've got a snapshot 
destroy (happens every 15 minutes) you've got a lot of entries you need to 
write.  Those are all scattered about the pool...  Even if you can find them 
fast, it's still a bear.


No, these entries you need to update are scattered around your
SSD (be it ARC or a hypothetical SSD-based copy of metadata
which I also campaigned for some time ago). We agreed (or
assumed) that with SSDs in place you can find the DDT entries
to update relatively fast now. The values are changed in RAM
and flushed to disk as part of an upcoming TXG commit, likely
in a limited number of disk head strokes (lots to coalesce),
and the way I see it - the updated copy remains in the ARC
instead of the obsolete DDT entry, and can make it into L2ARC
sometime in the future, as well.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Jim Klimov

2012-08-01 23:34, opensolarisisdeadlongliveopensolaris пишет:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov

Well, there is at least a couple of failure scenarios where
copies1 are good:

1) A single-disk pool, as in a laptop. Noise on the bus,
 media degradation, or any other reason to misread or
 miswrite a block can result in a failed pool.


How does mac/win/lin handle this situation?  (Not counting btrfs.)

Such noise might result in a temporarily faulted pool (blue screen of death) 
that is fully recovered after reboot.



In some of my cases I was lucky enough to get a corrupted /sbin/init
or something like that once, and the box had no other BE's yet, so the
OS could not do anything reasonable after boot. It is different from a
corrupted zpool, but ended in a useless OS image due to one broken
sector nonetheless.


 Meanwhile you're always paying for it in terms of performance, and 
it's all solvable via pool redundancy.


For a single-disk box, copies IS the redundancy. ;)

The discussion did stray off from my original question, though ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Peter Jeremy
On 2012-Aug-01 21:00:46 +0530, Nigel W nige...@nosun.ca wrote:
I think a fantastic idea for dealing with the DDT (and all other
metadata for that matter) would be an option to put (a copy of)
metadata exclusively on a SSD.

This is on my wishlist as well.  I believe ZEVO supports it so possibly
it'll be available in ZFS in the near future.

-- 
Peter Jeremy


pgpNyzMT6fOdD.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-31 Thread opensolarisisdeadlongliveopensolaris
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 
 The copies thing is a really only for laptops, where the likelihood of
 redundancy is very low 

ZFS also stores multiple copies of things that it considers extra important.  
I'm not sure what exactly - uber block, or stuff like that...

When you set the copies property, you're just making it apply to other stuff, 
that otherwise would be only 1.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-30 Thread GREGG WONDERLY

On Jul 29, 2012, at 3:12 PM, opensolarisisdeadlongliveopensolaris 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
   I wondered if the copies attribute can be considered sort
 of equivalent to the number of physical disks - limited to seek
 times though. Namely, for the same amount of storage on a 4-HDD
 box I could use raidz1 and 4*1tb@copies=1 or 4*2tb@copies=2 or
 even 4*3tb@copies=3, for example.
 
 The first question - reliability...
 
 copies might be on the same disk.  So it's not guaranteed to help if you 
 have a disk failure.

I thought I understood that copies would not be on the same disk, I guess I 
need to go read up on this again.

Gregg Wonderly
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-30 Thread John Martin

On 07/29/12 14:52, Bob Friesenhahn wrote:


My opinion is that complete hard drive failure and block-level media
failure are two totally different things.


That would depend on the recovery behavior of the drive for
block-level media failure.  A drive whose firmware does excessive
(reports of up to 2 minutes) retries of a bad sector may be
indistinguishable from a failed drive.  See previous discussions
of the firmware differences between desktop and enterprise drives.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-30 Thread Brandon High
On Mon, Jul 30, 2012 at 7:11 AM, GREGG WONDERLY gregg...@gmail.com wrote:
 I thought I understood that copies would not be on the same disk, I guess I 
 need to go read up on this again.

ZFS attempts to put copies on separate devices, but there's no guarantee.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-30 Thread Nico Williams
The copies thing is a really only for laptops, where the likelihood of
redundancy is very low (there are some high-end laptops with multiple
drives, but those are relatively rare) and where this idea is better
than nothing.  It's also nice that copies can be set on a per-dataset
manner (whereas RAID-Zn and mirroring are for pool-wide redundancy,
not per-dataset), so you could set it  1 on home directories but not
/.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-29 Thread Jim Klimov

Hello all,

  Over the past few years there have been many posts suggesting
that for modern HDDs (several TB size, around 100-200MB/s best
speed) the rebuild times grow exponentially, so to build a well
protected pool with these disks one has to plan for about three
disk's worth of redundancy - that is, three- or four-way mirrors,
or raidz3 - just to allow systems to survive a disk outage (with
accpetably high probability of success) while one is resilvering.

  There were many posts on this matter from esteemed members of
the list, including (but certainly not limited to) these articles:
* https://blogs.oracle.com/ahl/entry/triple_parity_raid_z
* https://blogs.oracle.com/ahl/entry/acm_triple_parity_raid
* http://queue.acm.org/detail.cfm?id=1670144
* http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html

  Now, this brings me to such a question: when people build a
home-NAS box, they are quite constrained in terms of the number
of directly attached disks (about 4-6 bays), or even if they
use external JBODs - to the number of disks in them (up to 8,
which does allow a 5+3 raidz3 set in a single box, which still
seems like a large overhead to some buyers - a 4*2 mirror would
give about as much space and higher performance, but may have
unacceptably less redundancy). If I want to have considerable
storage, with proper reliability, and just a handful of drives,
what are my best options?

  I wondered if the copies attribute can be considered sort
of equivalent to the number of physical disks - limited to seek
times though. Namely, for the same amount of storage on a 4-HDD
box I could use raidz1 and 4*1tb@copies=1 or 4*2tb@copies=2 or
even 4*3tb@copies=3, for example.

  To simplify the matters, let's assume that this is a small
box (under 10GB RAM) not using dedup, though it would likely
use compression :)

  Question to theorists and practicians: is any of these options
better or worse than the others, in terms of reliability and
access/rebuild/scrub speeds, for either a single-sector error
or for a full-disk replacement?

  Would extra copies on larger disks actually provide the extra
reliability, or only add overheads and complicate/degrade the
situation?

  Would the use of several copies cripple the write speeds?

  Can the extra copies be used by zio scheduler to optimize and
speed up reads, like extra mirror sides would?

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-29 Thread Roy Sigurd Karlsbakk
copies won't help much if the pool is unavailable. It may, however, help if, 
say, you have a RAIDz2, and two drives die, and htere are errors on a  third 
drive, but not sufficiently bad for zfs to reject the pool

roy

- Opprinnelig melding -
 Hello all,
 
 Over the past few years there have been many posts suggesting
 that for modern HDDs (several TB size, around 100-200MB/s best
 speed) the rebuild times grow exponentially, so to build a well
 protected pool with these disks one has to plan for about three
 disk's worth of redundancy - that is, three- or four-way mirrors,
 or raidz3 - just to allow systems to survive a disk outage (with
 accpetably high probability of success) while one is resilvering.
 
 There were many posts on this matter from esteemed members of
 the list, including (but certainly not limited to) these articles:
 * https://blogs.oracle.com/ahl/entry/triple_parity_raid_z
 * https://blogs.oracle.com/ahl/entry/acm_triple_parity_raid
 * http://queue.acm.org/detail.cfm?id=1670144
 *
 http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html
 
 Now, this brings me to such a question: when people build a
 home-NAS box, they are quite constrained in terms of the number
 of directly attached disks (about 4-6 bays), or even if they
 use external JBODs - to the number of disks in them (up to 8,
 which does allow a 5+3 raidz3 set in a single box, which still
 seems like a large overhead to some buyers - a 4*2 mirror would
 give about as much space and higher performance, but may have
 unacceptably less redundancy). If I want to have considerable
 storage, with proper reliability, and just a handful of drives,
 what are my best options?
 
 I wondered if the copies attribute can be considered sort
 of equivalent to the number of physical disks - limited to seek
 times though. Namely, for the same amount of storage on a 4-HDD
 box I could use raidz1 and 4*1tb@copies=1 or 4*2tb@copies=2 or
 even 4*3tb@copies=3, for example.
 
 To simplify the matters, let's assume that this is a small
 box (under 10GB RAM) not using dedup, though it would likely
 use compression :)
 
 Question to theorists and practicians: is any of these options
 better or worse than the others, in terms of reliability and
 access/rebuild/scrub speeds, for either a single-sector error
 or for a full-disk replacement?
 
 Would extra copies on larger disks actually provide the extra
 reliability, or only add overheads and complicate/degrade the
 situation?
 
 Would the use of several copies cripple the write speeds?
 
 Can the extra copies be used by zio scheduler to optimize and
 speed up reads, like extra mirror sides would?
 
 Thanks,
 //Jim Klimov
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
r...@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-29 Thread Bob Friesenhahn

On Sun, 29 Jul 2012, Jim Klimov wrote:


 Would extra copies on larger disks actually provide the extra
reliability, or only add overheads and complicate/degrade the
situation?


My opinion is that complete hard drive failure and block-level media 
failure are two totally different things.  Complete hard drive failure 
rates should not be directly related to total storage size whereas the 
probabily of media failure per drive is directly related to total 
storage size.  Given this, and assuming that complete hard drive 
failure occurs much less often than partial media failure, using the 
copies feature should be pretty effective.



 Would the use of several copies cripple the write speeds?


It would reduce the write rate by 1/2 or by whatever number of copies 
you have requested.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-29 Thread opensolarisisdeadlongliveopensolaris
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
I wondered if the copies attribute can be considered sort
 of equivalent to the number of physical disks - limited to seek
 times though. Namely, for the same amount of storage on a 4-HDD
 box I could use raidz1 and 4*1tb@copies=1 or 4*2tb@copies=2 or
 even 4*3tb@copies=3, for example.

The first question - reliability...

copies might be on the same disk.  So it's not guaranteed to help if you have 
a disk failure.

Let's try this:  Take a disk, slice it into two partitions, and then make a 
mirror using the 2 partitions.  This is about as useful as the copies property. 
 Half the write speed, half the usable disk capacity, improved redundancy 
against bad blocks, but no better redundancy against disk failure.  (copies 
will actually be better, because unlike the partitioning scenario, copies 
will sometimes write the extra copies to other disks.)

Re: the assumption - lower performance with larger disks...  rebuild time 
growing exponentially...

I don't buy it, and I don't see that argument being made in any of the messages 
you referenced.  Rebuild time is dependent on the amount of data in the vdev 
and the layout of said data, so if you consider a mirror of 3T versus 6 vdev's 
all mirroring 500G, then in that situation the larger disks resilver slower.  
(Because it's a larger amount of data that needs to resilver.  You have to 
resilver all your data instead of 1/6th of your data.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss