[zfs-discuss] zpool errors without fmdump or dmesg errors

2013-01-19 Thread Stephan Budach

Hi all,

I am running S11 on a Dell PE650. It has 5 zpools attached that are made 
out of 240 drives, connected via fibre. On thursday all of the sudden 
two out of three zpools on one FC channel showed numerous errors and one 
of them showed this:


root@solaris11a:~# zpool status vsmPool01
  pool: vsmPool01
 state: SUSPENDED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jan 19 08:53:28 2013
344G scanned out of 24,7T at 128M/s, 55h18m to go
45,9G resilvered, 1,36% done
config:

NAME STATE READ WRITE CKSUM
vsmPool01UNAVAIL  0 0 0  
experienced I/O failures
  mirror-0   UNAVAIL  0 0 0  
experienced I/O failures
c0t201A001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2006001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2005001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-1   UNAVAIL  0 0 0  
experienced I/O failures
c0t2006001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t201B001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2007001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-2   UNAVAIL  0 0 0  
experienced I/O failures
c0t2007001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t201C001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2008001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-3   UNAVAIL  0 0 0  
experienced I/O failures
c0t2008001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2009001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
c0t201D001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-4   UNAVAIL  0 0 0  
experienced I/O failures
c0t2009001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t201E001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t200A001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-5   UNAVAIL  0 0 0  
experienced I/O failures
c0t200A001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
spare-1  UNAVAIL  0 0 0  
experienced I/O failures
  c0t201F001378E06A18d0  UNAVAIL  0 0 0  
experienced I/O failures
  c0t2015001378E0E198d0  UNAVAIL  0 0 0  
experienced I/O failures  (resilvering)
c0t200B001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-6   UNAVAIL  0 0 0  
experienced I/O failures
c0t200B001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2020001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t200C001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-7   UNAVAIL  0 0 0  
experienced I/O failures
spare-0  UNAVAIL  0 0 0  
experienced I/O failures
  c0t2021001378E06A18d0  UNAVAIL  0 0 0  
experienced I/O failures
  c0t2014001378E0DE98d0  UNAVAIL  0 0 0  
experienced I/O failures  (resilvering)
c0t200D001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
c0t200C001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-8   UNAVAIL  0 0 0  
experienced I/O failures
c0t200D001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2022001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t200E001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-9   UNAVAIL  0 0 0  
experienced I/O failures
c0t200E001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2023001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t200F001378E0E198d0UNAVAIL  0 0 0  
experienced I/O failures
  mirror-10  UNAVAIL  0 0 0  
experienced I/O failures
c0t200F001378E0DE98d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2024001378E06A18d0UNAVAIL  0 0 0  
experienced I/O failures
c0t2010001378E0E198d0UNAVAIL  0 0 0  

Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.

Oh, I forgot to mention - The above logic only makes sense for mirrors and 
stripes.  Not for raidz (or raid-5/6/dp in general)

If you have a pool of mirrors or stripes, the system isn't forced to subdivide 
a 4k block onto multiple disks, so it works very well.  But if you have a pool 
blocksize of 4k and let's say a 5-disk raidz (capacity of 4 disks) then the 4k 
block gets divided into 1k on each disk and 1k parity on the parity disk.  Now, 
since the hardware only supports block sizes of 4k ... You can see there's a 
lot of wasted space, and if you do a bunch of it, you'll also have a lot of 
wasted time waiting for seeks/latency.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Stephan Budach

Hi,

I am always experiencing chksum errors while scrubbing my zpool(s), but 
I never experienced chksum errors while resilvering. Does anybody know 
why that would be? This happens on all of my servers, Sun Fire 4170M2, 
Dell PE 650 and on any FC storage that I have.


Currently I had a major issue where two of my zpools have been suspended 
due to every single drive had been marked as UNAVAIL due to experienced 
I/O failures.


Now, this zpool is made of 3-way mirrors and currently 13 out of 15 
vdevs are resilvering (which they had gone through yesterday as well) 
and I never got any error while resilvering. I have been all over the 
setup to find any glitch or bad part, but I couldn't come up with 
anything significant.


Doesn't this sound improbable, wouldn't one expect to encounter other 
chksum errors while resilvering is running?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Jim Klimov

Hello all,

  While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact unique - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.

  Thus these many unique deduped blocks are just a burden when my
system writes into the datasets with dedup enabled, when it walks the
superfluously large DDT, when it has to store this DDT on disk and in
ARC, maybe during the scrubbing... These entries bring lots of headache
(or performance degradation) for zero gain.

  So I thought it would be a nice feature to let ZFS go over the DDT
(I won't care if it requires to offline/export the pool) and evict the
entries with count==1 as well as locate the block-pointer tree entries
on disk and clear the dedup bits, making such blocks into regular unique
ones. This would require rewriting metadata (less DDT, new blockpointer)
but should not touch or reallocate the already-saved userdata (blocks'
contents) on the disk. The new BP without the dedup bit set would have
the same contents of other fields (though its parents would of course
have to be changed more - new DVAs, new checksums...)

  In the end my pool would only track as deduped those blocks which do
already have two or more references - which, given the static nature
of such backup box, should be enough (i.e. new full backups of the same
source data would remain deduped and use no extra space, while unique
data won't waste the resources being accounted as deduped).

What do you think?
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Bob Friesenhahn

On Sat, 19 Jan 2013, Stephan Budach wrote:


Now, this zpool is made of 3-way mirrors and currently 13 out of 15 vdevs are 
resilvering (which they had gone through yesterday as well) and I never got 
any error while resilvering. I have been all over the setup to find any 
glitch or bad part, but I couldn't come up with anything significant.


Doesn't this sound improbable, wouldn't one expect to encounter other chksum 
errors while resilvering is running?


I can't attest to chksum errors since I have yet to see one on my 
machines (have seen several complete disk failures, or disks faulted 
by the system though).  Checksum errors are bad and not seeing them 
should be the normal case.


Resilver may in fact be just verifying that the pool disks are 
coherent via metadata.  This might happen if the fiber channel is 
flapping.


Regarding the dire fiber channel issue, are you using fiber channel 
switches or direct connections to the storage array(s)?  If you are 
using switches, are they stable or are they doing something terrible 
like resetting?  Do you have duplex connectivity?  Have you verified 
that your FC HBA's firmware is correct?


Did you check for messages in /var/adm/messages which might indicate 
when and how FC connectivity has been lost?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Jim Klimov

On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)

The way I get it, resilvering is related to scrubbing but limited
in impact such that it rebuilds a particular top-level vdev (i.e.
one of the component mirrors) with an assigned-bad and new device.

So they both should walk the block-pointer tree from the uberblock
(current BP tree root) until they ultimately read all the BP entries
and validate the userdata with checksums. But while scrub walks and
verifies the whole pool and fixes discrepancies (logging checksum
errors), the resilver verifies a particular TLVdev (and maybe has
a cut-off earliest TXG for disks which fell out of the pool and
later returned into it - with a known latest TXT that is assumed
valid on this disk) and the process expects there to be errors -
it is intent on (partially) rewriting one of the devices in it.
Hmmm... Maybe that's why there are no errors logged? I don't know :)

As for practice, I also have one Thumper that logs errors on a
couple of drives upon every scrub. I think it was related to
connectors, at least replugging the disks helped a lot (counts
went from tens per scrub to 0-3). One of the original 250Gb disks
was replaced with a 3Tb one and a 250Gb partition became part of
the old pool (the remainder became a new test pool over a single
device). Scrubbing the pools yields errors in those new 250Gb,
but never on the 2.75Tb single-disk pool... so go figure :)

Overall, intermittent errors might be attibuted to non-ECC RAM/CPUs
(not our case), temperature affecting the mechanics and electronics
(conditioned server room - not our case), electric power variations
and noise (other systems in the room on the same and other UPSes
don't complain like this), and cable/connector/HBA degradation
(oxydization, wear, etc. - likely all that remains for our causes).
This example regards internal disks of the Thumper, so at least we
are certain to attribute no problems related to further breakage
components - external cables, disk trays, etc...

HTH,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Stephan Budach

Am 19.01.13 18:17, schrieb Bob Friesenhahn:

On Sat, 19 Jan 2013, Stephan Budach wrote:


Now, this zpool is made of 3-way mirrors and currently 13 out of 15 
vdevs are resilvering (which they had gone through yesterday as well) 
and I never got any error while resilvering. I have been all over the 
setup to find any glitch or bad part, but I couldn't come up with 
anything significant.


Doesn't this sound improbable, wouldn't one expect to encounter other 
chksum errors while resilvering is running?


I can't attest to chksum errors since I have yet to see one on my 
machines (have seen several complete disk failures, or disks faulted 
by the system though).  Checksum errors are bad and not seeing them 
should be the normal case.
I know and it's really bugging me, that I seem to have these chksum 
errors on all of my machines, be it Sun gear or Dell.


Resilver may in fact be just verifying that the pool disks are 
coherent via metadata.  This might happen if the fiber channel is 
flapping.


Regarding the dire fiber channel issue, are you using fiber channel 
switches or direct connections to the storage array(s)? If you are 
using switches, are they stable or are they doing something terrible 
like resetting?  Do you have duplex connectivity?  Have you verified 
that your FC HBA's firmware is correct?

Looking on my FC switches, I am noticing such errors like these:

[656][Thu Dec 06 03:33:04.795 UTC 2012][I][8600.001E][Port][Port: 
2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged out of nameserver.]
  [657][Thu Dec 06 03:33:05.829 UTC 2012][I][8600.0020][Port][Port: 
2][SYNC_LOSS]
  [658][Thu Dec 06 03:37:08.077 UTC 2012][I][8600.001F][Port][Port: 
2][SYNC_ACQ]
  [659][Thu Dec 06 03:37:10.582 UTC 2012][I][8600.001D][Port][Port: 
2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged into nameserver.]
  [660][Sun Dec 09 04:18:32.324 UTC 2012][I][8600.001E][Port][Port: 
10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged out of 
nameserver.]
  [661][Sun Dec 09 04:18:32.326 UTC 2012][I][8600.0020][Port][Port: 
10][SYNC_LOSS]
  [662][Sun Dec 09 04:18:32.913 UTC 2012][I][8600.001F][Port][Port: 
10][SYNC_ACQ]
  [663][Sun Dec 09 04:18:33.024 UTC 2012][I][8600.001D][Port][Port: 
10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged into nameserver.]


Just ignore the timestamp, as it seems that the time is not set 
correctly, but the dates match my two issues from today and thursday, 
which accounts for three days. I didn't catch that before, but it seems 
to clearly indicate a problem with the FC connection…


But, what do I make of this information?



Did you check for messages in /var/adm/messages which might indicate 
when and how FC connectivity has been lost?
Well, this is the most scaring part to me. Neither fmdump nor dmesg 
showed anything that would indicate a connectivity issue - at least not 
the last time.


Bob


Thanks,
Stephan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Bob Friesenhahn

On Sat, 19 Jan 2013, Jim Klimov wrote:


On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)


I don't think that zfs would call it scrubbing unless the user 
requested scrubbing.  Unplugging a USB drive which is part of a mirror 
for a short while results in considerable activity when it is plugged 
back in.  It is as if zfs does not trust the device which was 
temporarily unplugged and does a full validation of it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Bob Friesenhahn

On Sat, 19 Jan 2013, Stephan Budach wrote:


Just ignore the timestamp, as it seems that the time is not set correctly, 
but the dates match my two issues from today and thursday, which accounts for 
three days. I didn't catch that before, but it seems to clearly indicate a 
problem with the FC connection…


But, what do I make of this information?


I don't know, but the issue/problem seems to below the zfs level so 
you need to fix that lower level before worrying about zfs.


Did you check for messages in /var/adm/messages which might indicate when 
and how FC connectivity has been lost?
Well, this is the most scaring part to me. Neither fmdump nor dmesg showed 
anything that would indicate a connectivity issue - at least not the last 
time.


Weird.  I wonder if multipathing is working for you at all.  With my 
direct-connect setup, if a path is lost, then there is quite a lot of 
messaging to /var/adm/messages.  I also see a lot of messaging related 
to multipathing when the system boots and first starts using the 
array.  However, with the direct-connect setup, the HBA can report 
problems immediately if it sees a loss of signal.  Your issues might 
be on the other side of the switch (on the storage array side) so the 
local HBA does not see the problem and timeouts are used.  Make sure 
to check the logs in your storage array to see if it is encountering 
resets or flapping connectivity.


Do you have duplex switches so that there are fully-redundant paths, 
or is only one switch used?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Jim Klimov

On 2013-01-19 20:08, Bob Friesenhahn wrote:

On Sat, 19 Jan 2013, Jim Klimov wrote:


On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)


I don't think that zfs would call it scrubbing unless the user requested
scrubbing.  Unplugging a USB drive which is part of a mirror for a short
while results in considerable activity when it is plugged back in.  It
is as if zfs does not trust the device which was temporarily unplugged
and does a full validation of it.


Now, THAT would be resilvering - and by default it should be a limited
one, with a cutoff at the last TXG known to the disk that went MIA/AWOL.
The disk's copy of the pool label (4 copies in fact) record the last
TXG it knew safely. So the resilver should only try to validate and
copy over the blocks whose BP entries' birth TXG number is above that.
And since these blocks' components (mirror copies or raidz parity/data
parts) are expected to be missing on this device, mismatches are likely
not reported - I am not sure there's any attempt to even detect them.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Stephan Budach

Am 19.01.13 20:18, schrieb Bob Friesenhahn:

On Sat, 19 Jan 2013, Stephan Budach wrote:


Just ignore the timestamp, as it seems that the time is not set 
correctly, but the dates match my two issues from today and thursday, 
which accounts for three days. I didn't catch that before, but it 
seems to clearly indicate a problem with the FC connection…


But, what do I make of this information?


I don't know, but the issue/problem seems to below the zfs level so 
you need to fix that lower level before worrying about zfs.

Yes, I do think that as well.


Did you check for messages in /var/adm/messages which might indicate 
when and how FC connectivity has been lost?
Well, this is the most scaring part to me. Neither fmdump nor dmesg 
showed anything that would indicate a connectivity issue - at least 
not the last time.


Weird.  I wonder if multipathing is working for you at all.  With my 
direct-connect setup, if a path is lost, then there is quite a lot of 
messaging to /var/adm/messages.  I also see a lot of messaging related 
to multipathing when the system boots and first starts using the 
array.  However, with the direct-connect setup, the HBA can report 
problems immediately if it sees a loss of signal.  Your issues might 
be on the other side of the switch (on the storage array side) so the 
local HBA does not see the problem and timeouts are used.  Make sure 
to check the logs in your storage array to see if it is encountering 
resets or flapping connectivity.

I will check that.


Do you have duplex switches so that there are fully-redundant paths, 
or is only one switch used?
Well, no… I don't have enough switch ports on my FC San, but we will 
replace these Sanboxes with Nexus Switches from Cisco this year and I 
will have multipathing then.




Bob

Thanks,
Stephan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Jim Klimov

On 2013-01-19 20:23, Jim Klimov wrote:

On 2013-01-19 20:08, Bob Friesenhahn wrote:

On Sat, 19 Jan 2013, Jim Klimov wrote:


On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)


I don't think that zfs would call it scrubbing unless the user requested
scrubbing.  Unplugging a USB drive which is part of a mirror for a short
while results in considerable activity when it is plugged back in.  It
is as if zfs does not trust the device which was temporarily unplugged
and does a full validation of it.


Now, THAT would be resilvering - and by default it should be a limited
one, with a cutoff at the last TXG known to the disk that went MIA/AWOL.
The disk's copy of the pool label (4 copies in fact) record the last
TXG it knew safely. So the resilver should only try to validate and
copy over the blocks whose BP entries' birth TXG number is above that.
And since these blocks' components (mirror copies or raidz parity/data
parts) are expected to be missing on this device, mismatches are likely
not reported - I am not sure there's any attempt to even detect them.


And regarding the considerable activity - AFAIK there is little way
for ZFS to reliably read and test TXGs newer than X other than to
walk the whole current tree of block pointers and go deeper into those
that match the filter (TLVDEV number in DVA, and optionally TXG numbers
in birth/physical fields).

So likely the resilver does much of the same activity that a full scrub
would - at least in terms of reading all of the pool's metadata (though
maybe not all copies thereof).

My 2c and my speculation,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Richard Elling
On Jan 19, 2013, at 7:16 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 Oh, I forgot to mention - The above logic only makes sense for mirrors and 
 stripes.  Not for raidz (or raid-5/6/dp in general)
 
 If you have a pool of mirrors or stripes, the system isn't forced to 
 subdivide a 4k block onto multiple disks, so it works very well.  But if you 
 have a pool blocksize of 4k and let's say a 5-disk raidz (capacity of 4 
 disks) then the 4k block gets divided into 1k on each disk and 1k parity on 
 the parity disk.  Now, since the hardware only supports block sizes of 4k ... 
 You can see there's a lot of wasted space, and if you do a bunch of it, 
 you'll also have a lot of wasted time waiting for seeks/latency.

This is not quite true for raidz. If there is a 4k write to a raidz comprised 
of 4k sector disks, then
there will be one data and one parity block. There will not be 4 data + 1 
parity with 75% 
space wastage. Rather, the space allocation more closely resembles a variant of 
mirroring,
like some vendors call RAID-1E
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Jim Klimov

On 2013-01-19 23:39, Richard Elling wrote:

This is not quite true for raidz. If there is a 4k write to a raidz
comprised of 4k sector disks, then
there will be one data and one parity block. There will not be 4 data +
1 parity with 75%
space wastage. Rather, the space allocation more closely resembles a
variant of mirroring,
like some vendors call RAID-1E


I agree with this exact reply, but as I posted sometime late last year,
reporting on my digging in the bowels of ZFS and my problematic pool,
for a 6-disk raidz2 set I only saw allocations (including two parity
disks) divisible by 3 sectors, even if the amount of the (compressed)
userdata was not so rounded. I.e. I had either miniature files or tails
of files fitting into one sector plus two parities (overall a 3 sector
allocation), or tails ranging 2-4 sectors and occupying 6 with parity
(while 2 or 3 sectors could use just 4 or 5 w/parities, respectively).

I am not sure what these numbers mean - 3 being a case for one userdata
sector plus both parities or for half of 6-disk stripe - both such
explanations fit in my case.

But yes, with current raidz allocation there are many ways to waste
space. And those small percentages (or not so small) do add up.
Rectifying this example, i.e. allocating only as much as is used,
does not seem like an incompatible on-disk format change, and should
be doable within the write-queue logic. Maybe it would cause tradeoffs
in efficiency; however, ZFS does explicitly rotate starting disks
of allocations every few megabytes in order to even out the loads
among spindles (normally parity disks don't have to be accessed -
unless mismatches occur on data disks). Disabling such padding would
only help achieve this goal and save space at the same time...

My 2c,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Nico Williams
I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.

I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.

To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter.  This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.

This would allow most writes of non-duplicate blocks to be faster than
normal dedup writes, but still slower than normal non-dedup writes:
the Bloom filter will add some cost.

The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.

It's very likely that this is a bit too obvious to just work.

Of course, it is easier to just use flash.  It's also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point
anyways.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Richard Elling
bloom filters are a great fit for this :-)

  -- richard



On Jan 19, 2013, at 5:59 PM, Nico Williams n...@cryptonector.com wrote:

 I've wanted a system where dedup applies only to blocks being written
 that have a good chance of being dups of others.
 
 I think one way to do this would be to keep a scalable Bloom filter
 (on disk) into which one inserts block hashes.
 
 To decide if a block needs dedup one would first check the Bloom
 filter, then if the block is in it, use the dedup code path, else the
 non-dedup codepath and insert the block in the Bloom filter.  This
 means that the filesystem would store *two* copies of any
 deduplicatious block, with one of those not being in the DDT.
 
 This would allow most writes of non-duplicate blocks to be faster than
 normal dedup writes, but still slower than normal non-dedup writes:
 the Bloom filter will add some cost.
 
 The nice thing about this is that Bloom filters can be sized to fit in
 main memory, and will be much smaller than the DDT.
 
 It's very likely that this is a bit too obvious to just work.
 
 Of course, it is easier to just use flash.  It's also easier to just
 not dedup: the most highly deduplicatious data (VM images) is
 relatively easy to manage using clones and snapshots, to a point
 anyways.
 
 Nico
 --
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss