Re: [zfs-discuss] Data loss by memory corruption?

2012-01-16 Thread John Martin

On 01/16/12 11:08, David Magda wrote:


The conclusions are hardly unreasonable:


While the reliability mechanisms in ZFS are able to provide reasonable
robustness against disk corruptions, memory corruptions still remain a
serious problem to data integrity.


I've heard the same thing said ("use ECC!") on this list many times over
the years.


I believe the whole paragraph quoted from the USENIX paper above is
important:

  While the reliability mechanisms in ZFS are able to
  provide reasonable robustness against disk corruptions,
  memory corruptions still remain a serious problem to
  data integrity. Our results for memory corruptions in-
  dicate cases where bad data is returned to the user, oper-
  ations silently fail, and the whole system crashes. Our
  probability analysis shows that one single bit flip has
  small but non-negligible chances to cause failures such
  as reading/writing corrupt data and system crashing.

The authors provide probability calculations in section 6.3
for single bit flips.  ECC provides detection and correction
of single bit flips.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2012-01-16 Thread sol
Thanks for that, Matt, very reassuring  :-)




>
> There is plenty of good will between everyone who's worked on ZFS -- current 
> Oracle employees, former employees, and those never employed by Oracle.  We 
> would all like to see all implementations of ZFS be the highest quality 
> possible.  I'd like to think that we all try to achieve that to the extent 
> that it is possible within our corporate priorities.
>
>
>--matt___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones

2012-01-16 Thread Matthew Ahrens
On Mon, Jan 16, 2012 at 11:34 AM, Jim Klimov  wrote:

> 2012-01-16 23:14, Matthew Ahrens пишет:
>
>> On Thu, Jan 12, 2012 at 5:00 PM, Jim Klimov > > wrote:
>>
>>While reading about zfs on-disk formats, I wondered once again
>>why is it not possible to create a snapshot on existing data,
>>not of the current TXG but of some older point-in-time?
>>
>>
>> It is not possible because the older data may no longer exist on-disk.
>>  For example, you want to take a snapshot from 10 txg's ago.  But since
>> then we have created a new file, which modified the containing
>> directory.  So we freed the directory block from 10 txg's ago.  That
>> freed block is then a candidate for reallocation.
>>
>> Existence of old uberblocks in the ring buffer does not indicate that
>> the data they reference is still valid.  This is the reason that "zpool
>> import -F" does not always work.
>>
>
> Hmmm... the way I got it (but again have no prooflinks handy)
> was that ZFS "recently" got a deferred-reuse feature to just
> guarantee those rollbacks, basically. I am not sure which
> builds or distros that might be included in.
>
> If you authoritatively say it's not there (or not in illumos),
> I'm going to trust you ;)
>

It's definitely not there in Illumos.  See TXG_DEFER_SIZE.  There was talk
of changing it at Oracle, don't know if that ever happened.  If you have a
S11 system you could probably use mdb to look at the size of the
ms_defermap.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC, block based or file based?

2012-01-16 Thread Matthew Ahrens
On Fri, Jan 13, 2012 at 4:49 PM, Matt Banks  wrote:

> I'm sorry to be asking such a basic question that would seem to be easily
> found on Google, but after 30 minutes of "googling" and looking through
> this lists' archives, I haven't found a definitive answer.
>
> Is the L2ARC caching scheme based on files or blocks?
>

Blocks.


> The reason I ask: We have several databases that are stored in single
> large files of 500GB or more.
>
> So, is L2ARC doing us any good if the entire file can't be cached at once?
>

It will, if your working set is not larger than the L2ARC.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones

2012-01-16 Thread Jim Klimov

2012-01-16 23:14, Matthew Ahrens пишет:

On Thu, Jan 12, 2012 at 5:00 PM, Jim Klimov mailto:jimkli...@cos.ru>> wrote:

While reading about zfs on-disk formats, I wondered once again
why is it not possible to create a snapshot on existing data,
not of the current TXG but of some older point-in-time?


It is not possible because the older data may no longer exist on-disk.
  For example, you want to take a snapshot from 10 txg's ago.  But since
then we have created a new file, which modified the containing
directory.  So we freed the directory block from 10 txg's ago.  That
freed block is then a candidate for reallocation.

Existence of old uberblocks in the ring buffer does not indicate that
the data they reference is still valid.  This is the reason that "zpool
import -F" does not always work.


Hmmm... the way I got it (but again have no prooflinks handy)
was that ZFS "recently" got a deferred-reuse feature to just
guarantee those rollbacks, basically. I am not sure which
builds or distros that might be included in.

If you authoritatively say it's not there (or not in illumos),
I'm going to trust you ;)

What about injecting snapshots into static data - before at
least one existing snapshot? Is that possible? I do get your
point about missing older directory data and possible invalidity
of the snapshot as a ZPL dataset (and probably a bad basis for
a writeable clone)... but let's call them checkpoints then, and
limit use for zfs send and fencing of erred ranges ;)

Is that technically possible or logically reasonable?

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2012-01-16 Thread Matthew Ahrens
On Thu, Jan 5, 2012 at 6:53 AM, sol  wrote:

>
> I would have liked to think that there was some good-will between the ex-
> and current-members of the zfs team, in the sense that the people who
> created zfs but then left Oracle still care about it enough to want the
> Oracle version to be as bug-free as possible.
>

There is plenty of good will between everyone who's worked on ZFS --
current Oracle employees, former employees, and those never employed by
Oracle.  We would all like to see all implementations of ZFS be the highest
quality possible.  I'd like to think that we all try to achieve that to the
extent that it is possible within our corporate priorities.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Upgrade

2012-01-16 Thread Matthew Ahrens
On Thu, Jan 5, 2012 at 7:17 PM, Ivan Rodriguez  wrote:

> Dear list,
>
>  I'm about to upgrade a zpool from 10 to 29 version, I suppose that
> this upgrade will improve several performance issues that are present
> on 10, however
> inside that pool we have several zfs filesystems all of them are
> version 1 my first question is is there a problem with performance or
> any other problem if you operate a zpool 29 with zfs filesystems
> version 1 ?
>

There is no problem to use a recent pool version with an old filesystem
version.


> Is it better to upgrade zfs to the latest version ?
>

Upgrade if you want the new filesystem features (e.g. case insensitivity,
user accounting & quotas).


> Can we jump from zfs version 1 to 5 ?
>

Yes.


> Is there any implications on zfs send/receive with filesystem's and
> pools with different versions ?
>

Some filesystem versions require a corresponding pool version.  E.g. fs
version 4 (userquotas) requires pool version 15.  So you can not send a
version 4 filesystem to a pool that is version < 15.  Also, you can not
send a version X filesystem to a machine that does not understand that
filesystem version.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones

2012-01-16 Thread Matthew Ahrens
On Thu, Jan 12, 2012 at 5:00 PM, Jim Klimov  wrote:

> While reading about zfs on-disk formats, I wondered once again
> why is it not possible to create a snapshot on existing data,
> not of the current TXG but of some older point-in-time?
>

It is not possible because the older data may no longer exist on-disk.  For
example, you want to take a snapshot from 10 txg's ago.  But since then we
have created a new file, which modified the containing directory.  So we
freed the directory block from 10 txg's ago.  That freed block is then a
candidate for reallocation.

Existence of old uberblocks in the ring buffer does not indicate that the
data they reference is still valid.  This is the reason that "zpool import
-F" does not always work.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread David Magda
On Mon, January 16, 2012 11:22, Bob Friesenhahn wrote:
> This seems very unlikely since the future needs of Apple show little
> requirement for zfs.  Apple only offers one computer model which
> provides ECC and a disk drive configuration which is marginally useful
> for zfs.  This computer model has a very limited user-base which is
> primarily people in the video and desktop imaging/publishing world.
> Apple already exited the server market, for which they only ever
> offered single limited-use model (Xserve).

Having "real" snapshots would certainly help Time Machine. That said,
Apple has managed to add on-disk Time Machine snapshots and better
encryption in 10.7 (Lion) via a logical file manager:

http://arstechnica.com/apple/reviews/2011/07/mac-os-x-10-7.ars/13

Zpools aren't the only feature in ZFS after all. Oh well.

> There would likely be a market if someone was to sell pre-packaged zfs
> for Apple OS-X at a much higher price than the operating system
> itself.

Already a product:

http://tenscomplement.com/
http://arstechnica.com/apple/news/2011/03/how-zfs-is-slowly-making-its-way-to-mac-os-x.ars


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread Chris Ridd

On 16 Jan 2012, at 16:56, Rich Teer wrote:

> On Mon, 16 Jan 2012, Freddie Cash wrote:
> 
>>> There would likely be a market if someone was to sell pre-packaged zfs for
>>> Apple OS-X at a much higher price than the operating system itself.
> 
> 10's Complement (?) are planning such a thing, although I have no idea
> on their pricing.  The software is still in development.

They have announced pricing for 2 of their 4 ZFS products: see 
.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread Rich Teer
On Mon, 16 Jan 2012, Freddie Cash wrote:

> As an FS for their TimeMachine NAS boxes (Time Capsule, I think),
> though, ZFS would be a good fit.  Similar to how the Time Slider works
> in Sun/Oracle's version of Nautilus/GNOME2.  Especially if they expand
> the boxes to use 4 drives (2x mirror), and had the pool
> pre-configured.

Agreed.

> As a desktop/laptop FS, though, ZFS (in its current incarnation) is
> overkill and unwieldy.  Especially since most of these machines only
> have room for a single HD.

I respectfully disagree: end-to-end checksums are always a good thing,
and simple single-drive {desk,lap}top could use a single pool and gain
all the benfits of ZFS with none of the "unweildyness", although again
I disagree that ZFS is unweildy.

> > There would likely be a market if someone was to sell pre-packaged zfs for
> > Apple OS-X at a much higher price than the operating system itself.

10's Complement (?) are planning such a thing, although I have no idea
on their pricing.  The software is still in development.

-- 
Rich Teer, Publisher
Vinylphile Magazine
.  *   * . * .* .
 .   *   .   .*
* .  . /\ ( .  . *
 . .  / .\   . * .
.*.  / *  \  . .
  . /*   o \ .
*   '''||'''   .
www.vinylphilemag.com   **
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread Freddie Cash
On Mon, Jan 16, 2012 at 8:22 AM, Bob Friesenhahn
 wrote:
> On Mon, 16 Jan 2012, David Magda wrote:
>> http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html
>>
>> Perhaps Apple can come to an agreement with Oracle when they couldn't with
>> Sun.
>
> This seems very unlikely since the future needs of Apple show little
> requirement for zfs.  Apple only offers one computer model which provides
> ECC and a disk drive configuration which is marginally useful for zfs.  This
> computer model has a very limited user-base which is primarily people in the
> video and desktop imaging/publishing world. Apple already exited the server
> market, for which they only ever offered single limited-use model (Xserve).

As an FS for their TimeMachine NAS boxes (Time Capsule, I think),
though, ZFS would be a good fit.  Similar to how the Time Slider works
in Sun/Oracle's version of Nautilus/GNOME2.  Especially if they expand
the boxes to use 4 drives (2x mirror), and had the pool
pre-configured.

As a desktop/laptop FS, though, ZFS (in its current incarnation) is
overkill and unwieldy.  Especially since most of these machines only
have room for a single HD.

> There would likely be a market if someone was to sell pre-packaged zfs for
> Apple OS-X at a much higher price than the operating system itself.
>
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread Bob Friesenhahn

On Mon, 16 Jan 2012, David Magda wrote:


http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html

Perhaps Apple can come to an agreement with Oracle when they couldn't with
Sun.


This seems very unlikely since the future needs of Apple show little 
requirement for zfs.  Apple only offers one computer model which 
provides ECC and a disk drive configuration which is marginally useful 
for zfs.  This computer model has a very limited user-base which is 
primarily people in the video and desktop imaging/publishing world. 
Apple already exited the server market, for which they only ever 
offered single limited-use model (Xserve).


There would likely be a market if someone was to sell pre-packaged zfs 
for Apple OS-X at a much higher price than the operating system 
itself.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread David Magda
On Mon, January 16, 2012 11:05, Rich Teer wrote:
> On Sun, 15 Jan 2012, Toby Thain wrote:
>
>> Rumours have long circulated, even before the brief public debacle of
>> ZFS in OS X - "is it in Leopard...yes it's in...no it's not...yes it's
>> in...oh damn, it's really not" - that Apple is building their own clone
>> of ZFS.
>
> I don't know why APple don't just get off the pot and officially adopy
> ZFS. I mean, they've embraced DTrace, so what's stopping them from using
> ZFS too?

This was discussed already:

>> [On Sat Oct 24 14:14:19 UTC 2009, David Magda wrote:]
>>
>> Apple can currently just take the ZFS CDDL code and incorporate it
>> (like they did with DTrace), but it may be that they wanted a "private
>> license" from Sun (with appropriate technical support and
>> indemnification), and the two entities couldn't come to mutually
>> agreeable terms.
>
> I cannot disclose details, but that is the essence of it.

http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html

Perhaps Apple can come to an agreement with Oracle when they couldn't with
Sun.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss by memory corruption?

2012-01-16 Thread David Magda
On Mon, January 16, 2012 01:19, Richard Elling wrote:

>> [1] http://www.usenix.org/event/fast10/tech/full_papers/zhang.pdf
>
> Yes. Netapp has funded those researchers in the past. Looks like a FUD
> piece to me.
> Lookout everyone, the memory system you bought from Intel might suck!

>From the paper:

> This material is based upon work supported by the National Science
> Foundation under the following grants: CCF-0621487, CNS-0509474,
> CNS-0834392, CCF-0811697, CCF-0811697, CCF-0937959, as well as by generous
> donations from NetApp, Inc, Sun Microsystems, and Google.

So Sun paid to FUD themselves?

The conclusions are hardly unreasonable:

> While the reliability mechanisms in ZFS are able to provide reasonable
> robustness against disk corruptions, memory corruptions still remain a
> serious problem to data integrity.

I've heard the same thing said ("use ECC!") on this list many times over
the years.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread Rich Teer
On Sun, 15 Jan 2012, Toby Thain wrote:

> Rumours have long circulated, even before the brief public debacle of ZFS in
> OS X - "is it in Leopard...yes it's in...no it's not...yes it's in...oh damn,
> it's really not" - that Apple is building their own clone of ZFS.

I don't know why APple don't just get off the pot and officially adopy ZFS.
I mean, they've embraced DTrace, so what's stopping them from using ZFS too?

-- 
Rich Teer, Publisher
Vinylphile Magazine
.  *   * . * .* .
 .   *   .   .*
* .  . /\ ( .  . *
 . .  / .\   . * .
.*.  / *  \  . .
  . /*   o \ .
*   '''||'''   .
www.vinylphilemag.com   **
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-16 Thread Gary Mills
On Mon, Jan 16, 2012 at 09:13:03AM -0600, Bob Friesenhahn wrote:
> On Mon, 16 Jan 2012, Jim Klimov wrote:
> >
> >I think that in order to create a truly fragmented ZFS layout,
> >Edward needs to do sync writes (without a ZIL?) so that every
> >block and its metadata go to disk (coalesced as they may be)
> >and no two blocks of the file would be sequenced on disk together.
> >Although creating snapshots should give that effect...
> 
> In my experience, most files on Unix systems are re-written from
> scatch.  For example, when one edits a file in an editor, the editor
> loads the file into memory, performs the edit, and then writes out
> the whole file.  Given sufficient free disk space, these files are
> unlikely to be fragmented.
> 
> The case of slowly written log files or random-access databases are
> the worse cases for causing fragmentation.

The case I've seen was with an IMAP server with many users.  E-mail
folders were represented as ZFS directories, and e-mail messages as
files within those directories.  New messages arrived randomly in the
INBOX folder, so that those files were written all over the place on
the storage.  Users also deleted many messages from their INBOX
folder, but the files were retained in snapshots for two weeks.  On
IMAP session startup, the server typically had to read all of the
messages in the INBOX folder, making this portion slow.  The server
also had to refresh the folder whenever new messages arrived, making
that portion slow as well.  Performance degraded when the storage
became 50% full.  It would increase markedly when the oldest snapshot
was deleted.

-- 
-Gary Mills--refurb--Winnipeg, Manitoba, Canada-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-16 Thread Bob Friesenhahn

On Mon, 16 Jan 2012, Jim Klimov wrote:


I think that in order to create a truly fragmented ZFS layout,
Edward needs to do sync writes (without a ZIL?) so that every
block and its metadata go to disk (coalesced as they may be)
and no two blocks of the file would be sequenced on disk together.
Although creating snapshots should give that effect...


Creating snapshots does not in itself cause fragmentation since COW 
would cause that level of fragmentation to exist anyway.  However, 
snapshots cause old blocks to be maintained so the disk becomes more 
full, fresh blocks may be less appropriately situated, and the disk 
seeks may become more expensive due to needing to seek over more 
tracks.


In my experience, most files on Unix systems are re-written from 
scatch.  For example, when one edits a file in an editor, the editor 
loads the file into memory, performs the edit, and then writes out the 
whole file.  Given sufficient free disk space, these files are 
unlikely to be fragmented.


The case of slowly written log files or random-access databases are 
the worse cases for causing fragmentation.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-16 Thread Jim Klimov

2012-01-16 8:39, Bob Friesenhahn wrote:

On Sun, 15 Jan 2012, Edward Ned Harvey wrote:


While I'm waiting for this to run, I'll make some predictions:
The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading
the initial sequential file should take ~16 sec
After fragmentation, it should be essentially random 4k fragments (32768
bits). I figure each time the head is able to find useful data, it takes


The 4k fragments is the part I don't agree with. Zfs does not do that.
If you were to run raidzN over a wide enough array of disks you could
end up with 4K fragments (distributed across the disks), but then you
would always have 4K fragments.



I think that in order to create a truly fragmented ZFS layout,
Edward needs to do sync writes (without a ZIL?) so that every
block and its metadata go to disk (coalesced as they may be)
and no two blocks of the file would be sequenced on disk together.
Although creating snapshots should give that effect...

He would have to fight hard to defeat ZFS's anti-fragmentation
attempts overall - while this is possible on very full pools ;)
Hint: pre-fill Ed's test pool to 90%, then run the tests :)

I think that to go forward about discussing defragmentation
tools, we should define a metric of fragmentation - as Bob and
Edward have often brought up. This implies accounting for
the effects on end-user of some mix of factors like:

1) Size of "free" reads and writes, i.e. cheap prefetch of
   a HDD's track as opposed to seeking; reads of an SSD block
   (those 256KB that are sliced into 4/8KB pages) as opposed
   to random reads of pages from separate SSD blocks.
   Seeks to neighboring tracks may be faster, than full-disk
   seeks, but they are slower than no seeks at all.

   For an optimal read-performance, we might want to prefetch
   whole tracks/blocks (not 64Kb from the position of ZFS's
   wanted block, but the whole track including this block,
   reversely knowing the sector numbers of start and end).

   Effect: we might not need to fully defragment data, but
   rather make long-enough ranges "correctly" positioned
   on the media. These may span neighboring tracks/blocks.

   We do need to know media's performance characteristics
   to do this optimally (i.e. which disk tracks have which
   byte-lengths, and where does each circle start in terms
   of LBA offsets).

   Also, disks' internal reallocation to spare blocks
   may lead to uncontrollable random seeks, degrading
   performance over time, but an FS is unlikely to have
   control or knowledge of that.

   Metric: start-addresses and lengths of fastest-read
   locations (i.e. whole tracks or SSD blocks) on leaf
   storage. May be variable within the storage device.


2) In case of ZFS - reads of contiguously allocated and
   monotonously increasing block numbers of data from a
   file's or zvol's most current version (live dataset
   as opposed to block history change in snapshots and
   the monotonous TXG number increase in on-disk vlocks).
   This may be in unresolvable conflict with clones and
   deduplication, so some files or volumes can not be
   made contiguous without breaking continuity of others.
   Still, some "overall contiguousness" can be optimised.

   For users it might also be important to have many files
   from some directory stored close to each other, especially
   if these are small files used together somehow (sourcecode,
   thumbnails, whatever).

   Effect: fast reads of most-current datasets.
   Metric: length of continuous (DVA) stretches of current
   logical block numbers of userdata divided by total data
   size. Amount of separate fragments somehow included ;)

3) In case of ZFS - fast access to metadata, especially
   branches of the current blockpointer tree in sequence
   of increasing TXG numbers.

   Effect: fast reads of metadata, i.e. scrubbing.
   Metric: length of continuous (DVA) stretches of current
   block pointer trees in same-or-increasing TXG numbers
   divided by total size of the tree (branch).

There is likely no absolute fragmentation or defragmentation,
but there are some optimisations. For example, ZFS's attempts
to coalesce 10Mb of data during one write into one metaslab
may suffice. And we do actually see performance hits when it
can't find stretches long enough (quickly enough) with pools
over empirical 80% fill-up. Defragmentation might set the aim
of clearing up enough 10Mb-long stretches of free space and
relocate smaller fragments of current user-data or {monotonous
BPTree} metadata into these clearings.

In particular, even if we have old data in snapshots, but
it is stored in long 10Mb+ contiguous stretches, we might
just leave it there. It is already about as good as it gets.

Also, as I proposed elsewhere, the metadata might be stored
in separate stretches of physical disk space - thus different
aims of defragmenting userdata and metadata (and free space)
would not conflict.

What do you think?
//Jim
_

Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread Jim Klimov

Thanks again for answering! :)

2012-01-16 10:08, Richard Elling wrote:

On Jan 15, 2012, at 7:04 AM, Jim Klimov wrote:


"Does raidzN actually protect against bitrot?"
That's a kind of radical, possibly offensive, question formula
that I have lately.


Simple answer: no. raidz provides data protection. Checksums verify
data is correct. Two different parts of the storage solution.


Meaning - data-block checksum mismatch allows to detect an error;
afterwards raidz permutations matching the checksum allow to fix
it (if enough redundancy is available)? Right?


raidz uses an algorithm to try permutations of data and parity to
verify against the checksum. Once the checksum matches, repair
can begin.


Ok, nice to have this statement confirmed so many times now ;)

How do per-disk cksum errors get counted then for raidz - thanks
to permutation of fixable errors we can detect which disk:sector
returned mismatching data? Likewise, for unfixable errors we can't
know the faulty disk - unless one had explicitly erred?

So, if my 6-disk raidz2 couldn't fix the error, it either occured
on 3 disks' parts of one stripe, or in RAM/CPU (SPOF) before writing
the data and checksum to disk? In the latter case there is definitely
no single-disk's fault for returning bad data, so per-disk cksum
counters are zero? ;)



2*) How are the sector-ranges on-physical-disk addressed by
ZFS? Are there special block pointers with some sort of
physical LBA addresses in place of DVAs and with checksums?
I think there should be (claimed end-to-end checksumming)
but wanted to confirm.


No.


Ok, so basically there is the vdev_raidz_map_alloc() algorithm
to convert DVAs into leaf addresses, and it is always going to
be the same for all raidz's?

For example, such lack of explicit addressing would not let ZFS
reallocate one disk's bad media sector into another location -
the disk is always expected to do that reliably and successfully?




2**) Alternatively, how does raidzN get into situation like
"I know there is an error somewhere, but don't know where"?
Does this signal simultaneous failures in different disks
of one stripe?
How *do* some things get fixed then - can only dittoed data
or metadata be salvaged from second good copies on raidZ?


No. See the seminal blog on raidz
http://blogs.oracle.com/bonwick/entry/raid_z



3) Is it true that in recent ZFS the metadata is stored in
a mirrored layout, even for raidzN pools? That is, does
the raidzN layout only apply to userdata blocks now?
If "yes":


Yes, for Solaris 11. No, for all other implementations, at this time.


Are there plans to do this for illumos, etc.?
I thought that my oi_148a's disks' IO patterns matched the
idea of mirroring metadata, now I'll have to explain that
data with some secondary ideas ;)




3*) Is such mirroring applied over physical VDEVs or over
top-level VDEVs? For certain 512/4096 bytes of a metadata
block, are there two (ditto-mirror) or more (ditto over
raidz) physical sectors of storage directly involved?


It is done in the top-level vdev. For more information see the manual,


  What's New in /ZFS/? - Oracle Solaris /ZFS/ Administration Guide
  

docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html



3**) If small blocks, sized 1-or-few sectors, are fanned out
in incomplete raidz stripes (i.e. 512b parity + 512b data)
does this actually lead to +100% overhead for small data,
double that (200%) for dittoed data/copies=2?


The term "incomplete" does not apply here. The stripe written is
complete: data + parity.


Just to make up, I meant variable-width stripes as opposed
to "full-width stripe" writes in other RAIDs. That is, to
update one sector of data on a 6-disk raid6 I'd need to
write 6 sectors; while on raidz2 I need to write only two.
No extra reply solicited here ;)


Does this apply to metadata in particular? ;)


lost context here, for non-Solaris 11 implementations, metadata is
no different than data with copies=[23]


The question here was whether writes of metadata (assumed
to be a small number of sectors down to one per block)
incur writes of parity, of ditto copies, or of parity and
copies, increasing storage requirements by several times.

One background thought was that I wanted to make sense of
my last year's experience with a zvol whose blocksize was
1 sector (4kb), and the metadata overhead (consumption of
free space) was about the same as userdata size. At that
time I thought it's because I have a 1-sector metadata
block to address each 1-sector data block of the volume;
but now I think the overhead would be closer to 400% of
userdata size...




Does this large factor apply to ZVOLs with fixed block
size being defined "small" (i.e. down to the minimum 512b/4k
available for these disks)?


NB, there are a few slides in my ZFS tutorials where we talk about this.
http://www.slideshare.net/relling/useni