RE: [zfs-discuss] Expanding raidz2

Jeff Bonwick Thu, 13 Jul 2006 02:14:09 -0700

> Maybe this is a dumb question, but I've never written a
> filesystem is there a fundamental reason why you cannot have
> some files mirrored, with others as raidz, and others with no
> resilience? This would allow a pool to initially exist on one
> disk, then gracefully change between different resilience
> strategies as you add disks and the requirements change.


Actually, it's an excellent question.  And a deep one.
It goes to the very heart of why the traditional factoring
of storage into filesystems and volumes is such a bad idea.

In a typical filesystem, each block is represented by a small
integer -- typically 32 or 64 bits -- indicating its location
on disk.  To make a filesystem talk to multiple disks, you
either need to add another integer -- a device number -- to
each block pointer, or you need to generate virtual block
numbers.  Doing the former requires modifying the filesystem;
doing the latter does not, which is why volumes caught on
in the first place.  It was expedient.

The simplest example of block virtualization is a concatentation
of two disks.  For simplicity, assume all disks have 100 blocks.
To create a 200-block volume using disks A and B, we assign virtual
blocks 0-99 to A and 100-199 to B.  As far as the filesystem is
concerned, it's just looking at a 200-block logical device.
But when it issues a read for (say) logical block 137, the volume
manager will actually map that to physical block 37 of disk B.

A stripe (RAID-0) is similar, except that instead of putting
the low blocks on A and the high ones on B, you put the even
ones on A and the odd ones on B.  So disk A stores virtual
blocks 0, 2, 4, 6, ... on physical blocks 0, 1, 2, 3, etc.
The advantage of striping is that when you issue a read of
(say) 10 blocks, that maps into 5 blocks on each disk, and you
can read from those disks in parallel.  So you get up to double
the bandwidth (less for small I/O, because then the per-I/O
overhead dominates, but I digress).

A mirror (RAID-1) is even simpler -- it's just a 1-1 mapping
of logical to physical block numbers on two or more disks.

RAID-4 is only slightly more complex.  The rule here is that all
disks XOR to zero (i.e., if you XOR the nth block of each disk
together, you get a block of zeroes), so you can lose any one disk
and still be able to reconstruct the data.  The block mapping is
just like a stripe, except that there's a parity disk as well.

RAID-5 is like RAID-4, but the parity rotates at some fixed
interval so that you don't have a single 'hot' parity disk.

RAID-6 is a variant on RAID-4/5 that (using a bit subtler
mathematics) can survive two disk failures, not just one.

Now here's the key limitation of this scheme, which is so obvious
that it's easy to miss:  the relationship between replicas of your
data is expressed in terms of the *devices*, not the *data*.

That's why a traditional filesystem can't offer different
RAID levels using the same devices -- because the RAID levels
are device-wide in nature.  In a mirror, all disks are identical.
In a RAID-4/5 group, all disks XOR to zero.  Mixing (say) mirroring
with RAID-5 doesn't work because in the event of disk failure, the
volume manager would have no idea how to reconstruct missing data.

RAID-Z takes a different approach.  We were designing a filesystem
as well, so we could make the block pointers as semantically rich
as we wanted.  To that end, the block pointers in ZFS contains data
layout information.  One nice side effect of this is that we don't
need fixed-width RAID stripes.  If you have 4+1 RAID-Z, we'll store
128k as 4x32k plus 32k of parity, just like any RAID system would.
But if you only need to store 3 sectors, we won't do a partial-stripe
update of an existing 5-wide stripe; instead, we'll just allocate
four sectors, and store the data and its parity.  The stripe width
is variable on a per-block basis.  And, although we don't support it
yet, so is the replication model.  The rule for how to reconstruct
a given block is described explicitly in the block pointer, not
implicitly by the device configuration.

So to answer your question: no, it's no pie in the sky.  It's a
great idea.  Per-file or even per-block replication is something
we've thought about in depth, built into the on-disk format,
and plan to support in the future.

The main issues are administrative.  ZFS is all about ease of use
(when it's not busy being all about data integrity), so getting the
interface to be simple and intuitive is important -- and not as
simple as it sounds.  If your free disk space might be used for
single-copy data, or might be used for mirrored data, then
how much free space do you have?  Questions like that need
to be answered, and answered in ways that make sense.

(Note: would anyone ever really want per-block replication levels?
It's not as crazy as it sounds.  A couple of examples: replicating
only the first block, so that even if you lose data, you know the
file type and some idea what it contained; replicating only the first
(say) 1GB, so that most files are replicated, but giant mpegs and
core files aren't; or in a database, replicating only those records
that have a particular field set.)

Jeff

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

RE: [zfs-discuss] Expanding raidz2

Reply via email to