[zfs-code] Extending RAIDZ.

James Blackburn Thu, 13 Sep 2007 22:15:34 +0100

On 9/12/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:
> On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote:
> > Well I read this email having just written a mammoth one in the other
> > thread, my thoughts:
> >
> > The main difficulty in this, as far as I see it, is you're
> > intentionally moving data on a checksummed copy-on-write filesystem
> > ;).  At the very least this is creating lots of work before we even
> > start to address the problem (and given that the ZFS guys are
> > undoubtedly working on device removal, that effort would be wasted).
> > I think this is probably more difficult than it's worth -- re-writing
> > data should be a separate non RAID-Z specific feature (once you're
> > changing the block pointers, you need to update the checksums, and you
> > need to ensure that you're maintaining consistency, preserve
> > snapshots, etc. etc.). Surely it would be much easier to leave the
> > data as is and version the array's disk layout?
>
> I've some time to experiment with my idea. What I did was:
>
> 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this
>    helps me to using hacked up 'zpool attach' with RAIDZ.
> 2. Turn on logging of all write into RAIDZ vdev (offset+size).
> 3. zpool create tank raidz disk0 disk1 disk2
> 4. zpool attach tank disk0 disk3
> 5. zpool export tank
> 6. Backout 1.
> 7. Use a special tool, that will read all blocks written earlier. I use
>    only three disks for reading and logged offset+size pairs.
> 8. Use the same tool to write the data back, but now use four disks.
> 9. Try to: zpool import tank
>
> Yeah, 9 fails. It shows that pool metadata is corrupted.
>
> I was really surprised. This means that layers above vdev knows details
> about vdev internals, like number of disks, I think. What I basically
> did was adding one disk. ZFS can ask raidz vdev for a block using
> exactly the same offset+size as before. This should be enough, but
> isn't. Checksum is stored with a block pointer in a leaf vdev? If so,
> why?


All that's needed to resolve a block pointer currently is the vdev +
offset.  Of course the checksum needs to be correct.  So assuming that
that you have added the extra disk, moved the blocks around, and
updated the offsets correctly the likely problem is checksums.  As
every block pointer checksums its child, if you change a block's
location and update the block pointer offset, the block pointer's
parent's checksum will be wrong.

If you're re-writing/moving data, you'll need to re-write checksums as
well, or switch them off.

James

[zfs-code] Extending RAIDZ.

Reply via email to