> Maybe this is a dumb question, but I've never written a > filesystem is there a fundamental reason why you cannot have > some files mirrored, with others as raidz, and others with no > resilience? This would allow a pool to initially exist on one > disk, then gracefully change between different resilience > strategies as you add disks and the requirements change.
Actually, it's an excellent question. And a deep one. It goes to the very heart of why the traditional factoring of storage into filesystems and volumes is such a bad idea. In a typical filesystem, each block is represented by a small integer -- typically 32 or 64 bits -- indicating its location on disk. To make a filesystem talk to multiple disks, you either need to add another integer -- a device number -- to each block pointer, or you need to generate virtual block numbers. Doing the former requires modifying the filesystem; doing the latter does not, which is why volumes caught on in the first place. It was expedient. The simplest example of block virtualization is a concatentation of two disks. For simplicity, assume all disks have 100 blocks. To create a 200-block volume using disks A and B, we assign virtual blocks 0-99 to A and 100-199 to B. As far as the filesystem is concerned, it's just looking at a 200-block logical device. But when it issues a read for (say) logical block 137, the volume manager will actually map that to physical block 37 of disk B. A stripe (RAID-0) is similar, except that instead of putting the low blocks on A and the high ones on B, you put the even ones on A and the odd ones on B. So disk A stores virtual blocks 0, 2, 4, 6, ... on physical blocks 0, 1, 2, 3, etc. The advantage of striping is that when you issue a read of (say) 10 blocks, that maps into 5 blocks on each disk, and you can read from those disks in parallel. So you get up to double the bandwidth (less for small I/O, because then the per-I/O overhead dominates, but I digress). A mirror (RAID-1) is even simpler -- it's just a 1-1 mapping of logical to physical block numbers on two or more disks. RAID-4 is only slightly more complex. The rule here is that all disks XOR to zero (i.e., if you XOR the nth block of each disk together, you get a block of zeroes), so you can lose any one disk and still be able to reconstruct the data. The block mapping is just like a stripe, except that there's a parity disk as well. RAID-5 is like RAID-4, but the parity rotates at some fixed interval so that you don't have a single 'hot' parity disk. RAID-6 is a variant on RAID-4/5 that (using a bit subtler mathematics) can survive two disk failures, not just one. Now here's the key limitation of this scheme, which is so obvious that it's easy to miss: the relationship between replicas of your data is expressed in terms of the *devices*, not the *data*. That's why a traditional filesystem can't offer different RAID levels using the same devices -- because the RAID levels are device-wide in nature. In a mirror, all disks are identical. In a RAID-4/5 group, all disks XOR to zero. Mixing (say) mirroring with RAID-5 doesn't work because in the event of disk failure, the volume manager would have no idea how to reconstruct missing data. RAID-Z takes a different approach. We were designing a filesystem as well, so we could make the block pointers as semantically rich as we wanted. To that end, the block pointers in ZFS contains data layout information. One nice side effect of this is that we don't need fixed-width RAID stripes. If you have 4+1 RAID-Z, we'll store 128k as 4x32k plus 32k of parity, just like any RAID system would. But if you only need to store 3 sectors, we won't do a partial-stripe update of an existing 5-wide stripe; instead, we'll just allocate four sectors, and store the data and its parity. The stripe width is variable on a per-block basis. And, although we don't support it yet, so is the replication model. The rule for how to reconstruct a given block is described explicitly in the block pointer, not implicitly by the device configuration. So to answer your question: no, it's no pie in the sky. It's a great idea. Per-file or even per-block replication is something we've thought about in depth, built into the on-disk format, and plan to support in the future. The main issues are administrative. ZFS is all about ease of use (when it's not busy being all about data integrity), so getting the interface to be simple and intuitive is important -- and not as simple as it sounds. If your free disk space might be used for single-copy data, or might be used for mirrored data, then how much free space do you have? Questions like that need to be answered, and answered in ways that make sense. (Note: would anyone ever really want per-block replication levels? It's not as crazy as it sounds. A couple of examples: replicating only the first block, so that even if you lose data, you know the file type and some idea what it contained; replicating only the first (say) 1GB, so that most files are replicated, but giant mpegs and core files aren't; or in a database, replicating only those records that have a particular field set.) Jeff _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss