Re: [PATCH 6/6] Btrfs: do aio_write instead of write
On Tue, Jun 1, 2010 at 11:19 PM, Chris Mason chris.ma...@oracle.com wrote: Is it ok not to unlock_extent if !ordered? I don't know if you fixed this in a later version but it stuck out to me :) The construct is confusing. Ordered extents track things that we have allocated on disk and need to write. New ones can't be created while we have the extent range locked. But we can't force old ones to disk with the lock held. So, we lock then lookup and if we find nothing we can safely do our operation because no io is in progress. We unlock a little later on, or at endio time. If we find an ordered extent we drop the lock and wait for that IO to finish, then loop again. Ok, that's fair enough. Maybe it's worth commenting, I'm sure I'm not the only one surprised. Thanks, -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] Btrfs: do aio_write instead of write
On Sat, May 22, 2010 at 3:03 AM, Josef Bacik jo...@redhat.com wrote: + while (1) { + lock_extent(tree, start, end, GFP_NOFS); + ordered = btrfs_lookup_ordered_extent(inode, start); + if (!ordered) + break; + unlock_extent(tree, start, end, GFP_NOFS); Is it ok not to unlock_extent if !ordered? I don't know if you fixed this in a later version but it stuck out to me :) -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote: Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That means, I have to have *lots* of comparisions (size of file - blocksize). Even it's not the same, it must be very fast and that's the same problem like the one discussed here. My solution (not yet implemented) is as follows (hopefully I remember well): I calculate a checksum of 24 bit. (there can be another size) This means, I can have 2^24 different checksums. Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well, I'm just in a hotel and have no calculator): one bit for each possibility. This verctor is initialized with zeros. For each calculated checksum of a block, I set the according bit in the bit vector. It's very fast, to check if a block with a special checksum exists in the filesystem (backup for me) by checking the appropriate bit in the bit vector. If it doesn't exist, it's a new block If it exists, there need to be a separate 'real' check if it's really the same block (which is slow, but's that's happening 1% of the time). Which means you have to refer to each block in some unique way from the bit vector, making it a block pointer vector instead. That's only 64 times more expensive for a 64 bit offset... Since the overwhelming majority of combinations will never appear in practice, you are much better served with a self-sizing data structure like a hash map or even a binary tree, or a hash map with each bucket being a binary tree, etc... You can use any sized hash and it won't affect the number of nodes you have to store. You can trade off to CPU or RAM easily as required, just by selecting an appropriate data structure. A bit vector and especially a pointer vector have extremely bad any case RAM requirements because even if you're deduping a mere 10 blocks you're still allocating and initialising 2^24 offsets. The least you could do is adaptively switch to a more efficient data structure if you see the number of blocks is low enough. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in file-item.c
On Thu, Apr 30, 2009 at 5:04 AM, Marc R. O'Connor mrocon...@oel.state.nj.us wrote: If you can stomach it, you can get a second opinion from the bootable windows memory testing iso: http://oca.microsoft.com/en/windiag.asp It will be hard but I might just try it. Two versions of memtest86+ die in the middle of the scan. Ugh. Also try memtester in Linux. Boot up as you normally do and give it a fair chunk of your RAM to run on. Unlike memtest it leaves all the actual low level stuff to the kernel. Doesn't even need root, let alone boot. At least then you can thoroughly rule out a memtest86 bug. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in file-item.c
On Fri, May 1, 2009 at 2:54 AM, Tracy Reed tr...@ultraviolet.org wrote: Sorry, I was unclear: I meant manage in a public-relations sort of way. Not in a technical way. You are absolutely right that bad RAM or CPU means you are hosed. Even so, it's a perfect opportunity to not make things worse by trying to write data after fundamental assertions have already failed. My most recent data loss scenario with ext3 involved a little kernel/hardware/whatever glitch that would have been harmless on its own, but ext3 took as a cue to completely mangle metadata. I found XML config files with PNG blocks in them, etc. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason chris.ma...@oracle.com wrote: So you need an extra index either way. It makes sense to keep the crc32c csums for fast verification of the data read from disk and only use the expensive csums for dedup. What about self-healing? With only a CRC32 to distinguish a good block from a bad one, statistically you're likely to get an incorrectly healed block in only every few billion blocks. And that may not be your machine, but it'll be somebody's, since the probability is way too high for it not to happen to somebody. Even just a 64 bit checksum would drop the probability plenty, but I'd really only start with 128 bits. NetApp does 64 for 4k of data, ZFS does 256 bits per block, and this traces back to the root like a highly dynamic Merkle tree. In the CRC case the only safe redundancy is one that has 3+ copies of the block, to compare the raw data itself, at which point you may as well have just been using healing RAID1 without checksums. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LVM vs btrfs as a volume manager for SANs
On Thu, Apr 23, 2009 at 5:20 AM, Tomasz Chmielewski man...@wpkg.org wrote: - what happens if SAN machine crashes while the iSCSI file images were being written to; with LVM and its block devices, I'm somehow more confident it wouldn't make more data loss than necessary I would hope that COW applies to such a setting. It certainly does for ZFS, making it an excellent backend for iSCSI. At least, having it on btrfs shouldn't make it any less reliable than on LVM, as long as btrfs does its job correctly. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs development plans
On Tue, Apr 21, 2009 at 5:46 PM, Stephan von Krawczynski sk...@ithnet.com wrote: On Mon, 20 Apr 2009 12:38:57 -0400 Chris Mason chris.ma...@oracle.com wrote: The short answer from my point of view is yes. This doesn't really change the motivations for working on btrfs or the problems we're trying to solve. ... which sounds logical to me. From looking at the project for a while one can see you are trying to solve problems that are not really linux' ones... Even so, I certainly hope that btrfs end up at least as reliable and feature-complete as ZFS, if ZFS itself cannot be merged into Linux. That's a big ask, but now that ZFS' IP has been imported into Oracle, perhaps a lot of patent and copyright issues can be smoothed over, giving btrfs a huge advantage relative to what it had before the acquisition. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: interesting use case for multiple devices and delayed raid?
On Wed, Apr 1, 2009 at 8:17 PM, Brian J. Murrell br...@interlinx.bc.ca wrote: I have a use case that I wonder if anyone might find interesting involving multiple device support and delayed raid. Let's say I have a system with two disks of equal size (to make it easy) which has sporadic, heavy, write requirements. At some points in time there will be multiple files being appended to simultaneously and at other times, there will be no activity at all. The write activity is time sensitive, however, so the filesystem must be able to provide guaranteed (only in a loose sense -- not looking for real QoS reservation semantics) bandwidths at times. Let's say slightly (but within the realm of reality) less than the bandwidth of the two disks combined. I assume you mean read bandwidth, since write bandwidth cannot be increased by mirroring, only striping. If you intend to stripe first, then mirror later as time permits, this is the kind of sophistication you will need to write in the program code itself. A filesystem is a handy abstraction, but you are by no means limited to using it. If you have very special needs, you can get pretty far by writing your own meta-filesystem to add semantics you don't have in your kernel filesystem of choice. That's what every single database application does. You can get even further by writing a complete user-space filesystem as part of your program, or a shared daemon, and the performance isn't really that bad. I also want both the metadata and file data mirrored between the two disks so that I can afford to lose one of the disks and not lose (most of) my data. It is not a strict requirement that all data be immediately mirrored however. This is handled by DragonFly BSD's HAMMER filesystem. A master gets written to, and asynchronously updates a slave, even over a network. It is transactionally consistent and virtually impossible to corrupt as long as the disk media is stable. However as far as I know it won't spread reads, so you'll still get the performance of one disk. A more complete solution, that requires no software changes, would be to have 3 or 4 disks. A stripe for really fast reads and writes, and another disk (or another stripe) to act as a slave to the data being written to the primary stripe. This seems to do what you want, at a small price premium. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: interesting use case for multiple devices and delayed raid?
On Thu, Apr 2, 2009 at 8:04 AM, Brian J. Murrell br...@interlinx.bc.ca wrote: A more complete solution, that requires no software changes, would be to have 3 or 4 disks. A stripe for really fast reads and writes, and another disk (or another stripe) to act as a slave to the data being written to the primary stripe. This seems to do what you want, at a small price premium. No. That's not really what I am describing at all. Well you get the bandwidth of 2 disks when reading and writing, and still mirrored to a second stripe as time permits. Kind of like delayed RAID10. I apologize if my original description was unclear. Hopefully it is more so now. Yes. It'll be up to the actual filesystem devs to weigh in on whether it's worth implementing. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: metadata copied/data not copied?
On Tue, Mar 17, 2009 at 1:35 AM, CSights csig...@fastmail.fm wrote: Hi everyone, I'm curious what would happen in btrfs if the following commands were issued: # cp file1 file2 # chown newuser:newgroup file2 Where file1 was owned by olduser:oldgroup. If I understand copy-on-write correctly the cp would merely create a new pointer (or whatever it is called :( ) containing the files' metadata but the file contents would not actually be duplicated. Certainly not. cp is a userspace program with no knowledge of filesystem details like COW or, as you describe, deduplicate. The semantics you are looking for come with links, except that they don't. Hard links are implemented at the filesystem level, but they are not copy-on-write at the user space level - if you write to the linked file it will appear via all other links too. Soft links are nothing more than a transparent shortcut that happen to give most of the semantics people want from hard links in a much more flexible way. But neither allows a file to have contradictory ownership - you need ACLs to hack up access rights to mimic that. Hard links and soft links make multiple paths refer to the same file, not merely the same contents with different metadata. Otherwise you could soft link /bin/sh into your home directly, setuid the link, and own the machine. Clearly this is not the case, and won't be with btrfs either. And if you're not talking about soft or hard links, I've no idea how you thought that would work. It is possible for a file system to detect identical blocks between files, but without more guidance, it would be very expensive to do so, and with questionable benefits. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: metadata copied/data not copied?
On Tue, Mar 17, 2009 at 8:14 AM, Dmitri Nikulin dniku...@gmail.com wrote: Otherwise you could soft link /bin/sh into your home directly, setuid the link, and own the machine. Sorry, that was a terrible example, only root can setuid anyway. A better example is linking to /bin/sh and making your link writable, then using that to inject malicious code, which is just as good and would be possible with the semantics you described. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: questions about GRUB and BTRFS
On Wed, Feb 25, 2009 at 12:32 PM, Anthony Roberts btrfs-de...@arbitraryconstant.com wrote: Hi, A quick googling turns up posts that GRUB support for BTRFS is planned. My curiosity is more towards how this will be managed, because the way this is currently implemented with software RAID/LVM is quite haphazard. I therefore have some questions about GRUB + BTRFS: IANAGD (I Am Not A GRUB Developer), but I'll post some intuitive respones. -With GRUB booting, it's easy to think of awkward use cases and limitations unless it's capable of discovering BTRFS instances, and can boot by specifying BTRFS UUID + subvolume. That seems quite ambitious, but is this planned eventually? I don't know how much filesystem code can be crammed into the pre-/boot parts of GRUB, but I doubt it's enough to support btrfs' advanced features like object-level striping. For comparison with how the two major ZFS operating systems support root on ZFS: *Solaris (Open, Nexenta, etc.) support booting from ZFS using GRUB, but ONLY plain or mirrored, not striped or raid-z. Not sure about linear, if the kernel is installed on anything but the first vdev. FreeBSD unofficially supports / on ZFS very well, but you still need a /boot to let the bootloader find the kernel and modules. However the kernel itself can be given a ZFS pool and path such as zfs:pool/freebsd/root and it will find all of the ZFS metadata it needs on disk blocks and the small cache in /boot. However in return for this /boot you get the ability to boot right off RAID-Z or whatever you like, because it's using the kernel with full driver and filesystem code instead of very limited bootloader code. -Might it be possible to tweak the userspace component of GRUB to install the bootloader to every member device? This seems necessary for reliable booting and rebuilding after a dead disk. Even if you couldn't tweak grub, device-mapper already has an easy way to mirror just the boot blocks per disk. However GRUB would get confused since the virtual device does not map to a BIOS boot device. Legacy BIOS booting is a pain that way. You may as well just write a shell script to automatically invoke grub-install for each device individually. -64 kb at the beginning of the device is plenty for MBR + GRUB stage 1 + 1.5. Might this allow bootable BTRFS without paritions being used at all? The space used for partitioning is negligible, however we're on the cusp of disks that are too big to partition with MBR, and GPT booting doesn't seem well supported yet. As far as I know, we don't even have a way to boot straight off LVM (because GRUB doesn't support it, and for a kernel and initrd you need a supported partition), and btrfs would only be more difficult. There's obviously no point in getting worked up about this before production ready support is available in the first place. :) However, I am curious about what sort of implementation is planned. Well before production ready support is there, people will already want to test btrfs as their / (which should be automagic like for FreeBSD ZFS) and /boot (because they're difficult that way). Long before reiser4 was even proposed for mainline merge, it already had GRUB support. Enthusiasts will always believe that even /boot should be fortified with COW, checksums and snapshotting :) Especially if btrfs is intended to be the next default Linux filesystem as quoted in many places, it will need /boot support in some form. I'll personally keep an ext3 /boot for a long time just because recovery is easier that way. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
On Tue, Feb 24, 2009 at 3:10 PM, Martin K. Petersen martin.peter...@oracle.com wrote: Dmitri == Dmitri Nikulin dniku...@gmail.com writes: Dmitri If that's the case, why is it marketed for Windows Vista only, Dmitri and referring to filesystem features like marking unused blocks? Dmitri Surely if it was at the device level it would be OS-neutral, and Dmitri marketed as such. The article you posted references some benchmarketing numbers involving Vista. That does not imply it's a Windows-only product. http://en.wikipedia.org/wiki/ExtremeFFS So it seems. I misread the marketing material, and I'm very relieved that Linux will benefit just as much from the improvement, indeed probably better. Thank you very much for correcting me, Martin and Dongjun. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
On Mon, Feb 23, 2009 at 12:22 PM, Dongjun Shin djshi...@gmail.com wrote: A well-designed SSD should survive power cycling and should provide atomicity of flush operation regardless of the underlying flash operations. I don't expect that users of SSD have different requirements about atomicity. Well that's my point, it should provide atomicity, but is this the case for consumer SSDs? It is certainly NOT the case for cheap USB-based flash media and AFAIK not for CF either. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
On Mon, Feb 23, 2009 at 2:17 PM, Seth Huang seth...@gmail.com wrote: On Mon, Feb 23, 2009 at 12:22 PM, Dongjun Shin djshi...@gmail.com wrote: A well-designed SSD should survive power cycling and should provide atomicity of flush operation regardless of the underlying flash operations. I don't expect that users of SSD have different requirements about atomicity. A reliable system should be based on the assumption that the underlying parts are unreliable. Therefore, we should do as much as possible to make sure the reliability in our filesystem instead of leaning on the SSDs. I generally agree with this approach, however, it would clearly have a performance penalty. If possible it should be optional so that, on a reliable media, the hardware can do the hard work and software can perform well. But it might be too much to ask that btrfs support mkfs/mount options for every distinct class of storage (rotating, bad SSD, good SSD, USB flash, holographic cube, electron spin, etc.). -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
On Sat, Feb 21, 2009 at 3:30 AM, Chris Mason chris.ma...@oracle.com wrote: The short answer is that in ssd mode we don't try to avoid random reads. In the ideal future where SSDs can be run without a flimsy hardware FTL, and btrfs can use something like ubi directly, would SSD mode also be able to enable more intelligent wear levelling and safer use of eraseblocks? I've read that one of the potentially crippling limitations of ZFS is that even its reliability features depend largely on being able to perform atomic writes, which are currently impossible (?) on flash media where a block has to be erased before it can be updated, clearly not an atomic operation. Is there any solution to this that doesn't depend on a battery backup? Clearly it's not something a filesystem can practically solve. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs and swap files on SSD's ?
On Wed, Jan 21, 2009 at 12:02 AM, Chris Mason chris.ma...@oracle.com wrote: There are patches to support swap over NFS that might make it safe to use on btrfs. At any rate, it is a fixable problem. FreeBSD has been able to run swap over NFS for as long as I can remember, what is different in Linux that makes it especially difficult? I've read that swap over non-trivial filesystems is hazardous as it may lead to a situation in which memory allocation can fail in the swap/FS code that was meant to make allocation possible again. If btrfs is to take the role of a RAID and volume manager, it would certainly be very useful to be able to run swap on it, since that frees up other volumes from an administrative standpoint. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html