Re: btrfs csum failed on git .pack file
What a strange coincidence that it affected git pack files in both cases. It's almost too improbable... Probably more than a coincidence I think, the question is what though... Some SSD drives (or rather the cheap wear levelling controllers in things like USB sticks) have firmware which tries to recognise certain data structures of common filesystems (like FAT and NTFS), and uses information in those data structures to optimise the allocation and erasure of blocks (for example the free space linked list in FAT). If the data you were saving to the disk was similar to one of those data structures, you might've triggered one of those algorithms, which would cause data corruption. This is common in high performance usb sticks because they want to pre-erase blocks on file deletion for operating systems not supporting SCSI TRIM - I imagine the same technology might carry across to cheap SSD's. Not much BTRFS can do about it though. If the piece of data that triggers the bug could be identified, workarounds could possibly be introduced for the particular buggy controllers. Oliver Mattos (resent as I emailled wrong recipients before) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/25] [btrfs] BUG to BUG_ON changes
if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root)) BUG(); ^ You seem to have missed one. Actually that one was left on purpose because BUG_ON calls are not to have any side effects and I do not know enough about btrfs to know what BTRFS_NODEPTRS_PER_BLOCK does so it was left as is. BTRFS_NODEPTRS_PER_BLOCK has no side effects - it's defined as: 00284 #define BTRFS_NODEPTRS_PER_BLOCK(r) (((r)-nodesize - \ 00285 sizeof(struct btrfs_header)) / \ 00286 sizeof(struct btrfs_key_ptr)) so it should be fine to put it in the BUG_ON. - Original Message - From: Stoyan Gaydarov stoyboy...@gmail.com To: davidj...@gmail.com Cc: linux-ker...@vger.kernel.org; chris.ma...@oracle.com; linux-btrfs@vger.kernel.org Sent: Tuesday, March 10, 2009 6:16 PM Subject: Re: [PATCH 01/25] [btrfs] BUG to BUG_ON changes On Tue, Mar 10, 2009 at 4:16 AM, David John davidj...@gmail.com wrote: Stoyan Gaydarov wrote: Signed-off-by: Stoyan Gaydarov stoyboy...@gmail.com --- fs/btrfs/ctree.c | 3 +-- fs/btrfs/extent-tree.c | 3 +-- fs/btrfs/free-space-cache.c | 3 +-- fs/btrfs/tree-log.c | 3 +-- 4 files changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 37f31b5..2c590b0 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -2174,8 +2174,7 @@ static int insert_ptr(struct btrfs_trans_handle *trans, struct btrfs_root BUG_ON(!path-nodes[level]); lower = path-nodes[level]; nritems = btrfs_header_nritems(lower); - if (slot nritems) - BUG(); + BUG_ON(slot nritems); if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root)) BUG(); ^ You seem to have missed one. Actually that one was left on purpose because BUG_ON calls are not to have any side effects and I do not know enough about btrfs to know what BTRFS_NODEPTRS_PER_BLOCK does so it was left as is. if (slot != nritems) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9abf81f..0314ab6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -672,8 +672,7 @@ static noinline int insert_extents(struct btrfs_trans_handle *trans, keys+i, data_size+i, total-i); BUG_ON(ret 0); - if (last ret 1) - BUG(); + BUG_ON(last ret 1); leaf = path-nodes[0]; for (c = 0; c ret; c++) { diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index d1e5f0e..b0c7661 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -267,8 +267,7 @@ static int __btrfs_add_free_space(struct btrfs_block_group_cache *block_group, out: if (ret) { printk(KERN_ERR btrfs: unable to add free space :%d\n, ret); - if (ret == -EEXIST) - BUG(); + BUG_ON(ret == -EEXIST); } kfree(alloc_info); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 9c462fb..2c892f6 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -1150,8 +1150,7 @@ insert: ret = insert_one_name(trans, root, path, key-objectid, key-offset, name, name_len, log_type, log_key); - if (ret ret != -ENOENT) - BUG(); + BUG_ON(ret ret != -ENOENT); goto out; } -- -Stoyan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
Hi, While this sounds nice in practice, in reality since eraseblocks are generally very large, and with hardware based block remapping (FTL), you can never be sure which data blocks are at risk when re-writing just one block. There is a good chance that rewriting one block of data somewhere will wipe out a block of data at a totally different location of the disk. Since the filesystem has no understanding of the hardware FTL, there isn't much that can be done about this at the filesystem level. The only thing that could be done is metadata mirroring on the same disk, but I suspect even that would only marginally improve reliability, since both copies of the metadata are likely to be written to storage at nearly the same time, they are quite likely to be re-mapped into the same block, and then if you loose one, you loose both. The solution to this is twofold: 1 :- filesystems should be able to detect when a non-atomic write has corrupted the filesystem and tell the user - eg. Filesystem corruption found, Likley hardware malfunction detected (since with a fs like btrfs, when all big software bugs are resolved, the only thing that can cause disk corruption are hardware issues) 2 :- Someone who feels this is close to their heart needs to test every big brand of SSD and name and shame those where writes are non-atomic - as soon as you get this publicised that certain brands of SSD put data at risk, the manufacturer will be very fast at resolving it. - Original Message - From: Seth Huang seth...@gmail.com To: linux-btrfs@vger.kernel.org Sent: Monday, February 23, 2009 3:17 AM Subject: Re: ssd optimised mode On Mon, Feb 23, 2009 at 12:22 PM, Dongjun Shin djshi...@gmail.com wrote: A well-designed SSD should survive power cycling and should provide atomicity of flush operation regardless of the underlying flash operations. I don't expect that users of SSD have different requirements about atomicity. A reliable system should be based on the assumption that the underlying parts are unreliable. Therefore, we should do as much as possible to make sure the reliability in our filesystem instead of leaning on the SSDs. -- Seth Huang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data De-duplication
Neat. Thanks much. It'd be cool to output the results of each of your hashes to a database so you can get a feel for how many duplicate blocks there are cross-files as well. I'd like to run this in a similar setup on all my VMware VMDK files and get an idea of how much space savings there would be across 20+ Windows 2003 VMDK files... probably *lots* of common blocks. Ray Hi, Currently it DOES do cross file block matching - thats why it takes sooo long to run :-) If you remove everything after the word sort, it will make a verbose output, which you could then stick into some SQL database if you wanted. You could relativey easily format the output into a format for direct input to an SQL database if you modified the line with the dd in it within the first while. You can also remove the sort and the pipe before it to get an unsorted output - the advantage of this is it takes less time. I'm guessing that if you had the time to run this on multi-gigabyte disk images you'd find that as much as 80% of the blocks are duplicated between any two virtual machines of the same operating system. That means if you have 20 Win 2k3 VM's and the first VM image of Windows + software is 2GB (after nulls are removed), the total size for 20VM's could be ~6GB (remembering there will be extra redundancy the more VM's you add)- not a bad saving. Thanks Oliver -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data De-duplication
2) Keep a tree of checksums for data blocks, so that a bit of data can be located by it's checksum. Whenever a data block is about to be written check if the block matches any known block, and if it does then don't bother duplicating the data on disk. I suspect this option may not be realistic for performance reasons. When compression was added, the writeback path was changed to make option #2 viable, at least in the case where the admin is willing to risk hash collisions on strong hashes. When the a direct read comparison is required before sharing blocks, it is probably best done by a stand alone utility, since we don't want wait for a read of a full extent every time we want to write on. Can we assume hash collisions won't occur? I mean if it's a 256 bit hash then even with 256TB of data, and one hash per block, the chances of collision are still too small to calculate on gnome calculator. The only issue is if the hash algorithm is later found to be flawed, a malicious bit of data could be stored on the disk who's hash would collide with some more important data, potentially allowing the contents of one file to be replaced with another. Even if we don't assume hash collisions won't occur (eg. for crc's), the write performance when writing duplicate files is equal to the read performance of the disk, since for every block written by a program, one block will need to be read, and no blocks written. This is still better than the write case (as most devices read faster than write), and has the added advantage of saving lots of space. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data De-duplication
On Wed, 2008-12-10 at 13:07 -0700, Anthony Roberts wrote: When the a direct read comparison is required before sharing blocks, it is probably best done by a stand alone utility, since we don't want wait for a read of a full extent every time we want to write on. Can a stand-alone utility do this without a race? Eg, if a portion of one of the files has already been read by the util, but is changed before the util has a chance to actually do the ioctl to duplicate the range. If we assume someone is keeping a hash table of likely matches, might it make sense to have a verify-and-dup ioctl that does this safely? This sounds good because then a standard user could safely do this to any file as long as he had read access to both files, but wouldn't need write access (since the operation doesn't actually modify the file data). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data De-duplication
I see quite a few uses for this, and while it looks like the kernel mode automatic de-dup-on-write code might be performance costly, require disk format changes, and be controversial, it sounds like the user mode utility could be implemented today. It looks like a simple script could do the job - just iterate through every file in the filesystem, run md5sum on every block of every file, whenever a duplicate is found call an ioctl to remove the duplicate data. By md5summing each block it can also effectively compress disk images. While not very efficient it should work, and having something like this in the toolkit would mean as soon as btrfs gets stable enough for everyday use it would immediately out-do every other linux filesystem in terms of space efficiency for some workloads. In the long term kernel mode de-duplication would probably be good. I'm willing to bet even the average user has say 1-2% of data duplicated somewhere on the HD due to accidental copies instead of moves, same application installed to two different paths, two users who happen to have the same file each saved in their home folder, etc, so even the average user will slightly benefit. I'm considering writing that script to test on my ext3 disk just to see how much duplicate wasted data I really have. Thanks Oliver -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data De-duplication
It would be interesting to see how many duplicate *blocks* there are across the filesystem, agnostic to files... Is this somthing your script does Oliver? My script doesn't yet exist, although when created it would, yes. I was thinking of just making a BASH script and using dd to extract 512 byte chunks of files, pipe through md5sum and save the result in a large index file. Next just iterate through the index file looking for duplicate hashes. In fact that sounds so easy I might do it right now... (only to proof of concept stage - a real utility would probably want to be written in a compiled language and use proper trees for faster searching) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Data De-duplication
Hi, Say I download a large file from the net to /mnt/a.iso. I then download the same file again to /mnt/b.iso. These files now have the same content, but are stored twice since the copies weren't made with the bcp utility. The same occurs if a directory tree with duplicate files (created with bcp) is put through a non-aware program - for example tarred and then untarred again. This could be improved in two ways: 1) Make a utility which checks the checksums for all the data extents, and if the checksums of data match for two files then check the file data, and if the file data matches then keep only one copy. It could be run as a cron job to free up disk space on systems where duplicate data is common (eg. virtual machine images) 2) Keep a tree of checksums for data blocks, so that a bit of data can be located by it's checksum. Whenever a data block is about to be written check if the block matches any known block, and if it does then don't bother duplicating the data on disk. I suspect this option may not be realistic for performance reasons. If either is possible then thought needs to be put into if it's worth doing on a file level, or a partial-file level (ie. if I have two similar files, can the space used by the identical parts of the files be saved) Has any thought been put into either 1) or 2) - are either possible or desired? Thanks Oliver -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Weak Allocation
Hi, I presume that in the design of BTRFS, like most other filesystems, a block on the underlying storage is either allocated (ie. to store metadata or file data), or deallocated (possibly blank or containing garbage left over, but the contents are irrelivant). Does BTRFS have any system that could allow adding at a later point in time a feature which would allow weak allocation of blocks, by which I mean the block is allocated (ie. storing useful data), but if another file needs to be written which has a higher priority and there are no/not many alternate blocks readily available for use then the data will be replaced. I could forsee uses for features like that as a cache - for example my web browsing cache is not vital data, and as such doesn't need to use up disk space, but it might as well use up any disk space that would otherwise go unused. The cache data can always be regenerated, so loosing the data isn't a problem. Other uses of the feature could be for persistant network caches (ie. to store copies of remote files on the network can they can be accessed faster locally), but again the cache data isn't critical to the operation of the system, so could be stored in weakly allocated blocks. Further uses could be caches of compressed files (decompressed versions of the same files are also saved in other blocks, and depending on IO and CPU load either the compressed or decompressed version is used). From a user-land perspective, these files could be created with a special flag which specifies they are only weakly allocated, which means any time the file has no open file descriptors it could vanish if the underlying filesystem wants to use the space it occupies for something else. A file could have a priority value which specifies how important it is, and therefore how likely it is to be erased if a new block needs to be allocated. The block allocator would use this information, together with the physical layout of the data to decide where to place new data to avoid fragmentation while retaining possibly useful data for future use. Let me know your ideas on this - at the moment it's only an idea, but I'm interested to know if a) it would be possible to implement it into a complex filesystem like btrfs, and b) if it would prove useful if implemented. Thanks Oliver. PS. I realise this could be implemented with a user space daemon which polls available disk space and deletes caches when disk space gets low (as windows does with shadow copies), but that hardly seems ideal, since it can't intelligently choose which caches to delete to reduce fragmentation, and large sudden disk allocations will fail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Partial Allocation
Hi, I presume that in the design of BTRFS, like most other filesystems, a block on the underlying storage is either allocated (ie. to store metadata or file data), or deallocated (possibly blank or containing garbage left over, but the contents are irrelivant). Does BTRFS have any system that could allow adding at a later point in time a feature which would allow weak allocation of blocks, by which I mean the block is allocated (ie. storing useful data), but if another file needs to be written which has a higher priority and there are not many free blocks left, then the data could be replaced. I could forsee uses for features like that as a cache - for example my web browsing cache is not vital data, and as such doesn't need to use up disk space, but it might as well use up any disk space that would otherwise go unused. The cache data can always be regenerated, so loosing the data isn't a problem. Other uses of the feature could be for persistant network caches (ie. to store copies of remote files on the network can they can be accessed faster locally), but again the cache data isn't critical to the operation of the system, so could be stored in weakly allocated blocks. Further uses could be caches of compressed files (decompressed versions of the same files are also saved in other blocks, and depending on IO and CPU load either the compressed or decompressed version is used). From a user-land perspective, these files could be created with a special flag which specifies they are only weakly allocated, which means any time the file has no open file descriptors it could vanish if the underlying filesystem wants to use the space it occupies for something else. A file could have a priority value which specifies how important it is, and therefore how likely it is to be erased if a new block needs to be allocated. The block allocator would use this information, together with the physical layout of the data to decide where to place new data to avoid fragmentation while retaining possibly useful data for future use. Let me know your ideas on this - at the moment it's only an idea, but I'm interested to know if a) it would be possible to implement it into a complex filesystem like btrfs, and b) if it would prove useful if implemented. Thanks Oliver. PS. I realise this could be implemented with a user space daemon which polls available disk space and deletes caches when disk space gets low (as windows does with shadow copies), but that hardly seems ideal, since it can't intelligently choose which caches to delete to reduce fragmentation, and large sudden disk allocations will fail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partial allocation
Hi, I presume that in the design of BTRFS, like most other filesystems, a block on the underlying storage is either allocated (ie. to store metadata or file data), or deallocated (possibly blank or containing garbage left over, but the contents are irrelivant). Does BTRFS have any system that could allow adding at a later point in time a feature which would allow weak allocation of blocks, by which I mean the block is allocated (ie. storing useful data), but if another file needs to be written which has a higher priority and there are no free blocks left, then the data will be replaced. I could forsee uses for features like that as a cache - for example my web browsing cache is not vital data, and as such doesn't need to use up disk space, but it might as well use up any disk space that would otherwise go unused. The cache data can always be regenerated, so loosing the data isn't a problem. Other uses of the feature could be for persistant network caches (ie. to store copies of remote files on the network can they can be accessed faster locally), but again the cache data isn't critical to the operation of the system, so could be stored in weakly allocated blocks. Further uses could be caches of compressed files (decompressed versions of the same files are also saved in other blocks, and depending on IO and CPU load either the compressed or decompressed version is used). From a user-land perspective, these files could be created with a special flag which specifies they are only weakly allocated, which means any time the file has no open file descriptors it could vanish if the underlying filesystem wants to use the space it occupies for something else. A file could have a priority value which specifies how important it is, and therefore how likely it is to be erased if a new block needs to be allocated. The block allocator would use this information, together with the physical layout of the data to decide where to place new data to avoid fragmentation while retaining possibly useful data for future use. Let me know your ideas on this - at the moment it's only an idea, but I'm interested to know if a) it would be possible to implement it into a complex filesystem like btrfs, and b) if it would prove useful if implemented. Thanks Oliver. PS. I realise this could be implemented with a user space daemon which polls available disk space and deletes caches when disk space gets low (as windows does with shadow copies), but that hardly seems ideal, since it can't intelligently choose which caches to delete to reduce fragmentation, and large sudden disk allocations will fail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html