Re: btrfs csum failed on git .pack file

2009-09-09 Thread Oliver Mattos



What a strange coincidence that it affected git pack files in both cases.
It's almost too improbable...



Probably more than a coincidence I think, the question is what though...


Some SSD drives (or rather the cheap wear levelling controllers in things
like USB sticks) have firmware which tries to recognise certain data
structures of common filesystems (like FAT and NTFS), and uses information
in those data structures to optimise the allocation and erasure of blocks
(for example the free space linked list in FAT).  If the data you were
saving to the disk was similar to one of those data structures, you might've
triggered one of those algorithms, which would cause data corruption.  This
is common in high performance usb sticks because they want to pre-erase
blocks on file deletion for operating systems not supporting SCSI TRIM - I
imagine the same technology might carry across to cheap SSD's.

Not much BTRFS can do about it though.  If the piece of data that triggers
the bug could be identified, workarounds could possibly be introduced for
the particular buggy controllers.

Oliver Mattos

(resent as I emailled wrong recipients before) 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/25] [btrfs] BUG to BUG_ON changes

2009-03-10 Thread Oliver Mattos



if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root))
BUG();


^ You seem to have missed one.


Actually that one was left on purpose because BUG_ON calls are not to
have any side effects and I do not know enough about btrfs to know
what BTRFS_NODEPTRS_PER_BLOCK does so it was left as is.



BTRFS_NODEPTRS_PER_BLOCK has no side effects - it's defined as:

00284 #define BTRFS_NODEPTRS_PER_BLOCK(r) (((r)-nodesize - \
00285   sizeof(struct btrfs_header)) / \
00286  sizeof(struct btrfs_key_ptr))

so it should be fine to put it in the BUG_ON.
- Original Message - 
From: Stoyan Gaydarov stoyboy...@gmail.com

To: davidj...@gmail.com
Cc: linux-ker...@vger.kernel.org; chris.ma...@oracle.com; 
linux-btrfs@vger.kernel.org

Sent: Tuesday, March 10, 2009 6:16 PM
Subject: Re: [PATCH 01/25] [btrfs] BUG to BUG_ON changes



On Tue, Mar 10, 2009 at 4:16 AM, David John davidj...@gmail.com wrote:

Stoyan Gaydarov wrote:

Signed-off-by: Stoyan Gaydarov stoyboy...@gmail.com
---
fs/btrfs/ctree.c | 3 +--
fs/btrfs/extent-tree.c | 3 +--
fs/btrfs/free-space-cache.c | 3 +--
fs/btrfs/tree-log.c | 3 +--
4 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 37f31b5..2c590b0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2174,8 +2174,7 @@ static int insert_ptr(struct btrfs_trans_handle 
*trans, struct btrfs_root

BUG_ON(!path-nodes[level]);
lower = path-nodes[level];
nritems = btrfs_header_nritems(lower);
- if (slot  nritems)
- BUG();
+ BUG_ON(slot  nritems);
if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root))
BUG();


^ You seem to have missed one.


Actually that one was left on purpose because BUG_ON calls are not to
have any side effects and I do not know enough about btrfs to know
what BTRFS_NODEPTRS_PER_BLOCK does so it was left as is.




if (slot != nritems) {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9abf81f..0314ab6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -672,8 +672,7 @@ static noinline int insert_extents(struct 
btrfs_trans_handle *trans,

keys+i, data_size+i, total-i);
BUG_ON(ret  0);

- if (last  ret  1)
- BUG();
+ BUG_ON(last  ret  1);

leaf = path-nodes[0];
for (c = 0; c  ret; c++) {
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index d1e5f0e..b0c7661 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -267,8 +267,7 @@ static int __btrfs_add_free_space(struct 
btrfs_block_group_cache *block_group,

out:
if (ret) {
printk(KERN_ERR btrfs: unable to add free space :%d\n, ret);
- if (ret == -EEXIST)
- BUG();
+ BUG_ON(ret == -EEXIST);
}

kfree(alloc_info);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9c462fb..2c892f6 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1150,8 +1150,7 @@ insert:
ret = insert_one_name(trans, root, path, key-objectid, key-offset,
name, name_len, log_type, log_key);

- if (ret  ret != -ENOENT)
- BUG();
+ BUG_ON(ret  ret != -ENOENT);
goto out;
}






--

-Stoyan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-23 Thread Oliver Mattos

Hi,

While this sounds nice in practice, in reality since eraseblocks are 
generally very large, and with hardware based block remapping (FTL), you can 
never be sure which data blocks are at risk when re-writing just one block. 
There is a good chance that rewriting one block of data somewhere will wipe 
out a block of data at a totally different location of the disk.


Since the filesystem has no understanding of the hardware FTL, there isn't 
much that can be done about this at the filesystem level.


The only thing that could be done is metadata mirroring on the same disk, 
but I suspect even that would only marginally improve reliability, since 
both copies of the metadata are likely to be written to storage at nearly 
the same time, they are quite likely to be re-mapped into the same block, 
and then if you loose one, you loose both.


The solution to this is twofold:
1 :- filesystems should be able to detect when a non-atomic write has 
corrupted the filesystem and tell the user - eg. Filesystem corruption 
found, Likley hardware malfunction detected  (since with a fs like btrfs, 
when all big software bugs are resolved, the only thing that can cause disk 
corruption are hardware issues)


2 :- Someone who feels this is close to their heart needs to test every big 
brand of SSD and name and shame those where writes are non-atomic - as soon 
as you get this publicised that certain brands of SSD put data at risk, the 
manufacturer will be very fast at resolving it.



- Original Message - 
From: Seth Huang seth...@gmail.com

To: linux-btrfs@vger.kernel.org
Sent: Monday, February 23, 2009 3:17 AM
Subject: Re: ssd optimised mode



On Mon, Feb 23, 2009 at 12:22 PM, Dongjun Shin djshi...@gmail.com wrote:
A well-designed SSD should survive power cycling and should provide 
atomicity
of flush operation regardless of the underlying flash operations. I don't 
expect

that users of SSD have different requirements about atomicity.


A reliable system should be based on the assumption that the underlying 
parts are unreliable. Therefore, we should do as much as possible to make 
sure the reliability in our filesystem instead of leaning on the SSDs.


--
Seth Huang

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-11 Thread Oliver Mattos
 Neat.  Thanks much.  It'd be cool to output the results of each of your
 hashes to a database so you can get a feel for how many duplicate
 blocks there are cross-files as well.
 
 I'd like to run this in a similar setup on all my VMware VMDK files and
 get an idea of how much space savings there would be across 20+ Windows
 2003 VMDK files... probably *lots* of common blocks.
 
 Ray
Hi,

Currently it DOES do cross file block matching - thats why it takes sooo
long to run :-)

If you remove everything after the word sort, it will make a verbose
output, which you could then stick into some SQL database if you wanted.
You could relativey easily format the output into a format for direct
input to an SQL database if you modified the line with the dd in it
within the first while.  You can also remove the sort and the pipe
before it to get an unsorted output - the advantage of this is it takes
less time.

I'm guessing that if you had the time to run this on multi-gigabyte disk
images you'd find that as much as 80% of the blocks are duplicated
between any two virtual machines of the same operating system.

That means if you have 20 Win 2k3 VM's and the first VM image of Windows
+ software is 2GB (after nulls are removed), the total size for 20VM's
could be ~6GB (remembering there will be extra redundancy the more VM's
you add)- not a bad saving.

Thanks
Oliver

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
  2)  Keep a tree of checksums for data blocks, so that a bit of data can
  be located by it's checksum.  Whenever a data block is about to be
  written check if the block matches any known block, and if it does then
  don't bother duplicating the data on disk.  I suspect this option may
  not be realistic for performance reasons.
  
 
 When compression was added, the writeback path was changed to make
 option #2 viable, at least in the case where the admin is willing to
 risk hash collisions on strong hashes.  When the a direct read
 comparison is required before sharing blocks, it is probably best done
 by a stand alone utility, since we don't want wait for a read of a full
 extent every time we want to write on.
 

Can we assume hash collisions won't occur?  I mean if it's a 256 bit
hash then even with 256TB of data, and one hash per block, the chances
of collision are still too small to calculate on gnome calculator.

The only issue is if the hash algorithm is later found to be flawed, a
malicious bit of data could be stored on the disk who's hash would
collide with some more important data, potentially allowing the contents
of one file to be replaced with another.

Even if we don't assume hash collisions won't occur (eg. for crc's), the
write performance when writing duplicate files is equal to the read
performance of the disk, since for every block written by a program, one
block will need to be read, and no blocks written.  This is still better
than the write case (as most devices read faster than write), and has
the added advantage of saving lots of space.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
On Wed, 2008-12-10 at 13:07 -0700, Anthony Roberts wrote:
  When the a direct read
  comparison is required before sharing blocks, it is probably best done
  by a stand alone utility, since we don't want wait for a read of a full
  extent every time we want to write on.
 
 Can a stand-alone utility do this without a race? Eg, if a portion of one
 of the files has already been read by the util, but is changed before the
 util has a chance to actually do the ioctl to duplicate the range.
 
 If we assume someone is keeping a hash table of likely matches, might it
 make sense to have a verify-and-dup ioctl that does this safely?
 

This sounds good because then a standard user could safely do this to
any file as long as he had read access to both files, but wouldn't need
write access (since the operation doesn't actually modify the file
data).

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
I see quite a few uses for this, and while it looks like the kernel mode
automatic de-dup-on-write code might be performance costly, require disk
format changes, and be controversial, it sounds like the user mode
utility could be implemented today.

It looks like a simple script could do the job - just iterate through
every file in the filesystem, run md5sum on every block of every file,
whenever a duplicate is found call an ioctl to remove the duplicate
data.  By md5summing each block it can also effectively compress disk
images.

While not very efficient it should work, and having something like this
in the toolkit would mean as soon as btrfs gets stable enough for
everyday use it would immediately out-do every other linux filesystem in
terms of space efficiency for some workloads.

In the long term kernel mode de-duplication would probably be good.  I'm
willing to bet even the average user has say 1-2% of data duplicated
somewhere on the HD due to accidental copies instead of moves, same
application installed to two different paths, two users who happen to
have the same file each saved in their home folder, etc, so even the
average user will slightly benefit.

I'm considering writing that script to test on my ext3 disk just to see
how much duplicate wasted data I really have.

Thanks
Oliver


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos

 It would be interesting to see how many duplicate *blocks* there are
 across the filesystem, agnostic to files...
 
 Is this somthing your script does Oliver?

My script doesn't yet exist, although when created it would, yes.  I was
thinking of just making a BASH script and using dd to extract 512 byte
chunks of files, pipe through md5sum and save the result in a large
index file.  Next just iterate through the index file looking for
duplicate hashes.

In fact that sounds so easy I might do it right now...  (only to proof
of concept stage - a real utility would probably want to be written in a
compiled language and use proper trees for faster searching)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Data De-duplication

2008-12-09 Thread Oliver Mattos
Hi,

Say I download a large file from the net to /mnt/a.iso.  I then download
the same file again to /mnt/b.iso.  These files now have the same
content, but are stored twice since the copies weren't made with the bcp
utility.

The same occurs if a directory tree with duplicate files (created with
bcp) is put through a non-aware program - for example tarred and then
untarred again.

This could be improved in two ways:

1)  Make a utility which checks the checksums for all the data extents,
and if the checksums of data match for two files then check the file
data, and if the file data matches then keep only one copy.  It could be
run as a cron job to free up disk space on systems where duplicate data
is common (eg. virtual machine images)

2)  Keep a tree of checksums for data blocks, so that a bit of data can
be located by it's checksum.  Whenever a data block is about to be
written check if the block matches any known block, and if it does then
don't bother duplicating the data on disk.  I suspect this option may
not be realistic for performance reasons.

If either is possible then thought needs to be put into if it's worth
doing on a file level, or a partial-file level (ie. if I have two
similar files, can the space used by the identical parts of the files be
saved)

Has any thought been put into either 1) or 2) - are either possible or
desired?

Thanks
Oliver

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Weak Allocation

2008-12-02 Thread Oliver Mattos
Hi,

I presume that in the design of BTRFS, like most other filesystems, a
block on the underlying storage is either allocated (ie. to store
metadata or file data), or deallocated (possibly blank or containing
garbage left over, but the contents are irrelivant).

Does BTRFS have any system that could allow adding at a later point in
time a feature which would allow weak allocation of blocks, by which I
mean the block is allocated (ie. storing useful data), but if another
file needs to be written which has a higher priority and there are 
no/not many alternate blocks readily available for use then the data
will be replaced.

I could forsee uses for features like that as a cache - for example my
web browsing cache is not vital data, and as such doesn't need to use up
disk space, but it might as well use up any disk space that would
otherwise go unused.  The cache data can always be regenerated, so
loosing the data isn't a problem.

Other uses of the feature could be for persistant network caches (ie. to
store copies of remote files on the network can they can be accessed
faster locally), but again the cache data isn't critical to the
operation of the system, so could be stored in weakly allocated
blocks.  Further uses could be caches of compressed files (decompressed
versions of the same files are also saved in other blocks, and depending
on IO and CPU load either the compressed or decompressed version is
used).

From a user-land perspective, these files could be created with a
special flag which specifies they are only weakly allocated, which
means any time the file has no open file descriptors it could vanish
if the underlying filesystem wants to use the space it occupies for
something else.  A file could have a priority value which specifies
how important it is, and therefore how likely it is to be erased if a
new block needs to be allocated.  The block allocator would use this
information, together with the physical layout of the data to decide
where to place new data to avoid fragmentation while retaining possibly
useful data for future use.

Let me know your ideas on this - at the moment it's only an idea, but
I'm interested to know if a) it would be possible to implement it into a
complex filesystem like btrfs, and b) if it would prove useful if
implemented.

Thanks
Oliver.

PS.  I realise this could be implemented with a user space daemon which
polls available disk space and deletes caches when disk space gets low
(as windows does with shadow copies), but that hardly seems ideal, since
it can't intelligently choose which caches to delete to reduce
fragmentation, and large sudden disk allocations will fail.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Partial Allocation

2008-12-02 Thread Oliver Mattos
Hi,

I presume that in the design of BTRFS, like most other filesystems, a
block on the underlying storage is either allocated (ie. to store
metadata or file data), or deallocated (possibly blank or containing
garbage left over, but the contents are irrelivant).

Does BTRFS have any system that could allow adding at a later point in
time a feature which would allow weak allocation of blocks, by which I
mean the block is allocated (ie. storing useful data), but if another
file needs to be written which has a higher priority and there are not
many free blocks left, then the data could be replaced.

I could forsee uses for features like that as a cache - for example my
web browsing cache is not vital data, and as such doesn't need to use up
disk space, but it might as well use up any disk space that would
otherwise go unused.  The cache data can always be regenerated, so
loosing the data isn't a problem.

Other uses of the feature could be for persistant network caches (ie. to
store copies of remote files on the network can they can be accessed
faster locally), but again the cache data isn't critical to the
operation of the system, so could be stored in weakly allocated
blocks.  Further uses could be caches of compressed files (decompressed
versions of the same files are also saved in other blocks, and depending
on IO and CPU load either the compressed or decompressed version is
used).

From a user-land perspective, these files could be created with a
special flag which specifies they are only weakly allocated, which
means any time the file has no open file descriptors it could vanish
if the underlying filesystem wants to use the space it occupies for
something else.  A file could have a priority value which specifies
how important it is, and therefore how likely it is to be erased if a
new block needs to be allocated.  The block allocator would use this
information, together with the physical layout of the data to decide
where to place new data to avoid fragmentation while retaining possibly
useful data for future use.

Let me know your ideas on this - at the moment it's only an idea, but
I'm interested to know if a) it would be possible to implement it into a
complex filesystem like btrfs, and b) if it would prove useful if
implemented.

Thanks
Oliver.

PS.  I realise this could be implemented with a user space daemon which
polls available disk space and deletes caches when disk space gets low
(as windows does with shadow copies), but that hardly seems ideal, since
it can't intelligently choose which caches to delete to reduce
fragmentation, and large sudden disk allocations will fail.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partial allocation

2008-12-01 Thread Oliver Mattos
Hi,

I presume that in the design of BTRFS, like most other filesystems, a
block on the underlying storage is either allocated (ie. to store
metadata or file data), or deallocated (possibly blank or containing
garbage left over, but the contents are irrelivant).

Does BTRFS have any system that could allow adding at a later point in
time a feature which would allow weak allocation of blocks, by which I
mean the block is allocated (ie. storing useful data), but if another
file needs to be written which has a higher priority and there are no
free blocks left, then the data will be replaced.

I could forsee uses for features like that as a cache - for example my
web browsing cache is not vital data, and as such doesn't need to use up
disk space, but it might as well use up any disk space that would
otherwise go unused.  The cache data can always be regenerated, so
loosing the data isn't a problem.

Other uses of the feature could be for persistant network caches (ie. to
store copies of remote files on the network can they can be accessed
faster locally), but again the cache data isn't critical to the
operation of the system, so could be stored in weakly allocated
blocks.  Further uses could be caches of compressed files (decompressed
versions of the same files are also saved in other blocks, and depending
on IO and CPU load either the compressed or decompressed version is
used).

From a user-land perspective, these files could be created with a
special flag which specifies they are only weakly allocated, which
means any time the file has no open file descriptors it could vanish
if the underlying filesystem wants to use the space it occupies for
something else.  A file could have a priority value which specifies
how important it is, and therefore how likely it is to be erased if a
new block needs to be allocated.  The block allocator would use this
information, together with the physical layout of the data to decide
where to place new data to avoid fragmentation while retaining possibly
useful data for future use.

Let me know your ideas on this - at the moment it's only an idea, but
I'm interested to know if a) it would be possible to implement it into a
complex filesystem like btrfs, and b) if it would prove useful if
implemented.

Thanks
Oliver.

PS.  I realise this could be implemented with a user space daemon which
polls available disk space and deletes caches when disk space gets low
(as windows does with shadow copies), but that hardly seems ideal, since
it can't intelligently choose which caches to delete to reduce
fragmentation, and large sudden disk allocations will fail.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html