Re: btrfs is using 25% more disk than it should

2014-12-23 Thread Zygo Blaxell
On Sat, Dec 20, 2014 at 06:28:22AM -0500, Josef Bacik wrote:
 We now have two extents
 with the same bytenr but with different lengths.  
[...]
 Then there is the problem of actually returning the free space.  Now
 if we drop all of the refs for an extent we know the space is free
 and we return it to the allocator.  With the above example we can't
 do that anymore, we have to check the extent tree for any area that
 is left overlapping the area we just freed.  This add's another
 search to every btrfs_free_extent operation, which slows the whole
 system down and again leaves us with weird corner cases and pain for
 the users.  Plus this would be an incompatible format change so
 would require setting a feature flag in the fs and rolled to
 voluntarily.

Ouchie.

 Now I have another solution, but I'm not convinced it's awesome
 either.  Take the same example above, but instead we split the
 original extent in the extent tree so we avoid all the mess of
 having overlapping ranges

Would this work for a read-only snapshot?  For a read-write snapshot
it would be as if we had modified both (or all, if there are multiple
snapshots) versions of the tree with split extents.

 This wouldn't require a format change so everybody would get
 this behaviour as soon as we turned it on

It could be a mount option, like autodefrag, off by default until the
bugs were worked out.

Arguably there could be a 'garbage-collection tool' similar to 'btrfs
fi defrag', that could be used to clean out any large partially-obscured
extents from specific files.  This might be important for deduplication
as well (although the extent-same code looks like it does split extents?).

Definitely something to think about.  Thanks for the detailed
explanations.



signature.asc
Description: Digital signature


Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Daniele Testa
Ok, so this is what I did:

1. Copied the sparse 315GB (with 302GB inside) to another server
2. Re-formatted the btrfs partition
3. chattr +C on the parent dir
4. Copied the 315GB file back to the btrfs partition (the file is not
sparse any more due to the copying)

This is the end result:

root@s4 /opt/drives/ssd # ls -alhs
total 316G
 16K drwxr-xr-x 1 libvirt-qemu libvirt-qemu   42 Dec 20 07:00 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
315G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 20 09:11 disk_208.img
   0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 20 06:53 snapshots

root@s4 /opt/drives/ssd # du -h
0   ./snapshots
316G.

root@s4 /opt/drives/ssd # df -h
/dev/md3411G  316G
94G  78% /opt/drives/ssd

root@s4 /opt/drives/ssd # btrfs filesystem df /opt/drives/ssd
Data, single: total=323.01GiB, used=315.08GiB
System, DUP: total=8.00MiB, used=64.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=880.00KiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=16.00MiB, used=0.00

root@s4 /opt/drives/ssd # lsattr
 ./snapshots
---C ./disk_208.img

As you can see, it looks much better now. The file takes as much space
as it should and the Metadata is only 880kb.

I will do some writes inside the VM and see if the file grows on the
outside. If everything is ok, it should not.

2014-12-20 5:17 GMT+08:00 Josef Bacik jba...@fb.com:
 On 12/19/2014 04:10 PM, Josef Bacik wrote:

 On 12/18/2014 09:59 AM, Daniele Testa wrote:

 Hey,

 I am hoping you guys can shed some light on my issue. I know that it's
 a common question that people see differences in the disk used when
 running different calculations, but I still think that my issue is
 weird.

 root@s4 / # mount
 /dev/md3 on /opt/drives/ssd type btrfs
 (rw,noatime,compress=zlib,discard,nospace_cache)

 root@s4 / # btrfs filesystem df /opt/drives/ssd
 Data: total=407.97GB, used=404.08GB
 System, DUP: total=8.00MB, used=52.00KB
 System: total=4.00MB, used=0.00
 Metadata, DUP: total=1.25GB, used=672.21MB
 Metadata: total=8.00MB, used=0.00

 root@s4 /opt/drives/ssd # ls -alhs
 total 302G
 4.0K drwxr-xr-x 1 root root   42 Dec 18 14:34 .
 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49
 disk_208.img
 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots

 root@s4 /opt/drives/ssd # du -h
 0   ./snapshots
 302G.

 As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
 that partition, I have one single starse file, taking 302GB of space
 (max 315GB). The snapshots directory is completely empty.

 However, for some weird reason, btrfs seems to think it takes 404GB.
 The big file is a disk that I use in a virtual server and when I write
 stuff inside that virtual server, the disk-usage of the btrfs
 partition on the host keeps increasing even if the sparse-file is
 constant at 302GB. I even have 100GB of free disk-space inside that
 virtual disk-file. Writing 1GB inside the virtual disk-file seems to
 increase the usage about 4-5GB on the outside.

 Does anyone have a clue on what is going on? How can the difference
 and behaviour be like this when I just have one single file? Is it
 also normal to have 672MB of metadata for a single file?


 Hello and welcome to the wonderful world of btrfs, where COW can really
 suck hard without being super clear why!  It's 4pm on a Friday right
 before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
 use pretty pictures.  You have this case to start with

 file offset 0   offset 302g
 [-prealloced 302g extent--]

 (man it's impressive I got all that lined up right)

 On disk you have 2 things.  First your file which has file extents which
 says

 inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen
 302g

 and then in the extent tree, who keeps track of actual allocated space
 has this

 extent bytenr 123, len 302g, refs 1

 Now say you boot up your virt image and it writes 1 4k block to offset
 0.  Now you have this

 [4k][302g-4k--]

 And for your inode you now have this

 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
 disklen 302g

 and in your extent tree you have

 extent bytenr 123, len 302g, refs 1
 extent bytenr whatever, len 4k, refs 1

 See that?  Your file is still the same size, it is still 302g.  If you
 cp'ed it right now it would copy 302g of information.  But what you have
 actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
 your virt thing decides to write to the middle, lets say at offset 12k,
 now you have this

 inode 256, file 

Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Josef Bacik

On 12/20/2014 01:18 AM, Daniele Testa wrote:

But I read somewhere that compression should be turned off on mounts
that just store large VM-images. Is that wrong?



It doesn't really matter frankly.  Usually virt images are preallocated 
with fallocate which means compression doesn't happen as writes into 
fallocated areas aren't compressed, but you aren't doing that so you 
would be getting some compression.



Btw, I am not pre-allocation space for the images. I use sparse files with:

dd if=/dev/zero of=drive.img bs=1 count=1 seek=300G

It creates the file in a few ms.
Is it better to use fallocate with btrfs?



It depends.  If you are going to use nodatacow for your virt images then 
I would definitely suggest using fallocate since you'll get a nice 
contiguous chunk of data for your virt images.



If I use sparse files, it adds a benefit when I want to copy/move the
image-file to another server.
Like if the 300GB sparse file just has 10GB of data in it, I only need
to copy 10GB when moving it to another server.
Would the same be true with fallocate?



No, but send/receive would only copy 10GB, but the resulting file would 
be sparse.



Anyways, would disabling CoW (by putting +C on the parent dir) prevent
the performance issues and 2*filesize issue?



Yes.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Robert White

On 12/19/2014 01:17 PM, Josef Bacik wrote:

tl;dr: Cow means you can in the worst case end up using 2 * filesize -
blocksize of data on disk and the file will appear to be filesize.  Thanks,


Doesn't the worst case more like N^log(N) (when N is file in blocksize) 
in the pernicious case?


Staggered block overwrites can peer down through gaps to create more 
than two layers of retention. The only real requirement is that each 
layer get smaller than the one before it so as to leave some of each of 
it's predecessor visible.


So if I make a file size N blocks, then overwrite it with N-1 blocks, 
then overwrite it again with N-2 blocks (etc). I can easily create a 
deep slop of obscured data.


[-]
[]
[---]
[--]
[-]
[]
[---]
[--]
[-]
(etc...)


Or would I have to bracket the front and back

--
 
  --

Or could I bracket the sides

-
 
---   ---
-- --
-   -

There's got to be pahological patterns like this that can end up with a 
heck of a lot of hidden data.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Josef Bacik

On 12/20/2014 12:52 AM, Zygo Blaxell wrote:

On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:

And for your inode you now have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
disklen 302g

and in your extent tree you have

extent bytenr 123, len 302g, refs 1
extent bytenr whatever, len 4k, refs 1

See that?  Your file is still the same size, it is still 302g.  If you
cp'ed it right now it would copy 302g of information.  But what you have
actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
your virt thing decides to write to the middle, lets say at offset 12k,
now you have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
disklen 4k
inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
disklen 302g

and in the extent tree you have this

extent bytenr 123, len 302g, refs 2
extent bytenr whatever, len 4k, refs 1
extent bytenr notimportant, len 4k, refs 1

See that refs 2 change?  We split the original extent, so we have 2 file
extents pointing to the same physical extents, so we bumped the ref
count.  This will happen over and over again until we have completely
overwritten the original extent, at which point your space usage will go
back down to ~302g.


Wait, *what*?

OK, I did a small experiment, and found that btrfs actually does do
something like this.  Can't argue with fact, though it would be nice if
btrfs could be smarter and drop unused portions of the original extent
sooner.  :-P



So we've thought about changing this, and will eventually, but it's kind 
of difficult.  Above is an example of what happens currently, so the 
split code for file extents is kind of big and scary, check 
__btrfs_drop_extents.  We would have to fix that to adjust the 
disk_bytenr and disk_num_bytes, which isn't too bad since we already are 
doing this dance and adjusting offset.  The trick would be when updating 
the extent references, we would have to split those extents.  So say we 
have a 128mb extent and we write 4k at 1mb.  If we split the extent refs 
we'd have this afterwards


(note this isn't how they'd be ordered on disk, just written this way so 
it makes logical sense)


extent bytenr 0, len 1mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1

Ok so now we have 3 extents in the extent tree to describe essentially 2 
ranges that are in use, but we get back the 4k so that's nice.  But wait 
there's more!  What if we're snapshotted?  We can't just drop that 4k 
because somebody else has a reference to it.  So what do we do?  Well we 
could do something like this


extent bytenr 0, len 1mb, refs 1
extent bytenr 0, len 128mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1

This creates all sorts of problems for us.  We now have two extents with 
the same bytenr but with different lengths.  This could be ok, we'd have 
to add a bunch of checks to make sure we're looking at the right extent, 
but it wouldn't be horrible.  I imagine we'd be fixing weird corruption 
bugs for a few releases though while we found all of the corner cases we 
missed.


Then there is the problem of actually returning the free space.  Now if 
we drop all of the refs for an extent we know the space is free and we 
return it to the allocator.  With the above example we can't do that 
anymore, we have to check the extent tree for any area that is left 
overlapping the area we just freed.  This add's another search to every 
btrfs_free_extent operation, which slows the whole system down and again 
leaves us with weird corner cases and pain for the users.  Plus this 
would be an incompatible format change so would require setting a 
feature flag in the fs and rolled to voluntarily.


Now I have another solution, but I'm not convinced it's awesome either. 
 Take the same example above, but instead we split the original extent 
in the extent tree so we avoid all the mess of having overlapping ranges 
and get this instead


extent bytenr 0, len 1mb, refs 2
extent bytenr 1mb, len 4k, refs 1  -- part of the original extent
pointed to by the snapshot
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 2

So yay we've solved the problem of overlapping extents and bonus this is 
backwards compatible.  So why don't we do this?  Well all the reasons I 
listed above about corner cases and much pain for our users.  This 
wouldn't require a format change so everybody would get this behaviour 
as soon as we turned it on, and I feel I would be doing a lot of fsck 
work for the next 6 months.  Plus we would have to add a 'split' 
operation to the extent operations that copies all of the 

Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Josef Bacik

On 12/20/2014 06:23 AM, Robert White wrote:

On 12/19/2014 01:17 PM, Josef Bacik wrote:

tl;dr: Cow means you can in the worst case end up using 2 * filesize -
blocksize of data on disk and the file will appear to be filesize.
Thanks,


Doesn't the worst case more like N^log(N) (when N is file in blocksize)
in the pernicious case?

Staggered block overwrites can peer down through gaps to create more
than two layers of retention. The only real requirement is that each
layer get smaller than the one before it so as to leave some of each of
it's predecessor visible.

So if I make a file size N blocks, then overwrite it with N-1 blocks,
then overwrite it again with N-2 blocks (etc). I can easily create a
deep slop of obscured data.

[-]
[]
[---]
[--]
[-]
[]
[---]
[--]
[-]
(etc...)


Or would I have to bracket the front and back

--
  
   --

Or could I bracket the sides

-
 
---   ---
-- --
-   -

There's got to be pahological patterns like this that can end up with a
heck of a lot of hidden data.


Just the sloped case would do it, the pathological case would result in 
way more used than you expect.  So I guess the worst case would be 
something like


(num_blocks + (num_blocks - 1)!) * blocksize

in actually size usage.  Our extents are limited to 128mb in size, but 
still that ends up being pretty huge.  I'm actually going to do this 
locally and see what happens.  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Robert White

On 12/20/2014 03:39 AM, Josef Bacik wrote:

On 12/20/2014 06:23 AM, Robert White wrote:

On 12/19/2014 01:17 PM, Josef Bacik wrote:

tl;dr: Cow means you can in the worst case end up using 2 * filesize -
blocksize of data on disk and the file will appear to be filesize.
Thanks,


Doesn't the worst case more like N^log(N) (when N is file in blocksize)
in the pernicious case?

Staggered block overwrites can peer down through gaps to create more
than two layers of retention. The only real requirement is that each
layer get smaller than the one before it so as to leave some of each of
it's predecessor visible.

So if I make a file size N blocks, then overwrite it with N-1 blocks,
then overwrite it again with N-2 blocks (etc). I can easily create a
deep slop of obscured data.

[-]
[]
[---]
[--]
[-]
[]
[---]
[--]
[-]
(etc...)


Or would I have to bracket the front and back

--
  
   --

Or could I bracket the sides

-
 
---   ---
-- --
-   -

There's got to be pahological patterns like this that can end up with a
heck of a lot of hidden data.


Just the sloped case would do it, the pathological case would result in
way more used than you expect.  So I guess the worst case would be
something like

(num_blocks + (num_blocks - 1)!) * blocksize


I think that for a single file it's not factorial but consecutive sum. 
(One of Gauss' equations.)


so max=((n * (n+1))/2)*blocksize

A lot smaller than factorial but still n^2+n blocks, which is nothing to 
discard lightly.




in actually size usage.  Our extents are limited to 128mb in size, but
still that ends up being pretty huge.  I'm actually going to do this
locally and see what happens.  Thanks,

Josef



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Robert White

On 12/19/2014 01:10 PM, Josef Bacik wrote:

On 12/18/2014 09:59 AM, Daniele Testa wrote:

Hey,

I am hoping you guys can shed some light on my issue. I know that it's
a common question that people see differences in the disk used when
running different calculations, but I still think that my issue is
weird.

root@s4 / # mount
/dev/md3 on /opt/drives/ssd type btrfs
(rw,noatime,compress=zlib,discard,nospace_cache)

root@s4 / # btrfs filesystem df /opt/drives/ssd
Data: total=407.97GB, used=404.08GB
System, DUP: total=8.00MB, used=52.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.25GB, used=672.21MB
Metadata: total=8.00MB, used=0.00

root@s4 /opt/drives/ssd # ls -alhs
total 302G
4.0K drwxr-xr-x 1 root root   42 Dec 18 14:34 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49
disk_208.img
0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots

root@s4 /opt/drives/ssd # du -h
0   ./snapshots
302G.

As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
that partition, I have one single starse file, taking 302GB of space
(max 315GB). The snapshots directory is completely empty.

However, for some weird reason, btrfs seems to think it takes 404GB.
The big file is a disk that I use in a virtual server and when I write
stuff inside that virtual server, the disk-usage of the btrfs
partition on the host keeps increasing even if the sparse-file is
constant at 302GB. I even have 100GB of free disk-space inside that
virtual disk-file. Writing 1GB inside the virtual disk-file seems to
increase the usage about 4-5GB on the outside.

Does anyone have a clue on what is going on? How can the difference
and behaviour be like this when I just have one single file? Is it
also normal to have 672MB of metadata for a single file?



Hello and welcome to the wonderful world of btrfs, where COW can really
suck hard without being super clear why!  It's 4pm on a Friday right
before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
use pretty pictures.  You have this case to start with

file offset 0   offset 302g
[-prealloced 302g extent--]

(man it's impressive I got all that lined up right)

On disk you have 2 things.  First your file which has file extents which
says

inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g

and then in the extent tree, who keeps track of actual allocated space
has this

extent bytenr 123, len 302g, refs 1

Now say you boot up your virt image and it writes 1 4k block to offset
0.  Now you have this

[4k][302g-4k--]

And for your inode you now have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
disklen 302g

and in your extent tree you have

extent bytenr 123, len 302g, refs 1
extent bytenr whatever, len 4k, refs 1

See that?  Your file is still the same size, it is still 302g.  If you
cp'ed it right now it would copy 302g of information.  But what you have
actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
your virt thing decides to write to the middle, lets say at offset 12k,
now you have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
disklen 4k
inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
disklen 302g

and in the extent tree you have this

extent bytenr 123, len 302g, refs 2
extent bytenr whatever, len 4k, refs 1
extent bytenr notimportant, len 4k, refs 1

See that refs 2 change?  We split the original extent, so we have 2 file
extents pointing to the same physical extents, so we bumped the ref
count.  This will happen over and over again until we have completely
overwritten the original extent, at which point your space usage will go
back down to ~302g.

We split big extents with cow, so unless you've got lots of space to
spare or are going to use nodatacow you should probably not pre-allocate
virt images.  Thanks,


Stll too new to the code base to offer much other than psudocode...

Is it easy to find all the inodes that are using a particular extent 
at runtime?


It occurs to me that since every extent starts life with exactly one 
owner, a scruplous breaking of extents can prevent the unbounded 
left-overlap problem...


If the preexisting extent is always broken up into two or three new 
extents wherever it's being referenced, then problematic overlaps are 
eleminated and dead data can be discarded as soon as it's actually dead.


So in the exemplar case
'.' == preexisting extent
'+' == new written extent
'-' == preexisting described by new 

Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/18/2014 9:59 AM, Daniele Testa wrote:
 As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On 
 that partition, I have one single starse file, taking 302GB of
 space (max 315GB). The snapshots directory is completely empty.

So you don't have any snapshots or other subvolumes?

 However, for some weird reason, btrfs seems to think it takes
 404GB. The big file is a disk that I use in a virtual server and
 when I write stuff inside that virtual server, the disk-usage of
 the btrfs partition on the host keeps increasing even if the
 sparse-file is constant at 302GB. I even have 100GB of free
 disk-space inside that virtual disk-file. Writing 1GB inside the
 virtual disk-file seems to increase the usage about 4-5GB on the
 outside.

Did you flag the file as nodatacow?

 Does anyone have a clue on what is going on? How can the
 difference and behaviour be like this when I just have one single
 file? Is it also normal to have 672MB of metadata for a single
 file?

You probably have the data checksums enabled and that isn't
unreasonable for checksums on 302g of data.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P
DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL
H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ
yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2
WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza
yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8=
=Z9/S
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Daniele Testa
No, I don't have any snapshots or subvolumes. Only that single file.

The file has both checksums and datacow on it. I will do chattr +C
on the parent dir and re-create the file to make sure all files are
marked as nodatacow.

Should I also turn off checksums with the mount-flags if this
filesystem only contain big VM-files? Or is it not needed if I put +C
on the parent dir?

2014-12-20 2:53 GMT+08:00 Phillip Susi ps...@ubuntu.com:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 12/18/2014 9:59 AM, Daniele Testa wrote:
 As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
 that partition, I have one single starse file, taking 302GB of
 space (max 315GB). The snapshots directory is completely empty.

 So you don't have any snapshots or other subvolumes?

 However, for some weird reason, btrfs seems to think it takes
 404GB. The big file is a disk that I use in a virtual server and
 when I write stuff inside that virtual server, the disk-usage of
 the btrfs partition on the host keeps increasing even if the
 sparse-file is constant at 302GB. I even have 100GB of free
 disk-space inside that virtual disk-file. Writing 1GB inside the
 virtual disk-file seems to increase the usage about 4-5GB on the
 outside.

 Did you flag the file as nodatacow?

 Does anyone have a clue on what is going on? How can the
 difference and behaviour be like this when I just have one single
 file? Is it also normal to have 672MB of metadata for a single
 file?

 You probably have the data checksums enabled and that isn't
 unreasonable for checksums on 302g of data.


 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.17 (MingW32)

 iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P
 DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL
 H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ
 yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2
 WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza
 yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8=
 =Z9/S
 -END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/19/2014 2:59 PM, Daniele Testa wrote:
 No, I don't have any snapshots or subvolumes. Only that single
 file.
 
 The file has both checksums and datacow on it. I will do chattr
 +C on the parent dir and re-create the file to make sure all files
 are marked as nodatacow.
 
 Should I also turn off checksums with the mount-flags if this 
 filesystem only contain big VM-files? Or is it not needed if I put
 +C on the parent dir?

If you don't want the overhead of those checksums, then yea.  Also I
would question why you are using btrfs to hold only big vm files in
the first place.  You would be better off using lvm thinp volumes
instead of files, though personally I prefer to just use regular lvm
volumes and manually allocate enough space.  It avoids the
fragmentation you get from thin provisioning ( or qcow2 ) at the cost
of a bit of overallocated space and the need to do some manual
resizing to add more if and when it is needed.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUlIwGAAoJENRVrw2cjl5RlGEH/1OYz07C/OjGBASA9IHTCVMV
NkYHnO3/s2+SOsafQj4ej/RifgX9aG43b8Y6z9XAdosG/X+8z7xRjW9Nic0H5beK
JZRpwP+02Dw02A3/RSPjGqJBeAmS8yi9yTlunnPaCau+m1kPYL4M/vFM8/hqrGeU
Jy+jbffX+XtOedBWptxnDVIyXpYskgVyH8AmQ9d3TGrv52jw/QY1BxkuoVG60hBU
Fk4Q8ed43C9zjCVihmkDOeER6Ygr1roDb1/gFLoeCk4FwVLO9Kusft2Qi2oXyHy1
iTkoVJan8NRzXBhrPtZexxQdewHSw9Z4wyHxlal3b/xIbRf6/DRwPRHfgG5djvM=
=AqC/
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Josef Bacik

On 12/18/2014 09:59 AM, Daniele Testa wrote:

Hey,

I am hoping you guys can shed some light on my issue. I know that it's
a common question that people see differences in the disk used when
running different calculations, but I still think that my issue is
weird.

root@s4 / # mount
/dev/md3 on /opt/drives/ssd type btrfs
(rw,noatime,compress=zlib,discard,nospace_cache)

root@s4 / # btrfs filesystem df /opt/drives/ssd
Data: total=407.97GB, used=404.08GB
System, DUP: total=8.00MB, used=52.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.25GB, used=672.21MB
Metadata: total=8.00MB, used=0.00

root@s4 /opt/drives/ssd # ls -alhs
total 302G
4.0K drwxr-xr-x 1 root root   42 Dec 18 14:34 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49 disk_208.img
0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots

root@s4 /opt/drives/ssd # du -h
0   ./snapshots
302G.

As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
that partition, I have one single starse file, taking 302GB of space
(max 315GB). The snapshots directory is completely empty.

However, for some weird reason, btrfs seems to think it takes 404GB.
The big file is a disk that I use in a virtual server and when I write
stuff inside that virtual server, the disk-usage of the btrfs
partition on the host keeps increasing even if the sparse-file is
constant at 302GB. I even have 100GB of free disk-space inside that
virtual disk-file. Writing 1GB inside the virtual disk-file seems to
increase the usage about 4-5GB on the outside.

Does anyone have a clue on what is going on? How can the difference
and behaviour be like this when I just have one single file? Is it
also normal to have 672MB of metadata for a single file?



Hello and welcome to the wonderful world of btrfs, where COW can really 
suck hard without being super clear why!  It's 4pm on a Friday right 
before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to 
use pretty pictures.  You have this case to start with


file offset 0   offset 302g
[-prealloced 302g extent--]

(man it's impressive I got all that lined up right)

On disk you have 2 things.  First your file which has file extents which 
says


inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g

and then in the extent tree, who keeps track of actual allocated space 
has this


extent bytenr 123, len 302g, refs 1

Now say you boot up your virt image and it writes 1 4k block to offset 
0.  Now you have this


[4k][302g-4k--]

And for your inode you now have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), 
disklen 4k
inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123, 
disklen 302g


and in your extent tree you have

extent bytenr 123, len 302g, refs 1
extent bytenr whatever, len 4k, refs 1

See that?  Your file is still the same size, it is still 302g.  If you 
cp'ed it right now it would copy 302g of information.  But what you have 
actually allocated on disk?  Well that's now 302g + 4k.  Now lets say 
your virt thing decides to write to the middle, lets say at offset 12k, 
now you have this


inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), 
disklen 4k

inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever, 
disklen 4k
inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123, 
disklen 302g


and in the extent tree you have this

extent bytenr 123, len 302g, refs 2
extent bytenr whatever, len 4k, refs 1
extent bytenr notimportant, len 4k, refs 1

See that refs 2 change?  We split the original extent, so we have 2 file 
extents pointing to the same physical extents, so we bumped the ref 
count.  This will happen over and over again until we have completely 
overwritten the original extent, at which point your space usage will go 
back down to ~302g.


We split big extents with cow, so unless you've got lots of space to 
spare or are going to use nodatacow you should probably not pre-allocate 
virt images.  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Josef Bacik

On 12/19/2014 02:59 PM, Daniele Testa wrote:

No, I don't have any snapshots or subvolumes. Only that single file.

The file has both checksums and datacow on it. I will do chattr +C
on the parent dir and re-create the file to make sure all files are
marked as nodatacow.

Should I also turn off checksums with the mount-flags if this
filesystem only contain big VM-files? Or is it not needed if I put +C
on the parent dir?


Please God don't turn off of checksums.  Checksums are tracked in 
metadata anyway, they won't show up in the data accounting.  Our csums 
are 8 bytes per block, so basic math says you are going to max out at 
604 megabytes for that big of a file.


Please people try to only take advice from people who know what they are 
talking about.  So unless it's from somebody who has commits in 
btrfs/btrfs-progs take their feedback with a grain of salt.  Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Josef Bacik

On 12/19/2014 04:10 PM, Josef Bacik wrote:

On 12/18/2014 09:59 AM, Daniele Testa wrote:

Hey,

I am hoping you guys can shed some light on my issue. I know that it's
a common question that people see differences in the disk used when
running different calculations, but I still think that my issue is
weird.

root@s4 / # mount
/dev/md3 on /opt/drives/ssd type btrfs
(rw,noatime,compress=zlib,discard,nospace_cache)

root@s4 / # btrfs filesystem df /opt/drives/ssd
Data: total=407.97GB, used=404.08GB
System, DUP: total=8.00MB, used=52.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.25GB, used=672.21MB
Metadata: total=8.00MB, used=0.00

root@s4 /opt/drives/ssd # ls -alhs
total 302G
4.0K drwxr-xr-x 1 root root   42 Dec 18 14:34 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49
disk_208.img
0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots

root@s4 /opt/drives/ssd # du -h
0   ./snapshots
302G.

As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
that partition, I have one single starse file, taking 302GB of space
(max 315GB). The snapshots directory is completely empty.

However, for some weird reason, btrfs seems to think it takes 404GB.
The big file is a disk that I use in a virtual server and when I write
stuff inside that virtual server, the disk-usage of the btrfs
partition on the host keeps increasing even if the sparse-file is
constant at 302GB. I even have 100GB of free disk-space inside that
virtual disk-file. Writing 1GB inside the virtual disk-file seems to
increase the usage about 4-5GB on the outside.

Does anyone have a clue on what is going on? How can the difference
and behaviour be like this when I just have one single file? Is it
also normal to have 672MB of metadata for a single file?



Hello and welcome to the wonderful world of btrfs, where COW can really
suck hard without being super clear why!  It's 4pm on a Friday right
before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
use pretty pictures.  You have this case to start with

file offset 0   offset 302g
[-prealloced 302g extent--]

(man it's impressive I got all that lined up right)

On disk you have 2 things.  First your file which has file extents which
says

inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g

and then in the extent tree, who keeps track of actual allocated space
has this

extent bytenr 123, len 302g, refs 1

Now say you boot up your virt image and it writes 1 4k block to offset
0.  Now you have this

[4k][302g-4k--]

And for your inode you now have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
disklen 302g

and in your extent tree you have

extent bytenr 123, len 302g, refs 1
extent bytenr whatever, len 4k, refs 1

See that?  Your file is still the same size, it is still 302g.  If you
cp'ed it right now it would copy 302g of information.  But what you have
actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
your virt thing decides to write to the middle, lets say at offset 12k,
now you have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
disklen 4k
inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
disklen 302g

and in the extent tree you have this

extent bytenr 123, len 302g, refs 2
extent bytenr whatever, len 4k, refs 1
extent bytenr notimportant, len 4k, refs 1

See that refs 2 change?  We split the original extent, so we have 2 file
extents pointing to the same physical extents, so we bumped the ref
count.  This will happen over and over again until we have completely
overwritten the original extent, at which point your space usage will go
back down to ~302g.

We split big extents with cow, so unless you've got lots of space to
spare or are going to use nodatacow you should probably not pre-allocate
virt images.  Thanks,



Sorry should have added a

tl;dr: Cow means you can in the worst case end up using 2 * filesize - 
blocksize of data on disk and the file will appear to be filesize.  Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/19/2014 4:15 PM, Josef Bacik wrote:
 Please God don't turn off of checksums.  Checksums are tracked in 
 metadata anyway, they won't show up in the data accounting.  Our
 csums are 8 bytes per block, so basic math says you are going to
 max out at 604 megabytes for that big of a file.

Yes, and it is exactly that metadata space he is complaining about.
So if you don't want to use up all of that space ( and have no use for
the checksums ), then you turn them off.

 Please people try to only take advice from people who know what
 they are talking about.  So unless it's from somebody who has
 commits in btrfs/btrfs-progs take their feedback with a grain of
 salt.  Thanks,

Well that is rather arrogant and rude.  For that matter, I *do* have
commits in btrfs-progs.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUlJ5gAAoJENRVrw2cjl5RZ5MIAI0Ok0q0hFTMcYYXu1U48R4Z
AsuRg6zQDMOa9C1SqZucH2cuiiaGU8XixKcscaquoJDzzaND2kuy+sxp0k2YQnGz
+/269OmZUtwjYil1NcSFTJiE2bYUAx1R+xWUGax/03NsXRr672f0EtAQ2sIitTaG
WsNUhiU0GREpQL6pK403fO79eD2vRmgCx2w50gB2OYPQYciJ+YN0YAJ7z8VEmUro
M9xqce2oc7haAHliDvazl+7IDRkkiZ7FcpSs2nBSqiHiUhgVaxuTzHZEXvUasE5l
LamJCwiSwuevWWPCDE4N/r7qVcamKM2K/DMvZCiOuPkSm3YkcVyrUd8x4i8OEJs=
=8R13
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Josef Bacik

On 12/19/2014 04:53 PM, Phillip Susi wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/19/2014 4:15 PM, Josef Bacik wrote:

Please God don't turn off of checksums.  Checksums are tracked in
metadata anyway, they won't show up in the data accounting.  Our
csums are 8 bytes per block, so basic math says you are going to
max out at 604 megabytes for that big of a file.


Yes, and it is exactly that metadata space he is complaining about.
So if you don't want to use up all of that space ( and have no use for
the checksums ), then you turn them off.


Please people try to only take advice from people who know what
they are talking about.  So unless it's from somebody who has
commits in btrfs/btrfs-progs take their feedback with a grain of
salt.  Thanks,


Well that is rather arrogant and rude.  For that matter, I *do* have
commits in btrfs-progs.



root@destiny ~/btrfs-progs# git log --oneline --author=Phillip Susi
c65345d btrfs-progs: document --rootdir mkfs switch
f6b6e93 btrfs-progs: removed extraneous whitespace from mkfs man page

Sorry I should have qualified that statement better.

So unless it's from somebody who has had commits to meaningful portions 
of btrfs/btrfs-progs take their feedback with a grain of salt.


There are too many people on this list who give random horribly wrong 
advice to users that can result in data loss or corruption.  Now I'll 
admit I read her question wrong so what you said wasn't incorrect, I'm 
sorry for that.  I've seen a lot of people responding to questions 
recently that I don't recognize that have been completely full of crap, 
I just assumed you were in that camp as well.  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Duncan
Daniele Testa posted on Sat, 20 Dec 2014 03:59:42 +0800 as excerpted:

 The file has both checksums and datacow on it. I will do chattr +C
 on the parent dir and re-create the file to make sure all files are
 marked as nodatacow.
 
 Should I also turn off checksums with the mount-flags if this filesystem
 only contain big VM-files? Or is it not needed if I put +C on the parent
 dir?

FWIW...

Turning off datacow, whether by chattr +C on the parent dir before 
creating the file, or via mount option, turns off checksumming as well.  
(For completeness, it also turns off compression, but I don't think that 
applies in your case.)

In general, active VM images (and database files) with default flags tend 
to get very highly fragmented very fast, due to btrfs' default COW on a 
file with a heavy internal rewrite pattern (as opposed to append-only 
or full rename/replace on rewrite).  For relatively small files with this 
rewrite pattern, think typical desktop firefox sqlite database files of a 
quarter GiB or less, the btrfs autodefrag mount option can be helpful, 
but because it triggers a rewrite of the entire file, as filesize goes 
up, the viability of autodefrag goes down, and at somewhere around half a 
gig, autodefrag doesn't work so well any more, particularly on very 
active files where the incoming rewrite stream may be faster than btrfs 
can rewrite the entire file.

Making heavy-internal-rewrite pattern files of over say half a GiB in 
size nocow is one suggested solution.  However, snapshots lock in place 
the existing version, causing a one-time COW after a snapshot.  If people 
are doing frequent automated snapshots (say once an hour), this can be a 
big problem, as the file ends up fragmenting pretty badly with these 1-
cow writes as well.  That's where snapshots come into the picture.

There are ways to work around the problem (put the files in question on a 
subvolume and don't snapshot it as often as the parent, setup a cron job 
to do say weekly defrag on the files in question, etc), but since you 
don't have snapshots going anyway, that's not a concern for you except as 
a preventative -- consider it if you /do/ start doing snapshots.

So anyway, as I said, creating the file nocow (whether by mount option or 
chattr) will turn off checksumming too.  But on something that frequently 
internally rewritten, where corruption will very likely corrupt the VM 
anyway and there's already mechanisms in place to deal with that (either 
VM integrity mechanisms, or backups, or simply disposable VMs, fire up a 
new one when necessary), at least with btrfs single-mode-data where 
there's no second copy to restore from if the checksum /does/ fail, 
turning off checksumming isn't necessarily as bad as it may seem anyway.

And it /should/ save you some on the metadata... tho I'd not consider 
that savings worth turning off checksumming if that were the /only/ 
reason, on its own.  The metadata difference is more a nice side-effect 
of an already commonly recommended practice for large VM image files, 
than something you'd turn off checksumming for in the first place.  
Certainly, on most files I'd prefer the checksums, and in fact am running 
btrfs raid1 mode here specifically to get the benefit of having a second 
copy to retrieve from if the first attempted copy fails checksum.  But VM 
images and database files are a bit of an exception.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Duncan
Josef Bacik posted on Fri, 19 Dec 2014 16:17:08 -0500 as excerpted:

 tl;dr: Cow means you can in the worst case end up using 2 * filesize -
 blocksize of data on disk and the file will appear to be filesize.

Thanks for the tl;dr /and/ the very sensible longer explanation.  That's 
a very nice thing to know and to file away for further reference. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Zygo Blaxell
On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:
 And for your inode you now have this
 
 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
 disklen 302g
 
 and in your extent tree you have
 
 extent bytenr 123, len 302g, refs 1
 extent bytenr whatever, len 4k, refs 1
 
 See that?  Your file is still the same size, it is still 302g.  If you
 cp'ed it right now it would copy 302g of information.  But what you have
 actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
 your virt thing decides to write to the middle, lets say at offset 12k,
 now you have this
 
 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
 inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
 disklen 4k
 inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
 disklen 302g
 
 and in the extent tree you have this
 
 extent bytenr 123, len 302g, refs 2
 extent bytenr whatever, len 4k, refs 1
 extent bytenr notimportant, len 4k, refs 1
 
 See that refs 2 change?  We split the original extent, so we have 2 file
 extents pointing to the same physical extents, so we bumped the ref
 count.  This will happen over and over again until we have completely
 overwritten the original extent, at which point your space usage will go
 back down to ~302g.

Wait, *what*?

OK, I did a small experiment, and found that btrfs actually does do
something like this.  Can't argue with fact, though it would be nice if
btrfs could be smarter and drop unused portions of the original extent
sooner.  :-P

The above quoted scenario is a little oversimplified.  Chances are that
302G file is made of much smaller extents (128M..256M).  If the VM is
writing 4K randomly everywhere then those 128M+ extents are not going
away any time soon.  Even the extents that are dropped stick around for
a few btrfs transaction commits before they go away.

I couldn't reproduce this behavior until I realized the extents I was
overwriting in my tests were exactly the same size and position of
the extents on disk.  I changed the offset slightly and found that
partially-overwritten extents do in fact stick around in their entirety.

There seems to be an unexpected benefit for compression here:  compression
keeps the extents small, so many small updates will be less likely to
leave big mostly-unused extents lying around the filesystem.


signature.asc
Description: Digital signature


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Daniele Testa
But I read somewhere that compression should be turned off on mounts
that just store large VM-images. Is that wrong?

Btw, I am not pre-allocation space for the images. I use sparse files with:

dd if=/dev/zero of=drive.img bs=1 count=1 seek=300G

It creates the file in a few ms.
Is it better to use fallocate with btrfs?

If I use sparse files, it adds a benefit when I want to copy/move the
image-file to another server.
Like if the 300GB sparse file just has 10GB of data in it, I only need
to copy 10GB when moving it to another server.
Would the same be true with fallocate?

Anyways, would disabling CoW (by putting +C on the parent dir) prevent
the performance issues and 2*filesize issue?

2014-12-20 13:52 GMT+08:00 Zygo Blaxell ce3g8...@umail.furryterror.org:
 On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:
 And for your inode you now have this
 
 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
 disklen 302g
 
 and in your extent tree you have
 
 extent bytenr 123, len 302g, refs 1
 extent bytenr whatever, len 4k, refs 1
 
 See that?  Your file is still the same size, it is still 302g.  If you
 cp'ed it right now it would copy 302g of information.  But what you have
 actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
 your virt thing decides to write to the middle, lets say at offset 12k,
 now you have this
 
 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
 inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
 disklen 4k
 inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
 disklen 302g
 
 and in the extent tree you have this
 
 extent bytenr 123, len 302g, refs 2
 extent bytenr whatever, len 4k, refs 1
 extent bytenr notimportant, len 4k, refs 1
 
 See that refs 2 change?  We split the original extent, so we have 2 file
 extents pointing to the same physical extents, so we bumped the ref
 count.  This will happen over and over again until we have completely
 overwritten the original extent, at which point your space usage will go
 back down to ~302g.

 Wait, *what*?

 OK, I did a small experiment, and found that btrfs actually does do
 something like this.  Can't argue with fact, though it would be nice if
 btrfs could be smarter and drop unused portions of the original extent
 sooner.  :-P

 The above quoted scenario is a little oversimplified.  Chances are that
 302G file is made of much smaller extents (128M..256M).  If the VM is
 writing 4K randomly everywhere then those 128M+ extents are not going
 away any time soon.  Even the extents that are dropped stick around for
 a few btrfs transaction commits before they go away.

 I couldn't reproduce this behavior until I realized the extents I was
 overwriting in my tests were exactly the same size and position of
 the extents on disk.  I changed the offset slightly and found that
 partially-overwritten extents do in fact stick around in their entirety.

 There seems to be an unexpected benefit for compression here:  compression
 keeps the extents small, so many small updates will be less likely to
 leave big mostly-unused extents lying around the filesystem.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Duncan
Daniele Testa posted on Sat, 20 Dec 2014 14:18:31 +0800 as excerpted:

 Anyways, would disabling CoW (by putting +C on the parent dir) prevent
 the performance issues and 2*filesize issue?

It should, provided you don't then start snapshotting the file (which I 
don't believe you intend to do but just in case...).

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs is using 25% more disk than it should

2014-12-18 Thread Daniele Testa
Hey,

I am hoping you guys can shed some light on my issue. I know that it's
a common question that people see differences in the disk used when
running different calculations, but I still think that my issue is
weird.

root@s4 / # mount
/dev/md3 on /opt/drives/ssd type btrfs
(rw,noatime,compress=zlib,discard,nospace_cache)

root@s4 / # btrfs filesystem df /opt/drives/ssd
Data: total=407.97GB, used=404.08GB
System, DUP: total=8.00MB, used=52.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.25GB, used=672.21MB
Metadata: total=8.00MB, used=0.00

root@s4 /opt/drives/ssd # ls -alhs
total 302G
4.0K drwxr-xr-x 1 root root   42 Dec 18 14:34 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49 disk_208.img
   0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots

root@s4 /opt/drives/ssd # du -h
0   ./snapshots
302G.

As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
that partition, I have one single starse file, taking 302GB of space
(max 315GB). The snapshots directory is completely empty.

However, for some weird reason, btrfs seems to think it takes 404GB.
The big file is a disk that I use in a virtual server and when I write
stuff inside that virtual server, the disk-usage of the btrfs
partition on the host keeps increasing even if the sparse-file is
constant at 302GB. I even have 100GB of free disk-space inside that
virtual disk-file. Writing 1GB inside the virtual disk-file seems to
increase the usage about 4-5GB on the outside.

Does anyone have a clue on what is going on? How can the difference
and behaviour be like this when I just have one single file? Is it
also normal to have 672MB of metadata for a single file?

Regards,
Daniele
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html