Re: btrfs is using 25% more disk than it should

2014-12-20 Thread Daniele Testa
Ok, so this is what I did:

1. Copied the sparse 315GB (with 302GB inside) to another server
2. Re-formatted the btrfs partition
3. chattr +C on the parent dir
4. Copied the 315GB file back to the btrfs partition (the file is not
sparse any more due to the copying)

This is the end result:

root@s4 /opt/drives/ssd # ls -alhs
total 316G
 16K drwxr-xr-x 1 libvirt-qemu libvirt-qemu   42 Dec 20 07:00 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
315G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 20 09:11 disk_208.img
   0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 20 06:53 snapshots

root@s4 /opt/drives/ssd # du -h
0   ./snapshots
316G.

root@s4 /opt/drives/ssd # df -h
/dev/md3411G  316G
94G  78% /opt/drives/ssd

root@s4 /opt/drives/ssd # btrfs filesystem df /opt/drives/ssd
Data, single: total=323.01GiB, used=315.08GiB
System, DUP: total=8.00MiB, used=64.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=880.00KiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=16.00MiB, used=0.00

root@s4 /opt/drives/ssd # lsattr
 ./snapshots
---C ./disk_208.img

As you can see, it looks much better now. The file takes as much space
as it should and the Metadata is only 880kb.

I will do some writes inside the VM and see if the file grows on the
outside. If everything is ok, it should not.

2014-12-20 5:17 GMT+08:00 Josef Bacik jba...@fb.com:
 On 12/19/2014 04:10 PM, Josef Bacik wrote:

 On 12/18/2014 09:59 AM, Daniele Testa wrote:

 Hey,

 I am hoping you guys can shed some light on my issue. I know that it's
 a common question that people see differences in the disk used when
 running different calculations, but I still think that my issue is
 weird.

 root@s4 / # mount
 /dev/md3 on /opt/drives/ssd type btrfs
 (rw,noatime,compress=zlib,discard,nospace_cache)

 root@s4 / # btrfs filesystem df /opt/drives/ssd
 Data: total=407.97GB, used=404.08GB
 System, DUP: total=8.00MB, used=52.00KB
 System: total=4.00MB, used=0.00
 Metadata, DUP: total=1.25GB, used=672.21MB
 Metadata: total=8.00MB, used=0.00

 root@s4 /opt/drives/ssd # ls -alhs
 total 302G
 4.0K drwxr-xr-x 1 root root   42 Dec 18 14:34 .
 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49
 disk_208.img
 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots

 root@s4 /opt/drives/ssd # du -h
 0   ./snapshots
 302G.

 As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
 that partition, I have one single starse file, taking 302GB of space
 (max 315GB). The snapshots directory is completely empty.

 However, for some weird reason, btrfs seems to think it takes 404GB.
 The big file is a disk that I use in a virtual server and when I write
 stuff inside that virtual server, the disk-usage of the btrfs
 partition on the host keeps increasing even if the sparse-file is
 constant at 302GB. I even have 100GB of free disk-space inside that
 virtual disk-file. Writing 1GB inside the virtual disk-file seems to
 increase the usage about 4-5GB on the outside.

 Does anyone have a clue on what is going on? How can the difference
 and behaviour be like this when I just have one single file? Is it
 also normal to have 672MB of metadata for a single file?


 Hello and welcome to the wonderful world of btrfs, where COW can really
 suck hard without being super clear why!  It's 4pm on a Friday right
 before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
 use pretty pictures.  You have this case to start with

 file offset 0   offset 302g
 [-prealloced 302g extent--]

 (man it's impressive I got all that lined up right)

 On disk you have 2 things.  First your file which has file extents which
 says

 inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen
 302g

 and then in the extent tree, who keeps track of actual allocated space
 has this

 extent bytenr 123, len 302g, refs 1

 Now say you boot up your virt image and it writes 1 4k block to offset
 0.  Now you have this

 [4k][302g-4k--]

 And for your inode you now have this

 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
 disklen 302g

 and in your extent tree you have

 extent bytenr 123, len 302g, refs 1
 extent bytenr whatever, len 4k, refs 1

 See that?  Your file is still the same size, it is still 302g.  If you
 cp'ed it right now it would copy 302g of information.  But what you have
 actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
 your virt thing decides to write to the middle, lets say at offset 12k,
 now you have this

 inode 256, file

Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Daniele Testa
No, I don't have any snapshots or subvolumes. Only that single file.

The file has both checksums and datacow on it. I will do chattr +C
on the parent dir and re-create the file to make sure all files are
marked as nodatacow.

Should I also turn off checksums with the mount-flags if this
filesystem only contain big VM-files? Or is it not needed if I put +C
on the parent dir?

2014-12-20 2:53 GMT+08:00 Phillip Susi ps...@ubuntu.com:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 12/18/2014 9:59 AM, Daniele Testa wrote:
 As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
 that partition, I have one single starse file, taking 302GB of
 space (max 315GB). The snapshots directory is completely empty.

 So you don't have any snapshots or other subvolumes?

 However, for some weird reason, btrfs seems to think it takes
 404GB. The big file is a disk that I use in a virtual server and
 when I write stuff inside that virtual server, the disk-usage of
 the btrfs partition on the host keeps increasing even if the
 sparse-file is constant at 302GB. I even have 100GB of free
 disk-space inside that virtual disk-file. Writing 1GB inside the
 virtual disk-file seems to increase the usage about 4-5GB on the
 outside.

 Did you flag the file as nodatacow?

 Does anyone have a clue on what is going on? How can the
 difference and behaviour be like this when I just have one single
 file? Is it also normal to have 672MB of metadata for a single
 file?

 You probably have the data checksums enabled and that isn't
 unreasonable for checksums on 302g of data.


 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.17 (MingW32)

 iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P
 DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL
 H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ
 yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2
 WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza
 yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8=
 =Z9/S
 -END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs is using 25% more disk than it should

2014-12-19 Thread Daniele Testa
But I read somewhere that compression should be turned off on mounts
that just store large VM-images. Is that wrong?

Btw, I am not pre-allocation space for the images. I use sparse files with:

dd if=/dev/zero of=drive.img bs=1 count=1 seek=300G

It creates the file in a few ms.
Is it better to use fallocate with btrfs?

If I use sparse files, it adds a benefit when I want to copy/move the
image-file to another server.
Like if the 300GB sparse file just has 10GB of data in it, I only need
to copy 10GB when moving it to another server.
Would the same be true with fallocate?

Anyways, would disabling CoW (by putting +C on the parent dir) prevent
the performance issues and 2*filesize issue?

2014-12-20 13:52 GMT+08:00 Zygo Blaxell ce3g8...@umail.furryterror.org:
 On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:
 And for your inode you now have this
 
 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
 disklen 302g
 
 and in your extent tree you have
 
 extent bytenr 123, len 302g, refs 1
 extent bytenr whatever, len 4k, refs 1
 
 See that?  Your file is still the same size, it is still 302g.  If you
 cp'ed it right now it would copy 302g of information.  But what you have
 actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
 your virt thing decides to write to the middle, lets say at offset 12k,
 now you have this
 
 inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
 disklen 4k
 inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
 inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
 disklen 4k
 inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
 disklen 302g
 
 and in the extent tree you have this
 
 extent bytenr 123, len 302g, refs 2
 extent bytenr whatever, len 4k, refs 1
 extent bytenr notimportant, len 4k, refs 1
 
 See that refs 2 change?  We split the original extent, so we have 2 file
 extents pointing to the same physical extents, so we bumped the ref
 count.  This will happen over and over again until we have completely
 overwritten the original extent, at which point your space usage will go
 back down to ~302g.

 Wait, *what*?

 OK, I did a small experiment, and found that btrfs actually does do
 something like this.  Can't argue with fact, though it would be nice if
 btrfs could be smarter and drop unused portions of the original extent
 sooner.  :-P

 The above quoted scenario is a little oversimplified.  Chances are that
 302G file is made of much smaller extents (128M..256M).  If the VM is
 writing 4K randomly everywhere then those 128M+ extents are not going
 away any time soon.  Even the extents that are dropped stick around for
 a few btrfs transaction commits before they go away.

 I couldn't reproduce this behavior until I realized the extents I was
 overwriting in my tests were exactly the same size and position of
 the extents on disk.  I changed the offset slightly and found that
 partially-overwritten extents do in fact stick around in their entirety.

 There seems to be an unexpected benefit for compression here:  compression
 keeps the extents small, so many small updates will be less likely to
 leave big mostly-unused extents lying around the filesystem.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs is using 25% more disk than it should

2014-12-18 Thread Daniele Testa
Hey,

I am hoping you guys can shed some light on my issue. I know that it's
a common question that people see differences in the disk used when
running different calculations, but I still think that my issue is
weird.

root@s4 / # mount
/dev/md3 on /opt/drives/ssd type btrfs
(rw,noatime,compress=zlib,discard,nospace_cache)

root@s4 / # btrfs filesystem df /opt/drives/ssd
Data: total=407.97GB, used=404.08GB
System, DUP: total=8.00MB, used=52.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.25GB, used=672.21MB
Metadata: total=8.00MB, used=0.00

root@s4 /opt/drives/ssd # ls -alhs
total 302G
4.0K drwxr-xr-x 1 root root   42 Dec 18 14:34 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49 disk_208.img
   0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots

root@s4 /opt/drives/ssd # du -h
0   ./snapshots
302G.

As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On
that partition, I have one single starse file, taking 302GB of space
(max 315GB). The snapshots directory is completely empty.

However, for some weird reason, btrfs seems to think it takes 404GB.
The big file is a disk that I use in a virtual server and when I write
stuff inside that virtual server, the disk-usage of the btrfs
partition on the host keeps increasing even if the sparse-file is
constant at 302GB. I even have 100GB of free disk-space inside that
virtual disk-file. Writing 1GB inside the virtual disk-file seems to
increase the usage about 4-5GB on the outside.

Does anyone have a clue on what is going on? How can the difference
and behaviour be like this when I just have one single file? Is it
also normal to have 672MB of metadata for a single file?

Regards,
Daniele
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Extra info

2014-12-18 Thread Daniele Testa
Sorry, did not read the guidelines correctly. Here comes more info:

root@s4 /opt/drives/ssd # uname -a
Linux s4.podnix.com 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux

root@s4 /opt/drives/ssd # btrfs --version
Btrfs Btrfs v0.19

root@s4 /opt/drives/ssd # btrfs fi show
Label: none  uuid: 752ed11b-defc-4717-b4c9-a9e08ad64ba6
Total devices 1 FS bytes used 404.74GB
devid1 size 410.50GB used 410.50GB path /dev/md3

Regards,
Daniele
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extra info

2014-12-18 Thread Daniele Testa
I am running latest Debian stable. However, I used backports to update
the kernel to 3.16.

root@s4 /opt/drives/ssd # uname -a
Linux s4.podnix.com 3.16.0-0.bpo.4-amd64 #1 SMP Debian
3.16.7-ckt2-1~bpo70+1 (2014-12-08) x86_64 GNU/Linux

root@s4 /opt/drives/ssd # btrfs --version
Btrfs v3.14.1

It still reports over-use, so I am running a defrag on the file:

root@s4 /opt/drives/ssd # btrfs filesystem defragment
/opt/drives/ssd/disk_208.img

But I see it slowly eats even more disk space durring the defrag. I
had about 7GB before. When it went down close
to 1GB, I cancelled it as I'm afraid it will corrupt the file if it
runs out of space.

Do you know how btrfs behaves if it runs out of space durring a
defrag? Any other ideas how I can solve it?

Regards,
Daniele


2014-12-18 23:35 GMT+08:00 Hugo Mills h...@carfax.org.uk:
 On Thu, Dec 18, 2014 at 11:02:34PM +0800, Daniele Testa wrote:
 Sorry, did not read the guidelines correctly. Here comes more info:

 root@s4 /opt/drives/ssd # uname -a
 Linux s4.podnix.com 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 
 GNU/Linux

This is your problem. I think the difficulty is that writes into
 the middle of an extent didn't split the extent and allow the
 overwritten area to be reclaimed, so the whole extent still takes up
 space.

IIRC, josef fixed this about 18 months ago. You should upgrade your
 kernel to something that isn't written in cueniform (like 3.18, say),
 and defrag the file in question. I think that should fix the problem.

 root@s4 /opt/drives/ssd # btrfs --version
 Btrfs Btrfs v0.19

This is also an antique, and probably needs an upgrade too
 (although it's less critical than the kernel).

Hugo.

 root@s4 /opt/drives/ssd # btrfs fi show
 Label: none  uuid: 752ed11b-defc-4717-b4c9-a9e08ad64ba6
 Total devices 1 FS bytes used 404.74GB
 devid1 size 410.50GB used 410.50GB path /dev/md3

 Regards,
 Daniele

 --
 Hugo Mills | Python is executable pseudocode; perl is executable
 hugo@... carfax.org.uk | line-noise.
 http://carfax.org.uk/  |
 PGP: 65E74AC0  |Ben Burton
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html