Re: Ordering of directory operations maintained across system crashes in Btrfs?

2014-03-13 Thread Goswin von Brederlow
On Mon, Mar 03, 2014 at 11:56:49AM -0600, thanumalayan mad wrote:
 Chris,
 
 Great, thanks. Any guesses whether other filesystems (disk-based) do
 things similar to the last two examples you pointed out? Saying we
 think 3 normal filesystems reorder stuff seems to motivate
 application developers to fix bugs ...
 
 Also, just for more information, the sequence we observed was,
 
 Thread A:
 
 unlink(foo)
 rename(somefile X, somefile Y)
 fsync(somefile Z)
 
 The source and destination of the renamed file are unrelated to the
 fsync. But the rename happens in the fsync()'s transaction, while
 unlink() is delayed. I guess this has something to do with backrefs
 too.
 
 Thanks,
 Thanu
 
 On Mon, Mar 3, 2014 at 11:43 AM, Chris Mason c...@fb.com wrote:
  On 02/25/2014 09:01 PM, thanumalayan mad wrote:
 
  Hi all,
 
  Slightly complicated question.
 
  Assume I do two directory operations in a Btrfs partition (such as an
  unlink() and a rename()), one after the other, and a crash happens
  after the rename(). Can Btrfs (the current version) send the second
  operation to the disk first, so that after the crash, I observe the
  effects of rename() but not the effects of the unlink()?
 
  I think I am observing Btrfs re-ordering an unlink() and a rename(),
  and I just want to confirm that my observation is true. Also, if Btrfs
  does send directory operations to disk out of order, is there some
  limitation on this? Like, is this restricted to only unlink() and
  rename()?
 
  I am looking at some (buggy) applications that use Btrfs, and this
  behavior seems to affect them.
 
 
  There isn't a single answer for this one.
 
  You might have
 
  Thread A:
 
  ulink(foo);
  rename(somefile, somefile2);
  crash
 
  This should always have the rename happen before or in the same transaction
  as the rename.
 
  Thread A:
 
  ulink(dirA/foo);
  rename(dirB/somefile, dirB/somefile2);
 
  Here you're at the mercy of what is happening in dirB.  If someone fsyncs
  that directory, it may hit the disk before the unlink.
 
  Thread A:
 
  ulink(foo);
  rename(somefile, somefile2);
  fsync(somefile);
 
  This one is even fuzzier.  Backrefs allow us to do some file fsyncs without
  touching the directory, making it possible the unlink will hit disk after
  the fsync.
 
  -chris

As I understand it POSIX only garanties that the in-core data is
updated by the syscalls in-order. On crash anything can happen. If the
application needs something to be commited to disk then it needs to
fsync(). Specifically it needs to fsync() the changed files AND
directories.

From man fsync:

   Calling  fsync()  does  not  necessarily  ensure  that the entry in the
   directory containing the file has  also  reached  disk.   For  that  an
   explicit fsync() on a file descriptor for the directory is also needed.

So the fsync(somefile) above doesn't necessarily force the rename to
disk.


My experience with fuse tells me that at least fuse handles operations
in parallel and only blocks a later operation if it is affected by an
earlier operation. An unlink in one directory can (and will) run in
parallel to a rename in another directory. Then, depending on how
threads get scheduled, the rename can complete before the unlink.

My conclusion is that you need to fsync() the directory to ensure the
metadata update has made it to the disk if you require that. Otherwise
you have to be able to cope with (meta)data loss on crash.


Note: https://code.google.com/p/leveldb/issues/detail?id=189 talks a
lot about journaling and that any yournaling filesystem should
preserve the order. I think that is rather pointless for two reasons:

1) The journal gets replayed after a crash so in whatever order the
two journal entries are written doesn't matter. They both make it to
disk. You can't see one without the other. This is assuming you
fsync()ed the dirs so force the metadata change into the journal in
the first place.

2) btrfs afaik doesn't have any journal since COW already garanties
atomic updates and crash protection.


Overall I also think the fear of fsync() is overrated for this issue.
This would only happen on programm start or whenever you open a
database. Not somthing that happens every second.

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about df and btrfs fi df

2014-02-18 Thread Goswin von Brederlow
On Mon, Feb 17, 2014 at 06:08:20PM +0100, David Sterba wrote:
 On Mon, Feb 10, 2014 at 01:41:23PM -0500, Josef Bacik wrote:
  
  
  On 02/10/2014 01:36 PM, cwillu wrote:
  IMO, used should definitely include metadata, especially given that we
  inline small files.
  
  I can convince myself both that this implies that we should roll it
  into b_avail, and that we should go the other way and only report the
  actual used number for metadata as well, so I might just plead
  insanity here.
  
  
  I could be convinced to do this.  So we have
  
  total: (total disk bytes) / (raid multiplier)
  used: (total used in data block groups) +
  (total used in metadata block groups)
  avail: total - (total used in data block groups +
  total metadata block groups)
 
 The size of global block reserve should be IMO subtracted from 'avail',
 this reports the space as free, but is in fact not.

How much global block reserve is there? Does that explain why I can't
use the last 270G of my 19TB btrfs?
 
 The used amount of the global reserve might be included into
 filesystem 'used', but I've observed the global reserve used for short
 periods of time under some heavy stress, I'm convinced it needs to be
 accounted in the df report.

As a comparison the ext2/3/4 filesystem has a % reserved for root and
does not show this in available. So you get filesystem with 0 bytes
free but root can still write to them.

I would argue that available should not include the reserve. It is not
available for normal operations, right?

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: know mount location with in FS

2014-02-18 Thread Goswin von Brederlow
On Tue, Feb 18, 2014 at 11:06:38AM +0800, Anand Jain wrote:
 
 For what reason?
 
 Remember that a single block device can be mounted in multiple places
  (or bind-mounted, etc), so there is not even necessarily a single
  answer to that question.
 
 -Eric
 
  Yes indeed. (the attempt is should we be able to maintain all
  the mount points as a list saved/updated under per fs_devices. ?)
 
  some of the exported symbols at fs/namei.c looks closely
  related to the purpose here, but it didn't help unless
  I missed something.
 
  any comment is helpful..
 
  The reason:
 First of all btrfs-progs has used scan-all-disks very
 liberally which isn't a scalable design (imagine a data
 center with 1000's of LUN).
 Even a simple check_mounted() does scan-all-disks (when
 total_disk 1), that isn't necessary if the kernel could
 let it know.
 Scan for btrfs has expensive steps of reading each super-block,
 and the effect is, in general most of the btrfs-progs commands
 are very very slow when things like scrub is running.
 check_mounted() fails when seeding is used (since
 /proc/self/mounts would show disk with lowest devid and in
 most common scenario it will be a seed disk. (which has
 different FSID from the actual disk in question). and
 Further most severe problem is some btrfs-progs threads has been
 scan-all-disks more than once during the thread's life time.
 So a total revamp of this design has become an immediate need.
 
 What I am planning is
- btrfs-progs to init btrfs-disk-list once per required thread
  (mostly use BTRFS_IOC_GET_DEVS, which would dump anything
  and everything about the btrfs devices)
- the btrfs-disk-list is obtained from kernel first, and will
  fill with the remaining disks which kernel isn't aware of.
- If the step one also provides the mount point(s) from the
  kernel that would complete the loop with what end user
  would want to know.
 
 
 Thanks, Anand

What about mountpoints outside the current filesystem namespace or
ones that should be shortened to the filesystem namespace (e.g. in a
chroot the leading dirs need to be cut)?

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC with 270GiB free

2014-02-18 Thread Goswin von Brederlow
On Tue, Feb 18, 2014 at 08:58:10AM -0500, Josef Bacik wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 02/16/2014 08:58 AM, Goswin von Brederlow wrote:
  Hi,
  
  I'm getting a ENOSPC error from btrfs despite there still being
  plenty of space left:
  
  % df -m /mnt/nas3 Filesystem 1M-blocks Used Available
  Use% Mounted on /dev/mapper/nas3-a  19077220 18805132270773
  99% /mnt/nas3
  
  % btrfs fi show Label: none  uuid:
  4b18f84e-2499-41ca-81ff-fe1783c11491 Total devices 1 FS bytes used
  17.91TiB devid1 size 18.19TiB used 17.94TiB path
  /dev/mapper/nas3-a
  
  Btrfs v3.12
  
  % btrfs fi df Data, single: total=17.89TiB, used=17.88TiB System,
  DUP: total=32.00MiB, used=1.92MiB Metadata, DUP: total=25.50GiB,
  used=24.89GiB
  
  As you can see there are still 270GiB free and plenty of block
  groups free on the device too.
  
  So why isn't btrfs allocating a new block group to store more
  data?
  
  
 
 What kernel?  Can you give btrfs-next a try?  Mount with -o
 enospc_debug and when you get enospc send the dmesg.  Thanks,
 
 Josef

Standard Debian linux kernel:

Linux nas3 3.12-1-amd64 #1 SMP Debian 3.12.9-1 (2014-02-01) x86_64 GNU/Linux

Compiling new kernel and rebooting will take some time. But it looks
like I can remount with enospc_debug. I will try if that outputs
anything usefull first. So far I got:

[258988.006643] btrfs: disk space caching is enabled

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC with 270GiB free

2014-02-17 Thread Goswin von Brederlow
On Mon, Feb 17, 2014 at 08:42:23AM +0100, Dan van der Ster wrote:
 Did you already try this?? [1]:
 
btrfs fi balance start -dusage=5 /mnt/nas3
 
 Cheers, dan
 
 [1] 
 https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_get_.22No_space_left_on_device.22_errors.2C_but_df_says_I.27ve_got_lots_of_space
 
 On Sun, Feb 16, 2014 at 2:58 PM, Goswin von Brederlow goswin-...@web.de 
 wrote:
  Hi,
 
  I'm getting a ENOSPC error from btrfs despite there still being plenty
  of space left:
 
  % df -m /mnt/nas3
  Filesystem 1M-blocks Used Available Use% Mounted on
  /dev/mapper/nas3-a  19077220 18805132270773  99% /mnt/nas3
 
  % btrfs fi show
  Label: none  uuid: 4b18f84e-2499-41ca-81ff-fe1783c11491
  Total devices 1 FS bytes used 17.91TiB
  devid1 size 18.19TiB used 17.94TiB path /dev/mapper/nas3-a
 
  Btrfs v3.12
 
  % btrfs fi df
  Data, single: total=17.89TiB, used=17.88TiB
  System, DUP: total=32.00MiB, used=1.92MiB
  Metadata, DUP: total=25.50GiB, used=24.89GiB
 
  As you can see there are still 270GiB free and plenty of block groups
  free on the device too.
 
  So why isn't btrfs allocating a new block group to store more data?
 
  MfG
  Goswin

I did and that isn't the problem. Balancing only frees up partially
used block groups so they can be reused. But the problem is that the
remaining free block groups are not getting used.

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck does not fix

2014-02-17 Thread Goswin von Brederlow
On Mon, Feb 17, 2014 at 03:20:58AM +, Duncan wrote:
 Chris Murphy posted on Sun, 16 Feb 2014 12:54:44 -0700 as excerpted:
  Also, 10 hours to balance two disks at 2.3TB seems like a long time. I'm
  not sure if that's expected.
 
 FWIW, I think you may not realize how big 2.3 TiB is, and/or how slow 
 spinning rust can be when dealing with TiBs of potentially fragmented 
 data...
 
 2.3TiB * 1024GiB/TiB * 1024 MiB/GiB / 10 hours / 60 min/hr / 60 sec/min =
 
 66.99... real close to 67 MiB/sec
 
 Since it's multiple TiB we're talking and only two devices, that's almost 
 certainly spinning rust, not SSD, and on spinning rust, 67 MiB/sec really 
 isn't /that/ bad, especially if the filesystem wasn't new and had been 
 reasonably used, thus likely had some fragmentation to deal with.

Don't forget that that is 67MiB/s reading data and 67MiB/s writing
data giving a total of 134MiB/s. 

Still, on a good system each disk should have about that speed so it's
about 50% of theoretical maximum. Which is quite good given that the
disks will need to seek between every read and write. In comparison
moving data with LVM gets only about half that speed and that doesn't
even have the overhead of a filesystem to deal with.

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ENOSPC with 270GiB free

2014-02-16 Thread Goswin von Brederlow
Hi,

I'm getting a ENOSPC error from btrfs despite there still being plenty
of space left:

% df -m /mnt/nas3
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/mapper/nas3-a  19077220 18805132270773  99% /mnt/nas3

% btrfs fi show
Label: none  uuid: 4b18f84e-2499-41ca-81ff-fe1783c11491
Total devices 1 FS bytes used 17.91TiB
devid1 size 18.19TiB used 17.94TiB path /dev/mapper/nas3-a

Btrfs v3.12

% btrfs fi df
Data, single: total=17.89TiB, used=17.88TiB
System, DUP: total=32.00MiB, used=1.92MiB
Metadata, DUP: total=25.50GiB, used=24.89GiB

As you can see there are still 270GiB free and plenty of block groups
free on the device too.

So why isn't btrfs allocating a new block group to store more data?

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC with 270GiB free

2014-02-16 Thread Goswin von Brederlow
On Sun, Feb 16, 2014 at 06:18:38PM +, Duncan wrote:
 Goswin von Brederlow posted on Sun, 16 Feb 2014 14:58:08 +0100 as
 excerpted:
 
  As you can see there are still 270GiB free and plenty of block groups
  free on the device too.
  
  So why isn't btrfs allocating a new block group to store more data?
 
 I saw this on a much (much) smaller filesystem a few weeks ago, when I 
 redid my /boot.  In my case it was under a gig total, so mixed-mode, but 
 copying files over in a particular order errored some of them out with 
 ENOSPC.  But the way I was copying (using mc) left the ones that hadn't 
 copied selected, and I tried a copy of them again, and/or used mc's 
 directory-diff to find the missing files and copy them over again.  After 
 about three times, they all copied.
 
 So some combination of size and metadata wasn't triggering a new block 
 allocation, but coming in a different order, it triggered fine.  Again, 
 this was mixed-mode, so data/metadata blocks mixed, and it didn't matter 
 which ran out first since they were combined.
 
 I wonder if you're running into something similar.  Can you try doing the 
 copy in a different order, or is it one big file?
 

I'm using rsync and towards the last few GB before it gives ENOSPC the
filesystem gets realy slow and eats more and more cpu time. I'm
copying multi gigabyte files and the second last file managed 2MB/s
and the failed one came down to ~100K/s at the end. So trying a lot of
different files isn't realy feasable, timewise.

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html