Re: [PATCH] btrfs-progs: Documentation: Add filter section for btrfs-balance.

2014-06-04 Thread Duncan
Qu Wenruo posted on Tue, 03 Jun 2014 14:20:08 +0800 as excerpted:

 Man page for 'btrfs-balance' mentioned filters but does not explain
 them, which make end users hard to use '-d', '-m' or '-s options.
 
 This patch will use the explanations from
 https://btrfs.wiki.kernel.org/index.php/Balance_Filters to enrich the
 man page.
 
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com

Thanks.  The wiki-only nature of the balance-filters documentation has 
been a bit of a complication.  This should fix that. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs-progs: fix compiler warning

2014-06-04 Thread Qu Wenruo


 Original Message 
Subject: [PATCH 1/1] btrfs-progs: fix compiler warning
From: Christian Hesse m...@eworm.de
To: linux-btrfs@vger.kernel.org
Date: 2014年06月03日 19:29

gcc 4.9.0 gives a warning: array subscript is above array bounds

Checking for greater or equal instead of just equal fixes this.

Signed-off-by: Christian Hesse m...@eworm.de
---
  cmds-restore.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 96b97e1..534a49e 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -169,7 +169,7 @@ again:
break;
}
  
-	if (level == BTRFS_MAX_LEVEL)

+   if (level = BTRFS_MAX_LEVEL)
return 1;
  
  	slot = path-slots[level] + 1;

Also I faied to reproduce the bug.
Using gcc-4.9.0-3 from Archlinux core repo.

It seems to be related to default gcc flags from distribution?

Thanks,
Qu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs-progs: fix compiler warning

2014-06-04 Thread Christian Hesse
Qu Wenruo quwen...@cn.fujitsu.com on Wed, 2014/06/04 14:48:
 
  Original Message 
 Subject: [PATCH 1/1] btrfs-progs: fix compiler warning
 From: Christian Hesse m...@eworm.de
 To: linux-btrfs@vger.kernel.org
 Date: 2014年06月03日 19:29
  gcc 4.9.0 gives a warning: array subscript is above array bounds
 
  Checking for greater or equal instead of just equal fixes this.
 
  Signed-off-by: Christian Hesse m...@eworm.de
  ---
cmds-restore.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
 
  diff --git a/cmds-restore.c b/cmds-restore.c
  index 96b97e1..534a49e 100644
  --- a/cmds-restore.c
  +++ b/cmds-restore.c
  @@ -169,7 +169,7 @@ again:
  break;
  }

  -   if (level == BTRFS_MAX_LEVEL)
  +   if (level = BTRFS_MAX_LEVEL)
  return 1;

  slot = path-slots[level] + 1;

 Also I faied to reproduce the bug.
 Using gcc-4.9.0-3 from Archlinux core repo.

Exactly the same here. ;)

 It seems to be related to default gcc flags from distribution?

Probably. I did compile with optimization, so adding -O2 may do the trick:

make CFLAGS=${CFLAGS} -O2 all
-- 
Schoene Gruesse
Chris
 O ascii ribbon campaign
   stop html mail - www.asciiribbon.org


signature.asc
Description: PGP signature


Re: [PATCH v4] Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled

2014-06-04 Thread Liu Bo
On Sun, Jun 01, 2014 at 01:50:28AM +0100, Filipe David Borba Manana wrote:
 If the NO_HOLES feature is enabled holes don't have file extent items in
 the btree that represent them anymore. This made the clone operation
 ignore the gaps that exist between consecutive file extent items and
 therefore not create the holes at the destination. When not using the
 NO_HOLES feature, the holes were created at the destination.
 
 A test case for xfstests follows.

Reviewed-by: Liu Bo bo.li@oracle.com

-liubo

 
 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---
 
 V2: Deal with holes at the boundaries of the cloning range and that
 either overlap the boundary completely or partially.
 Test case for xfstests updated too to test these 2 cases.
 
 V3: Deal with the case where the cloning range overlaps (partially or
 completely) a hole at the end of the source file, and might increase
 the size of the target file.
 Updated the test for xfstests to cover these cases too.
 
 V4: Moved some duplicated code into an helper function.
 
  fs/btrfs/ioctl.c | 108 
 ++-
  1 file changed, 83 insertions(+), 25 deletions(-)
 
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 04ece8f..95194a9 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -2983,6 +2983,37 @@ out:
   return ret;
  }
  
 +static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
 +  struct inode *inode,
 +  u64 endoff,
 +  const u64 destoff,
 +  const u64 olen)
 +{
 + struct btrfs_root *root = BTRFS_I(inode)-root;
 + int ret;
 +
 + inode_inc_iversion(inode);
 + inode-i_mtime = inode-i_ctime = CURRENT_TIME;
 + /*
 +  * We round up to the block size at eof when determining which
 +  * extents to clone above, but shouldn't round up the file size.
 +  */
 + if (endoff  destoff + olen)
 + endoff = destoff + olen;
 + if (endoff  inode-i_size)
 + btrfs_i_size_write(inode, endoff);
 +
 + ret = btrfs_update_inode(trans, root, inode);
 + if (ret) {
 + btrfs_abort_transaction(trans, root, ret);
 + btrfs_end_transaction(trans, root);
 + goto out;
 + }
 + ret = btrfs_end_transaction(trans, root);
 +out:
 + return ret;
 +}
 +
  /**
   * btrfs_clone() - clone a range from inode file to another
   *
 @@ -2995,7 +3026,8 @@ out:
   * @destoff: Offset within @inode to start clone
   */
  static int btrfs_clone(struct inode *src, struct inode *inode,
 -u64 off, u64 olen, u64 olen_aligned, u64 destoff)
 +const u64 off, const u64 olen, const u64 olen_aligned,
 +const u64 destoff)
  {
   struct btrfs_root *root = BTRFS_I(inode)-root;
   struct btrfs_path *path = NULL;
 @@ -3007,8 +3039,9 @@ static int btrfs_clone(struct inode *src, struct inode 
 *inode,
   int slot;
   int ret;
   int no_quota;
 - u64 len = olen_aligned;
 + const u64 len = olen_aligned;
   u64 last_disko = 0;
 + u64 last_dest_end = destoff;
  
   ret = -ENOMEM;
   buf = vmalloc(btrfs_level_size(root, 0));
 @@ -3076,7 +3109,7 @@ process_slot:
   u64 disko = 0, diskl = 0;
   u64 datao = 0, datal = 0;
   u8 comp;
 - u64 endoff;
 + u64 drop_start;
  
   extent = btrfs_item_ptr(leaf, slot,
   struct btrfs_file_extent_item);
 @@ -3125,6 +3158,18 @@ process_slot:
   new_key.offset = destoff;
  
   /*
 +  * Deal with a hole that doesn't have an extent item
 +  * that represents it (NO_HOLES feature enabled).
 +  * This hole is either in the middle of the cloning
 +  * range or at the beginning (fully overlaps it or
 +  * partially overlaps it).
 +  */
 + if (new_key.offset != last_dest_end)
 + drop_start = last_dest_end;
 + else
 + drop_start = new_key.offset;
 +
 + /*
* 1 - adjusting old extent (we may have to split it)
* 1 - add new extent
* 1 - inode update
 @@ -3153,7 +3198,7 @@ process_slot:
   }
  
   ret = btrfs_drop_extents(trans, root, inode,
 -  new_key.offset,
 +  drop_start,
new_key.offset + datal,

Race condition between btrfs and udev

2014-06-04 Thread Wang Shilong

Originally this problem was reproduced by the following scripts:

# dd if=/dev/zero of=data bs=1M  count=50
# losetup /dev/loop1 data
# i=1
# while [ 1 ]
   do
   mkfs.btrfs -fK /dev/loop1  /dev/null || exit 1
   i++
   echo loop $i
   done

Further, a easy way to trigger this problem is by running the followng c 
codes repeatedly:


int main(int argc, char **argv)
{

int fd = open(argv[1], O_RDWR | O_EXCL);
if (fd  0) {
perror(fail to open);
exit(1);
}
close(fd);
return 0;
}
here @argv[1] needs a btrfs block device.

So the problem is RW opening would trigger udev event which will call 
btrfs_scan_one_device()
In btrfs_scan_one_device(), it would open the block device with EXCL 
flag...meanwhile if another

program try to open that device with O_EXCL, it would fail with EBUSY

I don't know whether this is a serious problem, now there are two places 
in btrfs-progs that is

trying to open device with O_EXCL:

1. in utils.c: test_dev_for_mkfs()
2. in disk-io.c: __open_ctree_fd()

Any ideas on this? maybe we can remove @EXCL flag from btrfs-progs?

Thanks,
Wang

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: All free space eaten during defragmenting (3.14)

2014-06-04 Thread Duncan
Peter Chant posted on Tue, 03 Jun 2014 23:21:55 +0100 as excerpted:

 On 06/03/2014 05:46 AM, Duncan wrote:
 
 Of course if you were using something like find and executing defrag on
 each found entry, then yes it would recurse, as find would recurse
 across filesystems and keep going (unless you told it not to using
 find's -xdev option).
 
 I did not know the recursive option existed.  However, I'd previously
 cursed the tools not having a recursive option or being recursive by
 default.  If there is now a recursive option it would be really perverse
 to use find to implement a recursive defrag.

Defrag's -r/recursive option is reasonably new, but checking the btrfs-
progs git tree (since I run the git version) says that was commit 
c2c5353b, which git describe says was v0.19-725, so it should be in btrfs-
progs v3.12.  So it's not /that/ new.  Anyone still running something 
earlier than that really should update. =:^)

But the wiki recommended using find from back before the builtin 
recursive option, and I can well imagine people with already working 
scripts not wanting to fix what isn't (for them) broken. =:^)  So I 
imagine there will be find-and-defrag users for some time, tho they 
should even now be on their way to becoming a rather small percentage,
at least for folks following the keep-current recommendations.

Meanwhile, this question is bugging me so let me just ask it.  The OP was 
from a different email address (szotsaki@gmail), and once I noticed that 
I've been assuming that you and the OP are different people, tho in my 
first reply to you I assumed you were the OP.  So just to clear things 
up, different people and I can't assume that what he wrote about his case 
applies to you, correct? =:^)

 Meanwhile, you mention the autodefrag mount option.  Assuming you have
 it on all the time, there should be that much to defrag, *EXCEPT* if
 the -c/ compress option is used as well.  If you aren't also using the
 compress mount option by default, then you are effectively telling
 defrag to compress everything as it goes, so it will
 defrag-and-compressed all files.  Which wouldn't be a problem with
 snapshot-aware-defrag as it'd compress for all snapshots at the same
 time too.  But with snapshot-aware-
 defrag currently disabled, that would effectively force ALL files to be
 rewritten in ordered to compress them, thereby breaking the COW link
 with the other snapshots and duplicating ALL data.
 
 I've got compress=lzo, options from fstab:
 device=/dev/sdb,device=/dev/sdc,autodefrag,defaults,inode_cache,noatime,
 compress=lzo
 
 I'm running kernel 3.13.6.  Not sure if snapshot-aware-defrag is enabled
 or disabled in this version.

A git search says (linus' mainline tree) commit 8101c8db, merge commit 
878a876b, with git describe labeling the merge commit as v3.14-rc1-13, so 
it would be in v3.14-rc2.  However, the commit in question was CCed to 
stable@, so it should have made it into a 3.13.x stable release as well.  
Whether it's in 3.13.6 specifically, I couldn't say without checking the 
stable tree or changelog, which should be easier for you to do since 
you're actually running it.  (Hint, I simply searched on defrag, here; 
it ended up being the third hit back from 3.14.0, I believe, so it 
shouldn't be horribly buried, at least.)

 Unfortunately I really don't understand how COW works here.
 I understand the basic idea but have no idea how it is implemented
 in btrfs or any other fs.

FWIW, I think only the kernel/filesystem or at least developer types 
/really/ understand COW, but I /think/ I have a reasonable sysadmin's-
level understanding of the practical effects in terms of btrfs, simply 
from watching the list.

Meanwhile, not that it has any bearing on this thread, but about your 
mount options, FWIW you may wish to remove that inode_cache option.  I 
don't claim to have a full understanding, but from what I've picked up 
from various dev remarks, it's not necessary at all on 64-bit systems 
(well, unless you have really small files filling an exabyte size 
filesystem!) since the inode-space is large enough finding free inode 
numbers isn't an issue, and while it can be of help in specific 
situations on 32-bit systems, there's two problems with it that make it 
not suitable for the general case: (1) on large filesystems (I'm not sure 
how large but I'd guess it's TiB scale) there's danger of inode-number-
collision due to 32-bit-overflow, and (2) it must be regenerated at every 
mount, which at least on TiB-scale spinning rust can trigger several 
minutes of intense drive activity while it does so.  (The btrfs wiki now 
says it's not recommended, but has a somewhat different explanation.  
While I'm not a coder and thus in no position to say for sure based on 
the code, I believe the wiki's explanation isn't quite correct, but 
either way, it's still not recommended.)

The use-cases where inode_cache might be worthwhile are thus all 32-bit, 
and include things like busy email 

Re: What to do about snapshot-aware defrag

2014-06-04 Thread Erkki Seppala
Martin m_bt...@ml1.co.uk writes:

 The *ONLY* application that I know of that uses atime is Mutt and then
 *only* for mbox files!...

However, users, such as myself :), can be interested in when a certain
file has been last accessed. With snapshots I can even get an idea of
all the times the file has been accessed.

 *And go KISS and move on faster* better?

Well, it in uncertain to me if it truly is better that btrfs would after
that point no longer truly even support atime, if using it results in
blowing up snapshot sizes. They might at that point even consider just
using LVM2 snapshots (shudder) ;).

-- 
  _
 / __// /__   __   http://www.modeemi.fi/~flux/\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi  \/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Partition tables / Output of parted

2014-06-04 Thread Stefan Malte Schumacher
Hello

I have created multiple filesystems with btrfs, in all cases directly
on the devices themself without creating partitions beforehand.  Now,
if I open the disks containing the multi-device filesystem in parted
it outputs the partion table as loop and shows one partition with
btrfs which covers the whole disk. Opening the disk with the
single-device fileystem in parted shows a partion table, type unknown
and no partition on the disk itself. 

I am unsure how to interpret this output. Two possible explanations
come to mind: a) Btrfs does create partitions, but only if a filesystem spans
multiple devices or b) the output of parted is faulty and no actual
partition is created in both cases. 

Could someone elaborate on this question or just point out where I can
find documentation on this issue?

Yours sincerely
Stefan 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/12] trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/

2014-06-04 Thread Antonio Ospite
Signed-off-by: Antonio Ospite a...@ao2.it
Cc: Chris Mason c...@fb.com
Cc: Josef Bacik jba...@fb.com
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2f6d7b1..b0a206f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3051,11 +3051,11 @@ process_slot:
 * | - extent - |
 */
 
-   /* substract range b */
+   /* subtract range b */
if (key.offset + datal  off + len)
datal = off + len - key.offset;
 
-   /* substract range a */
+   /* subtract range a */
if (off  key.offset) {
datao += off - key.offset;
datal -= off - key.offset;
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about snapshot-aware defrag

2014-06-04 Thread Martin
On 04/06/14 10:19, Erkki Seppala wrote:
 Martin m_bt...@ml1.co.uk writes:
 
 The *ONLY* application that I know of that uses atime is Mutt and then
 *only* for mbox files!...
 
 However, users, such as myself :), can be interested in when a certain
 file has been last accessed. With snapshots I can even get an idea of
 all the times the file has been accessed.
 
 *And go KISS and move on faster* better?
 
 Well, it in uncertain to me if it truly is better that btrfs would after
 that point no longer truly even support atime, if using it results in
 blowing up snapshot sizes. They might at that point even consider just
 using LVM2 snapshots (shudder) ;).

Not quite... My emphasis is:


1:

Go KISS for the defrag and accept that any atime use will render the
defrag ineffective. Give a note that the noatime mount option should be
used.


2:

Consider using noatime as a /default/ being as there are no known
'must-use' use cases. Those users still wanting atime can add that as a
mount option with the note that atime use reduces the snapshot defrag
effectiveness.


(The for/against atime is a good subject for another thread!)


Go fast KISS!

Regards,
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partition tables / Output of parted

2014-06-04 Thread Russell Coker
On Wed, 4 Jun 2014 13:19:16 Stefan Malte Schumacher wrote:
 I have created multiple filesystems with btrfs, in all cases directly
 on the devices themself without creating partitions beforehand.

I do that sometimes, it works well.  I've done the same thing with Ext2/3 in 
the past as well, it's no big deal.

 Now,
 if I open the disks containing the multi-device filesystem in parted
 it outputs the partion table as loop and shows one partition with
 btrfs which covers the whole disk.

http://lists.alioth.debian.org/pipermail/parted-devel/2009-May/002840.html

A Google search on Partition Table: loop turned up the above explanation as 
the third result.

 I am unsure how to interpret this output. Two possible explanations
 come to mind: a) Btrfs does create partitions, but only if a filesystem
 spans multiple devices or b) the output of parted is faulty and no actual
 partition is created in both cases.

BTRFS doesn't create partitions.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Add support for LZ4 compression

2014-06-04 Thread Chris Mason
On 06/03/2014 11:53 AM, David Sterba wrote:
 On Sat, May 31, 2014 at 11:48:28PM +, Philip Worrall wrote:
 LZ4 is a lossless data compression algorithm that is focused on 
 compression and decompression speed. LZ4 gives a slightly worse
 compression ratio compared with LZO (and much worse than Zlib)
 but compression speeds are *generally* similar to LZO. 
 Decompression tends to be much faster under LZ4 compared 
 with LZO hence it makes more sense to use LZ4 compression
 when your workload involves a higher proportion of reads.

 The following patch set adds LZ4 compression support to BTRFS
 using the existing kernel implementation. It is based on the 
 changeset for LZO support in 2011. Once a filesystem has been 
 mounted with LZ4 compression enabled older versions of BTRFS 
 will be unable to read it. This implementation is however 
 backwards compatible with filesystems that currently use 
 LZO or Zlib compression. Existing data will remain unchanged 
 but any new files that you create will be compressed with LZ4.
 
 tl;dr simply copying what btrfs+LZO does will not buy us anything in
 terms of speedup or space savings.

I have a slightly different reason for holding off on these.  Disk
format changes are forever, and we need a really strong use case for
pulling them in.

With that said, thanks for spending all of the time on this.  Pulling in
Dave's idea to stream larger compression blocks through lzo (or any new
alg) might be enough to push performance much higher, and better show
case the differences between new algorithms.

The whole reason I chose zlib originally was because its streaming
interface was a better fit for how FS IO worked.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partition tables / Output of parted

2014-06-04 Thread Mike Fleetwood
On 4 June 2014 14:30, Russell Coker russ...@coker.com.au wrote:
 On Wed, 4 Jun 2014 13:19:16 Stefan Malte Schumacher wrote:
 I have created multiple filesystems with btrfs, in all cases directly
 on the devices themself without creating partitions beforehand.

 I do that sometimes, it works well.  I've done the same thing with Ext2/3 in
 the past as well, it's no big deal.

 Now,
 if I open the disks containing the multi-device filesystem in parted
 it outputs the partion table as loop and shows one partition with
 btrfs which covers the whole disk.

 http://lists.alioth.debian.org/pipermail/parted-devel/2009-May/002840.html

 A Google search on Partition Table: loop turned up the above explanation as
 the third result.

 I am unsure how to interpret this output. Two possible explanations
 come to mind: a) Btrfs does create partitions, but only if a filesystem
 spans multiple devices or b) the output of parted is faulty and no actual
 partition is created in both cases.

 BTRFS doesn't create partitions.

c) Parted (libparted) is merely displaying a pretend loop partition
table as a way to represent the situation of a file system covering
the whole disk in it's view of the world where all disks have a
partition table.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-04 Thread Arnd Bergmann
On Tuesday 03 June 2014, Dave Chinner wrote:
 On Tue, Jun 03, 2014 at 04:22:19PM +0200, Arnd Bergmann wrote:
  On Monday 02 June 2014 14:57:26 H. Peter Anvin wrote:
   On 06/02/2014 12:55 PM, Arnd Bergmann wrote:
  The possible uses I can see for non-ktime_t types in the kernel are:
  * inodes need 96 bit timestamps to represent the full range of values
that can be stored in a file system, you made a convincing argument
for that. Almost everything else can fit into 64 bit on a 32-bit
kernel, in theory also on a 64-bit kernel if we want that.
 
 Just ot be pedantic, inodes don't need 96 bit timestamps - some
 filesystems can *support up to* 96 bit timestamps. If the kernel
 only supports 64 bit timestamps and that's all the kernel can
 represent, then the upper bits of the 96 bit on-disk inode
 timestamps simply remain zero.

I meant the reverse: since we have file systems that can store
96-bit timestamps when using 64-bit kernels, we need to extend
32-bit kernels to have the same internal representation so we
can actually read those file systems correctly.

 If you move the filesystem between kernels with different time
 ranges, then the filesystem needs to be able to tell the kernel what
 it's supported range is.  This is where having the VFS limit the
 range of supported timestamps is important: the limit is the
 min(kernel range, filesystem range). This allows the filesystems
 to be indepenent of the kernel time representation, and the kernel
 to be independent of the physical filesystem time encoding

I agree it makes sense to let the kernel know about the limits
of the file system it accesses, but for the reverse, we're probably
better off just making the kernel representation large enough (i.e.
96 bits) so it can work with any known file system. We need another
check at the user space boundary to turn that into a value that the
user can understand, but that's another problem.

Arnd
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-04 Thread Arnd Bergmann
On Monday 02 June 2014, Joseph S. Myers wrote:
 On Mon, 2 Jun 2014, Arnd Bergmann wrote:
 
  Ok. Sorry about missing linux-api, I confused it with linux-arch, which
  may not be as relevant here, except for the one question whether we
  actually want to have the new ABI on all 32-bit architectures or only
  as an opt-in for those that expect to stay around for another 24 years.
 
 For glibc I think it will make the most sense to add the support for 
 64-bit time_t across all architectures that currently have 32-bit time_t 
 (with the new interfaces having fallback support to implementation in 
 terms of the 32-bit kernel interfaces, if the 64-bit syscalls are 
 unavailable either at runtime or in the kernel headers against which glibc 
 is compiled - this fallback code will of course need to check for overflow 
 when passing a time value to the kernel, hopefully with error handling 
 consistent with whatever the kernel ends up doing when a filesystem can't 
 support a timestamp).  If some architectures don't provide the new 
 interfaces in the kernel then that will mean the fallback code in glibc 
 can't be removed until glibc support for those architectures is removed 
 (as opposed to removing it when glibc no longer supports kernels predating 
 the kernel support).

Ok, that's a good reason to just provide the new interfaces on all
architectures right away. Thanks for the insight!

Arnd
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs-progs: fix compiler warning

2014-06-04 Thread David Sterba
On Wed, Jun 04, 2014 at 09:19:26AM +0200, Christian Hesse wrote:
  It seems to be related to default gcc flags from distribution?
 
 Probably. I did compile with optimization, so adding -O2 may do the trick:
 
 make CFLAGS=${CFLAGS} -O2 all

The warning appears with -O2, so the question is if gcc is not able to
reason about the values (ie. a false positive) or if there's a bug that
I don't see.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-04 Thread Nicolas Pitre
On Wed, 4 Jun 2014, Arnd Bergmann wrote:

 On Tuesday 03 June 2014, Dave Chinner wrote:
  Just ot be pedantic, inodes don't need 96 bit timestamps - some
  filesystems can *support up to* 96 bit timestamps. If the kernel
  only supports 64 bit timestamps and that's all the kernel can
  represent, then the upper bits of the 96 bit on-disk inode
  timestamps simply remain zero.
 
 I meant the reverse: since we have file systems that can store
 96-bit timestamps when using 64-bit kernels, we need to extend
 32-bit kernels to have the same internal representation so we
 can actually read those file systems correctly.
 
  If you move the filesystem between kernels with different time
  ranges, then the filesystem needs to be able to tell the kernel what
  it's supported range is.  This is where having the VFS limit the
  range of supported timestamps is important: the limit is the
  min(kernel range, filesystem range). This allows the filesystems
  to be indepenent of the kernel time representation, and the kernel
  to be independent of the physical filesystem time encoding
 
 I agree it makes sense to let the kernel know about the limits
 of the file system it accesses, but for the reverse, we're probably
 better off just making the kernel representation large enough (i.e.
 96 bits) so it can work with any known file system.

Depends...  96 bit handling may get prohibitive on 32-bit archs.

The important point here is for the kernel to be able to represent the 
time _range_ used by any known filesystem, not necessarily the time 
_precision_.

For example, a 64 bit representation can be made of 40 bits for seconds 
spanning 34865 years, and 24 bits for fractional seconds providing 
precision down to 60 nanosecs.  That ought to be plenty good on 32 bit 
systems while still being cheap to handle.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-04 Thread Arnd Bergmann
On Wednesday 04 June 2014 13:30:32 Nicolas Pitre wrote:
 On Wed, 4 Jun 2014, Arnd Bergmann wrote:
 
  On Tuesday 03 June 2014, Dave Chinner wrote:
   Just ot be pedantic, inodes don't need 96 bit timestamps - some
   filesystems can *support up to* 96 bit timestamps. If the kernel
   only supports 64 bit timestamps and that's all the kernel can
   represent, then the upper bits of the 96 bit on-disk inode
   timestamps simply remain zero.
  
  I meant the reverse: since we have file systems that can store
  96-bit timestamps when using 64-bit kernels, we need to extend
  32-bit kernels to have the same internal representation so we
  can actually read those file systems correctly.
  
   If you move the filesystem between kernels with different time
   ranges, then the filesystem needs to be able to tell the kernel what
   it's supported range is.  This is where having the VFS limit the
   range of supported timestamps is important: the limit is the
   min(kernel range, filesystem range). This allows the filesystems
   to be indepenent of the kernel time representation, and the kernel
   to be independent of the physical filesystem time encoding
  
  I agree it makes sense to let the kernel know about the limits
  of the file system it accesses, but for the reverse, we're probably
  better off just making the kernel representation large enough (i.e.
  96 bits) so it can work with any known file system.
 
 Depends...  96 bit handling may get prohibitive on 32-bit archs.
 
 The important point here is for the kernel to be able to represent the 
 time _range_ used by any known filesystem, not necessarily the time 
 _precision_.
 
 For example, a 64 bit representation can be made of 40 bits for seconds 
 spanning 34865 years, and 24 bits for fractional seconds providing 
 precision down to 60 nanosecs.  That ought to be plenty good on 32 bit 
 systems while still being cheap to handle.

I have checked earlier that we don't do any computation on inode
time stamps in common code, we just pass them around, so there is
very little runtime overhead. There is a small bit of space overhead
(12 byte) per inode, but that structure is already on the order of
500 bytes.

For other timekeeping stuff in the kernel, I agree that using some
64-bit representation (nanoseconds, 32/32 unsigned seconds/nanoseconds,
...) has advantages, that's exactly the point I was making earlier
against simply extending the internal time_t/timespec to 64-bit
seconds for everything.

Arnd
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about snapshot-aware defrag

2014-06-04 Thread Chris Murphy

On Jun 4, 2014, at 7:15 AM, Martin m_bt...@ml1.co.uk wrote:
 
 Consider using noatime as a /default/ being as there are no known
 'must-use' use cases.

The quote I'm finding on the interwebs is POSIX  “requires that operating 
systems maintain file system metadata that records when each file was last 
accessed. I'm not sure if upstream kernel projects aim for LSB (and thus 
POSIX) compliance by default and let distros opt out; or the opposite.

 Those users still wanting atime can add that as a
 mount option with the note that atime use reduces the snapshot defrag
 effectiveness.

I can imagine some optimizations for Btrfs that are easier than other file 
systems, like a way to point metadata chunks to specific devices, for example 
metadata to persistent memory, while the data goes to conventional hard drives.


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: canonicalize pathnames for device commands

2014-06-04 Thread Jeff Mahoney
mount(8) will canonicalize pathnames before passing them to the kernel.
Links to e.g. /dev/sda will be resolved to /dev/sda. Links to /dev/dm-#
will be resolved using the name of the device mapper table to
/dev/mapper/name.

Btrfs will use whatever name the user passes to it, regardless of whether
it is canonical or not. That means that if a 'btrfs device ready' is
issued on any device node pointing to the original device, it will adopt
the new name instead of the name that was used during mount.

Mounting using /dev/sdb2 will result in df:
/dev/sdb2  209715200 39328 207577088   1% /mnt

# ls -la /dev/whatever-i-like
lrwxrwxrwx 1 root root 4 Jun  4 13:36 /dev/whatever-i-like - sdb2
# btrfs dev ready /dev/whatever-i-like
# df /mnt
/dev/whatever-i-like 209715200 39328 207577088   1% /mnt

Likewise, mounting with /dev/mapper/whatever and using /dev/dm-0 with a
btrfs device command results in df showing /dev/dm-0. This can happen with
multipath devices with friendly names enabled and doing something like 
'partprobe' which (at least with our version) ends up issuing a 'change'
uevent on the sysfs node. That *always* uses the dm-# name, and we get
confused users.

This patch does the same canonicalization of the paths that mount does
so that we don't end up having inconsistent names reported by -show_devices
later.

Signed-off-by: Jeff Mahoney je...@suse.com
---
 cmds-device.c  |   60 -
 cmds-replace.c |   13 ++--
 utils.c|   57 ++
 utils.h|2 +
 4 files changed, 117 insertions(+), 15 deletions(-)

--- a/cmds-device.c
+++ b/cmds-device.c
@@ -95,6 +95,7 @@ static int cmd_add_dev(int argc, char **
int devfd, res;
u64 dev_block_count = 0;
int mixed = 0;
+   char *path;
 
res = test_dev_for_mkfs(argv[i], force, estr);
if (res) {
@@ -118,15 +119,24 @@ static int cmd_add_dev(int argc, char **
goto error_out;
}
 
-   strncpy_null(ioctl_args.name, argv[i]);
+   path = canonicalize_path(argv[i]);
+   if (!path) {
+   fprintf(stderr,
+   ERROR: Could not canonicalize pathname '%s': 
%s\n,
+   argv[i], strerror(errno));
+   ret++;
+   goto error_out;
+   }
+
+   strncpy_null(ioctl_args.name, path);
res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV, ioctl_args);
e = errno;
-   if(res0){
+   if (res  0) {
fprintf(stderr, ERROR: error adding the device '%s' - 
%s\n,
-   argv[i], strerror(e));
+   path, strerror(e));
ret++;
}
-
+   free(path);
}
 
 error_out:
@@ -242,6 +252,7 @@ static int cmd_scan_dev(int argc, char *
 
for( i = devstart ; i  argc ; i++ ){
struct btrfs_ioctl_vol_args args;
+   char *path;
 
if (!is_block_device(argv[i])) {
fprintf(stderr,
@@ -249,9 +260,17 @@ static int cmd_scan_dev(int argc, char *
ret = 1;
goto close_out;
}
-   printf(Scanning for Btrfs filesystems in '%s'\n, argv[i]);
+   path = canonicalize_path(argv[i]);
+   if (!path) {
+   fprintf(stderr,
+   ERROR: Could not canonicalize path '%s': %s\n,
+   argv[i], strerror(errno));
+   ret = 1;
+   goto close_out;
+   }
+   printf(Scanning for Btrfs filesystems in '%s'\n, path);
 
-   strncpy_null(args.name, argv[i]);
+   strncpy_null(args.name, path);
/*
 * FIXME: which are the error code returned by this ioctl ?
 * it seems that is impossible to understand if there no is
@@ -262,9 +281,11 @@ static int cmd_scan_dev(int argc, char *
 
if( ret  0 ){
fprintf(stderr, ERROR: unable to scan the device '%s' 
- %s\n,
-   argv[i], strerror(e));
+   path, strerror(e));
+   free(path);
goto close_out;
}
+   free(path);
}
 
 close_out:
@@ -284,6 +305,7 @@ static int cmd_ready_dev(int argc, char
struct  btrfs_ioctl_vol_args args;
int fd;
int ret;
+   char*path;
 
if (check_argc_min(argc, 2))
usage(cmd_ready_dev_usage);
@@ -293,22 +315,34 @@ static int cmd_ready_dev(int argc, char
perror(failed to open 

Very slow filesystem

2014-06-04 Thread Igor M
Hello,

Why btrfs becames EXTREMELY slow after some time (months) of usage ?
This is now happened second time, first time I though it was hard
drive fault, but now drive seems ok.
Filesystem is mounted with compress-force=lzo and is used for MySQL
databases, files are mostly big 2G-8G.
Copying from this file system is unbelievable slow. It goes form 500
KB/s to maybe 5MB/s maybe faster some files.
hdparm -t or dd show 130MB/s+. There are no errors on drive.No errors in logs.

Can I somehow get position of file on disk, so I can try with raw read
with dd or something to make sure it's not drive fault ?
As I said I tried dd and speeds are normal but maybe there is problem
with only some sectors.

Below are btrfs version and info:

# uname -a
Linux voyager 3.14.2 #1 SMP Tue May 6 09:25:40 CEST 2014 x86_64
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz GenuineIntel GNU/Linux
Currently but when filesystem was created it was some 3.x I don't remember.

# btrfs --version
Btrfs v0.20-rc1-358-g194aa4a (now I'm upgraded to Btrfs v3.14.2)

# btrfs fi show
Label: none  uuid: b367812a-b91a-4fb2-a839-a3a153312eba
Total devices 1 FS bytes used 2.36TiB
devid1 size 2.73TiB used 2.38TiB path /dev/sde

Label: none  uuid: 09898e7a-b0b4-4a26-a956-a833514c17f6
Total devices 1 FS bytes used 1.05GiB
devid1 size 3.64TiB used 5.04GiB path /dev/sdb

Btrfs v3.14.2

# btrfs fi df /mnt/old
Data, single: total=2.36TiB, used=2.35TiB
System, DUP: total=8.00MiB, used=264.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=8.50GiB, used=7.13GiB
Metadata, single: total=8.00MiB, used=0.00
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow filesystem

2014-06-04 Thread Fajar A. Nugraha
On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote:
 Hello,

 Why btrfs becames EXTREMELY slow after some time (months) of usage ?

 # btrfs fi show
 Label: none  uuid: b367812a-b91a-4fb2-a839-a3a153312eba
 Total devices 1 FS bytes used 2.36TiB
 devid1 size 2.73TiB used 2.38TiB path /dev/sde

 # btrfs fi df /mnt/old
 Data, single: total=2.36TiB, used=2.35TiB

Is that the fs that is slow?

It's almost full. Most filesystems would exhibit really bad
performance when close to full due to fragmentation issue (threshold
vary, but 80-90% full usually means you need to start adding space).
You should free up some space (e.g. add a new disk so it becomes
multi-device, or delete some files) and rebalance/defrag.

-- 
Fajar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow filesystem

2014-06-04 Thread Roman Mamedov
On Thu, 5 Jun 2014 05:27:33 +0700
Fajar A. Nugraha l...@fajar.net wrote:

 On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote:
  Hello,
 
  Why btrfs becames EXTREMELY slow after some time (months) of usage ?
 
  # btrfs fi show
  Label: none  uuid: b367812a-b91a-4fb2-a839-a3a153312eba
  Total devices 1 FS bytes used 2.36TiB
  devid1 size 2.73TiB used 2.38TiB path /dev/sde
 
  # btrfs fi df /mnt/old
  Data, single: total=2.36TiB, used=2.35TiB
 
 Is that the fs that is slow?
 
 It's almost full.

Really, is it? The device size is 2.75 TiB, while only 2.35 TiB is used. About
400 GiB should be free. That's not almost full. The btrfs fi df readings
may be a little confusing, but usually it's those who ask questions on this
list are confused by them, not those who (try to) answer. :)

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: Very slow filesystem

2014-06-04 Thread Igor M
On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha l...@fajar.net wrote:
 On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote:
 Hello,

 Why btrfs becames EXTREMELY slow after some time (months) of usage ?

 # btrfs fi show
 Label: none  uuid: b367812a-b91a-4fb2-a839-a3a153312eba
 Total devices 1 FS bytes used 2.36TiB
 devid1 size 2.73TiB used 2.38TiB path /dev/sde

 # btrfs fi df /mnt/old
 Data, single: total=2.36TiB, used=2.35TiB

 Is that the fs that is slow?

 It's almost full. Most filesystems would exhibit really bad
 performance when close to full due to fragmentation issue (threshold
 vary, but 80-90% full usually means you need to start adding space).
 You should free up some space (e.g. add a new disk so it becomes
 multi-device, or delete some files) and rebalance/defrag.

 --
 Fajar

Yes this one is slow. I know it's getting full I'm just copying to new
disk (it will take days or even weeks!).
It shouldn't be so much fragmented, data is mostly just added. But
still, can reading became so slow just because fullness and
fragmentation ?
It just seems strange to me. If it would be 60Mb/s instead 130, but so
much slower. I'll delete some files and see if it will be faster, but
it will take hours to copy them to new disk.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow filesystem

2014-06-04 Thread Timofey Titovets
i can mistake, but i think what:
btrfstune -x dev # can improve perfomance because this decrease metadata
Also, in last versions of btrfs progs changed from 4k to 16k, it also
can help (but for this, you must reformat fs)
For clean btrfs fi df /, you can try do:
btrfs bal start -f -sconvert=dup,soft -mconvert=dup,soft path
  Data, single: total=52.01GiB, used=49.29GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.50GiB, used=483.77MiB
Also disable compression or use without force option, if i properly
understand it also do additional fragmentation (filefrag helpfull).
Also, for defragmentation data (if you need defrag some files), you
can do it by just copy-past, it create nodefragment copy

2014-06-05 1:45 GMT+03:00 Igor M igor...@gmail.com:
 On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha l...@fajar.net wrote:
 On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote:
 Hello,

 Why btrfs becames EXTREMELY slow after some time (months) of usage ?

 # btrfs fi show
 Label: none  uuid: b367812a-b91a-4fb2-a839-a3a153312eba
 Total devices 1 FS bytes used 2.36TiB
 devid1 size 2.73TiB used 2.38TiB path /dev/sde

 # btrfs fi df /mnt/old
 Data, single: total=2.36TiB, used=2.35TiB

 Is that the fs that is slow?

 It's almost full. Most filesystems would exhibit really bad
 performance when close to full due to fragmentation issue (threshold
 vary, but 80-90% full usually means you need to start adding space).
 You should free up some space (e.g. add a new disk so it becomes
 multi-device, or delete some files) and rebalance/defrag.

 --
 Fajar

 Yes this one is slow. I know it's getting full I'm just copying to new
 disk (it will take days or even weeks!).
 It shouldn't be so much fragmented, data is mostly just added. But
 still, can reading became so slow just because fullness and
 fragmentation ?
 It just seems strange to me. If it would be 60Mb/s instead 130, but so
 much slower. I'll delete some files and see if it will be faster, but
 it will take hours to copy them to new disk.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best regards,
Timofey.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-04 Thread H. Peter Anvin
On 06/04/2014 12:24 PM, Arnd Bergmann wrote:
 
 For other timekeeping stuff in the kernel, I agree that using some
 64-bit representation (nanoseconds, 32/32 unsigned seconds/nanoseconds,
 ...) has advantages, that's exactly the point I was making earlier
 against simply extending the internal time_t/timespec to 64-bit
 seconds for everything.
 

How much of a performance issue is it to make time_t 64 bits, and for
the bits there are, how hard are they to fix?

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56

2014-06-04 Thread Gui Hecheng
To return EOPNOTSUPP is more user friendly than to return EINVAL,
and then user-space tool will show that the dev_replace operation
for raid56 is not currently supported rather than showing that
there is an invalid argument.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 fs/btrfs/dev-replace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 9f22905..2af6e66 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -313,7 +313,7 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
 
if (btrfs_fs_incompat(fs_info, RAID56)) {
btrfs_warn(fs_info, dev_replace cannot yet handle 
RAID5/RAID6);
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 
switch (args-start.cont_reading_from_srcdev_mode) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: show meaningful msgs for replace cmd upon raid56

2014-06-04 Thread Gui Hecheng
This depends on the kernel patch:
[PATCH] btrfs:replace EINVAL with EOPNOTSUPP for dev_replace

This catches the EOPNOTSUPP and output msg that says dev_replace raid56
is not currently supported. Note that the msg will only be shown when
run dev_replace not in background.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 cmds-replace.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/cmds-replace.c b/cmds-replace.c
index 9eb981b..8b18110 100644
--- a/cmds-replace.c
+++ b/cmds-replace.c
@@ -301,6 +301,10 @@ static int cmd_start_replace(int argc, char **argv)
ERROR: ioctl(DEV_REPLACE_START) failed on 
\%s\: %s, %s\n,
path, strerror(errno),
replace_dev_result2string(start_args.result));
+
+   if (errno == EOPNOTSUPP)
+   fprintf(stderr, WARNING: dev_replace cannot 
yet handle RAID5/RAID6\n);
+
goto leave_with_error;
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mount: add btrfs to mount.8

2014-06-04 Thread Gui Hecheng
Based on Documentation/filesystems/btrfs.txt

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 sys-utils/mount.8 | 186 ++
 1 file changed, 186 insertions(+)

diff --git a/sys-utils/mount.8 b/sys-utils/mount.8
index efa1ae8..ec8eab3 100644
--- a/sys-utils/mount.8
+++ b/sys-utils/mount.8
@@ -671,6 +671,7 @@ currently supported include:
 .IR adfs ,
 .IR affs ,
 .IR autofs ,
+.IR btrfs ,
 .IR cifs ,
 .IR coda ,
 .IR coherent ,
@@ -1245,6 +1246,191 @@ Give blocksize. Allowed values are 512, 1024, 2048, 
4096.
 These options are accepted but ignored.
 (However, quota utilities may react to such strings in
 .IR /etc/fstab .)
+.SH Mount options for btrfs
+Btrfs is a copy on write filesystem for Linux aimed at
+implementing advanced features while focusing on fault tolerance,
+repair and easy administration.
+.TP
+.BI alloc_start= bytes
+Debugging option to force all block allocations above a certain
+byte threshold on each block device.  The value is specified in
+bytes, optionally with a K, M, or G suffix, case insensitive.
+Default is 1MB.
+.TP
+.B autodefrag
+Disable/enable auto defragmentation.
+Auto defragmentation detects small random writes into files and queue
+them up for the defrag process.  Works best for small files;
+Not well suited for large database workloads.
+.TP
+\fBcheck_int\fP|\fBcheck_int_data\fP|\fBcheck_int_print_mask=\fP\,\fIvalue\fP
+These debugging options control the behavior of the integrity checking
+module(the BTRFS_FS_CHECK_INTEGRITY config option required).
+
+.B check_int
+enables the integrity checker module, which examines all
+block write requests to ensure on-disk consistency, at a large
+memory and CPU cost.  
+
+.B check_int_data
+includes extent data in the integrity checks, and
+implies the check_int option.
+
+.B check_int_print_mask
+takes a bitmask of BTRFSIC_PRINT_MASK_* values
+as defined in fs/btrfs/check-integrity.c, to control the integrity
+checker module behavior.
+
+See comments at the top of
+.IR fs/btrfs/check-integrity.c
+for more info.
+.TP
+.BI commit= seconds
+Set the interval of periodic commit, 30 seconds by default. Higher
+values defer data being synced to permanent storage with obvious
+consequences when the system crashes. The upper bound is not forced,
+but a warning is printed if it's more than 300 seconds (5 minutes).
+.TP
+\fBcompress\fP|\fBcompress=\fP\,\fItype\fP|\fBcompress-force\fP|\fBcompress-force=\fP\,\fItype\fP
+Control BTRFS file data compression.  Type may be specified as zlib
+lzo or no (for no compression, used for remounting).  If no type
+is specified, zlib is used.  If compress-force is specified,
+all files will be compressed, whether or not they compress well.
+If compression is enabled, nodatacow and nodatasum are disabled.
+.TP
+.B degraded
+Allow mounts to continue with missing devices.  A read-write mount may
+fail with too many devices missing, for example if a stripe member
+is completely missing.
+.TP
+.BI device= devicepath
+Specify a device during mount so that ioctls on the control device
+can be avoided.  Especially useful when trying to mount a multi-device
+setup as root.  May be specified multiple times for multiple devices.
+.TP
+.B discard
+Disable/enable discard mount option.
+Discard issues frequent commands to let the block device reclaim space
+freed by the filesystem.
+This is useful for SSD devices, thinly provisioned
+LUNs and virtual machine images, but may have a significant
+performance impact.  (The fstrim command is also available to
+initiate batch trims from userspace).
+.TP
+.B enospc_debug
+Disable/enable debugging option to be more verbose in some ENOSPC conditions.
+.TP
+.BI fatal_errors= action
+Action to take when encountering a fatal error: 
+  bug - BUG() on a fatal error.  This is the default.
+  panic - panic() on a fatal error.
+.TP
+.B flushoncommit
+The
+.B flushoncommit
+mount option forces any data dirtied by a write in a
+prior transaction to commit as part of the current commit.  This makes
+the committed state a fully consistent view of the file system from the
+application's perspective (i.e., it includes all completed file system
+operations).  This was previously the behavior only when a snapshot is
+created.
+.TP
+.B inode_cache
+Enable free inode number caching.   Defaults to off due to an overflow
+problem when the free space crcs don't fit inside a single page.
+.TP
+.BI max_inline= bytes
+Specify the maximum amount of space, in bytes, that can be inlined in
+a metadata B-tree leaf.  The value is specified in bytes, optionally 
+with a K, M, or G suffix, case insensitive.  In practice, this value
+is limited by the root sector size, with some space unavailable due
+to leaf headers.  For a 4k sectorsize, max inline data is ~3900 bytes.
+.TP
+.BI metadata_ratio= value
+Specify that 1 metadata chunk should be allocated after every
+.IRvalue
+data chunks.  Off by default.
+.TP
+.B noacl
+Enable/disable support for Posix 

Re: Very slow filesystem

2014-06-04 Thread Duncan
Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:

 Why btrfs becames EXTREMELY slow after some time (months) of usage ?
 This is now happened second time, first time I though it was hard drive
 fault, but now drive seems ok.
 Filesystem is mounted with compress-force=lzo and is used for MySQL
 databases, files are mostly big 2G-8G.

That's the problem right there, database access pattern on files over 1 
GiB in size, but the problem along with the fix has been repeated over 
and over and over and over... again on this list, and it's covered on the 
btrfs wiki as well, so I guess you haven't checked existing answers 
before you asked the same question yet again.

Never-the-less, here's the basic answer yet again...

Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a 
particular file rewrite pattern, that being frequently changed and 
rewritten data internal to an existing file (as opposed to appended to 
it, like a log file).  In the normal case, such an internal-rewrite 
pattern triggers copies of the rewritten blocks every time they change, 
*HIGHLY* fragmenting this type of files after only a relatively short 
period.  While compression changes things up a bit (filefrag doesn't know 
how to deal with it yet and its report isn't reliable), it's not unusual 
to see people with several-gig files with this sort of write pattern on 
btrfs without compression find filefrag reporting literally hundreds of 
thousands of extents!

For smaller files with this access pattern (think firefox/thunderbird 
sqlite database files and the like), typically up to a few hundred MiB or 
so, btrfs' autodefrag mount option works reasonably well, as when it sees 
a file fragmenting due to rewrite, it'll queue up that file for 
background defrag via sequential copy, deleting the old fragmented copy 
after the defrag is done.

For larger files (say a gig plus) with this access pattern, typically 
larger database files as well as VM images, autodefrag doesn't scale so 
well, as the whole file must be rewritten each time, and at that size the 
changes can come faster than the file can be rewritten.  So a different 
solution must be used for them.

The recommended solution for larger internal-rewrite-pattern files is to 
give them the NOCOW file attribute (chattr +C) , so they're updated in 
place.  However, this attribute cannot be added to a file with existing 
data and have things work as expected.  NOCOW must be added to the file 
before it contains data.  The easiest way to do that is to set the 
attribute on the subdir that will contain the files and let the files 
inherit the attribute as they are created.  Then you can copy (not move, 
and don't use cp's --reflink option) existing files into the new subdir, 
such that the new copy gets created with the NOCOW attribute.

NOCOW files are updated in-place, thereby eliminating the fragmentation 
that would otherwise occur, keeping them fast to access.

However, there are a few caveats.  Setting NOCOW turns off file 
compression and checksumming as well, which is actually what you want for 
such files as it eliminates race conditions and other complex issues that 
would otherwise occur when trying to update the files in-place (thus the 
reason such features aren't part of most non-COW filesystems, which 
update in-place by default).

Additionally, taking a btrfs snapshot locks the existing data in place 
for the snapshot, so the first rewrite to a file block (4096 bytes, I 
believe) after a snapshot will always be COW, even if the file has the 
NOCOW attribute set.  Some people run automatic snapshotting software and 
can be taking snapshots as often as once a minute.  Obviously, this 
effectively almost kills NOCOW entirely, since it's then only effective 
on changes after the first one between shapshots, and with snapshots only 
a minute apart, the file fragments almost as fast as it would have 
otherwise!

So snapshots and the NOCOW attribute basically don't get along with each 
other.  But because snapshots stop at subvolume boundaries, one method to 
avoid snapshotting NOCOW files is to put your NOCOW files, already in 
their own subdirs if using the suggestion above, into dedicated subvolumes 
as well.  That lets you continue taking snapshots of the parent subvolume, 
without snapshotting the the dedicated subvolumes containing the NOCOW 
database or VM-image files.

You'd then do conventional backups of your database and VM-image files, 
instead of snapshotting them.

Of course if you're not using btrfs snapshots in the first place, you can 
avoid the whole subvolume thing, and just put your NOCOW files in their 
own subdirs, setting NOCOW on the subdir as suggested above, so files 
(and further subdirs, nested subdirs inherit the NOCOW as well) inherit 
the NOCOW of the subdir they're created in, at that creation.

Meanwhile, it can be noted that once you turn off COW/compression/
checksumming, and if you're not snapshotting, you're 

Re: Very slow filesystem

2014-06-04 Thread Fajar A. Nugraha
(resending to the list as plain text, the original reply was rejected
due to HTML format)

On Thu, Jun 5, 2014 at 10:05 AM, Duncan 1i5t5.dun...@cox.net wrote:

 Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:

  Why btrfs becames EXTREMELY slow after some time (months) of usage ?
  This is now happened second time, first time I though it was hard drive
  fault, but now drive seems ok.
  Filesystem is mounted with compress-force=lzo and is used for MySQL
  databases, files are mostly big 2G-8G.

 That's the problem right there, database access pattern on files over 1
 GiB in size, but the problem along with the fix has been repeated over
 and over and over and over... again on this list, and it's covered on the
 btrfs wiki as well

Which part on the wiki? It's not on
https://btrfs.wiki.kernel.org/index.php/FAQ or
https://btrfs.wiki.kernel.org/index.php/UseCases

 so I guess you haven't checked existing answers
 before you asked the same question yet again.

 Never-the-less, here's the basic answer yet again...

 Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a
 particular file rewrite pattern, that being frequently changed and
 rewritten data internal to an existing file (as opposed to appended to
 it, like a log file).  In the normal case, such an internal-rewrite
 pattern triggers copies of the rewritten blocks every time they change,
 *HIGHLY* fragmenting this type of files after only a relatively short
 period.  While compression changes things up a bit (filefrag doesn't know
 how to deal with it yet and its report isn't reliable), it's not unusual
 to see people with several-gig files with this sort of write pattern on
 btrfs without compression find filefrag reporting literally hundreds of
 thousands of extents!

 For smaller files with this access pattern (think firefox/thunderbird
 sqlite database files and the like), typically up to a few hundred MiB or
 so, btrfs' autodefrag mount option works reasonably well, as when it sees
 a file fragmenting due to rewrite, it'll queue up that file for
 background defrag via sequential copy, deleting the old fragmented copy
 after the defrag is done.

 For larger files (say a gig plus) with this access pattern, typically
 larger database files as well as VM images, autodefrag doesn't scale so
 well, as the whole file must be rewritten each time, and at that size the
 changes can come faster than the file can be rewritten.  So a different
 solution must be used for them.


If COW and rewrite is the main issue, why don't zfs experience the
extreme slowdown (that is, not if you have sufficient free space
available, like 20% or so)?

-- 
Fajar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow filesystem

2014-06-04 Thread Duncan
Fajar A. Nugraha posted on Thu, 05 Jun 2014 10:22:49 +0700 as excerpted:

 (resending to the list as plain text, the original reply was rejected
 due to HTML format)
 
 On Thu, Jun 5, 2014 at 10:05 AM, Duncan 1i5t5.dun...@cox.net wrote:

 Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:

  Why btrfs becames EXTREMELY slow after some time (months) of usage ?
  This is now happened second time, first time I though it was hard
  drive fault, but now drive seems ok.
  Filesystem is mounted with compress-force=lzo and is used for MySQL
  databases, files are mostly big 2G-8G.

 That's the problem right there, database access pattern on files over 1
 GiB in size, but the problem along with the fix has been repeated over
 and over and over and over... again on this list, and it's covered on
 the btrfs wiki as well
 
 Which part on the wiki? It's not on
 https://btrfs.wiki.kernel.org/index.php/FAQ or
 https://btrfs.wiki.kernel.org/index.php/UseCases

Most of the discussion and information is on the list, but there's a 
limited amount of information on the wiki in at least three places.  Two 
are on the mount options page, in the autodefrag and nodatacow options 
description:

* Autodefrag says it's well suited to bdb and sqlite dbs but not vm 
images or big dbs (yet).

* Nodatacow says performance gain is usually under 5% *UNLESS* the 
workload is random writes to large db files, where the difference can be 
VERY large.  (There's also mention of the fact that this turns off 
checksumming and compression.)

Of course that's the nodatacow mount option, not the NOCOW file 
attribute, which isn't to my knowledge discussed on the wiki, and given 
the wiki wording, one does indeed have to read a bit between the lines, 
but it is there if one looks.  That was certainly enough hint for me to 
mark the issue for further study as I did my initial pre-mkfs.btrfs 
research, for instance, and that it was a problem, with additional 
detail, was quickly confirmed once I checked the list.

* Additionally, there some discussion in the FAQ under Can copy-on-write 
be turned off for data blocks?, including discussion of the command used 
(chattr +C), a link to a script, a shell commands example, and the hint 
will produce file suitable for a raw VM image -- the blocks will be 
updated in-place and are preallocated.


FWIW, if I did wiki editing there'd probably be a dedicated page 
discussing it, but for better or worse, I seem to work best on mailing 
lists and newsgroups, and every time I've tried contributing on the web, 
even when it has been to a web forum which one would think would be close 
enough to lists/groups for me to adapt to, it simply hasn't gone much of 
anywhere.  So these days I let other people more comfortable with editing 
wikis or doing web forums do that (and sometimes people do that by either 
actually quoting my list post nearly verbatim or simply linking to it, 
which I'm fine with, as after all that's where much of the info I post 
comes from in the first place), and I stick to the lists.  Since I don't 
directly contribute to the wiki I don't much criticize it, but there are 
indeed at least hints there for those who can read them, something I did 
myself so I know it's not asking the impossible.

 If COW and rewrite is the main issue, why don't zfs experience the
 extreme slowdown (that is, not if you have sufficient free space
 available, like 20% or so)?

My personal opinion?  Primarily two things:

1) zfs is far more mature than btrfs and has been in production usage for 
many years now, while btrfs is still barely getting the huge warnings 
stripped off.  There's a lot of btrfs optimization possible that simply 
hasn't occurred yet as the focus is still real data-destruction-risk 
bugs, and in fact, btrfs isn't yet feature-complete either, so there's 
still focus on raw feature development as well.  When btrfs gets to the 
maturity level that zfs is at now, I expect a lot of the problems we have 
now will have been dramatically reduced if not eliminated.  (And the devs 
are indeed working on this problem, among others.)

2) Stating the obvious, while both btrfs and zfs are COW based and have 
other similarities, btrfs is an different filesystem, with an entirely 
different implementation and somewhat different emphasis.  There 
consequently WILL be some differences, even when they're both mature 
filesystems.  It's entirely possible that something about the btrfs 
implementation makes it less suitable in general to this particular use-
case.

Additionally, while I don't have zfs experience myself nor do I find it a 
particularly feasible option for me due to licensing and political 
issues, from what I've read it tends to handle certain issues by simply 
throwing gigs on gigs of memory at the problem.  Btrfs is designed to 
require far less memory, and as such, will by definition be somewhat more 
limited in spots.  (Arguably, this is simply a specific case of #2 above, 
they're