Re: [PATCH] btrfs-progs: ins: Add v2 ioctl support in logical-resolve

2018-07-18 Thread Qu Wenruo


On 2018年07月19日 13:59, Lu Fengqi wrote:
> On Thu, Jul 19, 2018 at 01:51:59PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018年07月19日 13:15, Misono Tomohiro wrote:
>>> Add -i (ignore offset) option to logical-resolve command
>>> to show how BTRFS_IOC_LOGICAL_INO_V2 ioctl works
>>> (returns every ref to the extent of given logical address).
>>>
>>> [Example]
>>> $ mkfs.btrfs -f $DEV
>>> $ mount $DEV /mnt
>>>
>>> $ dd if=ev/urandom of=/mnt/file bs=4k count0
>>> # split above extent
>>> $ dd if=ev/urandom of=/mnt/file bs=4k seek count=1 conv=notrunc
>>> $ btrfs filesystem sync
>>>
>>> # v1
>>> $ btrfs inspect-internal logical-resolve -P 13631488 /mnt
>>> inode 257 offset 0 root 5
>>>
>>> # v2
>>> $ btrfs inspect-internal logical-resolve -iP 13631488 /mnt
>>> inode 257 offset 0 root 5
>>> inode 257 offset 45056 root 5
>>>
>>> Signed-off-by: Misono Tomohiro 
>>> ---
>>>  Documentation/btrfs-inspect-internal.asciidoc |  4 
>>>  cmds-inspect.c| 17 +++--
>>>  ioctl.h   | 10 +-
>>>  libbtrfsutil/btrfs.h  | 10 +-
>>
>> Not related to this patch itself, but I'm wondering could we just use
>> /usr/include/linux/btrfs.h?
>>
>> So we could save 2 same copies of headers here.
> 
> If we want to compile the newest btrfs-progs, but the older
> /usr/include/linux/btrfs.h maybe doesn't have already defined the ioctl,
> this will cause the compile error.

Makes sense.

> I am curious about why the
> libbtrfsutil need to copy the part of ioctl.h instead of including the
> ioctl.h directly?

Maybe for distribution purpose? Some binding may be delivered as
independent package just as Omar tries to do.
So in that case independent header makes sense.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs-progs: ins: Add v2 ioctl support in logical-resolve

2018-07-18 Thread Lu Fengqi
On Thu, Jul 19, 2018 at 01:51:59PM +0800, Qu Wenruo wrote:
>
>
>On 2018年07月19日 13:15, Misono Tomohiro wrote:
>> Add -i (ignore offset) option to logical-resolve command
>> to show how BTRFS_IOC_LOGICAL_INO_V2 ioctl works
>> (returns every ref to the extent of given logical address).
>> 
>> [Example]
>> $ mkfs.btrfs -f $DEV
>> $ mount $DEV /mnt
>> 
>> $ dd if=/dev/urandom of=/mnt/file bs=4k count=100
>> # split above extent
>> $ dd if=/dev/urandom of=/mnt/file bs=4k seek=10 count=1 conv=notrunc
>> $ btrfs filesystem sync
>> 
>> # v1
>> $ btrfs inspect-internal logical-resolve -P 13631488 /mnt
>> inode 257 offset 0 root 5
>> 
>> # v2
>> $ btrfs inspect-internal logical-resolve -iP 13631488 /mnt
>> inode 257 offset 0 root 5
>> inode 257 offset 45056 root 5
>> 
>> Signed-off-by: Misono Tomohiro 
>> ---
>>  Documentation/btrfs-inspect-internal.asciidoc |  4 
>>  cmds-inspect.c| 17 +++--
>>  ioctl.h   | 10 +-
>>  libbtrfsutil/btrfs.h  | 10 +-
>
>Not related to this patch itself, but I'm wondering could we just use
>/usr/include/linux/btrfs.h?
>
>So we could save 2 same copies of headers here.

If we want to compile the newest btrfs-progs, but the older
/usr/include/linux/btrfs.h maybe doesn't have already defined the ioctl,
this will cause the compile error. I am curious about why the
libbtrfsutil need to copy the part of ioctl.h instead of including the
ioctl.h directly?

-- 
Thanks,
Lu

>
>Thanks,
>Qu
>
>>  4 files changed, 37 insertions(+), 4 deletions(-)
>> 
>> diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
>> b/Documentation/btrfs-inspect-internal.asciidoc
>> index e2db6466..a55c9add 100644
>> --- a/Documentation/btrfs-inspect-internal.asciidoc
>> +++ b/Documentation/btrfs-inspect-internal.asciidoc
>> @@ -125,6 +125,10 @@ skip the path resolving and print the inodes instead
>>  verbose mode, print count of returned paths and all ioctl() return values
>>  -s 
>>  set internal buffer for storing the file names to 'bufsize', default is 
>> 4096, maximum 64k
>> +-i
>> +ignore offset and return all the ref information
>> +which points to the extent containing given logical address.
>> +This requires version 2 ioctl support (BTRFS_IOC_LOGICAL_INO_V2, since 
>> 4.15).
>>  
>>  *min-dev-size* [options] ::
>>  (needs root privileges)
>> diff --git a/cmds-inspect.c b/cmds-inspect.c
>> index 2fc50c1a..d47eeacb 100644
>> --- a/cmds-inspect.c
>> +++ b/cmds-inspect.c
>> @@ -131,6 +131,9 @@ static const char * const 
>> cmd_inspect_logical_resolve_usage[] = {
>>  "-s bufsize  set inode container's size. This is used to increase 
>> inode",
>>  "container's size in case it is not enough to read all the 
>> ",
>>  "resolved results. The max value one can set is 64k",
>> +"-i  ignore offset and return all the ref information",
>> +"which points to the extent containing given logical 
>> address",
>> +"(requires version 2 ioctl support)",
>>  NULL
>>  };
>>  
>> @@ -142,7 +145,9 @@ static int cmd_inspect_logical_resolve(int argc, char 
>> **argv)
>>  int verbose = 0;
>>  int getpath = 1;
>>  int bytes_left;
>> +int ignore_offset = 0;
>>  struct btrfs_ioctl_logical_ino_args loi;
>> +unsigned long ioctl_num = BTRFS_IOC_LOGICAL_INO;
>>  struct btrfs_data_container *inodes;
>>  u64 size = 4096;
>>  char full_path[PATH_MAX];
>> @@ -151,7 +156,7 @@ static int cmd_inspect_logical_resolve(int argc, char 
>> **argv)
>>  
>>  optind = 0;
>>  while (1) {
>> -int c = getopt(argc, argv, "Pvs:");
>> +int c = getopt(argc, argv, "Pvs:i");
>>  if (c < 0)
>>  break;
>>  
>> @@ -165,6 +170,9 @@ static int cmd_inspect_logical_resolve(int argc, char 
>> **argv)
>>  case 's':
>>  size = arg_strtou64(optarg);
>>  break;
>> +case 'i':
>> +ignore_offset = 1;
>> +break;
>>  default:
>>  usage(cmd_inspect_logical_resolve_usage);
>>  }
>> @@ -183,13 +191,18 @@ static int cmd_inspect_logical_resolve(int argc, char 
>> **argv)
>>  loi.size = size;
>>  loi.inodes = ptr_to_u64(inodes);
>>  
>> +if (ignore_offset) {
>> +loi.flags = BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
>> +ioctl_num = BTRFS_IOC_LOGICAL_INO_V2;
>> +}
>> +
>>  fd = btrfs_open_dir(argv[optind + 1], &dirstream, 1);
>>  if (fd < 0) {
>>  ret = 12;
>>  goto out;
>>  }
>>  
>> -ret = ioctl(fd, BTRFS_IOC_LOGICAL_INO, &loi);
>> +ret = ioctl(fd, ioctl_num, &loi);
>>  if (ret < 0) {
>>  error("logical ino ioctl: %m");
>>  goto out;
>> diff --git a/ioctl.h b/ioctl.h
>> index 709e996f..74f30c20 100644
>> --- a/ioctl.h
>> +++ b

Re: [PATCH] btrfs-progs: ins: Add v2 ioctl support in logical-resolve

2018-07-18 Thread Qu Wenruo


On 2018年07月19日 13:15, Misono Tomohiro wrote:
> Add -i (ignore offset) option to logical-resolve command
> to show how BTRFS_IOC_LOGICAL_INO_V2 ioctl works
> (returns every ref to the extent of given logical address).
> 
> [Example]
> $ mkfs.btrfs -f $DEV
> $ mount $DEV /mnt
> 
> $ dd if=/dev/urandom of=/mnt/file bs=4k count=100
> # split above extent
> $ dd if=/dev/urandom of=/mnt/file bs=4k seek=10 count=1 conv=notrunc
> $ btrfs filesystem sync
> 
> # v1
> $ btrfs inspect-internal logical-resolve -P 13631488 /mnt
> inode 257 offset 0 root 5
> 
> # v2
> $ btrfs inspect-internal logical-resolve -iP 13631488 /mnt
> inode 257 offset 0 root 5
> inode 257 offset 45056 root 5
> 
> Signed-off-by: Misono Tomohiro 
> ---
>  Documentation/btrfs-inspect-internal.asciidoc |  4 
>  cmds-inspect.c| 17 +++--
>  ioctl.h   | 10 +-
>  libbtrfsutil/btrfs.h  | 10 +-

Not related to this patch itself, but I'm wondering could we just use
/usr/include/linux/btrfs.h?

So we could save 2 same copies of headers here.

Thanks,
Qu

>  4 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
> b/Documentation/btrfs-inspect-internal.asciidoc
> index e2db6466..a55c9add 100644
> --- a/Documentation/btrfs-inspect-internal.asciidoc
> +++ b/Documentation/btrfs-inspect-internal.asciidoc
> @@ -125,6 +125,10 @@ skip the path resolving and print the inodes instead
>  verbose mode, print count of returned paths and all ioctl() return values
>  -s 
>  set internal buffer for storing the file names to 'bufsize', default is 
> 4096, maximum 64k
> +-i
> +ignore offset and return all the ref information
> +which points to the extent containing given logical address.
> +This requires version 2 ioctl support (BTRFS_IOC_LOGICAL_INO_V2, since 4.15).
>  
>  *min-dev-size* [options] ::
>  (needs root privileges)
> diff --git a/cmds-inspect.c b/cmds-inspect.c
> index 2fc50c1a..d47eeacb 100644
> --- a/cmds-inspect.c
> +++ b/cmds-inspect.c
> @@ -131,6 +131,9 @@ static const char * const 
> cmd_inspect_logical_resolve_usage[] = {
>   "-s bufsize  set inode container's size. This is used to increase 
> inode",
>   "container's size in case it is not enough to read all the 
> ",
>   "resolved results. The max value one can set is 64k",
> + "-i  ignore offset and return all the ref information",
> + "which points to the extent containing given logical 
> address",
> + "(requires version 2 ioctl support)",
>   NULL
>  };
>  
> @@ -142,7 +145,9 @@ static int cmd_inspect_logical_resolve(int argc, char 
> **argv)
>   int verbose = 0;
>   int getpath = 1;
>   int bytes_left;
> + int ignore_offset = 0;
>   struct btrfs_ioctl_logical_ino_args loi;
> + unsigned long ioctl_num = BTRFS_IOC_LOGICAL_INO;
>   struct btrfs_data_container *inodes;
>   u64 size = 4096;
>   char full_path[PATH_MAX];
> @@ -151,7 +156,7 @@ static int cmd_inspect_logical_resolve(int argc, char 
> **argv)
>  
>   optind = 0;
>   while (1) {
> - int c = getopt(argc, argv, "Pvs:");
> + int c = getopt(argc, argv, "Pvs:i");
>   if (c < 0)
>   break;
>  
> @@ -165,6 +170,9 @@ static int cmd_inspect_logical_resolve(int argc, char 
> **argv)
>   case 's':
>   size = arg_strtou64(optarg);
>   break;
> + case 'i':
> + ignore_offset = 1;
> + break;
>   default:
>   usage(cmd_inspect_logical_resolve_usage);
>   }
> @@ -183,13 +191,18 @@ static int cmd_inspect_logical_resolve(int argc, char 
> **argv)
>   loi.size = size;
>   loi.inodes = ptr_to_u64(inodes);
>  
> + if (ignore_offset) {
> + loi.flags = BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
> + ioctl_num = BTRFS_IOC_LOGICAL_INO_V2;
> + }
> +
>   fd = btrfs_open_dir(argv[optind + 1], &dirstream, 1);
>   if (fd < 0) {
>   ret = 12;
>   goto out;
>   }
>  
> - ret = ioctl(fd, BTRFS_IOC_LOGICAL_INO, &loi);
> + ret = ioctl(fd, ioctl_num, &loi);
>   if (ret < 0) {
>   error("logical ino ioctl: %m");
>   goto out;
> diff --git a/ioctl.h b/ioctl.h
> index 709e996f..74f30c20 100644
> --- a/ioctl.h
> +++ b/ioctl.h
> @@ -491,10 +491,16 @@ BUILD_ASSERT(sizeof(struct btrfs_ioctl_ino_path_args) 
> == 56);
>  struct btrfs_ioctl_logical_ino_args {
>   __u64   logical;/* in */
>   __u64   size;   /* in */
> - __u64   reserved[4];
> + __u64   reserved[3];/* must be 0 for now */
> + __u64   flags;

[PATCH] btrfs-progs: ins: Add v2 ioctl support in logical-resolve

2018-07-18 Thread Misono Tomohiro
Add -i (ignore offset) option to logical-resolve command
to show how BTRFS_IOC_LOGICAL_INO_V2 ioctl works
(returns every ref to the extent of given logical address).

[Example]
$ mkfs.btrfs -f $DEV
$ mount $DEV /mnt

$ dd if=/dev/urandom of=/mnt/file bs=4k count=100
# split above extent
$ dd if=/dev/urandom of=/mnt/file bs=4k seek=10 count=1 conv=notrunc
$ btrfs filesystem sync

# v1
$ btrfs inspect-internal logical-resolve -P 13631488 /mnt
inode 257 offset 0 root 5

# v2
$ btrfs inspect-internal logical-resolve -iP 13631488 /mnt
inode 257 offset 0 root 5
inode 257 offset 45056 root 5

Signed-off-by: Misono Tomohiro 
---
 Documentation/btrfs-inspect-internal.asciidoc |  4 
 cmds-inspect.c| 17 +++--
 ioctl.h   | 10 +-
 libbtrfsutil/btrfs.h  | 10 +-
 4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
b/Documentation/btrfs-inspect-internal.asciidoc
index e2db6466..a55c9add 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -125,6 +125,10 @@ skip the path resolving and print the inodes instead
 verbose mode, print count of returned paths and all ioctl() return values
 -s 
 set internal buffer for storing the file names to 'bufsize', default is 4096, 
maximum 64k
+-i
+ignore offset and return all the ref information
+which points to the extent containing given logical address.
+This requires version 2 ioctl support (BTRFS_IOC_LOGICAL_INO_V2, since 4.15).
 
 *min-dev-size* [options] ::
 (needs root privileges)
diff --git a/cmds-inspect.c b/cmds-inspect.c
index 2fc50c1a..d47eeacb 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -131,6 +131,9 @@ static const char * const 
cmd_inspect_logical_resolve_usage[] = {
"-s bufsize  set inode container's size. This is used to increase 
inode",
"container's size in case it is not enough to read all the 
",
"resolved results. The max value one can set is 64k",
+   "-i  ignore offset and return all the ref information",
+   "which points to the extent containing given logical 
address",
+   "(requires version 2 ioctl support)",
NULL
 };
 
@@ -142,7 +145,9 @@ static int cmd_inspect_logical_resolve(int argc, char 
**argv)
int verbose = 0;
int getpath = 1;
int bytes_left;
+   int ignore_offset = 0;
struct btrfs_ioctl_logical_ino_args loi;
+   unsigned long ioctl_num = BTRFS_IOC_LOGICAL_INO;
struct btrfs_data_container *inodes;
u64 size = 4096;
char full_path[PATH_MAX];
@@ -151,7 +156,7 @@ static int cmd_inspect_logical_resolve(int argc, char 
**argv)
 
optind = 0;
while (1) {
-   int c = getopt(argc, argv, "Pvs:");
+   int c = getopt(argc, argv, "Pvs:i");
if (c < 0)
break;
 
@@ -165,6 +170,9 @@ static int cmd_inspect_logical_resolve(int argc, char 
**argv)
case 's':
size = arg_strtou64(optarg);
break;
+   case 'i':
+   ignore_offset = 1;
+   break;
default:
usage(cmd_inspect_logical_resolve_usage);
}
@@ -183,13 +191,18 @@ static int cmd_inspect_logical_resolve(int argc, char 
**argv)
loi.size = size;
loi.inodes = ptr_to_u64(inodes);
 
+   if (ignore_offset) {
+   loi.flags = BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   ioctl_num = BTRFS_IOC_LOGICAL_INO_V2;
+   }
+
fd = btrfs_open_dir(argv[optind + 1], &dirstream, 1);
if (fd < 0) {
ret = 12;
goto out;
}
 
-   ret = ioctl(fd, BTRFS_IOC_LOGICAL_INO, &loi);
+   ret = ioctl(fd, ioctl_num, &loi);
if (ret < 0) {
error("logical ino ioctl: %m");
goto out;
diff --git a/ioctl.h b/ioctl.h
index 709e996f..74f30c20 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -491,10 +491,16 @@ BUILD_ASSERT(sizeof(struct btrfs_ioctl_ino_path_args) == 
56);
 struct btrfs_ioctl_logical_ino_args {
__u64   logical;/* in */
__u64   size;   /* in */
-   __u64   reserved[4];
+   __u64   reserved[3];/* must be 0 for now */
+   __u64   flags;  /* in, v2 only */
/* struct btrfs_data_container  *inodes;out   */
__u64   inodes;
 };
+/*
+ * Return every ref to the extent, not just those containing logical block.
+ * Requires logical == extent bytenr.
+ */
+#define BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET   (1ULL << 0)
 
 enum btrfs_dev_stat_values {
  

Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)

2018-07-18 Thread Marc MERLIN
On Wed, Jul 18, 2018 at 10:42:21PM +0300, Andrei Borzenkov wrote:
> > Any help from other experienced developers would definitely help to
> > solve why memory of 'btrfs check' is not swapped out or why OOM killer
> > is not triggered.
> 
> Almost all used memory is marked as "active" and active pages are not
> swapped. Page is active if it was accessed recently. Is it possible that
> btrfs logic does frequent scans across all allocated memory?
> >>
> >> Active: 30381404 kB
> >> Inactive: 585952 kB

That is a very good find.

Yes, the linux kernel VM may be smart enough not to swap pages that got used
recently and when btrfs slurps all the extents to cross check everything, I
think it does cross reference them all many times.
This is why it can run in a few hours when btrfs check lowmem requires days
to run in a similar situation.

I'm not sure if there is a good way around this, but it's good to know that
btrfs repair can effectively abuse the linux VM in a way that it'll take
everything down without OOM having a chance to trigger.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
Related on XFS list.

https://www.spinics.net/lists/linux-xfs/msg20722.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn
 wrote:
> On 2018-07-18 13:40, Chris Murphy wrote:
>>
>> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy 
>> wrote:
>>
>>> I don't know for sure, but based on the addresses reported before and
>>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>>> the originally fallocated addresses for dd. So maybe it is COWing into
>>> new blocks, but is just as quickly deallocating the fallocated blocks
>>> as it goes, and hence doesn't end up in enospc?
>>
>>
>> Previous thread is "Problem with file system" from August 2017. And
>> there's these reproduce steps from Austin which have fallocate coming
>> after the dd.
>>
>>  truncate --size=4G ./test-fs
>>  mkfs.btrfs ./test-fs
>>  mkdir ./test
>>  mount -t auto ./test-fs ./test
>>  dd if=/dev/zero of=./test/test bs=65536 count=32768
>>  fallocate -l 2147483650 ./test/test && echo "Success!"
>>
>>
>> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
>> fallocate in half.
>>
>> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
>> [chris@f28s btrfs]$ sync
>> [chris@f28s btrfs]$ df -h
>> FilesystemSize  Used Avail Use% Mounted on
>> /dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
>> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
>>
>>
>> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
>> it, this fails, but I kinda expect that because there's only 1.1G free
>> space. But maybe that's what you're saying is the bug, it shouldn't
>> fail?
>
> Yes, you're right, I had things backwards (well, kind of, this does work on
> ext4 and regular XFS, so it arguably should work here).

I guess I'm confused what it even means to fallocate over a file with
in-use blocks unless either -d or -p options are used. And from the
man page, I don't grok the distinction between -d and -p either. But
based on their descriptions I'd expect they both should work without
enospc.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Goffredo Baroncelli
On 07/18/2018 09:20 AM, Duncan wrote:
> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
> excerpted:
> 
>> On 07/17/2018 11:12 PM, Duncan wrote:
>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>> excerpted:
>>>
 On 07/15/2018 04:37 PM, waxhead wrote:
>>>
 Striping and mirroring/pairing are orthogonal properties; mirror and
 parity are mutually exclusive.
>>>
>>> I can't agree.  I don't know whether you meant that in the global
>>> sense,
>>> or purely in the btrfs context (which I suspect), but either way I
>>> can't agree.
>>>
>>> In the pure btrfs context, while striping and mirroring/pairing are
>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>> flexible enough to allow both together and the feature may at some
>>> point be added, so it makes sense to have a layout notation format
>>> flexible enough to allow it as well.
>>
>> When I say orthogonal, It means that these can be combined: i.e. you can
>> have - striping (RAID0)
>> - parity  (?)
>> - striping + parity  (e.g. RAID5/6)
>> - mirroring  (RAID1)
>> - mirroring + striping  (RAID10)
>>
>> However you can't have mirroring+parity; this means that a notation
>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>> too verbose.
> 
> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
> on top of raid0.  
And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.

The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below). 
[]
>>>
>>> As someone else pointed out, md/lvm-raid10 already work like this. 
>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>> much works this way except with huge (gig size) chunks.
>>
>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device 
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
> alternate device pairs, it's effectively striped at the macro level, with 
> the 1 GiB device-level chunks effectively being huge individual device 
> strips of 1 GiB.

The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel 
from different disks.
With a "stripe size" of 1GB, it is very unlikely that this would happens.

 
> At 1 GiB strip size it doesn't have the typical performance advantage of 
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
> strips/chunks.



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)

2018-07-18 Thread Andrei Borzenkov
18.07.2018 03:05, Qu Wenruo пишет:
> 
> 
> On 2018年07月18日 04:59, Marc MERLIN wrote:
>> Ok, I did more testing. Qu is right that btrfs check does not crash the 
>> kernel.
>> It just takes all the memory until linux hangs everywhere, and somehow (no 
>> idea why) 
>> the OOM killer never triggers.
> 
> No OOM triggers? That's a little strange.
> Maybe it's related to how kernel handles memory over-commit?
> 
> And for the hang, I think it's related to some memory allocation failure
> and error handler just didn't handle it well, so it's causing deadlock
> for certain page.
> 
> ENOMEM handling is pretty common but hardly verified, so it's not that
> strange, but we must locate the problem.
> 
>> Details below:
>>
>> On Tue, Jul 17, 2018 at 01:32:57PM -0700, Marc MERLIN wrote:
>>> Here is what I got when the system was not doing well (it took minutes to 
>>> run):
>>>
>>>  total   used   free sharedbuffers cached
>>> Mem:  32643788   32070952 572836  0 1021604378772
>>> -/+ buffers/cache:   275900205053768
>>> Swap: 15616764 973596   14643168
>>
>> ok, the reason it was not that close to 0 was due to /dev/shm it seems.
>> I cleared that, and now I can get it to go to near 0 again.
>> I'm wrong about the system being fully crashed, it's not, it's just very
>> close to being hung.
>> I can type killall -9 btrfs in the serial console and wait a few minutes.
>> The system eventually recovers, but it's impossible to fix anything via ssh 
>> apparently because networking does not get to run when I'm in this state.
>>
>> I'm not sure why my system reproduces this easy while Qu's system does not, 
>> but Qu was right that the kernel is not dead and that it's merely a problem 
>> of userspace
>> taking all the RAM and somehow not being killed by OOM
> 
> In my system, at least I'm not using btrfs as root fs, and for the
> memory eating program I normally ensure it's eating all the memory +
> swap, so OOM killer is always triggered, maybe that's the cause.
> 
> So in your case, maybe it's btrfs not really taking up all memory, thus
> OOM killer not triggered.
> 
>>
>> I checked the PID and don't see why it's not being killed:
>> gargamel:/proc/31006# grep . oom*
>> oom_adj:0
>> oom_score:221   << this increases a lot, but OOM never kills it
>> oom_score_adj:0
>>
>> I have these variables:
>> /proc/sys/vm/oom_dump_tasks:1
>> /proc/sys/vm/oom_kill_allocating_task:0
>> /proc/sys/vm/overcommit_kbytes:0
>> /proc/sys/vm/overcommit_memory:0
>> /proc/sys/vm/overcommit_ratio:50  << is this bad (seems default)
> 
> Any kernel dmesg about OOM killer triggered?
> 
>>
>> Here is my system when it virtually died:
>> ER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>> root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49   1:35 ./btrfs 
>> check /dev/mapper/dshelf2
>>
>>  total   used   free sharedbuffers cached
>> Mem:  32643788   32180100 463688  0  44664 119508
>> -/+ buffers/cache:   32015928 627860
>> Swap: 15616764 443676   15173088
> 
> For swap, it looks like only some other program's memory is swapped out,
> not btrfs'.
> 
> And unfortunately, I'm not so familiar with OOM/MM code outside of
> filesystem.
> Any help from other experienced developers would definitely help to
> solve why memory of 'btrfs check' is not swapped out or why OOM killer
> is not triggered.
> 

Almost all used memory is marked as "active" and active pages are not
swapped. Page is active if it was accessed recently. Is it possible that
btrfs logic does frequent scans across all allocated memory?



> Thanks,
> Qu
> 
>>
>> MemTotal:   32643788 kB
>> MemFree:  463440 kB
>> MemAvailable:  44864 kB
>> Buffers:   44664 kB
>> Cached:   120360 kB
>> SwapCached:87064 kB
>> Active: 30381404 kB
>> Inactive: 585952 kB
>> Active(anon):   30334696 kB
>> Inactive(anon):   474624 kB
>> Active(file):  46708 kB
>> Inactive(file):   111328 kB
>> Unevictable:5616 kB
>> Mlocked:5616 kB
>> SwapTotal:  15616764 kB
>> SwapFree:   15173088 kB
>> Dirty:  1636 kB
>> Writeback: 4 kB
>> AnonPages:  30734240 kB
>> Mapped:67236 kB
>> Shmem:  3036 kB
>> Slab: 267884 kB
>> SReclaimable:  51528 kB
>> SUnreclaim:   216356 kB
>> KernelStack:   10144 kB
>> PageTables:69284 kB
>> NFS_Unstable:  0 kB
>> Bounce:0 kB
>> WritebackTmp:  0 kB
>> CommitLimit:31938656 kB
>> Committed_AS:   32865492 kB
>> VmallocTotal:   34359738367 kB
>> VmallocUsed:   0 kB
>> VmallocChunk:  0 kB
>> HardwareCorrupted: 0 kB
>> AnonHugePages: 0 kB
>> ShmemHugePages:0 kB
>> ShmemPmdMapped:0 kB
>> CmaTotal:  16384 kB
>> CmaFree:   0 kB
>> HugePages_Total:   0
>> HugePages_Free:

Re: Healthy amount of free space?

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 13:40, Chris Murphy wrote:

On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy  wrote:


I don't know for sure, but based on the addresses reported before and
after dd for the fallocated tmp file, it looks like Btrfs is not using
the originally fallocated addresses for dd. So maybe it is COWing into
new blocks, but is just as quickly deallocating the fallocated blocks
as it goes, and hence doesn't end up in enospc?


Previous thread is "Problem with file system" from August 2017. And
there's these reproduce steps from Austin which have fallocate coming
after the dd.

 truncate --size=4G ./test-fs
 mkfs.btrfs ./test-fs
 mkdir ./test
 mount -t auto ./test-fs ./test
 dd if=/dev/zero of=./test/test bs=65536 count=32768
 fallocate -l 2147483650 ./test/test && echo "Success!"


My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
fallocate in half.

[chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
[chris@f28s btrfs]$ sync
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1000m tmp


Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
it, this fails, but I kinda expect that because there's only 1.1G free
space. But maybe that's what you're saying is the bug, it shouldn't
fail?
Yes, you're right, I had things backwards (well, kind of, this does work 
on ext4 and regular XFS, so it arguably should work here).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy  wrote:

> I don't know for sure, but based on the addresses reported before and
> after dd for the fallocated tmp file, it looks like Btrfs is not using
> the originally fallocated addresses for dd. So maybe it is COWing into
> new blocks, but is just as quickly deallocating the fallocated blocks
> as it goes, and hence doesn't end up in enospc?

Previous thread is "Problem with file system" from August 2017. And
there's these reproduce steps from Austin which have fallocate coming
after the dd.

truncate --size=4G ./test-fs
mkfs.btrfs ./test-fs
mkdir ./test
mount -t auto ./test-fs ./test
dd if=/dev/zero of=./test/test bs=65536 count=32768
fallocate -l 2147483650 ./test/test && echo "Success!"


My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
fallocate in half.

[chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
[chris@f28s btrfs]$ sync
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1000m tmp


Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
it, this fails, but I kinda expect that because there's only 1.1G free
space. But maybe that's what you're saying is the bug, it shouldn't
fail?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
On Wed, Jul 18, 2018 at 11:06 AM, Austin S. Hemmelgarn
 wrote:
> On 2018-07-18 13:04, Chris Murphy wrote:
>>
>> On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
>>  wrote:
>>
>>>
>>> I'm not sure.  In this particular case, this will fail on BTRFS for any X
>>> larger than just short of one third of the total free space.  I would
>>> expect
>>> it to fail for any X larger than just short of half instead.
>>
>>
>> I'm confused. I can't get it to fail when X is 3/4 of free space.
>>
>> lvcreate -V 2g -T vg/thintastic -n btrfstest
>> mkfs.btrfs -M /dev/mapper/vg-btrfstest
>> mount /dev/mapper/vg-btrfstest /mnt/btrfs
>> cd /mnt/btrfs
>> fallocate -l 1500m tmp
>> dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
>>
>> Succeeds. No enospc. This is on kernel 4.17.6.
>
> Odd, I could have sworn it would fail reliably.  Unless something has
> changed since I last tested though, doing it with X equal to the free space
> on the filesystem will fail.

OK well X is being defined twice here so I can't tell if I'm doing
this correctly. There's fallocate X and that's 75% of free space for
the empty fs at the time of fallocate.

And then there's dd which is 1450m which is ~2.67x the free space at
the time of dd.

I don't know for sure, but based on the addresses reported before and
after dd for the fallocated tmp file, it looks like Btrfs is not using
the originally fallocated addresses for dd. So maybe it is COWing into
new blocks, but is just as quickly deallocating the fallocated blocks
as it goes, and hence doesn't end up in enospc?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 13:04, Chris Murphy wrote:

On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
 wrote:



I'm not sure.  In this particular case, this will fail on BTRFS for any X
larger than just short of one third of the total free space.  I would expect
it to fail for any X larger than just short of half instead.


I'm confused. I can't get it to fail when X is 3/4 of free space.

lvcreate -V 2g -T vg/thintastic -n btrfstest
mkfs.btrfs -M /dev/mapper/vg-btrfstest
mount /dev/mapper/vg-btrfstest /mnt/btrfs
cd /mnt/btrfs
fallocate -l 1500m tmp
dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450

Succeeds. No enospc. This is on kernel 4.17.6.
Odd, I could have sworn it would fail reliably.  Unless something has 
changed since I last tested though, doing it with X equal to the free 
space on the filesystem will fail.



Copied from terminal:

[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G   17M  2.0G   1% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1572864000 (384000 blocks of 4096 bytes)
  ext: logical_offset:physical_offset: length:   expected: flags:
0:0..   32767:  16400.. 49167:  32768: unwritten
1:32768..   65535:  56576.. 89343:  32768:  49168: unwritten
2:65536..   98303: 109824..142591:  32768:  89344: unwritten
3:98304..  131071: 163072..195839:  32768: 142592: unwritten
4:   131072..  163839: 216320..249087:  32768: 195840: unwritten
5:   163840..  196607: 269568..302335:  32768: 249088: unwritten
6:   196608..  229375: 322816..355583:  32768: 302336: unwritten
7:   229376..  262143: 376064..408831:  32768: 355584: unwritten
8:   262144..  294911: 429312..462079:  32768: 408832: unwritten
9:   294912..  327679: 482560..515327:  32768: 462080: unwritten
   10:   327680..  344063:  89344..105727:  16384: 515328: unwritten
   11:   344064..  360447: 142592..158975:  16384: 105728: unwritten
   12:   360448..  376831: 195840..212223:  16384: 158976: unwritten
   13:   376832..  383999: 249088..256255:   7168: 212224:
last,unwritten,eof
tmp: 14 extents found
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G  1.5G  543M  74% /mnt/btrfs
[chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
1450+0 records in
1450+0 records out
1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G  1.5G  591M  72% /mnt/btrfs
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1520435200 (371200 blocks of 4096 bytes)
  ext: logical_offset:physical_offset: length:   expected: flags:
0:0..   16383: 302336..318719:  16384:
1:16384..   32767: 355584..371967:  16384: 318720:
2:32768..   49151: 408832..425215:  16384: 371968:
3:49152..   65535: 462080..478463:  16384: 425216:
4:65536..   73727: 515328..523519:   8192: 478464:
5:73728..   86015:   3328.. 15615:  12288: 523520:
6:86016..   98303: 256256..268543:  12288:  15616:
7:98304..  104959:  49168.. 55823:   6656: 268544:
8:   104960..  109047: 105728..109815:   4088:  55824:
9:   109048..  113143: 158976..163071:   4096: 109816:
   10:   113144..  117239: 212224..216319:   4096: 163072:
   11:   117240..  121335: 318720..322815:   4096: 216320:
   12:   121336..  125431: 371968..376063:   4096: 322816:
   13:   125432..  128251: 425216..428035:   2820: 376064:
   14:   128252..  131071: 478464..481283:   2820: 428036:
   15:   131072..  132409:   1460..  2797:   1338: 481284:
   16:   132410..  165177: 322816..355583:  32768:   2798:
   17:   165178..  197945: 376064..408831:  32768: 355584:
   18:   197946..  230713: 429312..462079:  32768: 408832:
   19:   230714..  263481: 482560..515327:  32768: 462080:
   20:   263482..  296249:  16400.. 49167:  32768: 515328:
   21:   296250..  327687:  56576.. 88013:  31438:  49168:
   22:   327688..  328711: 428036..429059:   1024:  88014:
   23:   328712..  361479: 109824..142591:  32768: 429060:
   24:   361480..  371199:  88014.. 97733:   9720: 142592: last,eof
tmp: 25 extents found
[chris@f28s btrfs]$


*shrug*




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
 wrote:

>
> I'm not sure.  In this particular case, this will fail on BTRFS for any X
> larger than just short of one third of the total free space.  I would expect
> it to fail for any X larger than just short of half instead.

I'm confused. I can't get it to fail when X is 3/4 of free space.

lvcreate -V 2g -T vg/thintastic -n btrfstest
mkfs.btrfs -M /dev/mapper/vg-btrfstest
mount /dev/mapper/vg-btrfstest /mnt/btrfs
cd /mnt/btrfs
fallocate -l 1500m tmp
dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450

Succeeds. No enospc. This is on kernel 4.17.6.


Copied from terminal:

[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G   17M  2.0G   1% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1572864000 (384000 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   32767:  16400.. 49167:  32768: unwritten
   1:32768..   65535:  56576.. 89343:  32768:  49168: unwritten
   2:65536..   98303: 109824..142591:  32768:  89344: unwritten
   3:98304..  131071: 163072..195839:  32768: 142592: unwritten
   4:   131072..  163839: 216320..249087:  32768: 195840: unwritten
   5:   163840..  196607: 269568..302335:  32768: 249088: unwritten
   6:   196608..  229375: 322816..355583:  32768: 302336: unwritten
   7:   229376..  262143: 376064..408831:  32768: 355584: unwritten
   8:   262144..  294911: 429312..462079:  32768: 408832: unwritten
   9:   294912..  327679: 482560..515327:  32768: 462080: unwritten
  10:   327680..  344063:  89344..105727:  16384: 515328: unwritten
  11:   344064..  360447: 142592..158975:  16384: 105728: unwritten
  12:   360448..  376831: 195840..212223:  16384: 158976: unwritten
  13:   376832..  383999: 249088..256255:   7168: 212224:
last,unwritten,eof
tmp: 14 extents found
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G  1.5G  543M  74% /mnt/btrfs
[chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
1450+0 records in
1450+0 records out
1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G  1.5G  591M  72% /mnt/btrfs
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1520435200 (371200 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   16383: 302336..318719:  16384:
   1:16384..   32767: 355584..371967:  16384: 318720:
   2:32768..   49151: 408832..425215:  16384: 371968:
   3:49152..   65535: 462080..478463:  16384: 425216:
   4:65536..   73727: 515328..523519:   8192: 478464:
   5:73728..   86015:   3328.. 15615:  12288: 523520:
   6:86016..   98303: 256256..268543:  12288:  15616:
   7:98304..  104959:  49168.. 55823:   6656: 268544:
   8:   104960..  109047: 105728..109815:   4088:  55824:
   9:   109048..  113143: 158976..163071:   4096: 109816:
  10:   113144..  117239: 212224..216319:   4096: 163072:
  11:   117240..  121335: 318720..322815:   4096: 216320:
  12:   121336..  125431: 371968..376063:   4096: 322816:
  13:   125432..  128251: 425216..428035:   2820: 376064:
  14:   128252..  131071: 478464..481283:   2820: 428036:
  15:   131072..  132409:   1460..  2797:   1338: 481284:
  16:   132410..  165177: 322816..355583:  32768:   2798:
  17:   165178..  197945: 376064..408831:  32768: 355584:
  18:   197946..  230713: 429312..462079:  32768: 408832:
  19:   230714..  263481: 482560..515327:  32768: 462080:
  20:   263482..  296249:  16400.. 49167:  32768: 515328:
  21:   296250..  327687:  56576.. 88013:  31438:  49168:
  22:   327688..  328711: 428036..429059:   1024:  88014:
  23:   328712..  361479: 109824..142591:  32768: 429060:
  24:   361480..  371199:  88014.. 97733:   9720: 142592: last,eof
tmp: 25 extents found
[chris@f28s btrfs]$


*shrug*


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: btrfs_close_devices(): only fsync() if device->writeable

2018-07-18 Thread David Sterba
On Wed, Jun 06, 2018 at 02:30:41AM -0400, james harvey wrote:
> Prevent unnecessary error from failing fsync(), if opened read only.
> 
> Performed 'grep "writeable = " *.h *.c' to make sure there were no odd
> situations where fsync() might still be desired here.  They're all straight-
> forward.  The only situation where writeable will be 0 is if 
> btrfs_open_devices
> is given flags without O_RDWR.  There is no situation where a writeable volume
> temporarily becomes unwriteable, or anything like that.  Given that it's being
> opened O_RDWR, there's no reason to attempt fsync().
> 
> utils.c
> 
>int btrfs_add_to_fsid() {
>   ...
>   device->writeable = 1;
> 
> volumes.c
> 
>int btrfs_close_devices() {
>   ...
>   while (!list_empty(&fs_devices->devices)) {
>  ...
>  // just after the fsync() being patched
> 267: device->writeable = 0;
>...
>int btrfs_open_devices() {
>   ...
>   list_for_each_entry(device, &fs_devices->devices, dev_list) {
>  ...
>  if (flags & O_RDWR)
> 332:device->writeable = 1
> 
> kernel btrfs_close_devices() does not have a corresponding fsync() that I see.
> 
> Signed-off-by: James Harvey 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()

2018-07-18 Thread David Sterba
On Thu, Jun 07, 2018 at 03:20:02AM -0400, james harvey wrote:
> btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and 
> BTRFS_METADATA_KEY,
> which are the types we're looking for.
> 
> Signed-off-by: James Harvey 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] qgroup unused parameter cleanup

2018-07-18 Thread David Sterba
On Wed, Jul 18, 2018 at 02:45:23PM +0800, Lu Fengqi wrote:
> The transaction handler can provide fs_info, so we can fetch fs_info or
> quota_root(indirectly) from trans. Just remove the redundant parameter
> from qgroup functions.

Added to misc-next now, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 09:07, Chris Murphy wrote:

On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
 wrote:


If you're doing a training presentation, it may be worth mentioning that
preallocation with fallocate() does not behave the same on BTRFS as it does
on other filesystems.  For example, the following sequence of commands:

 fallocate -l X ./tmp
 dd if=/dev/zero of=./tmp bs=1 count=X

Will always work on ext4, XFS, and most other filesystems, for any value of
X between zero and just below the total amount of free space on the
filesystem.  On BTRFS though, it will reliably fail with ENOSPC for values
of X that are greater than _half_ of the total amount of free space on the
filesystem (actually, greater than just short of half).  In essence,
preallocating space does not prevent COW semantics for the first write
unless the file is marked NOCOW.


Is this a bug, or is it suboptimal behavior, or is it intentional?
It's been discussed before, though I can't find the email thread right 
now.  Pretty much, this is _technically_ not incorrect behavior, as the 
documentation for fallocate doesn't say that subsequent writes can't 
fail due to lack of space.  I personally consider it a bug though 
because it breaks from existing behavior in a way that is avoidable and 
defies user expectations.


There are two issues here:

1. Regions preallocated with fallocate still do COW on the first write 
to any given block in that region.  This can be handled by either 
treating the first write to each block as NOCOW, or by allocating a bit 
of extra space and doing a rotating approach like this for writes:

- Write goes into the extra space.
- Once the write is done, convert the region covered by the write
  into a new block of extra space.
- When the final block of the preallocated region is written,
  deallocate the extra space.
2. Preallocation does not completely account for necessary metadata 
space that will be needed to store the data there.  This may not be 
necessary if the first issue is addressed properly.


And then I wonder what happens with XFS COW:

  fallocate -l X ./tmp
  cp --reflink ./tmp ./tmp2
  dd if=/dev/zero of=./tmp bs=1 count=X
I'm not sure.  In this particular case, this will fail on BTRFS for any 
X larger than just short of one third of the total free space.  I would 
expect it to fail for any X larger than just short of half instead.


ZFS gets around this by not supporting fallocate (well, kind of, if 
you're using glibc and call posix_fallocate, that _will_ work, but it 
will take forever because it works by writing out each block of space 
that's being allocated, which, ironically, means that that still suffers 
from the same issue potentially that we have).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
 wrote:

> If you're doing a training presentation, it may be worth mentioning that
> preallocation with fallocate() does not behave the same on BTRFS as it does
> on other filesystems.  For example, the following sequence of commands:
>
> fallocate -l X ./tmp
> dd if=/dev/zero of=./tmp bs=1 count=X
>
> Will always work on ext4, XFS, and most other filesystems, for any value of
> X between zero and just below the total amount of free space on the
> filesystem.  On BTRFS though, it will reliably fail with ENOSPC for values
> of X that are greater than _half_ of the total amount of free space on the
> filesystem (actually, greater than just short of half).  In essence,
> preallocating space does not prevent COW semantics for the first write
> unless the file is marked NOCOW.

Is this a bug, or is it suboptimal behavior, or is it intentional?

And then I wonder what happens with XFS COW:

 fallocate -l X ./tmp
 cp --reflink ./tmp ./tmp2
 dd if=/dev/zero of=./tmp bs=1 count=X



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Hugo Mills
On Wed, Jul 18, 2018 at 08:39:48AM +, Duncan wrote:
> Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:
> 
> >> As implemented in BTRFS, raid1 doesn't have striping.
> > 
> > The argument is that because there's only two copies, on multi-device
> > btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> > alternate device pairs, it's effectively striped at the macro level,
> > with the 1 GiB device-level chunks effectively being huge individual
> > device strips of 1 GiB.
> > 
> > At 1 GiB strip size it doesn't have the typical performance advantage of
> > striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> > strips/chunks.
> 
> I forgot this bit...
> 
> Similarly, multi-device single is regarded by some to be conceptually 
> equivalent to raid0 with really huge GiB strips/chunks.
> 
> (As you may note, "the argument is" and "regarded by some" are distancing 
> phrases.  I've seen the argument made on-list, but while I understand the 
> argument and agree with it to some extent, I'm still a bit uncomfortable 
> with it and don't normally make it myself, this thread being a noted 
> exception tho originally I simply repeated what someone else already said 
> in-thread, because I too agree it's stretching things a bit.  But it does 
> appear to be a useful conceptual equivalency for some, and I do see the 
> similarity.
> 
> Perhaps it's a case of coder's view (no code doing it that way, it's just 
> a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
> (code or not, accidental or not, it's a reasonably accurate high-level 
> description of how it ends up working most of the time with equivalent 
> sized devices).)

   Well, it's an *accurate* observation. It's just not a particularly
*useful* one. :)

   Hugo.

-- 
Hugo Mills | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 03:20, Duncan wrote:

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:


On 07/17/2018 11:12 PM, Duncan wrote:

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:


On 07/15/2018 04:37 PM, waxhead wrote:



Striping and mirroring/pairing are orthogonal properties; mirror and
parity are mutually exclusive.


I can't agree.  I don't know whether you meant that in the global
sense,
or purely in the btrfs context (which I suspect), but either way I
can't agree.

In the pure btrfs context, while striping and mirroring/pairing are
orthogonal today, Hugo's whole point was that btrfs is theoretically
flexible enough to allow both together and the feature may at some
point be added, so it makes sense to have a layout notation format
flexible enough to allow it as well.


When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.


Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.

While it's not possible today on (pure) btrfs (it's possible today with
md/dm-raid or hardware-raid handling one layer), it's theoretically
possible both for btrfs and in general, and it could be added to btrfs in
the future, so a notation with the flexibility to allow parity and
mirroring together does make sense, and having just that sort of
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I
can see a case being made for it if one layer is hardware/firmware raid,
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't
arguably be a at least as good a match to the use-case.  Perhaps one of
the other experts in such things here might help with that.


Question #2: historically RAID10 is requires 4 disks. However I am
guessing if the stripe could be done on a different number of disks:
What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
that every 64k, the data are stored on a different disk


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.
Actually, it also behaves like LVM and MD RAID10 for any number of 
devices greater than 2, though the exact placement may diverge because 
of BTRFS's concept of different chunk types.  In LVM and MD RAID10, each 
block is stored as two copies, and what disks it ends up on is dependent 
on the block number modulo the number of disks (so, for 3 disks A, B, 
and C, block 0 is on A and B, block 1 is on C and A, and block 2 is on B 
and C, with subsequent blocks following the same pattern).  In an 
idealized model of BTRFS with only one chunk type, you get exactly the 
same behavior (because BTRFS allocates chunks based on disk utilization, 
and prefers lower numbered disks to higher ones in the event of a tie).


At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
strips/chunks.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 04:39, Duncan wrote:

Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level,
with the 1 GiB device-level chunks effectively being huge individual
device strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
strips/chunks.


I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing
phrases.  I've seen the argument made on-list, but while I understand the
argument and agree with it to some extent, I'm still a bit uncomfortable
with it and don't normally make it myself, this thread being a noted
exception tho originally I simply repeated what someone else already said
in-thread, because I too agree it's stretching things a bit.  But it does
appear to be a useful conceptual equivalency for some, and I do see the
similarity.
If the file is larger than the data chunk size, it _is_ striped, because 
it spans multiple chunks which are on separate devices.  Otherwise, it's 
more similar to what in GlusterFS is called a 'distributed volume'.  In 
such a Gluster volume, each file is entirely stored on one node (or you 
have a complete copy on N nodes where N is the number of replicas), with 
the selection of what node is used for the next file created being based 
on which node has the most free space.


That said, the main reason I explain single and raid1 the way I do is 
that I've found it's a much simpler way to explain generically how they 
work to people who already have storage background but may not care 
about the specifics.


Perhaps it's a case of coder's view (no code doing it that way, it's just
a coincidental oddity conditional on equal sizes), vs. sysadmin's view
(code or not, accidental or not, it's a reasonably accurate high-level
description of how it ends up working most of the time with equivalent
sized devices).)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-17 13:54, Martin Steigerwald wrote:

Nikolay Borisov - 17.07.18, 10:16:

On 17.07.2018 11:02, Martin Steigerwald wrote:

Nikolay Borisov - 17.07.18, 09:20:

On 16.07.2018 23:58, Wolf wrote:

Greetings,
I would like to ask what what is healthy amount of free space to
keep on each device for btrfs to be happy?

This is how my disk array currently looks like

 [root@dennas ~]# btrfs fi usage /raid
 
 Overall:

 Device size:  29.11TiB
 Device allocated: 21.26TiB
 Device unallocated:7.85TiB
 Device missing:  0.00B
 Used: 21.18TiB
 Free (estimated):  3.96TiB  (min: 3.96TiB)
 Data ratio:   2.00
 Metadata ratio:   2.00
 Global reserve:  512.00MiB  (used: 0.00B)


[…]


Btrfs does quite good job of evenly using space on all devices.
No,
how low can I let that go? In other words, with how much space
free/unallocated remaining space should I consider adding new
disk?


Btrfs will start running into problems when you run out of
unallocated space. So the best advice will be monitor your device
unallocated, once it gets really low - like 2-3 gb I will suggest
you run balance which will try to free up unallocated space by
rewriting data more compactly into sparsely populated block
groups. If after running balance you haven't really freed any
space then you should consider adding a new drive and running
balance to even out the spread of data/metadata.


What are these issues exactly?


For example if you have plenty of data space but your metadata is full
then you will be getting ENOSPC.


Of that one I am aware.

This just did not happen so far.

I did not yet add it explicitly to the training slides, but I just make
myself a note to do that.

Anything else?


If you're doing a training presentation, it may be worth mentioning that 
preallocation with fallocate() does not behave the same on BTRFS as it 
does on other filesystems.  For example, the following sequence of commands:


fallocate -l X ./tmp
dd if=/dev/zero of=./tmp bs=1 count=X

Will always work on ext4, XFS, and most other filesystems, for any value 
of X between zero and just below the total amount of free space on the 
filesystem.  On BTRFS though, it will reliably fail with ENOSPC for 
values of X that are greater than _half_ of the total amount of free 
space on the filesystem (actually, greater than just short of half).  In 
essence, preallocating space does not prevent COW semantics for the 
first write unless the file is marked NOCOW.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan
Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:

>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> alternate device pairs, it's effectively striped at the macro level,
> with the 1 GiB device-level chunks effectively being huge individual
> device strips of 1 GiB.
> 
> At 1 GiB strip size it doesn't have the typical performance advantage of
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> strips/chunks.

I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually 
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing 
phrases.  I've seen the argument made on-list, but while I understand the 
argument and agree with it to some extent, I'm still a bit uncomfortable 
with it and don't normally make it myself, this thread being a noted 
exception tho originally I simply repeated what someone else already said 
in-thread, because I too agree it's stretching things a bit.  But it does 
appear to be a useful conceptual equivalency for some, and I do see the 
similarity.

Perhaps it's a case of coder's view (no code doing it that way, it's just 
a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
(code or not, accidental or not, it's a reasonably accurate high-level 
description of how it ends up working most of the time with equivalent 
sized devices).)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/19] btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent

2018-07-18 Thread Lu Fengqi
It can be fetched from the transaction handle. In addition, remove the
WARN_ON(trans == NULL) because there shouldn't hit this condition.

Signed-off-by: Lu Fengqi 
---
v2:
Remove the WARN_ON(trans == NULL) suggested by Qu Wenruo.

 fs/btrfs/qgroup.c   | 15 ++-
 fs/btrfs/qgroup.h   |  5 ++---
 fs/btrfs/tree-log.c |  2 +-
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index c85c1a0e933a..7621ad34244f 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1579,10 +1579,10 @@ int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info 
*fs_info,
return 0;
 }
 
-int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans,
-   struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes,
-   gfp_t gfp_flag)
+int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
+ u64 num_bytes, gfp_t gfp_flag)
 {
+   struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_qgroup_extent_record *record;
struct btrfs_delayed_ref_root *delayed_refs;
int ret;
@@ -1590,8 +1590,6 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle 
*trans,
if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)
|| bytenr == 0 || num_bytes == 0)
return 0;
-   if (WARN_ON(trans == NULL))
-   return -EINVAL;
record = kmalloc(sizeof(*record), gfp_flag);
if (!record)
return -ENOMEM;
@@ -1644,8 +1642,8 @@ int btrfs_qgroup_trace_leaf_items(struct 
btrfs_trans_handle *trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
 
-   ret = btrfs_qgroup_trace_extent(trans, fs_info, bytenr,
-   num_bytes, GFP_NOFS);
+   ret = btrfs_qgroup_trace_extent(trans, bytenr, num_bytes,
+   GFP_NOFS);
if (ret)
return ret;
}
@@ -1796,8 +1794,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle 
*trans,
btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
 
-   ret = btrfs_qgroup_trace_extent(trans, fs_info,
-   child_bytenr,
+   ret = btrfs_qgroup_trace_extent(trans, child_bytenr,
fs_info->nodesize,
GFP_NOFS);
if (ret)
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 385367989ed6..0215dc0b1710 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -212,9 +212,8 @@ int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info 
*fs_info,
  * Return <0 for error, like memory allocation failure or invalid parameter
  * (NULL trans)
  */
-int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans,
-   struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes,
-   gfp_t gfp_flag);
+int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
+ u64 num_bytes, gfp_t gfp_flag);
 
 /*
  * Inform qgroup to trace all leaf items of data
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7b7498f1f641..10f6a4223897 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -685,7 +685,7 @@ static noinline int replay_one_extent(struct 
btrfs_trans_handle *trans,
 * as the owner of the file extent changed from log tree
 * (doesn't affect qgroup) to fs/file tree(affects qgroup)
 */
-   ret = btrfs_qgroup_trace_extent(trans, fs_info,
+   ret = btrfs_qgroup_trace_extent(trans,
btrfs_file_extent_disk_bytenr(eb, item),
btrfs_file_extent_disk_num_bytes(eb, item),
GFP_NOFS);
-- 
2.18.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/19] btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent

2018-07-18 Thread Qu Wenruo


On 2018年07月18日 15:54, Lu Fengqi wrote:
> On Wed, Jul 18, 2018 at 02:58:06PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018年07月18日 14:45, Lu Fengqi wrote:
>>> It can be fetched from the transaction handle.
>>>
>>> Signed-off-by: Lu Fengqi 
>>> ---
>>>  fs/btrfs/qgroup.c   | 13 ++---
>>>  fs/btrfs/qgroup.h   |  5 ++---
>>>  fs/btrfs/tree-log.c |  2 +-
>>>  3 files changed, 9 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>> index c85c1a0e933a..01add73cb2aa 100644
>>> --- a/fs/btrfs/qgroup.c
>>> +++ b/fs/btrfs/qgroup.c
>>> @@ -1579,10 +1579,10 @@ int btrfs_qgroup_trace_extent_post(struct 
>>> btrfs_fs_info *fs_info,
>>> return 0;
>>>  }
>>>  
>>> -int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans,
>>> -   struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes,
>>> -   gfp_t gfp_flag)
>>> +int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>>> + u64 num_bytes, gfp_t gfp_flag)
>>>  {
>>> +   struct btrfs_fs_info *fs_info = trans->fs_info;
>>
>> Just lines below, we do extra WARN_ON(trans == NULL).
>> So if we really hit some case with NULL trans, this would cause a NULL
>> pointer dereference.
>>
>> Although I have to admit, I'm a little paranoid about possible NULL
>> trans passed in.
>> So maybe it's a good timing to remove that WARN_ON() too?
> 
> Sorry, I didn't notice this WARN_ON(trans == NULL). However, I have
> confirmed that the callers of btrfs_qgroup_trace_{extent, leaf_items,
> subtree} should never pass NULL as trans. In my opinion the WARN_ON() can
> be removed without any bad effect.
> 
Then removing that WARN_ON() would be pretty nice.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 13/19] btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent

2018-07-18 Thread Lu Fengqi
On Wed, Jul 18, 2018 at 02:58:06PM +0800, Qu Wenruo wrote:
>
>
>On 2018年07月18日 14:45, Lu Fengqi wrote:
>> It can be fetched from the transaction handle.
>> 
>> Signed-off-by: Lu Fengqi 
>> ---
>>  fs/btrfs/qgroup.c   | 13 ++---
>>  fs/btrfs/qgroup.h   |  5 ++---
>>  fs/btrfs/tree-log.c |  2 +-
>>  3 files changed, 9 insertions(+), 11 deletions(-)
>> 
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index c85c1a0e933a..01add73cb2aa 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -1579,10 +1579,10 @@ int btrfs_qgroup_trace_extent_post(struct 
>> btrfs_fs_info *fs_info,
>>  return 0;
>>  }
>>  
>> -int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans,
>> -struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes,
>> -gfp_t gfp_flag)
>> +int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>> +  u64 num_bytes, gfp_t gfp_flag)
>>  {
>> +struct btrfs_fs_info *fs_info = trans->fs_info;
>
>Just lines below, we do extra WARN_ON(trans == NULL).
>So if we really hit some case with NULL trans, this would cause a NULL
>pointer dereference.
>
>Although I have to admit, I'm a little paranoid about possible NULL
>trans passed in.
>So maybe it's a good timing to remove that WARN_ON() too?

Sorry, I didn't notice this WARN_ON(trans == NULL). However, I have
confirmed that the callers of btrfs_qgroup_trace_{extent, leaf_items,
subtree} should never pass NULL as trans. In my opinion the WARN_ON() can
be removed without any bad effect.

-- 
Thanks,
Lu

>
>Thanks,
>Qu
>
>>  struct btrfs_qgroup_extent_record *record;
>>  struct btrfs_delayed_ref_root *delayed_refs;
>>  int ret;
>> @@ -1644,8 +1644,8 @@ int btrfs_qgroup_trace_leaf_items(struct 
>> btrfs_trans_handle *trans,
>>  
>>  num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
>>  
>> -ret = btrfs_qgroup_trace_extent(trans, fs_info, bytenr,
>> -num_bytes, GFP_NOFS);
>> +ret = btrfs_qgroup_trace_extent(trans, bytenr, num_bytes,
>> +GFP_NOFS);
>>  if (ret)
>>  return ret;
>>  }
>> @@ -1796,8 +1796,7 @@ int btrfs_qgroup_trace_subtree(struct 
>> btrfs_trans_handle *trans,
>>  btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
>>  path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
>>  
>> -ret = btrfs_qgroup_trace_extent(trans, fs_info,
>> -child_bytenr,
>> +ret = btrfs_qgroup_trace_extent(trans, child_bytenr,
>>  fs_info->nodesize,
>>  GFP_NOFS);
>>  if (ret)
>> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
>> index 385367989ed6..0215dc0b1710 100644
>> --- a/fs/btrfs/qgroup.h
>> +++ b/fs/btrfs/qgroup.h
>> @@ -212,9 +212,8 @@ int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info 
>> *fs_info,
>>   * Return <0 for error, like memory allocation failure or invalid parameter
>>   * (NULL trans)
>>   */
>> -int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans,
>> -struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes,
>> -gfp_t gfp_flag);
>> +int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>> +  u64 num_bytes, gfp_t gfp_flag);
>>  
>>  /*
>>   * Inform qgroup to trace all leaf items of data
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 7b7498f1f641..10f6a4223897 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -685,7 +685,7 @@ static noinline int replay_one_extent(struct 
>> btrfs_trans_handle *trans,
>>   * as the owner of the file extent changed from log tree
>>   * (doesn't affect qgroup) to fs/file tree(affects qgroup)
>>   */
>> -ret = btrfs_qgroup_trace_extent(trans, fs_info,
>> +ret = btrfs_qgroup_trace_extent(trans,
>>  btrfs_file_extent_disk_bytenr(eb, item),
>>  btrfs_file_extent_disk_num_bytes(eb, item),
>>  GFP_NOFS);
>> 
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan
Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:

> On 07/17/2018 11:12 PM, Duncan wrote:
>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>> excerpted:
>> 
>>> On 07/15/2018 04:37 PM, waxhead wrote:
>> 
>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>> parity are mutually exclusive.
>> 
>> I can't agree.  I don't know whether you meant that in the global
>> sense,
>> or purely in the btrfs context (which I suspect), but either way I
>> can't agree.
>> 
>> In the pure btrfs context, while striping and mirroring/pairing are
>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>> flexible enough to allow both together and the feature may at some
>> point be added, so it makes sense to have a layout notation format
>> flexible enough to allow it as well.
> 
> When I say orthogonal, It means that these can be combined: i.e. you can
> have - striping (RAID0)
> - parity  (?)
> - striping + parity  (e.g. RAID5/6)
> - mirroring  (RAID1)
> - mirroring + striping  (RAID10)
> 
> However you can't have mirroring+parity; this means that a notation
> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
> too verbose.

Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
top of mirroring or mirroring on top of raid5/6, much as raid10 is 
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
on top of raid0.  

While it's not possible today on (pure) btrfs (it's possible today with 
md/dm-raid or hardware-raid handling one layer), it's theoretically 
possible both for btrfs and in general, and it could be added to btrfs in 
the future, so a notation with the flexibility to allow parity and 
mirroring together does make sense, and having just that sort of 
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I 
can see a case being made for it if one layer is hardware/firmware raid, 
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61 
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't 
arguably be a at least as good a match to the use-case.  Perhaps one of 
the other experts in such things here might help with that.

>>> Question #2: historically RAID10 is requires 4 disks. However I am
>>> guessing if the stripe could be done on a different number of disks:
>>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>>> that every 64k, the data are stored on a different disk
>> 
>> As someone else pointed out, md/lvm-raid10 already work like this. 
>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>> much works this way except with huge (gig size) chunks.
> 
> As implemented in BTRFS, raid1 doesn't have striping.

The argument is that because there's only two copies, on multi-device 
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
alternate device pairs, it's effectively striped at the macro level, with 
the 1 GiB device-level chunks effectively being huge individual device 
strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of 
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
strips/chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] qgroup unused parameter cleanup

2018-07-18 Thread Qu Wenruo


On 2018年07月18日 14:45, Lu Fengqi wrote:
> The transaction handler can provide fs_info, so we can fetch fs_info or
> quota_root(indirectly) from trans. Just remove the redundant parameter
> from qgroup functions.
> 
> No functional change.

The whole serious looks pretty good.

Although for some call sites, we have extra check on NULL trans and
return -EINVAL (should be ASSERT though).
Maybe it's a good time to remove that paranoid check?

Thanks,
Qu

> 
> Lu Fengqi (19):
>   btrfs: qgroup: Drop quota_root parameter from add_qgroup_relation_item
>   btrfs: qgroup: Drop quota_root parameter from del_qgroup_relation_item
>   btrfs: qgroup: Drop quota_root parameter from del_qgroup_item
>   btrfs: qgroup: Drop root parameter from update_qgroup_limit_item
>   btrfs: qgroup: Drop root parameter from update_qgroup_info_item
>   btrfs: qgroup: Drop quota_root and fs_info parameters from
> update_qgroup_status_item
>   btrfs: qgroup: Drop fs_info parameter from btrfs_add_qgroup_relation
>   btrfs: qgroup: Drop fs_info parameter from __del_qgroup_relation
>   btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relation
>   btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroup
>   btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroup
>   btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroup
>   btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent
>   btrfs: qgroup: Drop fs_info parameter from
> btrfs_qgroup_trace_leaf_items
>   btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtree
>   btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extent
>   btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroups
>   btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inherit
>   btrfs: qgroup: Drop fs_info parameter from qgroup_rescan_leaf
> 
>  fs/btrfs/extent-tree.c|   4 +-
>  fs/btrfs/ioctl.c  |  18 ++--
>  fs/btrfs/qgroup.c | 163 +-
>  fs/btrfs/qgroup.h |  40 -
>  fs/btrfs/relocation.c |   5 +-
>  fs/btrfs/tests/qgroup-tests.c |  24 ++---
>  fs/btrfs/transaction.c|   5 +-
>  fs/btrfs/tree-log.c   |   2 +-
>  8 files changed, 123 insertions(+), 138 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature