from:"Timofey Titovets"

[PATCH V8] Btrfs: enhance raid1/10 balance heuristic

2018-11-13 Thread Timofey Titovets

From: Timofey Titovets 

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue length to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue length then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue length:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Some bench results from mail list
(Dmitrii Tcvetkov ):
Benchmark summary (arithmetic mean of 3 runs):
 Mainline Patch

RAID1  | 18.9 MiB/s | 26.5 MiB/s
RAID10 | 30.7 MiB/s | 30.7 MiB/s

mainline, fio got lucky to read from first HDD (quite slow HDD):
Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
  read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
  lat (msec): min=2, max=825, avg=60.17, stdev=65.06

mainline, fio got lucky to read from second HDD (much more modern):
Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
  read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
  lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56

mainline, fio got lucky to read from an SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
  read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
  lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36

With the patch, 2 HDDs:
Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
  read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
  lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14

With the patch, HDD(old one)+SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
  read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
  lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue length
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next
  v4 -> v5:
- Rebased on latest misc-next
  v5 -> v6:
- Fix spelling
- Include bench results
  v6 -> v7:
- Fixes based on Nikolay Borisov review:
  * Assume num == 2
  * Remove "for" loop based on that assumption, where possible
  v7 -> v8:
- Add comment about magic '2' num in guess function

Signed-off-by: Timofey Titovets 
Tested-by: Dmitrii Tcvetkov 
Reviewed-by: Dmitrii Tcvetkov 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 104 -
 2 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index cff6bdf27226..4ba5ede8969e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f435d397019e..d9b5cf31514a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -28,6 +29,8 @@
 #include "dev-replace.h"
 #include "sysfs.h"
 
+#define BTRFS_RAID_1_10_MAX_MIRRORS 2
+
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
[BTRFS_RAID_RAID10] = {
.sub_stripes= 2,
@@ -5166,6 +5169,104 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue length of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(ui

Re: [PATCH v2] btrfs: add zstd compression level support

2018-11-13 Thread Timofey Titovets

вт, 13 нояб. 2018 г. в 04:52, Nick Terrell :
>
>
>
> > On Nov 12, 2018, at 4:33 PM, David Sterba  wrote:
> >
> > On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote:
> >> From: Jennifer Liu 
> >>
> >> Adds zstd compression level support to btrfs. Zstd requires
> >> different amounts of memory for each level, so the design had
> >> to be modified to allow set_level() to allocate memory. We
> >> preallocate one workspace of the maximum size to guarantee
> >> forward progress. This feature is expected to be useful for
> >> read-mostly filesystems, or when creating images.
> >>
> >> Benchmarks run in qemu on Intel x86 with a single core.
> >> The benchmark measures the time to copy the Silesia corpus [0] to
> >> a btrfs filesystem 10 times, then read it back.
> >>
> >> The two important things to note are:
> >> - The decompression speed and memory remains constant.
> >>  The memory required to decompress is the same as level 1.
> >> - The compression speed and ratio will vary based on the source.
> >>
> >> LevelRatio   Compression Decompression   Compression Memory
> >> 12.59153 MB/s112 MB/s0.8 MB
> >> 22.67136 MB/s113 MB/s1.0 MB
> >> 32.72106 MB/s115 MB/s1.3 MB
> >> 42.7886  MB/s109 MB/s0.9 MB
> >> 52.8369  MB/s109 MB/s1.4 MB
> >> 62.8953  MB/s110 MB/s1.5 MB
> >> 72.9140  MB/s112 MB/s1.4 MB
> >> 82.9234  MB/s110 MB/s1.8 MB
> >> 92.9327  MB/s109 MB/s1.8 MB
> >> 10   2.9422  MB/s109 MB/s1.8 MB
> >> 11   2.9517  MB/s114 MB/s1.8 MB
> >> 12   2.9513  MB/s113 MB/s1.8 MB
> >> 13   2.9510  MB/s111 MB/s2.3 MB
> >> 14   2.997   MB/s110 MB/s2.6 MB
> >> 15   3.036   MB/s110 MB/s2.6 MB
> >>
> >> [0] 
> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__sun.aei.polsl.pl_-7Esdeor_index.php-3Fpage-3Dsilesia=DwIBAg=5VD0RTtNlTh3ycd41b3MUw=HQM5IQdWOB8WaMoii2dYTw=5LQRTUqZnx_a8dGSa5bGsd0Fm4ejQQOcH50wi7nRewY=gFUm-SA3aeQI7PBe3zmxUuxk4AEEZegB0cRsbjWUToo=
> >>
> >> Signed-off-by: Jennifer Liu 
> >> Signed-off-by: Nick Terrell 
> >> Reviewed-by: Omar Sandoval 
> >> ---
> >> v1 -> v2:
> >> - Don't reflow the unchanged line.
> >>
> >> fs/btrfs/compression.c | 169 +
> >> fs/btrfs/compression.h |  18 +++--
> >> fs/btrfs/lzo.c |   5 +-
> >> fs/btrfs/super.c   |   7 +-
> >> fs/btrfs/zlib.c|  33 
> >> fs/btrfs/zstd.c|  74 +-
> >> 6 files changed, 202 insertions(+), 104 deletions(-)
> >>
> >> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> >> index 2955a4ea2fa8..b46652cb653e 100644
> >> --- a/fs/btrfs/compression.c
> >> +++ b/fs/btrfs/compression.c
> >> @@ -822,9 +822,12 @@ void __init btrfs_init_compress(void)
> >>
> >>  /*
> >>   * Preallocate one workspace for each compression type so
> >> - * we can guarantee forward progress in the worst case
> >> + * we can guarantee forward progress in the worst case.
> >> + * Provide the maximum compression level to guarantee large
> >> + * enough workspace.
> >>   */
> >> -workspace = btrfs_compress_op[i]->alloc_workspace();
> >> +workspace = btrfs_compress_op[i]->alloc_workspace(
> >> +btrfs_compress_op[i]->max_level);
>
> We provide the max level here, so we have at least one workspace per
> compression type that is large enough.
>
> >>  if (IS_ERR(workspace)) {
> >>  pr_warn("BTRFS: cannot preallocate compression 
> >> workspace, will try later\n");
> >>  } else {
> >> @@ -835,23 +838,78 @@ void __init btrfs_init_compress(void)
> >>  }
> >> }
> >>
> >> +/*
> >> + * put a workspace struct back on the list or free it if we have enough
> >> + * idle ones sitting around
> >> + */
> >> +static void __free_workspace(int type, struct list_head *workspace,
> >> + bool heuristic)
> >> +{
> >> +int idx = type - 1;
> >> +struct list_head *idle_ws;
> >> +spinlock_t *ws_lock;
> >> +atomic_t *total_ws;
> >> +wait_queue_head_t *ws_wait;
> >> +int *free_ws;
> >> +
> >> +if (heuristic) {
> >> +idle_ws  = _heuristic_ws.idle_ws;
> >> +ws_lock  = _heuristic_ws.ws_lock;
> >> +total_ws = _heuristic_ws.total_ws;
> >> +ws_wait  = _heuristic_ws.ws_wait;
> >> +free_ws  = _heuristic_ws.free_ws;
> >> +} else {
> >> +idle_ws  = _comp_ws[idx].idle_ws;
> >> +ws_lock  = _comp_ws[idx].ws_lock;
> >> +total_ws =

[PATCH V7] Btrfs: enhance raid1/10 balance heuristic

2018-11-12 Thread Timofey Titovets

From: Timofey Titovets 

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue length to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue length then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue length:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Some bench results from mail list
(Dmitrii Tcvetkov ):
Benchmark summary (arithmetic mean of 3 runs):
 Mainline Patch

RAID1  | 18.9 MiB/s | 26.5 MiB/s
RAID10 | 30.7 MiB/s | 30.7 MiB/s

mainline, fio got lucky to read from first HDD (quite slow HDD):
Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
  read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
  lat (msec): min=2, max=825, avg=60.17, stdev=65.06

mainline, fio got lucky to read from second HDD (much more modern):
Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
  read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
  lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56

mainline, fio got lucky to read from an SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
  read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
  lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36

With the patch, 2 HDDs:
Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
  read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
  lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14

With the patch, HDD(old one)+SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
  read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
  lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue length
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next
  v4 -> v5:
- Rebased on latest misc-next
  v5 -> v6:
- Fix spelling
- Include bench results
  v6 -> v7:
- Fixes based on Nikolay Borisov review:
  * Assume num == 2
  * Remove "for" loop based on that assumption, where possible
  * No functional changes

Signed-off-by: Timofey Titovets 
Tested-by: Dmitrii Tcvetkov 
Reviewed-by: Dmitrii Tcvetkov 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 100 -
 2 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index be5bab20b2ab..939f0c6a2d79 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4405e430da6..a6632cc2bfab 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5159,6 +5160,102 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue length of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * Th

Re: [PATCH V6] Btrfs: enhance raid1/10 balance heuristic

2018-11-12 Thread Timofey Titovets

пн, 12 нояб. 2018 г. в 10:28, Nikolay Borisov :
>
>
>
> On 25.09.18 г. 21:38 ч., Timofey Titovets wrote:
> > Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> > based on pid % num of mirrors.
> >
> > Make logic understood:
> >  - if one of underline devices are non rotational
> >  - Queue length to underline devices
> >
> > By default try use pid % num_mirrors guessing, but:
> >  - If one of mirrors are non rotational, repick optimal to it
> >  - If underline mirror have less queue length then optimal,
> >repick to that mirror
> >
> > For avoid round-robin request balancing,
> > lets round down queue length:
> >  - By 8 for rotational devs
> >  - By 2 for all non rotational devs
> >
> > Some bench results from mail list
> > (Dmitrii Tcvetkov ):
> > Benchmark summary (arithmetic mean of 3 runs):
> >  Mainline Patch
> > 
> > RAID1  | 18.9 MiB/s | 26.5 MiB/s
> > RAID10 | 30.7 MiB/s | 30.7 MiB/s
> > 
> > mainline, fio got lucky to read from first HDD (quite slow HDD):
> > Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
> >   read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
> >   lat (msec): min=2, max=825, avg=60.17, stdev=65.06
> > 
> > mainline, fio got lucky to read from second HDD (much more modern):
> > Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
> >   read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
> >   lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
> > 
> > mainline, fio got lucky to read from an SSD:
> > Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
> >   read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
> >   lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
> > 
> > With the patch, 2 HDDs:
> > Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
> >   read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
> >   lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14
> > 
> > With the patch, HDD(old one)+SSD:
> > Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
> >   read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
> >   lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32
> >
> > Changes:
> >   v1 -> v2:
> > - Use helper part_in_flight() from genhd.c
> >   to get queue length
> > - Move guess code to guess_optimal()
> > - Change balancer logic, try use pid % mirror by default
> >   Make balancing on spinning rust if one of underline devices
> >   are overloaded
> >   v2 -> v3:
> > - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
> >   v3 -> v4:
> > - Rebased on latest misc-next
> >   v4 -> v5:
> > - Rebased on latest misc-next
> >   v5 -> v6:
> > - Fix spelling
> > - Include bench results
> >
> > Signed-off-by: Timofey Titovets 
> > Tested-by: Dmitrii Tcvetkov 
> > Reviewed-by: Dmitrii Tcvetkov 
> > ---
> >  block/genhd.c  |   1 +
> >  fs/btrfs/volumes.c | 111 -
> >  2 files changed, 110 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/genhd.c b/block/genhd.c
> > index 9656f9e9f99e..5ea5acc88d3c 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
> > hd_struct *part,
> >   atomic_read(>in_flight[1]);
> >   }
> >  }
> > +EXPORT_SYMBOL_GPL(part_in_flight);
> >
> >  void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
> >  unsigned int inflight[2])
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index c95af358b71f..fa7dd6ac087f 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -16,6 +16,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include "ctree.h"
> >  #include "extent_map.h"
> > @@ -5201,6 +5202,111 @@ int btrfs_is_pa

Re: [PATCH V6] Btrfs: enhance raid1/10 balance heuristic

2018-11-11 Thread Timofey Titovets

Gentle ping.
вт, 25 сент. 2018 г. в 21:38, Timofey Titovets :
>
> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> based on pid % num of mirrors.
>
> Make logic understood:
>  - if one of underline devices are non rotational
>  - Queue length to underline devices
>
> By default try use pid % num_mirrors guessing, but:
>  - If one of mirrors are non rotational, repick optimal to it
>  - If underline mirror have less queue length then optimal,
>repick to that mirror
>
> For avoid round-robin request balancing,
> lets round down queue length:
>  - By 8 for rotational devs
>  - By 2 for all non rotational devs
>
> Some bench results from mail list
> (Dmitrii Tcvetkov ):
> Benchmark summary (arithmetic mean of 3 runs):
>  Mainline Patch
> 
> RAID1  | 18.9 MiB/s | 26.5 MiB/s
> RAID10 | 30.7 MiB/s | 30.7 MiB/s
> 
> mainline, fio got lucky to read from first HDD (quite slow HDD):
> Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
>   read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
>   lat (msec): min=2, max=825, avg=60.17, stdev=65.06
> 
> mainline, fio got lucky to read from second HDD (much more modern):
> Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
>   read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
>   lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
> 
> mainline, fio got lucky to read from an SSD:
> Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
>   read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
>   lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
> 
> With the patch, 2 HDDs:
> Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
>   read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
>   lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14
> 
> With the patch, HDD(old one)+SSD:
> Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
>   read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
>   lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32
>
> Changes:
>   v1 -> v2:
> - Use helper part_in_flight() from genhd.c
>   to get queue length
> - Move guess code to guess_optimal()
> - Change balancer logic, try use pid % mirror by default
>   Make balancing on spinning rust if one of underline devices
>   are overloaded
>   v2 -> v3:
> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>   v3 -> v4:
> - Rebased on latest misc-next
>   v4 -> v5:
> - Rebased on latest misc-next
>   v5 -> v6:
> - Fix spelling
> - Include bench results
>
> Signed-off-by: Timofey Titovets 
> Tested-by: Dmitrii Tcvetkov 
> Reviewed-by: Dmitrii Tcvetkov 
> ---
>  block/genhd.c  |   1 +
>  fs/btrfs/volumes.c | 111 -
>  2 files changed, 110 insertions(+), 2 deletions(-)
>
> diff --git a/block/genhd.c b/block/genhd.c
> index 9656f9e9f99e..5ea5acc88d3c 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
> hd_struct *part,
> atomic_read(>in_flight[1]);
> }
>  }
> +EXPORT_SYMBOL_GPL(part_in_flight);
>
>  void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
>unsigned int inflight[2])
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c95af358b71f..fa7dd6ac087f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ctree.h"
>  #include "extent_map.h"
> @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
> return ret;
>  }
>
> +/**
> + * bdev_get_queue_len - return rounded down in flight queue length of bdev
> + *
> + * @bdev: target bdev
> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
> + */
> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
> +{
> +   int sum;
> +   struct hd_struct *bd_part = bdev->bd_part;
> +   struct request_queue *rq = bdev_get_queue(bdev);
>

Re: [PATCH v15.1 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree

2018-11-08 Thread Timofey Titovets

ts, free the one to insert.
> +*/
> +   rb_erase(>bytenr_node, _info->bytenr_root);
> +   kfree(ihash);
> +   ret = 0;
> +   goto out;
> +   }
> +
> +   list_add(>lru_list, _info->lru_list);
> +   dedupe_info->current_nr++;
> +
> +   /* Remove the last dedupe hash if we exceed limit */
> +   while (dedupe_info->current_nr > dedupe_info->limit_nr) {
> +   struct inmem_hash *last;
> +
> +   last = list_entry(dedupe_info->lru_list.prev,
> + struct inmem_hash, lru_list);
> +   __inmem_del(dedupe_info, last);
> +   }
> +out:
> +   mutex_unlock(_info->lock);
> +   return 0;
> +}
> +
> +int btrfs_dedupe_add(struct btrfs_fs_info *fs_info,
> +struct btrfs_dedupe_hash *hash)
> +{
> +   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
> +
> +   if (!fs_info->dedupe_enabled || !hash)
> +   return 0;
> +
> +   if (WARN_ON(dedupe_info == NULL))
> +   return -EINVAL;
> +
> +   if (WARN_ON(!btrfs_dedupe_hash_hit(hash)))
> +   return -EINVAL;
> +
> +   /* ignore old hash */
> +   if (dedupe_info->blocksize != hash->num_bytes)
> +   return 0;
> +
> +   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
> +   return inmem_add(dedupe_info, hash);
> +   return -EINVAL;
> +}
> --
> 2.19.1
>
>
>

Reviewed-by: Timofey Titovets 

Thanks.

-- 
Have a nice day,
Timofey.

Re: [PATCH v15.1 02/13] btrfs: dedupe: Introduce function to initialize dedupe info

2018-11-08 Thread Timofey Titovets

k */
> +   if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
> +   /* only one limit is accepted for enable*/
> +   if (dargs->limit_nr && dargs->limit_mem) {
> +   dargs->limit_nr = 0;
> +   dargs->limit_mem = 0;
> +   return -EINVAL;
> +   }
> +
> +   if (!limit_nr && !limit_mem)
> +   dargs->limit_nr = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
> +   else {
> +   u64 tmp = (u64)-1;
> +
> +   if (limit_mem) {
> +   tmp = div_u64(limit_mem,
> +   (sizeof(struct inmem_hash)) +
> +   btrfs_hash_sizes[hash_algo]);
> +   /* Too small limit_mem to fill a hash item */
> +   if (!tmp) {
> +   dargs->limit_mem = 0;
> +   dargs->limit_nr = 0;
> +   return -EINVAL;
> +   }
> +   }
> +   if (!limit_nr)
> +   limit_nr = (u64)-1;
> +
> +   dargs->limit_nr = min(tmp, limit_nr);
> +   }
> +   }
> +   if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
> +   dargs->limit_nr = 0;
> +
> +   return 0;
> +}
> +
> +int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
> +   struct btrfs_ioctl_dedupe_args *dargs)
> +{
> +   struct btrfs_dedupe_info *dedupe_info;
> +   int ret = 0;
> +
> +   ret = check_dedupe_parameter(fs_info, dargs);
> +   if (ret < 0)
> +   return ret;
> +
> +   dedupe_info = fs_info->dedupe_info;
> +   if (dedupe_info) {
> +   /* Check if we are re-enable for different dedupe config */
> +   if (dedupe_info->blocksize != dargs->blocksize ||
> +   dedupe_info->hash_algo != dargs->hash_algo ||
> +   dedupe_info->backend != dargs->backend) {
> +   btrfs_dedupe_disable(fs_info);
> +   goto enable;
> +   }
> +
> +   /* On-fly limit change is OK */
> +   mutex_lock(_info->lock);
> +   fs_info->dedupe_info->limit_nr = dargs->limit_nr;
> +   mutex_unlock(_info->lock);
> +   return 0;
> +   }
> +
> +enable:
> +   dedupe_info = init_dedupe_info(dargs);
> +   if (IS_ERR(dedupe_info))
> +   return PTR_ERR(dedupe_info);
> +   fs_info->dedupe_info = dedupe_info;
> +   /* We must ensure dedupe_bs is modified after dedupe_info */
> +   smp_wmb();
> +   fs_info->dedupe_enabled = 1;
> +   return ret;
> +}
> +
> +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
> +{
> +   /* Place holder for bisect, will be implemented in later patches */
> +   return 0;
> +}
> diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
> index 222ce7b4d827..87f5b7ce7766 100644
> --- a/fs/btrfs/dedupe.h
> +++ b/fs/btrfs/dedupe.h
> @@ -52,6 +52,18 @@ static inline int btrfs_dedupe_hash_hit(struct 
> btrfs_dedupe_hash *hash)
> return (hash && hash->bytenr);
>  }
>
> +static inline int btrfs_dedupe_hash_size(u16 algo)
> +{
> +   if (WARN_ON(algo >= ARRAY_SIZE(btrfs_hash_sizes)))
> +   return -EINVAL;
> +   return sizeof(struct btrfs_dedupe_hash) + btrfs_hash_sizes[algo];
> +}
> +
> +static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 algo)
> +{
> +   return kzalloc(btrfs_dedupe_hash_size(algo), GFP_NOFS);
> +}
> +
>  /*
>   * Initial inband dedupe info
>   * Called at dedupe enable time.
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 9cd15d2a40aa..ba879ac931f2 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -683,6 +683,9 @@ struct btrfs_ioctl_get_dev_stats {
>  /* Hash algorithm, only support SHA256 yet */
>  #define BTRFS_DEDUPE_HASH_SHA256   0
>
> +/* Default dedupe limit on number of hash */
> +#define BTRFS_DEDUPE_LIMIT_NR_DEFAULT  (32 * 1024)
> +
>  /*
>   * This structure is used for dedupe enable/disable/configure
>   * and status ioctl.
> --
> 2.19.1
>
>
>
Reviewed-by: Timofey Titovets 

Thanks.

-- 
Have a nice day,
Timofey.

Re: [PATCH v15.1 01/13] btrfs: dedupe: Introduce dedupe framework and its header

2018-11-08 Thread Timofey Titovets

eturn <0 for any error
> + * (tree operation error for some backends)
> + */
> +int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
> +   struct inode *inode, u64 file_pos,
> +   struct btrfs_dedupe_hash *hash);
> +
> +/*
> + * Add a dedupe hash into dedupe info
> + * Return 0 for success
> + * Return <0 for any error
> + * (tree operation error for some backends)
> + */
> +int btrfs_dedupe_add(struct btrfs_fs_info *fs_info,
> +struct btrfs_dedupe_hash *hash);
> +
> +/*
> + * Remove a dedupe hash from dedupe info
> + * Return 0 for success
> + * Return <0 for any error
> + * (tree operation error for some backends)
> + *
> + * NOTE: if hash deletion error is not handled well, it will lead
> + * to corrupted fs, as later dedupe write can points to non-exist or even
> + * wrong extent.
> + */
> +int btrfs_dedupe_del(struct btrfs_fs_info *fs_info, u64 bytenr);
>  #endif
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b0ab41da91d1..d1fa9d90cc8f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2678,6 +2678,7 @@ int open_ctree(struct super_block *sb,
> mutex_init(_info->reloc_mutex);
> mutex_init(_info->delalloc_root_mutex);
> mutex_init(_info->cleaner_delayed_iput_mutex);
> +   mutex_init(_info->dedupe_ioctl_lock);
> seqlock_init(_info->profiles_lock);
>
> INIT_LIST_HEAD(_info->dirty_cowonly_roots);
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 5ca1d21fc4a7..9cd15d2a40aa 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -20,6 +20,7 @@
>  #ifndef _UAPI_LINUX_BTRFS_H
>  #define _UAPI_LINUX_BTRFS_H
>  #include 
> +#include 
>  #include 
>
>  #define BTRFS_IOCTL_MAGIC 0x94
> @@ -667,6 +668,39 @@ struct btrfs_ioctl_get_dev_stats {
> __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
>  };
>
> +/* In-band dedupe related */
> +#define BTRFS_DEDUPE_BACKEND_INMEMORY  0
> +#define BTRFS_DEDUPE_BACKEND_ONDISK1
> +
> +/* Only support inmemory yet, so count is still only 1 */
> +#define BTRFS_DEDUPE_BACKEND_COUNT 1
> +
> +/* Dedup block size limit and default value */
> +#define BTRFS_DEDUPE_BLOCKSIZE_MAX SZ_8M
> +#define BTRFS_DEDUPE_BLOCKSIZE_MIN SZ_16K
> +#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT SZ_128K
> +
> +/* Hash algorithm, only support SHA256 yet */
> +#define BTRFS_DEDUPE_HASH_SHA256   0
> +
> +/*
> + * This structure is used for dedupe enable/disable/configure
> + * and status ioctl.
> + * Reserved range should be set to 0xff.
> + */
> +struct btrfs_ioctl_dedupe_args {
> +   __u16 cmd;  /* In: command */
> +   __u64 blocksize;    /* In/Out: blocksize */
> +   __u64 limit_nr; /* In/Out: limit nr for inmem backend */
> +   __u64 limit_mem;/* In/Out: limit mem for inmem backend */
> +   __u64 current_nr;   /* Out: current hash nr */
> +   __u16 backend;  /* In/Out: current backend */
> +   __u16 hash_algo;/* In/Out: hash algorithm */
> +   u8 status;  /* Out: enabled or disabled */
> +   u8 flags;   /* In: special flags for ioctl */
> +   u8 __unused[472];   /* Pad to 512 bytes */
> +};
> +
>  #define BTRFS_QUOTA_CTL_ENABLE 1
>  #define BTRFS_QUOTA_CTL_DISABLE2
>  #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED3
> --
> 2.19.1

Reviewed-by: Timofey Titovets 

Thanks.

-- 
Have a nice day,
Timofey.

Re: [PATCH v2] btrfs: add zstd compression level support

2018-11-01 Thread Timofey Titovets

 level = 9;
> -
> -   workspace->level = level > 0 ? level : 3;
> -}
> -
>  const struct btrfs_compress_op btrfs_zlib_compress = {
> .alloc_workspace= zlib_alloc_workspace,
> .free_workspace = zlib_free_workspace,
> .compress_pages = zlib_compress_pages,
> .decompress_bio = zlib_decompress_bio,
> .decompress = zlib_decompress,
> -   .set_level  = zlib_set_level,
> +   .set_level  = zlib_set_level,
> +   .max_level  = BTRFS_ZLIB_MAX_LEVEL,
> +   .default_level  = BTRFS_ZLIB_DEFAULT_LEVEL,
>  };
> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
> index af6ec59972f5..e5d7c2eae65c 100644
> --- a/fs/btrfs/zstd.c
> +++ b/fs/btrfs/zstd.c
> @@ -19,12 +19,13 @@
>
>  #define ZSTD_BTRFS_MAX_WINDOWLOG 17
>  #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
> -#define ZSTD_BTRFS_DEFAULT_LEVEL 3
> +#define BTRFS_ZSTD_DEFAULT_LEVEL 3
> +#define BTRFS_ZSTD_MAX_LEVEL 15
>
> -static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len)
> +static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len,
> +unsigned int level)
>  {
> -   ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL,
> -   src_len, 0);
> +   ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
>
> if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
> params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
> @@ -37,10 +38,25 @@ struct workspace {
> size_t size;
> char *buf;
> struct list_head list;
> +   unsigned int level;
> ZSTD_inBuffer in_buf;
> ZSTD_outBuffer out_buf;
>  };
>
> +static bool zstd_reallocate_mem(struct workspace *workspace, int size)
> +{
> +   void *new_mem;
> +
> +   new_mem = kvmalloc(size, GFP_KERNEL);
> +   if (new_mem) {
> +   kvfree(workspace->mem);
> +   workspace->mem = new_mem;
> +   workspace->size = size;
> +   return true;
> +   }
> +   return false;
> +}
> +
>  static void zstd_free_workspace(struct list_head *ws)
>  {
> struct workspace *workspace = list_entry(ws, struct workspace, list);
> @@ -50,10 +66,34 @@ static void zstd_free_workspace(struct list_head *ws)
> kfree(workspace);
>  }
>
> -static struct list_head *zstd_alloc_workspace(void)
> +static bool zstd_set_level(struct list_head *ws, unsigned int level)
> +{
> +   struct workspace *workspace = list_entry(ws, struct workspace, list);
> +   ZSTD_parameters params;
> +   int size;
> +
> +   if (level > BTRFS_ZSTD_MAX_LEVEL)
> +   level = BTRFS_ZSTD_MAX_LEVEL;
> +
> +   if (level == 0)
> +   level = BTRFS_ZSTD_DEFAULT_LEVEL;
> +
> +   params = ZSTD_getParams(level, ZSTD_BTRFS_MAX_INPUT, 0);
> +   size = max_t(size_t,
> +   ZSTD_CStreamWorkspaceBound(params.cParams),
> +   ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
> +   if (size > workspace->size) {
> +   if (!zstd_reallocate_mem(workspace, size))
> +   return false;
> +   }
> +   workspace->level = level;
> +   return true;
> +}
> +
> +static struct list_head *zstd_alloc_workspace(unsigned int level)
>  {
> ZSTD_parameters params =
> -   zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT);
> +   zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT, 
> level);
> struct workspace *workspace;
>
> workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
> @@ -69,6 +109,7 @@ static struct list_head *zstd_alloc_workspace(void)
> goto fail;
>
> INIT_LIST_HEAD(>list);
> +   zstd_set_level(>list, level);
>
> return >list;
>  fail:
> @@ -95,7 +136,8 @@ static int zstd_compress_pages(struct list_head *ws,
> unsigned long len = *total_out;
> const unsigned long nr_dest_pages = *out_pages;
> unsigned long max_out = nr_dest_pages * PAGE_SIZE;
> -   ZSTD_parameters params = zstd_get_btrfs_parameters(len);
> +   ZSTD_parameters params = zstd_get_btrfs_parameters(len,
> +  workspace->level);
>
> *out_pages = 0;
> *total_out = 0;
> @@ -419,15 +461,13 @@ static int zstd_decompress(struct list_head *ws, 
> unsigned char *data_in,
> return ret;
>  }
>
> -static void zstd_set_level(struct list_head *ws, unsigned int type)
> -{
> -}
> -
>  const struct btrfs_compress_op btrfs_zstd_compress = {
> -   .alloc_workspace = zstd_alloc_workspace,
> -   .free_workspace = zstd_free_workspace,
> -   .compress_pages = zstd_compress_pages,
> -   .decompress_bio = zstd_decompress_bio,
> -   .decompress = zstd_decompress,
> -   .set_level = zstd_set_level,
> +   .alloc_workspace= zstd_alloc_workspace,
> +   .free_workspace = zstd_free_workspace,
> +   .compress_pages = zstd_compress_pages,
> +   .decompress_bio = zstd_decompress_bio,
> +   .decompress = zstd_decompress,
> +   .set_level  = zstd_set_level,
> +   .max_level  = BTRFS_ZSTD_MAX_LEVEL,
> +   .default_level  = BTRFS_ZSTD_DEFAULT_LEVEL,
>  };
> --
> 2.17.1

Reviewed-by: Timofey Titovets 

You didn't mention, so:
Did you test compression ratio/performance with compress-force or just compress?

Thanks.

-- 
Have a nice day,
Timofey.

Re: [PATCH RESEND] Btrfs: make should_defrag_range() understood compressed extents

2018-10-06 Thread Timofey Titovets

вт, 18 сент. 2018 г. в 13:09, Timofey Titovets :
>
> From: Timofey Titovets 
>
>  Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
>  for file defragmentation.
>
>  Kernel default target extent size - 256KiB.
>  Btrfs progs default - 32MiB.
>
>  Both bigger then maximum size of compressed extent - 128KiB.
>  That lead to rewrite all compressed data on disk.
>
>  Fix that by check compression extents with different logic.
>
>  As addition, make should_defrag_range() understood compressed extent type,
>  if requested target compression are same as current extent compression type.
>  Just don't recompress/rewrite extents.
>  To avoid useless recompression of compressed extents.
>
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/ioctl.c | 28 +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index a990a9045139..0a5ea1ccc89d 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1142,7 +1142,7 @@ static bool defrag_check_next_extent(struct inode 
> *inode, struct extent_map *em)
>
>  static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
>u64 *last_len, u64 *skip, u64 *defrag_end,
> -  int compress)
> +  int compress, int compress_type)
>  {
> struct extent_map *em;
> int ret = 1;
> @@ -1177,8 +1177,29 @@ static int should_defrag_range(struct inode *inode, 
> u64 start, u32 thresh,
>  * real extent, don't bother defragging it
>  */
> if (!compress && (*last_len == 0 || *last_len >= thresh) &&
> -   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
> +   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
> ret = 0;
> +   goto out;
> +   }
> +
> +
> +   /*
> +* Try not recompress compressed extents
> +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
> +* recompress all compressed extents
> +*/
> +   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
> +   if (!compress) {
> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
> +   ret = 0;
> +   } else {
> +   if (em->compress_type != compress_type)
> +   goto out;
> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
> +   ret = 0;
> +   }
> +   }
> +
>  out:
> /*
>  * last_len ends up being a counter of how many bytes we've defragged.
> @@ -1477,7 +1498,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
> *file,
>
> if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
>  extent_thresh, _len, ,
> -_end, do_compress)){
> +_end, do_compress,
> +compress_type)){
> unsigned long next;
> /*
>  * the should_defrag function tells us how much to 
> skip
> --
> 2.19.0

Ok, If no one like that patch,
may be at least fix autodefarag on compressed files?
By change default extent target size 256K -> 128K?

-- 
Have a nice day,
Timofey.

[PATCH V6] Btrfs: enhance raid1/10 balance heuristic

2018-09-25 Thread Timofey Titovets

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue length to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue length then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue length:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Some bench results from mail list
(Dmitrii Tcvetkov ):
Benchmark summary (arithmetic mean of 3 runs):
 Mainline Patch

RAID1  | 18.9 MiB/s | 26.5 MiB/s
RAID10 | 30.7 MiB/s | 30.7 MiB/s

mainline, fio got lucky to read from first HDD (quite slow HDD):
Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
  read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
  lat (msec): min=2, max=825, avg=60.17, stdev=65.06

mainline, fio got lucky to read from second HDD (much more modern):
Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
  read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
  lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56

mainline, fio got lucky to read from an SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
  read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) 
  lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36

With the patch, 2 HDDs:
Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
  read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
  lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14

With the patch, HDD(old one)+SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
  read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
  lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue length
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next
  v4 -> v5:
- Rebased on latest misc-next
  v5 -> v6:
- Fix spelling
- Include bench results

Signed-off-by: Timofey Titovets 
Tested-by: Dmitrii Tcvetkov 
Reviewed-by: Dmitrii Tcvetkov 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 111 -
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9656f9e9f99e..5ea5acc88d3c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95af358b71f..fa7dd6ac087f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue length of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue length to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_

Re: [PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic

2018-09-20 Thread Timofey Titovets

чт, 20 сент. 2018 г. в 12:05, Peter Becker :
>
> i like the idea.
> do you have any benchmarks for this change?
>
> the general logic looks good for me.

https://patchwork.kernel.org/patch/10137909/

>
> Tested-by: Dmitrii Tcvetkov 
>
> Benchmark summary (arithmetic mean of 3 runs):
> Mainline Patch
> --
> RAID1 | 18.9 MiB/s | 26.5 MiB/s
> RAID10 | 30.7 MiB/s | 30.7 MiB/s


> fio configuration:
> [global]
> ioengine=libaio
> buffered=0
> direct=1
> bssplit=32k/100
> size=8G
> directory=/mnt/
> iodepth=16
> time_based
> runtime=900
>
> [test-fio]
> rw=randread
>
> All tests were run on 4 HDD btrfs filesystem in a VM with 4 Gb
> of ram on idle host. Full results attached to the email.

Also:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg71758.html
- - -
So, IIRC its works at least.

[PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic

2018-09-18 Thread Timofey Titovets

From: Timofey Titovets 

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next
  v4 -> v5:
- Rebased on latest misc-next

Signed-off-by: Timofey Titovets 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 111 -
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9656f9e9f99e..5ea5acc88d3c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95af358b71f..fa7dd6ac087f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_down = 2;
+
+   for (i = 0; i < num; i++) {
+   if (qlen[i])
+   continue;
+   bdev = map->stripes[i].dev->bdev;
+   qlen[i] = bdev_get_queue_len(bdev, round_down);
+   }
+
+   /* For mixed case, pick non rotational dev as optimal */
+   if

[PATCH RESEND] Btrfs: make should_defrag_range() understood compressed extents

2018-09-18 Thread Timofey Titovets

From: Timofey Titovets 

 Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
 for file defragmentation.

 Kernel default target extent size - 256KiB.
 Btrfs progs default - 32MiB.

 Both bigger then maximum size of compressed extent - 128KiB.
 That lead to rewrite all compressed data on disk.

 Fix that by check compression extents with different logic.

 As addition, make should_defrag_range() understood compressed extent type,
 if requested target compression are same as current extent compression type.
 Just don't recompress/rewrite extents.
 To avoid useless recompression of compressed extents.

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/ioctl.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a990a9045139..0a5ea1ccc89d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1142,7 +1142,7 @@ static bool defrag_check_next_extent(struct inode *inode, 
struct extent_map *em)
 
 static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
   u64 *last_len, u64 *skip, u64 *defrag_end,
-  int compress)
+  int compress, int compress_type)
 {
struct extent_map *em;
int ret = 1;
@@ -1177,8 +1177,29 @@ static int should_defrag_range(struct inode *inode, u64 
start, u32 thresh,
 * real extent, don't bother defragging it
 */
if (!compress && (*last_len == 0 || *last_len >= thresh) &&
-   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
+   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
ret = 0;
+   goto out;
+   }
+
+
+   /*
+* Try not recompress compressed extents
+* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
+* recompress all compressed extents
+*/
+   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
+   if (!compress) {
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   } else {
+   if (em->compress_type != compress_type)
+   goto out;
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   }
+   }
+
 out:
/*
 * last_len ends up being a counter of how many bytes we've defragged.
@@ -1477,7 +1498,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress)){
+_end, do_compress,
+compress_type)){
unsigned long next;
/*
 * the should_defrag function tells us how much to skip
-- 
2.19.0

Re: [PATCH V5] Btrfs: enchanse raid1/10 balance heuristic

2018-09-13 Thread Timofey Titovets

сб, 7 июл. 2018 г. в 18:24, Timofey Titovets :
>
> From: Timofey Titovets 
>
> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> based on pid % num of mirrors.
>
> Make logic understood:
>  - if one of underline devices are non rotational
>  - Queue leght to underline devices
>
> By default try use pid % num_mirrors guessing, but:
>  - If one of mirrors are non rotational, repick optimal to it
>  - If underline mirror have less queue leght then optimal,
>repick to that mirror
>
> For avoid round-robin request balancing,
> lets round down queue leght:
>  - By 8 for rotational devs
>  - By 2 for all non rotational devs
>
> Changes:
>   v1 -> v2:
> - Use helper part_in_flight() from genhd.c
>   to get queue lenght
> - Move guess code to guess_optimal()
> - Change balancer logic, try use pid % mirror by default
>   Make balancing on spinning rust if one of underline devices
>   are overloaded
>   v2 -> v3:
> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>   v3 -> v4:
> - Rebased on latest misc-next
>   v4 -> v5:
> - Rebased on latest misc-next
>
> Signed-off-by: Timofey Titovets 
> ---
>  block/genhd.c  |   1 +
>  fs/btrfs/volumes.c | 111 -
>  2 files changed, 110 insertions(+), 2 deletions(-)
>
> diff --git a/block/genhd.c b/block/genhd.c
> index 9656f9e9f99e..5ea5acc88d3c 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
> hd_struct *part,
> atomic_read(>in_flight[1]);
> }
>  }
> +EXPORT_SYMBOL_GPL(part_in_flight);
>
>  void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
>unsigned int inflight[2])
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c95af358b71f..fa7dd6ac087f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ctree.h"
>  #include "extent_map.h"
> @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
> return ret;
>  }
>
> +/**
> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
> + *
> + * @bdev: target bdev
> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
> + */
> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
> +{
> +   int sum;
> +   struct hd_struct *bd_part = bdev->bd_part;
> +   struct request_queue *rq = bdev_get_queue(bdev);
> +   uint32_t inflight[2] = {0, 0};
> +
> +   part_in_flight(rq, bd_part, inflight);
> +
> +   sum = max_t(uint32_t, inflight[0], inflight[1]);
> +
> +   /*
> +* Try prevent switch for every sneeze
> +* By roundup output num by some value
> +*/
> +   return ALIGN_DOWN(sum, round_down);
> +}
> +
> +/**
> + * guess_optimal - return guessed optimal mirror
> + *
> + * Optimal expected to be pid % num_stripes
> + *
> + * That's generaly ok for spread load
> + * Add some balancer based on queue leght to device
> + *
> + * Basic ideas:
> + *  - Sequential read generate low amount of request
> + *so if load of drives are equal, use pid % num_stripes balancing
> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
> + *and repick if other dev have "significant" less queue lenght
> + *  - Repick optimal if queue leght of other mirror are less
> + */
> +static int guess_optimal(struct map_lookup *map, int num, int optimal)
> +{
> +   int i;
> +   int round_down = 8;
> +   int qlen[num];
> +   bool is_nonrot[num];
> +   bool all_bdev_nonrot = true;
> +   bool all_bdev_rotate = true;
> +   struct block_device *bdev;
> +
> +   if (num == 1)
> +   return optimal;
> +
> +   /* Check accessible bdevs */
> +   for (i = 0; i < num; i++) {
> +   /* Init for missing bdevs */
> +   is_nonrot[i] = false;
> +   qlen[i] = INT_MAX;
> +   bdev = map->stripes[i].dev->bdev;
> +   if (bdev) {
> +   qlen[i] = 0;
> +   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
> +   if (is_nonrot[i])
> +   all_bdev_rotate = false;
> +   else
> +   all_bdev_nonrot = fa

Re: dduper - Offline btrfs deduplication tool

2018-09-05 Thread Timofey Titovets

пт, 24 авг. 2018 г. в 7:41, Lakshmipathi.G :
>
> Hi -
>
> dduper is an offline dedupe tool. Instead of reading whole file blocks and
> computing checksum, It works by fetching checksum from BTRFS csum tree. This
> hugely improves the performance.
>
> dduper works like:
> - Read csum for given two files.
> - Find matching location.
> - Pass the location to ioctl_ficlonerange directly
>   instead of ioctl_fideduperange
>
> By default, dduper adds safty check to above steps by creating a
> backup reflink file and compares the md5sum after dedupe.
> If the backup file matches new deduped file, then backup file is
> removed. You can skip this check by passing --skip option. Here is
> sample cli usage [1] and quick demo [2]
>
> Some performance numbers: (with -skip option)
>
> Dedupe two 1GB files with same  content - 1.2 seconds
> Dedupe two 5GB files with same  content - 8.2 seconds
> Dedupe two 10GB files with same  content - 13.8 seconds
>
> dduper requires `btrfs inspect-internal dump-csum` command, you can use
> this branch [3] or apply patch by yourself [4]
>
> [1] 
> https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md
> [2] http://giis.co.in/btrfs_dedupe.gif
> [3] git clone https://gitlab.collabora.com/laks/btrfs-progs.git -b  dump_csum
> [4] https://patchwork.kernel.org/patch/10540229/
>
> Please remember its version-0.1, so test it out, if you plan to use dduper 
> real data.
> Let me know, if you have suggestions or feedback or bugs :)
>
> Cheers.
> Lakshmipathi.G
>

One question:
Why not ioctl_fideduperange?
i.e. you kill most of benefits from that ioctl - atomicity.


-- 
Have a nice day,
Timofey.

[PATCH V5] Btrfs: enchanse raid1/10 balance heuristic

2018-07-07 Thread Timofey Titovets

From: Timofey Titovets 

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next
  v4 -> v5:
- Rebased on latest misc-next

Signed-off-by: Timofey Titovets 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 111 -
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9656f9e9f99e..5ea5acc88d3c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95af358b71f..fa7dd6ac087f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_down = 2;
+
+   for (i = 0; i < num; i++) {
+   if (qlen[i])
+   continue;
+   bdev = map->stripes[i].dev->bdev;
+   qlen[i] = bdev_get_queue_len(bdev, round_down);
+   }
+
+   /* For mixed case, pick non rotational dev as optimal */
+   if

[PATCH RESEND V4] Btrfs: enchanse raid1/10 balance heuristic

2018-07-07 Thread Timofey Titovets

From: Timofey Titovets 

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next

Signed-off-by: Timofey Titovets 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 111 -
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9656f9e9f99e..5ea5acc88d3c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95af358b71f..fa7dd6ac087f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_down = 2;
+
+   for (i = 0; i < num; i++) {
+   if (qlen[i])
+   continue;
+   bdev = map->stripes[i].dev->bdev;
+   qlen[i] = bdev_get_queue_len(bdev, round_down);
+   }
+
+   /* For mixed case, pick non rotational dev as optimal */
+   if (all_bdev_rotate == all_bdev_nonrot) {
+

Re: [PATCH 2/4] Btrfs: make should_defrag_range() understood compressed extents

2018-05-29 Thread Timofey Titovets

вт, 19 дек. 2017 г. в 13:02, Timofey Titovets :

>   Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
>   for file defragmentation.

>   Kernel default target extent size - 256KiB.
>   Btrfs progs default - 32MiB.

>   Both bigger then maximum size of compressed extent - 128KiB.
>   That lead to rewrite all compressed data on disk.

>   Fix that by check compression extents with different logic.

>   As addition, make should_defrag_range() understood compressed extent
type,
>   if requested target compression are same as current extent compression
type.
>   Just don't recompress/rewrite extents.
>   To avoid useless recompression of compressed extents.

> Signed-off-by: Timofey Titovets 
> ---
>   fs/btrfs/ioctl.c | 28 +---
>   1 file changed, 25 insertions(+), 3 deletions(-)

> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 45a47d0891fc..b29ea1f0f621 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode
*inode, struct extent_map *em)

>   static int should_defrag_range(struct inode *inode, u64 start, u32
thresh,
> u64 *last_len, u64 *skip, u64 *defrag_end,
> -  int compress)
> +  int compress, int compress_type)
>   {
>  struct extent_map *em;
>  int ret = 1;
> @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode
*inode, u64 start, u32 thresh,
>   * real extent, don't bother defragging it
>   */
>  if (!compress && (*last_len == 0 || *last_len >= thresh) &&
> -   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
> +   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
>  ret = 0;
> +   goto out;
> +   }
> +
> +
> +   /*
> +* Try not recompress compressed extents
> +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
> +* recompress all compressed extents
> +*/
> +   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
> +   if (!compress) {
> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
> +   ret = 0;
> +   } else {
> +   if (em->compress_type != compress_type)
> +   goto out;
> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
> +   ret = 0;
> +   }
> +   }
> +
>   out:
>  /*
>   * last_len ends up being a counter of how many bytes we've
defragged.
> @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct
file *file,

>  if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
>   extent_thresh, _len, ,
> -_end, do_compress)){
> +_end, do_compress,
> +compress_type)){
>  unsigned long next;
>  /*
>   * the should_defrag function tells us how much
to skip
> --
> 2.15.1

May be, then, if we don't want add some duck tape for compressed extents
and defrag,
we can just change default kernel target extent size 256KiB -> 128KiB?

That will also fix the issue with autodefrag and compression enabled.

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Timofey Titovets

пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn :

> On 2018-05-19 04:54, Niccolò Belli wrote:
> > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
> >> With a bit of work, it's possible to handle things sanely.  You can
> >> deduplicate data from snapshots, even if they are read-only (you need
> >> to pass the `-A` option to duperemove and run it as root), so it's
> >> perfectly reasonable to only defrag the main subvolume, and then
> >> deduplicate the snapshots against that (so that they end up all being
> >> reflinks to the main subvolume).  Of course, this won't work if you're
> >> short on space, but if you're dealing with snapshots, you should have
> >> enough space that this will work (because even without defrag, it's
> >> fully possible for something to cause the snapshots to suddenly take
> >> up a lot more space).
> >
> > Been there, tried that. Unfortunately even if I skip the defreg a simple
> >
> > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
> >
> > is going to eat more space than it was previously available (probably
> > due to autodefrag?).
> It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
> ioctl).  There's two things involved here:

> * BTRFS has somewhat odd and inefficient handling of partial extents.
> When part of an extent becomes unused (because of a CLONE ioctl, or an
> EXTENT_SAME ioctl, or something similar), that part stays allocated
> until the whole extent would be unused.
> * You're using the default deduplication block size (128k), which is
> larger than your filesystem block size (which is at most 64k, most
> likely 16k, but might be 4k if it's an old filesystem), so deduplicating
> can split extents.

That's a metadata node leaf != fs block size.
btrfs fs block size == machine page size currently.

> Because of this, if a duplicate region happens to overlap the front of
> an already shared extent, and the end of said shared extent isn't
> aligned with the deduplication block size, the EXTENT_SAME call will
> deduplicate the first part, creating a new shared extent, but not the
> tail end of the existing shared region, and all of that original shared
> region will stick around, taking up extra space that it wasn't before.

> Additionally, if only part of an extent is duplicated, then that area of
> the extent will stay allocated, because the rest of the extent is still
> referenced (so you won't necessarily see any actual space savings).

> You can mitigate this by telling duperemove to use the same block size
> as your filesystem using the `-b` option.   Note that using a smaller
> block size will also slow down the deduplication process and greatly
> increase the size of the hash file.

duperemove -b control "how hash data", not more or less and only support
4KiB..1MiB

And size of block for dedup will change efficiency of deduplication,
when count of hash-block pairs, will change hash file size and time
complexity.

Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern.

So, example, you have 2 of 2x4KiB blocks:
1: ''
2: ''

With -b 8KiB hash of first block not same as second.
But with -b 4KiB duperemove will see both '' and ''
And then that blocks will be deduped.

Even, duperemove have 2 modes of deduping:
1. By extents
2. By blocks

Thanks.

--
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: inode: Don't compress if NODATASUM or NODATACOW set

2018-05-14 Thread Timofey Titovets

пн, 14 мая 2018 г. в 20:32, David Sterba :

> On Mon, May 14, 2018 at 03:02:10PM +0800, Qu Wenruo wrote:
> > As btrfs(5) specified:
> >
> >   Note
> >   If nodatacow or nodatasum are enabled, compression is disabled.
> >
> > If NODATASUM or NODATACOW set, we should not compress the extent.
> >
> > And in fact, we have bug report about corrupted compressed extent
> > leading to memory corruption in mail list.

> Link please.

> > Although it's mostly buggy lzo implementation causing the problem, btrfs
> > still needs to be fixed to meet the specification.

> That's very vague, what's the LZO bug? If the input is garbage and lzo
> decompression cannot decompress it, it's not a lzo bug.

> > Reported-by: James Harvey 
> > Signed-off-by: Qu Wenruo 
> > ---
> >  fs/btrfs/inode.c | 8 
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index d241285a0d2a..dbef3f404559 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -396,6 +396,14 @@ static inline int inode_need_compress(struct inode
*inode, u64 start, u64 end)
> >  {
> >   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >
> > + /*
> > +  * Btrfs doesn't support compression without csum or CoW.
> > +  * This should have the highest priority.
> > +  */
> > + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW ||
> > + BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
> > + return 0;

> This is also the wrong place to fix that, NODATASUM or NODATACOW inode
> should never make it to compress_file_range (that calls
> inode_need_compress).


David, i've talk about that some time ago:
https://www.spinics.net/lists/linux-btrfs/msg73137.html

NoCow files can be *easy* compressed.
```
➜  ~ touch test
➜  ~ chattr +C test
➜  ~ lsattr test
---C-- test
➜  ~ dd if=/dev/zero of=./test bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00099878 s, 1.0 GB/s
➜  ~ sync
➜  ~ filefrag -v test
Filesystem type is: 9123683e
File size of test is 1048576 (256 blocks of 4096 bytes)
ext: logical_offset:physical_offset: length:   expected: flags:
   0:0.. 255:   88592741..  88592996:256:
last,eof
test: 1 extent found
➜  ~ btrfs fi def -vrczstd test
test
➜  ~ filefrag -v test
Filesystem type is: 9123683e
File size of test is 1048576 (256 blocks of 4096 bytes)
ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:   3125..  3156: 32: encoded
   1:   32..  63:   3180..  3211: 32:   3157: encoded
   2:   64..  95:   3185..  3216: 32:   3212: encoded
   3:   96.. 127:   3188..  3219: 32:   3217: encoded
   4:  128.. 159:   3263..  3294: 32:   3220: encoded
   5:  160.. 191:   3355..  3386: 32:   3295: encoded
   6:  192.. 223:   3376..  3407: 32:   3387: encoded
   7:  224.. 255:   3411..  3442: 32:   3408:
last,encoded,eof
test: 8 extents found
```
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread Timofey Titovets

пт, 11 мая 2018 г. в 20:32, Omar Sandoval :

> On Fri, May 11, 2018 at 06:49:16PM +0200, David Sterba wrote:
> > On Fri, May 11, 2018 at 05:25:50PM +0100, Filipe Manana wrote:
> > > On Fri, May 11, 2018 at 4:57 PM, David Sterba 
wrote:
> > > > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> > > > arrays can be 32KiB large. To avoid allocation failures due to
> > > > fragmented memory, use the allocation with fallback to vmalloc.
> > > >
> > > > Signed-off-by: David Sterba 
> > > > ---
> > > >
> > > > This depends on the patches that remove the 16MiB restriction in the
> > > > dedupe ioctl, but contextually can be applied to the current code
too.
> > > >
> > > > https://patchwork.kernel.org/patch/10374941/
> > > >
> > > >  fs/btrfs/ioctl.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > > index b572e38b4b64..a7f517009cd7 100644
> > > > --- a/fs/btrfs/ioctl.c
> > > > +++ b/fs/btrfs/ioctl.c
> > > > @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode
*src, u64 loff, u64 olen,
> > > >  * locking. We use an array for the page pointers. Size of
the array is
> > > >  * bounded by len, which is in turn bounded by
BTRFS_MAX_DEDUPE_LEN.
> > > >  */
> > > > -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > > > -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > > > +   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > > > +   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > >
> > > Kvzalloc should take 2 parameters and not 3.
> >
> > And the right function is kvmalloc_array.
> >
> > > Also, aren't the corresponding kvfree() calls missing?
> >
> > Yes, thanks for catching it. The updated version:
> >
> > From: David Sterba 
> > Subject: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data
> >
> > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> > arrays can be 32KiB large. To avoid allocation failures due to
> > fragmented memory, use the allocation with fallback to vmalloc.
> >
> > Signed-off-by: David Sterba 
> > ---
> >  fs/btrfs/ioctl.c | 16 +---
> >  1 file changed, 9 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index b572e38b4b64..4fcfa05ed960 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3178,12 +3178,13 @@ static int btrfs_extent_same(struct inode *src,
u64 loff, u64 olen,
> >* locking. We use an array for the page pointers. Size of the
array is
> >* bounded by len, which is in turn bounded by
BTRFS_MAX_DEDUPE_LEN.
> >*/
> > - cmp.src_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > - cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > + cmp.src_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> > +GFP_KERNEL);
> > + cmp.dst_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> > +GFP_KERNEL);

> kcalloc() implies __GFP_ZERO, do we need that here?

AFAIK, yes, because:
btrfs_cmp_data_free():
...
pg = cmp->src_pages[i];
if (pg) {...}
..

And we will catch that, if errors happens in gather_extent_pages().

Thanks.
> >   if (!cmp.src_pages || !cmp.dst_pages) {
> > - kfree(cmp.src_pages);
> > - kfree(cmp.dst_pages);
> > - return -ENOMEM;
> > + ret = -ENOMEM;
> > + goto out_free;
> >   }
> >
> >   if (same_inode)
> > @@ -3211,8 +3212,9 @@ static int btrfs_extent_same(struct inode *src,
u64 loff, u64 olen,
> >   else
> >   btrfs_double_inode_unlock(src, dst);
> >
> > - kfree(cmp.src_pages);
> > - kfree(cmp.dst_pages);
> > +out_free:
> > + kvfree(cmp.src_pages);
> > + kvfree(cmp.dst_pages);
> >
> >   return ret;
> >  }
> > --
> > 2.16.2
> >


-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] raid6_pq: Add module options to prefer algorithm

2018-05-03 Thread Timofey Titovets

Skip testing unnecessary algorithms to speedup module initialization

For my systems:
  Before: 1.510s (initrd)
  After:  977ms  (initrd) # I set prefer to fastest algorithm

Dmesg after patch:
[1.190042] raid6: avx2x4   gen() 28153 MB/s
[1.246683] raid6: avx2x4   xor() 19440 MB/s
[1.246684] raid6: using algorithm avx2x4 gen() 28153 MB/s
[1.246684] raid6:  xor() 19440 MB/s, rmw enabled
[1.246685] raid6: using avx2x2 recovery algorithm

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
CC: linux-btrfs@vger.kernel.org
---
 lib/raid6/algos.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 5065b1e7e327..abfcb4107fc3 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -30,6 +30,11 @@ EXPORT_SYMBOL(raid6_empty_zero_page);
 #endif
 #endif
 
+static char *prefer_name;
+
+module_param(prefer_name, charp, 0);
+MODULE_PARM_DESC(prefer_name, "Prefer gen/xor() algorithm");
+
 struct raid6_calls raid6_call;
 EXPORT_SYMBOL_GPL(raid6_call);
 
@@ -155,10 +160,27 @@ static inline const struct raid6_calls *raid6_choose_gen(
 {
unsigned long perf, bestgenperf, bestxorperf, j0, j1;
int start = (disks>>1)-1, stop = disks-3;   /* work on the second 
half of the disks */
-   const struct raid6_calls *const *algo;
-   const struct raid6_calls *best;
+   const struct raid6_calls *const *algo = NULL;
+   const struct raid6_calls *best = NULL;
+
+   if (strlen(prefer_name)) {
+   for (algo = raid6_algos; strlen(prefer_name) && *algo; algo++) {
+   if (!strncmp(prefer_name, (*algo)->name, 8)) {
+   best = *algo;
+   break;
+   }
+   }
+   if (!best)
+   pr_info("raid6: %-8s prefer not found\n", prefer_name);
+   }
+
+
+
+   if (!algo) {
+   algo = raid6_algos;
+   }
 
-   for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = raid6_algos; 
*algo; algo++) {
+   for (bestgenperf = 0, bestxorperf = 0; *algo; algo++) {
if (!best || (*algo)->prefer >= best->prefer) {
if ((*algo)->valid && !(*algo)->valid())
continue;
-- 
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V3 3/3] Btrfs: btrfs_extent_same() reuse cmp workspace

2018-05-01 Thread Timofey Titovets

We support big dedup requests by split range to several smaller,
and call dedup logic over each of them.

Instead of alloc/dealloc on each, let's reuse allocated memory.

Changes:
  v3:
- Splited from one to 3 patches

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 80 +---
 1 file changed, 41 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 38ce990e9b4c..f2521bc0b069 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2769,8 +2769,6 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp)
put_page(pg);
}
}
-   kfree(cmp->src_pages);
-   kfree(cmp->dst_pages);
 }
 
 static int btrfs_cmp_data_prepare(struct inode *src, u64 loff,
@@ -2779,40 +2777,14 @@ static int btrfs_cmp_data_prepare(struct inode *src, 
u64 loff,
 {
int ret;
int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
-   struct page **src_pgarr, **dst_pgarr;
 
-   /*
-* We must gather up all the pages before we initiate our
-* extent locking. We use an array for the page pointers. Size
-* of the array is bounded by len, which is in turn bounded by
-* BTRFS_MAX_DEDUPE_LEN.
-*/
-   src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   if (!src_pgarr || !dst_pgarr) {
-   kfree(src_pgarr);
-   kfree(dst_pgarr);
-   return -ENOMEM;
-   }
cmp->num_pages = num_pages;
-   cmp->src_pages = src_pgarr;
-   cmp->dst_pages = dst_pgarr;
 
-   /*
-* If deduping ranges in the same inode, locking rules make it mandatory
-* to always lock pages in ascending order to avoid deadlocks with
-* concurrent tasks (such as starting writeback/delalloc).
-*/
-   if (src == dst && dst_loff < loff) {
-   swap(src_pgarr, dst_pgarr);
-   swap(loff, dst_loff);
-   }
-
-   ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff);
+   ret = gather_extent_pages(src, cmp->src_pages, num_pages, loff);
if (ret)
goto out;
 
-   ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff);
+   ret = gather_extent_pages(dst, cmp->dst_pages, num_pages, dst_loff);
 
 out:
if (ret)
@@ -2883,11 +2855,11 @@ static int extent_same_check_offsets(struct inode 
*inode, u64 off, u64 *plen,
 }
 
 static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
-  struct inode *dst, u64 dst_loff)
+  struct inode *dst, u64 dst_loff,
+  struct cmp_pages *cmp)
 {
int ret;
u64 len = olen;
-   struct cmp_pages cmp;
bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;
@@ -2927,7 +2899,7 @@ static int __btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
}
 
 again:
-   ret = btrfs_cmp_data_prepare(src, loff, dst, dst_loff, olen, );
+   ret = btrfs_cmp_data_prepare(src, loff, dst, dst_loff, olen, cmp);
if (ret)
return ret;
 
@@ -2950,7 +2922,7 @@ static int __btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 * Ranges in the io trees already unlocked. Now unlock all
 * pages before waiting for all IO to complete.
 */
-   btrfs_cmp_data_free();
+   btrfs_cmp_data_free(cmp);
if (same_inode) {
btrfs_wait_ordered_range(src, same_lock_start,
 same_lock_len);
@@ -2963,12 +2935,12 @@ static int __btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
ASSERT(ret == 0);
if (WARN_ON(ret)) {
/* ranges in the io trees already unlocked */
-   btrfs_cmp_data_free();
+   btrfs_cmp_data_free(cmp);
return ret;
}
 
/* pass original length for comparison so we stay within i_size */
-   ret = btrfs_cmp_data(olen, );
+   ret = btrfs_cmp_data(olen, cmp);
if (ret == 0)
ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);
 
@@ -2978,7 +2950,7 @@ static int __btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
else
btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
 
-   btrfs_cmp_data_free();
+   btrfs_cmp_data_free(cmp);
 
return ret;
 }
@@ -2989,6 +2961,8 @@ static int btrfs_extent_same(struct inode *src, u64 loff, 
u64 olen,
 struct inode *dst, u64 dst_loff)
 {
int ret;
+   struct cmp_pages cmp;
+   int num_pages = PAGE_ALIGN(BTRFS_MAX_DEDUPE_LEN) >> PAGE_SHIF

[PATCH V3 2/3] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2018-05-01 Thread Timofey Titovets

At now btrfs_dedupe_file_range() restricted to 16MiB range for
limit locking time and memory requirement for dedup ioctl()

For too big input range code silently set range to 16MiB

Let's remove that restriction by do iterating over dedup range.
That's backward compatible and will not change anything for request
less then 16MiB.

Changes:
  v3:
- Splited from one to 3 patches

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fb8beedb0359..38ce990e9b4c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2983,11 +2983,14 @@ static int __btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
return ret;
 }
 
+#define BTRFS_MAX_DEDUPE_LEN   SZ_16M
+
 static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 struct inode *dst, u64 dst_loff)
 {
int ret;
bool same_inode = (src == dst);
+   u64 i, tail_len, chunk_count;
 
if (olen == 0)
return 0;
@@ -2998,13 +3001,28 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
return -EINVAL;
}
 
+   tail_len = olen % BTRFS_MAX_DEDUPE_LEN;
+   chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN);
+
if (same_inode)
inode_lock(src);
else
btrfs_double_inode_lock(src, dst);
 
-   ret = __btrfs_extent_same(src, loff, olen, dst, dst_loff);
+   for (i = 0; i < chunk_count; i++) {
+   ret = __btrfs_extent_same(src, loff, BTRFS_MAX_DEDUPE_LEN,
+ dst, dst_loff);
+   if (ret)
+   goto out;
+
+   loff += BTRFS_MAX_DEDUPE_LEN;
+   dst_loff += BTRFS_MAX_DEDUPE_LEN;
+   }
+
+   if (tail_len > 0)
+   ret = __btrfs_extent_same(src, loff, tail_len, dst, dst_loff);
 
+out:
if (same_inode)
inode_unlock(src);
else
@@ -3013,8 +3031,6 @@ static int btrfs_extent_same(struct inode *src, u64 loff, 
u64 olen,
return ret;
 }
 
-#define BTRFS_MAX_DEDUPE_LEN   SZ_16M
-
 ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
struct file *dst_file, u64 dst_loff)
 {
@@ -3023,9 +3039,6 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, 
u64 loff, u64 olen,
u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
ssize_t res;
 
-   if (olen > BTRFS_MAX_DEDUPE_LEN)
-   olen = BTRFS_MAX_DEDUPE_LEN;
-
if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
/*
 * Btrfs does not support blocksize < page_size. As a
-- 
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V3 1/3] Btrfs: split btrfs_extent_same() for simplification

2018-05-01 Thread Timofey Titovets

Split btrfs_extent_same() for simplification and preparation
for call several times over target files

Move most logic to __btrfs_extent_same()
And leave in btrfs_extent_same() things which must happens only once

Changes:
  v3:
- Splited from one to 3 patches

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 64 ++--
 1 file changed, 35 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f0e62e4f8fe7..fb8beedb0359 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2882,8 +2882,8 @@ static int extent_same_check_offsets(struct inode *inode, 
u64 off, u64 *plen,
return 0;
 }
 
-static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
-struct inode *dst, u64 dst_loff)
+static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
+  struct inode *dst, u64 dst_loff)
 {
int ret;
u64 len = olen;
@@ -2892,21 +2892,13 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
u64 same_lock_start = 0;
u64 same_lock_len = 0;
 
-   if (len == 0)
-   return 0;
-
-   if (same_inode)
-   inode_lock(src);
-   else
-   btrfs_double_inode_lock(src, dst);
-
ret = extent_same_check_offsets(src, loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
ret = extent_same_check_offsets(dst, dst_loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
if (same_inode) {
/*
@@ -2923,32 +2915,21 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 * allow an unaligned length so long as it ends at
 * i_size.
 */
-   if (len != olen) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (len != olen)
+   return -EINVAL;
 
/* Check for overlapping ranges */
-   if (dst_loff + len > loff && dst_loff < loff + len) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (dst_loff + len > loff && dst_loff < loff + len)
+   return -EINVAL;
 
same_lock_start = min_t(u64, loff, dst_loff);
same_lock_len = max_t(u64, loff, dst_loff) + len - 
same_lock_start;
}
 
-   /* don't make the dst file partly checksummed */
-   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
-   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
-
 again:
ret = btrfs_cmp_data_prepare(src, loff, dst, dst_loff, olen, );
if (ret)
-   goto out_unlock;
+   return ret;
 
if (same_inode)
ret = lock_extent_range(src, same_lock_start, same_lock_len,
@@ -2998,7 +2979,32 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
 
btrfs_cmp_data_free();
-out_unlock:
+
+   return ret;
+}
+
+static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
+struct inode *dst, u64 dst_loff)
+{
+   int ret;
+   bool same_inode = (src == dst);
+
+   if (olen == 0)
+   return 0;
+
+   /* don't make the dst file partly checksummed */
+   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
+   return -EINVAL;
+   }
+
+   if (same_inode)
+   inode_lock(src);
+   else
+   btrfs_double_inode_lock(src, dst);
+
+   ret = __btrfs_extent_same(src, loff, olen, dst, dst_loff);
+
if (same_inode)
inode_unlock(src);
else
-- 
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V3 0/3] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2018-05-01 Thread Timofey Titovets

At now btrfs_dedupe_file_range() restricted to 16MiB range for
limit locking time and memory requirement for dedup ioctl()

For too big input range code silently set range to 16MiB

Let's remove that restriction by do iterating over dedup range.
That's backward compatible and will not change anything for request
less then 16MiB.

Changes:
  v1 -> v2:
- Refactor btrfs_cmp_data_prepare and btrfs_extent_same
- Store memory of pages array between iterations
- Lock inodes once, not on each iteration
- Small inplace cleanups
  v2 -> v3:
- Split to several patches

Timofey Titovets (3):
  Btrfs: split btrfs_extent_same() for simplification
  Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
  Btrfs: btrfs_extent_same() reuse cmp workspace

 fs/btrfs/ioctl.c | 161 ++-
 1 file changed, 91 insertions(+), 70 deletions(-)

-- 
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] [RESEND] Btrfs: reduce size of struct btrfs_inode

2018-04-28 Thread Timofey Titovets

чт, 26 апр. 2018 г. в 16:44, David Sterba <dste...@suse.cz>:

> On Wed, Apr 25, 2018 at 02:37:17AM +0300, Timofey Titovets wrote:
> > Currently btrfs_inode have size equal 1136 bytes. (On x86_64).
> >
> > struct btrfs_inode store several vars releated to compression code,
> > all states use 1 or 2 bits.
> >
> > Lets declare bitfields for compression releated vars, to reduce
> > sizeof btrfs_inode to 1128 bytes.

> Unfortunatelly, this has no big effect. The inodes are allocated from a
> slab page, that's 4k and there are at most 3 inodes there. Snippet from
> /proc/slabinfo:

> # name   

> btrfs_inode   256043 278943   109631

> The size on my box is 1096 as it's 4.14, but this should not matter to
> demonstrate the idea.

> objperslab is 3 here, ie. there are 3 btrfs_inode in the page, and
> there's 4096 - 3 * 1096 = 808 of slack space. In order to pack 4 inodes
> per page, we'd have to squeeze the inode size to 1024 bytes. I've looked
> into that and did not see enough members to remove or substitute. IIRC
> there were like 24-32 bytes possible to shave, but that was it.

> Once we'd get to 1024, adding anything new to btrfs_inode would be quite
> difficult and as it goes, there's always something to add to the inode.

> So I'd take a different approach, to regroup items and decide by
> cacheline access patterns what to put together and what to separate.

> The maximum size of inode before going to 2 objects per page is 1365, so
> there's enough space for cacheline alignments.

May be i misunderstood something, but i was think that slab combine several
pages
in continuous range, so object in slab can cross page boundary.
So, all calculation will be very depends on scale of slab size.

i.e. on my machine that looks quite different:
name  
  
btrfs_inode 142475 146272  1136   28
  8

So,
PAGE_SIZE * pagesperslab / objperslab
4096 * 8 / 28 = 1170.28

4096*8 - 1136*28 = 960

That's looks like object can cross page boundary in slab.
So, if size reduced to 1128,
4096 * 8 / 29 = 1129.93
4096*8 - 1128*29 = 56

Did i miss something?

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] [RESEND] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2018-04-28 Thread Timofey Titovets

чт, 26 апр. 2018 г. в 17:05, David Sterba <dste...@suse.cz>:

> On Wed, Apr 25, 2018 at 02:37:14AM +0300, Timofey Titovets wrote:
> > At now btrfs_dedupe_file_range() restricted to 16MiB range for
> > limit locking time and memory requirement for dedup ioctl()
> >
> > For too big input range code silently set range to 16MiB
> >
> > Let's remove that restriction by do iterating over dedup range.
> > That's backward compatible and will not change anything for request
> > less then 16MiB.
> >
> > Changes:
> >   v1 -> v2:
> > - Refactor btrfs_cmp_data_prepare and btrfs_extent_same
> > - Store memory of pages array between iterations
> > - Lock inodes once, not on each iteration
> > - Small inplace cleanups

> I think this patch should be split into more, there are several logical
> changes mixed together.

> I can add the patch to for-next to see if there are any problems caught
> by the existing test, but will expect more revisions of the patch. I
> don't see any fundamental problems so far.

> Suggested changes:
> * factor out __btrfs_extent_same
> * adjust parameters if needed by the followup patches
> * add the chunk counting logic
> * any other cleanups

Thanks, i will try split it out.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V4] Btrfs: enchanse raid1/10 balance heuristic

2018-04-25 Thread Timofey Titovets

2018-04-25 10:54 GMT+03:00 Misono Tomohiro <misono.tomoh...@jp.fujitsu.com>:
> On 2018/04/25 9:20, Timofey Titovets wrote:
>> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
>> based on pid % num of mirrors.
>>
>> Make logic understood:
>>  - if one of underline devices are non rotational
>>  - Queue leght to underline devices
>>
>> By default try use pid % num_mirrors guessing, but:
>>  - If one of mirrors are non rotational, repick optimal to it
>>  - If underline mirror have less queue leght then optimal,
>>repick to that mirror
>>
>> For avoid round-robin request balancing,
>> lets round down queue leght:
>>  - By 8 for rotational devs
>>  - By 2 for all non rotational devs
>>
>> Changes:
>>   v1 -> v2:
>> - Use helper part_in_flight() from genhd.c
>>   to get queue lenght
>> - Move guess code to guess_optimal()
>> - Change balancer logic, try use pid % mirror by default
>>   Make balancing on spinning rust if one of underline devices
>>   are overloaded
>>   v2 -> v3:
>> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>>   v3 -> v4:
>> - Rebased on latest misc-next
>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  block/genhd.c  |   1 +
>>  fs/btrfs/volumes.c | 111 -
>>  2 files changed, 110 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 9656f9e9f99e..5ea5acc88d3c 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
>> hd_struct *part,
>>   atomic_read(>in_flight[1]);
>>   }
>>  }
>> +EXPORT_SYMBOL_GPL(part_in_flight);
>>
>>  struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
>>  {
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index c95af358b71f..fa7dd6ac087f 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -16,6 +16,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include "ctree.h"
>>  #include "extent_map.h"
>> @@ -5148,7 +5149,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, 
>> u64 logical, u64 len)
>>   /*
>>* There could be two corrupted data stripes, we need
>>* to loop retry in order to rebuild the correct data.
>> -  *
>> +  *
>>* Fail a stripe at a time on every retry except the
>>* stripe under reconstruction.
>>*/
>> @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
>> *fs_info, u64 logical, u64 len)
>>   return ret;
>>  }
>>
>> +/**
>> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
>> + *
>> + * @bdev: target bdev
>> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
>> + */
>> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
>> +{
>> + int sum;
>> + struct hd_struct *bd_part = bdev->bd_part;
>> + struct request_queue *rq = bdev_get_queue(bdev);
>> + uint32_t inflight[2] = {0, 0};
>> +
>> + part_in_flight(rq, bd_part, inflight);
>> +
>> + sum = max_t(uint32_t, inflight[0], inflight[1]);
>> +
>> + /*
>> +  * Try prevent switch for every sneeze
>> +  * By roundup output num by some value
>> +  */
>> + return ALIGN_DOWN(sum, round_down);
>> +}
>> +
>> +/**
>> + * guess_optimal - return guessed optimal mirror
>> + *
>> + * Optimal expected to be pid % num_stripes
>> + *
>> + * That's generaly ok for spread load
>> + * Add some balancer based on queue leght to device
>> + *
>> + * Basic ideas:
>> + *  - Sequential read generate low amount of request
>> + *so if load of drives are equal, use pid % num_stripes balancing
>> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
>> + *and repick if other dev have "significant" less queue lenght
>
> The code looks always choosing the queue with the lowest length regardless
> of the amount of queue length difference. So, this "significant" may be wrong?

yes, but before code looks at queue len, we do round_down by 8,
may be you confused because i hide A

[PATCH V4] Btrfs: enchanse raid1/10 balance heuristic

2018-04-24 Thread Timofey Titovets

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 111 -
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9656f9e9f99e..5ea5acc88d3c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
 {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95af358b71f..fa7dd6ac087f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5148,7 +5149,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
logical, u64 len)
/*
 * There could be two corrupted data stripes, we need
 * to loop retry in order to rebuild the correct data.
-* 
+*
 * Fail a stripe at a time on every retry except the
 * stripe under reconstruction.
 */
@@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_dow

[PATCH 2/4] [RESEND] Btrfs: make should_defrag_range() understood compressed extents

2018-04-24 Thread Timofey Titovets

 Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
 for file defragmentation.

 Kernel default target extent size - 256KiB.
 Btrfs progs default - 32MiB.

 Both bigger then maximum size of compressed extent - 128KiB.
 That lead to rewrite all compressed data on disk.

 Fix that by check compression extents with different logic.

 As addition, make should_defrag_range() understood compressed extent type,
 if requested target compression are same as current extent compression type.
 Just don't recompress/rewrite extents.
 To avoid useless recompression of compressed extents.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 45a47d0891fc..b29ea1f0f621 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode *inode, 
struct extent_map *em)
 
 static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
   u64 *last_len, u64 *skip, u64 *defrag_end,
-  int compress)
+  int compress, int compress_type)
 {
struct extent_map *em;
int ret = 1;
@@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, u64 
start, u32 thresh,
 * real extent, don't bother defragging it
 */
if (!compress && (*last_len == 0 || *last_len >= thresh) &&
-   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
+   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
ret = 0;
+   goto out;
+   }
+
+
+   /*
+* Try not recompress compressed extents
+* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
+* recompress all compressed extents
+*/
+   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
+   if (!compress) {
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   } else {
+   if (em->compress_type != compress_type)
+   goto out;
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   }
+   }
+
 out:
/*
 * last_len ends up being a counter of how many bytes we've defragged.
@@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress)){
+_end, do_compress,
+compress_type)){
unsigned long next;
/*
 * the should_defrag function tells us how much to skip
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] [RESEND] Btrfs: reduce size of struct btrfs_inode

2018-04-24 Thread Timofey Titovets

Currently btrfs_inode have size equal 1136 bytes. (On x86_64).

struct btrfs_inode store several vars releated to compression code,
all states use 1 or 2 bits.

Lets declare bitfields for compression releated vars, to reduce
sizeof btrfs_inode to 1128 bytes.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/btrfs_inode.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9eb0c92ee4b4..9d29d7e68757 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -181,13 +181,13 @@ struct btrfs_inode {
/*
 * Cached values of inode properties
 */
-   unsigned prop_compress; /* per-file compression algorithm */
+   unsigned prop_compress : 2; /* per-file compression algorithm */
/*
 * Force compression on the file using the defrag ioctl, could be
 * different from prop_compress and takes precedence if set
 */
-   unsigned defrag_compress;
-   unsigned change_compress;
+   unsigned defrag_compress : 2;
+   unsigned change_compress : 1;
 
struct btrfs_delayed_node *delayed_node;
 
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] [RESEND] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2018-04-24 Thread Timofey Titovets

At now btrfs_dedupe_file_range() restricted to 16MiB range for
limit locking time and memory requirement for dedup ioctl()

For too big input range code silently set range to 16MiB

Let's remove that restriction by do iterating over dedup range.
That's backward compatible and will not change anything for request
less then 16MiB.

Changes:
  v1 -> v2:
- Refactor btrfs_cmp_data_prepare and btrfs_extent_same
- Store memory of pages array between iterations
- Lock inodes once, not on each iteration
- Small inplace cleanups

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 160 ---
 1 file changed, 94 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index be5bd81b3669..45a47d0891fc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2965,8 +2965,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp)
put_page(pg);
}
}
-   kfree(cmp->src_pages);
-   kfree(cmp->dst_pages);
+
+   cmp->num_pages = 0;
 }
 
 static int btrfs_cmp_data_prepare(struct inode *src, u64 loff,
@@ -2974,41 +2974,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, 
u64 loff,
  u64 len, struct cmp_pages *cmp)
 {
int ret;
-   int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
-   struct page **src_pgarr, **dst_pgarr;
-
-   /*
-* We must gather up all the pages before we initiate our
-* extent locking. We use an array for the page pointers. Size
-* of the array is bounded by len, which is in turn bounded by
-* BTRFS_MAX_DEDUPE_LEN.
-*/
-   src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   if (!src_pgarr || !dst_pgarr) {
-   kfree(src_pgarr);
-   kfree(dst_pgarr);
-   return -ENOMEM;
-   }
-   cmp->num_pages = num_pages;
-   cmp->src_pages = src_pgarr;
-   cmp->dst_pages = dst_pgarr;
 
/*
 * If deduping ranges in the same inode, locking rules make it mandatory
 * to always lock pages in ascending order to avoid deadlocks with
 * concurrent tasks (such as starting writeback/delalloc).
 */
-   if (src == dst && dst_loff < loff) {
-   swap(src_pgarr, dst_pgarr);
+   if (src == dst && dst_loff < loff)
swap(loff, dst_loff);
-   }
 
-   ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff);
+   cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+   ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff);
if (ret)
goto out;
 
-   ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff);
+   ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, 
dst_loff);
 
 out:
if (ret)
@@ -3078,31 +3059,23 @@ static int extent_same_check_offsets(struct inode 
*inode, u64 off, u64 *plen,
return 0;
 }
 
-static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
-struct inode *dst, u64 dst_loff)
+static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
+  struct inode *dst, u64 dst_loff,
+  struct cmp_pages *cmp)
 {
int ret;
u64 len = olen;
-   struct cmp_pages cmp;
bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;
 
-   if (len == 0)
-   return 0;
-
-   if (same_inode)
-   inode_lock(src);
-   else
-   btrfs_double_inode_lock(src, dst);
-
ret = extent_same_check_offsets(src, loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
ret = extent_same_check_offsets(dst, dst_loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
if (same_inode) {
/*
@@ -3119,32 +3092,21 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 * allow an unaligned length so long as it ends at
 * i_size.
 */
-   if (len != olen) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (len != olen)
+   return -EINVAL;
 
/* Check for overlapping ranges */
-   if (dst_loff + len > loff && dst_loff < loff + len) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (dst_loff + len > loff && dst_loff < loff + len)
+   return -EINVAL;

[PATCH 0/4] [RESEND] Btrfs: just bunch of patches to ioctl.c

2018-04-24 Thread Timofey Titovets

1st patch, remove 16MiB restriction from extent_same ioctl(),
by doing iterations over passed range.

I did not see much difference in performance, so it's just remove
logic restriction.

2-3 pathes, update defrag ioctl():
 - Fix bad behaviour with full rewriting all compressed
   extents in defrag range. (that also make autodefrag on compressed fs
   not so expensive)
 - Allow userspace specify NONE as target compression type,
   that allow users to uncompress files by defragmentation with btrfs-progs
 - Make defrag ioctl understood requested compression type and current
   compression type of extents, to make btrfs fi def -rc
   idempotent operation.
   i.e. now possible to say, make all extents compressed with lzo,
   and btrfs will not recompress lzo compressed data.
   Same for zlib, zstd, none.
   (patch to btrfs-progs in PR on kdave GitHub).

4th patch, reduce size of struct btrfs_inode
 - btrfs_inode store fields like: prop_compress, defrag_compress and
   after 3rd patch, change_compress.
   They use unsigned as a type, and use 12 bytes in sum.
   But change_compress is a bitflag, and prop_compress/defrag_compress
   only store compression type, that currently use 0-3 of 2^32-1.
   
   So, set a bitfields on that vars, and reduce size of btrfs_inode:
   1136 -> 1128.

Timofey Titovets (4):
  Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
  Btrfs: make should_defrag_range() understood compressed extents
  Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
  Btrfs: reduce size of struct btrfs_inode

 fs/btrfs/btrfs_inode.h |   5 +-
 fs/btrfs/inode.c   |   4 +-
 fs/btrfs/ioctl.c   | 203 +++--
 3 files changed, 133 insertions(+), 79 deletions(-)

-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] [RESEND] Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation

2018-04-24 Thread Timofey Titovets

Currently defrag ioctl only support recompress files with specified
compression type.
Allow set compression type to none, while call defrag, and use
BTRFS_DEFRAG_RANGE_COMPRESS as flag, that user request change of compression 
type.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/btrfs_inode.h |  1 +
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/ioctl.c   | 17 ++---
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 63f0ccc92a71..9eb0c92ee4b4 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -187,6 +187,7 @@ struct btrfs_inode {
 * different from prop_compress and takes precedence if set
 */
unsigned defrag_compress;
+   unsigned change_compress;
 
struct btrfs_delayed_node *delayed_node;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 46df5e2a64e7..7af8f1784788 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -412,8 +412,8 @@ static inline int inode_need_compress(struct inode *inode, 
u64 start, u64 end)
if (btrfs_test_opt(fs_info, FORCE_COMPRESS))
return 1;
/* defrag ioctl */
-   if (BTRFS_I(inode)->defrag_compress)
-   return 1;
+   if (BTRFS_I(inode)->change_compress)
+   return BTRFS_I(inode)->defrag_compress;
/* bad compression ratios */
if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS)
return 0;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b29ea1f0f621..40f5e5678eac 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1276,7 +1276,7 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
unsigned long cluster = max_cluster;
u64 new_align = ~((u64)SZ_128K - 1);
struct page **pages = NULL;
-   bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
+   bool change_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
 
if (isize == 0)
return 0;
@@ -1284,11 +1284,10 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
if (range->start >= isize)
return -EINVAL;
 
-   if (do_compress) {
+   if (change_compress) {
if (range->compress_type > BTRFS_COMPRESS_TYPES)
return -EINVAL;
-   if (range->compress_type)
-   compress_type = range->compress_type;
+   compress_type = range->compress_type;
}
 
if (extent_thresh == 0)
@@ -1363,7 +1362,7 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress,
+_end, change_compress,
 compress_type)){
unsigned long next;
/*
@@ -1392,8 +1391,11 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
}
 
inode_lock(inode);
-   if (do_compress)
+   if (change_compress) {
+   BTRFS_I(inode)->change_compress = change_compress;
BTRFS_I(inode)->defrag_compress = compress_type;
+   }
+
ret = cluster_pages_for_defrag(inode, pages, i, cluster);
if (ret < 0) {
inode_unlock(inode);
@@ -1449,8 +1451,9 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
ret = defrag_count;
 
 out_ra:
-   if (do_compress) {
+   if (change_compress) {
inode_lock(inode);
+   BTRFS_I(inode)->change_compress = 0;
BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
inode_unlock(inode);
}
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recovery from full metadata with all device space consumed?

2018-04-19 Thread Timofey Titovets

2018-04-20 1:08 GMT+03:00 Drew Bloechl :
> I've got a btrfs filesystem that I can't seem to get back to a useful
> state. The symptom I started with is that rename() operations started
> dying with ENOSPC, and it looks like the metadata allocation on the
> filesystem is full:
>
> # btrfs fi df /broken
> Data, RAID0: total=3.63TiB, used=67.00GiB
> System, RAID1: total=8.00MiB, used=224.00KiB
> Metadata, RAID1: total=3.00GiB, used=2.50GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> All of the consumable space on the backing devices also seems to be in
> use:
>
> # btrfs fi show /broken
> Label: 'mon_data'  uuid: 85e52555-7d6d-4346-8b37-8278447eb590
> Total devices 4 FS bytes used 69.50GiB
> devid1 size 931.51GiB used 931.51GiB path /dev/sda1
> devid2 size 931.51GiB used 931.51GiB path /dev/sdb1
> devid3 size 931.51GiB used 931.51GiB path /dev/sdc1
> devid4 size 931.51GiB used 931.51GiB path /dev/sdd1
>
> Even the smallest balance operation I can start fails (this doesn't
> change even with an extra temporary device added to the filesystem):
>
> # btrfs balance start -v -dusage=1 /broken
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=1
> ERROR: error during balancing '/broken': No space left on device
> There may be more info in syslog - try dmesg | tail
> # dmesg | tail -1
> [11554.296805] BTRFS info (device sdc1): 757 enospc errors during
> balance
>
> The current kernel is 4.15.0 from Debian's stretch-backports
> (specifically linux-image-4.15.0-0.bpo.2-amd64), but it was Debian's
> 4.9.30 when the filesystem got into this state. I upgraded it in the
> hopes that a newer kernel would be smarter, but no dice.
>
> btrfs-progs is currently at v4.7.3.
>
> Most of what this filesystem stores is Prometheus 1.8's TSDB for its
> metrics, which are constantly written at around 50MB/second. The
> filesystem never really gets full as far as data goes, but there's a lot
> of never-ending churn for what data is there.
>
> Question 1: Are there other steps that can be tried to rescue a
> filesystem in this state? I still have it mounted in the same state, and
> I'm willing to try other things or extract debugging info.
>
> Question 2: Is there something I could have done to prevent this from
> happening in the first place?
>
> Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Not sure why this happening,
but if you stuck at that state:
  - Reboot to ensure no other problems will exists
  - Add any other external device temporary to FS, as example zram.
After you free small part of fs, delete external dev from FS and
continue balance chunks.

Thanks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic

2018-02-20 Thread Timofey Titovets

Gentle ping.

2018-01-03 0:23 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> 2018-01-02 21:31 GMT+03:00 Liu Bo <bo.li@oracle.com>:
>> On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote:
>>> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
>>> based on pid % num of mirrors.
>>>
>>> Make logic understood:
>>>  - if one of underline devices are non rotational
>>>  - Queue leght to underline devices
>>>
>>> By default try use pid % num_mirrors guessing, but:
>>>  - If one of mirrors are non rotational, repick optimal to it
>>>  - If underline mirror have less queue leght then optimal,
>>>repick to that mirror
>>>
>>> For avoid round-robin request balancing,
>>> lets round down queue leght:
>>>  - By 8 for rotational devs
>>>  - By 2 for all non rotational devs
>>>
>>
>> Sorry for making a late comment on v3.
>>
>> It's good to choose non-rotational if it could.
>>
>> But I'm not sure whether it's a good idea to guess queue depth here
>> because filesystem is still at a high position of IO stack.  It'd
>> probably get good results when running tests, but in practical mixed
>> workloads, the underlying queue depth will be changing all the time.
>
> First version supposed for SSD, SSD + HDD only cases.
> At that version that just a "attempt", make LB on hdd.
> That can be easy dropped, if we decide that's a bad behaviour.
>
> If i understood correctly, which counters used,
> we check count of I/O ops that device processing currently
> (i.e. after merging & etc),
> not queue what not send (i.e. before merging & etc).
>
> i.e. we guessed based on low level block io stuff.
> As example that not work on zram devs (AFAIK, as zram don't have that 
> counters).
>
> So, no matter at which level we check that.
>
>> In fact, I think for rotational disks, more merging and less seeking
>> make more sense, even in raid1/10 case.
>>
>> Thanks,
>>
>> -liubo
>
> queue_depth changing must not create big problems there,
> i.e. round_down must make all changes "equal".
>
> For hdd, if we have a "big" (8..16?) queue depth,
> with high probability that hdd overloaded,
> and if other hdd have much less load
> (may be instead of round_down, that better use abs diff > 8)
> we try to requeue requests to other hdd.
>
> That will not show true equal distribution, but in case where
> one disks have more load, and pid based mechanism fail to make LB,
> we will just move part of load to other hdd.
>
> Until load distribution will not changed.
>
> May be for HDD that need to make threshold more aggressive, like 16
> (i.e. afaik SATA drives have hw rq len 31, so just use half of that).
>
> Thanks.
>
>>> Changes:
>>>   v1 -> v2:
>>> - Use helper part_in_flight() from genhd.c
>>>   to get queue lenght
>>> - Move guess code to guess_optimal()
>>> - Change balancer logic, try use pid % mirror by default
>>>   Make balancing on spinning rust if one of underline devices
>>>   are overloaded
>>>   v2 -> v3:
>>> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>>>
>>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>>> ---
>>>  block/genhd.c  |   1 +
>>>  fs/btrfs/volumes.c | 115 
>>> -
>>>  2 files changed, 114 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index 96a66f671720..a77426a7 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
>>> hd_struct *part,
>>>   atomic_read(>in_flight[1]);
>>>   }
>>>  }
>>> +EXPORT_SYMBOL_GPL(part_in_flight);
>>>
>>>  struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
>>>  {
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 49810b70afd3..a3b80ba31d4d 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -27,6 +27,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include "ctree.h"
>>>  #include "extent_map.h"
>>> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct

Re: [PATCH 0/4] Btrfs: just bunch of patches to ioctl.c

2018-02-09 Thread Timofey Titovets

Gentle ping

2018-01-09 13:53 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> Gentle ping
>
> 2017-12-19 13:02 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
>> 1st patch, remove 16MiB restriction from extent_same ioctl(),
>> by doing iterations over passed range.
>>
>> I did not see much difference in performance, so it's just remove
>> logic restriction.
>>
>> 2-3 pathes, update defrag ioctl():
>>  - Fix bad behaviour with full rewriting all compressed
>>extents in defrag range. (that also make autodefrag on compressed fs
>>not so expensive)
>>  - Allow userspace specify NONE as target compression type,
>>that allow users to uncompress files by defragmentation with btrfs-progs
>>  - Make defrag ioctl understood requested compression type and current
>>compression type of extents, to make btrfs fi def -rc
>>idempotent operation.
>>i.e. now possible to say, make all extents compressed with lzo,
>>and btrfs will not recompress lzo compressed data.
>>Same for zlib, zstd, none.
>>(patch to btrfs-progs in PR on kdave GitHub).
>>
>> 4th patch, reduce size of struct btrfs_inode
>>  - btrfs_inode store fields like: prop_compress, defrag_compress and
>>after 3rd patch, change_compress.
>>They use unsigned as a type, and use 12 bytes in sum.
>>But change_compress is a bitflag, and prop_compress/defrag_compress
>>only store compression type, that currently use 0-3 of 2^32-1.
>>
>>So, set a bitfields on that vars, and reduce size of btrfs_inode:
>>1136 -> 1128.
>>
>> Timofey Titovets (4):
>>   Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
>>   Btrfs: make should_defrag_range() understood compressed extents
>>   Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
>>   Btrfs: reduce size of struct btrfs_inode
>>
>>  fs/btrfs/btrfs_inode.h |   5 +-
>>  fs/btrfs/inode.c   |   4 +-
>>  fs/btrfs/ioctl.c   | 203 
>> +++--
>>  3 files changed, 133 insertions(+), 79 deletions(-)
>>
>> --
>> 2.15.1
>
>
>
> --
> Have a nice day,
> Timofey.



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: invalid files names, btrfs check can't repair it

2018-01-12 Thread Timofey Titovets

2018-01-13 0:04 GMT+03:00 Sebastian Andrzej Siewior :
> Hi,
>
> so I had bad memory and before I realized it and removed it btrfs took some
> damage. Now I have this:
>
> |ls -lh crap/
> |ls: cannot access 'crap/2f3f379b2a3d7499471edb74869efe-1948311.d': No such 
> file or directory
> |ls: cannot access 'crap/454bf066ddfbf42e0f3b77ea71c82f-878732.o': No such 
> file or directory
> |total 0
> |-? ? ? ? ?? 2f3f379b2a3d7499471edb74869efe-1948311.d
> |-? ? ? ? ?? 454bf066ddfbf42e0f3b77ea71c82f-878732.o
>
> and in dmesg I see:
>
> | BTRFS critical (device sda4): invalid dir item type: 33
> | BTRFS critical (device sda4): invalid dir item name len: 8231
>
> `btrfs check' (from v4.14.1) finds them and prints them but has no idea
> what to do with it. Would it be possible to let the check tool rename
> the offended filename to something (like its inode number) put it in
> lost+found if it has any data attached to it and otherwise simply remove
> it? Right now I can't remove that folder.
>
> Sebastian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Deletion:
If that happens in subvol, you can create new subvol, reflink data and
delete old vol.
I don't know other ways to fix that entries.

P.S.
I have that issue without bad ram, but by some system hangs/resets
(I've use notreelog
 as workaround for now).

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs subvolume mount with different options

2018-01-12 Thread Timofey Titovets

2018-01-12 20:49 GMT+03:00 Konstantin V. Gavrilenko :
> Hi list,
>
> just wondering whether it is possible to mount two subvolumes with different 
> mount options, i.e.
>
> |
> |- /a  defaults,compress-force=lza
> |
> |- /b  defaults,nodatacow
>
>
> since, when both subvolumes are mounted, and when I change the option for one 
> it is changed for all of them.
>
>
> thanks in advance.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Not possible for now.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recommendations for balancing as part of regular maintenance?

2018-01-10 Thread Timofey Titovets

2018-01-10 21:33 GMT+03:00 Tom Worster :
> On 10 Jan 2018, at 12:01, Austin S. Hemmelgarn wrote:
>
>> On 2018-01-10 11:30, Tom Worster wrote:
>>
>> Also, for future reference, the term we typically use is ENOSPC, as that's
>> the symbolic name for the error code you get when this happens (or when your
>> filesystem is just normally full), but I actually kind of like your name for
>> it too, it conveys the exact condition being discussed in a way that should
>> be a bit easier for non-technical types to understand.
>
>
> Iiuc, ENOSPC is _exhaustion_ of unallocated space, which is a specific case
> of depletion.
>
> I sought a term to refer to the phenomenon of unallocated space shrinking
> beyond what filesystem use would demand and how it ratchets down. Hence a
> sysop needs to manage DoUS. ENOSPC is likely a failure of such management.
>
>
>>> - Some experienced users say that, to resolve a problem with DoUS, they
>>> would rather recreate the filesystem than run balance.
>>
>> This is kind of independent of BTRFS.
>
>
> Yes. I mentioned it only because it was, to me, a striking statement of lack
> of confidence in balance.
>
>
>>> But if Duncan is right (which, for me, is practically the same as
>>> consensus on the proposition) that problems with corruption while running
>>> balance are associated with heavy coincident IO activity, then I can see a
>>> reasonable way forwards. I can even see how general recommendations for
>>> BTRFS maintenance might develop.
>>
>> As I commented above, I would tend to believe Duncan is right in this case
>> (both because it makes sense, and because he seems to generally be right
>> about this type of thing).  That said, I really do think that normal user
>> I/O is probably not the issue, but low-level filesystem operations are.
>> That said, there is no reason that BTRFS shouldn't either:
>> 1. Handle this just fine without causing corruption.
>> or:
>> 2. Extend the mutex used to prevent concurrent balances to cover other
>> operations that might cause issues (that is, make it so you can't scrub a
>> filesystem while it's being balanced, or defragment it, or whatever else).
>
>
> Yes, but backtracking a bit, I think there's another really important point
> here. Assuming Duncan's right, it's not so hard to develop guidelines for
> general BTRFS management that include DoUS among other topics. Duncan's
> other email today contains or implies quite a lot of those guidelines.
>
> Or, to put it another way, it's enough for me. I think I know what to do
> now. And that much could be written down for the benefit of others.
>
> Tom
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

My two cents,
I've about ~50 different systems
(VCS Systems, MySQL DB, Web Servers, Elastic Search nodes & etc.).
All running btrfs only and run fine, even with auto snapshot rotating
on some of them,
(btrfs make my life easier and i like it).

Most of them are small VMs From 3GiB..512GiB (I use compression everywhere).
And no one of them need balance, only that i care,
i try have always some unallocated space on it.

Most of them are stuck with some used/allocated/unallocated ratio.

I.e. as i see from conversation point of view.
We run balance for reallocate data -> make more unallocated space,
but if someone have plenty of it, that useless, no?

ex. I've 60% allocated by data/meta data chunks on my notebook,
And only 40% are really used by data, even then i have 90% allocated,
and 85% used, i don't face into ENOSPC problems. (256GiB ssd).

And if i run balance, i run it only to fight with btrfs discard processing bug,
which leads to trim only unallocated space (probably fixed already).

So if we talk about "regular" running of balance, may be that make a sense
To check free space, i.e. if system have some percentage of space
allocated, like 80%,
and have plenty of allocated/unused space, only then balance will be needed, no?

(I'm not say that btrfs have no problems, i see some rare hateful bugs,
on some systems, but most of them are internal btrfs problems
or problems with coop of btrfs with applications).

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Btrfs: just bunch of patches to ioctl.c

2018-01-09 Thread Timofey Titovets

Gentle ping

2017-12-19 13:02 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> 1st patch, remove 16MiB restriction from extent_same ioctl(),
> by doing iterations over passed range.
>
> I did not see much difference in performance, so it's just remove
> logic restriction.
>
> 2-3 pathes, update defrag ioctl():
>  - Fix bad behaviour with full rewriting all compressed
>extents in defrag range. (that also make autodefrag on compressed fs
>not so expensive)
>  - Allow userspace specify NONE as target compression type,
>that allow users to uncompress files by defragmentation with btrfs-progs
>  - Make defrag ioctl understood requested compression type and current
>compression type of extents, to make btrfs fi def -rc
>idempotent operation.
>i.e. now possible to say, make all extents compressed with lzo,
>and btrfs will not recompress lzo compressed data.
>Same for zlib, zstd, none.
>(patch to btrfs-progs in PR on kdave GitHub).
>
> 4th patch, reduce size of struct btrfs_inode
>  - btrfs_inode store fields like: prop_compress, defrag_compress and
>after 3rd patch, change_compress.
>They use unsigned as a type, and use 12 bytes in sum.
>But change_compress is a bitflag, and prop_compress/defrag_compress
>only store compression type, that currently use 0-3 of 2^32-1.
>
>    So, set a bitfields on that vars, and reduce size of btrfs_inode:
>1136 -> 1128.
>
> Timofey Titovets (4):
>   Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
>   Btrfs: make should_defrag_range() understood compressed extents
>   Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
>   Btrfs: reduce size of struct btrfs_inode
>
>  fs/btrfs/btrfs_inode.h |   5 +-
>  fs/btrfs/inode.c   |   4 +-
>  fs/btrfs/ioctl.c   | 203 
> +++--
>  3 files changed, 133 insertions(+), 79 deletions(-)
>
> --
> 2.15.1



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] generic/015: Change the test filesystem size to 101mb

2018-01-08 Thread Timofey Titovets

2018-01-08 15:54 GMT+03:00 Qu Wenruo :
>
>
> On 2018年01月08日 16:43, Nikolay Borisov wrote:
>> This test has been failing for btrfs for quite some time,
>> at least since 4.7. There are 2 implementation details of btrfs that
>> it exposes:
>>
>> 1. Currently btrfs filesystem under 100mb are created in Mixed block
>> group mode. Freespace accounting for it is not 100% accurate - I've
>> observed around 100-200kb discrepancy between a newly created filesystem,
>> then writing a file and deleting it and checking the free space. This
>> falls within %3 and not %1 as hardcoded in the test.
>>
>> 2. BTRFS won't flush it's delayed allocation on file deletion if less
>> than 32mb are deleted. On such files we need to perform sync (missing
>> in the test) or wait until time elapses for transaction commit.
>
> I'm a little confused about the 32mb limit.
>
> My personal guess about the reason to delay space freeing would be:
> 1) Performance
>Btrfs tree operation (at least for write) is slow due to its tree
>design.
>So it may makes sense to delay space freeing.
>
>But in that case, 32MB may seems to small to really improve the
>performance. (Max file extent size is 128M, delaying one item
>deletion doesn't really improve performance)
>
> 2) To avoid later new allocation to rewrite the data.
>It's possible that freed space of deleted inode A get allocated to
>new file extents. And a power loss happens before we commit the
>transaction.
>
>In that case, if everything else works fine, we should be reverted to
>previous transaction where deleted inode A still exists.
>But we lost its data, as its data is overwritten by other file
>extents. And any read will just cause csum error.
>
>But in that case, there shouldn't be any 32MB limit, but all deletion
>of orphan inodes should be delayed.
>
>And further more, this can be addressed using log tree, to log such
>deletion so at recovery time, we just delete that inode.
>
> So I'm wonder if we can improve btrfs deletion behavior.
>
>
>>
>> Since mixed mode is somewhat deprecated and btrfs is not really intended
>> to be used on really small devices let's just adjust the test to
>> create a 101mb fs, which doesn't use mixed mode and really test
>> freespace accounting.
>
> Despite of some btrfs related questions, I'm wondering if there is any
> standard specifying (POSIX?) how a filesystem should behave when
> unlinking a file.
>
> Should the space freeing be synchronized? And how should statfs report
> available space?
>
> In short, I'm wondering if this test and its expected behavior is
> generic enough for all filesystems.
>
> Thanks,
> Qu
>
>>
>> Signed-off-by: Nikolay Borisov 
>> ---
>>  tests/generic/015 | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tests/generic/015 b/tests/generic/015
>> index 78f2b13..416c4ae 100755
>> --- a/tests/generic/015
>> +++ b/tests/generic/015
>> @@ -53,7 +53,7 @@ _supported_os Linux
>>  _require_scratch
>>  _require_no_large_scratch_dev
>>
>> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
>> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
>>  || _fail "mkfs failed"
>>  _scratch_mount || _fail "mount failed"
>>  out=$SCRATCH_MNT/fillup.$$
>>
>

All fs, including btrfs (AFAIK) return unlink(), (if file not open)
only then space has been freed.
So free space after return of unlink() must be freed.
Proofs: [1] [2] [3]
[4] - Posix, looks like do not describe behaviour.

1. http://man7.org/linux/man-pages/man2/unlink.2.html
2. https://stackoverflow.com/questions/31448693/why-system-call-unlink-so-slow
3. https://www.spinics.net/lists/linux-btrfs/msg59901.html
4. https://www.unix.com/man-page/posix/1P/unlink/

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] Remove custom crc32c init code from btrfs

2018-01-08 Thread Timofey Titovets

2018-01-08 12:45 GMT+03:00 Nikolay Borisov <nbori...@suse.com>:
> So here is a small 2 patch set which removes btrfs' manual initialisation of
> the lower level crc32c module. Explanation why is ok can be found in Patch 
> 2/2.
>
> Patch 1/2 just adds a function to the generic crc32c header which allows
> querying the actual crc32c implementaiton used (i.e. software or 
> hw-accelerated)
> to retain current btrfs behavior. This is mainly used for debugging purposes
> and is independent.
>
> Nikolay Borisov (2):
>   libcrc32c: Add crc32c_impl function
>   btrfs: Remove custom crc32c init code
>
>  fs/btrfs/Kconfig   |  3 +--
>  fs/btrfs/Makefile  |  2 +-
>  fs/btrfs/check-integrity.c |  4 ++--
>  fs/btrfs/ctree.h   | 16 ++
>  fs/btrfs/dir-item.c|  1 -
>  fs/btrfs/disk-io.c |  4 ++--
>  fs/btrfs/extent-tree.c | 10 -
>  fs/btrfs/hash.c| 54 
> --
>  fs/btrfs/hash.h| 43 
>  fs/btrfs/inode-item.c  |  1 -
>  fs/btrfs/inode.c   |  1 -
>  fs/btrfs/props.c   |  2 +-
>  fs/btrfs/send.c|  4 ++--
>  fs/btrfs/super.c   | 14 
>  fs/btrfs/tree-log.c|  2 +-
>  include/linux/crc32c.h |  1 +
>  lib/libcrc32c.c|  6 ++
>  17 files changed, 42 insertions(+), 126 deletions(-)
>  delete mode 100644 fs/btrfs/hash.c
>  delete mode 100644 fs/btrfs/hash.h
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reviewed-by: Timofey Titovets <nefelim...@gmail.com>

P.S.
May that are overkill to remove hash.c completely?
i.e. if we have a "plan" to support another hash algo,
we still need some abstractions for that.

Inband dedup don't touch hash.* so, no one else must be affected.

Thanks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2018-01-08 Thread Timofey Titovets

2017-12-20 0:23 GMT+03:00 Darrick J. Wong <darrick.w...@oracle.com>:
> On Tue, Dec 19, 2017 at 01:02:44PM +0300, Timofey Titovets wrote:
>> At now btrfs_dedupe_file_range() restricted to 16MiB range for
>> limit locking time and memory requirement for dedup ioctl()
>>
>> For too big input range code silently set range to 16MiB
>>
>> Let's remove that restriction by do iterating over dedup range.
>> That's backward compatible and will not change anything for request
>> less then 16MiB.
>>
>> Changes:
>>   v1 -> v2:
>> - Refactor btrfs_cmp_data_prepare and btrfs_extent_same
>> - Store memory of pages array between iterations
>> - Lock inodes once, not on each iteration
>> - Small inplace cleanups
>
> /me wonders if you could take advantage of vfs_clone_file_prep_inodes,
> which takes care of the content comparison (and flushing files, and inode
> checks, etc.) ?
>
> (ISTR Qu Wenruo(??) or someone remarking that this might not work well
> with btrfs locking model, but I could be mistaken about all that...)
>
> --D

Sorry, not enough knowledge to give an authoritative answer.
I can only say that, i try lightly test that, by add call before
btrfs_extent_same() with inode_locks,
at least that works.

Thanks.

>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  fs/btrfs/ioctl.c | 160 
>> ---
>>  1 file changed, 94 insertions(+), 66 deletions(-)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index be5bd81b3669..45a47d0891fc 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -2965,8 +2965,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp)
>>   put_page(pg);
>>   }
>>   }
>> - kfree(cmp->src_pages);
>> - kfree(cmp->dst_pages);
>> +
>> + cmp->num_pages = 0;
>>  }
>>
>>  static int btrfs_cmp_data_prepare(struct inode *src, u64 loff,
>> @@ -2974,41 +2974,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, 
>> u64 loff,
>> u64 len, struct cmp_pages *cmp)
>>  {
>>   int ret;
>> - int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
>> - struct page **src_pgarr, **dst_pgarr;
>> -
>> - /*
>> -  * We must gather up all the pages before we initiate our
>> -  * extent locking. We use an array for the page pointers. Size
>> -  * of the array is bounded by len, which is in turn bounded by
>> -  * BTRFS_MAX_DEDUPE_LEN.
>> -  */
>> - src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
>> - dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
>> - if (!src_pgarr || !dst_pgarr) {
>> - kfree(src_pgarr);
>> - kfree(dst_pgarr);
>> - return -ENOMEM;
>> - }
>> - cmp->num_pages = num_pages;
>> - cmp->src_pages = src_pgarr;
>> - cmp->dst_pages = dst_pgarr;
>>
>>   /*
>>* If deduping ranges in the same inode, locking rules make it 
>> mandatory
>>* to always lock pages in ascending order to avoid deadlocks with
>>* concurrent tasks (such as starting writeback/delalloc).
>>*/
>> - if (src == dst && dst_loff < loff) {
>> - swap(src_pgarr, dst_pgarr);
>> + if (src == dst && dst_loff < loff)
>>   swap(loff, dst_loff);
>> - }
>>
>> - ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff);
>> + cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
>> +
>> + ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff);
>>   if (ret)
>>   goto out;
>>
>> - ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff);
>> + ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, 
>> dst_loff);
>>
>>  out:
>>   if (ret)
>> @@ -3078,31 +3059,23 @@ static int extent_same_check_offsets(struct inode 
>> *inode, u64 off, u64 *plen,
>>   return 0;
>>  }
>>
>> -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>> -  struct inode *dst, u64 dst_loff)
>> +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>> +struct inode *dst, u64 dst_loff,
>> +struct cmp_page

Re: [PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2018-01-03 Thread Timofey Titovets

2018-01-03 14:40 GMT+03:00 Filipe Manana <fdman...@gmail.com>:
> On Thu, Dec 28, 2017 at 3:28 PM, Timofey Titovets <nefelim...@gmail.com> 
> wrote:
>> Insert sort are generaly perform better then bubble sort,
>> by have less iterations on avarage.
>> That version also try place element to right position
>> instead of raw swap.
>>
>> I'm not sure how many stripes per bio raid56,
>> btrfs try to store (and try to sort).
>
> If you don't know it, besides unlikely to be doing the best possible
> thing here, you might actually make things worse or not offering any
> benefit. IOW, you should know it for sure before submitting such
> changes.
>
> You should know if the number of elements to sort is big enough such
> that an insertion sort is faster than a bubble sort, and more
> importantly, measure it and mention it in the changelog.
> As it is, you are showing lack of understanding of the code and
> component you are touching, and leaving many open questions such as
> how faster this is, why insertion sort and not a
> quick/merge/heap/whatever sort, etc.
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”

Sorry, you are right,
I must do some tests and investigations before send a patch.
(I just try believe in some magic math things).

Input size depends on number of devs,
so on small arrays, like 3-5 no meaningful difference.

Example: raid6 (with 4 disks) produce many stripe line addresses like:
1. 4641783808 4641849344 4641914880 18446744073709551614
2. 4641652736 4641718272 18446744073709551614 4641587200
3. 18446744073709551614 4636475392 4636540928 4636606464
4. 4641521664 18446744073709551614 4641390592 4641456128

For that count of elements any sorting algo will work fast enough.

Let's, consider that addresses as random non-repeating numbers.

We can use tool like Sound Of Sorting (SoS) to make some
easy to interpret tests of algorithms.

(Sorry, no script to reproduce, as SoS not provide a cli,
just hand made by run SoS with different params).

Table (also in attach with source data points):
Sort_algo |Disk_num   |3   |4|6|8|10|12|14|AVG
Bubble|Comparasions   |3   |6|15   |28   |45|66|91
|36,2857142857143
Bubble|Array_Accesses |7,8 |18,2 |45,8 |81,8 |133,4 |192   |268,6 |106,8
Insertion |Comparasions   |2,8 |5|11,6 |17   |28,6  |39,4  |55,2  |22,8
Insertion |Array_Accesses |8,4 |13,6 |31   |48,8 |80,4  |109,6 |155,8
|63,9428571428571

i.e. on Size like 3-4 no much difference,
Insertion sort will work faster on bigger arrays (up to 1.7x for 14 disk array).

Does that make a sense?
I think yes, i.e. in any case that are several dozen machine instructions.
Which can be used elsewhere.

P.S. For heap sort, which are also available in kernel by sort(),
That will to much overhead on that small number of devices,
i.e. heap sort will show a profit over insert sort at 16+ cells in array.

/* Snob mode on */
P.S.S.
Heap sort & other like, need additional memory,
so that useless to compare in our case,
but they will works faster, of course.
/* Snob mode off */

Thanks.
-- 
Have a nice day,
Timofey.

Bubble_vs_Insertion.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic

2018-01-02 Thread Timofey Titovets

2018-01-02 21:31 GMT+03:00 Liu Bo <bo.li@oracle.com>:
> On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote:
>> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
>> based on pid % num of mirrors.
>>
>> Make logic understood:
>>  - if one of underline devices are non rotational
>>  - Queue leght to underline devices
>>
>> By default try use pid % num_mirrors guessing, but:
>>  - If one of mirrors are non rotational, repick optimal to it
>>  - If underline mirror have less queue leght then optimal,
>>repick to that mirror
>>
>> For avoid round-robin request balancing,
>> lets round down queue leght:
>>  - By 8 for rotational devs
>>  - By 2 for all non rotational devs
>>
>
> Sorry for making a late comment on v3.
>
> It's good to choose non-rotational if it could.
>
> But I'm not sure whether it's a good idea to guess queue depth here
> because filesystem is still at a high position of IO stack.  It'd
> probably get good results when running tests, but in practical mixed
> workloads, the underlying queue depth will be changing all the time.

First version supposed for SSD, SSD + HDD only cases.
At that version that just a "attempt", make LB on hdd.
That can be easy dropped, if we decide that's a bad behaviour.

If i understood correctly, which counters used,
we check count of I/O ops that device processing currently
(i.e. after merging & etc),
not queue what not send (i.e. before merging & etc).

i.e. we guessed based on low level block io stuff.
As example that not work on zram devs (AFAIK, as zram don't have that counters).

So, no matter at which level we check that.

> In fact, I think for rotational disks, more merging and less seeking
> make more sense, even in raid1/10 case.
>
> Thanks,
>
> -liubo

queue_depth changing must not create big problems there,
i.e. round_down must make all changes "equal".

For hdd, if we have a "big" (8..16?) queue depth,
with high probability that hdd overloaded,
and if other hdd have much less load
(may be instead of round_down, that better use abs diff > 8)
we try to requeue requests to other hdd.

That will not show true equal distribution, but in case where
one disks have more load, and pid based mechanism fail to make LB,
we will just move part of load to other hdd.

Until load distribution will not changed.

May be for HDD that need to make threshold more aggressive, like 16
(i.e. afaik SATA drives have hw rq len 31, so just use half of that).

Thanks.

>> Changes:
>>   v1 -> v2:
>> - Use helper part_in_flight() from genhd.c
>>   to get queue lenght
>> - Move guess code to guess_optimal()
>> - Change balancer logic, try use pid % mirror by default
>>   Make balancing on spinning rust if one of underline devices
>>   are overloaded
>>   v2 -> v3:
>> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  block/genhd.c  |   1 +
>>  fs/btrfs/volumes.c | 115 
>> -
>>  2 files changed, 114 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 96a66f671720..a77426a7 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
>> hd_struct *part,
>>   atomic_read(>in_flight[1]);
>>   }
>>  }
>> +EXPORT_SYMBOL_GPL(part_in_flight);
>>
>>  struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
>>  {
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 49810b70afd3..a3b80ba31d4d 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -27,6 +27,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include "ctree.h"
>>  #include "extent_map.h"
>> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
>> *fs_info, u64 logical, u64 len)
>>   return ret;
>>  }
>>
>> +/**
>> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
>> + *
>> + * @bdev: target bdev
>> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
>> + */
>> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
>> +{
>> + int sum;
>> + struct hd_struct *bd_part = bdev->bd_part;
>> + struct request_queue *rq = bdev_get_queue(bdev);
>> + ui

Re: [PATCH 1/2] Btrfs: heuristic: replace workspace managment code by mempool API

2017-12-30 Thread Timofey Titovets

2017-12-24 7:55 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> Currently compression code have custom workspace/memory cache
> for guarantee forward progress on high memory pressure.
>
> That api can be replaced with mempool API, which can guarantee the same.
> Main goal is simplify/cleanup code and replace it with general solution.
>
> I try avoid use of atomic/lock/wait stuff,
> as that all already hidden in mempool API.
> Only thing that must be racy safe is initialization of
> mempool.
>
> So i create simple mempool_alloc_wrap, which will handle
> mempool_create failures, and sync threads work by cmpxchg()
> on mempool_t pointer.
>
> Another logic difference between our custom stuff and mempool:
>  - ws find/free mosly reuse current workspaces whenever possible.
>  - mempool use alloc/free of provided helpers with more
>aggressive use of __GFP_NOMEMALLOC, __GFP_NORETRY, GFP_NOWARN,
>and only use already preallocated space when memory get tight.
>
> Not sure which approach are better, but simple stress tests with
> writing stuff on compressed fs on ramdisk show negligible difference on
> 8 CPU Virtual Machine with Intel Xeon E5-2420 0 @ 1.90GHz (+-1%).
>
> Other needed changes to use mempool:
>  - memalloc_nofs_{save,restore} move to each place where kvmalloc
>will be used in call chain.
>  - mempool_create return pointer to mampool or NULL,
>no error, so macros like IS_ERR(ptr) can't be used.
>
> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
> ---
>  fs/btrfs/compression.c | 197 
> ++---
>  1 file changed, 106 insertions(+), 91 deletions(-)
>
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 208334aa6c6e..02bd60357f04 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -768,46 +769,46 @@ struct heuristic_ws {
> struct bucket_item *bucket;
> /* Sorting buffer */
> struct bucket_item *bucket_b;
> -   struct list_head list;
>  };
>
> -static void free_heuristic_ws(struct list_head *ws)
> +static void heuristic_ws_free(void *element, void *pool_data)
>  {
> -   struct heuristic_ws *workspace;
> +   struct heuristic_ws *ws = (struct heuristic_ws *) element;
>
> -   workspace = list_entry(ws, struct heuristic_ws, list);
> -
> -   kvfree(workspace->sample);
> -   kfree(workspace->bucket);
> -   kfree(workspace->bucket_b);
> -   kfree(workspace);
> +   kfree(ws->sample);
> +   kfree(ws->bucket);
> +   kfree(ws->bucket_b);
> +   kfree(ws);
>  }
>
> -static struct list_head *alloc_heuristic_ws(void)
> +static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data)
>  {
> -   struct heuristic_ws *ws;
> +   struct heuristic_ws *ws = kzalloc(sizeof(*ws), gfp_mask);
>
> -   ws = kzalloc(sizeof(*ws), GFP_KERNEL);
> if (!ws)
> -   return ERR_PTR(-ENOMEM);
> +   return NULL;
>
> -   ws->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL);
> +   /*
> +* We can handle allocation failures and
> +* slab have caches for 8192 byte allocations
> +*/
> +   ws->sample = kmalloc(MAX_SAMPLE_SIZE, gfp_mask);
> if (!ws->sample)
> goto fail;
>
> -   ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), GFP_KERNEL);
> +   ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), gfp_mask);
> if (!ws->bucket)
> goto fail;
>
> -   ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), 
> GFP_KERNEL);
> +   ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), gfp_mask);
> if (!ws->bucket_b)
> goto fail;
>
> -   INIT_LIST_HEAD(>list);
> -   return >list;
> +   return ws;
> +
>  fail:
> -   free_heuristic_ws(>list);
> -   return ERR_PTR(-ENOMEM);
> +   heuristic_ws_free(ws, NULL);
> +   return NULL;
>  }
>
>  struct workspaces_list {
> @@ -821,9 +822,12 @@ struct workspaces_list {
> wait_queue_head_t ws_wait;
>  };
>
> -static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
> +struct workspace_stor {
> +   mempool_t *pool;
> +};
>
> -static struct workspaces_list btrfs_heuristic_ws;
> +static struct workspace_stor btrfs_heuristic_ws_stor;
> +static struct workspaces_list btrfs_comp_ws[BTRFS_COMP

[PATCH V3] Btrfs: enchanse raid1/10 balance heuristic

2017-12-30 Thread Timofey Titovets

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 115 -
 2 files changed, 114 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..a77426a7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
 {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 49810b70afd3..a3b80ba31d4d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_down = 2;
+
+   for (i = 0; i < num; i++) {
+   if (qlen[i])
+   continue;
+   bdev = map->stripes[i].dev->bdev;
+   qlen[i] = bdev_get_queue_len(bdev, round_down);
+   }
+
+   /* For mixed case, pick non rotational dev as optimal */
+   if (all_bdev_rotate == all_bdev_nonrot) {
+   for (i = 0; i < num; i++) {
+   if (is_nonrot[i])
+

Re: [PATCH v2] Btrfs: enchanse raid1/10 balance heuristic

2017-12-30 Thread Timofey Titovets

2017-12-30 11:14 GMT+03:00 Dmitrii Tcvetkov <demfl...@demfloro.ru>:
> On Sat, 30 Dec 2017 03:15:20 +0300
> Timofey Titovets <nefelim...@gmail.com> wrote:
>
>> 2017-12-29 22:14 GMT+03:00 Dmitrii Tcvetkov <demfl...@demfloro.ru>:
>> > On Fri, 29 Dec 2017 21:44:19 +0300
>> > Dmitrii Tcvetkov <demfl...@demfloro.ru> wrote:
>> >> > +/**
>> >> > + * guess_optimal - return guessed optimal mirror
>> >> > + *
>> >> > + * Optimal expected to be pid % num_stripes
>> >> > + *
>> >> > + * That's generaly ok for spread load
>> >> > + * Add some balancer based on queue leght to device
>> >> > + *
>> >> > + * Basic ideas:
>> >> > + *  - Sequential read generate low amount of request
>> >> > + *so if load of drives are equal, use pid % num_stripes
>> >> > balancing
>> >> > + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as
>> >> > optimal
>> >> > + *and repick if other dev have "significant" less queue
>> >> > lenght
>> >> > + *  - Repick optimal if queue leght of other mirror are less
>> >> > + */
>> >> > +static int guess_optimal(struct map_lookup *map, int optimal)
>> >> > +{
>> >> > +   int i;
>> >> > +   int round_down = 8;
>> >> > +   int num = map->num_stripes;
>> >>
>> >> num has to be initialized from map->sub_stripes if we're reading
>> >> RAID10, otherwise there will be NULL pointer dereference
>> >>
>> >
>> > Check can be like:
>> > if (map->type & BTRFS_BLOCK_GROUP_RAID10)
>> > num = map->sub_stripes;
>> >
>> >>@@ -5804,10 +5914,12 @@ static int __btrfs_map_block(struct
>> >>btrfs_fs_info *fs_info,
>> >>   stripe_index += mirror_num - 1;
>> >>   else {
>> >>   int old_stripe_index = stripe_index;
>> >>+  optimal = guess_optimal(map,
>> >>+  current->pid %
>> >>map->num_stripes);
>> >>   stripe_index = find_live_mirror(fs_info, map,
>> >> stripe_index,
>> >> map->sub_stripes,
>> >> stripe_index +
>> >>-current->pid %
>> >>map->sub_stripes,
>> >>+optimal,
>> >> dev_replace_is_ongoing);
>> >>   mirror_num = stripe_index - old_stripe_index
>> >> + 1; }
>> >>--
>> >>2.15.1
>> >
>> > Also here calculation should be with map->sub_stripes too.
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> > linux-btrfs" in the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> Why you think we need such check?
>> I.e. guess_optimal always called for find_live_mirror()
>> Both in same context, like that:
>>
>> if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
>>   u32 factor = map->num_stripes / map->sub_stripes;
>>
>>   stripe_nr = div_u64_rem(stripe_nr, factor, _index);
>>   stripe_index *= map->sub_stripes;
>>
>>   if (need_full_stripe(op))
>> num_stripes = map->sub_stripes;
>>   else if (mirror_num)
>> stripe_index += mirror_num - 1;
>>   else {
>> int old_stripe_index = stripe_index;
>> stripe_index = find_live_mirror(fs_info, map,
>>   stripe_index,
>>   map->sub_stripes, stripe_index +
>>   current->pid % map->sub_stripes,
>>   dev_replace_is_ongoing);
>> mirror_num = stripe_index - old_stripe_index + 1;
>> }
>>
>> That useless to check that internally
>
> My bad, so only need to call
> guess_optimal(map, current->pid % map->sub_stripes)
> in RAID10 branch.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yes, my bad, copy-paste error, will be fixed in v3

Thanks

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Btrfs: enchanse raid1/10 balance heuristic

2017-12-29 Thread Timofey Titovets

2017-12-29 22:14 GMT+03:00 Dmitrii Tcvetkov :
> On Fri, 29 Dec 2017 21:44:19 +0300
> Dmitrii Tcvetkov  wrote:
>> > +/**
>> > + * guess_optimal - return guessed optimal mirror
>> > + *
>> > + * Optimal expected to be pid % num_stripes
>> > + *
>> > + * That's generaly ok for spread load
>> > + * Add some balancer based on queue leght to device
>> > + *
>> > + * Basic ideas:
>> > + *  - Sequential read generate low amount of request
>> > + *so if load of drives are equal, use pid % num_stripes
>> > balancing
>> > + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as
>> > optimal
>> > + *and repick if other dev have "significant" less queue lenght
>> > + *  - Repick optimal if queue leght of other mirror are less
>> > + */
>> > +static int guess_optimal(struct map_lookup *map, int optimal)
>> > +{
>> > +   int i;
>> > +   int round_down = 8;
>> > +   int num = map->num_stripes;
>>
>> num has to be initialized from map->sub_stripes if we're reading
>> RAID10, otherwise there will be NULL pointer dereference
>>
>
> Check can be like:
> if (map->type & BTRFS_BLOCK_GROUP_RAID10)
> num = map->sub_stripes;
>
>>@@ -5804,10 +5914,12 @@ static int __btrfs_map_block(struct
>>btrfs_fs_info *fs_info,
>>   stripe_index += mirror_num - 1;
>>   else {
>>   int old_stripe_index = stripe_index;
>>+  optimal = guess_optimal(map,
>>+  current->pid %
>>map->num_stripes);
>>   stripe_index = find_live_mirror(fs_info, map,
>> stripe_index,
>> map->sub_stripes,
>> stripe_index +
>>-current->pid %
>>map->sub_stripes,
>>+optimal,
>> dev_replace_is_ongoing);
>>   mirror_num = stripe_index - old_stripe_index
>> + 1; }
>>--
>>2.15.1
>
> Also here calculation should be with map->sub_stripes too.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Why you think we need such check?
I.e. guess_optimal always called for find_live_mirror()
Both in same context, like that:

if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
  u32 factor = map->num_stripes / map->sub_stripes;

  stripe_nr = div_u64_rem(stripe_nr, factor, _index);
  stripe_index *= map->sub_stripes;

  if (need_full_stripe(op))
num_stripes = map->sub_stripes;
  else if (mirror_num)
stripe_index += mirror_num - 1;
  else {
int old_stripe_index = stripe_index;
stripe_index = find_live_mirror(fs_info, map,
  stripe_index,
  map->sub_stripes, stripe_index +
  current->pid % map->sub_stripes,
  dev_replace_is_ongoing);
mirror_num = stripe_index - old_stripe_index + 1;
}

That useless to check that internally

---
Also, fio results for all hdd raid1, results from waxhead:

Original:

Disk-4k-randread-depth-32: (g=0): rw=randread, bs=(R) 4096B-512KiB,
(W) 4096B-512KiB, (T) 4096B-512KiB, ioengine=libaio, iodepth=32
Disk-4k-read-depth-8: (g=0): rw=read, bs=(R) 4096B-512KiB, (W)
4096B-512KiB, (T) 4096B-512KiB, ioengine=libaio, iodepth=8
Disk-4k-randwrite-depth-8: (g=0): rw=randwrite, bs=(R) 4096B-512KiB,
(W) 4096B-512KiB, (T) 4096B-512KiB, ioengine=libaio, iodepth=8
fio-3.1
Starting 3 processes
Disk-4k-randread-depth-32: Laying out IO file (1 file / 65536MiB)
Jobs: 3 (f=3): [r(1),R(1),w(1)][100.0%][r=120MiB/s,w=9.88MiB/s][r=998,w=96
IOPS][eta 00m:00s]
Disk-4k-randread-depth-32: (groupid=0, jobs=1): err= 0: pid=3132: Fri
Dec 29 16:16:33 2017
   read: IOPS=375, BW=41.3MiB/s (43.3MB/s)(24.2GiB/600128msec)
slat (usec): min=15, max=206039, avg=88.71, stdev=990.35
clat (usec): min=357, max=3487.1k, avg=85022.93, stdev=141872.25
 lat (usec): min=399, max=3487.2k, avg=85112.58, stdev=141880.31
clat percentiles (msec):
 |  1.00th=[5],  5.00th=[7], 10.00th=[9], 20.00th=[   13],
 | 30.00th=[   19], 40.00th=[   27], 50.00th=[   39], 60.00th=[   56],
 | 70.00th=[   83], 80.00th=[  127], 90.00th=[  209], 95.00th=[  300],
 | 99.00th=[  600], 99.50th=[  852], 99.90th=[ 1703], 99.95th=[ 2165],
 | 99.99th=[ 2937]
   bw (  KiB/s): min=  392, max=75824, per=30.46%, avg=42736.09,
stdev=12019.09, samples=1186
   iops: min=3, max=  500, avg=380.24, stdev=99.50, samples=1186
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.29%, 10=12.33%, 20=19.67%, 50=24.92%
  lat (msec)   : 100=17.51%, 250=18.05%, 500=5.72%, 750=0.85%, 1000=0.28%
  lat (msec)   : 2000=0.29%, >=2000=0.07%
  cpu  : usr=0.67%, sys=4.62%, ctx=215716, majf=0, minf=526
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

[PATCH v2] Btrfs: enchanse raid1/10 balance heuristic

2017-12-28 Thread Timofey Titovets

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 116 -
 2 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..a77426a7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
 {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9a04245003ab..1c84534df9a5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5216,6 +5217,112 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int num = map->num_stripes;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_down = 2;
+
+   for (i = 0; i < num; i++) {
+   if (qlen[i])
+   continue;
+   bdev = map->stripes[i].dev->bdev;
+   qlen[i] = bdev_get_queue_len(bdev, round_down);
+   }
+
+   /* For mixed case, pick non rotational dev as optimal */
+   if (all_bdev_rotate == all_bdev_nonrot) {
+   for (i = 0; i < num; i++) {
+   if (is_nonrot[i])
+   optimal = i;
+   }
+   }

Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-28 Thread Timofey Titovets

2017-12-28 11:06 GMT+03:00 Dmitrii Tcvetkov <demfl...@demfloro.ru>:
> On Thu, 28 Dec 2017 01:39:31 +0300
> Timofey Titovets <nefelim...@gmail.com> wrote:
>
>> Currently btrfs raid1/10 balancer blance requests to mirrors,
>> based on pid % num of mirrors.
>>
>> Update logic and make it understood if underline device are non rotational.
>>
>> If one of mirrors are non rotational, then all read requests will be moved to
>> non rotational device.
>>
>> If both of mirrors are non rotational, calculate sum of
>> pending and in flight request for queue on that bdev and use
>> device with least queue leght.
>>
>> P.S.
>> Inspired by md-raid1 read balancing
>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  fs/btrfs/volumes.c | 59
>> ++ 1 file changed, 59
>> insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 9a04245003ab..98bc2433a920 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info
>> *fs_info, u64 logical, u64 len) return ret;
>>  }
>>
>> +static inline int bdev_get_queue_len(struct block_device *bdev)
>> +{
>> + int sum = 0;
>> + struct request_queue *rq = bdev_get_queue(bdev);
>> +
>> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
>> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
>> +
>
> This won't work as expected if bdev is controlled by blk-mq, these
> counters will be zero. AFAIK to get this info in block layer agnostic way
> part_in_flight[1] has to be used. It extracts these counters approriately.
>
> But it needs to be EXPORT_SYMBOL()'ed in block/genhd.c so we can continue
> to build btrfs as module.
>
>> + /*
>> +  * Try prevent switch for every sneeze
>> +  * By roundup output num by 2
>> +  */
>> + return ALIGN(sum, 2);
>> +}
>> +
>>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>   struct map_lookup *map, int first, int num,
>>   int optimal, int dev_replace_is_ongoing)
>>  {
>>   int i;
>>   int tolerance;
>> + struct block_device *bdev;
>>   struct btrfs_device *srcdev;
>> + bool all_bdev_nonrot = true;
>>
>>   if (dev_replace_is_ongoing &&
>>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info
>> *fs_info, else
>>   srcdev = NULL;
>>
>> + /*
>> +  * Optimal expected to be pid % num
>> +  * That's generaly ok for spinning rust drives
>> +  * But if one of mirror are non rotating,
>> +  * that bdev can show better performance
>> +  *
>> +  * if one of disks are non rotating:
>> +  *  - set optimal to non rotating device
>> +  * if both disk are non rotating
>> +  *  - set optimal to bdev with least queue
>> +  * If both disks are spinning rust:
>> +  *  - leave old pid % nu,
>> +  */
>> + for (i = 0; i < num; i++) {
>> + bdev = map->stripes[i].dev->bdev;
>> + if (!bdev)
>> + continue;
>> + if (blk_queue_nonrot(bdev_get_queue(bdev)))
>> + optimal = i;
>> + else
>> + all_bdev_nonrot = false;
>> + }
>> +
>> + if (all_bdev_nonrot) {
>> + int qlen;
>> + /* Forse following logic choise by init with some big number
>> */
>> + int optimal_dev_rq_count = 1 << 24;
>
> Probably better to use INT_MAX macro instead.
>
> [1] https://elixir.free-electrons.com/linux/v4.15-rc5/source/block/genhd.c#L68
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thank you very much!

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2017-12-28 Thread Timofey Titovets

Insert sort are generaly perform better then bubble sort,
by have less iterations on avarage.
That version also try place element to right position
instead of raw swap.

I'm not sure how many stripes per bio raid56,
btrfs try to store (and try to sort).

So, that a bit shorter just in the name of a great justice.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/volumes.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 98bc2433a920..7195fc8c49b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5317,29 +5317,24 @@ static inline int parity_smaller(u64 a, u64 b)
return a > b;
 }
 
-/* Bubble-sort the stripe set to put the parity/syndrome stripes last */
+/* Insertion-sort the stripe set to put the parity/syndrome stripes last */
 static void sort_parity_stripes(struct btrfs_bio *bbio, int num_stripes)
 {
struct btrfs_bio_stripe s;
-   int i;
+   int i, j;
u64 l;
-   int again = 1;
 
-   while (again) {
-   again = 0;
-   for (i = 0; i < num_stripes - 1; i++) {
-   if (parity_smaller(bbio->raid_map[i],
-  bbio->raid_map[i+1])) {
-   s = bbio->stripes[i];
-   l = bbio->raid_map[i];
-   bbio->stripes[i] = bbio->stripes[i+1];
-   bbio->raid_map[i] = bbio->raid_map[i+1];
-   bbio->stripes[i+1] = s;
-   bbio->raid_map[i+1] = l;
-
-   again = 1;
-   }
+   for (i = 1; i < num_stripes; i++) {
+   s = bbio->stripes[i];
+   l = bbio->raid_map[i];
+   for (j = i - 1; j >= 0; j--) {
+   if (!parity_smaller(bbio->raid_map[j], l))
+   break;
+   bbio->stripes[j+1]  = bbio->stripes[j];
+   bbio->raid_map[j+1] = bbio->raid_map[j];
}
+   bbio->stripes[j+1]  = s;
+   bbio->raid_map[j+1] = l;
}
 }
 
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Btrfs: compression: replace workspace managment code by mempool API

2017-12-23 Thread Timofey Titovets

Mostly cleanup of old code and replace
old API with new one.

1. Drop old linked list based approach

2. Replace all ERR_PTR(-ENOMEM) with NULL, as mempool code
   only understood NULL

3. mempool call alloc methods on create/resize,
   so for be sure, move nofs_{save,restore} to
   appropriate places

4. Update btrfs_comp_op to use void *ws, instead of list_head *ws

5. LZO more aggressive use of kmalloc on order 1 alloc,
   for more aggressive fallback to mempool

6. Refactor alloc functions to check every allocation,
   because mempool flags are aggressive and can fail more
   frequently.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/compression.c | 213 +
 fs/btrfs/compression.h |  12 +--
 fs/btrfs/lzo.c |  64 +--
 fs/btrfs/zlib.c|  56 +++--
 fs/btrfs/zstd.c|  49 +++-
 5 files changed, 148 insertions(+), 246 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 02bd60357f04..869df3f5bd1b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -811,23 +811,12 @@ static void *heuristic_ws_alloc(gfp_t gfp_mask, void 
*pool_data)
return NULL;
 }
 
-struct workspaces_list {
-   struct list_head idle_ws;
-   spinlock_t ws_lock;
-   /* Number of free workspaces */
-   int free_ws;
-   /* Total number of allocated workspaces */
-   atomic_t total_ws;
-   /* Waiters for a free workspace */
-   wait_queue_head_t ws_wait;
-};
-
 struct workspace_stor {
mempool_t *pool;
 };
 
 static struct workspace_stor btrfs_heuristic_ws_stor;
-static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
+static struct workspace_stor btrfs_comp_stor[BTRFS_COMPRESS_TYPES];
 
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
_zlib_compress,
@@ -837,14 +826,14 @@ static const struct btrfs_compress_op * const 
btrfs_compress_op[] = {
 
 void __init btrfs_init_compress(void)
 {
-   struct list_head *workspace;
int i;
-   mempool_t *pool = btrfs_heuristic_ws_stor.pool;
+   mempool_t *pool;
 
/*
 * Preallocate one workspace for heuristic so
 * we can guarantee forward progress in the worst case
 */
+   pool = btrfs_heuristic_ws_stor.pool;
pool = mempool_create(1, heuristic_ws_alloc,
 heuristic_ws_free, NULL);
 
@@ -852,23 +841,17 @@ void __init btrfs_init_compress(void)
pr_warn("BTRFS: cannot preallocate heuristic workspace, will 
try later\n");
 
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
-   INIT_LIST_HEAD(_comp_ws[i].idle_ws);
-   spin_lock_init(_comp_ws[i].ws_lock);
-   atomic_set(_comp_ws[i].total_ws, 0);
-   init_waitqueue_head(_comp_ws[i].ws_wait);
-
+   pool = btrfs_comp_stor[i].pool;
/*
 * Preallocate one workspace for each compression type so
 * we can guarantee forward progress in the worst case
 */
-   workspace = btrfs_compress_op[i]->alloc_workspace();
-   if (IS_ERR(workspace)) {
+   pool = mempool_create(1, btrfs_compress_op[i]->alloc_workspace,
+ btrfs_compress_op[i]->free_workspace,
+ NULL);
+
+   if (pool == NULL)
pr_warn("BTRFS: cannot preallocate compression 
workspace, will try later\n");
-   } else {
-   atomic_set(_comp_ws[i].total_ws, 1);
-   btrfs_comp_ws[i].free_ws = 1;
-   list_add(workspace, _comp_ws[i].idle_ws);
-   }
}
 }
 
@@ -881,6 +864,7 @@ static void *mempool_alloc_wrap(struct workspace_stor *stor)
int ncpu = num_online_cpus();
 
while (unlikely(stor->pool == NULL)) {
+   int i;
mempool_t *pool;
void *(*ws_alloc)(gfp_t gfp_mask, void *pool_data);
void (*ws_free)(void *element, void *pool_data);
@@ -888,6 +872,13 @@ static void *mempool_alloc_wrap(struct workspace_stor 
*stor)
if (stor == _heuristic_ws_stor) {
ws_alloc = heuristic_ws_alloc;
ws_free  = heuristic_ws_free;
+   } else {
+   for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
+   if (stor == _comp_stor[i])
+   break;
+   }
+   ws_alloc = btrfs_compress_op[i]->alloc_workspace;
+   ws_free  = btrfs_compress_op[i]->free_workspace;
}
 
pool = mempool_create(1, ws_alloc, ws_free, NULL);
@@ -915,7 +906,12 @@ static void *mempool_alloc_wrap(struct workspace_s

[PATCH 0/2] Btrfs: heuristic/compression convert workspace memory cache

2017-12-23 Thread Timofey Titovets

Attemp to simplify/cleanup compression code.
Little tested under high memory pressure.
At least all looks like working as expected.

First patch include preparation work for
replace old linked list based approach
with the new one based on mempool API.
Covert only one part as proof of concept, heuristic memory managment.
Define usage pattern and mempool_alloc_wrap() - handle
pool resize and pool init errors.

Second move zlib/lzo/zstd to new mempool API

Timofey Titovets (2):
  Btrfs: heuristic: replace workspace managment code by mempool API
  Btrfs: compression: replace workspace managment code by mempool API

 fs/btrfs/compression.c | 332 -
 fs/btrfs/compression.h |  12 +-
 fs/btrfs/lzo.c |  64 ++
 fs/btrfs/zlib.c|  56 +
 fs/btrfs/zstd.c|  49 +---
 5 files changed, 215 insertions(+), 298 deletions(-)

-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Btrfs: heuristic: replace workspace managment code by mempool API

2017-12-23 Thread Timofey Titovets

Currently compression code have custom workspace/memory cache
for guarantee forward progress on high memory pressure.

That api can be replaced with mempool API, which can guarantee the same.
Main goal is simplify/cleanup code and replace it with general solution.

I try avoid use of atomic/lock/wait stuff,
as that all already hidden in mempool API.
Only thing that must be racy safe is initialization of
mempool.

So i create simple mempool_alloc_wrap, which will handle
mempool_create failures, and sync threads work by cmpxchg()
on mempool_t pointer.

Another logic difference between our custom stuff and mempool:
 - ws find/free mosly reuse current workspaces whenever possible.
 - mempool use alloc/free of provided helpers with more
   aggressive use of __GFP_NOMEMALLOC, __GFP_NORETRY, GFP_NOWARN,
   and only use already preallocated space when memory get tight.

Not sure which approach are better, but simple stress tests with
writing stuff on compressed fs on ramdisk show negligible difference on
8 CPU Virtual Machine with Intel Xeon E5-2420 0 @ 1.90GHz (+-1%).

Other needed changes to use mempool:
 - memalloc_nofs_{save,restore} move to each place where kvmalloc
   will be used in call chain.
 - mempool_create return pointer to mampool or NULL,
   no error, so macros like IS_ERR(ptr) can't be used.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/compression.c | 197 ++---
 1 file changed, 106 insertions(+), 91 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 208334aa6c6e..02bd60357f04 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -768,46 +769,46 @@ struct heuristic_ws {
struct bucket_item *bucket;
/* Sorting buffer */
struct bucket_item *bucket_b;
-   struct list_head list;
 };
 
-static void free_heuristic_ws(struct list_head *ws)
+static void heuristic_ws_free(void *element, void *pool_data)
 {
-   struct heuristic_ws *workspace;
+   struct heuristic_ws *ws = (struct heuristic_ws *) element;
 
-   workspace = list_entry(ws, struct heuristic_ws, list);
-
-   kvfree(workspace->sample);
-   kfree(workspace->bucket);
-   kfree(workspace->bucket_b);
-   kfree(workspace);
+   kfree(ws->sample);
+   kfree(ws->bucket);
+   kfree(ws->bucket_b);
+   kfree(ws);
 }
 
-static struct list_head *alloc_heuristic_ws(void)
+static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data)
 {
-   struct heuristic_ws *ws;
+   struct heuristic_ws *ws = kzalloc(sizeof(*ws), gfp_mask);
 
-   ws = kzalloc(sizeof(*ws), GFP_KERNEL);
if (!ws)
-   return ERR_PTR(-ENOMEM);
+   return NULL;
 
-   ws->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL);
+   /*
+* We can handle allocation failures and
+* slab have caches for 8192 byte allocations
+*/
+   ws->sample = kmalloc(MAX_SAMPLE_SIZE, gfp_mask);
if (!ws->sample)
goto fail;
 
-   ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), GFP_KERNEL);
+   ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), gfp_mask);
if (!ws->bucket)
goto fail;
 
-   ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL);
+   ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), gfp_mask);
if (!ws->bucket_b)
goto fail;
 
-   INIT_LIST_HEAD(>list);
-   return >list;
+   return ws;
+
 fail:
-   free_heuristic_ws(>list);
-   return ERR_PTR(-ENOMEM);
+   heuristic_ws_free(ws, NULL);
+   return NULL;
 }
 
 struct workspaces_list {
@@ -821,9 +822,12 @@ struct workspaces_list {
wait_queue_head_t ws_wait;
 };
 
-static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
+struct workspace_stor {
+   mempool_t *pool;
+};
 
-static struct workspaces_list btrfs_heuristic_ws;
+static struct workspace_stor btrfs_heuristic_ws_stor;
+static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
 
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
_zlib_compress,
@@ -835,21 +839,17 @@ void __init btrfs_init_compress(void)
 {
struct list_head *workspace;
int i;
+   mempool_t *pool = btrfs_heuristic_ws_stor.pool;
 
-   INIT_LIST_HEAD(_heuristic_ws.idle_ws);
-   spin_lock_init(_heuristic_ws.ws_lock);
-   atomic_set(_heuristic_ws.total_ws, 0);
-   init_waitqueue_head(_heuristic_ws.ws_wait);
+   /*
+* Preallocate one workspace for heuristic so
+* we can guarantee forward progress in the worst case
+*/
+   pool = mempool_create(1, heuristic_ws_alloc,
+

[RFC PATCH] Btrfs: replace custom heuristic ws allocation logic with mempool API

2017-12-22 Thread Timofey Titovets

Currently btrfs compression code use custom wrapper
for store allocated compression/heuristic workspaces.

That logic try store at least ncpu+1 each type of workspaces.

As far, as i can see that logic fully reimplement
mempool API.
So i think, that use of mempool api can simplify code
and allow for cleanup it.

That a proof of concept patch, i have tested it (at least that works),
future version will looks mostly same.

If that acceptable,
next step will be:
1. Create mempool_alloc_w()
that will resize mempool to apropriate size ncpu+1
And will create apropriate mempool, if creating failed in __init.

2. Convert per compression ws to mempool.

Thanks.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
Cc: David Sterba <dste...@suse.com>
---
 fs/btrfs/compression.c | 123 -
 1 file changed, 39 insertions(+), 84 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 208334aa6c6e..cf47089b9ec0 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -768,14 +769,11 @@ struct heuristic_ws {
struct bucket_item *bucket;
/* Sorting buffer */
struct bucket_item *bucket_b;
-   struct list_head list;
 };
 
-static void free_heuristic_ws(struct list_head *ws)
+static void heuristic_ws_free(void *element, void *pool_data)
 {
-   struct heuristic_ws *workspace;
-
-   workspace = list_entry(ws, struct heuristic_ws, list);
+   struct heuristic_ws *workspace = (struct heuristic_ws *) element;
 
kvfree(workspace->sample);
kfree(workspace->bucket);
@@ -783,13 +781,12 @@ static void free_heuristic_ws(struct list_head *ws)
kfree(workspace);
 }
 
-static struct list_head *alloc_heuristic_ws(void)
+static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data)
 {
-   struct heuristic_ws *ws;
+   struct heuristic_ws *ws = kmalloc(sizeof(*ws), GFP_KERNEL);
 
-   ws = kzalloc(sizeof(*ws), GFP_KERNEL);
if (!ws)
-   return ERR_PTR(-ENOMEM);
+   return ws;
 
ws->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL);
if (!ws->sample)
@@ -803,11 +800,14 @@ static struct list_head *alloc_heuristic_ws(void)
if (!ws->bucket_b)
goto fail;
 
-   INIT_LIST_HEAD(>list);
-   return >list;
+   return ws;
+
 fail:
-   free_heuristic_ws(>list);
-   return ERR_PTR(-ENOMEM);
+   kvfree(ws->sample);
+   kfree(ws->bucket);
+   kfree(ws->bucket_b);
+   kfree(ws);
+   return NULL;
 }
 
 struct workspaces_list {
@@ -821,10 +821,9 @@ struct workspaces_list {
wait_queue_head_t ws_wait;
 };
 
+static mempool_t *btrfs_heuristic_ws_pool;
 static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
 
-static struct workspaces_list btrfs_heuristic_ws;
-
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
_zlib_compress,
_lzo_compress,
@@ -836,20 +835,15 @@ void __init btrfs_init_compress(void)
struct list_head *workspace;
int i;
 
-   INIT_LIST_HEAD(_heuristic_ws.idle_ws);
-   spin_lock_init(_heuristic_ws.ws_lock);
-   atomic_set(_heuristic_ws.total_ws, 0);
-   init_waitqueue_head(_heuristic_ws.ws_wait);
+   /*
+* Try preallocate pool with minimum size for successful
+* initialization of btrfs module
+*/
+   btrfs_heuristic_ws_pool = mempool_create(1, heuristic_ws_alloc,
+   heuristic_ws_free, NULL);
 
-   workspace = alloc_heuristic_ws();
-   if (IS_ERR(workspace)) {
-   pr_warn(
-   "BTRFS: cannot preallocate heuristic workspace, will try later\n");
-   } else {
-   atomic_set(_heuristic_ws.total_ws, 1);
-   btrfs_heuristic_ws.free_ws = 1;
-   list_add(workspace, _heuristic_ws.idle_ws);
-   }
+   if (IS_ERR(btrfs_heuristic_ws_pool))
+   pr_warn("BTRFS: cannot preallocate heuristic workspace, will 
try later\n");
 
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
INIT_LIST_HEAD(_comp_ws[i].idle_ws);
@@ -878,7 +872,7 @@ void __init btrfs_init_compress(void)
  * Preallocation makes a forward progress guarantees and we do not return
  * errors.
  */
-static struct list_head *__find_workspace(int type, bool heuristic)
+static struct list_head *find_workspace(int type)
 {
struct list_head *workspace;
int cpus = num_online_cpus();
@@ -890,19 +884,11 @@ static struct list_head *__find_workspace(int type, bool 
heuristic)
wait_queue_head_t *ws_wait;
int *free_ws;
 
-   if (heuristic) {
-   idle_ws  = _heuristic_ws.idle_ws;
-   ws_lock  = _heuristic_ws.ws_lock;
-

Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)

2017-12-20 Thread Timofey Titovets

How reproduce:
touch test_file
chattr +C test_file
dd if=/dev/zero of=test_file bs=1M count=1
btrfs fi def -vrczlib test_file
filefrag -v test_file

test_file
Filesystem type is: 9123683e
File size of test_file is 1048576 (256 blocks of 4096 bytes)
ext: logical_offset:physical_offset: length:   expected: flags:
  0:0..  31:   72917050..  72917081: 32: encoded
  1:   32..  63:   72917118..  72917149: 32:   72917082: encoded
  2:   64..  95:   72919494..  72919525: 32:   72917150: encoded
  3:   96.. 127:   72927576..  72927607: 32:   72919526: encoded
  4:  128.. 159:   72943261..  72943292: 32:   72927608: encoded
  5:  160.. 191:   72944929..  72944960: 32:   72943293: encoded
  6:  192.. 223:   72944952..  72944983: 32:   72944961: encoded
  7:  224.. 255:   72967084..  72967115: 32:   72944984:
last,encoded,eof
test_file: 8 extents found

I can't found at now, where that error happen in code,
but it's reproducible on Linux 4.14.8

Thanks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2017-12-19 Thread Timofey Titovets

At now btrfs_dedupe_file_range() restricted to 16MiB range for
limit locking time and memory requirement for dedup ioctl()

For too big input range code silently set range to 16MiB

Let's remove that restriction by do iterating over dedup range.
That's backward compatible and will not change anything for request
less then 16MiB.

Changes:
  v1 -> v2:
- Refactor btrfs_cmp_data_prepare and btrfs_extent_same
- Store memory of pages array between iterations
- Lock inodes once, not on each iteration
- Small inplace cleanups

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 160 ---
 1 file changed, 94 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index be5bd81b3669..45a47d0891fc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2965,8 +2965,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp)
put_page(pg);
}
}
-   kfree(cmp->src_pages);
-   kfree(cmp->dst_pages);
+
+   cmp->num_pages = 0;
 }
 
 static int btrfs_cmp_data_prepare(struct inode *src, u64 loff,
@@ -2974,41 +2974,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, 
u64 loff,
  u64 len, struct cmp_pages *cmp)
 {
int ret;
-   int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
-   struct page **src_pgarr, **dst_pgarr;
-
-   /*
-* We must gather up all the pages before we initiate our
-* extent locking. We use an array for the page pointers. Size
-* of the array is bounded by len, which is in turn bounded by
-* BTRFS_MAX_DEDUPE_LEN.
-*/
-   src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   if (!src_pgarr || !dst_pgarr) {
-   kfree(src_pgarr);
-   kfree(dst_pgarr);
-   return -ENOMEM;
-   }
-   cmp->num_pages = num_pages;
-   cmp->src_pages = src_pgarr;
-   cmp->dst_pages = dst_pgarr;
 
/*
 * If deduping ranges in the same inode, locking rules make it mandatory
 * to always lock pages in ascending order to avoid deadlocks with
 * concurrent tasks (such as starting writeback/delalloc).
 */
-   if (src == dst && dst_loff < loff) {
-   swap(src_pgarr, dst_pgarr);
+   if (src == dst && dst_loff < loff)
swap(loff, dst_loff);
-   }
 
-   ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff);
+   cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+   ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff);
if (ret)
goto out;
 
-   ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff);
+   ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, 
dst_loff);
 
 out:
if (ret)
@@ -3078,31 +3059,23 @@ static int extent_same_check_offsets(struct inode 
*inode, u64 off, u64 *plen,
return 0;
 }
 
-static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
-struct inode *dst, u64 dst_loff)
+static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
+  struct inode *dst, u64 dst_loff,
+  struct cmp_pages *cmp)
 {
int ret;
u64 len = olen;
-   struct cmp_pages cmp;
bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;
 
-   if (len == 0)
-   return 0;
-
-   if (same_inode)
-   inode_lock(src);
-   else
-   btrfs_double_inode_lock(src, dst);
-
ret = extent_same_check_offsets(src, loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
ret = extent_same_check_offsets(dst, dst_loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
if (same_inode) {
/*
@@ -3119,32 +3092,21 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 * allow an unaligned length so long as it ends at
 * i_size.
 */
-   if (len != olen) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (len != olen)
+   return -EINVAL;
 
/* Check for overlapping ranges */
-   if (dst_loff + len > loff && dst_loff < loff + len) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (dst_loff + len > loff && dst_loff < loff + len)
+   return -EINVAL;

[PATCH 3/4] Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation

2017-12-19 Thread Timofey Titovets

Currently defrag ioctl only support recompress files with specified
compression type.
Allow set compression type to none, while call defrag, and use
BTRFS_DEFRAG_RANGE_COMPRESS as flag, that user request change of compression 
type.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/btrfs_inode.h |  1 +
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/ioctl.c   | 17 ++---
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 63f0ccc92a71..9eb0c92ee4b4 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -187,6 +187,7 @@ struct btrfs_inode {
 * different from prop_compress and takes precedence if set
 */
unsigned defrag_compress;
+   unsigned change_compress;
 
struct btrfs_delayed_node *delayed_node;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 46df5e2a64e7..7af8f1784788 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -412,8 +412,8 @@ static inline int inode_need_compress(struct inode *inode, 
u64 start, u64 end)
if (btrfs_test_opt(fs_info, FORCE_COMPRESS))
return 1;
/* defrag ioctl */
-   if (BTRFS_I(inode)->defrag_compress)
-   return 1;
+   if (BTRFS_I(inode)->change_compress)
+   return BTRFS_I(inode)->defrag_compress;
/* bad compression ratios */
if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS)
return 0;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b29ea1f0f621..40f5e5678eac 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1276,7 +1276,7 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
unsigned long cluster = max_cluster;
u64 new_align = ~((u64)SZ_128K - 1);
struct page **pages = NULL;
-   bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
+   bool change_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
 
if (isize == 0)
return 0;
@@ -1284,11 +1284,10 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
if (range->start >= isize)
return -EINVAL;
 
-   if (do_compress) {
+   if (change_compress) {
if (range->compress_type > BTRFS_COMPRESS_TYPES)
return -EINVAL;
-   if (range->compress_type)
-   compress_type = range->compress_type;
+   compress_type = range->compress_type;
}
 
if (extent_thresh == 0)
@@ -1363,7 +1362,7 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress,
+_end, change_compress,
 compress_type)){
unsigned long next;
/*
@@ -1392,8 +1391,11 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
}
 
inode_lock(inode);
-   if (do_compress)
+   if (change_compress) {
+   BTRFS_I(inode)->change_compress = change_compress;
BTRFS_I(inode)->defrag_compress = compress_type;
+   }
+
ret = cluster_pages_for_defrag(inode, pages, i, cluster);
if (ret < 0) {
inode_unlock(inode);
@@ -1449,8 +1451,9 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
ret = defrag_count;
 
 out_ra:
-   if (do_compress) {
+   if (change_compress) {
inode_lock(inode);
+   BTRFS_I(inode)->change_compress = 0;
BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
inode_unlock(inode);
}
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] Btrfs: make should_defrag_range() understood compressed extents

2017-12-19 Thread Timofey Titovets

 Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
 for file defragmentation.

 Kernel default target extent size - 256KiB.
 Btrfs progs default - 32MiB.

 Both bigger then maximum size of compressed extent - 128KiB.
 That lead to rewrite all compressed data on disk.

 Fix that by check compression extents with different logic.

 As addition, make should_defrag_range() understood compressed extent type,
 if requested target compression are same as current extent compression type.
 Just don't recompress/rewrite extents.
 To avoid useless recompression of compressed extents.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 45a47d0891fc..b29ea1f0f621 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode *inode, 
struct extent_map *em)
 
 static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
   u64 *last_len, u64 *skip, u64 *defrag_end,
-  int compress)
+  int compress, int compress_type)
 {
struct extent_map *em;
int ret = 1;
@@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, u64 
start, u32 thresh,
 * real extent, don't bother defragging it
 */
if (!compress && (*last_len == 0 || *last_len >= thresh) &&
-   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
+   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
ret = 0;
+   goto out;
+   }
+
+
+   /*
+* Try not recompress compressed extents
+* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
+* recompress all compressed extents
+*/
+   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
+   if (!compress) {
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   } else {
+   if (em->compress_type != compress_type)
+   goto out;
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   }
+   }
+
 out:
/*
 * last_len ends up being a counter of how many bytes we've defragged.
@@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress)){
+_end, do_compress,
+compress_type)){
unsigned long next;
/*
 * the should_defrag function tells us how much to skip
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/4] Btrfs: just bunch of patches to ioctl.c

2017-12-19 Thread Timofey Titovets

1st patch, remove 16MiB restriction from extent_same ioctl(),
by doing iterations over passed range.

I did not see much difference in performance, so it's just remove
logic restriction.

2-3 pathes, update defrag ioctl():
 - Fix bad behaviour with full rewriting all compressed
   extents in defrag range. (that also make autodefrag on compressed fs
   not so expensive)
 - Allow userspace specify NONE as target compression type,
   that allow users to uncompress files by defragmentation with btrfs-progs
 - Make defrag ioctl understood requested compression type and current
   compression type of extents, to make btrfs fi def -rc
   idempotent operation.
   i.e. now possible to say, make all extents compressed with lzo,
   and btrfs will not recompress lzo compressed data.
   Same for zlib, zstd, none.
   (patch to btrfs-progs in PR on kdave GitHub).

4th patch, reduce size of struct btrfs_inode
 - btrfs_inode store fields like: prop_compress, defrag_compress and
   after 3rd patch, change_compress.
   They use unsigned as a type, and use 12 bytes in sum.
   But change_compress is a bitflag, and prop_compress/defrag_compress
   only store compression type, that currently use 0-3 of 2^32-1.
   
   So, set a bitfields on that vars, and reduce size of btrfs_inode:
   1136 -> 1128.

Timofey Titovets (4):
  Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
  Btrfs: make should_defrag_range() understood compressed extents
  Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
  Btrfs: reduce size of struct btrfs_inode

 fs/btrfs/btrfs_inode.h |   5 +-
 fs/btrfs/inode.c   |   4 +-
 fs/btrfs/ioctl.c   | 203 +++--
 3 files changed, 133 insertions(+), 79 deletions(-)

-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] Btrfs: reduce size of struct btrfs_inode

2017-12-19 Thread Timofey Titovets

Currently btrfs_inode have size equal 1136 bytes. (On x86_64).

struct btrfs_inode store several vars releated to compression code,
all states use 1 or 2 bits.

Lets declare bitfields for compression releated vars, to reduce
sizeof btrfs_inode to 1128 bytes.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/btrfs_inode.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9eb0c92ee4b4..9d29d7e68757 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -181,13 +181,13 @@ struct btrfs_inode {
/*
 * Cached values of inode properties
 */
-   unsigned prop_compress; /* per-file compression algorithm */
+   unsigned prop_compress : 2; /* per-file compression algorithm */
/*
 * Force compression on the file using the defrag ioctl, could be
 * different from prop_compress and takes precedence if set
 */
-   unsigned defrag_compress;
-   unsigned change_compress;
+   unsigned defrag_compress : 2;
+   unsigned change_compress : 1;
 
struct btrfs_delayed_node *delayed_node;
 
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation

2017-12-15 Thread Timofey Titovets

Currently defrag ioctl only support compress files with specified
compression type. Allow set compression type to none, while call defrag.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/btrfs_inode.h |  1 +
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/ioctl.c   | 17 ++---
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 63f0ccc92a71..9eb0c92ee4b4 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -187,6 +187,7 @@ struct btrfs_inode {
 * different from prop_compress and takes precedence if set
 */
unsigned defrag_compress;
+   unsigned change_compress;
 
struct btrfs_delayed_node *delayed_node;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 46df5e2a64e7..7af8f1784788 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -412,8 +412,8 @@ static inline int inode_need_compress(struct inode *inode, 
u64 start, u64 end)
if (btrfs_test_opt(fs_info, FORCE_COMPRESS))
return 1;
/* defrag ioctl */
-   if (BTRFS_I(inode)->defrag_compress)
-   return 1;
+   if (BTRFS_I(inode)->change_compress)
+   return BTRFS_I(inode)->defrag_compress;
/* bad compression ratios */
if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS)
return 0;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 12d4fa5d6dec..b777c8f53153 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1276,7 +1276,7 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
unsigned long cluster = max_cluster;
u64 new_align = ~((u64)SZ_128K - 1);
struct page **pages = NULL;
-   bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
+   bool change_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
 
if (isize == 0)
return 0;
@@ -1284,11 +1284,10 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
if (range->start >= isize)
return -EINVAL;
 
-   if (do_compress) {
+   if (change_compress) {
if (range->compress_type > BTRFS_COMPRESS_TYPES)
return -EINVAL;
-   if (range->compress_type)
-   compress_type = range->compress_type;
+   compress_type = range->compress_type;
}
 
if (extent_thresh == 0)
@@ -1363,7 +1362,7 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress,
+_end, change_compress,
 compress_type)){
unsigned long next;
/*
@@ -1392,8 +1391,11 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
}
 
inode_lock(inode);
-   if (do_compress)
+   if (change_compress) {
+   BTRFS_I(inode)->change_compress = change_compress;
BTRFS_I(inode)->defrag_compress = compress_type;
+   }
+
ret = cluster_pages_for_defrag(inode, pages, i, cluster);
if (ret < 0) {
inode_unlock(inode);
@@ -1449,8 +1451,9 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
ret = defrag_count;
 
 out_ra:
-   if (do_compress) {
+   if (change_compress) {
inode_lock(inode);
+   BTRFS_I(inode)->change_compress = 0;
BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
inode_unlock(inode);
}
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: make should_defrag_range() understood compressed extents

2017-12-15 Thread Timofey Titovets

Also,
in theory that break following case:
When fs mounted with compress=no, you can uncompress data by btrfs fi
def -vr , because of "bug" in logic/defaults.
And man page show that btrfs "Currently it’s not possible to select no
compression. See also section EXAMPLES."

I have a two simple patches that can provide a way (one to btrfs-progs
and one to btrfs.ko),
to uncompress data on fs mounted with compress=no, by run btrfs fi def
-vrcnone 
But behavior on fs mounted with compress, will be recompress data with
selected by "compress" algo,
because of inode_need_compression() logic.

In theory that also must be fixed.


2017-12-14 16:37 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> Compile tested and "battle" tested
>
> 2017-12-14 16:35 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
>> Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
>> for file defragmentation.
>>
>> Kernel target extent size default is 256KiB
>> Btrfs progs by default, use 32MiB.
>>
>> Both bigger then max (not fragmented) compressed extent size 128KiB.
>> That lead to rewrite all compressed data on disk.
>>
>> Fix that and also make should_defrag_range() understood
>> if requested target compression are same as current extent compression type.
>> To avoid useless recompression of compressed extents.
>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  fs/btrfs/ioctl.c | 28 +---
>>  1 file changed, 25 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index be5bd81b3669..12d4fa5d6dec 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode 
>> *inode, struct extent_map *em)
>>
>>  static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
>>u64 *last_len, u64 *skip, u64 *defrag_end,
>> -  int compress)
>> +  int compress, int compress_type)
>>  {
>> struct extent_map *em;
>> int ret = 1;
>> @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, 
>> u64 start, u32 thresh,
>>  * real extent, don't bother defragging it
>>  */
>> if (!compress && (*last_len == 0 || *last_len >= thresh) &&
>> -   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
>> +   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
>> ret = 0;
>> +   goto out;
>> +   }
>> +
>> +
>> +   /*
>> +* Try not recompress compressed extents
>> +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
>> +* recompress all compressed extents
>> +*/
>> +   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
>> +   if (!compress) {
>> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
>> +   ret = 0;
>> +   } else {
>> +   if (em->compress_type != compress_type)
>> +   goto out;
>> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
>> +   ret = 0;
>> +   }
>> +   }
>> +
>>  out:
>> /*
>>  * last_len ends up being a counter of how many bytes we've 
>> defragged.
>> @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
>> *file,
>>
>> if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
>>  extent_thresh, _len, ,
>> -_end, do_compress)){
>> +_end, do_compress,
>> +compress_type)){
>> unsigned long next;
>> /*
>>  * the should_defrag function tells us how much to 
>> skip
>> --
>> 2.15.1
>>
>
>
>
> --
> Have a nice day,
> Timofey.



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG?] Defrag on compressed FS do massive data rewrites

2017-12-14 Thread Timofey Titovets

I send:
[PATCH] Btrfs: make should_defrag_range() understood compressed extents

If you want you can test that, it fix btrfs fi def and autodefrag
behavior with compressed data.
Also it understood if user try recompress data with old/new compression algo.

2017-12-14 14:27 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> 2017-12-14 8:58 GMT+03:00 Duncan <1i5t5.dun...@cox.net>:
>> Timofey Titovets posted on Thu, 14 Dec 2017 02:05:35 +0300 as excerpted:
>>
>>> Also, same problem exist for autodefrag case i.e.:
>>> write 4KiB at start of compressed file autodefrag code add that file to
>>> autodefrag queue, call btrfs_defrag_file, set range from start to u64-1.
>>> That will trigger to full file rewrite, as all extents are smaller then
>>> 256KiB.
>>>
>>> (if i understood all correctly).
>>
>> If so, it's rather ironic, because that's how I believed autodefrag to
>> work, whole-file, for quite some time.  Then I learned otherwise, but I
>> always enable both autodefrag and compress=lzo on all my btrfs, so it
>> looks like at least for my use-case, I was correct with the whole-file
>> assumption after all.  (Well, at least for files that are actually
>> compressed, I don't run compress-force=lzo, just compress=lzo, so a
>> reasonable number of files aren't actually compressed anyway, and they'd
>> do the partial-file rewrite I had learned to be normal.)
>>
>> --
>> Duncan - List replies preferred.   No HTML msgs.
>> "Every nonfree program has a lord, a master --
>> and if you use the program, he is your master."  Richard Stallman
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> btrfs fi def can easy avoid that problem if properly documented,
> but i think that fix must be in place, because "full rewrite" by
> autodefrag are unacceptable behaviour.
>
> How i see, "How it works":
> Both, defrag ioctl and autodefrag - call btrfs_defrag_file() (ioctl.c).
> btrfs_defrag_file() set extent_thresh to args, if args not provided
> (autodefrag not initialize range->extent_thresh);
> use default:
> if (extent_thresh == 0)
>  extent_thresh = SZ_256K;
>
> Later btrfs_defrag_file() try defrag file from start by "index" (page
> number from start, file virtually splitted to page sized blocks).
> Than it call should_defrag_range(), if need make i+1 or skip range by
> info from should_defrag_range().
>
> should_defrag_range() get extent for specified start offset:
> em = defrag_lookup_extent(inode, start);
>
> Later (em->len >= thresh || (!next_mergeable && !prev_mergeable)) will
> fail condition because len (<128KiB) < thresh (256KiB or 32MiB usual).
> So extent will be rewritten.
>
> struct extent_map{}; have two potential useful info:
>
> ...
> u64 len;
> ...
>
> unsigned int compress_type;
> ...
>
> As i see by code len store "real" length, so
> in theory that just need to add additional check later like:
> if (em->len = BTRFS_MAX_UNCOMPRESSED && em->compress_type > 0)
> ret = 0;
>
> That must fix problem by "true" insight check for compressed extents.
>
> Thanks.
>
> P.S.
> (May be someone from more experienced devs can comment that?)
>
> --
> Have a nice day,
> Timofey.



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: make should_defrag_range() understood compressed extents

2017-12-14 Thread Timofey Titovets

Compile tested and "battle" tested

2017-12-14 16:35 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
> for file defragmentation.
>
> Kernel target extent size default is 256KiB
> Btrfs progs by default, use 32MiB.
>
> Both bigger then max (not fragmented) compressed extent size 128KiB.
> That lead to rewrite all compressed data on disk.
>
> Fix that and also make should_defrag_range() understood
> if requested target compression are same as current extent compression type.
> To avoid useless recompression of compressed extents.
>
> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
> ---
>  fs/btrfs/ioctl.c | 28 +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index be5bd81b3669..12d4fa5d6dec 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode 
> *inode, struct extent_map *em)
>
>  static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
>u64 *last_len, u64 *skip, u64 *defrag_end,
> -  int compress)
> +  int compress, int compress_type)
>  {
> struct extent_map *em;
> int ret = 1;
> @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, 
> u64 start, u32 thresh,
>  * real extent, don't bother defragging it
>  */
> if (!compress && (*last_len == 0 || *last_len >= thresh) &&
> -   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
> +   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
> ret = 0;
> +   goto out;
> +   }
> +
> +
> +   /*
> +* Try not recompress compressed extents
> +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
> +* recompress all compressed extents
> +*/
> +   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
> +   if (!compress) {
> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
> +   ret = 0;
> +   } else {
> +   if (em->compress_type != compress_type)
> +   goto out;
> +   if (em->len == BTRFS_MAX_UNCOMPRESSED)
> +   ret = 0;
> +   }
> +   }
> +
>  out:
> /*
>  * last_len ends up being a counter of how many bytes we've defragged.
> @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
> *file,
>
> if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
>  extent_thresh, _len, ,
> -_end, do_compress)){
> +_end, do_compress,
> +compress_type)){
> unsigned long next;
> /*
>  * the should_defrag function tells us how much to 
> skip
> --
> 2.15.1
>



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: make should_defrag_range() understood compressed extents

2017-12-14 Thread Timofey Titovets

Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
for file defragmentation.

Kernel target extent size default is 256KiB
Btrfs progs by default, use 32MiB.

Both bigger then max (not fragmented) compressed extent size 128KiB.
That lead to rewrite all compressed data on disk.

Fix that and also make should_defrag_range() understood
if requested target compression are same as current extent compression type.
To avoid useless recompression of compressed extents.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index be5bd81b3669..12d4fa5d6dec 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode *inode, 
struct extent_map *em)
 
 static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
   u64 *last_len, u64 *skip, u64 *defrag_end,
-  int compress)
+  int compress, int compress_type)
 {
struct extent_map *em;
int ret = 1;
@@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, u64 
start, u32 thresh,
 * real extent, don't bother defragging it
 */
if (!compress && (*last_len == 0 || *last_len >= thresh) &&
-   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
+   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
ret = 0;
+   goto out;
+   }
+
+
+   /*
+* Try not recompress compressed extents
+* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
+* recompress all compressed extents
+*/
+   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
+   if (!compress) {
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   } else {
+   if (em->compress_type != compress_type)
+   goto out;
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   }
+   }
+
 out:
/*
 * last_len ends up being a counter of how many bytes we've defragged.
@@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress)){
+_end, do_compress,
+compress_type)){
unsigned long next;
/*
 * the should_defrag function tells us how much to skip
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: btrfs_defrag_file() force use target extent size SZ_128KiB for compressed data

2017-12-14 Thread Timofey Titovets

Ignore that patch please, i will send another

2017-12-14 2:25 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> Defrag heuristic use extent lengh as threshold,
> kernel autodefrag use SZ_256KiB and btrfs-progs use SZ_32MiB as
> target extent lengh.
>
> Problem:
> Compressed extents always have lengh at < 128KiB (BTRFS_MAX_COMPRESSED)
> So btrfs_defrag_file() always rewrite all extents in defrag range.
>
> Hot fix that by force set target extent size to BTRFS_MAX_COMPRESSED,
> if file allowed to be compressed.
>
> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
> ---
>  fs/btrfs/ioctl.c | 23 +++
>  1 file changed, 23 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index be5bd81b3669..952364ff4108 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1232,6 +1232,26 @@ static int cluster_pages_for_defrag(struct inode 
> *inode,
>
>  }
>
> +static inline int inode_use_compression(struct inode *inode)
> +{
> +   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +
> +   /* force compress */
> +   if (btrfs_test_opt(fs_info, FORCE_COMPRESS))
> +   return 1;
> +   /* defrag ioctl */
> +   if (BTRFS_I(inode)->defrag_compress)
> +   return 1;
> +   /* bad compression ratios */
> +   if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS)
> +   return 0;
> +   if (btrfs_test_opt(fs_info, COMPRESS) ||
> +   BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS ||
> +   BTRFS_I(inode)->prop_compress)
> +   return 1;
> +   return 0;
> +}
> +
>  int btrfs_defrag_file(struct inode *inode, struct file *file,
>   struct btrfs_ioctl_defrag_range_args *range,
>   u64 newer_than, unsigned long max_to_defrag)
> @@ -1270,6 +1290,9 @@ int btrfs_defrag_file(struct inode *inode, struct file 
> *file,
> compress_type = range->compress_type;
> }
>
> +   if (inode_use_compression(inode))
> +   extent_thresh = BTRFS_MAX_COMPRESSED;
> +
> if (extent_thresh == 0)
> extent_thresh = SZ_256K;
>
> --
> 2.15.1



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG?] Defrag on compressed FS do massive data rewrites

2017-12-14 Thread Timofey Titovets

2017-12-14 8:58 GMT+03:00 Duncan <1i5t5.dun...@cox.net>:
> Timofey Titovets posted on Thu, 14 Dec 2017 02:05:35 +0300 as excerpted:
>
>> Also, same problem exist for autodefrag case i.e.:
>> write 4KiB at start of compressed file autodefrag code add that file to
>> autodefrag queue, call btrfs_defrag_file, set range from start to u64-1.
>> That will trigger to full file rewrite, as all extents are smaller then
>> 256KiB.
>>
>> (if i understood all correctly).
>
> If so, it's rather ironic, because that's how I believed autodefrag to
> work, whole-file, for quite some time.  Then I learned otherwise, but I
> always enable both autodefrag and compress=lzo on all my btrfs, so it
> looks like at least for my use-case, I was correct with the whole-file
> assumption after all.  (Well, at least for files that are actually
> compressed, I don't run compress-force=lzo, just compress=lzo, so a
> reasonable number of files aren't actually compressed anyway, and they'd
> do the partial-file rewrite I had learned to be normal.)
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs fi def can easy avoid that problem if properly documented,
but i think that fix must be in place, because "full rewrite" by
autodefrag are unacceptable behaviour.

How i see, "How it works":
Both, defrag ioctl and autodefrag - call btrfs_defrag_file() (ioctl.c).
btrfs_defrag_file() set extent_thresh to args, if args not provided
(autodefrag not initialize range->extent_thresh);
use default:
if (extent_thresh == 0)
 extent_thresh = SZ_256K;

Later btrfs_defrag_file() try defrag file from start by "index" (page
number from start, file virtually splitted to page sized blocks).
Than it call should_defrag_range(), if need make i+1 or skip range by
info from should_defrag_range().

should_defrag_range() get extent for specified start offset:
em = defrag_lookup_extent(inode, start);

Later (em->len >= thresh || (!next_mergeable && !prev_mergeable)) will
fail condition because len (<128KiB) < thresh (256KiB or 32MiB usual).
So extent will be rewritten.

struct extent_map{}; have two potential useful info:

...
u64 len;
...

unsigned int compress_type;
...

As i see by code len store "real" length, so
in theory that just need to add additional check later like:
if (em->len = BTRFS_MAX_UNCOMPRESSED && em->compress_type > 0)
ret = 0;

That must fix problem by "true" insight check for compressed extents.

Thanks.

P.S.
(May be someone from more experienced devs can comment that?)

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] Btrfs: btrfs_defrag_file() force use target extent size SZ_128KiB for compressed data

2017-12-13 Thread Timofey Titovets

Defrag heuristic use extent lengh as threshold,
kernel autodefrag use SZ_256KiB and btrfs-progs use SZ_32MiB as
target extent lengh.

Problem:
Compressed extents always have lengh at < 128KiB (BTRFS_MAX_COMPRESSED)
So btrfs_defrag_file() always rewrite all extents in defrag range.

Hot fix that by force set target extent size to BTRFS_MAX_COMPRESSED,
if file allowed to be compressed.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index be5bd81b3669..952364ff4108 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1232,6 +1232,26 @@ static int cluster_pages_for_defrag(struct inode *inode,
 
 }
 
+static inline int inode_use_compression(struct inode *inode)
+{
+   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+   /* force compress */
+   if (btrfs_test_opt(fs_info, FORCE_COMPRESS))
+   return 1;
+   /* defrag ioctl */
+   if (BTRFS_I(inode)->defrag_compress)
+   return 1;
+   /* bad compression ratios */
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS)
+   return 0;
+   if (btrfs_test_opt(fs_info, COMPRESS) ||
+   BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS ||
+   BTRFS_I(inode)->prop_compress)
+   return 1;
+   return 0;
+}
+
 int btrfs_defrag_file(struct inode *inode, struct file *file,
  struct btrfs_ioctl_defrag_range_args *range,
  u64 newer_than, unsigned long max_to_defrag)
@@ -1270,6 +1290,9 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
compress_type = range->compress_type;
}
 
+   if (inode_use_compression(inode))
+   extent_thresh = BTRFS_MAX_COMPRESSED;
+
if (extent_thresh == 0)
extent_thresh = SZ_256K;
 
-- 
2.15.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG?] Defrag on compressed FS do massive data rewrites

2017-12-13 Thread Timofey Titovets

2017-12-14 1:09 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> Hi, i see massive data rewrites of defragmented files when work with
> btrfs fi def .
> Before, i just thought it's a design problem - i.e. defrag always
> rewrite data to new place.
>
> At now, i read the code and see 2 bad cases:
> 1. With -c all extents of data will be rewriten, always.
> 2. btrfs use "bad" default target extent size, i.e. kernel by default
> try get 256KiB extent, btrfs-progres use 32MiB as a threshold.
>  Both of them make ioctl code should_defrag_range() think, that extent
> are "too" fragmented, and rewrite all compressed extents.
>
> Does that behavior expected?
>
> i.e. only way that i can safely use on my data are:
> btrfs fi def -vr -t 128KiB 
> That will defrag all fragmented compressed extents.
>
> "Hacky" solution that i see for now, is a create copy of inode_need_compress()
> for defrag ioctl, and if file must be compressed, force use of 128KiB
> as target extent.
> or at least document that not obvious behaviour.
>
> Thanks!
>
> --
> Have a nice day,
> Timofey.

Also, same problem exist for autodefrag case
i.e.:
write 4KiB at start of compressed file
autodefrag code add that file to autodefrag queue, call
btrfs_defrag_file, set range from start to u64-1.
That will trigger to full file rewrite, as all extents are smaller then 256KiB.

(if i understood all correctly).

Thanks.


-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[BUG?] Defrag on compressed FS do massive data rewrites

2017-12-13 Thread Timofey Titovets

Hi, i see massive data rewrites of defragmented files when work with
btrfs fi def .
Before, i just thought it's a design problem - i.e. defrag always
rewrite data to new place.

At now, i read the code and see 2 bad cases:
1. With -c all extents of data will be rewriten, always.
2. btrfs use "bad" default target extent size, i.e. kernel by default
try get 256KiB extent, btrfs-progres use 32MiB as a threshold.
 Both of them make ioctl code should_defrag_range() think, that extent
are "too" fragmented, and rewrite all compressed extents.

Does that behavior expected?

i.e. only way that i can safely use on my data are:
btrfs fi def -vr -t 128KiB 
That will defrag all fragmented compressed extents.

"Hacky" solution that i see for now, is a create copy of inode_need_compress()
for defrag ioctl, and if file must be compressed, force use of 128KiB
as target extent.
or at least document that not obvious behaviour.

Thanks!

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2017-12-13 Thread Timofey Titovets

Compile tested && battle tested by btrfs-extent-same from duperemove.

At performance, i see a negligible difference.

Thanks

2017-12-13 3:45 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> At now btrfs_dedupe_file_range() restricted to 16MiB range for
> limit locking time and memory requirement for dedup ioctl()
>
> For too big input range code silently set range to 16MiB
>
> Let's remove that restriction by do iterating over dedup range.
> That's backward compatible and will not change anything for request
> less then 16MiB.
>
> Changes:
>   v1 -> v2:
> - Refactor btrfs_cmp_data_prepare and btrfs_extent_same
> - Store memory of pages array between iterations
> - Lock inodes once, not on each iteration
>     - Small inplace cleanups
>
> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
> ---
>  fs/btrfs/ioctl.c | 160 
> ---
>  1 file changed, 94 insertions(+), 66 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index d136ff0522e6..b17dcab1bb0c 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2985,8 +2985,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp)
> put_page(pg);
> }
> }
> -   kfree(cmp->src_pages);
> -   kfree(cmp->dst_pages);
> +
> +   cmp->num_pages = 0;
>  }
>
>  static int btrfs_cmp_data_prepare(struct inode *src, u64 loff,
> @@ -2994,41 +2994,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, 
> u64 loff,
>   u64 len, struct cmp_pages *cmp)
>  {
> int ret;
> -   int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
> -   struct page **src_pgarr, **dst_pgarr;
> -
> -   /*
> -* We must gather up all the pages before we initiate our
> -* extent locking. We use an array for the page pointers. Size
> -* of the array is bounded by len, which is in turn bounded by
> -* BTRFS_MAX_DEDUPE_LEN.
> -*/
> -   src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> -   dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> -   if (!src_pgarr || !dst_pgarr) {
> -   kfree(src_pgarr);
> -   kfree(dst_pgarr);
> -   return -ENOMEM;
> -   }
> -   cmp->num_pages = num_pages;
> -   cmp->src_pages = src_pgarr;
> -   cmp->dst_pages = dst_pgarr;
>
> /*
>  * If deduping ranges in the same inode, locking rules make it 
> mandatory
>  * to always lock pages in ascending order to avoid deadlocks with
>  * concurrent tasks (such as starting writeback/delalloc).
>  */
> -   if (src == dst && dst_loff < loff) {
> -   swap(src_pgarr, dst_pgarr);
> +   if (src == dst && dst_loff < loff)
> swap(loff, dst_loff);
> -   }
>
> -   ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff);
> +   cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
> +
> +   ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff);
> if (ret)
> goto out;
>
> -   ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff);
> +   ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, 
> dst_loff);
>
>  out:
> if (ret)
> @@ -3098,31 +3079,23 @@ static int extent_same_check_offsets(struct inode 
> *inode, u64 off, u64 *plen,
> return 0;
>  }
>
> -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
> -struct inode *dst, u64 dst_loff)
> +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
> +  struct inode *dst, u64 dst_loff,
> +  struct cmp_pages *cmp)
>  {
> int ret;
> u64 len = olen;
> -   struct cmp_pages cmp;
> bool same_inode = (src == dst);
> u64 same_lock_start = 0;
> u64 same_lock_len = 0;
>
> -   if (len == 0)
> -   return 0;
> -
> -   if (same_inode)
> -   inode_lock(src);
> -   else
> -   btrfs_double_inode_lock(src, dst);
> -
> ret = extent_same_check_offsets(src, loff, , olen);
> if (ret)
> -   goto out_unlock;
> +   return ret;
>
> ret = extent_same_check_offsets(dst, dst_loff, , olen);
> if (ret)
> -   goto out_unlock;
> +   return ret;
>
> if (

Re: [PATCH v3 0/5] define BTRFS_DEV_STATE

2017-12-13 Thread Timofey Titovets

2017-12-13 5:26 GMT+03:00 David Sterba :
> On Wed, Dec 13, 2017 at 06:38:12AM +0800, Anand Jain wrote:
>>
>>
>> On 12/13/2017 01:42 AM, David Sterba wrote:
>> > On Sun, Dec 10, 2017 at 05:15:17PM +0800, Anand Jain wrote:
>> >> As of now device properties and states are being represented as int
>> >> variable, patches here makes them bit flags instead. Further, wip
>> >> patches such as device failed state needs this cleanup.
>> >>
>> >> v2:
>> >>   Adds BTRFS_DEV_STATE_REPLACE_TGT
>> >>   Adds BTRFS_DEV_STATE_FLUSH_SENT
>> >>   Drops BTRFS_DEV_STATE_CAN_DISCARD
>> >>   Starts bit flag from the bit 0
>> >>   Drops unrelated change - declare btrfs_device
>> >>
>> >> v3:
>> >>   Fix static checker warning, define respective dev state as bit number
>> >
>> > The define numbers are fixed but the whitespace changes that I made in
>> > misc-next
>>
>>   Will do next time. Thanks. I don't see misc-next. Is it for-next ?
>
> The kernel.org repository only gets the latest for-next, that is
> assembled from the pending branches, and also after some testing. You
> could still find 'misc-next' inside the for-next branch, but it's not
> obvious.
>
> All the development branches are pushed to
>
> https://github.com/kdave/btrfs-devel or
> http://repo.or.cz/linux-2.6/btrfs-unstable.git
>
> more frequently than the k.org/for-next is updated. I thought this has
> become a common knowledge, but yet it's not documented on the wiki so.
> Let's fix that.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Didn't know about your GitHub copy,
may be that have sense to update:
https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories

And add new links? To:
 - k.org/for-next
 - https://github.com/kdave/btrfs-devel or
 - http://repo.or.cz/linux-2.6/btrfs-unstable.git

Because current link to
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
looks like a bit outdated.

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2017-12-12 Thread Timofey Titovets

At now btrfs_dedupe_file_range() restricted to 16MiB range for
limit locking time and memory requirement for dedup ioctl()

For too big input range code silently set range to 16MiB

Let's remove that restriction by do iterating over dedup range.
That's backward compatible and will not change anything for request
less then 16MiB.

Changes:
  v1 -> v2:
- Refactor btrfs_cmp_data_prepare and btrfs_extent_same
- Store memory of pages array between iterations
- Lock inodes once, not on each iteration
- Small inplace cleanups

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/ioctl.c | 160 ---
 1 file changed, 94 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d136ff0522e6..b17dcab1bb0c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2985,8 +2985,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp)
put_page(pg);
}
}
-   kfree(cmp->src_pages);
-   kfree(cmp->dst_pages);
+
+   cmp->num_pages = 0;
 }
 
 static int btrfs_cmp_data_prepare(struct inode *src, u64 loff,
@@ -2994,41 +2994,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, 
u64 loff,
  u64 len, struct cmp_pages *cmp)
 {
int ret;
-   int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
-   struct page **src_pgarr, **dst_pgarr;
-
-   /*
-* We must gather up all the pages before we initiate our
-* extent locking. We use an array for the page pointers. Size
-* of the array is bounded by len, which is in turn bounded by
-* BTRFS_MAX_DEDUPE_LEN.
-*/
-   src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   if (!src_pgarr || !dst_pgarr) {
-   kfree(src_pgarr);
-   kfree(dst_pgarr);
-   return -ENOMEM;
-   }
-   cmp->num_pages = num_pages;
-   cmp->src_pages = src_pgarr;
-   cmp->dst_pages = dst_pgarr;
 
/*
 * If deduping ranges in the same inode, locking rules make it mandatory
 * to always lock pages in ascending order to avoid deadlocks with
 * concurrent tasks (such as starting writeback/delalloc).
 */
-   if (src == dst && dst_loff < loff) {
-   swap(src_pgarr, dst_pgarr);
+   if (src == dst && dst_loff < loff)
swap(loff, dst_loff);
-   }
 
-   ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff);
+   cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+   ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff);
if (ret)
goto out;
 
-   ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff);
+   ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, 
dst_loff);
 
 out:
if (ret)
@@ -3098,31 +3079,23 @@ static int extent_same_check_offsets(struct inode 
*inode, u64 off, u64 *plen,
return 0;
 }
 
-static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
-struct inode *dst, u64 dst_loff)
+static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
+  struct inode *dst, u64 dst_loff,
+  struct cmp_pages *cmp)
 {
int ret;
u64 len = olen;
-   struct cmp_pages cmp;
bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;
 
-   if (len == 0)
-   return 0;
-
-   if (same_inode)
-   inode_lock(src);
-   else
-   btrfs_double_inode_lock(src, dst);
-
ret = extent_same_check_offsets(src, loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
ret = extent_same_check_offsets(dst, dst_loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
if (same_inode) {
/*
@@ -3139,32 +3112,21 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 * allow an unaligned length so long as it ends at
 * i_size.
 */
-   if (len != olen) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (len != olen)
+   return -EINVAL;
 
/* Check for overlapping ranges */
-   if (dst_loff + len > loff && dst_loff < loff + len) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (dst_loff + len > loff && dst_loff < loff + len)
+   return -EINVAL;

Re: [PATCH 0/3] Minor compression heuristic cleanups

2017-12-12 Thread Timofey Titovets

2017-12-12 23:55 GMT+03:00 David Sterba <dste...@suse.com>:
> The callback pointers for radix_sort are not needed, we don't plan to
> export the function now. The compiler is smart enough to replace the
> indirect calls with direct ones, so there's no change in the resulting
> asm code.
>
> David Sterba (3):
>   btrfs: heuristic: open code get_num callback of radix sort
>   btrfs: heuristic: open code copy_call callback of radix sort
>   btrfs: heuristic: call get4bits directly
>
>  fs/btrfs/compression.c | 42 +++---
>  1 file changed, 11 insertions(+), 31 deletions(-)
>
> --
> 2.15.1
>

Thanks!
On whole series:
Reviewed-by: Timofey Titovets

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-12-10 Thread Timofey Titovets

2017-12-11 8:18 GMT+03:00 Dave :
> On Tue, Oct 31, 2017 someone wrote:
>>
>>
>> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
>> > nocow -- it will NOT be snapshotted
>
> I did exactly this. It servers the purpose of avoiding snapshots.
> However, today I saw the following at
> https://wiki.archlinux.org/index.php/Btrfs
>
> Note: From Btrfs Wiki Mount options: within a single file system, it
> is not possible to mount some subvolumes with nodatacow and others
> with datacow. The mount option of the first mounted subvolume applies
> to any other subvolumes.
>
> That makes me think my nodatacow mount option on $HOME/.cache is not
> effective. True?
>
> (My subjective performance results have not been as good as hoped for
> with the tweaks I have tried so far.)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

True, for magic dirs, that you may want mark as no cow, you need to
use chattr, like:
rm -rf ~/.cache
mkdir ~/.cache
chattr +C ~/.cache

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Just curious, whats happened with btrfs & rcu skiplist in 2013?

2017-12-10 Thread Timofey Titovets

Subj,
I found that https://lwn.net/Articles/554885/,
https://lwn.net/Articles/553047/
and some others messages like judy rcu & etc.

But nothing after june 2013, just curious, may be some one know why
that has been stalled?
or just what happened in the end?  (i.e. ex, RT guys, just said that
too bad for us..)

Thanks!
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] Btrfs: heuristic replace heap sort with radix sort

2017-12-05 Thread Timofey Titovets

Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).

That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.

Performance tested in userspace copy of heuristic code,
throughput:
- average <-> random data: ~3500 MiB/s - heap  sort
- average <-> random data: ~6000 MiB/s - radix sort

Changes:
  v1 -> v2:
- Tested on Big Endian
- Drop most of multiply operations
- Separately allocate sort buffer

Changes:
  v2 -> v3:
- Fix uint -> u conversion
- Reduce stack size, by reduce vars sizes to u32,
  restrict input array size to u32
  Assume that kernel will never try sorting arrays > 2^32
- Drop max_cell arg (precheck - correctly find max value by it self)

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/compression.c | 135 ++---
 1 file changed, 128 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 06ef50712acd..9573f4491367 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -33,7 +33,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -752,6 +751,8 @@ struct heuristic_ws {
u32 sample_size;
/* Buckets store counters for each byte value */
struct bucket_item *bucket;
+   /* Sorting buffer */
+   struct bucket_item *bucket_b;
struct list_head list;
 };
 
@@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws)
 
kvfree(workspace->sample);
kfree(workspace->bucket);
+   kfree(workspace->bucket_b);
kfree(workspace);
 }
 
@@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void)
if (!ws->bucket)
goto fail;
 
+   ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL);
+   if (!ws->bucket_b)
+   goto fail;
+
INIT_LIST_HEAD(>list);
return >list;
 fail:
@@ -1278,13 +1284,127 @@ static u32 shannon_entropy(struct heuristic_ws *ws)
return entropy_sum * 100 / entropy_max;
 }
 
-/* Compare buckets by size, ascending */
-static int bucket_comp_rev(const void *lv, const void *rv)
+#define RADIX_BASE 4
+#define COUNTERS_SIZE (1 << RADIX_BASE)
+
+static inline u8 get4bits(u64 num, u32 shift) {
+   u8 low4bits;
+   num = num >> shift;
+   /* Reverse order */
+   low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE);
+   return low4bits;
+}
+
+static inline void copy_cell(void *dst, u32 dest_i, void *src, u32 src_i)
+{
+   struct bucket_item *dstv = (struct bucket_item *) dst;
+   struct bucket_item *srcv = (struct bucket_item *) src;
+   dstv[dest_i] = srcv[src_i];
+}
+
+static inline u64 get_num(const void *a, u32 i)
+{
+   struct bucket_item *av = (struct bucket_item *) a;
+   return av[i].count;
+}
+
+/*
+ * Use 4 bits as radix base
+ * Use 16 u32 counters for calculating new possition in buf array
+ *
+ * @array - array that will be sorted
+ * @array_buf - buffer array to store sorting results
+ *  must be equal in size to @array
+ * @num   - array size
+ * @get_num   - function to extract number from array
+ * @copy_cell - function to copy data from array to array_buf
+ *  and vise versa
+ * @get4bits  - function to get 4 bits from number at specified offset
+ */
+
+static void radix_sort(void *array, void *array_buf, u32 num,
+  u64 (*get_num)(const void *,  u32 i),
+  void (*copy_cell)(void *dest, u32 dest_i,
+void* src,  u32 src_i),
+  u8 (*get4bits)(u64 num, u32 shift))
 {
-   const struct bucket_item *l = (const struct bucket_item *)lv;
-   const struct bucket_item *r = (const struct bucket_item *)rv;
+   u64 max_num;
+   u64 buf_num;
+   u32 counters[COUNTERS_SIZE];
+   u32 new_addr;
+   u32 addr;
+   u32 bitlen;
+   u32 shift;
+   int i;
+
+   /*
+* Try avoid useless loop iterations
+* For small numbers stored in big counters
+* example: 48 33 4 ... in 64bit array
+*/
+   max_num = get_num(array, 0);
+   for (i = 1; i <

Re: [PATCH v2] Btrfs: heuristic replace heap sort with radix sort

2017-12-04 Thread Timofey Titovets

2017-12-05 0:24 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> 2017-12-04 23:47 GMT+03:00 David Sterba <dste...@suse.cz>:
>> On Mon, Dec 04, 2017 at 12:30:33AM +0300, Timofey Titovets wrote:
>>> Slowest part of heuristic for now is kernel heap sort()
>>> It's can take up to 55% of runtime on sorting bucket items.
>>>
>>> As sorting will always call on most data sets to get correctly
>>> byte_core_set_size, the only way to speed up heuristic, is to
>>> speed up sort on bucket.
>>>
>>> Add a general radix_sort function.
>>> Radix sort require 2 buffers, one full size of input array
>>> and one for store counters (jump addresses).
>>>
>>> That increase usage per heuristic workspace +1KiB
>>> 8KiB + 1KiB -> 8KiB + 2KiB
>>>
>>> That is LSD Radix, i use 4 bit as a base for calculating,
>>> to make counters array acceptable small (16 elements * 8 byte).
>>>
>>> That Radix sort implementation have several points to adjust,
>>> I added him to make radix sort general usable in kernel,
>>> like heap sort, if needed.
>>>
>>> Performance tested in userspace copy of heuristic code,
>>> throughput:
>>> - average <-> random data: ~3500 MiB/s - heap  sort
>>> - average <-> random data: ~6000 MiB/s - radix sort
>>>
>>> Changes:
>>>   v1 -> v2:
>>> - Tested on Big Endian
>>> - Drop most of multiply operations
>>> - Separately allocate sort buffer
>>>
>>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>>> ---
>>>  fs/btrfs/compression.c | 147 
>>> ++---
>>>  1 file changed, 140 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>>> index ae016699d13e..19b52982deda 100644
>>> --- a/fs/btrfs/compression.c
>>> +++ b/fs/btrfs/compression.c
>>> @@ -33,7 +33,6 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> -#include 
>>>  #include 
>>>  #include "ctree.h"
>>>  #include "disk-io.h"
>>> @@ -752,6 +751,8 @@ struct heuristic_ws {
>>>   u32 sample_size;
>>>   /* Buckets store counters for each byte value */
>>>   struct bucket_item *bucket;
>>> + /* Sorting buffer */
>>> + struct bucket_item *bucket_b;
>>>   struct list_head list;
>>>  };
>>>
>>> @@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws)
>>>
>>>   kvfree(workspace->sample);
>>>   kfree(workspace->bucket);
>>> + kfree(workspace->bucket_b);
>>>   kfree(workspace);
>>>  }
>>>
>>> @@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void)
>>>   if (!ws->bucket)
>>>   goto fail;
>>>
>>> + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), 
>>> GFP_KERNEL);
>>> + if (!ws->bucket_b)
>>> + goto fail;
>>> +
>>>   INIT_LIST_HEAD(>list);
>>>   return >list;
>>>  fail:
>>> @@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws)
>>>   return entropy_sum * 100 / entropy_max;
>>>  }
>>>
>>> -/* Compare buckets by size, ascending */
>>> -static int bucket_comp_rev(const void *lv, const void *rv)
>>> +#define RADIX_BASE 4
>>> +#define COUNTERS_SIZE (1 << RADIX_BASE)
>>> +
>>> +static inline uint8_t get4bits(uint64_t num, int shift) {
>>> + uint8_t low4bits;
>>> + num = num >> shift;
>>> + /* Reverse order */
>>> + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE);
>>> + return low4bits;
>>> +}
>>> +
>>> +static inline void copy_cell(void *dst, int dest_i, void *src, int src_i)
>>>  {
>>> - const struct bucket_item *l = (const struct bucket_item *)lv;
>>> - const struct bucket_item *r = (const struct bucket_item *)rv;
>>> + struct bucket_item *dstv = (struct bucket_item *) dst;
>>> + struct bucket_item *srcv = (struct bucket_item *) src;
>>> + dstv[dest_i] = srcv[src_i];
>>> +}
>>>
>>> - return r->count - l->count;
>>> +static inline uint64_t get_num(const void *a, int i)
>>> +{
>>> + st

Re: [PATCH v2] Btrfs: heuristic replace heap sort with radix sort

2017-12-04 Thread Timofey Titovets

2017-12-04 23:47 GMT+03:00 David Sterba <dste...@suse.cz>:
> On Mon, Dec 04, 2017 at 12:30:33AM +0300, Timofey Titovets wrote:
>> Slowest part of heuristic for now is kernel heap sort()
>> It's can take up to 55% of runtime on sorting bucket items.
>>
>> As sorting will always call on most data sets to get correctly
>> byte_core_set_size, the only way to speed up heuristic, is to
>> speed up sort on bucket.
>>
>> Add a general radix_sort function.
>> Radix sort require 2 buffers, one full size of input array
>> and one for store counters (jump addresses).
>>
>> That increase usage per heuristic workspace +1KiB
>> 8KiB + 1KiB -> 8KiB + 2KiB
>>
>> That is LSD Radix, i use 4 bit as a base for calculating,
>> to make counters array acceptable small (16 elements * 8 byte).
>>
>> That Radix sort implementation have several points to adjust,
>> I added him to make radix sort general usable in kernel,
>> like heap sort, if needed.
>>
>> Performance tested in userspace copy of heuristic code,
>> throughput:
>> - average <-> random data: ~3500 MiB/s - heap  sort
>> - average <-> random data: ~6000 MiB/s - radix sort
>>
>> Changes:
>>   v1 -> v2:
>> - Tested on Big Endian
>> - Drop most of multiply operations
>> - Separately allocate sort buffer
>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  fs/btrfs/compression.c | 147 
>> ++---
>>  1 file changed, 140 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>> index ae016699d13e..19b52982deda 100644
>> --- a/fs/btrfs/compression.c
>> +++ b/fs/btrfs/compression.c
>> @@ -33,7 +33,6 @@
>>  #include 
>>  #include 
>>  #include 
>> -#include 
>>  #include 
>>  #include "ctree.h"
>>  #include "disk-io.h"
>> @@ -752,6 +751,8 @@ struct heuristic_ws {
>>   u32 sample_size;
>>   /* Buckets store counters for each byte value */
>>   struct bucket_item *bucket;
>> + /* Sorting buffer */
>> + struct bucket_item *bucket_b;
>>   struct list_head list;
>>  };
>>
>> @@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws)
>>
>>   kvfree(workspace->sample);
>>   kfree(workspace->bucket);
>> + kfree(workspace->bucket_b);
>>   kfree(workspace);
>>  }
>>
>> @@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void)
>>   if (!ws->bucket)
>>   goto fail;
>>
>> + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL);
>> + if (!ws->bucket_b)
>> + goto fail;
>> +
>>   INIT_LIST_HEAD(>list);
>>   return >list;
>>  fail:
>> @@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws)
>>   return entropy_sum * 100 / entropy_max;
>>  }
>>
>> -/* Compare buckets by size, ascending */
>> -static int bucket_comp_rev(const void *lv, const void *rv)
>> +#define RADIX_BASE 4
>> +#define COUNTERS_SIZE (1 << RADIX_BASE)
>> +
>> +static inline uint8_t get4bits(uint64_t num, int shift) {
>> + uint8_t low4bits;
>> + num = num >> shift;
>> + /* Reverse order */
>> + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE);
>> + return low4bits;
>> +}
>> +
>> +static inline void copy_cell(void *dst, int dest_i, void *src, int src_i)
>>  {
>> - const struct bucket_item *l = (const struct bucket_item *)lv;
>> - const struct bucket_item *r = (const struct bucket_item *)rv;
>> + struct bucket_item *dstv = (struct bucket_item *) dst;
>> + struct bucket_item *srcv = (struct bucket_item *) src;
>> + dstv[dest_i] = srcv[src_i];
>> +}
>>
>> - return r->count - l->count;
>> +static inline uint64_t get_num(const void *a, int i)
>> +{
>> + struct bucket_item *av = (struct bucket_item *) a;
>> + return av[i].count;
>> +}
>> +
>> +/*
>> + * Use 4 bits as radix base
>> + * Use 16 uint64_t counters for calculating new possition in buf array
>> + *
>> + * @array - array that will be sorted
>> + * @array_buf - buffer array to store sorting results
>> + *  must be equal in size to @array
>> + * @num   - array size
>> + * @max_cell  - Link to element with maximum possible value
&g

Re: [PATCH v4] Btrfs: compress_file_range() change page dirty status once

2017-12-04 Thread Timofey Titovets

Gentle ping

2017-10-24 1:29 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> We need to call extent_range_clear_dirty_for_io()
> on compression range to prevent application from changing
> page content, while pages compressing.
>
> extent_range_clear_dirty_for_io() run on each loop iteration,
> "(end - start)" can be much (up to 1024 times) bigger
> then compression range (BTRFS_MAX_UNCOMPRESSED).
>
> That produce extra calls to page managment code.
>
> Fix that behaviour by call extent_range_clear_dirty_for_io()
> only once.
>
> v1 -> v2:
>  - Make that more obviously and more safeprone
>
> v2 -> v3:
>  - Rebased on:
>Btrfs: compress_file_range() remove dead variable num_bytes
>  - Update change log
>  - Add comments
>
> v3 -> v4:
>  - Rebased on: kdave for-next
>  - To avoid dirty bit clear/set behaviour change
>call clear_bit once, istead of per compression range
>
> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
> ---
>  fs/btrfs/inode.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b93fe05a39c7..5816dd3cb6e6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -536,8 +536,10 @@ static noinline void compress_file_range(struct inode 
> *inode,
>  * If the compression fails for any reason, we set the pages
>  * dirty again later on.
>  */
> -   extent_range_clear_dirty_for_io(inode, start, end);
> -   redirty = 1;
> +   if (!redirty) {
> +   extent_range_clear_dirty_for_io(inode, start, end);
> +   redirty = 1;
> +   }
>
> /* Compression level is applied here and only here */
> ret = btrfs_compress_pages(
> --
> 2.14.2



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Btrfs: heuristic replace heap sort with radix sort

2017-12-03 Thread Timofey Titovets

Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).

That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.

Performance tested in userspace copy of heuristic code,
throughput:
- average <-> random data: ~3500 MiB/s - heap  sort
- average <-> random data: ~6000 MiB/s - radix sort

Changes:
  v1 -> v2:
- Tested on Big Endian
- Drop most of multiply operations
- Separately allocate sort buffer

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/compression.c | 147 ++---
 1 file changed, 140 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ae016699d13e..19b52982deda 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -33,7 +33,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -752,6 +751,8 @@ struct heuristic_ws {
u32 sample_size;
/* Buckets store counters for each byte value */
struct bucket_item *bucket;
+   /* Sorting buffer */
+   struct bucket_item *bucket_b;
struct list_head list;
 };
 
@@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws)
 
kvfree(workspace->sample);
kfree(workspace->bucket);
+   kfree(workspace->bucket_b);
kfree(workspace);
 }
 
@@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void)
if (!ws->bucket)
goto fail;
 
+   ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL);
+   if (!ws->bucket_b)
+   goto fail;
+
INIT_LIST_HEAD(>list);
return >list;
 fail:
@@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws)
return entropy_sum * 100 / entropy_max;
 }
 
-/* Compare buckets by size, ascending */
-static int bucket_comp_rev(const void *lv, const void *rv)
+#define RADIX_BASE 4
+#define COUNTERS_SIZE (1 << RADIX_BASE)
+
+static inline uint8_t get4bits(uint64_t num, int shift) {
+   uint8_t low4bits;
+   num = num >> shift;
+   /* Reverse order */
+   low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE);
+   return low4bits;
+}
+
+static inline void copy_cell(void *dst, int dest_i, void *src, int src_i)
 {
-   const struct bucket_item *l = (const struct bucket_item *)lv;
-   const struct bucket_item *r = (const struct bucket_item *)rv;
+   struct bucket_item *dstv = (struct bucket_item *) dst;
+   struct bucket_item *srcv = (struct bucket_item *) src;
+   dstv[dest_i] = srcv[src_i];
+}
 
-   return r->count - l->count;
+static inline uint64_t get_num(const void *a, int i)
+{
+   struct bucket_item *av = (struct bucket_item *) a;
+   return av[i].count;
+}
+
+/*
+ * Use 4 bits as radix base
+ * Use 16 uint64_t counters for calculating new possition in buf array
+ *
+ * @array - array that will be sorted
+ * @array_buf - buffer array to store sorting results
+ *  must be equal in size to @array
+ * @num   - array size
+ * @max_cell  - Link to element with maximum possible value
+ *  that can be used to cap radix sort iterations
+ *  if we know maximum value before call sort
+ * @get_num   - function to extract number from array
+ * @copy_cell - function to copy data from array to array_buf
+ *  and vise versa
+ * @get4bits  - function to get 4 bits from number at specified offset
+ */
+
+static void radix_sort(void *array, void *array_buf,
+  int num,
+  const void *max_cell,
+  uint64_t (*get_num)(const void *, int i),
+  void (*copy_cell)(void *dest, int dest_i,
+void* src, int src_i),
+  uint8_t (*get4bits)(uint64_t num, int shift))
+{
+   u64 max_num;
+   uint64_t buf_num;
+   uint64_t counters[COUNTERS_SIZE];
+   uint64_t new_addr;
+   int i;
+   int addr;
+   int bitlen;
+   int shift;
+
+   /*
+* Try avoid useless loop iterations
+* For small numbers stored in big counters
+* example: 48 33 4 ... in 64bit array
+*/
+   if (!max_ce

Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-28 Thread Timofey Titovets

2017-11-28 21:48 GMT+03:00 David Sterba :
> On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote:
>> As we all know, under certain circumstances, it is more appropriate to
>> create some subvolumes rather than keep everything in the same
>> subvolume. As the condition of demand change, the user may need to
>> convert a previous directory to a subvolume. For this reason，how about
>> adding an ioctl to convert a directory to a subvolume?
>
> I'd say too difficult to get everything right in kernel. This is
> possible to be done in userspace, with existing tools.
>
> The problem is that the conversion cannot be done atomically in most
> cases, so even if it's just one ioctl call, there are several possible
> intermediate states that would exist during the call. Reporting where
> did the ioctl fail would need some extended error code semantics.
>
>> Users can convert by the scripts mentioned in this
>> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
>> it easier to use the off-the-shelf btrfs subcommand?
>
> Adding a subcommand would work, though I'd rather avoid reimplementing
> 'cp -ax' or 'rsync -ax'.  We want to copy the files preserving all
> attributes, with reflink, and be able to identify partially synced
> files, and not cross the mountpoints or subvolumes.
>
> The middle step with snapshotting the containing subvolume before
> syncing the data is also a valid option, but not always necessary.
>
>> After an initial consideration, our implementation is broadly divided
>> into the following steps:
>> 1. Freeze the filesystem or set the subvolume above the source directory
>> to read-only;
>
> Freezing the filesystme will freeze all IO, so this would not work, but
> I understand what you mean. The file data are synced before the snapshot
> is taken, but nothing prevents applications to continue writing data.
>
> Open and live files is a problem and don't see a nice solution here.
>
>> 2. Perform a pre-check, for example, check if a cross-device link
>> creation during the conversion;
>
> Cross-device links are not a problem as long as we use 'cp' ie. the
> manual creation of files in the target.
>
>> 3. Perform conversion, such as creating a new subvolume and moving the
>> contents of the source directory;
>> 4. Thaw the filesystem or restore the subvolume writable property.
>>
>> In fact, I am not so sure whether this use of freeze is appropriate
>> because the source directory the user needs to convert may be located
>> at / or /home and this pre-check and conversion process may take a long
>> time, which can lead to some shell and graphical application suspended.
>
> I think the closest operation is a read-only remount, which is not
> always possible due to open files and can otherwise considered as quite
> intrusive operation to the whole system. And the root filesystem cannot
> be easily remounted read-only in the systemd days anyway.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

My two 2c,
Then we talking about 'fast' (i.e. i like the idea where ioctl calls
to be fast) conversion of dir to subvolume,
can be done like that (sorry if i miss understood something and that a
rave or i'm crazy..):

For make idea more clear, for userspace that can looks like:
1. Create snapshot of parent subvol for that dir
2. Cleanup all data, except content of dir in snapshot
3. Move content of that dir to snapshot root
4. Replace dir with that snapshot/subvol
i.e. no copy, no cp, only rename() and garbage collecting.

In kernel that in "theory" will looks like:
1. Copy of subvol root inode
2. Replace root inode with target dir inode
3. Replace target dir in old subvol with new subvol
4. GC old dir content from parent subvol, GC all useless content of
around dir in new subvol

That's may be a fastest way for user, but that will not solve problems
with opened files & etc,
but that must be fast from user point of view, and all other staff can
be simply cleaned in background

Thanks
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: zstd compression

2017-11-16 Thread Timofey Titovets

2017-11-16 19:32 GMT+03:00 Austin S. Hemmelgarn :
> On 2017-11-16 08:43, Duncan wrote:
>>
>> Austin S. Hemmelgarn posted on Thu, 16 Nov 2017 07:30:47 -0500 as
>> excerpted:
>>
>>> On 2017-11-15 16:31, Duncan wrote:

 Austin S. Hemmelgarn posted on Wed, 15 Nov 2017 07:57:06 -0500 as
 excerpted:

> The 'compress' and 'compress-force' mount options only impact newly
> written data.  The compression used is stored with the metadata for
> the extents themselves, so any existing data on the volume will be
> read just fine with whatever compression method it was written with,
> while new data will be written with the specified compression method.
>
> If you want to convert existing files, you can use the '-c' option to
> the defrag command to do so.


 ... Being aware of course that using defrag to recompress files like
 that will break 100% of the existing reflinks, effectively (near)
 doubling data usage if the files are snapshotted, since the snapshot
 will now share 0% of its extents with the newly compressed files.
>>>
>>> Good point, I forgot to mention that.


 (The actual effect shouldn't be quite that bad, as some files are
 likely to be uncompressed due to not compressing well, and I'm not sure
 if defrag -c rewrites them or not.  Further, if there's multiple
 snapshots data usage should only double with respect to the latest one,
 the data delta between it and previous snapshots won't be doubled as
 well.)
>>>
>>> I'm pretty sure defrag is equivalent to 'compress-force', not
>>> 'compress', but I may be wrong.
>>
>>
>> But... compress-force doesn't actually force compression _all_ the time.
>> Rather, it forces btrfs to continue checking whether compression is worth
>> it for each "block"[1] of the file, instead of giving up if the first
>> quick try at the beginning says that block won't compress.
>>
>> So what I'm saying is that if the snapshotted data is already compressed,
>> think (pre-)compressed tarballs or image files such as jpeg that are
>> unlikely to /easily/ compress further and might well actually be _bigger_
>> once the compression algorithm is run over them, defrag -c will likely
>> fail to compress them further even if it's the equivalent of compress-
>> force, and thus /should/ leave them as-is, not breaking the reflinks of
>> the snapshots and thus not doubling the data usage for that file, or more
>> exactly, that extent of that file.
>>
>> Tho come to think of it, is defrag -c that smart, to actually leave the
>> data as-is if it doesn't compress further, or does it still rewrite it
>> even if it doesn't compress, thus breaking the reflink and doubling the
>> usage regardless?
>
> I'm not certain how compression factors in, but if you aren't compressing
> the file, it will only get rewritten if it's fragmented (which is shy
> defragmenting the system root directory is usually insanely fast on most
> systems, stuff there is almost never fragmented).
>>
>>
>> ---
>> [1] Block:  I'm not positive it's the usual 4K block in this case.  I
>> think I read that it's 16K, but I might be confused on that.  But
>> regardless the size, the point is, with compress-force btrfs won't give
>> up like simple compress will if the first "block" doesn't compress, it'll
>> keep trying.
>>
>> Of course the new compression heuristic changes this a bit too, but the
>> same general idea holds, compress-force continues to try for the entire
>> file, compress will give up much faster.
>
> I'm not actually sure, I would think it checks 128k blocks of data (the
> effective block size for compression), but if it doesn't it should be
> checking at the filesystem block size (which means 16k on most recently
> created filesystems).
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Defragment of data on btrfs, is simply rewrite data if, data doesn't
meet some criteria.
And only that -c does, it's say which compression method apply for new
written data, no more, no less.
On write side, FS see long/short data ranges for writing (see
compress_file_range()), if compression needed, split data to 128KiB
and pass it to compression logic.
compression logic give up it self in 2 cases:
1. Compression of 2 (or 3?) first page sized blocks of 128KiB make
data bigger -> give up -> write data as is
2. After compression done, if compression not free at least one sector
size -> write data as is

i.e.
If you write 16 KiB at time, btrfs will compress each separate write as 16 KiB.
If you write 1 MiB at time, btrfs will split it by 128 KiB.
If you write 1025KiB, btrfs will split it by 128 KiB and last 1 KiB
will be written as is.

JFYI:
Only that heuristic logic doing (i.e. compress, not compress-force) is:
On every write, kernel check if compression are

Re: [PATCH 4/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2017-11-14 Thread Timofey Titovets

Sorry, i just thinking that i can test that and send you some feedback,
But for now, no time.
I will check that later and try adds memory reusing.

So, just ignore patches for now.

Thanks

2017-10-10 20:36 GMT+03:00 David Sterba <dste...@suse.cz>:
> On Tue, Oct 03, 2017 at 06:06:04PM +0300, Timofey Titovets wrote:
>> At now btrfs_dedupe_file_range() restricted to 16MiB range for
>> limit locking time and memory requirement for dedup ioctl()
>>
>> For too big input rage code silently set range to 16MiB
>>
>> Let's remove that restriction by do iterating over dedup range.
>> That's backward compatible and will not change anything for request
>> less then 16MiB.
>
> This would make the ioctl more pleasant to use. So far I haven't found
> any problems to do the iteration. One possible speedup could be done to
> avoid the repeated allocations in btrfs_extent_same if we're going to
> iterate more than once.
>
> As this would mean the 16MiB length restriction is gone, this needs to
> bubble up to the documentation
> (http://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html)
>
> Have you tested the behaviour with larger ranges?



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1: can't readd removed dev while the fs is mounted

2017-10-27 Thread Timofey Titovets

2017-10-28 1:40 GMT+03:00 Julien Muchembled :
> Hello,
>
> I have 2 disks in RAID1, each one having 2 partitions:
> - 1 for / (BtrFS)
> - 1 for /home (MD/XFS)
>
> For some reasons, 1 disk was removed and readded. I had no issue at readding 
> it to the MD array, but for BtrFS, I had to reboot.
>
> Then, I tried to investigate more using qemu with systemrescuecd (kernel 
> 4.9.30 and btrfs-progs v4.9.1). From:
>
>   /sys/devices/pci:00/:00:01.1/ata1/host0
>
> I use:
>
>   # echo 1 > target0:0:1/0:0:1:0/delete
>
> to remove sdb and after some changes in the mount point:
>
>   # echo '- - -' > scsi_host/host0/scan
>
> to readd it.
>
> Then I executed
>
>   # btrfs scrub start -B -d /mnt/tmp
>
> to fix things but I only get uncorrectable errors and the dmesg is full of 
> 'i/o error' lines
>
> Maybe some command is required so that BtrFS accept to reuse the device. I 
> tried:
>
>   # btrfs replace start -B -f 2 /dev/sdb /mnt/tmp
>   ERROR: ioctl(DEV_REPLACE_STATUS) failed on "/mnt/tmp": Inappropriate ioctl 
> for device
>
> Julien
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

AFAIK, no, for now btrfs just can't reuse device - no way
At least for mounted FS.
AFAIK, btrfs have patches for dynamic device states, but patches not merged

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4] Btrfs: compress_file_range() change page dirty status once

2017-10-23 Thread Timofey Titovets

We need to call extent_range_clear_dirty_for_io()
on compression range to prevent application from changing
page content, while pages compressing.

extent_range_clear_dirty_for_io() run on each loop iteration,
"(end - start)" can be much (up to 1024 times) bigger
then compression range (BTRFS_MAX_UNCOMPRESSED).

That produce extra calls to page managment code.

Fix that behaviour by call extent_range_clear_dirty_for_io()
only once.

v1 -> v2:
 - Make that more obviously and more safeprone

v2 -> v3:
 - Rebased on:
   Btrfs: compress_file_range() remove dead variable num_bytes
 - Update change log
 - Add comments

v3 -> v4:
 - Rebased on: kdave for-next
 - To avoid dirty bit clear/set behaviour change
   call clear_bit once, istead of per compression range

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/inode.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b93fe05a39c7..5816dd3cb6e6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -536,8 +536,10 @@ static noinline void compress_file_range(struct inode 
*inode,
 * If the compression fails for any reason, we set the pages
 * dirty again later on.
 */
-   extent_range_clear_dirty_for_io(inode, start, end);
-   redirty = 1;
+   if (!redirty) {
+   extent_range_clear_dirty_for_io(inode, start, end);
+   redirty = 1;
+   }

/* Compression level is applied here and only here */
ret = btrfs_compress_pages(
--
2.14.2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 0/6] Btrfs: populate heuristic with code

2017-10-23 Thread Timofey Titovets

2017-10-22 16:44 GMT+03:00 Timofey Titovets <nefelim...@gmail.com>:
> 2017-10-20 16:45 GMT+03:00 David Sterba <dste...@suse.cz>:
>> On Fri, Oct 20, 2017 at 01:48:01AM +0300, Timofey Titovets wrote:
>>> 2017-10-19 18:39 GMT+03:00 David Sterba <dste...@suse.cz>:
>>> > On Fri, Sep 29, 2017 at 06:22:00PM +0200, David Sterba wrote:
>>> >> On Thu, Sep 28, 2017 at 05:33:35PM +0300, Timofey Titovets wrote:
>>> >> > Compile tested, hand tested on live system
>>> >> >
>>> >> > Change v7 -> v8
>>> >> >   - All code moved to compression.c (again)
>>> >> >   - Heuristic workspaces inmplemented another way
>>> >> > i.e. only share logic with compression workspaces
>>> >> >   - Some style fixes suggested by Devid
>>> >> >   - Move sampling function from heuristic code
>>> >> > (I'm afraid of big functions)
>>> >> >   - Much more comments and explanations
>>> >>
>>> >> Thanks for the update, I went through the patches and they looked good
>>> >> enough to be put into for-next. I may have more comments about a few
>>> >> things, but nothing serious that would hinder testing.
>>> >
>>> > I did a final pass through the patches and edited comments wehre I was
>>> > not able to undrerstand them. Please check the updated patches in [1] if
>>> > I did not accidentally change the meaning.
>>>
>>> I don't see a link [1] in mail, may be you missed it?
>>
>> Yeah, sorry:
>> https://github.com/kdave/btrfs-devel/commits/ext/timofey/heuristic
>
> I did re-read updated comments, looks ok to me
> (i only found one typo, leave a comment).
>
>
> Thanks
> --
> Have a nice day,
> Timofey.

Can you please try that patch? (in attach)

I think some time about performance hit of heuristic and
how to avoid using sorting,

That patch will try prefind min/max values (before sorting) in array,
and (max - min), used to filter edge data cases where
byte core size < 64 or bigger > 200
It's a bit hacky workaround =\,
That show a ~same speedup on my data set as show using of radix sort.
(i.e. x2 speed up)

Thanks.

-- 
Have a nice day,
Timofey.
From fb2a329828e64ad0e224a8cb97dbc17147149629 Mon Sep 17 00:00:00 2001
From: Timofey Titovets <nefelim...@gmail.com>
Date: Mon, 23 Oct 2017 21:24:29 +0300
Subject: [PATCH] Btrfs: heuristic try avoid bucket sorting on edge data cases

Heap sort used in kernel are too slow and costly,
So let's make some statistic assume about egde input data cases
Based on observation of difference between min/max values in bucket.

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/compression.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 0ca16909894e..56b67ec4fb5b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1310,8 +1310,46 @@ static int byte_core_set_size(struct heuristic_ws *ws)
 	u32 i;
 	u32 coreset_sum = 0;
 	const u32 core_set_threshold = ws->sample_size * 90 / 100;
+	struct bucket_item *max, *min;
+	struct bucket_item tmp;
 	struct bucket_item *bucket = ws->bucket;
 
+
+	/* Presort for find min/max value */
+	max = [0];
+	min = [BUCKET_SIZE - 1];
+	for (i = 1; i < BUCKET_SIZE - 1; i++) {
+		if (bucket[i].count > max->count) {
+			tmp = *max;
+			*max = bucket[i];
+			bucket[i] = tmp;
+		}
+		if (bucket[i].count < min->count) {
+			tmp = *min;
+			*min = bucket[i];
+			bucket[i] = tmp;
+		}
+	}
+
+	/*
+	 * Hacks for avoid sorting on Edge data cases (sorting too constly)
+	 * i.e. that will fast filter easy compressible
+	 * and bad compressible data
+	 * Based on observation of number distribution on different data sets
+	 *
+	 * Assume 1: For bad compressible data distribution between min/max
+	 * will be less then 0.6% of sample size
+	 *
+	 * Assume 2: For good compressible data distribution between min/max
+	 * will be far bigger then 4% of sample size
+	 */
+
+	if (max->count - min->count < ws->sample_size * 6 / 1000)
+		return BYTE_CORE_SET_HIGH + 1;
+
+	if (max->count - min->count > ws->sample_size * 4 / 100)
+		return BYTE_CORE_SET_LOW - 1;
+
 	/* Sort in reverse order */
 	sort(bucket, BUCKET_SIZE, sizeof(*bucket), _comp_rev, NULL);
 
-- 
2.14.2

Re: [PATCH v8 0/6] Btrfs: populate heuristic with code

2017-10-22 Thread Timofey Titovets

2017-10-20 16:45 GMT+03:00 David Sterba <dste...@suse.cz>:
> On Fri, Oct 20, 2017 at 01:48:01AM +0300, Timofey Titovets wrote:
>> 2017-10-19 18:39 GMT+03:00 David Sterba <dste...@suse.cz>:
>> > On Fri, Sep 29, 2017 at 06:22:00PM +0200, David Sterba wrote:
>> >> On Thu, Sep 28, 2017 at 05:33:35PM +0300, Timofey Titovets wrote:
>> >> > Compile tested, hand tested on live system
>> >> >
>> >> > Change v7 -> v8
>> >> >   - All code moved to compression.c (again)
>> >> >   - Heuristic workspaces inmplemented another way
>> >> > i.e. only share logic with compression workspaces
>> >> >   - Some style fixes suggested by Devid
>> >> >   - Move sampling function from heuristic code
>> >> > (I'm afraid of big functions)
>> >> >   - Much more comments and explanations
>> >>
>> >> Thanks for the update, I went through the patches and they looked good
>> >> enough to be put into for-next. I may have more comments about a few
>> >> things, but nothing serious that would hinder testing.
>> >
>> > I did a final pass through the patches and edited comments wehre I was
>> > not able to undrerstand them. Please check the updated patches in [1] if
>> > I did not accidentally change the meaning.
>>
>> I don't see a link [1] in mail, may be you missed it?
>
> Yeah, sorry:
> https://github.com/kdave/btrfs-devel/commits/ext/timofey/heuristic

I did re-read updated comments, looks ok to me
(i only found one typo, leave a comment).


Thanks
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 0/6] Btrfs: populate heuristic with code

2017-10-19 Thread Timofey Titovets

2017-10-19 18:39 GMT+03:00 David Sterba <dste...@suse.cz>:
> On Fri, Sep 29, 2017 at 06:22:00PM +0200, David Sterba wrote:
>> On Thu, Sep 28, 2017 at 05:33:35PM +0300, Timofey Titovets wrote:
>> > Compile tested, hand tested on live system
>> >
>> > Change v7 -> v8
>> >   - All code moved to compression.c (again)
>> >   - Heuristic workspaces inmplemented another way
>> > i.e. only share logic with compression workspaces
>> >   - Some style fixes suggested by Devid
>> >   - Move sampling function from heuristic code
>> > (I'm afraid of big functions)
>> >   - Much more comments and explanations
>>
>> Thanks for the update, I went through the patches and they looked good
>> enough to be put into for-next. I may have more comments about a few
>> things, but nothing serious that would hinder testing.
>
> I did a final pass through the patches and edited comments wehre I was
> not able to undrerstand them. Please check the updated patches in [1] if
> I did not accidentally change the meaning.

I don't see a link [1] in mail, may be you missed it?
I look at my patches in for-next branch, and that's not looks like
changed, so i assume your link not point at kernel.org %).

> I'm about to add the patchset to the main patch pile for 4.15 soon.
> Further tuning is possible and such patches will be probably accepted
> during the 4.15 development cycle once the as parts have landed. It's
> desirable to gather some testing results of heuristic effects on various
> data types. So far I've been watching for performance drops only.

Just for my information, you compare compress + heuristic with
compression force?

P.S.
Just to sync that we expect from heuristic:
it's expected to get some performance drops on easy compressible data, because
heuristic not free,
but how much this drops?
Main reason for heuristic, it's to win cpu/latency cost for bad
compressible data.
In compare to direct compression.
That allow to provide some worst case stable latency/throughput for userspace.

P.S.S.
I send some emails before, where i show slow paths in heuristic
(sort(), ilog2()).
So i expect that kernel can see same slow downs on that paths.
But i'm don't have enough skills for now, to perform kernel profiling.

> In case the heuristic would turn out to cause problems we can't fix
> during 4.15 cycle, we can still disable it. This is only a last resort
> measure but we need to be prepared.
kk

Thanks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] Btrfs: handle unaligned tail of data ranges more efficient

2017-10-15 Thread Timofey Titovets

May be then just add a comment at least at one of that functions?
Like:
/*
 * Handle unaligned end, end is inclusive, so always unaligned
 */

Or something like:
/*
 * It's obvious, kernel use paging, so range
 * Almost always have aligned start like 0
 * and unaligned end like 8192 - 1
 */

Or we assume that everybody who look at kernel code, must understood
that basic things?

Thanks


2017-10-10 19:37 GMT+03:00 David Sterba <dste...@suse.cz>:
> On Tue, Oct 03, 2017 at 06:06:03PM +0300, Timofey Titovets wrote:
>> At now while switch page bits in data ranges
>> we always hande +1 page, for cover case
>> where end of data range is not page aligned
>
> The 'end' is inclusive and thus not aligned in most cases, ie. it's
> offset 4095 in the page, so the IS_ALIGNED is allways true and the code
> is equivalent to the existing condition (index <= end_index).



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] Btrfs: clear_dirty only on pages only in compression range

2017-10-13 Thread Timofey Titovets

2017-10-10 19:22 GMT+03:00 David Sterba <dste...@suse.cz>:
> On Tue, Oct 03, 2017 at 06:06:02PM +0300, Timofey Titovets wrote:
>> We need to call extent_range_clear_dirty_for_io()
>> on compression range to prevent application from changing
>> page content, while pages compressing.
>>
>> but "(end - start)" can be much (up to 1024 times) bigger
>> then compression range (BTRFS_MAX_UNCOMPRESSED), so optimize that
>> by calculating compression range for
>> that loop iteration, and flip bits only on that range
>
> I'm not sure this is safe to do. Compress_file_range gets the whole
> range [start,end] from async_cow_start, and other parts of code might
> rely on the status of the whole range, ie. not partially redirtied.

I check some kernel code, io path are complex =\.
I see 3 approaches:
1. That used to signal upper level, that we fail to write that pages
and on other sync() call, kernel can send that data again to writing.
2. We lock pages from any changes while processing data.
3. Serialize writes, i.e. if i understood correctly, that allow to
not send down pages again for several sync requests.

My code above will handle ok first and second case, and in theory will
cause some issues with 3, but doesn't matter.

The main design problem from my point of view for now, that we call
that function many times in the loop, example:
compress_file_range get: 0 - 1048576
extent_range_clear_dirty_for_io(), will get a call for:
0 - 1048576
131072 - 1048576
262144 - 1048576
... & etc

So, we can beat that in different way.
I first think about move extent_range_clear_dirty_for_io() out of
loop, but i think that a better approach.

Whats about:
if (!redirty) {
extent_range_clear_dirty_for_io(inode, start, end);
redirty = 1;
}

That will also cover all above cases, because that lock whole range, as before.
But we only call that once, and that will fix design issue.

That you think?

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] Btrfs: heuristic replace heap sort with radix sort

2017-10-12 Thread Timofey Titovets

Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

So, add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

So for buffer array, just allocate BUCKET_SIZE*2 for bucket,
and use free tale as a buffer, to improve data locality.
That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16el*8byte).

Not tested on Big.Endian.
I try handle that by some kernel macros.

Performance tested in userspace copy of heuristic code,
throughput:
- average <-> random data: ~3500 MiB/s  - heap  sort
- average <-> random data: ~6000 MiB/s +71% - radix sort

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/compression.c | 153 +
 1 file changed, 141 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 01738b9a8dc7..f320fbb1de17 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -753,6 +753,7 @@ struct heuristic_ws {
u32 sample_size;
/* Bucket store counter for each byte type */
struct bucket_item *bucket;
+   struct bucket_item *bucket_buf;
struct list_head list;
 };

@@ -778,10 +779,12 @@ static struct list_head *alloc_heuristic_ws(void){
if (!ws->sample)
goto fail;

-   ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), GFP_KERNEL);
+   ws->bucket = kcalloc(BUCKET_SIZE*2, sizeof(*ws->bucket), GFP_KERNEL);
if (!ws->bucket)
goto fail;

+   ws->bucket_buf = >bucket[BUCKET_SIZE];
+
INIT_LIST_HEAD(>list);
return >list;
 fail:
@@ -1225,6 +1228,137 @@ int btrfs_decompress_buf2page(const char *buf, unsigned 
long buf_start,
return 1;
 }

+#define RADIX_BASE 4
+#define COUNTERS_SIZE (1 << RADIX_BASE)
+
+static inline u8 get4bits(u64 num, int shift) {
+   u8 low4bits;
+   num = num >> shift;
+   /* Reverse order */
+   low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE);
+   return low4bits;
+}
+
+static inline void copy_cell(void *dst, const void *src)
+{
+   struct bucket_item *dstv = (struct bucket_item *) dst;
+   struct bucket_item *srcv = (struct bucket_item *) src;
+   *dstv = *srcv;
+}
+
+static inline u64 get_num(const void *a)
+{
+   struct bucket_item *av = (struct bucket_item *) a;
+   return cpu_to_le32(av->count);
+}
+
+/*
+ * Kernel compatible radix sort implementation
+ * Use 4 bits as radix base
+ * Use 16 64bit counters for calculating new possition in buf array
+ * Tested only on Little Endian
+ *
+ * @array - array that will be sorted
+ * @array_buf - buffer array to store sorting results
+ *  must be equal in size to @array
+ * @num   - array size
+ * @size  - item size
+ * @max_cell  - Link to element with maximum possible value
+ *  that can be used to cap radix sort iterations
+ *  if we know maximum value before call sort
+ * @get_num   - function to extract number from array
+ * @copy_cell - function to copy data from array to array_buf
+ *  and vise versa
+ * @get4bits  - function to get 4 bits from number at specified offset
+ */
+static void radix_sort(void *array, void *array_buf,
+  int num, int size,
+  const void *max_cell,
+  u64 (*get_num)(const void *),
+  void (*copy_cell)(void *dest, const void* src),
+  u8 (*get4bits)(u64 num, int shift))
+{
+   u64 max_num;
+   u64 buf_num;
+   u64 counters[COUNTERS_SIZE];
+   u64 new_addr;
+   s64 i;
+   int addr;
+   int bitlen;
+   int shift;
+
+   /*
+* Try avoid useless loop iterations
+* For small numbers stored in big counters
+* example: 48 33 4 ... in 64bit array
+*/
+   if (!max_cell) {
+   max_num = get_num(array);
+   for (i = 0 + size; i < num*size; i += size) {
+   buf_num = get_num(array + i);
+   if (le64_to_cpu(buf_num) > le64_to_cpu(max_num))
+   max_num = buf_num;
+   }
+   } else {
+   max_num = get_num(max_cell);
+   }
+
+   buf_num = ilog2(le64_to_cpu(max_num));
+   bitlen = ALIGN(buf_num, RADIX_BASE*2);
+
+   shift = 0;
+   while (shift < bitlen) {
+   memset(counters, 0, sizeof(counters));
+
+   for (i = 0; i < num*size; i += size) {
+

Re: [RFC] Btrfs: compression heuristic performance

2017-10-12 Thread Timofey Titovets

Just info update.

I did some more benchmark, optimization and testing.
(Write kernel generalized version of radix sort (that can be ported to sort.c))
(Some memory allocating changes & etc)
(Also i'm stupid and make a mistake with inversion of percentage numbers)

New, more clean numbers (I disable useless printf and etc):
  Stats mode heap sort (in kernel):
- average ... random data - ~3000 MiB/s
- Zeroed - ~4400 MiB/s
  Stats mode radix sort:
- average ... random data - ~4900 MiB/s +63%
- Zeroed - ~4100 MiB/s -7%  # radix have some problems with
sorting zeroes :)

  Non stats mode heap sort:
- average ... random data - ~3500 MiB/s
- Zeroed - ~11500 MiB/s
  Non stats mode radix sort:
- average ... random data - ~6000 MiB/s +71%
- Zeroed - ~11500 MiB/s # as expected no diff, sorting do nothing
on that case

Another hot (and slow) path in heuristic, is simple ilog2 =\
i'm give up on porting in kernel ilog2 for x86, so i'm not sure that
happens in kernel with that,
so JFYI:
it's have a significant value only with radix, because heap sort are
much slower...
So numbers:
  Stats mode simple ilog2:
- average ... random data - ~3000 MiB/s
- Zeroed - ~4400 MiB/s
Stats mode fast ilog2[1]:
- average ... random data - ~3500 MiB/s +16%
- Zeroed - ~4400 MiB/s  # ilog2 just do nothing on zeroes

  Non stats mode simple ilog2:
- average ... random data - ~3500 MiB/s
- Zeroed - ~11500 MiB/s
   Non stats mode fast ilog2:
# Because byte_core_set_size avoid from too bad data
- average ... random data - ~3600 MiB/s +2%
- Zeroed - ~11500 MiB/s # as expected no diff, ilog2 do nothing on that case
# with radix sort
- average ... random data - ~6200 MiB/s +77%

So, as heuristic is try save CPU time from compression,
I think that acceptable to sacrifice 1KiB memory for buffered data
(general time/memory trade-off)
I will write a RFC with radix sort to compression.c in near time.
I'm hard to found other kernel places where that can be used for now.
IMHO, if someone also will want that, we will always can move that to lib later.

Thanks.

Notes:
1. Fast cross platform 64 bit ilog2 (found in the web)
const int tab64[64] = {
63,  0, 58,  1, 59, 47, 53,  2,
60, 39, 48, 27, 54, 33, 42,  3,
61, 51, 37, 40, 49, 18, 28, 20,
55, 30, 34, 11, 43, 14, 22,  4,
62, 57, 46, 52, 38, 26, 32, 41,
50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16,  9, 12,
44, 24, 15,  8, 23,  7,  6,  5};

int ilog2_64 (uint64_t value)
{
value |= value >> 1;
value |= value >> 2;
value |= value >> 4;
value |= value >> 8;
value |= value >> 16;
value |= value >> 32;
return tab64[((uint64_t)((value - (value >> 1))*0x07EDD5E59A4E28C2)) >> 58];
}

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Btrfs: compression heuristic performance

2017-10-11 Thread Timofey Titovets

Hi David, (and list folks in CC of course).

TL;DR
Sorting is a slowest part of heuristic for now (in worst case ~55% of run time).
Kernel use heap sort, i.e. heuristic also use that.
Radix sort speedup heuristic a lot (~+30-40%).
But radix have some tradeoffs (i.e. need more memory, but that can be
easy fixed in theory),
does that a normal idea to port radix sort for kernel?

Long read version below.
---
I already mentioned that i reverse port heuristic code from kernel to
userspace [1],
to make some experiments and profiling easy for me.

(So i spend about a week with some experiments and testings.)

(I firstly profiling with Valgrind)
So i found heuristic bottleneck for now, is a kernel heap sort()
(sort uses about 55% of runtime, that look crazy for me)

And as sort used in byte_core_set_size(), all not text data, sort will
be called.
i.e. from my point of view in 90% cases.

(Tool have 2 mods in main, stats and non stats %) )
For show the scale of a problem:
  Stats mode:
- For average and worst case data (random) ~2600 MiB/s
- Zeroed data - ~3400 MiB/s
  Non stats (kernel like)
- For average and worst case data (random) ~3000 MiB/s
- Zeroed data - ~8000 MiB/s


I spend several days try to beat that by different way =\
Only way that i found (i.e. that show good profit)
Is replace kernel heap sort with radix_sort() [2]   =\

With radix sort:
  Stats mode:
- For average and worst case data (random) ~3800 MiB/s ~+31%
- Zeroed data - ~3400 MiB/s (not sure why performance are same)
  Non stats (kernel like)
- For average and worst case data (random) ~4800 MiB/s ~+37%
- Zeroed data - ~8000 MiB/s

(Inline Radix get additional show +200MiB/s)

That seems cool, but please, wait a second nothing come free.

Radix sort need a 2 buffers, one for counters, and one for output array.

So in our case, (i use 4 bit as radix base),
16 * 4 bytes for counters (stored at stack of sort function)
+ 4*256 bytes for bucket_buffer. (stored at heuristic workspace).

So, in theory, that make a heuristic workspace weight more, i.e. +1KiB per CPU.

As heuristic have some assumptions, i.e. we really care only about sample size.
And sample size in our case are 8KiB max (i.e. 8KiB per CPU).
That possible to replace u32 counters in buckets with u16 counters.

And get same:
8KiB sample per CPU
1KiB buckets per CPU

But main question:
does that make a sense?

Thanks!

Links:
1. https://github.com/Nefelim4ag/companalyze
2. 
https://github.com/Nefelim4ag/companalyze/commit/1cd371eac5056927e5f567c6bd32e179ba859629
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Btrfs: heuristic add shannon entropy calculation

2017-10-08 Thread Timofey Titovets

Byte distribution check in heuristic will filter edge data
cases and some time fail to classificate input data

Let's fix that by adding Shannon entropy calculation,
that will cover classification of most other data types.

As Shannon entropy need log2 with some precision to work,
let's use ilog2(N) and for increase precision,
by do ilog2(pow(N, 4))

Shannon entropy slightly changed for avoiding signed numbers
and divisions.

Changes:
  v1 -> v2:
- Replace log2_lshift4 wiht ilog2
- To compensate accuracy errors of ilog2
  @ENTROPY_LVL_ACEPTABLE 70 -> 65
  @ENTROPY_LVL_HIGH  85 -> 80
- Drop usage of u64 from shannon calculation

Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
---
 fs/btrfs/compression.c | 83 +-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 0060bc4ae5f2..8efbce5633b5 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -1224,6 +1225,60 @@ int btrfs_decompress_buf2page(const char *buf, unsigned 
long buf_start,
return 1;
 }

+/*
+ * Shannon Entropy calculation
+ *
+ * Pure byte distribution analyze fail to determine
+ * compressiability of data. Try calculate entropy to
+ * estimate the average minimum number of bits needed
+ * to encode a sample data.
+ *
+ * For comfort, use return of percentage of needed bit's,
+ * instead of bit's amaount directly.
+ *
+ * @ENTROPY_LVL_ACEPTABLE - below that threshold sample has low byte
+ * entropy and can be compressible with high probability
+ *
+ * @ENTROPY_LVL_HIGH - data are not compressible with high probability,
+ *
+ * Use of ilog2() decrease precision, so to see ~ correct answer
+ * LVL decreased by 5.
+ */
+
+#define ENTROPY_LVL_ACEPTABLE 65
+#define ENTROPY_LVL_HIGH 80
+
+/*
+ * For increase precision in shannon_entropy calculation
+ * Let's do pow(n, M) to save more digits after comma
+ *
+ * Max int bit lengh 64
+ * ilog2(MAX_SAMPLE_SIZE) -> 13
+ * 13*4 = 52 < 64 -> M = 4
+ * So use pow(n, 4)
+ */
+static inline u32 ilog2_w(u64 n)
+{
+   return ilog2(n * n * n * n);
+}
+
+static u32 shannon_entropy(struct heuristic_ws *ws)
+{
+   const u32 entropy_max = 8*ilog2_w(2);
+   u32 entropy_sum = 0;
+   u32 p, p_base, sz_base;
+   u32 i;
+
+   sz_base = ilog2_w(ws->sample_size);
+   for (i = 0; i < BUCKET_SIZE && ws->bucket[i].count > 0; i++) {
+   p = ws->bucket[i].count;
+   p_base = ilog2_w(p);
+   entropy_sum += p * (sz_base - p_base);
+   }
+
+   entropy_sum /= ws->sample_size;
+   return entropy_sum * 100 / entropy_max;
+}

 /* Compare buckets by size, ascending */
 static inline int bucket_comp_rev(const void *lv, const void *rv)
@@ -1404,7 +1459,7 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
struct heuristic_ws *ws;
u32 i;
u8 byte;
-   int ret = 1;
+   int ret = 0;

ws = list_entry(ws_list, struct heuristic_ws, list);

@@ -1439,6 +1494,32 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
goto out;
}

+   i = shannon_entropy(ws);
+   if (i <= ENTROPY_LVL_ACEPTABLE) {
+   ret = 4;
+   goto out;
+   }
+
+   /*
+* At below entropy levels additional
+* analysis needed for show green light to compression
+* For now just assume that compression at that level are
+* not worth for resources, becuase:
+* 1. that possible to defrag data later
+* 2. Case where that will really show compressible data are rare
+*   ex. 150 byte types, every bucket have counter at level ~54
+*   Heuristic will be confused, that can happen when data
+*   have some internal repeated patterns abbacbbc..
+*   that can be detected only by analize sample for byte pairs
+*/
+   if (i < ENTROPY_LVL_HIGH) {
+   ret = 5;
+   goto out;
+   } else {
+   ret = 0;
+   goto out;
+   }
+
 out:
__free_workspace(0, ws_list, true);
return ret;
--
2.14.2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 323 matches

Mail list logo