subject:"\[PATCH 7\/8\] wbt\: add general throttling mechanism"

[PATCH 7/8] wbt: add general throttling mechanism

2016-09-07 Thread Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes. Or NFS can tap into it, to accomplish the same.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
   wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe 
---
 include/linux/wbt.h| 120 
 include/trace/events/wbt.h | 153 ++
 lib/Kconfig|   3 +
 lib/Makefile   |   1 +
 lib/wbt.c  | 679 +
 5 files changed, 956 insertions(+)
 create mode 100644 include/linux/wbt.h
 create mode 100644 include/trace/events/wbt.h
 create mode 100644 lib/wbt.c

diff --git a/include/linux/wbt.h b/include/linux/wbt.h
new file mode 100644
index ..5ffcd1409c2f
--- /dev/null
+++ b/include/linux/wbt.h
@@ -0,0 +1,120 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include 
+#include 
+#include 
+#include 
+
+enum {
+   ISSUE_STAT_TRACKED  = 1ULL << 63,
+   ISSUE_STAT_READ = 1ULL << 62,
+   ISSUE_STAT_MASK = ISSUE_STAT_TRACKED | ISSUE_STAT_READ,
+   ISSUE_STAT_TIME_MASK= ~ISSUE_STAT_MASK,
+
+   WBT_TRACKED = 1,
+   WBT_READ= 2,
+};
+
+struct wb_issue_stat {
+   u64 time;
+};
+
+static inline void wbt_issue_stat_set_time(struct wb_issue_stat *stat)
+{
+   stat->time = (stat->time & ISSUE_STAT_MASK) |
+   (ktime_to_ns(ktime_get()) & ISSUE_STAT_TIME_MASK);
+}
+
+static inline u64 wbt_issue_stat_get_time(struct wb_issue_stat *stat)
+{
+   return stat->time & ISSUE_STAT_TIME_MASK;
+}
+
+static inline void wbt_mark_tracked(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_TRACKED;
+}
+
+static inline void wbt_clear_state(struct wb_issue_stat *stat)
+{
+   stat->time &= ~(ISSUE_STAT_TRACKED | ISSUE_STAT_READ);
+}
+
+static inline bool wbt_tracked(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_TRACKED) != 0;
+}
+
+static inline void wbt_mark_read(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_READ;
+}
+
+static inline bool wbt_is_read(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_READ) != 0;
+}
+
+struct wb_stat_ops {
+   void (*get)(void *, struct blk_rq_stat *);
+   bool (*is_current)(struct blk_rq_stat *);
+   void (*clear)(void *);
+};
+
+struct rq_wb {
+   /*
+* Settings that govern how we throttle
+*/
+   unsigned int wb_background; /* background writeback */
+   unsigned int wb_normal; /* normal writeback */
+   unsigned int wb_max;/* max throughput writeback */
+   int scale_step;
+   bool scaled_max;
+
+   u64 win_nsec;   /* default window size */
+   u64 cur_win_nsec;   /* current window size */
+
+   /*
+* Number of consecutive periods where we don't have enough
+* information to make a firm scale up/down decision.
+*/
+   unsigned int unknown_cnt;
+
+   struct timer_list window_timer;
+
+   s64 sync_issue;
+   void *sync_cookie;
+
+   unsigned int wc;
+   unsigned int queue_depth;
+
+   unsigned long last_issue;   /* last non-throttled issue */
+   unsigned long last_comp;/* last non-throttled comp */
+   unsigned long min_lat_nsec;
+   struct backing_dev_info *bdi;
+   struct request_queue *q;
+   wait_queue_head_t wait;
+   atomic_t inflight;
+
+   struct wb_stat_ops *stat_ops;
+   void *ops_data;
+};
+
+struct backing_dev_info;
+
+void __wbt_done(struct rq_wb *);
+void wbt_done(struct rq_wb *, struct wb_issue_stat *);
+unsigned int wbt_wait(struct rq_wb *, unsigned int, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void 
*);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_issue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_disable(struct rq_wb *);
+void wbt_track(struct wb_issue_stat *, unsigned int);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index ..926c7ee0ef4e
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,153 @@
+#undef

[PATCH 7/8] wbt: add general throttling mechanism

2016-09-07 Thread Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes. Or NFS can tap into it, to accomplish the same.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
   wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe 
---
 include/linux/wbt.h| 120 
 include/trace/events/wbt.h | 153 ++
 lib/Kconfig|   3 +
 lib/Makefile   |   1 +
 lib/wbt.c  | 679 +
 5 files changed, 956 insertions(+)
 create mode 100644 include/linux/wbt.h
 create mode 100644 include/trace/events/wbt.h
 create mode 100644 lib/wbt.c

diff --git a/include/linux/wbt.h b/include/linux/wbt.h
new file mode 100644
index ..5ffcd1409c2f
--- /dev/null
+++ b/include/linux/wbt.h
@@ -0,0 +1,120 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include 
+#include 
+#include 
+#include 
+
+enum {
+   ISSUE_STAT_TRACKED  = 1ULL << 63,
+   ISSUE_STAT_READ = 1ULL << 62,
+   ISSUE_STAT_MASK = ISSUE_STAT_TRACKED | ISSUE_STAT_READ,
+   ISSUE_STAT_TIME_MASK= ~ISSUE_STAT_MASK,
+
+   WBT_TRACKED = 1,
+   WBT_READ= 2,
+};
+
+struct wb_issue_stat {
+   u64 time;
+};
+
+static inline void wbt_issue_stat_set_time(struct wb_issue_stat *stat)
+{
+   stat->time = (stat->time & ISSUE_STAT_MASK) |
+   (ktime_to_ns(ktime_get()) & ISSUE_STAT_TIME_MASK);
+}
+
+static inline u64 wbt_issue_stat_get_time(struct wb_issue_stat *stat)
+{
+   return stat->time & ISSUE_STAT_TIME_MASK;
+}
+
+static inline void wbt_mark_tracked(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_TRACKED;
+}
+
+static inline void wbt_clear_state(struct wb_issue_stat *stat)
+{
+   stat->time &= ~(ISSUE_STAT_TRACKED | ISSUE_STAT_READ);
+}
+
+static inline bool wbt_tracked(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_TRACKED) != 0;
+}
+
+static inline void wbt_mark_read(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_READ;
+}
+
+static inline bool wbt_is_read(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_READ) != 0;
+}
+
+struct wb_stat_ops {
+   void (*get)(void *, struct blk_rq_stat *);
+   bool (*is_current)(struct blk_rq_stat *);
+   void (*clear)(void *);
+};
+
+struct rq_wb {
+   /*
+* Settings that govern how we throttle
+*/
+   unsigned int wb_background; /* background writeback */
+   unsigned int wb_normal; /* normal writeback */
+   unsigned int wb_max;/* max throughput writeback */
+   int scale_step;
+   bool scaled_max;
+
+   u64 win_nsec;   /* default window size */
+   u64 cur_win_nsec;   /* current window size */
+
+   /*
+* Number of consecutive periods where we don't have enough
+* information to make a firm scale up/down decision.
+*/
+   unsigned int unknown_cnt;
+
+   struct timer_list window_timer;
+
+   s64 sync_issue;
+   void *sync_cookie;
+
+   unsigned int wc;
+   unsigned int queue_depth;
+
+   unsigned long last_issue;   /* last non-throttled issue */
+   unsigned long last_comp;/* last non-throttled comp */
+   unsigned long min_lat_nsec;
+   struct backing_dev_info *bdi;
+   struct request_queue *q;
+   wait_queue_head_t wait;
+   atomic_t inflight;
+
+   struct wb_stat_ops *stat_ops;
+   void *ops_data;
+};
+
+struct backing_dev_info;
+
+void __wbt_done(struct rq_wb *);
+void wbt_done(struct rq_wb *, struct wb_issue_stat *);
+unsigned int wbt_wait(struct rq_wb *, unsigned int, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void 
*);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_issue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_disable(struct rq_wb *);
+void wbt_track(struct wb_issue_stat *, unsigned int);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index ..926c7ee0ef4e
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,153 @@
+#undef TRACE_SYSTEM
+#define

Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-09-01 Thread Jens Axboe


On 09/01/2016 12:05 PM, Omar Sandoval wrote:

diff --git a/lib/Kconfig b/lib/Kconfig
index d79909dc01ec..5a65a1f91889 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -550,4 +550,8 @@ config STACKDEPOT
bool
select STACKTRACE

+config WBT
+   bool
+   select SCALE_BITMAP


Looks like this snuck in from your experiments to get this to work on
top of scale_bitmap?


Oops yes, it is indeed. Killed, thanks.


+   if (waitqueue_active(>wait)) {
+   int diff = limit - inflight;
+
+   if (!inflight || diff >= rwb->wb_background / 2)
+   wake_up_nr(>wait, 1);


wake_up(>wait)?


Yeah, that'd be cleaner. I think this is a leftover from when I 
experimented with batched wakeups, with nr != 1. I'll change it to just 
wake_up().


--
Jens Axboe

Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-09-01 Thread Jens Axboe


On 09/01/2016 12:05 PM, Omar Sandoval wrote:

diff --git a/lib/Kconfig b/lib/Kconfig
index d79909dc01ec..5a65a1f91889 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -550,4 +550,8 @@ config STACKDEPOT
bool
select STACKTRACE

+config WBT
+   bool
+   select SCALE_BITMAP


Looks like this snuck in from your experiments to get this to work on
top of scale_bitmap?


Oops yes, it is indeed. Killed, thanks.


+   if (waitqueue_active(>wait)) {
+   int diff = limit - inflight;
+
+   if (!inflight || diff >= rwb->wb_background / 2)
+   wake_up_nr(>wait, 1);


wake_up(>wait)?


Yeah, that'd be cleaner. I think this is a leftover from when I 
experimented with batched wakeups, with nr != 1. I'll change it to just 
wake_up().


--
Jens Axboe

Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-09-01 Thread Omar Sandoval

On Wed, Aug 31, 2016 at 11:05:50AM -0600, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes. Or NFS can tap into it, to accomplish the same.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
> max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.
> 
> Signed-off-by: Jens Axboe 
> ---
>  include/linux/wbt.h| 118 +
>  include/trace/events/wbt.h | 122 ++
>  lib/Kconfig|   4 +
>  lib/Makefile   |   1 +
>  lib/wbt.c  | 587 
> +
>  5 files changed, 832 insertions(+)
>  create mode 100644 include/linux/wbt.h
>  create mode 100644 include/trace/events/wbt.h
>  create mode 100644 lib/wbt.c
> 

[snip]

> diff --git a/lib/Kconfig b/lib/Kconfig
> index d79909dc01ec..5a65a1f91889 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -550,4 +550,8 @@ config STACKDEPOT
>   bool
>   select STACKTRACE
>  
> +config WBT
> + bool
> + select SCALE_BITMAP

Looks like this snuck in from your experiments to get this to work on
top of scale_bitmap?

[snip]

> +void __wbt_done(struct rq_wb *rwb)
> +{
> + int inflight, limit;
> +
> + inflight = atomic_dec_return(>inflight);
> +
> + /*
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
> +  */
> + if (unlikely(!rwb_enabled(rwb))) {
> + wake_up_all(>wait);
> + return;
> + }
> +
> + /*
> +  * If the device does write back caching, drop further down
> +  * before we wake people up.
> +  */
> + if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
> + limit = 0;
> + else
> + limit = rwb->wb_normal;
> +
> + /*
> +  * Don't wake anyone up if we are above the normal limit.
> +  */
> + if (inflight && inflight >= limit)
> + return;
> +
> + if (waitqueue_active(>wait)) {
> + int diff = limit - inflight;
> +
> + if (!inflight || diff >= rwb->wb_background / 2)
> + wake_up_nr(>wait, 1);

wake_up(>wait)?

-- 
Omar

Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-09-01 Thread Omar Sandoval

On Wed, Aug 31, 2016 at 11:05:50AM -0600, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes. Or NFS can tap into it, to accomplish the same.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
> max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.
> 
> Signed-off-by: Jens Axboe 
> ---
>  include/linux/wbt.h| 118 +
>  include/trace/events/wbt.h | 122 ++
>  lib/Kconfig|   4 +
>  lib/Makefile   |   1 +
>  lib/wbt.c  | 587 
> +
>  5 files changed, 832 insertions(+)
>  create mode 100644 include/linux/wbt.h
>  create mode 100644 include/trace/events/wbt.h
>  create mode 100644 lib/wbt.c
> 

[snip]

> diff --git a/lib/Kconfig b/lib/Kconfig
> index d79909dc01ec..5a65a1f91889 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -550,4 +550,8 @@ config STACKDEPOT
>   bool
>   select STACKTRACE
>  
> +config WBT
> + bool
> + select SCALE_BITMAP

Looks like this snuck in from your experiments to get this to work on
top of scale_bitmap?

[snip]

> +void __wbt_done(struct rq_wb *rwb)
> +{
> + int inflight, limit;
> +
> + inflight = atomic_dec_return(>inflight);
> +
> + /*
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
> +  */
> + if (unlikely(!rwb_enabled(rwb))) {
> + wake_up_all(>wait);
> + return;
> + }
> +
> + /*
> +  * If the device does write back caching, drop further down
> +  * before we wake people up.
> +  */
> + if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
> + limit = 0;
> + else
> + limit = rwb->wb_normal;
> +
> + /*
> +  * Don't wake anyone up if we are above the normal limit.
> +  */
> + if (inflight && inflight >= limit)
> + return;
> +
> + if (waitqueue_active(>wait)) {
> + int diff = limit - inflight;
> +
> + if (!inflight || diff >= rwb->wb_background / 2)
> + wake_up_nr(>wait, 1);

wake_up(>wait)?

-- 
Omar

[PATCH 7/8] wbt: add general throttling mechanism

2016-08-31 Thread Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes. Or NFS can tap into it, to accomplish the same.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
   wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe 
---
 include/linux/wbt.h| 118 +
 include/trace/events/wbt.h | 122 ++
 lib/Kconfig|   4 +
 lib/Makefile   |   1 +
 lib/wbt.c  | 587 +
 5 files changed, 832 insertions(+)
 create mode 100644 include/linux/wbt.h
 create mode 100644 include/trace/events/wbt.h
 create mode 100644 lib/wbt.c

diff --git a/include/linux/wbt.h b/include/linux/wbt.h
new file mode 100644
index ..14473d550a18
--- /dev/null
+++ b/include/linux/wbt.h
@@ -0,0 +1,118 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include 
+#include 
+#include 
+#include 
+
+enum {
+   ISSUE_STAT_TRACKED  = 1ULL << 63,
+   ISSUE_STAT_READ = 1ULL << 62,
+   ISSUE_STAT_MASK = ISSUE_STAT_TRACKED | ISSUE_STAT_READ,
+   ISSUE_STAT_TIME_MASK= ~ISSUE_STAT_MASK,
+
+   WBT_TRACKED = 1,
+   WBT_READ= 2,
+};
+
+struct wb_issue_stat {
+   u64 time;
+};
+
+static inline void wbt_issue_stat_set_time(struct wb_issue_stat *stat)
+{
+   stat->time = (stat->time & ISSUE_STAT_MASK) |
+   (ktime_to_ns(ktime_get()) & ISSUE_STAT_TIME_MASK);
+}
+
+static inline u64 wbt_issue_stat_get_time(struct wb_issue_stat *stat)
+{
+   return stat->time & ISSUE_STAT_TIME_MASK;
+}
+
+static inline void wbt_mark_tracked(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_TRACKED;
+}
+
+static inline void wbt_clear_state(struct wb_issue_stat *stat)
+{
+   stat->time &= ~(ISSUE_STAT_TRACKED | ISSUE_STAT_READ);
+}
+
+static inline bool wbt_tracked(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_TRACKED) != 0;
+}
+
+static inline void wbt_mark_read(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_READ;
+}
+
+static inline bool wbt_is_read(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_READ) != 0;
+}
+
+struct wb_stat_ops {
+   void (*get)(void *, struct blk_rq_stat *);
+   void (*clear)(void *);
+};
+
+struct rq_wb {
+   /*
+* Settings that govern how we throttle
+*/
+   unsigned int wb_background; /* background writeback */
+   unsigned int wb_normal; /* normal writeback */
+   unsigned int wb_max;/* max throughput writeback */
+   unsigned int scale_step;
+
+   u64 win_nsec;   /* default window size */
+   u64 cur_win_nsec;   /* current window size */
+
+   /*
+* Number of consecutive periods where we don't have enough
+* information to make a firm scale up/down decision.
+*/
+   unsigned int unknown_cnt;
+
+   struct timer_list window_timer;
+
+   s64 sync_issue;
+   void *sync_cookie;
+
+   unsigned int wc;
+   unsigned int queue_depth;
+
+   unsigned long last_issue;   /* last non-throttled issue */
+   unsigned long last_comp;/* last non-throttled comp */
+   unsigned long min_lat_nsec;
+   struct backing_dev_info *bdi;
+   struct request_queue *q;
+   wait_queue_head_t wait;
+   atomic_t inflight;
+
+   struct wb_stat_ops *stat_ops;
+   void *ops_data;
+};
+
+struct backing_dev_info;
+
+void __wbt_done(struct rq_wb *);
+void wbt_done(struct rq_wb *, struct wb_issue_stat *);
+unsigned int wbt_wait(struct rq_wb *, unsigned int, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void 
*);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_issue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_disable(struct rq_wb *);
+void wbt_track(struct wb_issue_stat *, unsigned int);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index ..a4b8b2e57bb1
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,122 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM wbt
+
+#if !defined(_TRACE_WBT_H) ||

[PATCH 7/8] wbt: add general throttling mechanism

2016-08-31 Thread Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes. Or NFS can tap into it, to accomplish the same.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
   wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe 
---
 include/linux/wbt.h| 118 +
 include/trace/events/wbt.h | 122 ++
 lib/Kconfig|   4 +
 lib/Makefile   |   1 +
 lib/wbt.c  | 587 +
 5 files changed, 832 insertions(+)
 create mode 100644 include/linux/wbt.h
 create mode 100644 include/trace/events/wbt.h
 create mode 100644 lib/wbt.c

diff --git a/include/linux/wbt.h b/include/linux/wbt.h
new file mode 100644
index ..14473d550a18
--- /dev/null
+++ b/include/linux/wbt.h
@@ -0,0 +1,118 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include 
+#include 
+#include 
+#include 
+
+enum {
+   ISSUE_STAT_TRACKED  = 1ULL << 63,
+   ISSUE_STAT_READ = 1ULL << 62,
+   ISSUE_STAT_MASK = ISSUE_STAT_TRACKED | ISSUE_STAT_READ,
+   ISSUE_STAT_TIME_MASK= ~ISSUE_STAT_MASK,
+
+   WBT_TRACKED = 1,
+   WBT_READ= 2,
+};
+
+struct wb_issue_stat {
+   u64 time;
+};
+
+static inline void wbt_issue_stat_set_time(struct wb_issue_stat *stat)
+{
+   stat->time = (stat->time & ISSUE_STAT_MASK) |
+   (ktime_to_ns(ktime_get()) & ISSUE_STAT_TIME_MASK);
+}
+
+static inline u64 wbt_issue_stat_get_time(struct wb_issue_stat *stat)
+{
+   return stat->time & ISSUE_STAT_TIME_MASK;
+}
+
+static inline void wbt_mark_tracked(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_TRACKED;
+}
+
+static inline void wbt_clear_state(struct wb_issue_stat *stat)
+{
+   stat->time &= ~(ISSUE_STAT_TRACKED | ISSUE_STAT_READ);
+}
+
+static inline bool wbt_tracked(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_TRACKED) != 0;
+}
+
+static inline void wbt_mark_read(struct wb_issue_stat *stat)
+{
+   stat->time |= ISSUE_STAT_READ;
+}
+
+static inline bool wbt_is_read(struct wb_issue_stat *stat)
+{
+   return (stat->time & ISSUE_STAT_READ) != 0;
+}
+
+struct wb_stat_ops {
+   void (*get)(void *, struct blk_rq_stat *);
+   void (*clear)(void *);
+};
+
+struct rq_wb {
+   /*
+* Settings that govern how we throttle
+*/
+   unsigned int wb_background; /* background writeback */
+   unsigned int wb_normal; /* normal writeback */
+   unsigned int wb_max;/* max throughput writeback */
+   unsigned int scale_step;
+
+   u64 win_nsec;   /* default window size */
+   u64 cur_win_nsec;   /* current window size */
+
+   /*
+* Number of consecutive periods where we don't have enough
+* information to make a firm scale up/down decision.
+*/
+   unsigned int unknown_cnt;
+
+   struct timer_list window_timer;
+
+   s64 sync_issue;
+   void *sync_cookie;
+
+   unsigned int wc;
+   unsigned int queue_depth;
+
+   unsigned long last_issue;   /* last non-throttled issue */
+   unsigned long last_comp;/* last non-throttled comp */
+   unsigned long min_lat_nsec;
+   struct backing_dev_info *bdi;
+   struct request_queue *q;
+   wait_queue_head_t wait;
+   atomic_t inflight;
+
+   struct wb_stat_ops *stat_ops;
+   void *ops_data;
+};
+
+struct backing_dev_info;
+
+void __wbt_done(struct rq_wb *);
+void wbt_done(struct rq_wb *, struct wb_issue_stat *);
+unsigned int wbt_wait(struct rq_wb *, unsigned int, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void 
*);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_issue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_disable(struct rq_wb *);
+void wbt_track(struct wb_issue_stat *, unsigned int);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index ..a4b8b2e57bb1
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,122 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM wbt
+
+#if !defined(_TRACE_WBT_H) ||

Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-05-03 Thread Jens Axboe

On 05/03/2016 12:14 PM, Jens Axboe wrote:

On 05/03/2016 10:59 AM, Jens Axboe wrote:

On 05/03/2016 09:48 AM, Jan Kara wrote:

On Tue 03-05-16 17:40:32, Jan Kara wrote:

On Tue 03-05-16 11:34:10, Jan Kara wrote:

Yeah, once I'll hunt down that regression with old disk, I can have
a look
into how writeback throttling plays together with blkio-controller.

So I've tried the following script (note that you need cgroup v2 for
writeback IO to be throttled):

---
mkdir /sys/fs/cgroup/group1
echo 1000 >/sys/fs/cgroup/group1/io.weight
dd if=/dev/zero of=/mnt/file1 bs=1M count=1&
DD1=$!
echo $DD1 >/sys/fs/cgroup/group1/cgroup.procs

mkdir /sys/fs/cgroup/group2
echo 100 >/sys/fs/cgroup/group2/io.weight
#echo "259:65536 wbps=500" >/sys/fs/cgroup/group2/io.max
echo "259:65536 wbps=max" >/sys/fs/cgroup/group2/io.max
dd if=/dev/zero of=/mnt/file2 bs=1M count=1&
DD2=$!
echo $DD2 >/sys/fs/cgroup/group2/cgroup.procs

while true; do
sleep 1
kill -USR1 $DD1
kill -USR1 $DD2
echo
'==='
done
---

and watched the progress of the dd processes in different cgroups.
The 1/10
weight difference has no effect with your writeback patches - the
situation
after one minute:

3120+1 records in
3120+1 records out
3272392704 bytes (3.3 GB) copied, 63.7119 s, 51.4 MB/s
3217+1 records in
3217+1 records out
3374010368 bytes (3.4 GB) copied, 63.5819 s, 53.1 MB/s

I should add that even without your patches the progress doesn't quite
correspond to the weight ratio:

Forgot to fill in corresponding data for unpatched kernel here:

5962+2 records in
5962+2 records out
6252281856 bytes (6.3 GB) copied, 64.1719 s, 97.4 MB/s
1502+0 records in
1502+0 records out
1574961152 bytes (1.6 GB) copied, 64.207 s, 24.5 MB/s

Thanks for testing this, I'll see what we can do about that. It stands
to reason that we'll throttle a heavier writer more, statistically. But
I'm assuming this above test was run basically with just the writes
going, so no real competition? And hence we end up throttling them
equally much, destroying the weighting in the process. But for both
cases, we basically don't pay any attention to cgroup weights.

but still there is noticeable difference to cgroups with different
weights.

OTOH blk-throttle combines well with your patches: Limiting one
cgroup to
5 M/s results in numbers like:

3883+2 records in
3883+2 records out
4072091648 bytes (4.1 GB) copied, 36.6713 s, 111 MB/s
413+0 records in
413+0 records out
433061888 bytes (433 MB) copied, 36.8939 s, 11.7 MB/s

which is fine and comparable with unpatched kernel. Higher throughput
number is because we do buffered writes and dd reports what it wrote
into
page cache. And there is no wonder blk-throttle combines fine - it
throttles bios which happens before we reach writeback throttling
mechanism.

OK, that's good, at least that part works fine. And yes, the throttle
path is hit before we end up in the make_request_fn, which is where wbt
drops in.

So I belive this demonstrates that your writeback throttling just
doesn't
work well with selective scheduling policy that happens below it
because it
can essentially lead to IO priority inversion issues...

It this testing still done on the QD=1 ATA disk? Not too surprising that
this falls apart, since we have very little room to maneuver. I wonder
if a normal SATA with NCQ would behave better in this regard. I'll have
to test a bit and think about how we can best handle this case.

I think what we'll do for now is just disable wbt IFF we have a non-root
cgroup attached to CFQ. Done here:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle=7315756efe76bbdf83076fc9dbc569bbb4da5d32

That was a bit too untested.. This should be better, it taps into where
cfq normally notices a difference in blkcg:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle=9b89e1bb666bd036a4cb1313479435087fb86ba0

--
Jens Axboe