Re: [blk] WARNING: CPU: 0 PID: 300 at kernel/softirq.c:156 _local_bh_enable_ip()
Hi, I think one fix is to use "_local_bh_enable()" instead of "local_bh_enable()" in u64_stats_fetch_retry_bh(). There's no enabling of irq in _local_bh_enable(). But I wonder why we do "WARN_ON_ONCE(!irqs_disabled() )" in _local_bh_enable()? What's the bad thing if someone call _local_bh_enable() with irqs_diabled()? Is below change acceptable? diff --git a/kernel/softirq.c b/kernel/softirq.c index 490fcbb..f446763 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -142,9 +142,11 @@ EXPORT_SYMBOL(_local_bh_enable); void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) { + unsigned long flags; + WARN_ON_ONCE(in_irq() || irqs_disabled()); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_disable(); + raw_local_irq_save(flags); #endif /* * Are softirqs going to be turned on now: @@ -167,7 +169,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) preempt_count_dec(); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_enable(); + raw_local_irq_restore(flags); #endif preempt_check_resched(); } Zhiguo On Tue, Apr 22, 2014 at 9:06 AM, Jens Axboe wrote: > On 2014-04-20 14:30, Jet Chen wrote: >> >> Hi Zhiguo, >> >> I got the below dmesg and the first bad commit is >> >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >> commit 2c575026fae6e63771bd2a4c1d407214a8096a89 >> Author: Hong Zhiguo >> AuthorDate: Wed Nov 20 10:35:05 2013 -0700 >> Commit: Jens Axboe >> CommitDate: Wed Nov 20 15:33:04 2013 -0700 >> >> Update of blkg_stat and blkg_rwstat may happen in bh context. >> While u64_stats_fetch_retry is only preempt_disable on 32bit >> UP system. This is not enough to avoid preemption by bh and >> may read strange 64 bit value. >> Signed-off-by: Hong Zhiguo >> Acked-by: Tejun Heo >> Cc: sta...@kernel.org >> Signed-off-by: Jens Axboe >> >> >> >> >> +---+++ >> | | 82023bb7f7 | >> 2c575026fa | >> >> +---+++ >> | boot_successes| 47 | >> 0 | >> | boot_failures | 13 | >> 20 | >> | BUG:kernel_boot_hang | 13 | >> 5 | >> | inconsistent_IN-SOFTIRQ-W-SOFTIRQ-ON-W_usage | 0 | >> 15 | >> | backtrace:smpboot_thread_fn | 0 | >> 15 | >> | backtrace:vfs_read| 0 | >> 9 | >> | backtrace:SyS_read| 0 | >> 9 | >> | WARNING:CPU:PID:at_kernel/softirq.c:_local_bh_enable_ip() | 0 | >> 7 | >> | BUG:spinlock_lockup_suspected_on_CPU | 0 | >> 1 | >> | backtrace:do_mount| 0 | >> 6 | >> | backtrace:SyS_mount | 0 | >> 6 | >> | backtrace:async_run_entry_fn | 0 | >> 0 | >> >> +---+++ >> >> [ 12.318816] WARNING: CPU: 0 PID: 300 at kernel/softirq.c:156 >> _local_bh_enable_ip+0x34/0x97() > > > Does this still happen with 3.15-rc2? > > -- > Jens Axboe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [blk] WARNING: CPU: 0 PID: 300 at kernel/softirq.c:156 _local_bh_enable_ip()
Hi, I think one fix is to use _local_bh_enable() instead of local_bh_enable() in u64_stats_fetch_retry_bh(). There's no enabling of irq in _local_bh_enable(). But I wonder why we do WARN_ON_ONCE(!irqs_disabled() ) in _local_bh_enable()? What's the bad thing if someone call _local_bh_enable() with irqs_diabled()? Is below change acceptable? diff --git a/kernel/softirq.c b/kernel/softirq.c index 490fcbb..f446763 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -142,9 +142,11 @@ EXPORT_SYMBOL(_local_bh_enable); void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) { + unsigned long flags; + WARN_ON_ONCE(in_irq() || irqs_disabled()); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_disable(); + raw_local_irq_save(flags); #endif /* * Are softirqs going to be turned on now: @@ -167,7 +169,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) preempt_count_dec(); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_enable(); + raw_local_irq_restore(flags); #endif preempt_check_resched(); } Zhiguo On Tue, Apr 22, 2014 at 9:06 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-20 14:30, Jet Chen wrote: Hi Zhiguo, I got the below dmesg and the first bad commit is git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master commit 2c575026fae6e63771bd2a4c1d407214a8096a89 Author: Hong Zhiguo zhiguoh...@tencent.com AuthorDate: Wed Nov 20 10:35:05 2013 -0700 Commit: Jens Axboe ax...@kernel.dk CommitDate: Wed Nov 20 15:33:04 2013 -0700 Update of blkg_stat and blkg_rwstat may happen in bh context. While u64_stats_fetch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value. Signed-off-by: Hong Zhiguo zhiguoh...@tencent.com Acked-by: Tejun Heo t...@kernel.org Cc: sta...@kernel.org Signed-off-by: Jens Axboe ax...@kernel.dk +---+++ | | 82023bb7f7 | 2c575026fa | +---+++ | boot_successes| 47 | 0 | | boot_failures | 13 | 20 | | BUG:kernel_boot_hang | 13 | 5 | | inconsistent_IN-SOFTIRQ-W-SOFTIRQ-ON-W_usage | 0 | 15 | | backtrace:smpboot_thread_fn | 0 | 15 | | backtrace:vfs_read| 0 | 9 | | backtrace:SyS_read| 0 | 9 | | WARNING:CPU:PID:at_kernel/softirq.c:_local_bh_enable_ip() | 0 | 7 | | BUG:spinlock_lockup_suspected_on_CPU | 0 | 1 | | backtrace:do_mount| 0 | 6 | | backtrace:SyS_mount | 0 | 6 | | backtrace:async_run_entry_fn | 0 | 0 | +---+++ [ 12.318816] WARNING: CPU: 0 PID: 300 at kernel/softirq.c:156 _local_bh_enable_ip+0x34/0x97() Does this still happen with 3.15-rc2? -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 1/2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, Vivek and Jens, I did tests and you affirmed the idea, and Vivek said he'll review the last version of the patch. But it seems he left blkio area more than half year. What next should I do to make progress ? BRs Zhiguo On Sun, Oct 20, 2013 at 8:11 PM, Hong Zhiguo wrote: > Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket) > is very simple and widely used for rate limiting and shaping. > It's well suitable for blk-throttle. And it naturally works for > hierarchical cgroups. So I took it to replace the original time > _slicing_|_trimming_|_extending_ logic. > > The advantage is simplicity, reliability and less fluctuation. > > About 300 lines of code for time-slicing is replaced with 60 lines > of code for token bucket in blk-throttle.c. > > I've tested this patch by fio with rw=randread, rw=randrw. It > behaves almost the same with original time-slicing implementation, > and with less fluctuation. It's also tested on hierarchical setup. > > Signed-off-by: Hong Zhiguo > Tested-by: Hong Zhiguo > --- > block/blk-throttle.c | 285 > +-- > 1 file changed, 50 insertions(+), 235 deletions(-) > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index 8331aba..45d4d91 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -2,6 +2,8 @@ > * Interface for controlling IO bandwidth on a request queue > * > * Copyright (C) 2010 Vivek Goyal > + * > + * Simplify the logic with token bucket, Hong Zhiguo > */ > > #include > @@ -18,9 +20,6 @@ static int throtl_grp_quantum = 8; > /* Total max dispatch from all groups in one round */ > static int throtl_quantum = 32; > > -/* Throttling is performed over 100ms slice and after that slice is renewed > */ > -static unsigned long throtl_slice = HZ/10; /* 100 ms */ > - > static struct blkcg_policy blkcg_policy_throtl; > > /* A workqueue to queue throttle related work */ > @@ -91,6 +90,11 @@ struct tg_stats_cpu { > struct blkg_rwstat serviced; > }; > > +/* Depth of iops token bucket */ > +#define THROTL_BURST_IO 8 > +/* Depth of bps token bucket */ > +#define THROTL_BURST_BYTES (4096 * 8) > + > struct throtl_grp { > /* must be the first member */ > struct blkg_policy_data pd; > @@ -133,14 +137,13 @@ struct throtl_grp { > /* IOPS limits */ > unsigned int iops[2]; > > - /* Number of bytes disptached in current slice */ > - uint64_t bytes_disp[2]; > - /* Number of bio's dispatched in current slice */ > - unsigned int io_disp[2]; > + /* Token for throttling by bps */ > + uint64_t bytes_token[2]; > + /* Token for throttling by iops */ > + unsigned int io_token[2]; > > - /* When did we start a new slice */ > - unsigned long slice_start[2]; > - unsigned long slice_end[2]; > + /* Time check-point */ > + unsigned long t_c[2]; > > /* Per cpu stats pointer */ > struct tg_stats_cpu __percpu *stats_cpu; > @@ -680,171 +683,39 @@ static bool throtl_schedule_next_dispatch(struct > throtl_service_queue *sq, > return false; > } > > -static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg, > - bool rw, unsigned long start) > -{ > - tg->bytes_disp[rw] = 0; > - tg->io_disp[rw] = 0; > - > - /* > -* Previous slice has expired. We must have trimmed it after last > -* bio dispatch. That means since start of last slice, we never used > -* that bandwidth. Do try to make use of that bandwidth while giving > -* credit. > -*/ > - if (time_after_eq(start, tg->slice_start[rw])) > - tg->slice_start[rw] = start; > - > - tg->slice_end[rw] = jiffies + throtl_slice; > - throtl_log(>service_queue, > - "[%c] new slice with credit start=%lu end=%lu jiffies=%lu", > - rw == READ ? 'R' : 'W', tg->slice_start[rw], > - tg->slice_end[rw], jiffies); > -} > - > -static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw) > -{ > - tg->bytes_disp[rw] = 0; > - tg->io_disp[rw] = 0; > - tg->slice_start[rw] = jiffies; > - tg->slice_end[rw] = jiffies + throtl_slice; > - throtl_log(>service_queue, > - "[%c] new slice start=%lu end=%lu jiffies=%lu", > - rw == READ ? 'R' : 'W', tg->slice_start[rw], > - tg->slice_end[rw], jiffies); > -} > - > -static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw, > - unsigned long jiffy_end) > -{ > - tg->slice_end[rw] = roundup(jiffy_end, throtl_slice); > -} > - > -static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw, > - unsigned long jiffy_end) > -{ > - tg->slice_end[rw] = roundup(jiffy_end, throtl_slice); > -
Re: [PATCH v4 1/2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, Vivek and Jens, I did tests and you affirmed the idea, and Vivek said he'll review the last version of the patch. But it seems he left blkio area more than half year. What next should I do to make progress ? BRs Zhiguo On Sun, Oct 20, 2013 at 8:11 PM, Hong Zhiguo honk...@gmail.com wrote: Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket) is very simple and widely used for rate limiting and shaping. It's well suitable for blk-throttle. And it naturally works for hierarchical cgroups. So I took it to replace the original time _slicing_|_trimming_|_extending_ logic. The advantage is simplicity, reliability and less fluctuation. About 300 lines of code for time-slicing is replaced with 60 lines of code for token bucket in blk-throttle.c. I've tested this patch by fio with rw=randread, rw=randrw. It behaves almost the same with original time-slicing implementation, and with less fluctuation. It's also tested on hierarchical setup. Signed-off-by: Hong Zhiguo zhiguoh...@tencent.com Tested-by: Hong Zhiguo zhiguoh...@tencent.com --- block/blk-throttle.c | 285 +-- 1 file changed, 50 insertions(+), 235 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 8331aba..45d4d91 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2,6 +2,8 @@ * Interface for controlling IO bandwidth on a request queue * * Copyright (C) 2010 Vivek Goyal vgo...@redhat.com + * + * Simplify the logic with token bucket, Hong Zhiguo zhiguoh...@tencent.com */ #include linux/module.h @@ -18,9 +20,6 @@ static int throtl_grp_quantum = 8; /* Total max dispatch from all groups in one round */ static int throtl_quantum = 32; -/* Throttling is performed over 100ms slice and after that slice is renewed */ -static unsigned long throtl_slice = HZ/10; /* 100 ms */ - static struct blkcg_policy blkcg_policy_throtl; /* A workqueue to queue throttle related work */ @@ -91,6 +90,11 @@ struct tg_stats_cpu { struct blkg_rwstat serviced; }; +/* Depth of iops token bucket */ +#define THROTL_BURST_IO 8 +/* Depth of bps token bucket */ +#define THROTL_BURST_BYTES (4096 * 8) + struct throtl_grp { /* must be the first member */ struct blkg_policy_data pd; @@ -133,14 +137,13 @@ struct throtl_grp { /* IOPS limits */ unsigned int iops[2]; - /* Number of bytes disptached in current slice */ - uint64_t bytes_disp[2]; - /* Number of bio's dispatched in current slice */ - unsigned int io_disp[2]; + /* Token for throttling by bps */ + uint64_t bytes_token[2]; + /* Token for throttling by iops */ + unsigned int io_token[2]; - /* When did we start a new slice */ - unsigned long slice_start[2]; - unsigned long slice_end[2]; + /* Time check-point */ + unsigned long t_c[2]; /* Per cpu stats pointer */ struct tg_stats_cpu __percpu *stats_cpu; @@ -680,171 +683,39 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq, return false; } -static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg, - bool rw, unsigned long start) -{ - tg-bytes_disp[rw] = 0; - tg-io_disp[rw] = 0; - - /* -* Previous slice has expired. We must have trimmed it after last -* bio dispatch. That means since start of last slice, we never used -* that bandwidth. Do try to make use of that bandwidth while giving -* credit. -*/ - if (time_after_eq(start, tg-slice_start[rw])) - tg-slice_start[rw] = start; - - tg-slice_end[rw] = jiffies + throtl_slice; - throtl_log(tg-service_queue, - [%c] new slice with credit start=%lu end=%lu jiffies=%lu, - rw == READ ? 'R' : 'W', tg-slice_start[rw], - tg-slice_end[rw], jiffies); -} - -static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw) -{ - tg-bytes_disp[rw] = 0; - tg-io_disp[rw] = 0; - tg-slice_start[rw] = jiffies; - tg-slice_end[rw] = jiffies + throtl_slice; - throtl_log(tg-service_queue, - [%c] new slice start=%lu end=%lu jiffies=%lu, - rw == READ ? 'R' : 'W', tg-slice_start[rw], - tg-slice_end[rw], jiffies); -} - -static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw, - unsigned long jiffy_end) -{ - tg-slice_end[rw] = roundup(jiffy_end, throtl_slice); -} - -static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw, - unsigned long jiffy_end) -{ - tg-slice_end[rw] = roundup(jiffy_end, throtl_slice); - throtl_log(tg-service_queue, - [%c]
Re: [blkg] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
Hi, Thanks for the report. The q->queue_lock may be taken in irq. And in sys_read() context, we hold q->queue_lock with irq disabled and then called local_bh_enable, which turned irq enabled. This leads to potential double acquire of queue_lock. One fix is to use "_local_bh_enable()" instead of "local_bh_enable()" in u64_stats_fetch_retry_bh(). There's no enabling of irq in _local_bh_enable(). But I wonder why we do "WARN_ON_ONCE(!irqs_disabled())" in _local_bh_enable()? What's the bad thing if someone call _local_bh_enable() with irqs_diabled()? Is below change acceptable? diff --git a/kernel/softirq.c b/kernel/softirq.c index 490fcbb..f446763 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -142,9 +142,11 @@ EXPORT_SYMBOL(_local_bh_enable); void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) { + unsigned long flags; + WARN_ON_ONCE(in_irq() || irqs_disabled()); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_disable(); + raw_local_irq_save(flags); #endif /* * Are softirqs going to be turned on now: @@ -167,7 +169,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) preempt_count_dec(); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_enable(); + raw_local_irq_restore(flags); #endif preempt_check_resched(); } Zhiguo On Wed, Mar 5, 2014 at 9:21 PM, Fengguang Wu wrote: > Greetings, > > I got the below dmesg and the first bad commit is > > commit 2c575026fae6e63771bd2a4c1d407214a8096a89 > Author: Hong Zhiguo > AuthorDate: Wed Nov 20 10:35:05 2013 -0700 > Commit: Jens Axboe > CommitDate: Wed Nov 20 15:33:04 2013 -0700 > > Update of blkg_stat and blkg_rwstat may happen in bh context. > While u64_stats_fetch_retry is only preempt_disable on 32bit > UP system. This is not enough to avoid preemption by bh and > may read strange 64 bit value. > > > +++ > ||| > +++ > | boot_successes | 0 | > | boot_failures | 19 | > | inconsistent_IN-HARDIRQ-W-HARDIRQ-ON-W_usage | 12 | > | backtrace:vfs_read | 10 | > | backtrace:SyS_read | 10 | > | BUG:spinlock_lockup_suspected_on_CPU | 1 | > | WARNING:CPU:PID:at_kernel/softirq.c:__local_bh_enable_ip() | 9 | > | backtrace:do_mount | 9 | > | backtrace:SyS_mount| 9 | > | inconsistent_SOFTIRQ-ON-W-IN-SOFTIRQ-W_usage | 7 | > | backtrace:floppy_work_workfn | 7 | > +++ > > [ 91.904407] [ INFO: inconsistent lock state ] > [ 91.905533] 3.14.0-rc3-wl-01773-g311db85 #11 Tainted: GW > [ 91.906880] - > [ 91.908013] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. > [ 91.909378] blkid/3517 [HC0[0]:SC0[0]:HE1:SE1] takes: > [ 91.910360] (&(>__queue_lock)->rlock){?.-...}, at: [] > blk_flush_plug_list+0x15a/0x240 > [ 91.910360] {IN-HARDIRQ-W} state was registered at: > [ 91.910360] [] __lock_acquire+0xb5a/0x1aa0 > [ 91.910360] [] lock_acquire+0x77/0xa0 > [ 91.910360] [] _raw_spin_lock_irqsave+0x4d/0x60 > [ 91.910360] [] blk_end_bidi_request+0x3b/0x60 > [ 91.910360] [] blk_end_request+0x12/0x20 > [ 91.910360] [] ide_end_rq+0x27/0x50 > [ 91.910360] [] ide_complete_rq+0x27/0x50 > [ 91.910360] [] cdrom_newpc_intr+0x315/0x8e0 > [ 91.910360] [] ide_intr+0x15a/0x1f0 > [ 91.910360] [] handle_irq_event_percpu+0x20/0x120 > [ 91.910360] [] handle_irq_event+0x2c/0x50 > [ 91.910360] [] handle_edge_irq+0x66/0x120 > [ 91.910360] irq event stamp: 2248 > [ 91.910360] hardirqs last enabled at (2245): [] > _raw_spin_unlock_irq+0x22/0x50 > [ 91.910360] hardirqs last disabled at (2246): [] > blk_flush_plug_list+0xf7/0x240 > [ 91.910360] softirqs last enabled at (2248): [] > __cfq_set_active_queue+0x45/0x1d0 > [ 91.910360] softirqs last disabled at (2247): [] > __cfq_set_active_queue+0x45/0x1d0 > [ 91.910360] > [ 91.910360] other info that might help us debug this: > [ 91.910360] Possible unsafe locking scenario: > [ 91.910360] > [ 91.910360]CPU0 > [ 91.910360] > [ 91.910360] lock(&(>__queue_lock)->rlock); > [ 91.910360] > [ 91.910360] lock(&(>__queue_lock)->rlock); > [ 91.910360] > [ 91.910360] *** DEADLOCK *** > [ 91.910360] > [ 91.910360] 1 lock held by blkid/3517: > [ 91.910360] #0: (&(>__queue_lock)->rlock){?.-...}, at: [] > blk_flush_plug_list+0x15a/0x240 > [ 91.910360] > [ 91.910360] stack backtrace: > [
Re: [blkg] inconsistent {IN-HARDIRQ-W} - {HARDIRQ-ON-W} usage.
Hi, Thanks for the report. The q-queue_lock may be taken in irq. And in sys_read() context, we hold q-queue_lock with irq disabled and then called local_bh_enable, which turned irq enabled. This leads to potential double acquire of queue_lock. One fix is to use _local_bh_enable() instead of local_bh_enable() in u64_stats_fetch_retry_bh(). There's no enabling of irq in _local_bh_enable(). But I wonder why we do WARN_ON_ONCE(!irqs_disabled()) in _local_bh_enable()? What's the bad thing if someone call _local_bh_enable() with irqs_diabled()? Is below change acceptable? diff --git a/kernel/softirq.c b/kernel/softirq.c index 490fcbb..f446763 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -142,9 +142,11 @@ EXPORT_SYMBOL(_local_bh_enable); void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) { + unsigned long flags; + WARN_ON_ONCE(in_irq() || irqs_disabled()); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_disable(); + raw_local_irq_save(flags); #endif /* * Are softirqs going to be turned on now: @@ -167,7 +169,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) preempt_count_dec(); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_enable(); + raw_local_irq_restore(flags); #endif preempt_check_resched(); } Zhiguo On Wed, Mar 5, 2014 at 9:21 PM, Fengguang Wu fengguang...@intel.com wrote: Greetings, I got the below dmesg and the first bad commit is commit 2c575026fae6e63771bd2a4c1d407214a8096a89 Author: Hong Zhiguo zhiguoh...@tencent.com AuthorDate: Wed Nov 20 10:35:05 2013 -0700 Commit: Jens Axboe ax...@kernel.dk CommitDate: Wed Nov 20 15:33:04 2013 -0700 Update of blkg_stat and blkg_rwstat may happen in bh context. While u64_stats_fetch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value. +++ ||| +++ | boot_successes | 0 | | boot_failures | 19 | | inconsistent_IN-HARDIRQ-W-HARDIRQ-ON-W_usage | 12 | | backtrace:vfs_read | 10 | | backtrace:SyS_read | 10 | | BUG:spinlock_lockup_suspected_on_CPU | 1 | | WARNING:CPU:PID:at_kernel/softirq.c:__local_bh_enable_ip() | 9 | | backtrace:do_mount | 9 | | backtrace:SyS_mount| 9 | | inconsistent_SOFTIRQ-ON-W-IN-SOFTIRQ-W_usage | 7 | | backtrace:floppy_work_workfn | 7 | +++ [ 91.904407] [ INFO: inconsistent lock state ] [ 91.905533] 3.14.0-rc3-wl-01773-g311db85 #11 Tainted: GW [ 91.906880] - [ 91.908013] inconsistent {IN-HARDIRQ-W} - {HARDIRQ-ON-W} usage. [ 91.909378] blkid/3517 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 91.910360] ((q-__queue_lock)-rlock){?.-...}, at: [c143f35a] blk_flush_plug_list+0x15a/0x240 [ 91.910360] {IN-HARDIRQ-W} state was registered at: [ 91.910360] [c106235a] __lock_acquire+0xb5a/0x1aa0 [ 91.910360] [c10639c7] lock_acquire+0x77/0xa0 [ 91.910360] [c200b83d] _raw_spin_lock_irqsave+0x4d/0x60 [ 91.910360] [c143e9fb] blk_end_bidi_request+0x3b/0x60 [ 91.910360] [c143eab2] blk_end_request+0x12/0x20 [ 91.910360] [c16382c7] ide_end_rq+0x27/0x50 [ 91.910360] [c1638467] ide_complete_rq+0x27/0x50 [ 91.910360] [c164ff75] cdrom_newpc_intr+0x315/0x8e0 [ 91.910360] [c163873a] ide_intr+0x15a/0x1f0 [ 91.910360] [c106a270] handle_irq_event_percpu+0x20/0x120 [ 91.910360] [c106a39c] handle_irq_event+0x2c/0x50 [ 91.910360] [c106c376] handle_edge_irq+0x66/0x120 [ 91.910360] irq event stamp: 2248 [ 91.910360] hardirqs last enabled at (2245): [c200b9d2] _raw_spin_unlock_irq+0x22/0x50 [ 91.910360] hardirqs last disabled at (2246): [c143f2f7] blk_flush_plug_list+0xf7/0x240 [ 91.910360] softirqs last enabled at (2248): [c1456ea5] __cfq_set_active_queue+0x45/0x1d0 [ 91.910360] softirqs last disabled at (2247): [c1456ea5] __cfq_set_active_queue+0x45/0x1d0 [ 91.910360] [ 91.910360] other info that might help us debug this: [ 91.910360] Possible unsafe locking scenario: [ 91.910360] [ 91.910360]CPU0 [ 91.910360] [ 91.910360] lock((q-__queue_lock)-rlock); [ 91.910360] Interrupt [ 91.910360] lock((q-__queue_lock)-rlock); [ 91.910360] [ 91.910360] *** DEADLOCK *** [ 91.910360] [ 91.910360] 1 lock held by blkid/3517: [ 91.910360] #0:
Re: [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
Hi, Vivek, I tested the PATCH v4 for some basic hierarchical setup as I did before. And I get the similar result. Preparation 1) mount subsys blkio with "__DEVEL__sane_behavior" 2) Create 3 levels of directories under the blkio mount point: mkdir 1 mkdir 1/2 mkdir 1/2/3 mkdir 4 3) start 4 bash sessions, write their PIDs into: 1/cgroup.procs 1/2/cgroup.procs 1/2/3/cgroup.procs 4/cgroup.procs 4) prepare 4 10MB files on sdb(ext4 fs) Note: in below hierarchy graph: "[50k]" means configured value for read_bps_device is 50kB/s "(50k)" means bandwidth reported by dd is 50kB/s Test A: 1 process throttled by ancestor group = Hierarchy set-up: (echo "8:16 204800" > 1/blkio.throttle.read_bps_device) |-- 1 [200k] |`-- 2 [-] | `-- 3 [-] `-- 4 [-] dd within group 3: (drop cache then: dd if=10M-file-3 of=/dev/null) Result: 206kB/s (I did same test without the token-bucket patch, The result is 205kB/s) dd within group 2: (drop cache then: dd if=10M-file-2 of=/dev/null) Result: 206kB/s Test B: 4 processes in 3 levels of hierarchy = Hierarchy set-up: echo "8:16 204800" > 1/blkio.throttle.read_bps_device echo "8:16 102400" > 1/2/blkio.throttle.read_bps_device echo "8:16 51200" > 1/2/3/blkio.throttle.read_bps_device echo "8:16 51200" > 4/blkio.throttle.read_bps_device |-- 1 [200k] |`-- 2 [100k] | `-- 3 [50k] `-- 4 [50k] start 4 dd processes from 4 bash sessions (dd if=10M-file-x of=/dev/null) Result: |-- 1 (104k) |`-- 2 (52.1k) | `-- 3 (51.3k) `-- 4 (51.4k) On Wed, Oct 23, 2013 at 5:02 AM, Vivek Goyal wrote: > Hi Hong, > > This approach looks good in general. Only downside I can think of > updation of nr_requests throughout the hierarchy. So deeper the > hierarchy, higher the overhead. > > I am not sure if that's a concern or not. I will have a closer look > a the patches tomorrow and do some testing too. > > Thanks > Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
Hi, Vivek, I tested the PATCH v4 for some basic hierarchical setup as I did before. And I get the similar result. Preparation 1) mount subsys blkio with __DEVEL__sane_behavior 2) Create 3 levels of directories under the blkio mount point: mkdir 1 mkdir 1/2 mkdir 1/2/3 mkdir 4 3) start 4 bash sessions, write their PIDs into: 1/cgroup.procs 1/2/cgroup.procs 1/2/3/cgroup.procs 4/cgroup.procs 4) prepare 4 10MB files on sdb(ext4 fs) Note: in below hierarchy graph: [50k] means configured value for read_bps_device is 50kB/s (50k) means bandwidth reported by dd is 50kB/s Test A: 1 process throttled by ancestor group = Hierarchy set-up: (echo 8:16 204800 1/blkio.throttle.read_bps_device) |-- 1 [200k] |`-- 2 [-] | `-- 3 [-] `-- 4 [-] dd within group 3: (drop cache then: dd if=10M-file-3 of=/dev/null) Result: 206kB/s (I did same test without the token-bucket patch, The result is 205kB/s) dd within group 2: (drop cache then: dd if=10M-file-2 of=/dev/null) Result: 206kB/s Test B: 4 processes in 3 levels of hierarchy = Hierarchy set-up: echo 8:16 204800 1/blkio.throttle.read_bps_device echo 8:16 102400 1/2/blkio.throttle.read_bps_device echo 8:16 51200 1/2/3/blkio.throttle.read_bps_device echo 8:16 51200 4/blkio.throttle.read_bps_device |-- 1 [200k] |`-- 2 [100k] | `-- 3 [50k] `-- 4 [50k] start 4 dd processes from 4 bash sessions (dd if=10M-file-x of=/dev/null) Result: |-- 1 (104k) |`-- 2 (52.1k) | `-- 3 (51.3k) `-- 4 (51.4k) On Wed, Oct 23, 2013 at 5:02 AM, Vivek Goyal vgo...@redhat.com wrote: Hi Hong, This approach looks good in general. Only downside I can think of updation of nr_requests throughout the hierarchy. So deeper the hierarchy, higher the overhead. I am not sure if that's a concern or not. I will have a closer look a the patches tomorrow and do some testing too. Thanks Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
Hi, Vivek, Sorry I don't have time to test them carefully this week. I'll do it in this weekend. The updating of nr_queued_tree level by level only happens once for each bio. We have already the computing(and maybe queue operation) level by level for every bio, even it's passed through without being queued on the tree. And the computing(and queue operation) on each level is much bigger than updating one counter (nr_queued_tree). So updating of nr_queued_tree doesn't introduce notable overhead compared to existing overhead of blk-throttle. Zhiguo On Wed, Oct 23, 2013 at 5:02 AM, Vivek Goyal wrote: > On Sun, Oct 20, 2013 at 08:11:12PM +0800, Hong Zhiguo wrote: >> From: Hong Zhiguo >> >> Why >> >> Pointed out by Vivek: Tokens generated during idle period should >> be trimmed. Otherwise a huge bio may be permited immediately. >> Overlimit behaviour may be observed during short I/O throughput >> test. >> >> Vivek also pointed out: We should not over-trim for hierarchical >> groups. Suppose a subtree of groups are idle when a big bio comes. >> The token of the child group is trimmed and not enough. So the bio is >> queued on the child group. After some jiffies the child group reserved >> enough tokens and the bio climbs up. If we trim the parent group at >> this time again, this bio will wait too much time than expected. >> >> Analysis >> >> When the bio is queued on child group, all it's ancestor groups >> becomes non-idle. They should start to reserve tokens for that >> bio from this moment. And their reserved tokens before should be >> trimmed at this moment. >> >> How >> >> service_queue now has a new member nr_queued_tree[2], to represent >> the the number of bios waiting on the subtree rooted by this sq. >> >> When a bio is queued on the hierarchy first time, nr_queued_tree >> of all ancestors and the child group itself are increased. When a >> bio climbs up, nr_queued_tree of the child group is decreased. >> >> When nr_queued_tree turns from zero to one, the tokens reserved >> before are trimmed. And after this switch, this group will never >> be trimmed to reserve tokens for the bio waiting on it's descendant >> group. >> > > Hi Hong, > > This approach looks good in general. Only downside I can think of > updation of nr_requests throughout the hierarchy. So deeper the > hierarchy, higher the overhead. > > I am not sure if that's a concern or not. I will have a closer look > a the patches tomorrow and do some testing too. > > Thanks > Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
Hi, Vivek, Sorry I don't have time to test them carefully this week. I'll do it in this weekend. The updating of nr_queued_tree level by level only happens once for each bio. We have already the computing(and maybe queue operation) level by level for every bio, even it's passed through without being queued on the tree. And the computing(and queue operation) on each level is much bigger than updating one counter (nr_queued_tree). So updating of nr_queued_tree doesn't introduce notable overhead compared to existing overhead of blk-throttle. Zhiguo On Wed, Oct 23, 2013 at 5:02 AM, Vivek Goyal vgo...@redhat.com wrote: On Sun, Oct 20, 2013 at 08:11:12PM +0800, Hong Zhiguo wrote: From: Hong Zhiguo zhiguoh...@tencent.com Why Pointed out by Vivek: Tokens generated during idle period should be trimmed. Otherwise a huge bio may be permited immediately. Overlimit behaviour may be observed during short I/O throughput test. Vivek also pointed out: We should not over-trim for hierarchical groups. Suppose a subtree of groups are idle when a big bio comes. The token of the child group is trimmed and not enough. So the bio is queued on the child group. After some jiffies the child group reserved enough tokens and the bio climbs up. If we trim the parent group at this time again, this bio will wait too much time than expected. Analysis When the bio is queued on child group, all it's ancestor groups becomes non-idle. They should start to reserve tokens for that bio from this moment. And their reserved tokens before should be trimmed at this moment. How service_queue now has a new member nr_queued_tree[2], to represent the the number of bios waiting on the subtree rooted by this sq. When a bio is queued on the hierarchy first time, nr_queued_tree of all ancestors and the child group itself are increased. When a bio climbs up, nr_queued_tree of the child group is decreased. When nr_queued_tree turns from zero to one, the tokens reserved before are trimmed. And after this switch, this group will never be trimmed to reserve tokens for the bio waiting on it's descendant group. Hi Hong, This approach looks good in general. Only downside I can think of updation of nr_requests throughout the hierarchy. So deeper the hierarchy, higher the overhead. I am not sure if that's a concern or not. I will have a closer look a the patches tomorrow and do some testing too. Thanks Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] blk-throttle: simplify logic by token bucket algorithm
Hi, Vivek, Thanks for your comments. I didn't realize the ancestor over-trim issue before. Trimming of iops token is not necessary. Since a bio always costs _one_ iops token. Trimming it won't change the fact that current iops token is zero or not. For hierarchical triming, as you pointed out, there's 2 issues in my v3 patch. 1) When bio climbs up, the parent group is not trimmed. 2) When bio climbs up, we should not trim groups level by level, that would block the bio too much time than expected. You proposed time keeping method helps. But how about some bios waiting on multiple child group, and climb up in different order than they get queued on clild group? Some token is still missed. How does this works for multiple levels of ancestors? I got another idea and implementation for the issues. The basic idea is: When the bio is queued on child group, all it's ancestor groups becomes non-idle. They should start to reserve tokens for that bio from this moment. And tokens they reserved before should be trimmed at this moment. I'll also send the implementation later. Please help to review the idea and code. Thanks. On Fri, Oct 18, 2013 at 11:55 PM, Vivek Goyal wrote: > On Thu, Oct 17, 2013 at 08:17:52PM +0800, Hong Zhiguo wrote: > > I think same logic need to be applied to iops too? > > Also, how does it work in case of hierarchy? IIUC, when bio is being > propogated upwards, we first add it to the parent, and then later > update dispatch time which will in turn call tg_with_in_bps_limit(). > > So by the time you come here after bio propogation, bio is already > on service queue and it will be entitiled to *unlimited* idle tokens. > > On the flip side, we can't do blind truncation of idle tokens in > parent group. Reason being that once bio got queued in child group, > effectively it should be subjected to parent's limits too right now > and not necessarily when it climbs up the tree. > > To solve this problem I had put following patch. > > commit 32ee5bc4787dfbdb280b4d81a338dcdd55918c1e > Author: Vivek Goyal > Date: Tue May 14 13:52:38 2013 -0700 > > blk-throttle: Account for child group's start time in parent while bio > climb > > > Above solution was not perfect but seemed like reasonable approximation. > We might have to do something similar. I think we can keep track of time > when an bio starts waiting in a group (gets to head of list) and then > when that bio is dispatched, pass that time to parent group. Now > parent can give more tokens to bio based on wait time start as passed > by children. > > For example say Twaitstart is the time a bio started waiting on the head > of queue of a group. Say Tidle is time when parent became idle. Now when > child passes this bio to parent, it will also pass Twaitstart to parent. > When it comes time for token calculation, parent can do following. > > If (time_after(Twaitstart, Tidle) > start_time = Twaitstart; > else > start_time = Tidle; > > token_eligibility = (jiffies - start_time) * rate; > > This is far from perfect, but it should work reasonably. > > Thanks > Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] blk-throttle: simplify logic by token bucket algorithm
Hi, Vivek, Thanks for your comments. I didn't realize the ancestor over-trim issue before. Trimming of iops token is not necessary. Since a bio always costs _one_ iops token. Trimming it won't change the fact that current iops token is zero or not. For hierarchical triming, as you pointed out, there's 2 issues in my v3 patch. 1) When bio climbs up, the parent group is not trimmed. 2) When bio climbs up, we should not trim groups level by level, that would block the bio too much time than expected. You proposed time keeping method helps. But how about some bios waiting on multiple child group, and climb up in different order than they get queued on clild group? Some token is still missed. How does this works for multiple levels of ancestors? I got another idea and implementation for the issues. The basic idea is: When the bio is queued on child group, all it's ancestor groups becomes non-idle. They should start to reserve tokens for that bio from this moment. And tokens they reserved before should be trimmed at this moment. I'll also send the implementation later. Please help to review the idea and code. Thanks. On Fri, Oct 18, 2013 at 11:55 PM, Vivek Goyal vgo...@redhat.com wrote: On Thu, Oct 17, 2013 at 08:17:52PM +0800, Hong Zhiguo wrote: I think same logic need to be applied to iops too? Also, how does it work in case of hierarchy? IIUC, when bio is being propogated upwards, we first add it to the parent, and then later update dispatch time which will in turn call tg_with_in_bps_limit(). So by the time you come here after bio propogation, bio is already on service queue and it will be entitiled to *unlimited* idle tokens. On the flip side, we can't do blind truncation of idle tokens in parent group. Reason being that once bio got queued in child group, effectively it should be subjected to parent's limits too right now and not necessarily when it climbs up the tree. To solve this problem I had put following patch. commit 32ee5bc4787dfbdb280b4d81a338dcdd55918c1e Author: Vivek Goyal vgo...@redhat.com Date: Tue May 14 13:52:38 2013 -0700 blk-throttle: Account for child group's start time in parent while bio climb Above solution was not perfect but seemed like reasonable approximation. We might have to do something similar. I think we can keep track of time when an bio starts waiting in a group (gets to head of list) and then when that bio is dispatched, pass that time to parent group. Now parent can give more tokens to bio based on wait time start as passed by children. For example say Twaitstart is the time a bio started waiting on the head of queue of a group. Say Tidle is time when parent became idle. Now when child passes this bio to parent, it will also pass Twaitstart to parent. When it comes time for token calculation, parent can do following. If (time_after(Twaitstart, Tidle) start_time = Twaitstart; else start_time = Tidle; token_eligibility = (jiffies - start_time) * rate; This is far from perfect, but it should work reasonably. Thanks Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Vivek, Thanks for your elaboration. I got your points. I'll update the patch to have such logic. Do you think adding below logic in tg_with_in_bps_limit enough? if (!sq->nr_queued[rw]) { trim the token to bucket depth; } Thanks On Wed, Oct 16, 2013 at 10:14 PM, Vivek Goyal wrote: > On Wed, Oct 16, 2013 at 02:09:40PM +0800, Hong zhi guo wrote: >> Hi, Vivek, >> >> Thanks for your careful review. I'll rename t_c to last_dispatch, it's >> much better. >> >> For the big burst issue, I've different opinion. Let's discuss it. >> >> Any time a big IO means a big burst. Even if it's throttled at first >> time, queued in >> the service_queue, and then waited for a while, when it's delivered >> out, it IS still >> a big burst. So throttling the bio for another while makes no sense. > > If a malicious application is creating a big BIO and sending it out, then > effective IO rate as seen by application will be much higher than > throttling limit. > > So yes, a burst is anyway going to happen when big IO is dispatched > to disk, but the question is when should that burst be allowed. What's > the effective IO rate application should see. > >> >> If a group has been idle for 5 minutes, then it owns the credit to >> deliver a big IO >> (within 300 * bps bytes). And the extra credit will be cut off after >> the delivery. > > I think there are couple of issues here. > > - First of all, if you think that a group is entitiled for tokens even > when it is not doing IO, then why are you truncating the tokens after > dispatch of a BIO. > > - Second in general it does not seem right that a group is entitiled to > tokens even when no IO is happening or group is not backlogged. That > would mean a group will not do IO for 10 hours and then be entitiled > to those tokens suddenly after 10 hours with a huge burst. > > So I think you also agree that a group should not be entitiled to > tokens when group is not backlogged and that's why you seem to be > truncating extra tokens after dispatch of a BIO. If that's the case, > then even for first BIO, ideally a group should not be given tokens > for idle time. > > Just think that an application has a huge BIO, say size 2MB. And group > has limit of say 8KB per second. Now if group has been idling long enough, > this BIO will be dispatched immediately. And effective rate a group > will be is much higher than 8KB/s. Which is not right, IMO. > > If you argue that token entitilement for idle groups is not right and > doing it for first BIO in a batch is exception for simplicity reasons, > that still might be fine. But right now that does not seem to be the > case. > > Thanks > Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Vivek, Thanks for your careful review. I'll rename t_c to last_dispatch, it's much better. For the big burst issue, I've different opinion. Let's discuss it. Any time a big IO means a big burst. Even if it's throttled at first time, queued in the service_queue, and then waited for a while, when it's delivered out, it IS still a big burst. So throttling the bio for another while makes no sense. If a group has been idle for 5 minutes, then it owns the credit to deliver a big IO (within 300 * bps bytes). And the extra credit will be cut off after the delivery. Waiting for your comments. Zhiguo On Wed, Oct 16, 2013 at 1:32 AM, Vivek Goyal wrote: > On Mon, Oct 14, 2013 at 05:09:17PM +0800, Hong Zhiguo wrote: > > [..] > > Hi Hong, > > Thanks for the token based throttling implementation. In general it looks > good and it simplies the logic. Couple of comments/concerns below. > >> @@ -133,14 +135,13 @@ struct throtl_grp { >> /* IOPS limits */ >> unsigned int iops[2]; >> >> - /* Number of bytes disptached in current slice */ >> - uint64_t bytes_disp[2]; >> - /* Number of bio's dispatched in current slice */ >> - unsigned int io_disp[2]; >> + /* Token for throttling by bps */ >> + uint64_t bytes_token[2]; >> + /* Token for throttling by iops */ >> + unsigned int io_token[2]; >> >> - /* When did we start a new slice */ >> - unsigned long slice_start[2]; >> - unsigned long slice_end[2]; >> + /* Time check-point */ >> + unsigned long t_c[2]; > > Can we rename this variable to say last_dispatch[2]. > > [..] >> >> @@ -852,41 +708,26 @@ static bool tg_with_in_bps_limit(struct throtl_grp >> *tg, struct bio *bio, >>unsigned long *wait) >> { >> bool rw = bio_data_dir(bio); >> - u64 bytes_allowed, extra_bytes, tmp; >> - unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; >> - >> - jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw]; >> - >> - /* Slice has just started. Consider one slice interval */ >> - if (!jiffy_elapsed) >> - jiffy_elapsed_rnd = throtl_slice; >> - >> - jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice); >> + u64 extra_bytes, token; >> + unsigned long jiffy_wait; >> >> - tmp = tg->bps[rw] * jiffy_elapsed_rnd; >> - do_div(tmp, HZ); >> - bytes_allowed = tmp; >> + token = (u64)tg->bps[rw] * (jiffies - tg->t_c[rw]); > > So here we seem to be calculating how many tokens a group is eligible > for. This is done based on since last time check. And last time check > was updated in last dispatch from the group. > > What happens if no IO happens in a group for some time, say for 5 minutes, > and then one big IO comes in. IIUC, we will allow a big burst of IO for > the first IO (tokens worth of full 5 minutes will be given to this group). > I think we should have a mechanism where we don't allow too big a burst > for the first IO. (Possibly limit the burst to THROTL_BURST_IO and > THROTL_BURST_BYTES until and unless a bigger IO is already queued and > we are waiting for it). > > In previous time slice logic, I took care of it by extend time slice > by 100ms and if no IO happens in the group for 100ms, time slice will > expire and next IO will get at max of 100ms of worth of credit (equivalent > of burst limits). > > Thanks > Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Vivek, Thanks for your careful review. I'll rename t_c to last_dispatch, it's much better. For the big burst issue, I've different opinion. Let's discuss it. Any time a big IO means a big burst. Even if it's throttled at first time, queued in the service_queue, and then waited for a while, when it's delivered out, it IS still a big burst. So throttling the bio for another while makes no sense. If a group has been idle for 5 minutes, then it owns the credit to deliver a big IO (within 300 * bps bytes). And the extra credit will be cut off after the delivery. Waiting for your comments. Zhiguo On Wed, Oct 16, 2013 at 1:32 AM, Vivek Goyal vgo...@redhat.com wrote: On Mon, Oct 14, 2013 at 05:09:17PM +0800, Hong Zhiguo wrote: [..] Hi Hong, Thanks for the token based throttling implementation. In general it looks good and it simplies the logic. Couple of comments/concerns below. @@ -133,14 +135,13 @@ struct throtl_grp { /* IOPS limits */ unsigned int iops[2]; - /* Number of bytes disptached in current slice */ - uint64_t bytes_disp[2]; - /* Number of bio's dispatched in current slice */ - unsigned int io_disp[2]; + /* Token for throttling by bps */ + uint64_t bytes_token[2]; + /* Token for throttling by iops */ + unsigned int io_token[2]; - /* When did we start a new slice */ - unsigned long slice_start[2]; - unsigned long slice_end[2]; + /* Time check-point */ + unsigned long t_c[2]; Can we rename this variable to say last_dispatch[2]. [..] @@ -852,41 +708,26 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio, unsigned long *wait) { bool rw = bio_data_dir(bio); - u64 bytes_allowed, extra_bytes, tmp; - unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; - - jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg-slice_start[rw]; - - /* Slice has just started. Consider one slice interval */ - if (!jiffy_elapsed) - jiffy_elapsed_rnd = throtl_slice; - - jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice); + u64 extra_bytes, token; + unsigned long jiffy_wait; - tmp = tg-bps[rw] * jiffy_elapsed_rnd; - do_div(tmp, HZ); - bytes_allowed = tmp; + token = (u64)tg-bps[rw] * (jiffies - tg-t_c[rw]); So here we seem to be calculating how many tokens a group is eligible for. This is done based on since last time check. And last time check was updated in last dispatch from the group. What happens if no IO happens in a group for some time, say for 5 minutes, and then one big IO comes in. IIUC, we will allow a big burst of IO for the first IO (tokens worth of full 5 minutes will be given to this group). I think we should have a mechanism where we don't allow too big a burst for the first IO. (Possibly limit the burst to THROTL_BURST_IO and THROTL_BURST_BYTES until and unless a bigger IO is already queued and we are waiting for it). In previous time slice logic, I took care of it by extend time slice by 100ms and if no IO happens in the group for 100ms, time slice will expire and next IO will get at max of 100ms of worth of credit (equivalent of burst limits). Thanks Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Vivek, Thanks for your elaboration. I got your points. I'll update the patch to have such logic. Do you think adding below logic in tg_with_in_bps_limit enough? if (!sq-nr_queued[rw]) { trim the token to bucket depth; } Thanks On Wed, Oct 16, 2013 at 10:14 PM, Vivek Goyal vgo...@redhat.com wrote: On Wed, Oct 16, 2013 at 02:09:40PM +0800, Hong zhi guo wrote: Hi, Vivek, Thanks for your careful review. I'll rename t_c to last_dispatch, it's much better. For the big burst issue, I've different opinion. Let's discuss it. Any time a big IO means a big burst. Even if it's throttled at first time, queued in the service_queue, and then waited for a while, when it's delivered out, it IS still a big burst. So throttling the bio for another while makes no sense. If a malicious application is creating a big BIO and sending it out, then effective IO rate as seen by application will be much higher than throttling limit. So yes, a burst is anyway going to happen when big IO is dispatched to disk, but the question is when should that burst be allowed. What's the effective IO rate application should see. If a group has been idle for 5 minutes, then it owns the credit to deliver a big IO (within 300 * bps bytes). And the extra credit will be cut off after the delivery. I think there are couple of issues here. - First of all, if you think that a group is entitiled for tokens even when it is not doing IO, then why are you truncating the tokens after dispatch of a BIO. - Second in general it does not seem right that a group is entitiled to tokens even when no IO is happening or group is not backlogged. That would mean a group will not do IO for 10 hours and then be entitiled to those tokens suddenly after 10 hours with a huge burst. So I think you also agree that a group should not be entitiled to tokens when group is not backlogged and that's why you seem to be truncating extra tokens after dispatch of a BIO. If that's the case, then even for first BIO, ideally a group should not be given tokens for idle time. Just think that an application has a huge BIO, say size 2MB. And group has limit of say 8KB per second. Now if group has been idling long enough, this BIO will be dispatched immediately. And effective rate a group will be is much higher than 8KB/s. Which is not right, IMO. If you argue that token entitilement for idle groups is not right and doing it for first BIO in a batch is exception for simplicity reasons, that still might be fine. But right now that does not seem to be the case. Thanks Vivek -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, I did the test for 3 levels hierarchy. It works. Preparation 1) mount subsys blkio with "__DEVEL__sane_behavior" 2) Create 3 levels of directories under the blkio mount point: mkdir 1 mkdir 1/2 mkdir 1/2/3 3) start 3 bash sessions, write their PIDs into: 1/cgroup.procs 1/2/cgroup.procs 1/2/3/cgroup.procs 4) prepare 3 10MB files on sdb(ext4 fs) Note: in below hierarchy graph: "[50k]" means configured value for read_bps_device is 50kB/s "(50k)" means bandwidth reported by dd is 50kB/s Test A: 1 process throttled by ancestor group = Hierarchy set-up: (echo "8:16 204800" > 1/blkio.throttle.read_bps_device) 1 [200k] `-- 2 [-] `-- 3 [-] dd within group 3: (drop cache then: dd if=10M-file-3 of=/dev/null) Result: 206kB/s (I did same test without the token-bucket patch, The result is 205kB/s) dd within group 2: (drop cache then: dd if=10M-file-2 of=/dev/null) Result: 205kB/s Test B: 3 processes in 3 levels of hierarchy = Hierarchy set-up: echo "8:16 204800" > 1/blkio.throttle.read_bps_device echo "8:16 102400" > 1/2/blkio.throttle.read_bps_device echo "8:16 51200" > 1/2/3/blkio.throttle.read_bps_device 1 [200k] `-- 2 [100k] `-- 3 [50k] start 3 dd processes from 3 bash sessions (dd if=10M-file-x of=/dev/null) Result: 1 (103k) `-- 2 (51.9k) `-- 3 (51.4k) best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, I did the test for 3 levels hierarchy. It works. Preparation 1) mount subsys blkio with __DEVEL__sane_behavior 2) Create 3 levels of directories under the blkio mount point: mkdir 1 mkdir 1/2 mkdir 1/2/3 3) start 3 bash sessions, write their PIDs into: 1/cgroup.procs 1/2/cgroup.procs 1/2/3/cgroup.procs 4) prepare 3 10MB files on sdb(ext4 fs) Note: in below hierarchy graph: [50k] means configured value for read_bps_device is 50kB/s (50k) means bandwidth reported by dd is 50kB/s Test A: 1 process throttled by ancestor group = Hierarchy set-up: (echo 8:16 204800 1/blkio.throttle.read_bps_device) 1 [200k] `-- 2 [-] `-- 3 [-] dd within group 3: (drop cache then: dd if=10M-file-3 of=/dev/null) Result: 206kB/s (I did same test without the token-bucket patch, The result is 205kB/s) dd within group 2: (drop cache then: dd if=10M-file-2 of=/dev/null) Result: 205kB/s Test B: 3 processes in 3 levels of hierarchy = Hierarchy set-up: echo 8:16 204800 1/blkio.throttle.read_bps_device echo 8:16 102400 1/2/blkio.throttle.read_bps_device echo 8:16 51200 1/2/3/blkio.throttle.read_bps_device 1 [200k] `-- 2 [100k] `-- 3 [50k] start 3 dd processes from 3 bash sessions (dd if=10M-file-x of=/dev/null) Result: 1 (103k) `-- 2 (51.9k) `-- 3 (51.4k) best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, I've not tested hierarchical setup yet. I'll do it tomorrow. BTW, what kind of setup do you expect? Is hierarchy of 2 levels enough? Zhiguo On Mon, Oct 14, 2013 at 9:36 PM, Tejun Heo wrote: > Hello, > > Yes, this definitely is the direction we wanna take it. I'll wait for > Vivek to chime in but have you also tested hierarchical setup? > > Thanks. > > -- > tejun -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, I've not tested hierarchical setup yet. I'll do it tomorrow. BTW, what kind of setup do you expect? Is 2 levels of hierarchical enough? Zhiguo On Mon, Oct 14, 2013 at 9:36 PM, Tejun Heo wrote: > Hello, > > Yes, this definitely is the direction we wanna take it. I'll wait for > Vivek to chime in but have you also tested hierarchical setup? > > Thanks. > > -- > tejun -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, I've not tested hierarchical setup yet. I'll do it tomorrow. BTW, what kind of setup do you expect? Is 2 levels of hierarchical enough? Zhiguo On Mon, Oct 14, 2013 at 9:36 PM, Tejun Heo t...@kernel.org wrote: Hello, Yes, this definitely is the direction we wanna take it. I'll wait for Vivek to chime in but have you also tested hierarchical setup? Thanks. -- tejun -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
Hi, Tejun, I've not tested hierarchical setup yet. I'll do it tomorrow. BTW, what kind of setup do you expect? Is hierarchy of 2 levels enough? Zhiguo On Mon, Oct 14, 2013 at 9:36 PM, Tejun Heo t...@kernel.org wrote: Hello, Yes, this definitely is the direction we wanna take it. I'll wait for Vivek to chime in but have you also tested hierarchical setup? Thanks. -- tejun -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/6] net-next: replace obsolete NLMSG_* with type safe nlmsg_*
Thanks, Thomas. But I didn't change any formatting. Just do the substitution in place. Should I re-format and re-send the patch? On Thu, Mar 28, 2013 at 10:32 PM, Thomas Graf wrote: > On 03/28/13 at 12:47am, Hong Zhiguo wrote: >> diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c >> index 505b30a..467fb92 100644 >> --- a/net/ipv4/udp_diag.c >> +++ b/net/ipv4/udp_diag.c >> @@ -64,9 +64,9 @@ static int udp_dump_one(struct udp_table *tbl, struct >> sk_buff *in_skb, >> goto out; >> >> err = -ENOMEM; >> - rep = alloc_skb(NLMSG_SPACE((sizeof(struct inet_diag_msg) + >> - sizeof(struct inet_diag_meminfo) + >> - 64)), GFP_KERNEL); >> + rep = nlmsg_new(sizeof(struct inet_diag_msg) + >> + sizeof(struct inet_diag_meminfo) + 64, >> + GFP_KERNEL); > > This is formatted incorrectly, otherwise the patch looks good. -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] scsi: replace obsolete NLMSG_* with type safe nlmsg_*
Thanks, Thomas. But I didn't change any formatting. Just do the substitution in place. On Thu, Mar 28, 2013 at 10:45 PM, Thomas Graf wrote: > On 03/28/13 at 12:53am, Hong Zhiguo wrote: >> Signed-off-by: Hong Zhiguo > > There are some formatting errors but the Netlink bits themselves > look good. -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] scsi: replace obsolete NLMSG_* with type safe nlmsg_*
Thanks, Thomas. But I didn't change any formatting. Just do the substitution in place. On Thu, Mar 28, 2013 at 10:45 PM, Thomas Graf tg...@suug.ch wrote: On 03/28/13 at 12:53am, Hong Zhiguo wrote: Signed-off-by: Hong Zhiguo honk...@gmail.com There are some formatting errors but the Netlink bits themselves look good. -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/6] net-next: replace obsolete NLMSG_* with type safe nlmsg_*
Thanks, Thomas. But I didn't change any formatting. Just do the substitution in place. Should I re-format and re-send the patch? On Thu, Mar 28, 2013 at 10:32 PM, Thomas Graf tg...@suug.ch wrote: On 03/28/13 at 12:47am, Hong Zhiguo wrote: diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c index 505b30a..467fb92 100644 --- a/net/ipv4/udp_diag.c +++ b/net/ipv4/udp_diag.c @@ -64,9 +64,9 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb, goto out; err = -ENOMEM; - rep = alloc_skb(NLMSG_SPACE((sizeof(struct inet_diag_msg) + - sizeof(struct inet_diag_meminfo) + - 64)), GFP_KERNEL); + rep = nlmsg_new(sizeof(struct inet_diag_msg) + + sizeof(struct inet_diag_meminfo) + 64, + GFP_KERNEL); This is formatted incorrectly, otherwise the patch looks good. -- best regards Hong Zhiguo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/