Re: [PATCH] workqueue: fix a workqueue kernel panic issue.
On Mon, Sep 22, 2014 at 03:40:55AM -0700, Yifan Zhang wrote: > You can tell it is a bug when pwq = get_work_pwq() return NULL, and > cpu_intensive = pwq->wq->flags use it w/o check. A bug somewhere else. > Normally get_work_pwq doesn't return NULL, but we had a bug in code > which makes INIT_WORK(, do_work) is called in multi-thread. In > some cases, work_struct is re-init just before get_work_pwq is > called, it makes work_struct->data is invalid and thus causes the > problem. It is indeed a bug of ourselves, and after fix it there is > no such issue. But I wonder we still a NULL check before dereference > pwq here anyway, since get_work_pwq may return NULL in some cases. Do you realize how timing dependent that particular pattern of breakage is? If you're doing INIT_WORK() in racy way, there are many places which can break in workqueue. It's not that different from random memory corruption. It doesn't make any sense at all to add a special case code for that in one particular place where this specific incidence happens to trigger. In general, don't do things like this anywhere in the kernel. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] workqueue: fix a workqueue kernel panic issue.
On Mon, Sep 22, 2014 at 03:40:55AM -0700, Yifan Zhang wrote: You can tell it is a bug when pwq = get_work_pwq() return NULL, and cpu_intensive = pwq-wq-flags use it w/o check. A bug somewhere else. Normally get_work_pwq doesn't return NULL, but we had a bug in code which makes INIT_WORK(work, do_work) is called in multi-thread. In some cases, work_struct is re-init just before get_work_pwq is called, it makes work_struct-data is invalid and thus causes the problem. It is indeed a bug of ourselves, and after fix it there is no such issue. But I wonder we still a NULL check before dereference pwq here anyway, since get_work_pwq may return NULL in some cases. Do you realize how timing dependent that particular pattern of breakage is? If you're doing INIT_WORK() in racy way, there are many places which can break in workqueue. It's not that different from random memory corruption. It doesn't make any sense at all to add a special case code for that in one particular place where this specific incidence happens to trigger. In general, don't do things like this anywhere in the kernel. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] workqueue: fix a workqueue kernel panic issue.
Hi Tejun, This issue is found when we got a kernel panic: [ 943.673177] c3 7843 (Thread-162) pwm-vibrator pwm-vibrator: Vibrator: Set duration: 10ms [ 943.717242] c0 7843 (Thread-162) pwm-vibrator pwm-vibrator: Vibrator: Set duration: 30ms [ 949.571415] c1 779 (Binder_3) B52ISP version: HW 3.36, SWM 1.4, revision 871 [ 949.579291] c1 779 (Binder_3) b52_sensor_set_fmt: mbus code not match [ 949.749224] c0 779 (Binder_3) b52_sensor_set_fmt: mbus code not match [ 949.803474] c0 7844 (kworker/0:2) Unable to handle kernel NULL pointer dereference at virtual address 0008 [ 949.813433] c0 7844 (kworker/0:2) pgd = ffc7d000 [ 949.818707] c0 7844 (kworker/0:2) [0008] *pgd=01214003, *pmd=01215003, *pte=00e0d4200407 [ 949.829002] c0 7844 (kworker/0:2) Internal error: Oops: 9607 [#1] PREEMPT SMP [ 949.836425] Modules linked in: tzdd galcore ssipcmisck(P) audiostub cidatattydev gs_modem ccinetdev cci_datastub citty iml_module seh cploaddev msocketk [ 949.850077] c0 7844 (kworker/0:2) CPU: 0 PID: 7844 Comm: kworker/0:2 Tainted: P 3.10.33 #1 [ 949.859318] c0 7844 (kworker/0:2) task: ffc015517500 ti: ffc027c48000 task.ti: ffc027c48000 [ 949.868641] c0 7844 (kworker/0:2) PC is at process_one_work+0x38/0x410 [ 949.875110] c0 7844 (kworker/0:2) LR is at worker_thread+0x13c/0x3c0 [ 949.881407] c0 7844 (kworker/0:2) pc : [] lr : [] pstate: 61c5 [ 949.890634] c0 7844 (kworker/0:2) sp : ffc027c4bd60 [ 949.895810] R29: ffc027c4bd60 R28: 0009 [ 949.901091] R27: ffc0008da108 R26: 0001 [ 949.906372] R25: ffc000aa2902 R24: [ 949.911653] R23: ffc027c48000 R22: ffc02b3b11b0 [ 949.916934] R21: ffc034da09c0 R20: ffc02b3b1180 [ 949.922215] R19: ffc000c05380 R18: [ 949.927497] R17: R16: ffc0001909a4 [ 949.932777] R15: R14: ca81 [ 949.938059] R13: f6f99a80 R12: 03d1 [ 949.943341] R11: 0004 R10: ffc000bef3a8 [ 949.948622] R9 : ffc027c4bbe0 R8 : ffc015517920 [ 949.953903] R7 : ffc000724d98 R6 : ffc000724d98 [ 949.959183] R5 : R4 : [ 949.964465] R3 : ffc034da0d40 R2 : ffc034da09c0 [ 949.969746] R1 : ffc000c05380 R0 : [ 949.975019] c0 7844 (kworker/0:2) [ 949.978390] c0 7844 (kworker/0:2) [ 949.978390] PC: ffcbbb54: You can tell it is a bug when pwq = get_work_pwq() return NULL, and cpu_intensive = pwq->wq->flags use it w/o check. Normally get_work_pwq doesn't return NULL, but we had a bug in code which makes INIT_WORK(, do_work) is called in multi-thread. In some cases, work_struct is re-init just before get_work_pwq is called, it makes work_struct->data is invalid and thus causes the problem. It is indeed a bug of ourselves, and after fix it there is no such issue. But I wonder we still a NULL check before dereference pwq here anyway, since get_work_pwq may return NULL in some cases. What do you think :-) ? BR, YIfan -Original Message- From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of Tejun Heo Sent: 2014年9月22日 11:40 To: Yifan Zhang Cc: Jing Xiang; linux-kernel@vger.kernel.org Subject: Re: [PATCH] workqueue: fix a workqueue kernel panic issue. On Sun, Sep 21, 2014 at 08:30:32PM -0700, Yifan Zhang wrote: > Hi Tejun, > > What's do you think of this patch ? Any concern ? Hmmm? Haven't I already responded to this patch? > -Original Message- > From: Yifan Zhang [mailto:zhan...@marvell.com] > Sent: 2014年9月17日 16:18 > To: Tejun Heo; Jing Xiang; linux-kernel@vger.kernel.org > Cc: Yifan Zhang > Subject: [PATCH] workqueue: fix a workqueue kernel panic issue. > > if created workqueue in multi-thread unsynchronized, Can you please elaborate? > get_work_pwq() may return NULL, which cause kernel panic. Judge > get_work_pwq() return value before use > pwq->wq->flags. > > Signed-off-by: Yifan Zhang > --- > kernel/workqueue.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c index > 5dbe22a..d3ac87f 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1947,9 +1947,19 @@ __acquires(>lock) { > struct pool_workqueue *pwq = get_work_pwq(work); > struct worker_pool *pool = worker->pool; > - bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; > + bool cpu_intensive; > int work_color; > struct worker *collision; > + > + if (pwq == NULL) { > + pr_err("BUG: invalid struct work_struct.data %lu\n", > + atomic_long_read(>data)); > + dump_stack(); > + return; I have difficult time seeing how the above piece of code would be acceptable but maybe the situation you're trying to explain is weird enough to justify it. Thanks. -- tejun
RE: [PATCH] workqueue: fix a workqueue kernel panic issue.
Hi Tejun, This issue is found when we got a kernel panic: [ 943.673177] c3 7843 (Thread-162) pwm-vibrator pwm-vibrator: Vibrator: Set duration: 10ms [ 943.717242] c0 7843 (Thread-162) pwm-vibrator pwm-vibrator: Vibrator: Set duration: 30ms [ 949.571415] c1 779 (Binder_3) B52ISP version: HW 3.36, SWM 1.4, revision 871 [ 949.579291] c1 779 (Binder_3) b52_sensor_set_fmt: mbus code not match [ 949.749224] c0 779 (Binder_3) b52_sensor_set_fmt: mbus code not match [ 949.803474] c0 7844 (kworker/0:2) Unable to handle kernel NULL pointer dereference at virtual address 0008 [ 949.813433] c0 7844 (kworker/0:2) pgd = ffc7d000 [ 949.818707] c0 7844 (kworker/0:2) [0008] *pgd=01214003, *pmd=01215003, *pte=00e0d4200407 [ 949.829002] c0 7844 (kworker/0:2) Internal error: Oops: 9607 [#1] PREEMPT SMP [ 949.836425] Modules linked in: tzdd galcore ssipcmisck(P) audiostub cidatattydev gs_modem ccinetdev cci_datastub citty iml_module seh cploaddev msocketk [ 949.850077] c0 7844 (kworker/0:2) CPU: 0 PID: 7844 Comm: kworker/0:2 Tainted: P 3.10.33 #1 [ 949.859318] c0 7844 (kworker/0:2) task: ffc015517500 ti: ffc027c48000 task.ti: ffc027c48000 [ 949.868641] c0 7844 (kworker/0:2) PC is at process_one_work+0x38/0x410 [ 949.875110] c0 7844 (kworker/0:2) LR is at worker_thread+0x13c/0x3c0 [ 949.881407] c0 7844 (kworker/0:2) pc : [ffcbbbd4] lr : [ffcbcab8] pstate: 61c5 [ 949.890634] c0 7844 (kworker/0:2) sp : ffc027c4bd60 [ 949.895810] R29: ffc027c4bd60 R28: 0009 [ 949.901091] R27: ffc0008da108 R26: 0001 [ 949.906372] R25: ffc000aa2902 R24: [ 949.911653] R23: ffc027c48000 R22: ffc02b3b11b0 [ 949.916934] R21: ffc034da09c0 R20: ffc02b3b1180 [ 949.922215] R19: ffc000c05380 R18: [ 949.927497] R17: R16: ffc0001909a4 [ 949.932777] R15: R14: ca81 [ 949.938059] R13: f6f99a80 R12: 03d1 [ 949.943341] R11: 0004 R10: ffc000bef3a8 [ 949.948622] R9 : ffc027c4bbe0 R8 : ffc015517920 [ 949.953903] R7 : ffc000724d98 R6 : ffc000724d98 [ 949.959183] R5 : R4 : [ 949.964465] R3 : ffc034da0d40 R2 : ffc034da09c0 [ 949.969746] R1 : ffc000c05380 R0 : [ 949.975019] c0 7844 (kworker/0:2) [ 949.978390] c0 7844 (kworker/0:2) [ 949.978390] PC: ffcbbb54: You can tell it is a bug when pwq = get_work_pwq() return NULL, and cpu_intensive = pwq-wq-flags use it w/o check. Normally get_work_pwq doesn't return NULL, but we had a bug in code which makes INIT_WORK(work, do_work) is called in multi-thread. In some cases, work_struct is re-init just before get_work_pwq is called, it makes work_struct-data is invalid and thus causes the problem. It is indeed a bug of ourselves, and after fix it there is no such issue. But I wonder we still a NULL check before dereference pwq here anyway, since get_work_pwq may return NULL in some cases. What do you think :-) ? BR, YIfan -Original Message- From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of Tejun Heo Sent: 2014年9月22日 11:40 To: Yifan Zhang Cc: Jing Xiang; linux-kernel@vger.kernel.org Subject: Re: [PATCH] workqueue: fix a workqueue kernel panic issue. On Sun, Sep 21, 2014 at 08:30:32PM -0700, Yifan Zhang wrote: Hi Tejun, What's do you think of this patch ? Any concern ? Hmmm? Haven't I already responded to this patch? -Original Message- From: Yifan Zhang [mailto:zhan...@marvell.com] Sent: 2014年9月17日 16:18 To: Tejun Heo; Jing Xiang; linux-kernel@vger.kernel.org Cc: Yifan Zhang Subject: [PATCH] workqueue: fix a workqueue kernel panic issue. if created workqueue in multi-thread unsynchronized, Can you please elaborate? get_work_pwq() may return NULL, which cause kernel panic. Judge get_work_pwq() return value before use pwq-wq-flags. Signed-off-by: Yifan Zhang zhan...@marvell.com --- kernel/workqueue.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5dbe22a..d3ac87f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1947,9 +1947,19 @@ __acquires(pool-lock) { struct pool_workqueue *pwq = get_work_pwq(work); struct worker_pool *pool = worker-pool; - bool cpu_intensive = pwq-wq-flags WQ_CPU_INTENSIVE; + bool cpu_intensive; int work_color; struct worker *collision; + + if (pwq == NULL) { + pr_err(BUG: invalid struct work_struct.data %lu\n, + atomic_long_read(work-data)); + dump_stack(); + return; I have difficult time seeing how the above piece of code would be acceptable but maybe the situation you're trying to explain is weird enough to justify it. Thanks
Re: [PATCH] workqueue: fix a workqueue kernel panic issue.
On Sun, Sep 21, 2014 at 08:30:32PM -0700, Yifan Zhang wrote: > Hi Tejun, > > What's do you think of this patch ? Any concern ? Hmmm? Haven't I already responded to this patch? > -Original Message- > From: Yifan Zhang [mailto:zhan...@marvell.com] > Sent: 2014年9月17日 16:18 > To: Tejun Heo; Jing Xiang; linux-kernel@vger.kernel.org > Cc: Yifan Zhang > Subject: [PATCH] workqueue: fix a workqueue kernel panic issue. > > if created workqueue in multi-thread unsynchronized, Can you please elaborate? > get_work_pwq() may return NULL, which cause kernel panic. Judge > get_work_pwq() return value before use > pwq->wq->flags. > > Signed-off-by: Yifan Zhang > --- > kernel/workqueue.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5dbe22a..d3ac87f > 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1947,9 +1947,19 @@ __acquires(>lock) { > struct pool_workqueue *pwq = get_work_pwq(work); > struct worker_pool *pool = worker->pool; > - bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; > + bool cpu_intensive; > int work_color; > struct worker *collision; > + > + if (pwq == NULL) { > + pr_err("BUG: invalid struct work_struct.data %lu\n", > + atomic_long_read(>data)); > + dump_stack(); > + return; I have difficult time seeing how the above piece of code would be acceptable but maybe the situation you're trying to explain is weird enough to justify it. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] workqueue: fix a workqueue kernel panic issue.
Hi Tejun, What's do you think of this patch ? Any concern ? BR, Yifan -Original Message- From: Yifan Zhang [mailto:zhan...@marvell.com] Sent: 2014年9月17日 16:18 To: Tejun Heo; Jing Xiang; linux-kernel@vger.kernel.org Cc: Yifan Zhang Subject: [PATCH] workqueue: fix a workqueue kernel panic issue. if created workqueue in multi-thread unsynchronized, get_work_pwq() may return NULL, which cause kernel panic. Judge get_work_pwq() return value before use pwq->wq->flags. Signed-off-by: Yifan Zhang --- kernel/workqueue.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5dbe22a..d3ac87f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1947,9 +1947,19 @@ __acquires(>lock) { struct pool_workqueue *pwq = get_work_pwq(work); struct worker_pool *pool = worker->pool; - bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; + bool cpu_intensive; int work_color; struct worker *collision; + + if (pwq == NULL) { + pr_err("BUG: invalid struct work_struct.data %lu\n", + atomic_long_read(>data)); + dump_stack(); + return; + } + + cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; + #ifdef CONFIG_LOCKDEP /* * It is permissible to free the struct work_struct from -- 1.7.9.5 N�Р骒r��yb�X�肚�v�^�)藓{.n�+�伐�{��赙zXФ�≤�}��财�z�:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ��&�)撷f��^j谦y�m��@A�a囤� 0鹅h���i
RE: [PATCH] workqueue: fix a workqueue kernel panic issue.
Hi Tejun, What's do you think of this patch ? Any concern ? BR, Yifan -Original Message- From: Yifan Zhang [mailto:zhan...@marvell.com] Sent: 2014年9月17日 16:18 To: Tejun Heo; Jing Xiang; linux-kernel@vger.kernel.org Cc: Yifan Zhang Subject: [PATCH] workqueue: fix a workqueue kernel panic issue. if created workqueue in multi-thread unsynchronized, get_work_pwq() may return NULL, which cause kernel panic. Judge get_work_pwq() return value before use pwq-wq-flags. Signed-off-by: Yifan Zhang zhan...@marvell.com --- kernel/workqueue.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5dbe22a..d3ac87f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1947,9 +1947,19 @@ __acquires(pool-lock) { struct pool_workqueue *pwq = get_work_pwq(work); struct worker_pool *pool = worker-pool; - bool cpu_intensive = pwq-wq-flags WQ_CPU_INTENSIVE; + bool cpu_intensive; int work_color; struct worker *collision; + + if (pwq == NULL) { + pr_err(BUG: invalid struct work_struct.data %lu\n, + atomic_long_read(work-data)); + dump_stack(); + return; + } + + cpu_intensive = pwq-wq-flags WQ_CPU_INTENSIVE; + #ifdef CONFIG_LOCKDEP /* * It is permissible to free the struct work_struct from -- 1.7.9.5 N�Р骒r��yb�X�肚�v�^�)藓{.n�+�伐�{��赙zXФ�≤�}��财�z�j:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ���)撷f��^j谦y�m��@A�a囤� 0鹅h���i
Re: [PATCH] workqueue: fix a workqueue kernel panic issue.
On Sun, Sep 21, 2014 at 08:30:32PM -0700, Yifan Zhang wrote: Hi Tejun, What's do you think of this patch ? Any concern ? Hmmm? Haven't I already responded to this patch? -Original Message- From: Yifan Zhang [mailto:zhan...@marvell.com] Sent: 2014年9月17日 16:18 To: Tejun Heo; Jing Xiang; linux-kernel@vger.kernel.org Cc: Yifan Zhang Subject: [PATCH] workqueue: fix a workqueue kernel panic issue. if created workqueue in multi-thread unsynchronized, Can you please elaborate? get_work_pwq() may return NULL, which cause kernel panic. Judge get_work_pwq() return value before use pwq-wq-flags. Signed-off-by: Yifan Zhang zhan...@marvell.com --- kernel/workqueue.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5dbe22a..d3ac87f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1947,9 +1947,19 @@ __acquires(pool-lock) { struct pool_workqueue *pwq = get_work_pwq(work); struct worker_pool *pool = worker-pool; - bool cpu_intensive = pwq-wq-flags WQ_CPU_INTENSIVE; + bool cpu_intensive; int work_color; struct worker *collision; + + if (pwq == NULL) { + pr_err(BUG: invalid struct work_struct.data %lu\n, + atomic_long_read(work-data)); + dump_stack(); + return; I have difficult time seeing how the above piece of code would be acceptable but maybe the situation you're trying to explain is weird enough to justify it. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/