~RE: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-30 Thread Jialing Fu
> [...]
>
 Hi, ulf

 We find this bug on Intel-C3230RK platform for very small probability.

 Whereas I can easily reproduce this case if I add a mdelay(1) or 
 longer delay as Jialing did.

 This patch seems useful to me. Should we push it forward? :)
>>>
>>>
>>> It seems like a very good idea!
>>>
>>> Should we add a fixes tag to it?
>>
>>
>> That's cool, but how to add a fixes tag?
>>
>> [Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)
>>
>
> A fixes tag points to an old commit which introduced the bug. If we can't 
> find one, we can add a Cc tag to "stable". Just search the git log and you 
> will find examples.
>
> Like add one line as below?
> Fixes: 2220eedfd7ae ("mmc: fix async request mechanism for sequential 
> read scenarios")
>

That's it, Jialing. From my git blame, seems this bug has been introduced for a 
long time, but I feel strange that no one had captured it before you did.
[Jialing Fu] Shawn,
Yes, this bug is very hard to duplicate in my experiment.
But it happens indeed, I had suffered this bug about 2 years before I fixed it.
Totally I got bug reports 3 times and about 3~4 Ramdump files.
At first, I failed to get useful clue and even through it was DDR stability 
issue.

Below if my analysis:
As what I had commented in the fix patch,  only the below "LineB" still gets 
"wait" from "mrq" as the compiler's result, then this bug may be triggered.
If the compiler has some optimism which lineB doesn't need to fetch "wait" from 
"mrq" again, the issue can't happen.
  static void mmc_wait_data_done(struct mmc_request *mrq)
  {
 mrq->host->context_info.is_done_rcv = true;//LineA
 // If below line still gets host from "mrq" as the result of
 // compiler, the panic happens as we traced.
 wake_up_interruptible(>host->context_info.wait); //LineB
  }
Also, I suspect the bug may be triggered if "IRQ" or "ICache line missing" just 
happen between LineA and LineB. 
Especial the "Icache missing" case, it is easier to happens than IRQ.
I disassemble my code, and find LineA and LineB's assemble codes are located in 
two different cache line in my fail case. If you are interesting, you can check 
your assemble code too. 


Anyway, I will add a fixes tag and send v2 ASAP. :)

>
> Kind regards
> Uffe
>


--
Best Regards
Shawn Lin

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

~RE: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-30 Thread Jialing Fu
 [...]

 Hi, ulf

 We find this bug on Intel-C3230RK platform for very small probability.

 Whereas I can easily reproduce this case if I add a mdelay(1) or 
 longer delay as Jialing did.

 This patch seems useful to me. Should we push it forward? :)


 It seems like a very good idea!

 Should we add a fixes tag to it?


 That's cool, but how to add a fixes tag?

 [Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)


 A fixes tag points to an old commit which introduced the bug. If we can't 
 find one, we can add a Cc tag to stable. Just search the git log and you 
 will find examples.

 Like add one line as below?
 Fixes: 2220eedfd7ae (mmc: fix async request mechanism for sequential 
 read scenarios)


That's it, Jialing. From my git blame, seems this bug has been introduced for a 
long time, but I feel strange that no one had captured it before you did.
[Jialing Fu] Shawn,
Yes, this bug is very hard to duplicate in my experiment.
But it happens indeed, I had suffered this bug about 2 years before I fixed it.
Totally I got bug reports 3 times and about 3~4 Ramdump files.
At first, I failed to get useful clue and even through it was DDR stability 
issue.

Below if my analysis:
As what I had commented in the fix patch,  only the below LineB still gets 
wait from mrq as the compiler's result, then this bug may be triggered.
If the compiler has some optimism which lineB doesn't need to fetch wait from 
mrq again, the issue can't happen.
  static void mmc_wait_data_done(struct mmc_request *mrq)
  {
 mrq-host-context_info.is_done_rcv = true;//LineA
 // If below line still gets host from mrq as the result of
 // compiler, the panic happens as we traced.
 wake_up_interruptible(mrq-host-context_info.wait); //LineB
  }
Also, I suspect the bug may be triggered if IRQ or ICache line missing just 
happen between LineA and LineB. 
Especial the Icache missing case, it is easier to happens than IRQ.
I disassemble my code, and find LineA and LineB's assemble codes are located in 
two different cache line in my fail case. If you are interesting, you can check 
your assemble code too. 


Anyway, I will add a fixes tag and send v2 ASAP. :)


 Kind regards
 Uffe



--
Best Regards
Shawn Lin

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Shawn Lin

在 2015/8/28 18:22, Jialing Fu 写道:


[...]


Hi, ulf

We find this bug on Intel-C3230RK platform for very small probability.

Whereas I can easily reproduce this case if I add a mdelay(1) or
longer delay as Jialing did.

This patch seems useful to me. Should we push it forward? :)



It seems like a very good idea!

Should we add a fixes tag to it?



That's cool, but how to add a fixes tag?

[Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)



A fixes tag points to an old commit which introduced the bug. If we can't find one, we 
can add a Cc tag to "stable". Just search the git log and you will find 
examples.

Like add one line as below?
Fixes: 2220eedfd7ae ("mmc: fix async request mechanism for sequential read 
scenarios")



That's it, Jialing. From my git blame, seems this bug has been 
introduced for a long time, but I feel strange that no one had captured 
it before you did.


Anyway, I will add a fixes tag and send v2 ASAP. :)



Kind regards
Uffe




--
Best Regards
Shawn Lin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Jialing Fu

[...]

>>> Hi, ulf
>>>
>>> We find this bug on Intel-C3230RK platform for very small probability.
>>>
>>> Whereas I can easily reproduce this case if I add a mdelay(1) or  
>>> longer delay as Jialing did.
>>>
>>> This patch seems useful to me. Should we push it forward? :)
>>
>>
>> It seems like a very good idea!
>>
>> Should we add a fixes tag to it?
>
>
> That's cool, but how to add a fixes tag?
>
> [Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)
>

A fixes tag points to an old commit which introduced the bug. If we can't find 
one, we can add a Cc tag to "stable". Just search the git log and you will find 
examples.

Like add one line as below?
Fixes: 2220eedfd7ae ("mmc: fix async request mechanism for sequential read 
scenarios")


Kind regards
Uffe
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Ulf Hansson
[...]

>>> Hi, ulf
>>>
>>> We find this bug on Intel-C3230RK platform for very small probability.
>>>
>>> Whereas I can easily reproduce this case if I add a mdelay(1) or  longer
>>> delay as Jialing did.
>>>
>>> This patch seems useful to me. Should we push it forward? :)
>>
>>
>> It seems like a very good idea!
>>
>> Should we add a fixes tag to it?
>
>
> That's cool, but how to add a fixes tag?
>
> [Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)
>

A fixes tag points to an old commit which introduced the bug. If we
can't find one, we can add a Cc tag to "stable". Just search the git
log and you will find examples.

Kind regards
Uffe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Shawn Lin

在 2015/8/28 16:55, Ulf Hansson 写道:

On 28 August 2015 at 05:25, Shawn Lin  wrote:

On 2015/8/28 11:13, Shawn Lin wrote:


From: Jialing Fu 

The following panic is captured in ker3.14, but the issue still exists
in latest kernel.
-
[   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer
dereference
at virtual address 0578
..
[   20.738499] c0 3136 (Compiler) PC is at
_raw_spin_lock_irqsave+0x24/0x60
[   20.738527] c0 3136 (Compiler) LR is at
_raw_spin_lock_irqsave+0x20/0x60
[   20.740134] c0 3136 (Compiler) Call trace:
[   20.740165] c0 3136 (Compiler) []
_raw_spin_lock_irqsave+0x24/0x60
[   20.740200] c0 3136 (Compiler) [] __wake_up+0x1c/0x54
[   20.740230] c0 3136 (Compiler) []
mmc_wait_data_done+0x28/0x34
[   20.740262] c0 3136 (Compiler) []
mmc_request_done+0xa4/0x220
[   20.740314] c0 3136 (Compiler) []
sdhci_tasklet_finish+0xac/0x264
[   20.740352] c0 3136 (Compiler) []
tasklet_action+0xa0/0x158
[   20.740382] c0 3136 (Compiler) []
__do_softirq+0x10c/0x2e4
[   20.740411] c0 3136 (Compiler) [] irq_exit+0x8c/0xc0
[   20.740439] c0 3136 (Compiler) []
handle_IRQ+0x48/0xac
[   20.740469] c0 3136 (Compiler) []
gic_handle_irq+0x38/0x7c
--
Because in SMP, "mrq" has race condition between below two paths:
path1: CPU0: 
static void mmc_wait_data_done(struct mmc_request *mrq)
{
   mrq->host->context_info.is_done_rcv = true;
   //
   // If CPU0 has just finished "is_done_rcv = true" in path1, and at
   // this moment, IRQ or ICache line missing happens in CPU0.
   // What happens in CPU1 (path2)?
   //
   // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
   // path2 would have chance to break from wait_event_interruptible
   // in mmc_wait_for_data_req_done and continue to run for next
   // mmc_request (mmc_blk_rw_rq_prep).
   //
   // Within mmc_blk_rq_prep, mrq is cleared to 0.
   // If below line still gets host from "mrq" as the result of
   // compiler, the panic happens as we traced.
   wake_up_interruptible(>host->context_info.wait);
}

path2: CPU1: 
static int mmc_wait_for_data_req_done(...
{
   ...
   while (1) {
 wait_event_interruptible(context_info->wait,
 (context_info->is_done_rcv ||
  context_info->is_new_req));
static void mmc_blk_rw_rq_prep(...
 {
 ...
 memset(brq, 0, sizeof(struct mmc_blk_request));

This issue happens very coincidentally; however adding mdelay(1) in
mmc_wait_data_done as below could duplicate it easily.

 static void mmc_wait_data_done(struct mmc_request *mrq)
 {
   mrq->host->context_info.is_done_rcv = true;
+mdelay(1);
   wake_up_interruptible(>host->context_info.wait);
  }



Hi, ulf

We find this bug on Intel-C3230RK platform for very small probability.

Whereas I can easily reproduce this case if I add a mdelay(1) or  longer
delay as Jialing did.

This patch seems useful to me. Should we push it forward? :)


It seems like a very good idea!

Should we add a fixes tag to it?


That's cool, but how to add a fixes tag?

[Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)



[...]

Kind regards
Uffe






--
Best Regards
Shawn Lin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Ulf Hansson
On 28 August 2015 at 05:25, Shawn Lin  wrote:
> On 2015/8/28 11:13, Shawn Lin wrote:
>>
>> From: Jialing Fu 
>>
>> The following panic is captured in ker3.14, but the issue still exists
>> in latest kernel.
>> -
>> [   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer
>> dereference
>> at virtual address 0578
>> ..
>> [   20.738499] c0 3136 (Compiler) PC is at
>> _raw_spin_lock_irqsave+0x24/0x60
>> [   20.738527] c0 3136 (Compiler) LR is at
>> _raw_spin_lock_irqsave+0x20/0x60
>> [   20.740134] c0 3136 (Compiler) Call trace:
>> [   20.740165] c0 3136 (Compiler) []
>> _raw_spin_lock_irqsave+0x24/0x60
>> [   20.740200] c0 3136 (Compiler) [] __wake_up+0x1c/0x54
>> [   20.740230] c0 3136 (Compiler) []
>> mmc_wait_data_done+0x28/0x34
>> [   20.740262] c0 3136 (Compiler) []
>> mmc_request_done+0xa4/0x220
>> [   20.740314] c0 3136 (Compiler) []
>> sdhci_tasklet_finish+0xac/0x264
>> [   20.740352] c0 3136 (Compiler) []
>> tasklet_action+0xa0/0x158
>> [   20.740382] c0 3136 (Compiler) []
>> __do_softirq+0x10c/0x2e4
>> [   20.740411] c0 3136 (Compiler) [] irq_exit+0x8c/0xc0
>> [   20.740439] c0 3136 (Compiler) []
>> handle_IRQ+0x48/0xac
>> [   20.740469] c0 3136 (Compiler) []
>> gic_handle_irq+0x38/0x7c
>> --
>> Because in SMP, "mrq" has race condition between below two paths:
>> path1: CPU0: 
>>static void mmc_wait_data_done(struct mmc_request *mrq)
>>{
>>   mrq->host->context_info.is_done_rcv = true;
>>   //
>>   // If CPU0 has just finished "is_done_rcv = true" in path1, and at
>>   // this moment, IRQ or ICache line missing happens in CPU0.
>>   // What happens in CPU1 (path2)?
>>   //
>>   // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
>>   // path2 would have chance to break from wait_event_interruptible
>>   // in mmc_wait_for_data_req_done and continue to run for next
>>   // mmc_request (mmc_blk_rw_rq_prep).
>>   //
>>   // Within mmc_blk_rq_prep, mrq is cleared to 0.
>>   // If below line still gets host from "mrq" as the result of
>>   // compiler, the panic happens as we traced.
>>   wake_up_interruptible(>host->context_info.wait);
>>}
>>
>> path2: CPU1: 
>>static int mmc_wait_for_data_req_done(...
>>{
>>   ...
>>   while (1) {
>> wait_event_interruptible(context_info->wait,
>> (context_info->is_done_rcv ||
>>  context_info->is_new_req));
>>static void mmc_blk_rw_rq_prep(...
>> {
>> ...
>> memset(brq, 0, sizeof(struct mmc_blk_request));
>>
>> This issue happens very coincidentally; however adding mdelay(1) in
>> mmc_wait_data_done as below could duplicate it easily.
>>
>> static void mmc_wait_data_done(struct mmc_request *mrq)
>> {
>>   mrq->host->context_info.is_done_rcv = true;
>> +mdelay(1);
>>   wake_up_interruptible(>host->context_info.wait);
>>  }
>>
>
> Hi, ulf
>
> We find this bug on Intel-C3230RK platform for very small probability.
>
> Whereas I can easily reproduce this case if I add a mdelay(1) or  longer
> delay as Jialing did.
>
> This patch seems useful to me. Should we push it forward? :)

It seems like a very good idea!

Should we add a fixes tag to it?

[...]

Kind regards
Uffe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Jialing Fu

[...]

 Hi, ulf

 We find this bug on Intel-C3230RK platform for very small probability.

 Whereas I can easily reproduce this case if I add a mdelay(1) or  
 longer delay as Jialing did.

 This patch seems useful to me. Should we push it forward? :)


 It seems like a very good idea!

 Should we add a fixes tag to it?


 That's cool, but how to add a fixes tag?

 [Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)


A fixes tag points to an old commit which introduced the bug. If we can't find 
one, we can add a Cc tag to stable. Just search the git log and you will find 
examples.

Like add one line as below?
Fixes: 2220eedfd7ae (mmc: fix async request mechanism for sequential read 
scenarios)


Kind regards
Uffe
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Shawn Lin

在 2015/8/28 16:55, Ulf Hansson 写道:

On 28 August 2015 at 05:25, Shawn Lin shawn@rock-chips.com wrote:

On 2015/8/28 11:13, Shawn Lin wrote:


From: Jialing Fu j...@marvell.com

The following panic is captured in ker3.14, but the issue still exists
in latest kernel.
-
[   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer
dereference
at virtual address 0578
..
[   20.738499] c0 3136 (Compiler) PC is at
_raw_spin_lock_irqsave+0x24/0x60
[   20.738527] c0 3136 (Compiler) LR is at
_raw_spin_lock_irqsave+0x20/0x60
[   20.740134] c0 3136 (Compiler) Call trace:
[   20.740165] c0 3136 (Compiler) [ffc0008ee900]
_raw_spin_lock_irqsave+0x24/0x60
[   20.740200] c0 3136 (Compiler) [ffcdd024] __wake_up+0x1c/0x54
[   20.740230] c0 3136 (Compiler) [ffc000639414]
mmc_wait_data_done+0x28/0x34
[   20.740262] c0 3136 (Compiler) [ffc0006391a0]
mmc_request_done+0xa4/0x220
[   20.740314] c0 3136 (Compiler) [ffc000656894]
sdhci_tasklet_finish+0xac/0x264
[   20.740352] c0 3136 (Compiler) [ffca2b58]
tasklet_action+0xa0/0x158
[   20.740382] c0 3136 (Compiler) [ffca2078]
__do_softirq+0x10c/0x2e4
[   20.740411] c0 3136 (Compiler) [ffca24bc] irq_exit+0x8c/0xc0
[   20.740439] c0 3136 (Compiler) [ffc8489c]
handle_IRQ+0x48/0xac
[   20.740469] c0 3136 (Compiler) [ffc81428]
gic_handle_irq+0x38/0x7c
--
Because in SMP, mrq has race condition between below two paths:
path1: CPU0: tasklet context
static void mmc_wait_data_done(struct mmc_request *mrq)
{
   mrq-host-context_info.is_done_rcv = true;
   //
   // If CPU0 has just finished is_done_rcv = true in path1, and at
   // this moment, IRQ or ICache line missing happens in CPU0.
   // What happens in CPU1 (path2)?
   //
   // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
   // path2 would have chance to break from wait_event_interruptible
   // in mmc_wait_for_data_req_done and continue to run for next
   // mmc_request (mmc_blk_rw_rq_prep).
   //
   // Within mmc_blk_rq_prep, mrq is cleared to 0.
   // If below line still gets host from mrq as the result of
   // compiler, the panic happens as we traced.
   wake_up_interruptible(mrq-host-context_info.wait);
}

path2: CPU1: The mmcqd thread runs mmc_queue_thread
static int mmc_wait_for_data_req_done(...
{
   ...
   while (1) {
 wait_event_interruptible(context_info-wait,
 (context_info-is_done_rcv ||
  context_info-is_new_req));
static void mmc_blk_rw_rq_prep(...
 {
 ...
 memset(brq, 0, sizeof(struct mmc_blk_request));

This issue happens very coincidentally; however adding mdelay(1) in
mmc_wait_data_done as below could duplicate it easily.

 static void mmc_wait_data_done(struct mmc_request *mrq)
 {
   mrq-host-context_info.is_done_rcv = true;
+mdelay(1);
   wake_up_interruptible(mrq-host-context_info.wait);
  }



Hi, ulf

We find this bug on Intel-C3230RK platform for very small probability.

Whereas I can easily reproduce this case if I add a mdelay(1) or  longer
delay as Jialing did.

This patch seems useful to me. Should we push it forward? :)


It seems like a very good idea!

Should we add a fixes tag to it?


That's cool, but how to add a fixes tag?

[Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)



[...]

Kind regards
Uffe






--
Best Regards
Shawn Lin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Ulf Hansson
[...]

 Hi, ulf

 We find this bug on Intel-C3230RK platform for very small probability.

 Whereas I can easily reproduce this case if I add a mdelay(1) or  longer
 delay as Jialing did.

 This patch seems useful to me. Should we push it forward? :)


 It seems like a very good idea!

 Should we add a fixes tag to it?


 That's cool, but how to add a fixes tag?

 [Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)


A fixes tag points to an old commit which introduced the bug. If we
can't find one, we can add a Cc tag to stable. Just search the git
log and you will find examples.

Kind regards
Uffe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Ulf Hansson
On 28 August 2015 at 05:25, Shawn Lin shawn@rock-chips.com wrote:
 On 2015/8/28 11:13, Shawn Lin wrote:

 From: Jialing Fu j...@marvell.com

 The following panic is captured in ker3.14, but the issue still exists
 in latest kernel.
 -
 [   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer
 dereference
 at virtual address 0578
 ..
 [   20.738499] c0 3136 (Compiler) PC is at
 _raw_spin_lock_irqsave+0x24/0x60
 [   20.738527] c0 3136 (Compiler) LR is at
 _raw_spin_lock_irqsave+0x20/0x60
 [   20.740134] c0 3136 (Compiler) Call trace:
 [   20.740165] c0 3136 (Compiler) [ffc0008ee900]
 _raw_spin_lock_irqsave+0x24/0x60
 [   20.740200] c0 3136 (Compiler) [ffcdd024] __wake_up+0x1c/0x54
 [   20.740230] c0 3136 (Compiler) [ffc000639414]
 mmc_wait_data_done+0x28/0x34
 [   20.740262] c0 3136 (Compiler) [ffc0006391a0]
 mmc_request_done+0xa4/0x220
 [   20.740314] c0 3136 (Compiler) [ffc000656894]
 sdhci_tasklet_finish+0xac/0x264
 [   20.740352] c0 3136 (Compiler) [ffca2b58]
 tasklet_action+0xa0/0x158
 [   20.740382] c0 3136 (Compiler) [ffca2078]
 __do_softirq+0x10c/0x2e4
 [   20.740411] c0 3136 (Compiler) [ffca24bc] irq_exit+0x8c/0xc0
 [   20.740439] c0 3136 (Compiler) [ffc8489c]
 handle_IRQ+0x48/0xac
 [   20.740469] c0 3136 (Compiler) [ffc81428]
 gic_handle_irq+0x38/0x7c
 --
 Because in SMP, mrq has race condition between below two paths:
 path1: CPU0: tasklet context
static void mmc_wait_data_done(struct mmc_request *mrq)
{
   mrq-host-context_info.is_done_rcv = true;
   //
   // If CPU0 has just finished is_done_rcv = true in path1, and at
   // this moment, IRQ or ICache line missing happens in CPU0.
   // What happens in CPU1 (path2)?
   //
   // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
   // path2 would have chance to break from wait_event_interruptible
   // in mmc_wait_for_data_req_done and continue to run for next
   // mmc_request (mmc_blk_rw_rq_prep).
   //
   // Within mmc_blk_rq_prep, mrq is cleared to 0.
   // If below line still gets host from mrq as the result of
   // compiler, the panic happens as we traced.
   wake_up_interruptible(mrq-host-context_info.wait);
}

 path2: CPU1: The mmcqd thread runs mmc_queue_thread
static int mmc_wait_for_data_req_done(...
{
   ...
   while (1) {
 wait_event_interruptible(context_info-wait,
 (context_info-is_done_rcv ||
  context_info-is_new_req));
static void mmc_blk_rw_rq_prep(...
 {
 ...
 memset(brq, 0, sizeof(struct mmc_blk_request));

 This issue happens very coincidentally; however adding mdelay(1) in
 mmc_wait_data_done as below could duplicate it easily.

 static void mmc_wait_data_done(struct mmc_request *mrq)
 {
   mrq-host-context_info.is_done_rcv = true;
 +mdelay(1);
   wake_up_interruptible(mrq-host-context_info.wait);
  }


 Hi, ulf

 We find this bug on Intel-C3230RK platform for very small probability.

 Whereas I can easily reproduce this case if I add a mdelay(1) or  longer
 delay as Jialing did.

 This patch seems useful to me. Should we push it forward? :)

It seems like a very good idea!

Should we add a fixes tag to it?

[...]

Kind regards
Uffe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-28 Thread Shawn Lin

在 2015/8/28 18:22, Jialing Fu 写道:


[...]


Hi, ulf

We find this bug on Intel-C3230RK platform for very small probability.

Whereas I can easily reproduce this case if I add a mdelay(1) or
longer delay as Jialing did.

This patch seems useful to me. Should we push it forward? :)



It seems like a very good idea!

Should we add a fixes tag to it?



That's cool, but how to add a fixes tag?

[Fixes] mmc: core: fix race condition in mmc_wait_data_done ?   :)



A fixes tag points to an old commit which introduced the bug. If we can't find one, we 
can add a Cc tag to stable. Just search the git log and you will find 
examples.

Like add one line as below?
Fixes: 2220eedfd7ae (mmc: fix async request mechanism for sequential read 
scenarios)



That's it, Jialing. From my git blame, seems this bug has been 
introduced for a long time, but I feel strange that no one had captured 
it before you did.


Anyway, I will add a fixes tag and send v2 ASAP. :)



Kind regards
Uffe




--
Best Regards
Shawn Lin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-27 Thread Shawn Lin

On 2015/8/28 11:13, Shawn Lin wrote:

From: Jialing Fu 

The following panic is captured in ker3.14, but the issue still exists
in latest kernel.
-
[   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer 
dereference
at virtual address 0578
..
[   20.738499] c0 3136 (Compiler) PC is at _raw_spin_lock_irqsave+0x24/0x60
[   20.738527] c0 3136 (Compiler) LR is at _raw_spin_lock_irqsave+0x20/0x60
[   20.740134] c0 3136 (Compiler) Call trace:
[   20.740165] c0 3136 (Compiler) [] 
_raw_spin_lock_irqsave+0x24/0x60
[   20.740200] c0 3136 (Compiler) [] __wake_up+0x1c/0x54
[   20.740230] c0 3136 (Compiler) [] 
mmc_wait_data_done+0x28/0x34
[   20.740262] c0 3136 (Compiler) [] 
mmc_request_done+0xa4/0x220
[   20.740314] c0 3136 (Compiler) [] 
sdhci_tasklet_finish+0xac/0x264
[   20.740352] c0 3136 (Compiler) [] tasklet_action+0xa0/0x158
[   20.740382] c0 3136 (Compiler) [] __do_softirq+0x10c/0x2e4
[   20.740411] c0 3136 (Compiler) [] irq_exit+0x8c/0xc0
[   20.740439] c0 3136 (Compiler) [] handle_IRQ+0x48/0xac
[   20.740469] c0 3136 (Compiler) [] gic_handle_irq+0x38/0x7c
--
Because in SMP, "mrq" has race condition between below two paths:
path1: CPU0: 
   static void mmc_wait_data_done(struct mmc_request *mrq)
   {
  mrq->host->context_info.is_done_rcv = true;
  //
  // If CPU0 has just finished "is_done_rcv = true" in path1, and at
  // this moment, IRQ or ICache line missing happens in CPU0.
  // What happens in CPU1 (path2)?
  //
  // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
  // path2 would have chance to break from wait_event_interruptible
  // in mmc_wait_for_data_req_done and continue to run for next
  // mmc_request (mmc_blk_rw_rq_prep).
  //
  // Within mmc_blk_rq_prep, mrq is cleared to 0.
  // If below line still gets host from "mrq" as the result of
  // compiler, the panic happens as we traced.
  wake_up_interruptible(>host->context_info.wait);
   }

path2: CPU1: 
   static int mmc_wait_for_data_req_done(...
   {
  ...
  while (1) {
wait_event_interruptible(context_info->wait,
(context_info->is_done_rcv ||
 context_info->is_new_req));
   static void mmc_blk_rw_rq_prep(...
{
...
memset(brq, 0, sizeof(struct mmc_blk_request));

This issue happens very coincidentally; however adding mdelay(1) in
mmc_wait_data_done as below could duplicate it easily.

static void mmc_wait_data_done(struct mmc_request *mrq)
{
  mrq->host->context_info.is_done_rcv = true;
+mdelay(1);
  wake_up_interruptible(>host->context_info.wait);
 }



Hi, ulf

We find this bug on Intel-C3230RK platform for very small probability.

Whereas I can easily reproduce this case if I add a mdelay(1) or  longer 
delay as Jialing did.


This patch seems useful to me. Should we push it forward? :)



At runtime, IRQ or ICache line missing may just happen at the same place
of the mdelay(1).

This patch gets the mmc_context_info at the beginning of function, it can
avoid this race condition.

Signed-off-by: Jialing Fu 
Tested-by: Shawn Lin 
---

  drivers/mmc/core/core.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 664b617..0520064 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -358,8 +358,10 @@ EXPORT_SYMBOL(mmc_start_bkops);
   */
  static void mmc_wait_data_done(struct mmc_request *mrq)
  {
-   mrq->host->context_info.is_done_rcv = true;
-   wake_up_interruptible(>host->context_info.wait);
+   struct mmc_context_info *context_info = >host->context_info;
+
+   context_info->is_done_rcv = true;
+   wake_up_interruptible(_info->wait);
  }

  static void mmc_wait_done(struct mmc_request *mrq)




--
Best Regards
Shawn Lin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] mmc: core: fix race condition in mmc_wait_data_done

2015-08-27 Thread Shawn Lin

On 2015/8/28 11:13, Shawn Lin wrote:

From: Jialing Fu j...@marvell.com

The following panic is captured in ker3.14, but the issue still exists
in latest kernel.
-
[   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer 
dereference
at virtual address 0578
..
[   20.738499] c0 3136 (Compiler) PC is at _raw_spin_lock_irqsave+0x24/0x60
[   20.738527] c0 3136 (Compiler) LR is at _raw_spin_lock_irqsave+0x20/0x60
[   20.740134] c0 3136 (Compiler) Call trace:
[   20.740165] c0 3136 (Compiler) [ffc0008ee900] 
_raw_spin_lock_irqsave+0x24/0x60
[   20.740200] c0 3136 (Compiler) [ffcdd024] __wake_up+0x1c/0x54
[   20.740230] c0 3136 (Compiler) [ffc000639414] 
mmc_wait_data_done+0x28/0x34
[   20.740262] c0 3136 (Compiler) [ffc0006391a0] 
mmc_request_done+0xa4/0x220
[   20.740314] c0 3136 (Compiler) [ffc000656894] 
sdhci_tasklet_finish+0xac/0x264
[   20.740352] c0 3136 (Compiler) [ffca2b58] tasklet_action+0xa0/0x158
[   20.740382] c0 3136 (Compiler) [ffca2078] __do_softirq+0x10c/0x2e4
[   20.740411] c0 3136 (Compiler) [ffca24bc] irq_exit+0x8c/0xc0
[   20.740439] c0 3136 (Compiler) [ffc8489c] handle_IRQ+0x48/0xac
[   20.740469] c0 3136 (Compiler) [ffc81428] gic_handle_irq+0x38/0x7c
--
Because in SMP, mrq has race condition between below two paths:
path1: CPU0: tasklet context
   static void mmc_wait_data_done(struct mmc_request *mrq)
   {
  mrq-host-context_info.is_done_rcv = true;
  //
  // If CPU0 has just finished is_done_rcv = true in path1, and at
  // this moment, IRQ or ICache line missing happens in CPU0.
  // What happens in CPU1 (path2)?
  //
  // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
  // path2 would have chance to break from wait_event_interruptible
  // in mmc_wait_for_data_req_done and continue to run for next
  // mmc_request (mmc_blk_rw_rq_prep).
  //
  // Within mmc_blk_rq_prep, mrq is cleared to 0.
  // If below line still gets host from mrq as the result of
  // compiler, the panic happens as we traced.
  wake_up_interruptible(mrq-host-context_info.wait);
   }

path2: CPU1: The mmcqd thread runs mmc_queue_thread
   static int mmc_wait_for_data_req_done(...
   {
  ...
  while (1) {
wait_event_interruptible(context_info-wait,
(context_info-is_done_rcv ||
 context_info-is_new_req));
   static void mmc_blk_rw_rq_prep(...
{
...
memset(brq, 0, sizeof(struct mmc_blk_request));

This issue happens very coincidentally; however adding mdelay(1) in
mmc_wait_data_done as below could duplicate it easily.

static void mmc_wait_data_done(struct mmc_request *mrq)
{
  mrq-host-context_info.is_done_rcv = true;
+mdelay(1);
  wake_up_interruptible(mrq-host-context_info.wait);
 }



Hi, ulf

We find this bug on Intel-C3230RK platform for very small probability.

Whereas I can easily reproduce this case if I add a mdelay(1) or  longer 
delay as Jialing did.


This patch seems useful to me. Should we push it forward? :)



At runtime, IRQ or ICache line missing may just happen at the same place
of the mdelay(1).

This patch gets the mmc_context_info at the beginning of function, it can
avoid this race condition.

Signed-off-by: Jialing Fu j...@marvell.com
Tested-by: Shawn Lin shawn@rock-chips.com
---

  drivers/mmc/core/core.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 664b617..0520064 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -358,8 +358,10 @@ EXPORT_SYMBOL(mmc_start_bkops);
   */
  static void mmc_wait_data_done(struct mmc_request *mrq)
  {
-   mrq-host-context_info.is_done_rcv = true;
-   wake_up_interruptible(mrq-host-context_info.wait);
+   struct mmc_context_info *context_info = mrq-host-context_info;
+
+   context_info-is_done_rcv = true;
+   wake_up_interruptible(context_info-wait);
  }

  static void mmc_wait_done(struct mmc_request *mrq)




--
Best Regards
Shawn Lin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/