RE: [PATCH v3] Drivers: hv: vmbus: fix the race when querying & updating the percpu list

2016-06-01 Thread Dexuan Cui
> >> [5.472143] BUG: unable to handle kernel paging request at
> >> 00079fff5288
> >> [5.477107] IP: [] vmbus_onoffer+0x311/0x570
> >> [hv_vmbus]
> >> ...
> >>   Vitaly
> >
> > I can't reproduce the panic somehow, but I did find a bug in
> vmbus_process_offer():
> >
> > "hv_event_tasklet_disable(channel) and
> hv_event_tasklet_enable(channel)"
> > are buggy: the 'channel' parameter should be 'newchannel'.
> >
> > This was a copy-and-paste bug... Sorry!
> > Can you fix this and see if the panic will disappear in your side?
> 
> This fixes the issue I'm seeing, thanks!
> 
>   Vitaly

Many thanks, Vitaly!

I'll post v4 shortly.

Thanks,
-- Dexuan



Re: [PATCH v3] Drivers: hv: vmbus: fix the race when querying & updating the percpu list

2016-06-01 Thread Vitaly Kuznetsov
Dexuan Cui  writes:

>> From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
>> Sent: Wednesday, June 1, 2016 0:27
>> To: Dexuan Cui 
>> Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
>> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
>> jasow...@redhat.com; KY Srinivasan ; Haiyang
>> Zhang ; rolf.neugeba...@docker.com;
>> dave.sc...@docker.com; ian.campb...@docker.com
>> Subject: Re: [PATCH v3] Drivers: hv: vmbus: fix the race when querying &
>> updating the percpu list
>> 
>> Dexuan Cui  writes:
>> 
>> > There is a rare race when we remove an entry from the global list
>> > hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
>> > percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
>> > process_chn_event() -> pcpu_relid2channel() is trying to query the list,
>> > we can get the kernel fault.
>> >
>> > Similarly, we also have the issue in the code path: vmbus_process_offer()
>> ->
>> > percpu_channel_enq().
>> >
>> > We can resolve the issue by disabling the tasklet when updating the list.
>> >
>> > The patch also moves vmbus_release_relid() to a later place where
>> > the channel has been removed from the per-cpu and the global lists.
>> >
>> > Reported-by: Rolf Neugebauer 
>> > Cc: Vitaly Kuznetsov 
>> > Signed-off-by: Dexuan Cui 
>> 
>> Tested 4.7-rc1 with this path applied and kernel always crashes on boot
>> (WS2016TP5, 12 CPU SMP guest, Generation 2):
>> 
>> [5.464251] hv_vmbus: Hyper-V Host Build:14300-10.0-1-0.1006; Vmbus
>> version:4.0
>> [5.471666] hv_vmbus: Unknown GUID: f8e65716-3cb3-4a06-9a60-
>> 1889c5cccab5
>> [5.472143] BUG: unable to handle kernel paging request at
>> 00079fff5288
>> [5.477107] IP: [] vmbus_onoffer+0x311/0x570
>> [hv_vmbus]
>> ...
>>   Vitaly
>
> I can't reproduce the panic somehow, but I did find a bug in 
> vmbus_process_offer():
>
> "hv_event_tasklet_disable(channel) and hv_event_tasklet_enable(channel)"
> are buggy: the 'channel' parameter should be 'newchannel'.
>
> This was a copy-and-paste bug... Sorry!
> Can you fix this and see if the panic will disappear in your side?

This fixes the issue I'm seeing, thanks!

-- 
  Vitaly


RE: [PATCH v3] Drivers: hv: vmbus: fix the race when querying & updating the percpu list

2016-05-31 Thread Dexuan Cui
> From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
> Sent: Wednesday, June 1, 2016 0:27
> To: Dexuan Cui 
> Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> jasow...@redhat.com; KY Srinivasan ; Haiyang
> Zhang ; rolf.neugeba...@docker.com;
> dave.sc...@docker.com; ian.campb...@docker.com
> Subject: Re: [PATCH v3] Drivers: hv: vmbus: fix the race when querying &
> updating the percpu list
> 
> Dexuan Cui  writes:
> 
> > There is a rare race when we remove an entry from the global list
> > hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
> > percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
> > process_chn_event() -> pcpu_relid2channel() is trying to query the list,
> > we can get the kernel fault.
> >
> > Similarly, we also have the issue in the code path: vmbus_process_offer()
> ->
> > percpu_channel_enq().
> >
> > We can resolve the issue by disabling the tasklet when updating the list.
> >
> > The patch also moves vmbus_release_relid() to a later place where
> > the channel has been removed from the per-cpu and the global lists.
> >
> > Reported-by: Rolf Neugebauer 
> > Cc: Vitaly Kuznetsov 
> > Signed-off-by: Dexuan Cui 
> 
> Tested 4.7-rc1 with this path applied and kernel always crashes on boot
> (WS2016TP5, 12 CPU SMP guest, Generation 2):
> 
> [5.464251] hv_vmbus: Hyper-V Host Build:14300-10.0-1-0.1006; Vmbus
> version:4.0
> [5.471666] hv_vmbus: Unknown GUID: f8e65716-3cb3-4a06-9a60-
> 1889c5cccab5
> [5.472143] BUG: unable to handle kernel paging request at
> 00079fff5288
> [5.477107] IP: [] vmbus_onoffer+0x311/0x570
> [hv_vmbus]
> ...
>   Vitaly

I can't reproduce the panic somehow, but I did find a bug in 
vmbus_process_offer():

"hv_event_tasklet_disable(channel) and hv_event_tasklet_enable(channel)"
are buggy: the 'channel' parameter should be 'newchannel'.

This was a copy-and-paste bug... Sorry!
Can you fix this and see if the panic will disappear in your side?

Thanks,
-- Dexuan


Re: [PATCH v3] Drivers: hv: vmbus: fix the race when querying & updating the percpu list

2016-05-31 Thread Vitaly Kuznetsov
Dexuan Cui  writes:

> There is a rare race when we remove an entry from the global list
> hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
> percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
> process_chn_event() -> pcpu_relid2channel() is trying to query the list,
> we can get the kernel fault.
>
> Similarly, we also have the issue in the code path: vmbus_process_offer() ->
> percpu_channel_enq().
>
> We can resolve the issue by disabling the tasklet when updating the list.
>
> The patch also moves vmbus_release_relid() to a later place where
> the channel has been removed from the per-cpu and the global lists.
>
> Reported-by: Rolf Neugebauer 
> Cc: Vitaly Kuznetsov 
> Signed-off-by: Dexuan Cui 

Tested 4.7-rc1 with this path applied and kernel always crashes on boot
(WS2016TP5, 12 CPU SMP guest, Generation 2):

[5.464251] hv_vmbus: Hyper-V Host Build:14300-10.0-1-0.1006; Vmbus 
version:4.0
[5.471666] hv_vmbus: Unknown GUID: f8e65716-3cb3-4a06-9a60-1889c5cccab5
[5.472143] BUG: unable to handle kernel paging request at 00079fff5288
[5.477107] IP: [] vmbus_onoffer+0x311/0x570 [hv_vmbus]
[5.477107] PGD 0 
[5.477107] Oops:  [#1] SMP
[5.477107] Modules linked in: hv_vmbus
[5.477107] CPU: 11 PID: 189 Comm: kworker/11:1 Not tainted 
4.7.0-rc1_dc1_test+ #262
[5.477107] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
[5.477107] Workqueue: hv_vmbus_con vmbus_onmessage_work [hv_vmbus]
[5.477107] task: 8801796e4480 ti: 8801796e8000 task.ti: 
8801796e8000
[5.477107] RIP: 0010:[]  [] 
vmbus_onoffer+0x311/0x570 [hv_vmbus]
[5.477107] RSP: 0018:8801796ebc50  EFLAGS: 00010286
[5.477107] RAX: 8801 RBX: 880032641000 RCX: 0050
[5.477107] RDX: 0004 RSI:  RDI: 880032641000
[5.477107] RBP: 8801796ebd10 R08: 0001 R09: 0001
[5.477107] R10:  R11: 0001 R12: 0010
[5.477107] R13: 4a063cb3f8e65716 R14: b5caccc58918609a R15: a0008b60
[5.477107] FS:  () GS:88017c00() 
knlGS:
[5.477107] CS:  0010 DS:  ES:  CR0: 80050033
[5.477107] CR2: 00079fff5288 CR3: 32613000 CR4: 001406e0
[5.477107] Stack:
[5.477107]  880032641780 88003264102c 00100146 
a000646e
[5.477107]  8801796e5090 8801796e4480 004f827d 
0001
[5.477107]   8801796ebce8 810eaebc 
796e5058
[5.477107] Call Trace:
[5.477107]  [] ? __lock_acquire+0x3dc/0x730
[5.477107]  [] vmbus_onmessage+0x33/0xa0 [hv_vmbus]
[5.477107]  [] vmbus_onmessage_work+0x21/0x30 [hv_vmbus]
[5.653321]  [] process_one_work+0x1ff/0x6d0
[5.653321]  [] ? process_one_work+0x181/0x6d0
[5.653321]  [] worker_thread+0x4e/0x490
[5.653321]  [] ? process_one_work+0x6d0/0x6d0
[5.653321]  [] ? process_one_work+0x6d0/0x6d0
[5.653321]  [] kthread+0x101/0x120
[5.653321]  [] ret_from_fork+0x1f/0x40
[5.653321]  [] ? kthread_create_on_node+0x250/0x250
[5.653321] Code: 74 24 08 48 c7 c7 60 6c 00 a0 e8 0a 9e 1b e1 b8 10 00 00 
00 66 89 44 24 16 44 89 e6 48 89 df e8 f6 f9 ff ff 41 8b 87 f4 02 00 00 <48> 8b 
14 c5 80 12 03 a0 f0 ff 42 10 48 8b 42 08 a8 02 75 f8 0f 
[5.653321] RIP  [] vmbus_onoffer+0x311/0x570 [hv_vmbus]
[5.653321]  RSP 
[5.653321] CR2: 00079fff5288
[5.653321] ---[ end trace 62df6070997f1f10 ]---
[5.653321] Kernel panic - not syncing: Fatal exception
[5.653321] Kernel Offset: disabled
[5.653321] ---[ end Kernel panic - not syncing: Fatal exception
[5.653480] [ cut here ]

I can investigate it tomorrow if this doesn't reproduce for you.



-- 
  Vitaly


[PATCH v3] Drivers: hv: vmbus: fix the race when querying & updating the percpu list

2016-05-21 Thread Dexuan Cui
There is a rare race when we remove an entry from the global list
hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
process_chn_event() -> pcpu_relid2channel() is trying to query the list,
we can get the kernel fault.

Similarly, we also have the issue in the code path: vmbus_process_offer() ->
percpu_channel_enq().

We can resolve the issue by disabling the tasklet when updating the list.

The patch also moves vmbus_release_relid() to a later place where
the channel has been removed from the per-cpu and the global lists.

Reported-by: Rolf Neugebauer 
Cc: Vitaly Kuznetsov 
Signed-off-by: Dexuan Cui 
---

v2: added tasklet_schedule() after tasklet_enable(). Thanks, Vitaly!

v3: I shouldn't have moved percpu_channel_deq()
from  hv_process_channel_removal() to vmbus_close_internal(): the
channel couldn't be removed from the per-cpu list, if the channel state
was not CHANNEL_OPENED_STATE. v3 fixed this.

v3 also added 2 wrapper functions for the disable/enable stuff.
v3 also moved vmbus_release_relid() to a later safe place.

You can also get v3 by
https://github.com/dcui/linux/commit/2f314fb9352020f70b094bf31539afd3a18a5545

 drivers/hv/channel.c  |  6 ++
 drivers/hv/channel_mgmt.c | 32 
 include/linux/hyperv.h|  3 +++
 3 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 56dd261..ef8e9b5 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -512,7 +512,6 @@ static void reset_channel_cb(void *arg)
 static int vmbus_close_internal(struct vmbus_channel *channel)
 {
struct vmbus_channel_close_channel *msg;
-   struct tasklet_struct *tasklet;
int ret;
 
/*
@@ -524,8 +523,7 @@ static int vmbus_close_internal(struct vmbus_channel 
*channel)
 * To resolve the race, we can serialize them by disabling the
 * tasklet when the latter is running here.
 */
-   tasklet = hv_context.event_dpc[channel->target_cpu];
-   tasklet_disable(tasklet);
+   hv_event_tasklet_disable(channel);
 
/*
 * In case a device driver's probe() fails (e.g.,
@@ -591,7 +589,7 @@ static int vmbus_close_internal(struct vmbus_channel 
*channel)
get_order(channel->ringbuffer_pagecount * PAGE_SIZE));
 
 out:
-   tasklet_enable(tasklet);
+   hv_event_tasklet_enable(channel);
 
return ret;
 }
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 38b682ba..3bcf141 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -21,6 +21,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -303,16 +304,32 @@ static void vmbus_release_relid(u32 relid)
vmbus_post_msg(&msg, sizeof(struct vmbus_channel_relid_released));
 }
 
+void hv_event_tasklet_disable(struct vmbus_channel *channel)
+{
+   struct tasklet_struct *tasklet;
+   tasklet = hv_context.event_dpc[channel->target_cpu];
+   tasklet_disable(tasklet);
+}
+
+void hv_event_tasklet_enable(struct vmbus_channel *channel)
+{
+   struct tasklet_struct *tasklet;
+   tasklet = hv_context.event_dpc[channel->target_cpu];
+   tasklet_enable(tasklet);
+
+   /* In case there is any pending event */
+   tasklet_schedule(tasklet);
+}
+
 void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid)
 {
unsigned long flags;
struct vmbus_channel *primary_channel;
 
-   vmbus_release_relid(relid);
-
BUG_ON(!channel->rescind);
BUG_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
 
+   hv_event_tasklet_disable(channel);
if (channel->target_cpu != get_cpu()) {
put_cpu();
smp_call_function_single(channel->target_cpu,
@@ -321,6 +338,7 @@ void hv_process_channel_removal(struct vmbus_channel 
*channel, u32 relid)
percpu_channel_deq(channel);
put_cpu();
}
+   hv_event_tasklet_enable(channel);
 
if (channel->primary_channel == NULL) {
list_del(&channel->listentry);
@@ -341,6 +359,8 @@ void hv_process_channel_removal(struct vmbus_channel 
*channel, u32 relid)
cpumask_clear_cpu(channel->target_cpu,
  &primary_channel->alloced_cpus_in_node);
 
+   vmbus_release_relid(relid);
+
free_channel(channel);
 }
 
@@ -409,6 +429,7 @@ static void vmbus_process_offer(struct vmbus_channel 
*newchannel)
 
init_vp_index(newchannel, dev_type);
 
+   hv_event_tasklet_disable(channel);
if (newchannel->target_cpu != get_cpu()) {
put_cpu();
smp_call_function_single(newchannel->target_cpu,
@@ -418,6 +439,7 @@ static void vmbus_process_offer(struct vmbus_channel 
*newchannel)
percpu_channel_enq(newchannel);
put_cpu();
}
+