[PATCH 0/7] Drivers: hv: Some miscellaneous fixes
From: K. Y. SrinivasanSome miscellaneous fixes and enhancements. These patches were all sent earlier but failed to apply clean on Greg's tree. These have now been rebased. Alex Ng (2): Drivers: hv: utils: Continue to poll VSS channel after handling requests. Drivers: hv: utils: Check VSS daemon is listening before a hot backup K. Y. Srinivasan (1): Drivers: hv: Introduce a policy for controlling channel affinity Vitaly Kuznetsov (4): Drivers: hv: cleanup vmbus_open() for wrap around mappings Drivers: hv: ring_buffer: wrap around mappings for ring buffers Drivers: hv: ring_buffer: use wrap around mappings in hv_copy{from,to}_ringbuffer() Drivers: hv: ring_buffer: count on wrap around mappings in get_next_pkt_raw() drivers/hv/channel.c | 66 --- drivers/hv/channel_mgmt.c | 68 +++-- drivers/hv/hv_snapshot.c | 93 ++--- drivers/hv/hyperv_vmbus.h |4 +- drivers/hv/ring_buffer.c | 61 + include/linux/hyperv.h| 55 -- tools/hv/hv_vss_daemon.c |3 + 7 files changed, 193 insertions(+), 157 deletions(-) -- 1.7.4.1
[PATCH 0/7] Drivers: hv: Some miscellaneous fixes
From: K. Y. Srinivasan Some miscellaneous fixes and enhancements. These patches were all sent earlier but failed to apply clean on Greg's tree. These have now been rebased. Alex Ng (2): Drivers: hv: utils: Continue to poll VSS channel after handling requests. Drivers: hv: utils: Check VSS daemon is listening before a hot backup K. Y. Srinivasan (1): Drivers: hv: Introduce a policy for controlling channel affinity Vitaly Kuznetsov (4): Drivers: hv: cleanup vmbus_open() for wrap around mappings Drivers: hv: ring_buffer: wrap around mappings for ring buffers Drivers: hv: ring_buffer: use wrap around mappings in hv_copy{from,to}_ringbuffer() Drivers: hv: ring_buffer: count on wrap around mappings in get_next_pkt_raw() drivers/hv/channel.c | 66 --- drivers/hv/channel_mgmt.c | 68 +++-- drivers/hv/hv_snapshot.c | 93 ++--- drivers/hv/hyperv_vmbus.h |4 +- drivers/hv/ring_buffer.c | 61 + include/linux/hyperv.h| 55 -- tools/hv/hv_vss_daemon.c |3 + 7 files changed, 193 insertions(+), 157 deletions(-) -- 1.7.4.1
[PATCH 5/5] Drivers: hv: balloon: Use available memory value in pressure report
From: Alex NgReports for available memory should use the si_mem_available() value. The previous freeram value does not include available page cache memory. Signed-off-by: Alex Ng Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 13 + 1 files changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index d55e0e7..fdf8da9 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -1075,7 +1075,6 @@ static unsigned long compute_balloon_floor(void) static void post_status(struct hv_dynmem_device *dm) { struct dm_status status; - struct sysinfo val; unsigned long now = jiffies; unsigned long last_post = last_post_time; @@ -1087,7 +1086,6 @@ static void post_status(struct hv_dynmem_device *dm) if (!time_after(now, (last_post_time + HZ))) return; - si_meminfo(); memset(, 0, sizeof(struct dm_status)); status.hdr.type = DM_STATUS_REPORT; status.hdr.size = sizeof(struct dm_status); @@ -1103,7 +1101,7 @@ static void post_status(struct hv_dynmem_device *dm) * num_pages_onlined) as committed to the host, otherwise it can try * asking us to balloon them out. */ - status.num_avail = val.freeram; + status.num_avail = si_mem_available(); status.num_committed = vm_memory_committed() + dm->num_pages_ballooned + (dm->num_pages_added > dm->num_pages_onlined ? @@ -1209,7 +1207,7 @@ static void balloon_up(struct work_struct *dummy) int ret; bool done = false; int i; - struct sysinfo val; + long avail_pages; unsigned long floor; /* The host balloons pages in 2M granularity. */ @@ -1221,12 +1219,12 @@ static void balloon_up(struct work_struct *dummy) */ alloc_unit = 512; - si_meminfo(); + avail_pages = si_mem_available(); floor = compute_balloon_floor(); /* Refuse to balloon below the floor, keep the 2M granularity. */ - if (val.freeram < num_pages || val.freeram - num_pages < floor) { - num_pages = val.freeram > floor ? (val.freeram - floor) : 0; + if (avail_pages < num_pages || avail_pages - num_pages < floor) { + num_pages = avail_pages > floor ? (avail_pages - floor) : 0; num_pages -= num_pages % PAGES_IN_2M; } @@ -1237,7 +1235,6 @@ static void balloon_up(struct work_struct *dummy) bl_resp->hdr.size = sizeof(struct dm_balloon_response); bl_resp->more_pages = 1; - num_pages -= num_ballooned; num_ballooned = alloc_balloon_pages(_device, num_pages, bl_resp, alloc_unit); -- 1.7.4.1
[PATCH 2/5] Drivers: hv: balloon: account for gaps in hot add regions
From: Vitaly KuznetsovI'm observing the following hot add requests from the WS2012 host: hot_add_req: start_pfn = 0x108200 count = 330752 hot_add_req: start_pfn = 0x158e00 count = 193536 hot_add_req: start_pfn = 0x188400 count = 239616 As the host doesn't specify hot add regions we're trying to create 128Mb-aligned region covering the first request, we create the 0x108000 - 0x16 region and we add 0x108000 - 0x158e00 memory. The second request passes the pfn_covered() check, we enlarge the region to 0x108000 - 0x19 and add 0x158e00 - 0x188200 memory. The problem emerges with the third request as it starts at 0x188400 so there is a 0x200 gap which is not covered. As the end of our region is 0x19 now it again passes the pfn_covered() check were we just adjust the covered_end_pfn and make it 0x188400 instead of 0x188200 which means that we'll try to online 0x188200-0x188400 pages but these pages were never assigned to us and we crash. We can't react to such requests by creating new hot add regions as it may happen that the whole suggested range falls into the previously identified 128Mb-aligned area so we'll end up adding nothing or create intersecting regions and our current logic doesn't allow that. Instead, create a list of such 'gaps' and check for them in the page online callback. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 131 +- 1 files changed, 94 insertions(+), 37 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 4ae26d6..18766f6 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -441,6 +441,16 @@ struct hv_hotadd_state { unsigned long covered_end_pfn; unsigned long ha_end_pfn; unsigned long end_pfn; + /* +* A list of gaps. +*/ + struct list_head gap_list; +}; + +struct hv_hotadd_gap { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; }; struct balloon_state { @@ -596,18 +606,46 @@ static struct notifier_block hv_memory_nb = { .priority = 0 }; +/* Check if the particular page is backed and can be onlined and online it. */ +static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg) +{ + unsigned long cur_start_pgp; + unsigned long cur_end_pgp; + struct hv_hotadd_gap *gap; + + cur_start_pgp = (unsigned long)pfn_to_page(has->covered_start_pfn); + cur_end_pgp = (unsigned long)pfn_to_page(has->covered_end_pfn); -static void hv_bring_pgs_online(unsigned long start_pfn, unsigned long size) + /* The page is not backed. */ + if (((unsigned long)pg < cur_start_pgp) || + ((unsigned long)pg >= cur_end_pgp)) + return; + + /* Check for gaps. */ + list_for_each_entry(gap, >gap_list, list) { + cur_start_pgp = (unsigned long) + pfn_to_page(gap->start_pfn); + cur_end_pgp = (unsigned long) + pfn_to_page(gap->end_pfn); + if (((unsigned long)pg >= cur_start_pgp) && + ((unsigned long)pg < cur_end_pgp)) { + return; + } + } + + /* This frame is currently backed; online the page. */ + __online_page_set_limits(pg); + __online_page_increment_counters(pg); + __online_page_free(pg); +} + +static void hv_bring_pgs_online(struct hv_hotadd_state *has, + unsigned long start_pfn, unsigned long size) { int i; - for (i = 0; i < size; i++) { - struct page *pg; - pg = pfn_to_page(start_pfn + i); - __online_page_set_limits(pg); - __online_page_increment_counters(pg); - __online_page_free(pg); - } + for (i = 0; i < size; i++) + hv_page_online_one(has, pfn_to_page(start_pfn + i)); } static void hv_mem_hot_add(unsigned long start, unsigned long size, @@ -684,26 +722,24 @@ static void hv_online_page(struct page *pg) list_for_each(cur, _device.ha_region_list) { has = list_entry(cur, struct hv_hotadd_state, list); cur_start_pgp = (unsigned long) - pfn_to_page(has->covered_start_pfn); - cur_end_pgp = (unsigned long)pfn_to_page(has->covered_end_pfn); + pfn_to_page(has->start_pfn); + cur_end_pgp = (unsigned long)pfn_to_page(has->end_pfn); - if (((unsigned long)pg >= cur_start_pgp) && - ((unsigned long)pg < cur_end_pgp)) { - /* -* This frame is currently backed; online the -* page. -*/ - __online_page_set_limits(pg); -
[PATCH 4/5] Drivers: hv: balloon: replace ha_region_mutex with spinlock
From: Vitaly Kuznetsovlockdep reports possible circular locking dependency when udev is used for memory onlining: systemd-udevd/3996 is trying to acquire lock: ((memory_chain).rwsem){.+}, at: [] __blocking_notifier_call_chain+0x4e/0xc0 but task is already holding lock: (_device.ha_region_mutex){+.+.+.}, at: [] hv_memory_notifier+0x5e/0xc0 [hv_balloon] ... which is probably a false positive because we take and release ha_region_mutex from memory notifier chain depending on the arg. No real deadlocks were reported so far (though I'm not really sure about preemptible kernels...) but we don't really need to hold the mutex for so long. We use it to protect ha_region_list (and its members) and the num_pages_onlined counter. None of these operations require us to sleep and nothing is slow, switch to using spinlock with interrupts disabled. While on it, replace list_for_each -> list_for_each_entry as we actually need entries in all these cases, drop meaningless list_empty() checks. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 98 +- 1 files changed, 53 insertions(+), 45 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 3441326..d55e0e7 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -547,7 +547,11 @@ struct hv_dynmem_device { */ struct task_struct *thread; - struct mutex ha_region_mutex; + /* +* Protects ha_region_list, num_pages_onlined counter and individual +* regions from ha_region_list. +*/ + spinlock_t ha_lock; /* * A list of hot-add regions. @@ -571,18 +575,14 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val, void *v) { struct memory_notify *mem = (struct memory_notify *)v; + unsigned long flags; switch (val) { - case MEM_GOING_ONLINE: - mutex_lock(_device.ha_region_mutex); - break; - case MEM_ONLINE: + spin_lock_irqsave(_device.ha_lock, flags); dm_device.num_pages_onlined += mem->nr_pages; + spin_unlock_irqrestore(_device.ha_lock, flags); case MEM_CANCEL_ONLINE: - if (val == MEM_ONLINE || - mutex_is_locked(_device.ha_region_mutex)) - mutex_unlock(_device.ha_region_mutex); if (dm_device.ha_waiting) { dm_device.ha_waiting = false; complete(_device.ol_waitevent); @@ -590,10 +590,11 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val, break; case MEM_OFFLINE: - mutex_lock(_device.ha_region_mutex); + spin_lock_irqsave(_device.ha_lock, flags); dm_device.num_pages_onlined -= mem->nr_pages; - mutex_unlock(_device.ha_region_mutex); + spin_unlock_irqrestore(_device.ha_lock, flags); break; + case MEM_GOING_ONLINE: case MEM_GOING_OFFLINE: case MEM_CANCEL_OFFLINE: break; @@ -657,9 +658,12 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, unsigned long start_pfn; unsigned long processed_pfn; unsigned long total_pfn = pfn_count; + unsigned long flags; for (i = 0; i < (size/HA_CHUNK); i++) { start_pfn = start + (i * HA_CHUNK); + + spin_lock_irqsave(_device.ha_lock, flags); has->ha_end_pfn += HA_CHUNK; if (total_pfn > HA_CHUNK) { @@ -671,11 +675,11 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, } has->covered_end_pfn += processed_pfn; + spin_unlock_irqrestore(_device.ha_lock, flags); init_completion(_device.ol_waitevent); dm_device.ha_waiting = !memhp_auto_online; - mutex_unlock(_device.ha_region_mutex); nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); ret = add_memory(nid, PFN_PHYS((start_pfn)), (HA_CHUNK << PAGE_SHIFT)); @@ -692,9 +696,10 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, */ do_hot_add = false; } + spin_lock_irqsave(_device.ha_lock, flags); has->ha_end_pfn -= HA_CHUNK; has->covered_end_pfn -= processed_pfn; - mutex_lock(_device.ha_region_mutex); + spin_unlock_irqrestore(_device.ha_lock, flags); break; } @@ -708,7 +713,6 @@ static void
[PATCH 5/5] Drivers: hv: balloon: Use available memory value in pressure report
From: Alex Ng Reports for available memory should use the si_mem_available() value. The previous freeram value does not include available page cache memory. Signed-off-by: Alex Ng Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 13 + 1 files changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index d55e0e7..fdf8da9 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -1075,7 +1075,6 @@ static unsigned long compute_balloon_floor(void) static void post_status(struct hv_dynmem_device *dm) { struct dm_status status; - struct sysinfo val; unsigned long now = jiffies; unsigned long last_post = last_post_time; @@ -1087,7 +1086,6 @@ static void post_status(struct hv_dynmem_device *dm) if (!time_after(now, (last_post_time + HZ))) return; - si_meminfo(); memset(, 0, sizeof(struct dm_status)); status.hdr.type = DM_STATUS_REPORT; status.hdr.size = sizeof(struct dm_status); @@ -1103,7 +1101,7 @@ static void post_status(struct hv_dynmem_device *dm) * num_pages_onlined) as committed to the host, otherwise it can try * asking us to balloon them out. */ - status.num_avail = val.freeram; + status.num_avail = si_mem_available(); status.num_committed = vm_memory_committed() + dm->num_pages_ballooned + (dm->num_pages_added > dm->num_pages_onlined ? @@ -1209,7 +1207,7 @@ static void balloon_up(struct work_struct *dummy) int ret; bool done = false; int i; - struct sysinfo val; + long avail_pages; unsigned long floor; /* The host balloons pages in 2M granularity. */ @@ -1221,12 +1219,12 @@ static void balloon_up(struct work_struct *dummy) */ alloc_unit = 512; - si_meminfo(); + avail_pages = si_mem_available(); floor = compute_balloon_floor(); /* Refuse to balloon below the floor, keep the 2M granularity. */ - if (val.freeram < num_pages || val.freeram - num_pages < floor) { - num_pages = val.freeram > floor ? (val.freeram - floor) : 0; + if (avail_pages < num_pages || avail_pages - num_pages < floor) { + num_pages = avail_pages > floor ? (avail_pages - floor) : 0; num_pages -= num_pages % PAGES_IN_2M; } @@ -1237,7 +1235,6 @@ static void balloon_up(struct work_struct *dummy) bl_resp->hdr.size = sizeof(struct dm_balloon_response); bl_resp->more_pages = 1; - num_pages -= num_ballooned; num_ballooned = alloc_balloon_pages(_device, num_pages, bl_resp, alloc_unit); -- 1.7.4.1
[PATCH 2/5] Drivers: hv: balloon: account for gaps in hot add regions
From: Vitaly Kuznetsov I'm observing the following hot add requests from the WS2012 host: hot_add_req: start_pfn = 0x108200 count = 330752 hot_add_req: start_pfn = 0x158e00 count = 193536 hot_add_req: start_pfn = 0x188400 count = 239616 As the host doesn't specify hot add regions we're trying to create 128Mb-aligned region covering the first request, we create the 0x108000 - 0x16 region and we add 0x108000 - 0x158e00 memory. The second request passes the pfn_covered() check, we enlarge the region to 0x108000 - 0x19 and add 0x158e00 - 0x188200 memory. The problem emerges with the third request as it starts at 0x188400 so there is a 0x200 gap which is not covered. As the end of our region is 0x19 now it again passes the pfn_covered() check were we just adjust the covered_end_pfn and make it 0x188400 instead of 0x188200 which means that we'll try to online 0x188200-0x188400 pages but these pages were never assigned to us and we crash. We can't react to such requests by creating new hot add regions as it may happen that the whole suggested range falls into the previously identified 128Mb-aligned area so we'll end up adding nothing or create intersecting regions and our current logic doesn't allow that. Instead, create a list of such 'gaps' and check for them in the page online callback. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 131 +- 1 files changed, 94 insertions(+), 37 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 4ae26d6..18766f6 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -441,6 +441,16 @@ struct hv_hotadd_state { unsigned long covered_end_pfn; unsigned long ha_end_pfn; unsigned long end_pfn; + /* +* A list of gaps. +*/ + struct list_head gap_list; +}; + +struct hv_hotadd_gap { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; }; struct balloon_state { @@ -596,18 +606,46 @@ static struct notifier_block hv_memory_nb = { .priority = 0 }; +/* Check if the particular page is backed and can be onlined and online it. */ +static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg) +{ + unsigned long cur_start_pgp; + unsigned long cur_end_pgp; + struct hv_hotadd_gap *gap; + + cur_start_pgp = (unsigned long)pfn_to_page(has->covered_start_pfn); + cur_end_pgp = (unsigned long)pfn_to_page(has->covered_end_pfn); -static void hv_bring_pgs_online(unsigned long start_pfn, unsigned long size) + /* The page is not backed. */ + if (((unsigned long)pg < cur_start_pgp) || + ((unsigned long)pg >= cur_end_pgp)) + return; + + /* Check for gaps. */ + list_for_each_entry(gap, >gap_list, list) { + cur_start_pgp = (unsigned long) + pfn_to_page(gap->start_pfn); + cur_end_pgp = (unsigned long) + pfn_to_page(gap->end_pfn); + if (((unsigned long)pg >= cur_start_pgp) && + ((unsigned long)pg < cur_end_pgp)) { + return; + } + } + + /* This frame is currently backed; online the page. */ + __online_page_set_limits(pg); + __online_page_increment_counters(pg); + __online_page_free(pg); +} + +static void hv_bring_pgs_online(struct hv_hotadd_state *has, + unsigned long start_pfn, unsigned long size) { int i; - for (i = 0; i < size; i++) { - struct page *pg; - pg = pfn_to_page(start_pfn + i); - __online_page_set_limits(pg); - __online_page_increment_counters(pg); - __online_page_free(pg); - } + for (i = 0; i < size; i++) + hv_page_online_one(has, pfn_to_page(start_pfn + i)); } static void hv_mem_hot_add(unsigned long start, unsigned long size, @@ -684,26 +722,24 @@ static void hv_online_page(struct page *pg) list_for_each(cur, _device.ha_region_list) { has = list_entry(cur, struct hv_hotadd_state, list); cur_start_pgp = (unsigned long) - pfn_to_page(has->covered_start_pfn); - cur_end_pgp = (unsigned long)pfn_to_page(has->covered_end_pfn); + pfn_to_page(has->start_pfn); + cur_end_pgp = (unsigned long)pfn_to_page(has->end_pfn); - if (((unsigned long)pg >= cur_start_pgp) && - ((unsigned long)pg < cur_end_pgp)) { - /* -* This frame is currently backed; online the -* page. -*/ - __online_page_set_limits(pg); - __online_page_increment_counters(pg); -
[PATCH 4/5] Drivers: hv: balloon: replace ha_region_mutex with spinlock
From: Vitaly Kuznetsov lockdep reports possible circular locking dependency when udev is used for memory onlining: systemd-udevd/3996 is trying to acquire lock: ((memory_chain).rwsem){.+}, at: [] __blocking_notifier_call_chain+0x4e/0xc0 but task is already holding lock: (_device.ha_region_mutex){+.+.+.}, at: [] hv_memory_notifier+0x5e/0xc0 [hv_balloon] ... which is probably a false positive because we take and release ha_region_mutex from memory notifier chain depending on the arg. No real deadlocks were reported so far (though I'm not really sure about preemptible kernels...) but we don't really need to hold the mutex for so long. We use it to protect ha_region_list (and its members) and the num_pages_onlined counter. None of these operations require us to sleep and nothing is slow, switch to using spinlock with interrupts disabled. While on it, replace list_for_each -> list_for_each_entry as we actually need entries in all these cases, drop meaningless list_empty() checks. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 98 +- 1 files changed, 53 insertions(+), 45 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 3441326..d55e0e7 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -547,7 +547,11 @@ struct hv_dynmem_device { */ struct task_struct *thread; - struct mutex ha_region_mutex; + /* +* Protects ha_region_list, num_pages_onlined counter and individual +* regions from ha_region_list. +*/ + spinlock_t ha_lock; /* * A list of hot-add regions. @@ -571,18 +575,14 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val, void *v) { struct memory_notify *mem = (struct memory_notify *)v; + unsigned long flags; switch (val) { - case MEM_GOING_ONLINE: - mutex_lock(_device.ha_region_mutex); - break; - case MEM_ONLINE: + spin_lock_irqsave(_device.ha_lock, flags); dm_device.num_pages_onlined += mem->nr_pages; + spin_unlock_irqrestore(_device.ha_lock, flags); case MEM_CANCEL_ONLINE: - if (val == MEM_ONLINE || - mutex_is_locked(_device.ha_region_mutex)) - mutex_unlock(_device.ha_region_mutex); if (dm_device.ha_waiting) { dm_device.ha_waiting = false; complete(_device.ol_waitevent); @@ -590,10 +590,11 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val, break; case MEM_OFFLINE: - mutex_lock(_device.ha_region_mutex); + spin_lock_irqsave(_device.ha_lock, flags); dm_device.num_pages_onlined -= mem->nr_pages; - mutex_unlock(_device.ha_region_mutex); + spin_unlock_irqrestore(_device.ha_lock, flags); break; + case MEM_GOING_ONLINE: case MEM_GOING_OFFLINE: case MEM_CANCEL_OFFLINE: break; @@ -657,9 +658,12 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, unsigned long start_pfn; unsigned long processed_pfn; unsigned long total_pfn = pfn_count; + unsigned long flags; for (i = 0; i < (size/HA_CHUNK); i++) { start_pfn = start + (i * HA_CHUNK); + + spin_lock_irqsave(_device.ha_lock, flags); has->ha_end_pfn += HA_CHUNK; if (total_pfn > HA_CHUNK) { @@ -671,11 +675,11 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, } has->covered_end_pfn += processed_pfn; + spin_unlock_irqrestore(_device.ha_lock, flags); init_completion(_device.ol_waitevent); dm_device.ha_waiting = !memhp_auto_online; - mutex_unlock(_device.ha_region_mutex); nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); ret = add_memory(nid, PFN_PHYS((start_pfn)), (HA_CHUNK << PAGE_SHIFT)); @@ -692,9 +696,10 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, */ do_hot_add = false; } + spin_lock_irqsave(_device.ha_lock, flags); has->ha_end_pfn -= HA_CHUNK; has->covered_end_pfn -= processed_pfn; - mutex_lock(_device.ha_region_mutex); + spin_unlock_irqrestore(_device.ha_lock, flags); break; } @@ -708,7 +713,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, if
[PATCH 3/5] Drivers: hv: balloon: don't wait for ol_waitevent when memhp_auto_online is enabled
From: Vitaly KuznetsovWith the recently introduced in-kernel memory onlining (MEMORY_HOTPLUG_DEFAULT_ONLINE) these is no point in waiting for pages to come online in the driver and we can get rid of the waiting. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 15 +-- 1 files changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 18766f6..3441326 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -673,7 +673,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, has->covered_end_pfn += processed_pfn; init_completion(_device.ol_waitevent); - dm_device.ha_waiting = true; + dm_device.ha_waiting = !memhp_auto_online; mutex_unlock(_device.ha_region_mutex); nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); @@ -699,12 +699,15 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, } /* -* Wait for the memory block to be onlined. -* Since the hot add has succeeded, it is ok to -* proceed even if the pages in the hot added region -* have not been "onlined" within the allowed time. +* Wait for the memory block to be onlined when memory onlining +* is done outside of kernel (memhp_auto_online). Since the hot +* add has succeeded, it is ok to proceed even if the pages in +* the hot added region have not been "onlined" within the +* allowed time. */ - wait_for_completion_timeout(_device.ol_waitevent, 5*HZ); + if (dm_device.ha_waiting) + wait_for_completion_timeout(_device.ol_waitevent, + 5*HZ); mutex_lock(_device.ha_region_mutex); post_status(_device); } -- 1.7.4.1
[PATCH 1/5] Drivers: hv: balloon: keep track of where ha_region starts
From: Vitaly KuznetsovWindows 2012 (non-R2) does not specify hot add region in hot add requests and the logic in hot_add_req() is trying to find a 128Mb-aligned region covering the request. It may also happen that host's requests are not 128Mb aligned and the created ha_region will start before the first specified PFN. We can't online these non-present pages but we don't remember the real start of the region. This is a regression introduced by the commit 5a75d733 ("Drivers: hv: hv_balloon: don't lose memory when onlining order is not natural"). While the idea of keeping the 'moving window' was wrong (as there is no guarantee that hot add requests come ordered) we should still keep track of covered_start_pfn. This is not a revert, the logic is different. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index df35fb7..4ae26d6 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -430,13 +430,14 @@ struct dm_info_msg { * currently hot added. We hot add in multiples of 128M * chunks; it is possible that we may not be able to bring * online all the pages in the region. The range - * covered_end_pfn defines the pages that can + * covered_start_pfn:covered_end_pfn defines the pages that can * be brough online. */ struct hv_hotadd_state { struct list_head list; unsigned long start_pfn; + unsigned long covered_start_pfn; unsigned long covered_end_pfn; unsigned long ha_end_pfn; unsigned long end_pfn; @@ -682,7 +683,8 @@ static void hv_online_page(struct page *pg) list_for_each(cur, _device.ha_region_list) { has = list_entry(cur, struct hv_hotadd_state, list); - cur_start_pgp = (unsigned long)pfn_to_page(has->start_pfn); + cur_start_pgp = (unsigned long) + pfn_to_page(has->covered_start_pfn); cur_end_pgp = (unsigned long)pfn_to_page(has->covered_end_pfn); if (((unsigned long)pg >= cur_start_pgp) && @@ -854,6 +856,7 @@ static unsigned long process_hot_add(unsigned long pg_start, list_add_tail(_region->list, _device.ha_region_list); ha_region->start_pfn = rg_start; ha_region->ha_end_pfn = rg_start; + ha_region->covered_start_pfn = pg_start; ha_region->covered_end_pfn = pg_start; ha_region->end_pfn = rg_start + rg_size; } -- 1.7.4.1
[PATCH 3/5] Drivers: hv: balloon: don't wait for ol_waitevent when memhp_auto_online is enabled
From: Vitaly Kuznetsov With the recently introduced in-kernel memory onlining (MEMORY_HOTPLUG_DEFAULT_ONLINE) these is no point in waiting for pages to come online in the driver and we can get rid of the waiting. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c | 15 +-- 1 files changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 18766f6..3441326 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -673,7 +673,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, has->covered_end_pfn += processed_pfn; init_completion(_device.ol_waitevent); - dm_device.ha_waiting = true; + dm_device.ha_waiting = !memhp_auto_online; mutex_unlock(_device.ha_region_mutex); nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); @@ -699,12 +699,15 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, } /* -* Wait for the memory block to be onlined. -* Since the hot add has succeeded, it is ok to -* proceed even if the pages in the hot added region -* have not been "onlined" within the allowed time. +* Wait for the memory block to be onlined when memory onlining +* is done outside of kernel (memhp_auto_online). Since the hot +* add has succeeded, it is ok to proceed even if the pages in +* the hot added region have not been "onlined" within the +* allowed time. */ - wait_for_completion_timeout(_device.ol_waitevent, 5*HZ); + if (dm_device.ha_waiting) + wait_for_completion_timeout(_device.ol_waitevent, + 5*HZ); mutex_lock(_device.ha_region_mutex); post_status(_device); } -- 1.7.4.1
[PATCH 1/5] Drivers: hv: balloon: keep track of where ha_region starts
From: Vitaly Kuznetsov Windows 2012 (non-R2) does not specify hot add region in hot add requests and the logic in hot_add_req() is trying to find a 128Mb-aligned region covering the request. It may also happen that host's requests are not 128Mb aligned and the created ha_region will start before the first specified PFN. We can't online these non-present pages but we don't remember the real start of the region. This is a regression introduced by the commit 5a75d733 ("Drivers: hv: hv_balloon: don't lose memory when onlining order is not natural"). While the idea of keeping the 'moving window' was wrong (as there is no guarantee that hot add requests come ordered) we should still keep track of covered_start_pfn. This is not a revert, the logic is different. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_balloon.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index df35fb7..4ae26d6 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -430,13 +430,14 @@ struct dm_info_msg { * currently hot added. We hot add in multiples of 128M * chunks; it is possible that we may not be able to bring * online all the pages in the region. The range - * covered_end_pfn defines the pages that can + * covered_start_pfn:covered_end_pfn defines the pages that can * be brough online. */ struct hv_hotadd_state { struct list_head list; unsigned long start_pfn; + unsigned long covered_start_pfn; unsigned long covered_end_pfn; unsigned long ha_end_pfn; unsigned long end_pfn; @@ -682,7 +683,8 @@ static void hv_online_page(struct page *pg) list_for_each(cur, _device.ha_region_list) { has = list_entry(cur, struct hv_hotadd_state, list); - cur_start_pgp = (unsigned long)pfn_to_page(has->start_pfn); + cur_start_pgp = (unsigned long) + pfn_to_page(has->covered_start_pfn); cur_end_pgp = (unsigned long)pfn_to_page(has->covered_end_pfn); if (((unsigned long)pg >= cur_start_pgp) && @@ -854,6 +856,7 @@ static unsigned long process_hot_add(unsigned long pg_start, list_add_tail(_region->list, _device.ha_region_list); ha_region->start_pfn = rg_start; ha_region->ha_end_pfn = rg_start; + ha_region->covered_start_pfn = pg_start; ha_region->covered_end_pfn = pg_start; ha_region->end_pfn = rg_start + rg_size; } -- 1.7.4.1
[PATCH 0/5] Drivers: hv: balloon: Miscellaneous fixes.
From: K. Y. SrinivasanMiscellaneous fixes to the balloon driver. Alex Ng (1): Drivers: hv: balloon: Use available memory value in pressure report Vitaly Kuznetsov (4): Drivers: hv: balloon: keep track of where ha_region starts Drivers: hv: balloon: account for gaps in hot add regions Drivers: hv: balloon: don't wait for ol_waitevent when memhp_auto_online is enabled Drivers: hv: balloon: replace ha_region_mutex with spinlock drivers/hv/hv_balloon.c | 254 ++- 1 files changed, 161 insertions(+), 93 deletions(-) -- 1.7.4.1
[PATCH 0/5] Drivers: hv: balloon: Miscellaneous fixes.
From: K. Y. Srinivasan Miscellaneous fixes to the balloon driver. Alex Ng (1): Drivers: hv: balloon: Use available memory value in pressure report Vitaly Kuznetsov (4): Drivers: hv: balloon: keep track of where ha_region starts Drivers: hv: balloon: account for gaps in hot add regions Drivers: hv: balloon: don't wait for ol_waitevent when memhp_auto_online is enabled Drivers: hv: balloon: replace ha_region_mutex with spinlock drivers/hv/hv_balloon.c | 254 ++- 1 files changed, 161 insertions(+), 93 deletions(-) -- 1.7.4.1
[PATCH 2/2] Drivers: hv: utils: Check VSS daemon is listening before a hot backup
From: Alex NgHyper-V host will send a VSS_OP_HOT_BACKUP request to check if guest is ready for a live backup/snapshot. The driver should respond to the check only if the daemon is running and listening to requests. This allows the host to fallback to standard snapshots in case the VSS daemon is not running. Signed-off-by: Alex Ng Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_snapshot.c |9 ++--- tools/hv/hv_vss_daemon.c |3 +++ 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c index 4c8dd20..4cfd854 100644 --- a/drivers/hv/hv_snapshot.c +++ b/drivers/hv/hv_snapshot.c @@ -136,6 +136,11 @@ static int vss_on_msg(void *msg, int len) return vss_handle_handshake(vss_msg); } else if (vss_transaction.state == HVUTIL_USERSPACE_REQ) { vss_transaction.state = HVUTIL_USERSPACE_RECV; + + if (vss_msg->vss_hdr.operation == VSS_OP_HOT_BACKUP) + vss_transaction.msg->vss_cf.flags = + VSS_HBU_NO_AUTO_RECOVERY; + if (cancel_delayed_work_sync(_timeout_work)) { vss_respond_to_host(vss_msg->error); /* Transaction is finished, reset the state. */ @@ -195,6 +200,7 @@ static void vss_handle_request(struct work_struct *dummy) */ case VSS_OP_THAW: case VSS_OP_FREEZE: + case VSS_OP_HOT_BACKUP: if (vss_transaction.state < HVUTIL_READY) { /* Userspace is not registered yet */ vss_respond_to_host(HV_E_FAIL); @@ -203,9 +209,6 @@ static void vss_handle_request(struct work_struct *dummy) vss_transaction.state = HVUTIL_HOSTMSG_RECEIVED; vss_send_op(); return; - case VSS_OP_HOT_BACKUP: - vss_transaction.msg->vss_cf.flags = VSS_HBU_NO_AUTO_RECOVERY; - break; case VSS_OP_GET_DM_INFO: vss_transaction.msg->dm_info.flags = 0; break; diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 5d51d6f..e082980 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -250,6 +250,9 @@ int main(int argc, char *argv[]) syslog(LOG_ERR, "/etc/fstab and /proc/mounts"); } break; + case VSS_OP_HOT_BACKUP: + syslog(LOG_INFO, "VSS: op=CHECK HOT BACKUP\n"); + break; default: syslog(LOG_ERR, "Illegal op:%d\n", op); } -- 1.7.4.1
[PATCH 2/2] Drivers: hv: utils: Check VSS daemon is listening before a hot backup
From: Alex Ng Hyper-V host will send a VSS_OP_HOT_BACKUP request to check if guest is ready for a live backup/snapshot. The driver should respond to the check only if the daemon is running and listening to requests. This allows the host to fallback to standard snapshots in case the VSS daemon is not running. Signed-off-by: Alex Ng Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_snapshot.c |9 ++--- tools/hv/hv_vss_daemon.c |3 +++ 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c index 4c8dd20..4cfd854 100644 --- a/drivers/hv/hv_snapshot.c +++ b/drivers/hv/hv_snapshot.c @@ -136,6 +136,11 @@ static int vss_on_msg(void *msg, int len) return vss_handle_handshake(vss_msg); } else if (vss_transaction.state == HVUTIL_USERSPACE_REQ) { vss_transaction.state = HVUTIL_USERSPACE_RECV; + + if (vss_msg->vss_hdr.operation == VSS_OP_HOT_BACKUP) + vss_transaction.msg->vss_cf.flags = + VSS_HBU_NO_AUTO_RECOVERY; + if (cancel_delayed_work_sync(_timeout_work)) { vss_respond_to_host(vss_msg->error); /* Transaction is finished, reset the state. */ @@ -195,6 +200,7 @@ static void vss_handle_request(struct work_struct *dummy) */ case VSS_OP_THAW: case VSS_OP_FREEZE: + case VSS_OP_HOT_BACKUP: if (vss_transaction.state < HVUTIL_READY) { /* Userspace is not registered yet */ vss_respond_to_host(HV_E_FAIL); @@ -203,9 +209,6 @@ static void vss_handle_request(struct work_struct *dummy) vss_transaction.state = HVUTIL_HOSTMSG_RECEIVED; vss_send_op(); return; - case VSS_OP_HOT_BACKUP: - vss_transaction.msg->vss_cf.flags = VSS_HBU_NO_AUTO_RECOVERY; - break; case VSS_OP_GET_DM_INFO: vss_transaction.msg->dm_info.flags = 0; break; diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 5d51d6f..e082980 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -250,6 +250,9 @@ int main(int argc, char *argv[]) syslog(LOG_ERR, "/etc/fstab and /proc/mounts"); } break; + case VSS_OP_HOT_BACKUP: + syslog(LOG_INFO, "VSS: op=CHECK HOT BACKUP\n"); + break; default: syslog(LOG_ERR, "Illegal op:%d\n", op); } -- 1.7.4.1
[PATCH 1/2] Drivers: hv: utils: Continue to poll VSS channel after handling requests.
From: Alex NgMultiple VSS_OP_HOT_BACKUP requests may arrive in quick succession, even though the host only signals once. The driver wass handling the first request while ignoring the others in the ring buffer. We should poll the VSS channel after handling a request to continue processing other requests. Signed-off-by: Alex Ng Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_snapshot.c | 89 + 1 files changed, 42 insertions(+), 47 deletions(-) diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c index 3fba14e..4c8dd20 100644 --- a/drivers/hv/hv_snapshot.c +++ b/drivers/hv/hv_snapshot.c @@ -67,11 +67,11 @@ static const char vss_devname[] = "vmbus/hv_vss"; static __u8 *recv_buffer; static struct hvutil_transport *hvt; -static void vss_send_op(struct work_struct *dummy); static void vss_timeout_func(struct work_struct *dummy); +static void vss_handle_request(struct work_struct *dummy); static DECLARE_DELAYED_WORK(vss_timeout_work, vss_timeout_func); -static DECLARE_WORK(vss_send_op_work, vss_send_op); +static DECLARE_WORK(vss_handle_request_work, vss_handle_request); static void vss_poll_wrapper(void *channel) { @@ -150,8 +150,7 @@ static int vss_on_msg(void *msg, int len) return 0; } - -static void vss_send_op(struct work_struct *dummy) +static void vss_send_op(void) { int op = vss_transaction.msg->vss_hdr.operation; int rc; @@ -168,6 +167,8 @@ static void vss_send_op(struct work_struct *dummy) vss_msg->vss_hdr.operation = op; vss_transaction.state = HVUTIL_USERSPACE_REQ; + + schedule_delayed_work(_timeout_work, VSS_USERSPACE_TIMEOUT); rc = hvutil_transport_send(hvt, vss_msg, sizeof(*vss_msg)); if (rc) { pr_warn("VSS: failed to communicate to the daemon: %d\n", rc); @@ -182,6 +183,40 @@ static void vss_send_op(struct work_struct *dummy) return; } +static void vss_handle_request(struct work_struct *dummy) +{ + switch (vss_transaction.msg->vss_hdr.operation) { + /* +* Initiate a "freeze/thaw" operation in the guest. +* We respond to the host once the operation is complete. +* +* We send the message to the user space daemon and the operation is +* performed in the daemon. +*/ + case VSS_OP_THAW: + case VSS_OP_FREEZE: + if (vss_transaction.state < HVUTIL_READY) { + /* Userspace is not registered yet */ + vss_respond_to_host(HV_E_FAIL); + return; + } + vss_transaction.state = HVUTIL_HOSTMSG_RECEIVED; + vss_send_op(); + return; + case VSS_OP_HOT_BACKUP: + vss_transaction.msg->vss_cf.flags = VSS_HBU_NO_AUTO_RECOVERY; + break; + case VSS_OP_GET_DM_INFO: + vss_transaction.msg->dm_info.flags = 0; + break; + default: + break; + } + + vss_respond_to_host(0); + hv_poll_channel(vss_transaction.recv_channel, vss_poll_wrapper); +} + /* * Send a response back to the host. */ @@ -266,48 +301,8 @@ void hv_vss_onchannelcallback(void *context) vss_transaction.recv_req_id = requestid; vss_transaction.msg = (struct hv_vss_msg *)vss_msg; - switch (vss_msg->vss_hdr.operation) { - /* -* Initiate a "freeze/thaw" -* operation in the guest. -* We respond to the host once -* the operation is complete. -* -* We send the message to the -* user space daemon and the -* operation is performed in -* the daemon. -*/ - case VSS_OP_FREEZE: - case VSS_OP_THAW: - if (vss_transaction.state < HVUTIL_READY) { - /* Userspace is not registered yet */ - vss_respond_to_host(HV_E_FAIL); - return; - } - vss_transaction.state = HVUTIL_HOSTMSG_RECEIVED; - schedule_work(_send_op_work); - schedule_delayed_work(_timeout_work, - VSS_USERSPACE_TIMEOUT); - return; - - case VSS_OP_HOT_BACKUP: - vss_msg->vss_cf.flags = -
[PATCH 1/2] Drivers: hv: utils: Continue to poll VSS channel after handling requests.
From: Alex Ng Multiple VSS_OP_HOT_BACKUP requests may arrive in quick succession, even though the host only signals once. The driver wass handling the first request while ignoring the others in the ring buffer. We should poll the VSS channel after handling a request to continue processing other requests. Signed-off-by: Alex Ng Signed-off-by: K. Y. Srinivasan --- drivers/hv/hv_snapshot.c | 89 + 1 files changed, 42 insertions(+), 47 deletions(-) diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c index 3fba14e..4c8dd20 100644 --- a/drivers/hv/hv_snapshot.c +++ b/drivers/hv/hv_snapshot.c @@ -67,11 +67,11 @@ static const char vss_devname[] = "vmbus/hv_vss"; static __u8 *recv_buffer; static struct hvutil_transport *hvt; -static void vss_send_op(struct work_struct *dummy); static void vss_timeout_func(struct work_struct *dummy); +static void vss_handle_request(struct work_struct *dummy); static DECLARE_DELAYED_WORK(vss_timeout_work, vss_timeout_func); -static DECLARE_WORK(vss_send_op_work, vss_send_op); +static DECLARE_WORK(vss_handle_request_work, vss_handle_request); static void vss_poll_wrapper(void *channel) { @@ -150,8 +150,7 @@ static int vss_on_msg(void *msg, int len) return 0; } - -static void vss_send_op(struct work_struct *dummy) +static void vss_send_op(void) { int op = vss_transaction.msg->vss_hdr.operation; int rc; @@ -168,6 +167,8 @@ static void vss_send_op(struct work_struct *dummy) vss_msg->vss_hdr.operation = op; vss_transaction.state = HVUTIL_USERSPACE_REQ; + + schedule_delayed_work(_timeout_work, VSS_USERSPACE_TIMEOUT); rc = hvutil_transport_send(hvt, vss_msg, sizeof(*vss_msg)); if (rc) { pr_warn("VSS: failed to communicate to the daemon: %d\n", rc); @@ -182,6 +183,40 @@ static void vss_send_op(struct work_struct *dummy) return; } +static void vss_handle_request(struct work_struct *dummy) +{ + switch (vss_transaction.msg->vss_hdr.operation) { + /* +* Initiate a "freeze/thaw" operation in the guest. +* We respond to the host once the operation is complete. +* +* We send the message to the user space daemon and the operation is +* performed in the daemon. +*/ + case VSS_OP_THAW: + case VSS_OP_FREEZE: + if (vss_transaction.state < HVUTIL_READY) { + /* Userspace is not registered yet */ + vss_respond_to_host(HV_E_FAIL); + return; + } + vss_transaction.state = HVUTIL_HOSTMSG_RECEIVED; + vss_send_op(); + return; + case VSS_OP_HOT_BACKUP: + vss_transaction.msg->vss_cf.flags = VSS_HBU_NO_AUTO_RECOVERY; + break; + case VSS_OP_GET_DM_INFO: + vss_transaction.msg->dm_info.flags = 0; + break; + default: + break; + } + + vss_respond_to_host(0); + hv_poll_channel(vss_transaction.recv_channel, vss_poll_wrapper); +} + /* * Send a response back to the host. */ @@ -266,48 +301,8 @@ void hv_vss_onchannelcallback(void *context) vss_transaction.recv_req_id = requestid; vss_transaction.msg = (struct hv_vss_msg *)vss_msg; - switch (vss_msg->vss_hdr.operation) { - /* -* Initiate a "freeze/thaw" -* operation in the guest. -* We respond to the host once -* the operation is complete. -* -* We send the message to the -* user space daemon and the -* operation is performed in -* the daemon. -*/ - case VSS_OP_FREEZE: - case VSS_OP_THAW: - if (vss_transaction.state < HVUTIL_READY) { - /* Userspace is not registered yet */ - vss_respond_to_host(HV_E_FAIL); - return; - } - vss_transaction.state = HVUTIL_HOSTMSG_RECEIVED; - schedule_work(_send_op_work); - schedule_delayed_work(_timeout_work, - VSS_USERSPACE_TIMEOUT); - return; - - case VSS_OP_HOT_BACKUP: - vss_msg->vss_cf.flags = -VSS_HBU_NO_AUTO_RECOVERY; -
[PATCH 0/2] Drivers: hv: util: Some fixes to the backup driver
From: K. Y. SrinivasanSome fixes to the backup driver. Alex Ng (2): Drivers: hv: utils: Continue to poll VSS channel after handling requests. Drivers: hv: utils: Check VSS daemon is listening before a hot backup drivers/hv/hv_snapshot.c | 92 ++--- tools/hv/hv_vss_daemon.c |3 + 2 files changed, 48 insertions(+), 47 deletions(-) -- 1.7.4.1
[PATCH 0/2] Drivers: hv: util: Some fixes to the backup driver
From: K. Y. Srinivasan Some fixes to the backup driver. Alex Ng (2): Drivers: hv: utils: Continue to poll VSS channel after handling requests. Drivers: hv: utils: Check VSS daemon is listening before a hot backup drivers/hv/hv_snapshot.c | 92 ++--- tools/hv/hv_vss_daemon.c |3 + 2 files changed, 48 insertions(+), 47 deletions(-) -- 1.7.4.1
[PATCH 1/1] Tools: hv: kvp: ensure kvp device fd is closed on exec
From: Vitaly KuznetsovKVP daemon does fork()/exec() (with popen()) so we need to close our fds to avoid sharing them with child processes. The immediate implication of not doing so I see is SELinux complaining about 'ip' trying to access '/dev/vmbus/hv_kvp'. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- tools/hv/hv_kvp_daemon.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 0d9f48e..bc7adb8 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1433,7 +1433,7 @@ int main(int argc, char *argv[]) openlog("KVP", 0, LOG_USER); syslog(LOG_INFO, "KVP starting; pid is:%d", getpid()); - kvp_fd = open("/dev/vmbus/hv_kvp", O_RDWR); + kvp_fd = open("/dev/vmbus/hv_kvp", O_RDWR | O_CLOEXEC); if (kvp_fd < 0) { syslog(LOG_ERR, "open /dev/vmbus/hv_kvp failed; error: %d %s", -- 1.7.4.1
[PATCH 1/1] Tools: hv: kvp: ensure kvp device fd is closed on exec
From: Vitaly Kuznetsov KVP daemon does fork()/exec() (with popen()) so we need to close our fds to avoid sharing them with child processes. The immediate implication of not doing so I see is SELinux complaining about 'ip' trying to access '/dev/vmbus/hv_kvp'. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan --- tools/hv/hv_kvp_daemon.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 0d9f48e..bc7adb8 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1433,7 +1433,7 @@ int main(int argc, char *argv[]) openlog("KVP", 0, LOG_USER); syslog(LOG_INFO, "KVP starting; pid is:%d", getpid()); - kvp_fd = open("/dev/vmbus/hv_kvp", O_RDWR); + kvp_fd = open("/dev/vmbus/hv_kvp", O_RDWR | O_CLOEXEC); if (kvp_fd < 0) { syslog(LOG_ERR, "open /dev/vmbus/hv_kvp failed; error: %d %s", -- 1.7.4.1
[PATCH 1/1] Drivers: hv: Introduce a policy for controlling channel affinity
From: K. Y. SrinivasanIntroduce a mechanism to control how channels will be affinitized. We will support two policies: 1. HV_BALANCED: All performance critical channels will be dstributed evenly amongst all the available NUMA nodes. Once the Node is assigned, we will assign the CPU based on a simple round robin scheme. 2. HV_LOCALIZED: Only the primary channels are distributed across all NUMA nodes. Sub-channels will be in the same NUMA node as the primary channel. This is the current behaviour. The default policy will be the HV_BALANCED as it can minimize the remote memory access on NUMA machines with applications that span NUMA nodes. Signed-off-by: K. Y. Srinivasan Signed-off-by: K. Y. Srinivasan --- drivers/hv/channel_mgmt.c | 68 +--- include/linux/hyperv.h| 23 +++ 2 files changed, 62 insertions(+), 29 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 8345869..aaa2c4b 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -338,8 +338,9 @@ void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid) * We need to free the bit for init_vp_index() to work in the case * of sub-channel, when we reload drivers like hv_netvsc. */ - cpumask_clear_cpu(channel->target_cpu, - _channel->alloced_cpus_in_node); + if (channel->affinity_policy == HV_LOCALIZED) + cpumask_clear_cpu(channel->target_cpu, + _channel->alloced_cpus_in_node); free_channel(channel); } @@ -524,17 +525,17 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) } /* -* We distribute primary channels evenly across all the available -* NUMA nodes and within the assigned NUMA node we will assign the -* first available CPU to the primary channel. -* The sub-channels will be assigned to the CPUs available in the -* NUMA node evenly. +* Based on the channel affinity policy, we will assign the NUMA +* nodes. */ - if (!primary) { + + if ((channel->affinity_policy == HV_BALANCED) || (!primary)) { while (true) { next_node = next_numa_node_id++; - if (next_node == nr_node_ids) + if (next_node == nr_node_ids) { next_node = next_numa_node_id = 0; + continue; + } if (cpumask_empty(cpumask_of_node(next_node))) continue; break; @@ -558,15 +559,17 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) cur_cpu = -1; - /* -* Normally Hyper-V host doesn't create more subchannels than there -* are VCPUs on the node but it is possible when not all present VCPUs -* on the node are initialized by guest. Clear the alloced_cpus_in_node -* to start over. -*/ - if (cpumask_equal(>alloced_cpus_in_node, - cpumask_of_node(primary->numa_node))) - cpumask_clear(>alloced_cpus_in_node); + if (primary->affinity_policy == HV_LOCALIZED) { + /* +* Normally Hyper-V host doesn't create more subchannels +* than there are VCPUs on the node but it is possible when not +* all present VCPUs on the node are initialized by guest. +* Clear the alloced_cpus_in_node to start over. +*/ + if (cpumask_equal(>alloced_cpus_in_node, + cpumask_of_node(primary->numa_node))) + cpumask_clear(>alloced_cpus_in_node); + } while (true) { cur_cpu = cpumask_next(cur_cpu, _mask); @@ -577,17 +580,24 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) continue; } - /* -* NOTE: in the case of sub-channel, we clear the sub-channel -* related bit(s) in primary->alloced_cpus_in_node in -* hv_process_channel_removal(), so when we reload drivers -* like hv_netvsc in SMP guest, here we're able to re-allocate -* bit from primary->alloced_cpus_in_node. -*/ - if (!cpumask_test_cpu(cur_cpu, - >alloced_cpus_in_node)) { - cpumask_set_cpu(cur_cpu, - >alloced_cpus_in_node); + if (primary->affinity_policy == HV_LOCALIZED) { + /* +* NOTE: in the case of sub-channel, we clear the +
[PATCH 1/1] Drivers: hv: Introduce a policy for controlling channel affinity
From: K. Y. Srinivasan Introduce a mechanism to control how channels will be affinitized. We will support two policies: 1. HV_BALANCED: All performance critical channels will be dstributed evenly amongst all the available NUMA nodes. Once the Node is assigned, we will assign the CPU based on a simple round robin scheme. 2. HV_LOCALIZED: Only the primary channels are distributed across all NUMA nodes. Sub-channels will be in the same NUMA node as the primary channel. This is the current behaviour. The default policy will be the HV_BALANCED as it can minimize the remote memory access on NUMA machines with applications that span NUMA nodes. Signed-off-by: K. Y. Srinivasan Signed-off-by: K. Y. Srinivasan --- drivers/hv/channel_mgmt.c | 68 +--- include/linux/hyperv.h| 23 +++ 2 files changed, 62 insertions(+), 29 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 8345869..aaa2c4b 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -338,8 +338,9 @@ void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid) * We need to free the bit for init_vp_index() to work in the case * of sub-channel, when we reload drivers like hv_netvsc. */ - cpumask_clear_cpu(channel->target_cpu, - _channel->alloced_cpus_in_node); + if (channel->affinity_policy == HV_LOCALIZED) + cpumask_clear_cpu(channel->target_cpu, + _channel->alloced_cpus_in_node); free_channel(channel); } @@ -524,17 +525,17 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) } /* -* We distribute primary channels evenly across all the available -* NUMA nodes and within the assigned NUMA node we will assign the -* first available CPU to the primary channel. -* The sub-channels will be assigned to the CPUs available in the -* NUMA node evenly. +* Based on the channel affinity policy, we will assign the NUMA +* nodes. */ - if (!primary) { + + if ((channel->affinity_policy == HV_BALANCED) || (!primary)) { while (true) { next_node = next_numa_node_id++; - if (next_node == nr_node_ids) + if (next_node == nr_node_ids) { next_node = next_numa_node_id = 0; + continue; + } if (cpumask_empty(cpumask_of_node(next_node))) continue; break; @@ -558,15 +559,17 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) cur_cpu = -1; - /* -* Normally Hyper-V host doesn't create more subchannels than there -* are VCPUs on the node but it is possible when not all present VCPUs -* on the node are initialized by guest. Clear the alloced_cpus_in_node -* to start over. -*/ - if (cpumask_equal(>alloced_cpus_in_node, - cpumask_of_node(primary->numa_node))) - cpumask_clear(>alloced_cpus_in_node); + if (primary->affinity_policy == HV_LOCALIZED) { + /* +* Normally Hyper-V host doesn't create more subchannels +* than there are VCPUs on the node but it is possible when not +* all present VCPUs on the node are initialized by guest. +* Clear the alloced_cpus_in_node to start over. +*/ + if (cpumask_equal(>alloced_cpus_in_node, + cpumask_of_node(primary->numa_node))) + cpumask_clear(>alloced_cpus_in_node); + } while (true) { cur_cpu = cpumask_next(cur_cpu, _mask); @@ -577,17 +580,24 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) continue; } - /* -* NOTE: in the case of sub-channel, we clear the sub-channel -* related bit(s) in primary->alloced_cpus_in_node in -* hv_process_channel_removal(), so when we reload drivers -* like hv_netvsc in SMP guest, here we're able to re-allocate -* bit from primary->alloced_cpus_in_node. -*/ - if (!cpumask_test_cpu(cur_cpu, - >alloced_cpus_in_node)) { - cpumask_set_cpu(cur_cpu, - >alloced_cpus_in_node); + if (primary->affinity_policy == HV_LOCALIZED) { + /* +* NOTE: in the case of sub-channel, we clear the +* sub-channel related bit(s) in +*
[PATCH 1/4] Drivers: hv: cleanup vmbus_open() for wrap around mappings
From: Vitaly KuznetsovIn preparation for doing wrap around mappings for ring buffers cleanup vmbus_open() function: - check that ring sizes are PAGE_SIZE aligned (they are for all in-kernel drivers now); - kfree(open_info) on error only after we kzalloc() it (not an issue as it is valid to call kfree(NULL); - rename poorly named labels; - use alloc_pages() instead of __get_free_pages() as we need struct page pointer for future. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- drivers/hv/channel.c | 43 +++ 1 files changed, 23 insertions(+), 20 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index e3a0048..901b6ce 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -81,6 +81,10 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, unsigned long t; struct page *page; + if (send_ringbuffer_size % PAGE_SIZE || + recv_ringbuffer_size % PAGE_SIZE) + return -EINVAL; + spin_lock_irqsave(>lock, flags); if (newchannel->state == CHANNEL_OPEN_STATE) { newchannel->state = CHANNEL_OPENING_STATE; @@ -100,17 +104,16 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, recv_ringbuffer_size)); if (!page) - out = (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, - get_order(send_ringbuffer_size + - recv_ringbuffer_size)); - else - out = (void *)page_address(page); + page = alloc_pages(GFP_KERNEL|__GFP_ZERO, + get_order(send_ringbuffer_size + +recv_ringbuffer_size)); - if (!out) { + if (!page) { err = -ENOMEM; - goto error0; + goto error_set_chnstate; } + out = page_address(page); in = (void *)((unsigned long)out + send_ringbuffer_size); newchannel->ringbuffer_pages = out; @@ -122,14 +125,14 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (ret != 0) { err = ret; - goto error0; + goto error_free_pages; } ret = hv_ringbuffer_init( >inbound, in, recv_ringbuffer_size); if (ret != 0) { err = ret; - goto error0; + goto error_free_pages; } @@ -144,7 +147,7 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (ret != 0) { err = ret; - goto error0; + goto error_free_pages; } /* Create and init the channel open message */ @@ -153,7 +156,7 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, GFP_KERNEL); if (!open_info) { err = -ENOMEM; - goto error_gpadl; + goto error_free_gpadl; } init_completion(_info->waitevent); @@ -169,7 +172,7 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (userdatalen > MAX_USER_DEFINED_BYTES) { err = -EINVAL; - goto error_gpadl; + goto error_free_gpadl; } if (userdatalen) @@ -185,13 +188,13 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (ret != 0) { err = ret; - goto error1; + goto error_clean_msglist; } t = wait_for_completion_timeout(_info->waitevent, 5*HZ); if (t == 0) { err = -ETIMEDOUT; - goto error1; + goto error_clean_msglist; } spin_lock_irqsave(_connection.channelmsg_lock, flags); @@ -200,25 +203,25 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (open_info->response.open_result.status) { err = -EAGAIN; - goto error_gpadl; + goto error_free_gpadl; } newchannel->state = CHANNEL_OPENED_STATE; kfree(open_info); return 0; -error1: +error_clean_msglist: spin_lock_irqsave(_connection.channelmsg_lock, flags); list_del(_info->msglistentry); spin_unlock_irqrestore(_connection.channelmsg_lock, flags); -error_gpadl: +error_free_gpadl: vmbus_teardown_gpadl(newchannel, newchannel->ringbuffer_gpadlhandle); - -error0: + kfree(open_info); +error_free_pages: free_pages((unsigned long)out, get_order(send_ringbuffer_size + recv_ringbuffer_size)); - kfree(open_info);
[PATCH 1/4] Drivers: hv: cleanup vmbus_open() for wrap around mappings
From: Vitaly Kuznetsov In preparation for doing wrap around mappings for ring buffers cleanup vmbus_open() function: - check that ring sizes are PAGE_SIZE aligned (they are for all in-kernel drivers now); - kfree(open_info) on error only after we kzalloc() it (not an issue as it is valid to call kfree(NULL); - rename poorly named labels; - use alloc_pages() instead of __get_free_pages() as we need struct page pointer for future. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- drivers/hv/channel.c | 43 +++ 1 files changed, 23 insertions(+), 20 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index e3a0048..901b6ce 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -81,6 +81,10 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, unsigned long t; struct page *page; + if (send_ringbuffer_size % PAGE_SIZE || + recv_ringbuffer_size % PAGE_SIZE) + return -EINVAL; + spin_lock_irqsave(>lock, flags); if (newchannel->state == CHANNEL_OPEN_STATE) { newchannel->state = CHANNEL_OPENING_STATE; @@ -100,17 +104,16 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, recv_ringbuffer_size)); if (!page) - out = (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, - get_order(send_ringbuffer_size + - recv_ringbuffer_size)); - else - out = (void *)page_address(page); + page = alloc_pages(GFP_KERNEL|__GFP_ZERO, + get_order(send_ringbuffer_size + +recv_ringbuffer_size)); - if (!out) { + if (!page) { err = -ENOMEM; - goto error0; + goto error_set_chnstate; } + out = page_address(page); in = (void *)((unsigned long)out + send_ringbuffer_size); newchannel->ringbuffer_pages = out; @@ -122,14 +125,14 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (ret != 0) { err = ret; - goto error0; + goto error_free_pages; } ret = hv_ringbuffer_init( >inbound, in, recv_ringbuffer_size); if (ret != 0) { err = ret; - goto error0; + goto error_free_pages; } @@ -144,7 +147,7 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (ret != 0) { err = ret; - goto error0; + goto error_free_pages; } /* Create and init the channel open message */ @@ -153,7 +156,7 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, GFP_KERNEL); if (!open_info) { err = -ENOMEM; - goto error_gpadl; + goto error_free_gpadl; } init_completion(_info->waitevent); @@ -169,7 +172,7 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (userdatalen > MAX_USER_DEFINED_BYTES) { err = -EINVAL; - goto error_gpadl; + goto error_free_gpadl; } if (userdatalen) @@ -185,13 +188,13 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (ret != 0) { err = ret; - goto error1; + goto error_clean_msglist; } t = wait_for_completion_timeout(_info->waitevent, 5*HZ); if (t == 0) { err = -ETIMEDOUT; - goto error1; + goto error_clean_msglist; } spin_lock_irqsave(_connection.channelmsg_lock, flags); @@ -200,25 +203,25 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, if (open_info->response.open_result.status) { err = -EAGAIN; - goto error_gpadl; + goto error_free_gpadl; } newchannel->state = CHANNEL_OPENED_STATE; kfree(open_info); return 0; -error1: +error_clean_msglist: spin_lock_irqsave(_connection.channelmsg_lock, flags); list_del(_info->msglistentry); spin_unlock_irqrestore(_connection.channelmsg_lock, flags); -error_gpadl: +error_free_gpadl: vmbus_teardown_gpadl(newchannel, newchannel->ringbuffer_gpadlhandle); - -error0: + kfree(open_info); +error_free_pages: free_pages((unsigned long)out, get_order(send_ringbuffer_size + recv_ringbuffer_size)); - kfree(open_info); +error_set_chnstate: newchannel->state = CHANNEL_OPEN_STATE; return
[PATCH 2/4] Drivers: hv: ring_buffer: wrap around mappings for ring buffers
From: Vitaly KuznetsovMake it possible to always use a single memcpy() or to provide a direct link to a packet on the ring buffer by creating virtual mapping for two copies of the ring buffer with vmap(). Utilize currently empty hv_ringbuffer_cleanup() to do the unmap. While on it, replace sizeof(struct hv_ring_buffer) check in hv_ringbuffer_init() with BUILD_BUG_ON() as it is a compile time check. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- drivers/hv/channel.c | 29 ++--- drivers/hv/hyperv_vmbus.h |4 ++-- drivers/hv/ring_buffer.c | 39 +-- 3 files changed, 49 insertions(+), 23 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index 901b6ce..aad26da 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -75,7 +75,6 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, { struct vmbus_channel_open_channel *open_msg; struct vmbus_channel_msginfo *open_info = NULL; - void *in, *out; unsigned long flags; int ret, err = 0; unsigned long t; @@ -113,23 +112,21 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, goto error_set_chnstate; } - out = page_address(page); - in = (void *)((unsigned long)out + send_ringbuffer_size); - - newchannel->ringbuffer_pages = out; + newchannel->ringbuffer_pages = page_address(page); newchannel->ringbuffer_pagecount = (send_ringbuffer_size + recv_ringbuffer_size) >> PAGE_SHIFT; - ret = hv_ringbuffer_init( - >outbound, out, send_ringbuffer_size); + ret = hv_ringbuffer_init(>outbound, page, +send_ringbuffer_size >> PAGE_SHIFT); if (ret != 0) { err = ret; goto error_free_pages; } - ret = hv_ringbuffer_init( - >inbound, in, recv_ringbuffer_size); + ret = hv_ringbuffer_init(>inbound, +[send_ringbuffer_size >> PAGE_SHIFT], +recv_ringbuffer_size >> PAGE_SHIFT); if (ret != 0) { err = ret; goto error_free_pages; @@ -140,10 +137,10 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, newchannel->ringbuffer_gpadlhandle = 0; ret = vmbus_establish_gpadl(newchannel, -newchannel->outbound.ring_buffer, -send_ringbuffer_size + -recv_ringbuffer_size, ->ringbuffer_gpadlhandle); + page_address(page), + send_ringbuffer_size + + recv_ringbuffer_size, + >ringbuffer_gpadlhandle); if (ret != 0) { err = ret; @@ -219,8 +216,10 @@ error_free_gpadl: vmbus_teardown_gpadl(newchannel, newchannel->ringbuffer_gpadlhandle); kfree(open_info); error_free_pages: - free_pages((unsigned long)out, - get_order(send_ringbuffer_size + recv_ringbuffer_size)); + hv_ringbuffer_cleanup(>outbound); + hv_ringbuffer_cleanup(>inbound); + __free_pages(page, +get_order(send_ringbuffer_size + recv_ringbuffer_size)); error_set_chnstate: newchannel->state = CHANNEL_OPEN_STATE; return err; diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index ddcc348..a5b4442 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -522,8 +522,8 @@ extern unsigned int host_info_edx; /* Interface */ -int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, void *buffer, - u32 buflen); +int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, + struct page *pages, u32 pagecnt); void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index e3edcae..7e21c2c 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -27,6 +27,8 @@ #include #include #include +#include +#include #include "hyperv_vmbus.h" @@ -243,22 +245,46 @@ void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info, /* Initialize the ring buffer. */ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, - void *buffer, u32 buflen) + struct page *pages, u32 page_cnt) { - if (sizeof(struct hv_ring_buffer) != PAGE_SIZE) - return -EINVAL; + int i; + struct page **pages_wraparound; + +
[PATCH 4/4] Drivers: hv: ring_buffer: count on wrap around mappings in get_next_pkt_raw()
From: Vitaly KuznetsovWith wrap around mappings in place we can always provide drivers with direct links to packets on the ring buffer, even when they wrap around. Do the required updates to get_next_pkt_raw()/put_pkt_raw() Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- include/linux/hyperv.h | 32 +++- 1 files changed, 11 insertions(+), 21 deletions(-) diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 362acf0..897e4a7 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -1466,31 +1466,23 @@ static inline struct vmpacket_descriptor * get_next_pkt_raw(struct vmbus_channel *channel) { struct hv_ring_buffer_info *ring_info = >inbound; - u32 read_loc = ring_info->priv_read_index; + u32 priv_read_loc = ring_info->priv_read_index; void *ring_buffer = hv_get_ring_buffer(ring_info); - struct vmpacket_descriptor *cur_desc; - u32 packetlen; u32 dsize = ring_info->ring_datasize; - u32 delta = read_loc - ring_info->ring_buffer->read_index; + /* +* delta is the difference between what is available to read and +* what was already consumed in place. We commit read index after +* the whole batch is processed. +*/ + u32 delta = priv_read_loc >= ring_info->ring_buffer->read_index ? + priv_read_loc - ring_info->ring_buffer->read_index : + (dsize - ring_info->ring_buffer->read_index) + priv_read_loc; u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta); if (bytes_avail_toread < sizeof(struct vmpacket_descriptor)) return NULL; - if ((read_loc + sizeof(*cur_desc)) > dsize) - return NULL; - - cur_desc = ring_buffer + read_loc; - packetlen = cur_desc->len8 << 3; - - /* -* If the packet under consideration is wrapping around, -* return failure. -*/ - if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > (dsize - 1)) - return NULL; - - return cur_desc; + return ring_buffer + priv_read_loc; } /* @@ -1502,16 +1494,14 @@ static inline void put_pkt_raw(struct vmbus_channel *channel, struct vmpacket_descriptor *desc) { struct hv_ring_buffer_info *ring_info = >inbound; - u32 read_loc = ring_info->priv_read_index; u32 packetlen = desc->len8 << 3; u32 dsize = ring_info->ring_datasize; - if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > dsize) - BUG(); /* * Include the packet trailer. */ ring_info->priv_read_index += packetlen + VMBUS_PKT_TRAILER; + ring_info->priv_read_index %= dsize; } /* -- 1.7.4.1
[PATCH 2/4] Drivers: hv: ring_buffer: wrap around mappings for ring buffers
From: Vitaly Kuznetsov Make it possible to always use a single memcpy() or to provide a direct link to a packet on the ring buffer by creating virtual mapping for two copies of the ring buffer with vmap(). Utilize currently empty hv_ringbuffer_cleanup() to do the unmap. While on it, replace sizeof(struct hv_ring_buffer) check in hv_ringbuffer_init() with BUILD_BUG_ON() as it is a compile time check. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- drivers/hv/channel.c | 29 ++--- drivers/hv/hyperv_vmbus.h |4 ++-- drivers/hv/ring_buffer.c | 39 +-- 3 files changed, 49 insertions(+), 23 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index 901b6ce..aad26da 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -75,7 +75,6 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, { struct vmbus_channel_open_channel *open_msg; struct vmbus_channel_msginfo *open_info = NULL; - void *in, *out; unsigned long flags; int ret, err = 0; unsigned long t; @@ -113,23 +112,21 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, goto error_set_chnstate; } - out = page_address(page); - in = (void *)((unsigned long)out + send_ringbuffer_size); - - newchannel->ringbuffer_pages = out; + newchannel->ringbuffer_pages = page_address(page); newchannel->ringbuffer_pagecount = (send_ringbuffer_size + recv_ringbuffer_size) >> PAGE_SHIFT; - ret = hv_ringbuffer_init( - >outbound, out, send_ringbuffer_size); + ret = hv_ringbuffer_init(>outbound, page, +send_ringbuffer_size >> PAGE_SHIFT); if (ret != 0) { err = ret; goto error_free_pages; } - ret = hv_ringbuffer_init( - >inbound, in, recv_ringbuffer_size); + ret = hv_ringbuffer_init(>inbound, +[send_ringbuffer_size >> PAGE_SHIFT], +recv_ringbuffer_size >> PAGE_SHIFT); if (ret != 0) { err = ret; goto error_free_pages; @@ -140,10 +137,10 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, newchannel->ringbuffer_gpadlhandle = 0; ret = vmbus_establish_gpadl(newchannel, -newchannel->outbound.ring_buffer, -send_ringbuffer_size + -recv_ringbuffer_size, ->ringbuffer_gpadlhandle); + page_address(page), + send_ringbuffer_size + + recv_ringbuffer_size, + >ringbuffer_gpadlhandle); if (ret != 0) { err = ret; @@ -219,8 +216,10 @@ error_free_gpadl: vmbus_teardown_gpadl(newchannel, newchannel->ringbuffer_gpadlhandle); kfree(open_info); error_free_pages: - free_pages((unsigned long)out, - get_order(send_ringbuffer_size + recv_ringbuffer_size)); + hv_ringbuffer_cleanup(>outbound); + hv_ringbuffer_cleanup(>inbound); + __free_pages(page, +get_order(send_ringbuffer_size + recv_ringbuffer_size)); error_set_chnstate: newchannel->state = CHANNEL_OPEN_STATE; return err; diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index ddcc348..a5b4442 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -522,8 +522,8 @@ extern unsigned int host_info_edx; /* Interface */ -int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, void *buffer, - u32 buflen); +int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, + struct page *pages, u32 pagecnt); void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index e3edcae..7e21c2c 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -27,6 +27,8 @@ #include #include #include +#include +#include #include "hyperv_vmbus.h" @@ -243,22 +245,46 @@ void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info, /* Initialize the ring buffer. */ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, - void *buffer, u32 buflen) + struct page *pages, u32 page_cnt) { - if (sizeof(struct hv_ring_buffer) != PAGE_SIZE) - return -EINVAL; + int i; + struct page **pages_wraparound; + + BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
[PATCH 4/4] Drivers: hv: ring_buffer: count on wrap around mappings in get_next_pkt_raw()
From: Vitaly Kuznetsov With wrap around mappings in place we can always provide drivers with direct links to packets on the ring buffer, even when they wrap around. Do the required updates to get_next_pkt_raw()/put_pkt_raw() Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- include/linux/hyperv.h | 32 +++- 1 files changed, 11 insertions(+), 21 deletions(-) diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 362acf0..897e4a7 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -1466,31 +1466,23 @@ static inline struct vmpacket_descriptor * get_next_pkt_raw(struct vmbus_channel *channel) { struct hv_ring_buffer_info *ring_info = >inbound; - u32 read_loc = ring_info->priv_read_index; + u32 priv_read_loc = ring_info->priv_read_index; void *ring_buffer = hv_get_ring_buffer(ring_info); - struct vmpacket_descriptor *cur_desc; - u32 packetlen; u32 dsize = ring_info->ring_datasize; - u32 delta = read_loc - ring_info->ring_buffer->read_index; + /* +* delta is the difference between what is available to read and +* what was already consumed in place. We commit read index after +* the whole batch is processed. +*/ + u32 delta = priv_read_loc >= ring_info->ring_buffer->read_index ? + priv_read_loc - ring_info->ring_buffer->read_index : + (dsize - ring_info->ring_buffer->read_index) + priv_read_loc; u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta); if (bytes_avail_toread < sizeof(struct vmpacket_descriptor)) return NULL; - if ((read_loc + sizeof(*cur_desc)) > dsize) - return NULL; - - cur_desc = ring_buffer + read_loc; - packetlen = cur_desc->len8 << 3; - - /* -* If the packet under consideration is wrapping around, -* return failure. -*/ - if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > (dsize - 1)) - return NULL; - - return cur_desc; + return ring_buffer + priv_read_loc; } /* @@ -1502,16 +1494,14 @@ static inline void put_pkt_raw(struct vmbus_channel *channel, struct vmpacket_descriptor *desc) { struct hv_ring_buffer_info *ring_info = >inbound; - u32 read_loc = ring_info->priv_read_index; u32 packetlen = desc->len8 << 3; u32 dsize = ring_info->ring_datasize; - if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > dsize) - BUG(); /* * Include the packet trailer. */ ring_info->priv_read_index += packetlen + VMBUS_PKT_TRAILER; + ring_info->priv_read_index %= dsize; } /* -- 1.7.4.1
[PATCH 3/4] Drivers: hv: ring_buffer: use wrap around mappings in hv_copy{from,to}_ringbuffer()
From: Vitaly KuznetsovWith wrap around mappings for ring buffers we can always use a single memcpy() to do the job. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- drivers/hv/ring_buffer.c | 24 +++- 1 files changed, 3 insertions(+), 21 deletions(-) diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 7e21c2c..08043da 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -172,18 +172,7 @@ static u32 hv_copyfrom_ringbuffer( void *ring_buffer = hv_get_ring_buffer(ring_info); u32 ring_buffer_size = hv_get_ring_buffersize(ring_info); - u32 frag_len; - - /* wrap-around detected at the src */ - if (destlen > ring_buffer_size - start_read_offset) { - frag_len = ring_buffer_size - start_read_offset; - - memcpy(dest, ring_buffer + start_read_offset, frag_len); - memcpy(dest + frag_len, ring_buffer, destlen - frag_len); - } else - - memcpy(dest, ring_buffer + start_read_offset, destlen); - + memcpy(dest, ring_buffer + start_read_offset, destlen); start_read_offset += destlen; start_read_offset %= ring_buffer_size; @@ -204,15 +193,8 @@ static u32 hv_copyto_ringbuffer( { void *ring_buffer = hv_get_ring_buffer(ring_info); u32 ring_buffer_size = hv_get_ring_buffersize(ring_info); - u32 frag_len; - - /* wrap-around detected! */ - if (srclen > ring_buffer_size - start_write_offset) { - frag_len = ring_buffer_size - start_write_offset; - memcpy(ring_buffer + start_write_offset, src, frag_len); - memcpy(ring_buffer, src + frag_len, srclen - frag_len); - } else - memcpy(ring_buffer + start_write_offset, src, srclen); + + memcpy(ring_buffer + start_write_offset, src, srclen); start_write_offset += srclen; start_write_offset %= ring_buffer_size; -- 1.7.4.1
[PATCH 3/4] Drivers: hv: ring_buffer: use wrap around mappings in hv_copy{from,to}_ringbuffer()
From: Vitaly Kuznetsov With wrap around mappings for ring buffers we can always use a single memcpy() to do the job. Signed-off-by: Vitaly Kuznetsov Signed-off-by: K. Y. Srinivasan Tested-by: Dexuan Cui --- drivers/hv/ring_buffer.c | 24 +++- 1 files changed, 3 insertions(+), 21 deletions(-) diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 7e21c2c..08043da 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -172,18 +172,7 @@ static u32 hv_copyfrom_ringbuffer( void *ring_buffer = hv_get_ring_buffer(ring_info); u32 ring_buffer_size = hv_get_ring_buffersize(ring_info); - u32 frag_len; - - /* wrap-around detected at the src */ - if (destlen > ring_buffer_size - start_read_offset) { - frag_len = ring_buffer_size - start_read_offset; - - memcpy(dest, ring_buffer + start_read_offset, frag_len); - memcpy(dest + frag_len, ring_buffer, destlen - frag_len); - } else - - memcpy(dest, ring_buffer + start_read_offset, destlen); - + memcpy(dest, ring_buffer + start_read_offset, destlen); start_read_offset += destlen; start_read_offset %= ring_buffer_size; @@ -204,15 +193,8 @@ static u32 hv_copyto_ringbuffer( { void *ring_buffer = hv_get_ring_buffer(ring_info); u32 ring_buffer_size = hv_get_ring_buffersize(ring_info); - u32 frag_len; - - /* wrap-around detected! */ - if (srclen > ring_buffer_size - start_write_offset) { - frag_len = ring_buffer_size - start_write_offset; - memcpy(ring_buffer + start_write_offset, src, frag_len); - memcpy(ring_buffer, src + frag_len, srclen - frag_len); - } else - memcpy(ring_buffer + start_write_offset, src, srclen); + + memcpy(ring_buffer + start_write_offset, src, srclen); start_write_offset += srclen; start_write_offset %= ring_buffer_size; -- 1.7.4.1
[PATCH 0/4] Drivers: hv: vmbus: Make in-place consumption always possible
From: K. Y. SrinivasanMake in-place consumption of VMBus packets always possible. Currently we forbid it when a packet 'wraps around' the ring so we can't provide a single pointer to it. The idea if this series is dead simple: let's make a single virtual mapping for two copies (actually, two sets of pages which consist the ring buffer) of the ring buffer. With such a mapping we can always provide a pointers for in-place consumption to drivers. Copy path can also benefit from such mappings as we eliminate the need for conditional checking in copy_to/ copy_from functions and use a single memcpy(). Vitaly Kuznetsov (4): Drivers: hv: cleanup vmbus_open() for wrap around mappings Drivers: hv: ring_buffer: wrap around mappings for ring buffers Drivers: hv: ring_buffer: use wrap around mappings in hv_copy{from,to}_ringbuffer() Drivers: hv: ring_buffer: count on wrap around mappings in get_next_pkt_raw() drivers/hv/channel.c | 68 +++-- drivers/hv/hyperv_vmbus.h |4 +- drivers/hv/ring_buffer.c | 61 +++- include/linux/hyperv.h| 32 +++-- 4 files changed, 83 insertions(+), 82 deletions(-) -- 1.7.4.1
[PATCH 0/4] Drivers: hv: vmbus: Make in-place consumption always possible
From: K. Y. Srinivasan Make in-place consumption of VMBus packets always possible. Currently we forbid it when a packet 'wraps around' the ring so we can't provide a single pointer to it. The idea if this series is dead simple: let's make a single virtual mapping for two copies (actually, two sets of pages which consist the ring buffer) of the ring buffer. With such a mapping we can always provide a pointers for in-place consumption to drivers. Copy path can also benefit from such mappings as we eliminate the need for conditional checking in copy_to/ copy_from functions and use a single memcpy(). Vitaly Kuznetsov (4): Drivers: hv: cleanup vmbus_open() for wrap around mappings Drivers: hv: ring_buffer: wrap around mappings for ring buffers Drivers: hv: ring_buffer: use wrap around mappings in hv_copy{from,to}_ringbuffer() Drivers: hv: ring_buffer: count on wrap around mappings in get_next_pkt_raw() drivers/hv/channel.c | 68 +++-- drivers/hv/hyperv_vmbus.h |4 +- drivers/hv/ring_buffer.c | 61 +++- include/linux/hyperv.h| 32 +++-- 4 files changed, 83 insertions(+), 82 deletions(-) -- 1.7.4.1
[PATCH RESEND net-next] netvsc: Use the new in-place consumption APIs in the rx path
From: K. Y. SrinivasanUse the new APIs for eliminating a copy on the receive path. These new APIs also help in minimizing the number of memory barriers we end up issuing (in the ringbuffer code) since we can better control when we want to expose the ring state to the host. The patch is being resent to address earlier email issues. Signed-off-by: K. Y. Srinivasan --- drivers/net/hyperv/netvsc.c | 88 +-- 1 files changed, 59 insertions(+), 29 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 719cb35..8cd4c19 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1141,6 +1141,39 @@ static inline void netvsc_receive_inband(struct hv_device *hdev, } } +static void netvsc_process_raw_pkt(struct hv_device *device, + struct vmbus_channel *channel, + struct netvsc_device *net_device, + struct net_device *ndev, + u64 request_id, + struct vmpacket_descriptor *desc) +{ + struct nvsp_message *nvmsg; + + nvmsg = (struct nvsp_message *)((unsigned long) + desc + (desc->offset8 << 3)); + + switch (desc->type) { + case VM_PKT_COMP: + netvsc_send_completion(net_device, channel, device, desc); + break; + + case VM_PKT_DATA_USING_XFER_PAGES: + netvsc_receive(net_device, channel, device, desc); + break; + + case VM_PKT_DATA_INBAND: + netvsc_receive_inband(device, net_device, nvmsg); + break; + + default: + netdev_err(ndev, "unhandled packet type %d, tid %llx\n", + desc->type, request_id); + break; + } +} + + void netvsc_channel_cb(void *context) { int ret; @@ -1153,7 +1186,7 @@ void netvsc_channel_cb(void *context) unsigned char *buffer; int bufferlen = NETVSC_PACKET_SIZE; struct net_device *ndev; - struct nvsp_message *nvmsg; + bool need_to_commit = false; if (channel->primary_channel != NULL) device = channel->primary_channel->device_obj; @@ -1167,39 +1200,36 @@ void netvsc_channel_cb(void *context) buffer = get_per_channel_state(channel); do { + desc = get_next_pkt_raw(channel); + if (desc != NULL) { + netvsc_process_raw_pkt(device, + channel, + net_device, + ndev, + desc->trans_id, + desc); + + put_pkt_raw(channel, desc); + need_to_commit = true; + continue; + } + if (need_to_commit) { + need_to_commit = false; + commit_rd_index(channel); + } + ret = vmbus_recvpacket_raw(channel, buffer, bufferlen, _recvd, _id); if (ret == 0) { if (bytes_recvd > 0) { desc = (struct vmpacket_descriptor *)buffer; - nvmsg = (struct nvsp_message *)((unsigned long) -desc + (desc->offset8 << 3)); - switch (desc->type) { - case VM_PKT_COMP: - netvsc_send_completion(net_device, - channel, - device, desc); - break; - - case VM_PKT_DATA_USING_XFER_PAGES: - netvsc_receive(net_device, channel, - device, desc); - break; - - case VM_PKT_DATA_INBAND: - netvsc_receive_inband(device, - net_device, - nvmsg); - break; - - default: - netdev_err(ndev, - "unhandled packet type %d, " - "tid %llx len %d\n", - desc->type, request_id, - bytes_recvd);
[PATCH RESEND net-next] netvsc: Use the new in-place consumption APIs in the rx path
From: K. Y. Srinivasan Use the new APIs for eliminating a copy on the receive path. These new APIs also help in minimizing the number of memory barriers we end up issuing (in the ringbuffer code) since we can better control when we want to expose the ring state to the host. The patch is being resent to address earlier email issues. Signed-off-by: K. Y. Srinivasan --- drivers/net/hyperv/netvsc.c | 88 +-- 1 files changed, 59 insertions(+), 29 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 719cb35..8cd4c19 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1141,6 +1141,39 @@ static inline void netvsc_receive_inband(struct hv_device *hdev, } } +static void netvsc_process_raw_pkt(struct hv_device *device, + struct vmbus_channel *channel, + struct netvsc_device *net_device, + struct net_device *ndev, + u64 request_id, + struct vmpacket_descriptor *desc) +{ + struct nvsp_message *nvmsg; + + nvmsg = (struct nvsp_message *)((unsigned long) + desc + (desc->offset8 << 3)); + + switch (desc->type) { + case VM_PKT_COMP: + netvsc_send_completion(net_device, channel, device, desc); + break; + + case VM_PKT_DATA_USING_XFER_PAGES: + netvsc_receive(net_device, channel, device, desc); + break; + + case VM_PKT_DATA_INBAND: + netvsc_receive_inband(device, net_device, nvmsg); + break; + + default: + netdev_err(ndev, "unhandled packet type %d, tid %llx\n", + desc->type, request_id); + break; + } +} + + void netvsc_channel_cb(void *context) { int ret; @@ -1153,7 +1186,7 @@ void netvsc_channel_cb(void *context) unsigned char *buffer; int bufferlen = NETVSC_PACKET_SIZE; struct net_device *ndev; - struct nvsp_message *nvmsg; + bool need_to_commit = false; if (channel->primary_channel != NULL) device = channel->primary_channel->device_obj; @@ -1167,39 +1200,36 @@ void netvsc_channel_cb(void *context) buffer = get_per_channel_state(channel); do { + desc = get_next_pkt_raw(channel); + if (desc != NULL) { + netvsc_process_raw_pkt(device, + channel, + net_device, + ndev, + desc->trans_id, + desc); + + put_pkt_raw(channel, desc); + need_to_commit = true; + continue; + } + if (need_to_commit) { + need_to_commit = false; + commit_rd_index(channel); + } + ret = vmbus_recvpacket_raw(channel, buffer, bufferlen, _recvd, _id); if (ret == 0) { if (bytes_recvd > 0) { desc = (struct vmpacket_descriptor *)buffer; - nvmsg = (struct nvsp_message *)((unsigned long) -desc + (desc->offset8 << 3)); - switch (desc->type) { - case VM_PKT_COMP: - netvsc_send_completion(net_device, - channel, - device, desc); - break; - - case VM_PKT_DATA_USING_XFER_PAGES: - netvsc_receive(net_device, channel, - device, desc); - break; - - case VM_PKT_DATA_INBAND: - netvsc_receive_inband(device, - net_device, - nvmsg); - break; - - default: - netdev_err(ndev, - "unhandled packet type %d, " - "tid %llx len %d\n", - desc->type, request_id, - bytes_recvd); -
[PATCH 2/3] Drivers: hv: vmbus: Reduce the delay between retries in vmbus_post_msg()
From: K. Y. SrinivasanThe current delay between retries is unnecessarily high and is negatively affecting the time it takes to boot the system. Signed-off-by: K. Y. Srinivasan --- drivers/hv/connection.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index fcf8a02..78e6368 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -439,7 +439,7 @@ int vmbus_post_msg(void *buffer, size_t buflen) union hv_connection_id conn_id; int ret = 0; int retries = 0; - u32 msec = 1; + u32 usec = 1; conn_id.asu32 = 0; conn_id.u.id = VMBUS_MESSAGE_CONNECTION_ID; @@ -472,9 +472,9 @@ int vmbus_post_msg(void *buffer, size_t buflen) } retries++; - msleep(msec); - if (msec < 2048) - msec *= 2; + udelay(usec); + if (usec < 2048) + usec *= 2; } return ret; } -- 1.7.4.1
[PATCH 1/3] Drivers: hv: vmbus: Enable explicit signaling policy for NIC channels
From: K. Y. SrinivasanFor synthetic NIC channels, enable explicit signaling policy as netvsc wants to explicitly control when the host is to be signaled. Signed-off-by: K. Y. Srinivasan --- drivers/hv/channel.c | 18 -- drivers/hv/channel_mgmt.c |2 ++ drivers/hv/hyperv_vmbus.h |3 ++- drivers/hv/ring_buffer.c | 15 --- 4 files changed, 20 insertions(+), 18 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index a68830c..da022d3 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -657,7 +657,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, num_vecs, - , lock); + , lock, channel->signal_policy); /* * Signalling the host is conditional on many factors: @@ -678,11 +678,6 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer, * mechanism which can hurt the performance otherwise. */ - if (channel->signal_policy) - signal = true; - else - kick_q = true; - if (((ret == 0) && kick_q && signal) || (ret && !is_hvsock_channel(channel))) vmbus_setevent(channel); @@ -775,7 +770,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, 3, - , lock); + , lock, channel->signal_policy); /* * Signalling the host is conditional on many factors: @@ -793,11 +788,6 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel, * enough condition that it should not matter. */ - if (channel->signal_policy) - signal = true; - else - kick_q = true; - if (((ret == 0) && kick_q && signal) || (ret)) vmbus_setevent(channel); @@ -859,7 +849,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, 3, - , lock); + , lock, channel->signal_policy); if (ret == 0 && signal) vmbus_setevent(channel); @@ -924,7 +914,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel *channel, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, 3, - , lock); + , lock, channel->signal_policy); if (ret == 0 && signal) vmbus_setevent(channel); diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index b6c1211..8345869 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -406,6 +406,8 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) } dev_type = hv_get_dev_type(>offermsg.offer.if_type); + if (dev_type == HV_NIC) + set_channel_signal_state(newchannel, HV_SIGNAL_POLICY_EXPLICIT); init_vp_index(newchannel, dev_type); diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index dfa9fac..ddcc348 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -529,7 +529,8 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); int hv_ringbuffer_write(struct hv_ring_buffer_info *ring_info, struct kvec *kv_list, - u32 kv_count, bool *signal, bool lock); + u32 kv_count, bool *signal, bool lock, + enum hv_signal_policy policy); int hv_ringbuffer_read(struct hv_ring_buffer_info *inring_info, void *buffer, u32 buflen, u32 *buffer_actual_len, diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index fe586bf..e3edcae 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -66,12 +66,20 @@ u32 hv_end_read(struct hv_ring_buffer_info *rbi) *arrived. */ -static bool hv_need_to_signal(u32 old_write, struct hv_ring_buffer_info *rbi) +static bool hv_need_to_signal(u32 old_write, struct hv_ring_buffer_info *rbi, + enum hv_signal_policy policy) { virt_mb(); if (READ_ONCE(rbi->ring_buffer->interrupt_mask)) return false; + /* +* When the client wants to control signaling, +* we only honour the host interrupt mask. +*/ + if (policy == HV_SIGNAL_POLICY_EXPLICIT) + return true; + /* check interrupt_mask before read_index */
[PATCH 2/3] Drivers: hv: vmbus: Reduce the delay between retries in vmbus_post_msg()
From: K. Y. Srinivasan The current delay between retries is unnecessarily high and is negatively affecting the time it takes to boot the system. Signed-off-by: K. Y. Srinivasan --- drivers/hv/connection.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index fcf8a02..78e6368 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -439,7 +439,7 @@ int vmbus_post_msg(void *buffer, size_t buflen) union hv_connection_id conn_id; int ret = 0; int retries = 0; - u32 msec = 1; + u32 usec = 1; conn_id.asu32 = 0; conn_id.u.id = VMBUS_MESSAGE_CONNECTION_ID; @@ -472,9 +472,9 @@ int vmbus_post_msg(void *buffer, size_t buflen) } retries++; - msleep(msec); - if (msec < 2048) - msec *= 2; + udelay(usec); + if (usec < 2048) + usec *= 2; } return ret; } -- 1.7.4.1
[PATCH 1/3] Drivers: hv: vmbus: Enable explicit signaling policy for NIC channels
From: K. Y. Srinivasan For synthetic NIC channels, enable explicit signaling policy as netvsc wants to explicitly control when the host is to be signaled. Signed-off-by: K. Y. Srinivasan --- drivers/hv/channel.c | 18 -- drivers/hv/channel_mgmt.c |2 ++ drivers/hv/hyperv_vmbus.h |3 ++- drivers/hv/ring_buffer.c | 15 --- 4 files changed, 20 insertions(+), 18 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index a68830c..da022d3 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -657,7 +657,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, num_vecs, - , lock); + , lock, channel->signal_policy); /* * Signalling the host is conditional on many factors: @@ -678,11 +678,6 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer, * mechanism which can hurt the performance otherwise. */ - if (channel->signal_policy) - signal = true; - else - kick_q = true; - if (((ret == 0) && kick_q && signal) || (ret && !is_hvsock_channel(channel))) vmbus_setevent(channel); @@ -775,7 +770,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, 3, - , lock); + , lock, channel->signal_policy); /* * Signalling the host is conditional on many factors: @@ -793,11 +788,6 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel, * enough condition that it should not matter. */ - if (channel->signal_policy) - signal = true; - else - kick_q = true; - if (((ret == 0) && kick_q && signal) || (ret)) vmbus_setevent(channel); @@ -859,7 +849,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, 3, - , lock); + , lock, channel->signal_policy); if (ret == 0 && signal) vmbus_setevent(channel); @@ -924,7 +914,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel *channel, bufferlist[2].iov_len = (packetlen_aligned - packetlen); ret = hv_ringbuffer_write(>outbound, bufferlist, 3, - , lock); + , lock, channel->signal_policy); if (ret == 0 && signal) vmbus_setevent(channel); diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index b6c1211..8345869 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -406,6 +406,8 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) } dev_type = hv_get_dev_type(>offermsg.offer.if_type); + if (dev_type == HV_NIC) + set_channel_signal_state(newchannel, HV_SIGNAL_POLICY_EXPLICIT); init_vp_index(newchannel, dev_type); diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index dfa9fac..ddcc348 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -529,7 +529,8 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); int hv_ringbuffer_write(struct hv_ring_buffer_info *ring_info, struct kvec *kv_list, - u32 kv_count, bool *signal, bool lock); + u32 kv_count, bool *signal, bool lock, + enum hv_signal_policy policy); int hv_ringbuffer_read(struct hv_ring_buffer_info *inring_info, void *buffer, u32 buflen, u32 *buffer_actual_len, diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index fe586bf..e3edcae 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -66,12 +66,20 @@ u32 hv_end_read(struct hv_ring_buffer_info *rbi) *arrived. */ -static bool hv_need_to_signal(u32 old_write, struct hv_ring_buffer_info *rbi) +static bool hv_need_to_signal(u32 old_write, struct hv_ring_buffer_info *rbi, + enum hv_signal_policy policy) { virt_mb(); if (READ_ONCE(rbi->ring_buffer->interrupt_mask)) return false; + /* +* When the client wants to control signaling, +* we only honour the host interrupt mask. +*/ + if (policy == HV_SIGNAL_POLICY_EXPLICIT) + return true; + /* check interrupt_mask before read_index */ virt_rmb(); /* @@ -264,7
[PATCH 3/3] Drivers: hv: vmbus: Implement a mechanism to tag the channel for low latency
From: K. Y. SrinivasanOn Hyper-V, performance critical channels use the monitor mechanism to signal the host when the guest posts mesages for the host. This mechanism minimizes the hypervisor intercepts and also makes the host more efficient in that each time the host is woken up, it processes a batch of messages as opposed to just one. The goal here is improve the throughput and this is at the expense of increased latency. Implement a mechanism to let the client driver decide if latency is important. Signed-off-by: K. Y. Srinivasan --- drivers/hv/channel.c |7 ++- include/linux/hyperv.h | 35 +++ 2 files changed, 41 insertions(+), 1 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index da022d3..e3a0048 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -43,7 +43,12 @@ static void vmbus_setevent(struct vmbus_channel *channel) { struct hv_monitor_page *monitorpage; - if (channel->offermsg.monitor_allocated) { + /* +* For channels marked as in "low latency" mode +* bypass the monitor page mechanism. +*/ + if ((channel->offermsg.monitor_allocated) && + (!channel->low_latency)) { /* Each u32 represents 32 channels */ sync_set_bit(channel->offermsg.child_relid & 31, (unsigned long *) vmbus_connection.send_int_page + diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index b10954a..362acf0 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -850,6 +850,31 @@ struct vmbus_channel { * ring lock to preserve the current behavior. */ bool acquire_ring_lock; + /* +* For performance critical channels (storage, networking +* etc,), Hyper-V has a mechanism to enhance the throughput +* at the expense of latency: +* When the host is to be signaled, we just set a bit in a shared page +* and this bit will be inspected by the hypervisor within a certain +* window and if the bit is set, the host will be signaled. The window +* of time is the monitor latency - currently around 100 usecs. This +* mechanism improves throughput by: +* +* A) Making the host more efficient - each time it wakes up, +*potentially it will process morev number of packets. The +*monitor latency allows a batch to build up. +* B) By deferring the hypercall to signal, we will also minimize +*the interrupts. +* +* Clearly, these optimizations improve throughput at the expense of +* latency. Furthermore, since the channel is shared for both +* control and data messages, control messages currently suffer +* unnecessary latency adversley impacting performance and boot +* time. To fix this issue, permit tagging the channel as being +* in "low latency" mode. In this mode, we will bypass the monitor +* mechanism. +*/ + bool low_latency; }; @@ -891,6 +916,16 @@ static inline void set_channel_pending_send_size(struct vmbus_channel *c, c->outbound.ring_buffer->pending_send_sz = size; } +static inline void set_low_latency_mode(struct vmbus_channel *c) +{ + c->low_latency = true; +} + +static inline void clear_low_latency_mode(struct vmbus_channel *c) +{ + c->low_latency = false; +} + void vmbus_onmessage(void *context); int vmbus_request_offers(void); -- 1.7.4.1
[PATCH 3/3] Drivers: hv: vmbus: Implement a mechanism to tag the channel for low latency
From: K. Y. Srinivasan On Hyper-V, performance critical channels use the monitor mechanism to signal the host when the guest posts mesages for the host. This mechanism minimizes the hypervisor intercepts and also makes the host more efficient in that each time the host is woken up, it processes a batch of messages as opposed to just one. The goal here is improve the throughput and this is at the expense of increased latency. Implement a mechanism to let the client driver decide if latency is important. Signed-off-by: K. Y. Srinivasan --- drivers/hv/channel.c |7 ++- include/linux/hyperv.h | 35 +++ 2 files changed, 41 insertions(+), 1 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index da022d3..e3a0048 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -43,7 +43,12 @@ static void vmbus_setevent(struct vmbus_channel *channel) { struct hv_monitor_page *monitorpage; - if (channel->offermsg.monitor_allocated) { + /* +* For channels marked as in "low latency" mode +* bypass the monitor page mechanism. +*/ + if ((channel->offermsg.monitor_allocated) && + (!channel->low_latency)) { /* Each u32 represents 32 channels */ sync_set_bit(channel->offermsg.child_relid & 31, (unsigned long *) vmbus_connection.send_int_page + diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index b10954a..362acf0 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -850,6 +850,31 @@ struct vmbus_channel { * ring lock to preserve the current behavior. */ bool acquire_ring_lock; + /* +* For performance critical channels (storage, networking +* etc,), Hyper-V has a mechanism to enhance the throughput +* at the expense of latency: +* When the host is to be signaled, we just set a bit in a shared page +* and this bit will be inspected by the hypervisor within a certain +* window and if the bit is set, the host will be signaled. The window +* of time is the monitor latency - currently around 100 usecs. This +* mechanism improves throughput by: +* +* A) Making the host more efficient - each time it wakes up, +*potentially it will process morev number of packets. The +*monitor latency allows a batch to build up. +* B) By deferring the hypercall to signal, we will also minimize +*the interrupts. +* +* Clearly, these optimizations improve throughput at the expense of +* latency. Furthermore, since the channel is shared for both +* control and data messages, control messages currently suffer +* unnecessary latency adversley impacting performance and boot +* time. To fix this issue, permit tagging the channel as being +* in "low latency" mode. In this mode, we will bypass the monitor +* mechanism. +*/ + bool low_latency; }; @@ -891,6 +916,16 @@ static inline void set_channel_pending_send_size(struct vmbus_channel *c, c->outbound.ring_buffer->pending_send_sz = size; } +static inline void set_low_latency_mode(struct vmbus_channel *c) +{ + c->low_latency = true; +} + +static inline void clear_low_latency_mode(struct vmbus_channel *c) +{ + c->low_latency = false; +} + void vmbus_onmessage(void *context); int vmbus_request_offers(void); -- 1.7.4.1
[PATCH 0/3] Drivers: hv: vmbus: Miscellaneous adjustments
From: K. Y. SrinivasanSome miscellaneous adjustments to the vmbus driver. K. Y. Srinivasan (3): Drivers: hv: vmbus: Enable explicit signaling policy for NIC channels Drivers: hv: vmbus: Reduce the delay between retries in vmbus_post_msg() Drivers: hv: vmbus: Implement a mechanism to tag the channel for low latency drivers/hv/channel.c | 25 ++--- drivers/hv/channel_mgmt.c |2 ++ drivers/hv/connection.c |8 drivers/hv/hyperv_vmbus.h |3 ++- drivers/hv/ring_buffer.c | 15 --- include/linux/hyperv.h| 35 +++ 6 files changed, 65 insertions(+), 23 deletions(-) -- 1.7.4.1
[PATCH 0/3] Drivers: hv: vmbus: Miscellaneous adjustments
From: K. Y. Srinivasan Some miscellaneous adjustments to the vmbus driver. K. Y. Srinivasan (3): Drivers: hv: vmbus: Enable explicit signaling policy for NIC channels Drivers: hv: vmbus: Reduce the delay between retries in vmbus_post_msg() Drivers: hv: vmbus: Implement a mechanism to tag the channel for low latency drivers/hv/channel.c | 25 ++--- drivers/hv/channel_mgmt.c |2 ++ drivers/hv/connection.c |8 drivers/hv/hyperv_vmbus.h |3 ++- drivers/hv/ring_buffer.c | 15 --- include/linux/hyperv.h| 35 +++ 6 files changed, 65 insertions(+), 23 deletions(-) -- 1.7.4.1