Re: [PATCH net] Revert "vhost: cache used event for better performance"
On 07/26/2017 10:03 AM, Jason Wang wrote: > This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it > was reported to break vhost_net. We want to cache used event and use > it to check for notification. We try to valid cached used event by > checking whether or not it was ahead of new, but this is not correct > all the time, it could be stale and there's no way to know about this. > > Signed-off-by: Jason WangThis would then qualify for stable ?
Re: [Patch net] tap: convert a mutex to a spinlock
On 07/05/2017 10:50 PM, Cong Wang wrote: > We are not allowed to block on the RCU reader side, so can't > just hold the mutex as before. As a quick fix, convert it to > a spinlock. > > Fixes: d9f1f61c0801 ("tap: Extending tap device create/destroy APIs") > Reported-by: Christian Borntraeger <borntrae...@de.ibm.com> > Cc: Sainath Grandhi <sainath.gran...@intel.com> > Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> Tested-by: Christian Borntraeger <borntrae...@de.ibm.com> > --- > drivers/net/tap.c | 18 +- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/drivers/net/tap.c b/drivers/net/tap.c > index 4d4173d..d88ae3c 100644 > --- a/drivers/net/tap.c > +++ b/drivers/net/tap.c > @@ -106,7 +106,7 @@ struct major_info { > struct rcu_head rcu; > dev_t major; > struct idr minor_idr; > - struct mutex minor_lock; > + spinlock_t minor_lock; > const char *device_name; > struct list_head next; > }; > @@ -416,15 +416,15 @@ int tap_get_minor(dev_t major, struct tap_dev *tap) > goto unlock; > } > > - mutex_lock(_major->minor_lock); > - retval = idr_alloc(_major->minor_idr, tap, 1, TAP_NUM_DEVS, > GFP_KERNEL); > + spin_lock(_major->minor_lock); > + retval = idr_alloc(_major->minor_idr, tap, 1, TAP_NUM_DEVS, > GFP_ATOMIC); > if (retval >= 0) { > tap->minor = retval; > } else if (retval == -ENOSPC) { > netdev_err(tap->dev, "Too many tap devices\n"); > retval = -EINVAL; > } > - mutex_unlock(_major->minor_lock); > + spin_unlock(_major->minor_lock); > > unlock: > rcu_read_unlock(); > @@ -442,12 +442,12 @@ void tap_free_minor(dev_t major, struct tap_dev *tap) > goto unlock; > } > > - mutex_lock(_major->minor_lock); > + spin_lock(_major->minor_lock); > if (tap->minor) { > idr_remove(_major->minor_idr, tap->minor); > tap->minor = 0; > } > - mutex_unlock(_major->minor_lock); > + spin_unlock(_major->minor_lock); > > unlock: > rcu_read_unlock(); > @@ -467,13 +467,13 @@ static struct tap_dev *dev_get_by_tap_file(int major, > int minor) > goto unlock; > } > > - mutex_lock(_major->minor_lock); > + spin_lock(_major->minor_lock); > tap = idr_find(_major->minor_idr, minor); > if (tap) { > dev = tap->dev; > dev_hold(dev); > } > - mutex_unlock(_major->minor_lock); > + spin_unlock(_major->minor_lock); > > unlock: > rcu_read_unlock(); > @@ -1227,7 +1227,7 @@ static int tap_list_add(dev_t major, const char > *device_name) > tap_major->major = MAJOR(major); > > idr_init(_major->minor_idr); > - mutex_init(_major->minor_lock); > + spin_lock_init(_major->minor_lock); > > tap_major->device_name = device_name; >
locking issues in macvtap (looks like due to tap: Extending tap device create/destroy APIs)
Sainath, with rcu debugging and lock debugging I get the following splats. I think doing a mutex_lock while in an rcu read-side is not allowed, since mutex_lock can sleep. This is in 4.11 and 4.12 and seems to be introduced with commit d9f1f61c0801a7("tap: Extending tap device create/destroy APIs"). Christian [ 125.678015] === [ 125.678018] [ ERR: suspicious RCU usage. ] [ 125.678022] 4.11.0+ #18 Not tainted [ 125.678025] --- [ 125.678028] ./include/linux/rcupdate.h:521 Illegal context switch in RCU read-side critical section! [ 125.678031] other info that might help us debug this: [ 125.678035] rcu_scheduler_active = 2, debug_locks = 0 [ 125.678038] 2 locks held by libvirtd/3050: [ 125.678041] #0: (rtnl_mutex){+.+.+.}, at: [<00772b02>] rtnl_newlink+0x2ea/0x880 [ 125.678057] #1: (rcu_read_lock){..}, at: [<03ff800dad00>] tap_get_minor+0x0/0x1d8 [tap] [ 125.678068] stack backtrace: [ 125.678073] CPU: 26 PID: 3050 Comm: libvirtd Not tainted 4.11.0+ #18 [ 125.678076] Hardware name: IBM 2964 NC9 704 (LPAR) [ 125.678079] Stack: [ 125.678081]00fa977cb230 00fa977cb2c0 0003 [ 125.678091]00fa977cb360 00fa977cb2d8 00fa977cb2d8 0020 [ 125.678100] 03ff0020 00fa000a 00fa000a [ 125.678109]000c 00fa977cb328 [ 125.678119]008e2510 001139ac 00fa977cb2c0 00fa977cb318 [ 125.678150] Call Trace: [ 125.678157] ([<00113872>] show_trace+0xea/0xf0) [ 125.678160] [<00113950>] show_stack+0x68/0xe0 [ 125.678165] [<0057ef8c>] dump_stack+0x94/0xd8 [ 125.678172] [<001a4422>] ___might_sleep+0x21a/0x268 [ 125.678177] [<008ca842>] __mutex_lock+0x52/0x968 [ 125.678180] [<008cb192>] mutex_lock_nested+0x3a/0x48 [ 125.678184] [<03ff800dadd6>] tap_get_minor+0xd6/0x1d8 [tap] [ 125.678188] [<03ff801773a2>] macvtap_device_event+0x9a/0x1a0 [macvtap] [ 125.678191] [<0019bfbe>] notifier_call_chain+0x56/0x98 [ 125.678195] [<0019c1b2>] raw_notifier_call_chain+0x32/0x40 [ 125.678200] [<0075d014>] register_netdevice+0x3f4/0x508 [ 125.678204] [<03ff801718a0>] macvlan_common_newlink+0x360/0x430 [macvlan] [ 125.678207] [<03ff80177564>] macvtap_newlink+0xbc/0xf0 [macvtap] [ 125.678211] [<00772e32>] rtnl_newlink+0x61a/0x880 [ 125.678214] [<0077313c>] rtnetlink_rcv_msg+0xa4/0x248 [ 125.678219] [<0079cec0>] netlink_rcv_skb+0xd8/0x108 [ 125.678222] [<0076f538>] rtnetlink_rcv+0x48/0x58 [ 125.678226] [<0079c750>] netlink_unicast+0x178/0x1f8 [ 125.678229] [<0079cbd4>] netlink_sendmsg+0x304/0x3b0 [ 125.678233] [<00730676>] sock_sendmsg+0x6e/0x80 [ 125.678237] [<007311b0>] ___sys_sendmsg+0x2a0/0x2a8 [ 125.678240] [<007324d8>] __sys_sendmsg+0x60/0xa8 [ 125.678244] [<00732ed4>] SyS_socketcall+0x33c/0x390 [ 125.678248] [<008d08bc>] system_call+0xc4/0x258 [ 125.678251] INFO: lockdep is turned off. [ 125.678255] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747 [ 125.678257] in_atomic(): 1, irqs_disabled(): 0, pid: 3050, name: libvirtd [ 125.678261] INFO: lockdep is turned off. [ 125.678264] CPU: 26 PID: 3050 Comm: libvirtd Not tainted 4.11.0+ #18 [ 125.678267] Hardware name: IBM 2964 NC9 704 (LPAR) [ 125.678269] Stack: [ 125.678272]00fa977cb230 00fa977cb2c0 0003 [ 125.678281]00fa977cb360 00fa977cb2d8 00fa977cb2d8 0020 [ 125.678290] 00fa0020 00fa000a 00fa000a [ 125.678298]000c 00fa977cb328 [ 125.678308]008e2510 001139ac 00fa977cb2c0 00fa977cb318 [ 125.678323] Call Trace: [ 125.678326] ([<00113872>] show_trace+0xea/0xf0) [ 125.678330] [<00113950>] show_stack+0x68/0xe0 [ 125.678334] [<0057ef8c>] dump_stack+0x94/0xd8 [ 125.678337] [<001a438e>] ___might_sleep+0x186/0x268 [ 125.678341] [<008ca842>] __mutex_lock+0x52/0x968 [ 125.678346] [<008cb192>] mutex_lock_nested+0x3a/0x48 [ 125.678350] [<03ff800dadd6>] tap_get_minor+0xd6/0x1d8 [tap] [ 125.678354] [<03ff801773a2>] macvtap_device_event+0x9a/0x1a0 [macvtap] [ 125.678357] [<0019bfbe>] notifier_call_chain+0x56/0x98 [ 125.678360] [<0019c1b2>] raw_notifier_call_chain+0x32/0x40 [ 125.678364] [<0075d014>] register_netdevice+0x3f4/0x508 [ 125.678368] [<03ff801718a0>] macvlan_common_newlink+0x360/0x430 [macvlan] [ 125.678371] [<03ff80177564>]
Re: rhashtable - Cap total number of entries to 2^31
On 04/28/2017 01:31 PM, Herbert Xu wrote: > On Fri, Apr 28, 2017 at 12:23:15PM +0200, Christian Borntraeger wrote: >> >> I can reproduce this boot failure on s390 bisected to >> commit 6d684e54690caef45cf14051ddeb7c71beeb681b >>rhashtable: Cap total number of entries to 2^31 >> in linux-next from Apr 28 > > It should go away with > > https://patchwork.ozlabs.org/patch/756233/ > > Thanks, > Yes it does. Tested-by: Christian Borntraeger <borntrae...@de.ibm.com> would be nice to have it in the next linux-next version.
Re: rhashtable - Cap total number of entries to 2^31
On 04/28/2017 12:21 AM, Florian Fainelli wrote: > On 04/27/2017 02:16 PM, Florian Fainelli wrote: >> Hi Herbert, >> >> On 04/26/2017 10:44 PM, Herbert Xu wrote: >>> On Tue, Apr 25, 2017 at 10:48:22AM -0400, David Miller wrote: From: Florian WestphalDate: Tue, 25 Apr 2017 16:17:49 +0200 > I'd have less of an issue with this if we'd be talking about > something computationally expensive, but this is about storing > an extra value inside a struct just to avoid one "shr" in insert path... Agreed, this shift is probably filling an available cpu cycle :-) >>> >>> OK, but we need to have an extra field for another reason anyway. >>> The problem is that we're not capping the total number of elements >>> in the hashtable when max_size is not set, this means that nelems >>> can overflow which will cause havoc with the automatic shrinking >>> when it tries to fit 2^32 entries into a minimum-sized table. >>> >>> So I'm taking that hole back for now :) >>> >>> ---8<--- >>> When max_size is not set or if it set to a sufficiently large >>> value, the nelems counter can overflow. This would cause havoc >>> with the automatic shrinking as it would then attempt to fit a >>> huge number of entries into a tiny hash table. >>> >>> This patch fixes this by adding max_elems to struct rhashtable >>> to cap the number of elements. This is set to 2^31 as nelems is >>> not a precise count. This is sufficiently smaller than UINT_MAX >>> that it should be safe. >>> >>> When max_size is set max_elems will be lowered to at most twice >>> max_size as is the status quo. >>> >>> Signed-off-by: Herbert Xu >> >> This commit: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=6d684e54690caef45cf14051ddeb7c71beeb681b >> >> makes my ARMv7 (32-bit) system panic on boot with the log below. I can >> test net-next (or net) and report back if you want me to test anything. >> Thanks! > > And another on with a QEMU guest: > > [0.389212] NET: Registered protocol family 16 > [0.388807] Kernel panic - not syncing: rtnetlink_init: cannot > initialize rtnetlink > [0.388807] > [0.389445] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.11.0-rc8-02077-ge221c1f0fe25 #1 > [0.389745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 > [0.390219] Call Trace: > [0.391406] dump_stack+0x51/0x78 > [0.391585] panic+0xc7/0x20e > [0.391740] ? register_pernet_operations+0xa1/0xd0 > [0.392031] rtnetlink_init+0x22/0x1a0 > [0.392190] netlink_proto_init+0x168/0x184 > [0.392359] ? ptp_classifier_init+0x26/0x30 > [0.392528] ? netlink_net_init+0x2e/0x2e > [0.392692] do_one_initcall+0x54/0x190 > [0.392852] ? parse_args+0x248/0x400 > [0.393033] kernel_init_freeable+0x127/0x1b6 > [0.393208] ? kernel_init_freeable+0x1b6/0x1b6 > [0.393389] ? rest_init+0x70/0x70 > [0.393533] kernel_init+0x9/0x100 > [0.393676] ret_from_fork+0x29/0x40 > [0.394555] ---[ end Kernel panic - not syncing: rtnetlink_init: > cannot initialize rtnetlink > [0.394555] > > I traced this down to: > > rtnetlink_net_init() > netlink_kernel_create() > netlink_insert() > __netlink_insert() > rhashtable_lookup_insert_key() > __rhashtable_insert_fast() > rht_grow_above_max() > > And indeed we have: > > ht->nelemts = 0 > ht->max_elems = 0 > > such that rht_grow_above_max() returns true. > > With your commit we actually take this branch: > > if (ht->p.max_size < ht->max_elems / 2) > ht->max_elems = ht->p.max_size * 2; > > since max_size = 0 we have max_elems = 0 as well. > > Candidate fix #1: > > diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h > index 45f89369c4c8..ad9020e1609c 100644 > --- a/include/linux/rhashtable.h > +++ b/include/linux/rhashtable.h > @@ -329,7 +329,7 @@ static inline bool rht_grow_above_100(const struct > rhashtable *ht, > static inline bool rht_grow_above_max(const struct rhashtable *ht, > const struct bucket_table *tbl) > { > - return atomic_read(>nelems) >= ht->max_elems; > + return ht->p.max_size && atomic_read(>nelems) >= ht->max_elems; > } > > Candidate fix #2: > > diff --git a/lib/rhashtable.c b/lib/rhashtable.c > index 751630bbe409..6b4f07760fec 100644 > --- a/lib/rhashtable.c > +++ b/lib/rhashtable.c > @@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht, > > /* Cap total entries at 2^31 to avoid nelems overflow. */ > ht->max_elems = 1u << 31; > - if (ht->p.max_size < ht->max_elems / 2) > + if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2)) > ht->max_elems = ht->p.max_size * 2; > > ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE); > > Number #2 does not introduce an additional conditional on the fastpath, > so I
Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE
On 03/02/2017 10:45 PM, Arnd Bergmann wrote: > On Thu, Mar 2, 2017 at 8:00 PM, Christian Borntraeger > <borntrae...@de.ibm.com> wrote: >> On 03/02/2017 06:55 PM, Arnd Bergmann wrote: >>> On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger >>> <borntrae...@de.ibm.com> wrote: >>>> On 03/02/2017 05:38 PM, Arnd Bergmann wrote: >>>>> >>>>> This attempts a rewrite of the two macros, using a simpler implementation >>>>> for the most common case of having a naturally aligned 1, 2, 4, or (on >>>>> 64-bit architectures) 8 byte object that can be accessed with a single >>>>> instruction. For these, we go back to a volatile pointer dereference >>>>> that we had with the ACCESS_ONCE macro. >>>> >>>> We had changed that back then because gcc 4.6 and 4.7 had a bug that could >>>> removed the volatile statement on aggregate types like the following one >>>> >>>> union ipte_control { >>>> unsigned long val; >>>> struct { >>>> unsigned long k : 1; >>>> unsigned long kh : 31; >>>> unsigned long kg : 32; >>>> }; >>>> }; >>>> >>>> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145 >>>> >>>> If I see that right, your __ALIGNED_WORD(x) >>>> macro would say that for above structure sizeof(x) == sizeof(long)) is >>>> true, >>>> so it would fall back to the old volatile cast and might reintroduce the >>>> old compiler bug? >> >> Oh dear, I should double check my sentences in emails before sending...anyway >> the full story is referenced in >> >> commit 60815cf2e05057db5b78e398d9734c493560b11e >> Merge tag 'for-linus' of >> git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux >> which has a pointer to >> http://marc.info/?i=54611D86.4040306%40de.ibm.com >> which contains the full story. > > Ok, got it. So I guess the behavior of forcing aligned accesses on aligned > data is accidental, and allowing non-power-of-two arguments is also not > the main purpose. Right. The main purpose is to read/write _ONCE_. You can assume a somewhat atomic access for sizes <= word size. And there are certainly places that rely on that. But the *ONCE thing is mostly used for things where we used barrier() 10 years ago. Maybe we could just bail out on new compilers if we get > either of those? That might catch code that accidentally does something > that is inherently non-atomic or that causes a trap when the intention was > to have a simple atomic access. I think Linus stated that its ok to assume that the compiler is smart enough to uses a single instruction to access aligned and properly sized scalar types for *ONCE. Back then when I changed ACCESS_ONCE there were many places that did use it for non-atomic, > word size accesses. For example on some architectures a pmd_t is a typedef to an array, for which there is no way to read that atomically. So the focus must be on the "ONCE" part. If some code uses a properly aligned, word sized object we can also assume atomic access. If the access is not properly sized/aligned we do not get atomicity, but we do get the "ONCE". But adding a check for alignment/size would break the compilation of some code.
Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE
On 03/02/2017 06:55 PM, Arnd Bergmann wrote: > On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger > <borntrae...@de.ibm.com> wrote: >> On 03/02/2017 05:38 PM, Arnd Bergmann wrote: >>> >>> This attempts a rewrite of the two macros, using a simpler implementation >>> for the most common case of having a naturally aligned 1, 2, 4, or (on >>> 64-bit architectures) 8 byte object that can be accessed with a single >>> instruction. For these, we go back to a volatile pointer dereference >>> that we had with the ACCESS_ONCE macro. >> >> We had changed that back then because gcc 4.6 and 4.7 had a bug that could >> removed the volatile statement on aggregate types like the following one >> >> union ipte_control { >> unsigned long val; >> struct { >> unsigned long k : 1; >> unsigned long kh : 31; >> unsigned long kg : 32; >> }; >> }; >> >> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145 >> >> If I see that right, your __ALIGNED_WORD(x) >> macro would say that for above structure sizeof(x) == sizeof(long)) is true, >> so it would fall back to the old volatile cast and might reintroduce the >> old compiler bug? Oh dear, I should double check my sentences in emails before sending...anyway the full story is referenced in commit 60815cf2e05057db5b78e398d9734c493560b11e Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux which has a pointer to http://marc.info/?i=54611D86.4040306%40de.ibm.com which contains the full story. > > Ah, right, that's the missing piece. For some reason I didn't find > the reference in the source or the git log. > >> Could you maybe you fence your simple macro for anything older than 4.9? >> After >> all there was no kasan support anyway on these older gcc version. > > Yes, that should work, thanks!
Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE
On 03/02/2017 05:38 PM, Arnd Bergmann wrote: > When CONFIG_KASAN is enabled, the READ_ONCE/WRITE_ONCE macros cause > rather large kernel stacks, e.g.: > > mm/vmscan.c: In function 'shrink_page_list': > mm/vmscan.c:1333:1: error: the frame size of 3456 bytes is larger than 3072 > bytes [-Werror=frame-larger-than=] > block/cfq-iosched.c: In function 'cfqg_stats_add_aux': > block/cfq-iosched.c:750:1: error: the frame size of 4048 bytes is larger than > 3072 bytes [-Werror=frame-larger-than=] > fs/btrfs/disk-io.c: In function 'open_ctree': > fs/btrfs/disk-io.c:3314:1: error: the frame size of 3136 bytes is larger than > 3072 bytes [-Werror=frame-larger-than=] > fs/btrfs/relocation.c: In function 'build_backref_tree': > fs/btrfs/relocation.c:1193:1: error: the frame size of 4336 bytes is larger > than 3072 bytes [-Werror=frame-larger-than=] > fs/fscache/stats.c: In function 'fscache_stats_show': > fs/fscache/stats.c:287:1: error: the frame size of 6512 bytes is larger than > 3072 bytes [-Werror=frame-larger-than=] > fs/jbd2/commit.c: In function 'jbd2_journal_commit_transaction': > fs/jbd2/commit.c:1139:1: error: the frame size of 3760 bytes is larger than > 3072 bytes [-Werror=frame-larger-than=] > > This attempts a rewrite of the two macros, using a simpler implementation > for the most common case of having a naturally aligned 1, 2, 4, or (on > 64-bit architectures) 8 byte object that can be accessed with a single > instruction. For these, we go back to a volatile pointer dereference > that we had with the ACCESS_ONCE macro. We had changed that back then because gcc 4.6 and 4.7 had a bug that could removed the volatile statement on aggregate types like the following one union ipte_control { unsigned long val; struct { unsigned long k : 1; unsigned long kh : 31; unsigned long kg : 32; }; }; See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145 If I see that right, your __ALIGNED_WORD(x) macro would say that for above structure sizeof(x) == sizeof(long)) is true, so it would fall back to the old volatile cast and might reintroduce the old compiler bug? Could you maybe you fence your simple macro for anything older than 4.9? After all there was no kasan support anyway on these older gcc version. Christian
Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs
On 02/03/2017 07:19 AM, Ben Serebrin wrote: [...] > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -1502,20 +1502,44 @@ static void virtnet_set_affinity(struct virtnet_info > *vi) >* queue pairs, we let the queue pairs to be private to one cpu by >* setting the affinity hint to eliminate the contention. >*/ > - if (vi->curr_queue_pairs == 1 || > - vi->max_queue_pairs != num_online_cpus()) { > + if (vi->curr_queue_pairs == 1) { > virtnet_clean_affinity(vi, -1); > return; > } > > + /* If there are more cpus than queues, then assign the queues' > + * interrupts to the first cpus until we run out. > + */ > i = 0; > for_each_online_cpu(cpu) { > + if (i == vi->max_queue_pairs) > + break; > virtqueue_set_affinity(vi->rq[i].vq, cpu); > virtqueue_set_affinity(vi->sq[i].vq, cpu); > - netif_set_xps_queue(vi->dev, cpumask_of(cpu), i); > i++; > } > > + /* Stripe the XPS affinities across the online CPUs. > + * Hyperthread pairs are typically assigned such that Linux's > + * CPU X and X + (numcpus / 2) are hyperthread twins, so we cause > + * hyperthread twins to share TX queues, in the case where there are > + * more cpus than queues. > + */ This is not always true. E.g. on s390 the SMT threads are usually paired even/odd. e.g. [cborntra@s38lp08 linux]$ lscpu -e CPU NODE BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS 0 000 00:0:0:0 yesyeshorizontal 0 1 000 01:1:1:1 yesyeshorizontal 1 2 000 12:2:2:2 yesyeshorizontal 2 3 000 13:3:3:3 yesyeshorizontal 3 4 000 24:4:4:4 yesyeshorizontal 4 5 000 25:5:5:5 yesyeshorizontal 5 6 000 36:6:6:6 yesyeshorizontal 6 This does not matter yet for s390 (as virtio is usally doen via the ccw bus) but maybe we should consider an future patch to provide some arch-specific striping hints. Or would it make sense to change the s390 layout for SMT twins because there is more code that expects all threads 0 at the front and all threads 1 at the end? > + for (i = 0; i < vi->max_queue_pairs; i++) { > + struct cpumask mask; > + int skip = i; > + > + cpumask_clear(); > + for_each_online_cpu(cpu) { > + while (skip--) > + cpu = cpumask_next(cpu, cpu_online_mask); > + if (cpu < num_possible_cpus()) > + cpumask_set_cpu(cpu, ); > + skip = vi->max_queue_pairs - 1; > + } > + netif_set_xps_queue(vi->dev, , i); > + } > + > vi->affinity_hint_set = true; > } > >
Re: [PATCH 5/6] treewide: use kv[mz]alloc* rather than opencoded variants
On 01/12/2017 04:37 PM, Michal Hocko wrote: > diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c > index 4f74511015b8..e6bbb33d2956 100644 > --- a/arch/s390/kvm/kvm-s390.c > +++ b/arch/s390/kvm/kvm-s390.c > @@ -1126,10 +1126,7 @@ static long kvm_s390_get_skeys(struct kvm *kvm, struct > kvm_s390_skeys *args) > if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX) > return -EINVAL; > > - keys = kmalloc_array(args->count, sizeof(uint8_t), > - GFP_KERNEL | __GFP_NOWARN); > - if (!keys) > - keys = vmalloc(sizeof(uint8_t) * args->count); > + keys = kvmalloc(args->count * sizeof(uint8_t), GFP_KERNEL); > if (!keys) > return -ENOMEM; > > @@ -1171,10 +1168,7 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct > kvm_s390_skeys *args) > if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX) > return -EINVAL; > > - keys = kmalloc_array(args->count, sizeof(uint8_t), > - GFP_KERNEL | __GFP_NOWARN); > - if (!keys) > - keys = vmalloc(sizeof(uint8_t) * args->count); > + keys = kvmalloc(sizeof(uint8_t) * args->count, GFP_KERNEL); > if (!keys) > return -ENOMEM; KVM/s390 parts Acked-by: Christian Borntraeger <borntrae...@de.ibm.com>
Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
On 11/25/2016 10:08 PM, Michael S. Tsirkin wrote: > On Fri, Nov 25, 2016 at 05:49:45PM +0100, Christian Borntraeger wrote: >> On 11/25/2016 05:17 PM, Peter Zijlstra wrote: >>> On Fri, Nov 25, 2016 at 04:10:04PM +, Mark Rutland wrote: >>>> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote: >>> >>>>> What are use cases for such primitive that won't be OK with "read once >>>>> _and_ atomically"? >>>> >>>> I have none to hand. >>> >>> Whatever triggers the __builtin_memcpy() paths, and even the size==8 >>> paths on 32bit. >>> >>> You could put a WARN in there to easily find them. >> >> There were several cases that I found during writing the *ONCE stuff. >> For example there are some 32bit ppc variants with 64bit PTEs. Some for >> others (I think sparc). And the mm/ code is perfectly fine with these >> PTE accesses being done NOT atomic. > > In that case do we even need _ONCE at all? Yes. For example look at gup_pmd_range. Here several checks are made on the pmd. It is important the the check for pmd_none is made on the same value than the check for pmd_trans_huge, but it is not important that the value is still up to date. And there are really cases where we cannot read the thing atomically, e.g. on m68k and sparc(32bit) pmd_t is defined as array of longs. Another problem is that a compiler can implement the following code as 2 memory reads (e.g. if you have compare instructions that work on memory) instead of a memory read and 2 compares int check(unsigned long *value_p) { unsigned long value = *value_p; if (condition_a(value)) return 1; if (condition_b(value)) return 2; return 3; } With READ_ONCE you forbid that. In past times you would have used barrier() after the assignment to achieve the same goal. > Are there assumptions these are two 32 bit reads? It depends on the code. Some places (e.g. in gup) assumes that the access via READ_ONCE is atomic (which it is for sane compilers as long as the pointer is <= word size). In some others places just one bit is tested. > > >> >>> >>> The advantage of introducing the SINGLE_{LOAD,STORE}() helpers is that >>> they compiletime validate this the size is 'right' and can runtime check >>> alignment constraints. >>> >>> IE, they are strictly stronger than {READ,WRITE}_ONCE(). >>> >
Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
On 11/25/2016 06:28 PM, Mark Rutland wrote: > On Fri, Nov 25, 2016 at 05:49:45PM +0100, Christian Borntraeger wrote: >> On 11/25/2016 05:17 PM, Peter Zijlstra wrote: >>> On Fri, Nov 25, 2016 at 04:10:04PM +, Mark Rutland wrote: >>>> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote: >>> >>>>> What are use cases for such primitive that won't be OK with "read once >>>>> _and_ atomically"? >>>> >>>> I have none to hand. >>> >>> Whatever triggers the __builtin_memcpy() paths, and even the size==8 >>> paths on 32bit. >>> >>> You could put a WARN in there to easily find them. >> >> There were several cases that I found during writing the *ONCE stuff. >> For example there are some 32bit ppc variants with 64bit PTEs. Some for >> others (I think sparc). > > We have similar on 32-bit ARM w/ LPAE. LPAE implies that a naturally > aligned 64-bit access is single-copy atomic, which is what makes that > ok. > >> And the mm/ code is perfectly fine with these PTE accesses being done >> NOT atomic. > > That strikes me as surprising. Is there some mutual exclusion that > prevents writes from occuring wherever a READ_ONCE() happens to a PTE? See for example mm/memory.c handle_pte_fault. ---snip /* * some architectures can have larger ptes than wordsize, * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and * CONFIG_32BIT=y, so READ_ONCE or ACCESS_ONCE cannot guarantee * atomic accesses. The code below just needs a consistent * view for the ifs and we later double check anyway with the * ptl lock held. So here a barrier will do. */ ---snip--- The trick is that the code only does a specific check, but all other accesses are under the pte lock.
Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
On 11/25/2016 05:17 PM, Peter Zijlstra wrote: > On Fri, Nov 25, 2016 at 04:10:04PM +, Mark Rutland wrote: >> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote: > >>> What are use cases for such primitive that won't be OK with "read once >>> _and_ atomically"? >> >> I have none to hand. > > Whatever triggers the __builtin_memcpy() paths, and even the size==8 > paths on 32bit. > > You could put a WARN in there to easily find them. There were several cases that I found during writing the *ONCE stuff. For example there are some 32bit ppc variants with 64bit PTEs. Some for others (I think sparc). And the mm/ code is perfectly fine with these PTE accesses being done NOT atomic. > > The advantage of introducing the SINGLE_{LOAD,STORE}() helpers is that > they compiletime validate this the size is 'right' and can runtime check > alignment constraints. > > IE, they are strictly stronger than {READ,WRITE}_ONCE(). >
Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
On 11/25/2016 12:22 PM, Mark Rutland wrote: > On Thu, Nov 24, 2016 at 10:36:58PM +0200, Michael S. Tsirkin wrote: >> On Thu, Nov 24, 2016 at 10:25:11AM +, Mark Rutland wrote: >>> For several reasons, it would be beneficial to kill off ACCESS_ONCE() >>> tree-wide, in favour of {READ,WRITE}_ONCE(). These work with aggregate >>> types, >>> more obviously document their intended behaviour, and are necessary for >>> tools >>> like KTSAN to work correctly (as otherwise reads and writes cannot be >>> instrumented separately). >>> >>> While it's possible to script the bulk of this tree-wide conversion, some >>> cases >>> such as the virtio code, require some manual intervention. This series moves >>> the virtio and vringh code over to {READ,WRITE}_ONCE(), in the process >>> fixing a >>> bug in the virtio headers. >>> >>> Thanks, >>> Mark. >> >> I don't have a problem with this specific patchset. > > Good to hear. :) > > Does that mean you're happy to queue these patches? Or would you prefer > a new posting at some later point, with ack/review tags accumulated? > >> Though I really question the whole _ONCE APIs esp with >> aggregate types - these seem to generate a memcpy and >> an 8-byte read/writes sometimes, and I'm pretty sure this simply >> can't be read/written at once on all architectures. > > Yes, in cases where the access is larger than the machine can perform in > a single access, this will result in a memcpy. > > My understanding is that this has always been the case with > ACCESS_ONCE(), where multiple accesses were silently/implicitly > generated by the compiler. > > We could add some compile-time warnings for those cases. I'm not sure if > there's a reason we avoided doing that so far; perhaps Christian has a > some idea. My first version had this warning, but it was removed later on as requested by Linus http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02670.html ---snip--- Get rid of the f*cking size checks etc on READ_ONCE() and friends. They are about - wait for it - "reading a value once". Note how it doesn't say ANYTHING about "atomic" or anything like that. It's about reading *ONCE*. ---snip---
Re: [PATCH 2/3] vringh: kill off ACCESS_ONCE()
On 11/24/2016 11:25 AM, Mark Rutland wrote: > Despite living under drivers/ vringh.c is also used as part of the userspace > virtio tools. Before we can kill off the ACCESS_ONCE()definition in the tools, > we must convert vringh.c to use {READ,WRITE}_ONCE(). > > This patch does so, along with the required include of for > the relevant definitions. The userspace tools provide their own definitions in > their own . > > Signed-off-by: Mark Rutland <mark.rutl...@arm.com> > Cc: Jason Wang <jasow...@redhat.com> > Cc: Michael S. Tsirkin <m...@redhat.com> > Cc: k...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: netdev@vger.kernel.org > Cc: virtualizat...@lists.linux-foundation.org > --- > drivers/vhost/vringh.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c > index 3bb02c6..bb8971f 100644 > --- a/drivers/vhost/vringh.c > +++ b/drivers/vhost/vringh.c > @@ -3,6 +3,7 @@ > * > * Since these may be in userspace, we use (inline) accessors. > */ > +#include > #include > #include > #include > @@ -820,13 +821,13 @@ EXPORT_SYMBOL(vringh_need_notify_user); > static inline int getu16_kern(const struct vringh *vrh, > u16 *val, const __virtio16 *p) > { > - *val = vringh16_to_cpu(vrh, ACCESS_ONCE(*p)); > + *val = vringh16_to_cpu(vrh, READ_ONCE(*p)); > return 0; > } > > static inline int putu16_kern(const struct vringh *vrh, __virtio16 *p, u16 > val) > { > - ACCESS_ONCE(*p) = cpu_to_vringh16(vrh, val); > + WRITE_ONCE(*p, cpu_to_vringh16(vrh, val)); > return 0; > } > Makes sense Reviewed-by: Christian Borntraeger <borntrae...@de.ibm.com>
Re: [PATCH V3 3/3] vhost_net: basic polling support
On 02/26/2016 09:42 AM, Jason Wang wrote: > This patch tries to poll for new added tx buffer or socket receive > queue for a while at the end of tx/rx processing. The maximum time > spent on polling were specified through a new kind of vring ioctl. > > Signed-off-by: Jason Wang> --- > drivers/vhost/net.c| 79 > +++--- > drivers/vhost/vhost.c | 14 > drivers/vhost/vhost.h | 1 + > include/uapi/linux/vhost.h | 6 > 4 files changed, 95 insertions(+), 5 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index 9eda69e..c91af93 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -287,6 +287,44 @@ static void vhost_zerocopy_callback(struct ubuf_info > *ubuf, bool success) > rcu_read_unlock_bh(); > } > > +static inline unsigned long busy_clock(void) > +{ > + return local_clock() >> 10; > +} > + > +static bool vhost_can_busy_poll(struct vhost_dev *dev, > + unsigned long endtime) > +{ > + return likely(!need_resched()) && > +likely(!time_after(busy_clock(), endtime)) && > +likely(!signal_pending(current)) && > +!vhost_has_work(dev) && > +single_task_running(); > +} > + > +static int vhost_net_tx_get_vq_desc(struct vhost_net *net, > + struct vhost_virtqueue *vq, > + struct iovec iov[], unsigned int iov_size, > + unsigned int *out_num, unsigned int *in_num) > +{ > + unsigned long uninitialized_var(endtime); > + int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), > + out_num, in_num, NULL, NULL); > + > + if (r == vq->num && vq->busyloop_timeout) { > + preempt_disable(); > + endtime = busy_clock() + vq->busyloop_timeout; > + while (vhost_can_busy_poll(vq->dev, endtime) && > +vhost_vq_avail_empty(vq->dev, vq)) > + cpu_relax(); Can you use cpu_relax_lowlatency (which should be the same as cpu_relax for almost everybody but s390? cpu_relax (without low latency might give up the time slice when running under another hypervisor (like LPAR on s390), which might not be what we want here. [...] > +static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) > +{ > + struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; > + struct vhost_virtqueue *vq = >vq; > + unsigned long uninitialized_var(endtime); > + int len = peek_head_len(sk); > + > + if (!len && vq->busyloop_timeout) { > + /* Both tx vq and rx socket were polled here */ > + mutex_lock(>mutex); > + vhost_disable_notify(>dev, vq); > + > + preempt_disable(); > + endtime = busy_clock() + vq->busyloop_timeout; > + > + while (vhost_can_busy_poll(>dev, endtime) && > +skb_queue_empty(>sk_receive_queue) && > +vhost_vq_avail_empty(>dev, vq)) > + cpu_relax(); here as well.
Re: [PATCH v2] vhost: replace % with & on data path
On 11/30/2015 10:15 AM, Michael S. Tsirkin wrote: > We know vring num is a power of 2, so use & > to mask the high bits. Makes a lot of sense and virtio_ring.c seems to use the same logic. Acked-by: Christian Borntraeger <borntrae...@de.ibm.com> > > Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > --- > > Changes from v1: drop an unrelated chunk > > drivers/vhost/vhost.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 080422f..ad2146a 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -1369,7 +1369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, > /* Grab the next descriptor number they're advertising, and increment >* the index we've seen. */ > if (unlikely(__get_user(ring_head, > - >avail->ring[last_avail_idx % vq->num]))) { > + >avail->ring[last_avail_idx & (vq->num - > 1)]))) { > vq_err(vq, "Failed to read head: idx %d address %p\n", > last_avail_idx, > >avail->ring[last_avail_idx % vq->num]); > @@ -1489,7 +1489,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue > *vq, > u16 old, new; > int start; > > - start = vq->last_used_idx % vq->num; > + start = vq->last_used_idx & (vq->num - 1); > used = vq->used->ring + start; > if (count == 1) { > if (__put_user(heads[0].id, >id)) { > @@ -1531,7 +1531,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct > vring_used_elem *heads, > { > int start, n, r; > > - start = vq->last_used_idx % vq->num; > + start = vq->last_used_idx & (vq->num - 1); > n = vq->num - start; > if (n < count) { > r = __vhost_add_used_n(vq, heads, n); > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4-rc1 errors on bridge/macvtap interfaces (maybe commit 0bc05d58 switchdev: allow caller to exp...).
On 11/16/2015 02:25 PM, Jiri Pirko wrote: > Mon, Nov 16, 2015 at 02:19:55PM CET, mkube...@suse.cz wrote: >> On Mon, Nov 16, 2015 at 02:00:32PM +0100, Christian Borntraeger wrote: >>> >>> on 4.4-rc1 running on an s390x box (so qeth OSA network cards as real NICs) >>> I get errors like: >>> >>> [ 10.940523] Ebtables v2.0 registered >>> [ 11.685609] bridge: automatic filtering via arp/ip/ip6tables has been >>> deprecated. Update your scripts to load br_netfilter if you need this. >>> [ 11.686252] virbr0-nic: set_features() failed (-1); wanted >>> 0x008048c1, left 0x0080001b48c9 >>> [ 11.794553] virbr0: error setting offload STP state on port 1(virbr0-nic) >>> [ 11.794556] virbr0-nic: failed to set HW ageing time >>> [ 11.794558] virbr0: error setting offload STP state on port 1(virbr0-nic) >>> [ 11.794644] virbr0-nic: set_features() failed (-1); wanted >>> 0x008048c1, left 0x0080001b48c9 >>> [ 11.794675] device virbr0-nic entered promiscuous mode >>> [ 11.794706] virbr0: set_features() failed (-1); wanted >>> 0x00801fdb78c9, left 0x00801fff78e9 >>> [ 11.962901] ip_tables: (C) 2000-2006 Netfilter Core Team >>> [ 12.025317] nf_conntrack version 0.5.0 (65536 buckets, 262144 max) >>> [ 12.086880] virbr0: set_features() failed (-1); wanted >>> 0x00801fdb78c9, left 0x00801fff78e9 >>> [ 12.086924] virbr0: error setting offload STP state on port 1(virbr0-nic) >>> [ 12.086926] virbr0-nic: failed to set HW ageing time >>> [ 12.086927] virbr0: error setting offload STP state on port 1(virbr0-nic) >>> [ 12.086928] virbr0: port 1(virbr0-nic) entered listening state >>> [ 12.086934] virbr0: port 1(virbr0-nic) entered listening state >>> [ 12.087023] virbr0: set_features() failed (-1); wanted >>> 0x00801fdb78c9, left 0x00801fff78e9 >>> [ 12.087057] virbr0-nic: set_features() failed (-1); wanted >>> 0x008048c1, left 0x0080001b48c9 >>> [ 12.087088] virbr0: set_features() failed (-1); wanted >>> 0x00801fdb78c9, left 0x00801fff78e9 >>> [ 12.087122] virbr0-nic: set_features() failed (-1); wanted >>> 0x008048c1, left 0x0080001b48c9 >>> [ 12.087153] virbr0: set_features() failed (-1); wanted >>> 0x00801fdb78c9, left 0x00801fff78e9 >>> [ 12.102634] virbr0: error setting offload STP state on port 1(virbr0-nic) >>> [ 12.102637] virbr0: port 1(virbr0-nic) entered disabled state >>> >> ... >>> >>> [ 260.414547] macvtap0: set_features() failed (-1); wanted >>> 0x0004001f5209, left 0x0004001f5a09 >>> [ 260.414555] macvtap0: set_features() failed (-1); wanted >>> 0x0004001f5209, left 0x0004001f5a09 >>> [ 260.541477] macvtap0: set_features() failed (-1); wanted >>> 0x0004001f5209, left 0x0004001f5a09 >>> [ 260.541532] macvtap0: set_features() failed (-1); wanted >>> 0x0004001f5209, left 0x0004001f5a09 >>> [ 260.541603] macvtap0: set_features() failed (-1); wanted >>> 0x0004001f5209, left 0x0004001f5a09 >>> [ 260.984503] device enccw0.0.f500 entered promiscuous mode >>> >>> Does somebody have an idea? >> >> Looks like the issue addressed by >> >> http://patchwork.ozlabs.org/patch/544307/ > > second issues (stp state) is fixed by: > http://patchwork.ozlabs.org/patch/544246/ Yes, Tested-by: Christian Borntraeger <borntrae...@de.ibm.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
4.4-rc1 errors on bridge/macvtap interfaces (maybe commit 0bc05d58 switchdev: allow caller to exp...).
Jiri, on 4.4-rc1 running on an s390x box (so qeth OSA network cards as real NICs) I get errors like: [ 10.940523] Ebtables v2.0 registered [ 11.685609] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this. [ 11.686252] virbr0-nic: set_features() failed (-1); wanted 0x008048c1, left 0x0080001b48c9 [ 11.794553] virbr0: error setting offload STP state on port 1(virbr0-nic) [ 11.794556] virbr0-nic: failed to set HW ageing time [ 11.794558] virbr0: error setting offload STP state on port 1(virbr0-nic) [ 11.794644] virbr0-nic: set_features() failed (-1); wanted 0x008048c1, left 0x0080001b48c9 [ 11.794675] device virbr0-nic entered promiscuous mode [ 11.794706] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, left 0x00801fff78e9 [ 11.962901] ip_tables: (C) 2000-2006 Netfilter Core Team [ 12.025317] nf_conntrack version 0.5.0 (65536 buckets, 262144 max) [ 12.086880] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, left 0x00801fff78e9 [ 12.086924] virbr0: error setting offload STP state on port 1(virbr0-nic) [ 12.086926] virbr0-nic: failed to set HW ageing time [ 12.086927] virbr0: error setting offload STP state on port 1(virbr0-nic) [ 12.086928] virbr0: port 1(virbr0-nic) entered listening state [ 12.086934] virbr0: port 1(virbr0-nic) entered listening state [ 12.087023] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, left 0x00801fff78e9 [ 12.087057] virbr0-nic: set_features() failed (-1); wanted 0x008048c1, left 0x0080001b48c9 [ 12.087088] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, left 0x00801fff78e9 [ 12.087122] virbr0-nic: set_features() failed (-1); wanted 0x008048c1, left 0x0080001b48c9 [ 12.087153] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, left 0x00801fff78e9 [ 12.102634] virbr0: error setting offload STP state on port 1(virbr0-nic) [ 12.102637] virbr0: port 1(virbr0-nic) entered disabled state bisecting _one_ of the problem (the error message change, but I said "bad" whenever there was at least one error) points to commit 0bc05d585d381c30de3fdf955730df31593d2101 Author: Jiri PirkoAuthorDate: Wed Oct 14 19:40:50 2015 +0200 Commit: David S. Miller CommitDate: Thu Oct 15 06:09:48 2015 -0700 switchdev: allow caller to explicitly request attr_set as deferred Macvtap has a similar issue on 4.4-rc1: [ 260.414547] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, left 0x0004001f5a09 [ 260.414555] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, left 0x0004001f5a09 [ 260.541477] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, left 0x0004001f5a09 [ 260.541532] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, left 0x0004001f5a09 [ 260.541603] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, left 0x0004001f5a09 [ 260.984503] device enccw0.0.f500 entered promiscuous mode Does somebody have an idea? Christian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4-rc1 errors on bridge/macvtap interfaces (maybe commit 0bc05d58 switchdev: allow caller to exp...).
On 11/16/2015 02:19 PM, Michal Kubecek wrote: > On Mon, Nov 16, 2015 at 02:00:32PM +0100, Christian Borntraeger wrote: >> >> on 4.4-rc1 running on an s390x box (so qeth OSA network cards as real NICs) >> I get errors like: >> >> [ 10.940523] Ebtables v2.0 registered >> [ 11.685609] bridge: automatic filtering via arp/ip/ip6tables has been >> deprecated. Update your scripts to load br_netfilter if you need this. >> [ 11.686252] virbr0-nic: set_features() failed (-1); wanted >> 0x008048c1, left 0x0080001b48c9 >> [ 11.794553] virbr0: error setting offload STP state on port 1(virbr0-nic) >> [ 11.794556] virbr0-nic: failed to set HW ageing time >> [ 11.794558] virbr0: error setting offload STP state on port 1(virbr0-nic) >> [ 11.794644] virbr0-nic: set_features() failed (-1); wanted >> 0x008048c1, left 0x0080001b48c9 >> [ 11.794675] device virbr0-nic entered promiscuous mode >> [ 11.794706] virbr0: set_features() failed (-1); wanted >> 0x00801fdb78c9, left 0x00801fff78e9 >> [ 11.962901] ip_tables: (C) 2000-2006 Netfilter Core Team >> [ 12.025317] nf_conntrack version 0.5.0 (65536 buckets, 262144 max) >> [ 12.086880] virbr0: set_features() failed (-1); wanted >> 0x00801fdb78c9, left 0x00801fff78e9 >> [ 12.086924] virbr0: error setting offload STP state on port 1(virbr0-nic) >> [ 12.086926] virbr0-nic: failed to set HW ageing time >> [ 12.086927] virbr0: error setting offload STP state on port 1(virbr0-nic) >> [ 12.086928] virbr0: port 1(virbr0-nic) entered listening state >> [ 12.086934] virbr0: port 1(virbr0-nic) entered listening state >> [ 12.087023] virbr0: set_features() failed (-1); wanted >> 0x00801fdb78c9, left 0x00801fff78e9 >> [ 12.087057] virbr0-nic: set_features() failed (-1); wanted >> 0x008048c1, left 0x0080001b48c9 >> [ 12.087088] virbr0: set_features() failed (-1); wanted >> 0x00801fdb78c9, left 0x00801fff78e9 >> [ 12.087122] virbr0-nic: set_features() failed (-1); wanted >> 0x008048c1, left 0x0080001b48c9 >> [ 12.087153] virbr0: set_features() failed (-1); wanted >> 0x00801fdb78c9, left 0x00801fff78e9 >> [ 12.102634] virbr0: error setting offload STP state on port 1(virbr0-nic) >> [ 12.102637] virbr0: port 1(virbr0-nic) entered disabled state >> > ... >> >> [ 260.414547] macvtap0: set_features() failed (-1); wanted >> 0x0004001f5209, left 0x0004001f5a09 >> [ 260.414555] macvtap0: set_features() failed (-1); wanted >> 0x0004001f5209, left 0x0004001f5a09 >> [ 260.541477] macvtap0: set_features() failed (-1); wanted >> 0x0004001f5209, left 0x0004001f5a09 >> [ 260.541532] macvtap0: set_features() failed (-1); wanted >> 0x0004001f5209, left 0x0004001f5a09 >> [ 260.541603] macvtap0: set_features() failed (-1); wanted >> 0x0004001f5209, left 0x0004001f5a09 >> [ 260.984503] device enccw0.0.f500 entered promiscuous mode >> >> Does somebody have an idea? > > Looks like the issue addressed by > > http://patchwork.ozlabs.org/patch/544307/ Yes. The set_features error is indeed fixed. FWIW; Tested-by: Christian Borntraeger <borntrae...@de.ibm.com> The other error is independent. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/1] macvtap regression since 3.18
Michael, here is a fixup for macvtap. It was detected by running several different patterns via uperf over multiple network cards. Within some minutes, the network traffic stalled and Matt bisected it down to 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF"). Turns out that this patch broke sndbuf setting. I dont know exactly what went wrong in that testcase, but an earlier version of the patch did the trick and the testcase is now stable again. I was tempted to add another variable, but u,up,s,sp seem to have a meaning (signed,signed pointer) so I made u an unsigned int again. Long term we might want to refactor the code to have int sk_sendbuf; short ifr_flags; int vnet_hdr_sz; or something like that. But this rework would be to big for stable. Some of the casts that I added are not strictly necessary as get/put_user already do this, but better safe than sorry as we dont want to rely on the implementation of macros. Opinions? Christian Borntraeger (1): macvtap: Fix regression for macvtap ioctls drivers/net/macvtap.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) -- 2.3.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] macvtap: Fix regression for macvtap ioctls
To avoid overwriting the upper bits of the flags, commit 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF") changed the variable u from unsigned int to unsigned short and added some ORing logic for the flags. This introduced at least one regression: - TUNSETSNDBUF supports int as its size and also uses the now short u as buffer - this breaks any sendbuf size > 64k Let's change u back to unsigned int, keep the ORing and handle the overwrite issue with casts and masking. Cc: Michael S. Tsirkin <m...@redhat.com> Cc: David S. Miller <da...@davemloft.net> Reported-by: Mark A. Peloquin Bisected-by: Matthew Rosato <mjros...@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntrae...@de.ibm.com> Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF") Cc: sta...@vger.kernel.org --- drivers/net/macvtap.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index edd7734..c33fe41 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -1060,7 +1060,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, void __user *argp = (void __user *)arg; struct ifreq __user *ifr = argp; unsigned int __user *up = argp; - unsigned short u; + unsigned int u; int __user *sp = argp; struct sockaddr sa; int s; @@ -1076,7 +1076,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, if ((u & ~MACVTAP_FEATURES) != (IFF_NO_PI | IFF_TAP)) ret = -EINVAL; else - q->flags = (q->flags & ~MACVTAP_FEATURES) | u; + q->flags = (q->flags & ~MACVTAP_FEATURES) | (short) u; return ret; @@ -1089,9 +1089,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, } ret = 0; - u = q->flags; if (copy_to_user(>ifr_name, vlan->dev->name, IFNAMSIZ) || - put_user(u, >ifr_flags)) + put_user((short) q->flags, >ifr_flags)) ret = -EFAULT; macvtap_put_vlan(vlan); rtnl_unlock(); -- 2.3.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] macvtap: fix TUNSETSNDBUF values > 64k
Am 18.09.2015 um 12:41 schrieb Michael S. Tsirkin: > Upon TUNSETSNDBUF, macvtap reads the requested sndbuf size into > a local variable u. > commit 39ec7de7092b ("macvtap: fix uninitialized access on > TUNSETIFF") changed its type to u16 (which is the right thing to > do for all other macvtap ioctls), breaking all values > 64k. > > The value of TUNSETSNDBUF is actually a signed 32 bit integer, so > the right thing to do is to read it into an int. > > Cc: David S. Miller <da...@davemloft.net> > Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF") > Reported-by: Mark A. Peloquin > Bisected-by: Matthew Rosato <mjros...@linux.vnet.ibm.com> > Reported-by: Christian Borntraeger <borntrae...@de.ibm.com> > Signed-off-by: Michael S. Tsirkin <m...@redhat.com> You can add Tested-by: Matthew Rosato <mjros...@linux.vnet.ibm.com> as this looks identical to an early version of my patch which was tested, by Matt. (I send you the other version that changes back u as I felt that u and up are named to identify unsigned) and please add Acked-by: Christian Borntraeger <borntrae...@de.ibm.com> what about CC: sta...@vger.kernel.org Christian > --- > > This patch probably makes sense on stable. > > drivers/net/macvtap.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c > index edd7734..248478c 100644 > --- a/drivers/net/macvtap.c > +++ b/drivers/net/macvtap.c > @@ -,10 +,10 @@ static long macvtap_ioctl(struct file *file, unsigned > int cmd, > return 0; > > case TUNSETSNDBUF: > - if (get_user(u, up)) > + if (get_user(s, sp)) > return -EFAULT; > > - q->sk.sk_sndbuf = u; > + q->sk.sk_sndbuf = s; > return 0; > > case TUNGETVNETHDRSZ: > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] tun, macvtap: higher order allocations for skbs
Am 18.06.2015 um 12:20 schrieb Michael S. Tsirkin: Needs more testing. Anyone see anything wrong with this? Can you explain the motivation? FWIW, basic networking between two guest over macvtap still seems to work on s390 so I dont see any obvious regression. Christian Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/net/macvtap.c | 2 +- drivers/net/tun.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 928f3f4..80e87e4 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -610,7 +610,7 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad, linear = len; skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock, -err, 0); +err, 1); if (!skb) return NULL; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index cb376b2d..8f2f1e5 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1069,7 +1069,7 @@ static struct sk_buff *tun_alloc_skb(struct tun_file *tfile, linear = len; skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock, -err, 0); +err, 1); if (!skb) return ERR_PTR(err); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend] virtio_net: Fix oops on early interrupts - introduced by virtio reset code
Am Montag, 11. Februar 2008 schrieb Anthony Liguori: The reset support is in Linus's tree so we should try to push it for -rc2. You are right. My repository was borked. will push it to Jeff Garzik. Thanks Jeff can you schedule this fix into your network driver updates? Thanks --- With the latest virtio_reset patches I got the following oops: Unable to handle kernel pointer dereference at virtual kernel address Oops: 0004 [#1] PREEMPT SMP Modules linked in: CPU: 1 Not tainted 2.6.24zlive-guest-10577-g63f5307-dirty #168 Process swapper (pid: 0, task: 0f866040, ksp: 0f86fd78) Krnl PSW : 040410018000 0047598a (skb_recv_done+0x52/0x98) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3 Krnl GPRS: 0001 0efd0e60 0001 0f866040 008de4c8 1237 1237 0f977dd8 0020 001132bc 0f977e08 0f977dd8 Krnl Code: 0047597c: e3104034 lg %r1,48(%r4) 00475982: b9040001 lgr %r0,%r1 00475986: b9810003 ogr %r0,%r3 0047598a: eb1040300030 csg %r1,%r0,48(%r4) 00475990: a744fff9 brc 4,475982 00475994: a7110001 tmll%r1,1 00475998: a7840009 brc 8,4759aa 0047599c: e340b0b80004 lg %r4,184(%r11) Call Trace: ([01500f978000] 0x1500f978000) [004779a6] vring_interrupt+0x72/0x88 [00491d9c] kvm_extint_handler+0x34/0x44 [0010d2d4] do_extint+0xc0/0xfc [00113b5a] ext_no_vtime+0x1c/0x20 [0010a0b6] cpu_idle+0x21a/0x230 ([0010a096] cpu_idle+0x1fa/0x230) [0057dfe4] start_secondary+0xa0/0xb4 We must initialize vdev-priv before we use the notify hypercall as vdev-priv is used in skb_recv_done. So lets move the assignment of vdev-priv before we call try_fill_recv. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] Acked-by: Anthony Liguori [EMAIL PROTECTED] --- drivers/net/virtio_net.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -361,6 +361,7 @@ static int virtnet_probe(struct virtio_d netif_napi_add(dev, vi-napi, virtnet_poll, napi_weight); vi-dev = dev; vi-vdev = vdev; + vdev-priv = vi; /* We expect two virtqueues, receive then send. */ vi-rvq = vdev-config-find_vq(vdev, 0, skb_recv_done); @@ -395,7 +396,6 @@ static int virtnet_probe(struct virtio_d } pr_debug(virtnet: registered device %s\n, dev-name); - vdev-priv = vi; return 0; unregister: -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_net: Fix oops on early interrupts - introduced by virtio reset code
Am Montag, 11. Februar 2008 schrieb Anthony Liguori: The reset support is in Linus's tree so we should try to push it for -rc2. You are right. My repository was borked. will push it to Jeff Garzik. Thanks Jeff can you schedule this fix into your network driver updates? Thanks --- With the latest virtio_reset patches I got the following oops: Unable to handle kernel pointer dereference at virtual kernel address Oops: 0004 [#1] PREEMPT SMP Modules linked in: CPU: 1 Not tainted 2.6.24zlive-guest-10577-g63f5307-dirty #168 Process swapper (pid: 0, task: 0f866040, ksp: 0f86fd78) Krnl PSW : 040410018000 0047598a (skb_recv_done+0x52/0x98) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3 Krnl GPRS: 0001 0efd0e60 0001 0f866040 008de4c8 1237 1237 0f977dd8 0020 001132bc 0f977e08 0f977dd8 Krnl Code: 0047597c: e3104034 lg %r1,48(%r4) 00475982: b9040001 lgr %r0,%r1 00475986: b9810003 ogr %r0,%r3 0047598a: eb1040300030 csg %r1,%r0,48(%r4) 00475990: a744fff9 brc 4,475982 00475994: a7110001 tmll%r1,1 00475998: a7840009 brc 8,4759aa 0047599c: e340b0b80004 lg %r4,184(%r11) Call Trace: ([01500f978000] 0x1500f978000) [004779a6] vring_interrupt+0x72/0x88 [00491d9c] kvm_extint_handler+0x34/0x44 [0010d2d4] do_extint+0xc0/0xfc [00113b5a] ext_no_vtime+0x1c/0x20 [0010a0b6] cpu_idle+0x21a/0x230 ([0010a096] cpu_idle+0x1fa/0x230) [0057dfe4] start_secondary+0xa0/0xb4 We must initialize vdev-priv before we use the notify hypercall as vdev-priv is used in skb_recv_done. So lets move the assignment of vdev-priv before we call try_fill_recv. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] Acked-by: Anthony Liguori [EMAIL PROTECTED] --- drivers/net/virtio_net.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -361,6 +361,7 @@ static int virtnet_probe(struct virtio_d netif_napi_add(dev, vi-napi, virtnet_poll, napi_weight); vi-dev = dev; vi-vdev = vdev; + vdev-priv = vi; /* We expect two virtqueues, receive then send. */ vi-rvq = vdev-config-find_vq(vdev, 0, skb_recv_done); @@ -395,7 +396,6 @@ static int virtnet_probe(struct virtio_d } pr_debug(virtnet: registered device %s\n, dev-name); - vdev-priv = vi; return 0; unregister: -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_net: Fix oops on early interrupts - introduced by virtio reset code
Avi, this fixes a problem that was introduced by the virtio_reset patches. Can you apply that fix to kvm.git as a bugfix, as the virtio_reset infrastructure is not on Linus upstream yet? Anthony, Dor, are you ok with that change? -- With the latest virtio_reset patches I got the following oops: Unable to handle kernel pointer dereference at virtual kernel address Oops: 0004 [#1] PREEMPT SMP Modules linked in: CPU: 1 Not tainted 2.6.24zlive-guest-10577-g63f5307-dirty #168 Process swapper (pid: 0, task: 0f866040, ksp: 0f86fd78) Krnl PSW : 040410018000 0047598a (skb_recv_done+0x52/0x98) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3 Krnl GPRS: 0001 0efd0e60 0001 0f866040 008de4c8 1237 1237 0f977dd8 0020 001132bc 0f977e08 0f977dd8 Krnl Code: 0047597c: e3104034 lg %r1,48(%r4) 00475982: b9040001 lgr %r0,%r1 00475986: b9810003 ogr %r0,%r3 0047598a: eb1040300030 csg %r1,%r0,48(%r4) 00475990: a744fff9 brc 4,475982 00475994: a7110001 tmll%r1,1 00475998: a7840009 brc 8,4759aa 0047599c: e340b0b80004 lg %r4,184(%r11) Call Trace: ([01500f978000] 0x1500f978000) [004779a6] vring_interrupt+0x72/0x88 [00491d9c] kvm_extint_handler+0x34/0x44 [0010d2d4] do_extint+0xc0/0xfc [00113b5a] ext_no_vtime+0x1c/0x20 [0010a0b6] cpu_idle+0x21a/0x230 We must initialize vdev-priv before we use the notify hypercall as vdev-priv is used in skb_recv_done. So lets move the assignment of vdev-priv before we call try_fill_recv. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- drivers/net/virtio_net.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -361,6 +361,7 @@ static int virtnet_probe(struct virtio_d netif_napi_add(dev, vi-napi, virtnet_poll, napi_weight); vi-dev = dev; vi-vdev = vdev; + vdev-priv = vi; /* We expect two virtqueues, receive then send. */ vi-rvq = vdev-config-find_vq(vdev, 0, skb_recv_done); @@ -395,7 +396,6 @@ static int virtnet_probe(struct virtio_d } pr_debug(virtnet: registered device %s\n, dev-name); - vdev-priv = vi; return 0; unregister: -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_net: Fix open - interrupt race
I got the following oops during interface ifup. Unfortunately its not easily reproducable so I cant say for sure that my fix fixes this problem, but I am confident and I think its correct anyway: 2kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:234! 4illegal operation: 0001 [#1] PREEMPT SMP 4Modules linked in: 4CPU: 0 Not tainted 2.6.24zlive-guest-07293-gf1ca151-dirty #91 4Process swapper (pid: 0, task: 00800938, ksp: 0084ddb8) 4Krnl PSW : 040430018000 00466374 (vring_disable_cb+0x30/0x34) 4 R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3 4Krnl GPRS: 0001 0001 10003800 00466344 4 0e980900 008848b0 0084e748 4 0087b300 1237 1237 0f85bdd8 4 0e980920 001137c0 00464754 0f85bdd8 4Krnl Code: 00466368: e3b0b074lg %r11,112(%r11) 4 0046636e: 07febcr 15,%r14 4 00466370: a7f40001brc 15,466372 4 00466374: a7f4fff6brc 15,466360 4 00466378: eb7ff0500024stmg %r7,%r15,80(%r15) 4 0046637e: a7f13e00tmll%r15,15872 4 00466382: b90400eflgr %r14,%r15 4 00466386: a7840001brc 8,466388 4Call Trace: 4([000201500f85c000] 0x201500f85c000) 4 [00466556] vring_interrupt+0x72/0x88 4 [004801a0] kvm_extint_handler+0x34/0x44 4 [0010d22c] do_extint+0xbc/0xf8 4 [00113f98] ext_no_vtime+0x16/0x1a 4 [0010a182] cpu_idle+0x216/0x238 4([0010a162] cpu_idle+0x1f6/0x238) 4 [00568656] rest_init+0xaa/0xb8 4 [0084ee2c] start_kernel+0x3fc/0x490 4 [00100020] _stext+0x20/0x80 4 4 0Kernel panic - not syncing: Fatal exception in interrupt 4 After looking at the code and the dump I think the following scenario happened: Ifup was running on cpu2 and the interrupt arrived on cpu0. Now virtnet_open on cpu 2 managed to execute napi_enable and disable_cb but did not execute rx_schedule. Meanwhile on cpu 0 skb_recv_done was called by vring_interrupt, executed netif_rx_schedule_prep, which succeeded and therefore called disable_cb. This triggered the BUG_ON, as interrupts were already disabled by cpu 2. I think the proper solution is to make the call to disable_cb depend on the atomic update of NAPI_STATE_SCHED by using netif_rx_schedule_prep in the same way as skb_recv_done. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- drivers/net/virtio_net.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -321,10 +321,12 @@ static int virtnet_open(struct net_devic /* If all buffers were filled by other side before we napi_enabled, we * won't get another interrupt, so process any outstanding packets -* now. virtnet_poll wants re-enable the queue, so we disable here. */ - vi-rvq-vq_ops-disable_cb(vi-rvq); - netif_rx_schedule(vi-dev, vi-napi); - +* now. virtnet_poll wants re-enable the queue, so we disable here. +* We synchronize against interrupts via NAPI_STATE_SCHED */ + if (netif_rx_schedule_prep(dev, vi-napi)) { + vi-rvq-vq_ops-disable_cb(vi-rvq); + __netif_rx_schedule(dev, vi-napi); + } return 0; } -- IBM Deutschland Entwicklung GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Herbert Kircher Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_net: Fix open - interrupt race
Jeff, Rusty is (supposed to be) on vacation. Can you send this fix against virtio_net with your next network driver fixes for 2.6.25? Thank you - I got the following oops during interface ifup. Unfortunately its not easily reproducable so I cant say for sure that my fix fixes this problem, but I am confident and I think its correct anyway: 2kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:234! 4illegal operation: 0001 [#1] PREEMPT SMP 4Modules linked in: 4CPU: 0 Not tainted 2.6.24zlive-guest-07293-gf1ca151-dirty #91 4Process swapper (pid: 0, task: 00800938, ksp: 0084ddb8) 4Krnl PSW : 040430018000 00466374 (vring_disable_cb+0x30/0x34) 4 R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3 4Krnl GPRS: 0001 0001 10003800 00466344 4 0e980900 008848b0 0084e748 4 0087b300 1237 1237 0f85bdd8 4 0e980920 001137c0 00464754 0f85bdd8 4Krnl Code: 00466368: e3b0b074lg %r11,112(%r11) 4 0046636e: 07febcr 15,%r14 4 00466370: a7f40001brc 15,466372 4 00466374: a7f4fff6brc 15,466360 4 00466378: eb7ff0500024stmg %r7,%r15,80(%r15) 4 0046637e: a7f13e00tmll%r15,15872 4 00466382: b90400eflgr %r14,%r15 4 00466386: a7840001brc 8,466388 4Call Trace: 4([000201500f85c000] 0x201500f85c000) 4 [00466556] vring_interrupt+0x72/0x88 4 [004801a0] kvm_extint_handler+0x34/0x44 4 [0010d22c] do_extint+0xbc/0xf8 4 [00113f98] ext_no_vtime+0x16/0x1a 4 [0010a182] cpu_idle+0x216/0x238 4([0010a162] cpu_idle+0x1f6/0x238) 4 [00568656] rest_init+0xaa/0xb8 4 [0084ee2c] start_kernel+0x3fc/0x490 4 [00100020] _stext+0x20/0x80 4 4 0Kernel panic - not syncing: Fatal exception in interrupt 4 After looking at the code and the dump I think the following scenario happened: Ifup was running on cpu2 and the interrupt arrived on cpu0. Now virtnet_open on cpu 2 managed to execute napi_enable and disable_cb but did not execute rx_schedule. Meanwhile on cpu 0 skb_recv_done was called by vring_interrupt, executed netif_rx_schedule_prep, which succeeded and therefore called disable_cb. This triggered the BUG_ON, as interrupts were already disabled by cpu 2. I think the proper solution is to make the call to disable_cb depend on the atomic update of NAPI_STATE_SCHED by using netif_rx_schedule_prep in the same way as skb_recv_done. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] Acked-by: Rusty Russell [EMAIL PROTECTED] --- drivers/net/virtio_net.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -321,10 +321,12 @@ static int virtnet_open(struct net_devic /* If all buffers were filled by other side before we napi_enabled, we * won't get another interrupt, so process any outstanding packets -* now. virtnet_poll wants re-enable the queue, so we disable here. */ - vi-rvq-vq_ops-disable_cb(vi-rvq); - netif_rx_schedule(vi-dev, vi-napi); - +* now. virtnet_poll wants re-enable the queue, so we disable here. +* We synchronize against interrupts via NAPI_STATE_SCHED */ + if (netif_rx_schedule_prep(dev, vi-napi)) { + vi-rvq-vq_ops-disable_cb(vi-rvq); + __netif_rx_schedule(dev, vi-napi); + } return 0; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio_net and SMP guests
Am Donnerstag, 10. Januar 2008 schrieb Christian Borntraeger: Am Donnerstag, 10. Januar 2008 schrieb Christian Borntraeger: Am Dienstag, 18. Dezember 2007 schrieb Rusty Russell: To me this points to doing interrupt suppression a different way. If we have a -disable_cb() virtio function, and call it before we call netif_rx_schedule, does that fix it? The fix looks good and I agree with it. There is one problem that I try to find for some days, but the following BUG_ON triggers: static void vring_disable_cb(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); START_USE(vq); BUG_ON(vq-vring.avail-flags VRING_AVAIL_F_NO_INTERRUPT); vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; END_USE(vq); } Ok, I found it: static int virtnet_open(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); try_fill_recv(vi); /* If we didn't even get one input buffer, we're useless. */ if (vi-num == 0) return -ENOMEM; --- int for new packet static void skb_recv_done(struct virtqueue *rvq) { struct virtnet_info *vi = rvq-vdev-priv; /* Suppress further interrupts. */ rvq-vq_ops-disable_cb(rvq); netif_rx_schedule(vi-dev, vi-napi); } - poll is not yet possible, no softirq - return from interrupt napi_enable(vi-napi); vi-rvq-vq_ops-disable_cb(vi-rvq); --- BUG: its already disabled Btw. this problem also happens on single processor guests. What about the following patch: --- drivers/net/virtio_net.c |9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -179,9 +179,12 @@ static void try_fill_recv(struct virtnet static void skb_recv_done(struct virtqueue *rvq) { struct virtnet_info *vi = rvq-vdev-priv; - /* Suppress further interrupts. */ - rvq-vq_ops-disable_cb(rvq); - netif_rx_schedule(vi-dev, vi-napi); + /* Schedule NAPI, Suppress further interrupts if successful. */ + + if (netif_rx_schedule_prep(vi-dev, vi-napi)) { + rvq-vq_ops-disable_cb(rvq); + __netif_rx_schedule(vi-dev, vi-napi); + } } static int virtnet_poll(struct napi_struct *napi, int budget) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets
Am Donnerstag, 13. Dezember 2007 schrieb Dor Laor: You're right I got confused somehow. So in that case setting the driver status field on open in addition to your enable will do the trick. On DRIVER_OPEN the host will trigger an interrupt if the queue is not empty.. Thanks, Dor After looking into some other drivers, I prefer my first patch (moving napi_enable) ;-) There are some drivers like xen-netfront, b44, which call napi_enable before the buffers are passed to the hardware. So it seems that moving napi is also a valid option. But maybe I can just wait until Rusty returns from vacation (I will leave next week) so everything might be wonderful when I return ;-) Rusty, if you decide to apply my patch, there is one downside: The debugging code in virtio_ring sometimes triggers with a false positive: try_fill_recv calls vring_kick. Here we do a notify to the host. This might cause an immediate interrupt, triggering the poll routine before vring_kick can run END_USE. This is no real problem, as we dont touch the vq after notify. Christian, facing 64 guest cpus unconvering all kind of races -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor: This is why initially I suggested another status code in order to split the ring logic with driver status. but also not filling any buffers as long as VIRTIO_CONFIG_DEV_OPEN is not set. I will have a look but I think that add_status needs to be called It can fill the buffers even without dev_open, when the dev is finally opened the host will issue an interrupt if there are pending buffers. There is a problem associated with that scheme. Setting the value in the config space does not notify the hypervisor. That means the host will not send an interrupt. The interrupt is sent if the host tries to send the next packet. If somehow the host manages to use up all buffers before the device is finally opened, we have a problem. (I'm not sure it's worth solving, maybe just drop them like you suggested). As said above, dropping seems to me the preferred method. Did I miss something? Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets
Am Mittwoch, 12. Dezember 2007 schrieb Rusty Russell: On Wednesday 12 December 2007 00:16:12 Christian Borntraeger wrote: That would also work. We already have VRING_AVAIL_F_NO_INTERRUPT in virtio_ring.c - maybe we can use that. Its hidden in callback and restart handling, what about adding an explicit startup? Yes, I debated whether to make this a separate hook or not; the current method reduces the number of function calls without having two ways of disabling callbacks. In this case, simply starting devices with callbacks disabled and renaming 'restart' to 'enable' (or something) and calling it at the beginning is probably sufficient? So you suggest something like the following patch? It seems to work but there is still a theoretical race at startup. Therefore, I tend to agree with Dor that a separate hook seems prefereable, so I am not fully sure if the patch is the final solution: ps: Its ok to answer that after your vacation. --- drivers/block/virtio_blk.c |3 ++- drivers/net/virtio_net.c |5 - drivers/virtio/virtio_ring.c |9 - include/linux/virtio.h |4 ++-- 4 files changed, 12 insertions(+), 9 deletions(-) Index: kvm/drivers/virtio/virtio_ring.c === --- kvm.orig/drivers/virtio/virtio_ring.c +++ kvm/drivers/virtio/virtio_ring.c @@ -220,7 +220,7 @@ static void *vring_get_buf(struct virtqu return ret; } -static bool vring_restart(struct virtqueue *_vq) +static bool vring_enable(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); @@ -264,7 +264,7 @@ static struct virtqueue_ops vring_vq_ops .add_buf = vring_add_buf, .get_buf = vring_get_buf, .kick = vring_kick, - .restart = vring_restart, + .enable = vring_enable, .shutdown = vring_shutdown, }; @@ -299,9 +299,8 @@ struct virtqueue *vring_new_virtqueue(un vq-in_use = false; #endif - /* No callback? Tell other side not to bother us. */ - if (!callback) - vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; + /* disable interrupts until we enable them */ + vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; /* Put everything in free lists. */ vq-num_free = num; Index: kvm/include/linux/virtio.h === --- kvm.orig/include/linux/virtio.h +++ kvm/include/linux/virtio.h @@ -41,7 +41,7 @@ struct virtqueue * vq: the struct virtqueue we're talking about. * len: the length written into the buffer * Returns NULL or the data token handed to add_buf. - * @restart: restart callbacks after callback returned false. + * @enable: restart callbacks after callback returned false. * vq: the struct virtqueue we're talking about. * This returns false (and doesn't re-enable) if there are pending * buffers in the queue, to avoid a race. @@ -65,7 +65,7 @@ struct virtqueue_ops { void *(*get_buf)(struct virtqueue *vq, unsigned int *len); - bool (*restart)(struct virtqueue *vq); + bool (*enable)(struct virtqueue *vq); void (*shutdown)(struct virtqueue *vq); }; Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -201,7 +201,7 @@ again: /* Out of packets? */ if (received budget) { netif_rx_complete(vi-dev, napi); - if (unlikely(!vi-rvq-vq_ops-restart(vi-rvq)) + if (unlikely(!vi-rvq-vq_ops-enable(vi-rvq)) netif_rx_reschedule(vi-dev, napi)) goto again; } @@ -292,6 +292,9 @@ static int virtnet_open(struct net_devic return -ENOMEM; napi_enable(vi-napi); + + vi-rvq-vq_ops-enable(vi-rvq); + vi-svq-vq_ops-enable(vi-svq); return 0; } Index: kvm/drivers/block/virtio_blk.c === --- kvm.orig/drivers/block/virtio_blk.c +++ kvm/drivers/block/virtio_blk.c @@ -183,7 +183,8 @@ static int virtblk_probe(struct virtio_d err = PTR_ERR(vblk-vq); goto out_free_vblk; } - + /* enable interrupts */ + vblk-vq-vq_ops-enable(vblk-vq); vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk-pool) { err = -ENOMEM; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [virtio-net][PATCH] Don't arm tx hrtimer with a constant 500us each transmit
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor: --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -406,10 +405,10 @@ again: Hmm, while I agree in general with the patch, I fail to find the proper version of virtio_net where this patch applies. I tried kvm.git and linux-2.6.git from kernel.org. Can you give me a pointer to the repository where you work on virtio? Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [virtio-net][PATCH] Don't arm tx hrtimer with a constant 500us each transmit
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor: Christian Borntraeger wrote: Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor: --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -406,10 +405,10 @@ again: Hmm, while I agree in general with the patch, I fail to find the proper version of virtio_net where this patch applies. I tried kvm.git and linux-2.6.git from kernel.org. Can you give me a pointer to the repository where you work on virtio? Sorry for that, I added some debug prints of my one. Here it is: *git clone git*://kvm.*qumranet*.com/home/*dor*/src/linux-2.6-nv use branch 'virtio'. Ah, ok. I will look into that branch. BTW: what git repository do you use? I use Avis git from kernel.org: git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor: I think the change below handles the race. Otherwise please detail the use case. [...] @@ -292,6 +292,9 @@ static int virtnet_open(struct net_devic return -ENOMEM; napi_enable(vi-napi); + + vi-rvq-vq_ops-enable(vi-rvq); + vi-svq-vq_ops-enable(vi-svq); If you change it to: if (!vi-rvq-vq_ops-enable(vi-rvq)) vi-rvq-vq_ops-kick(vi-rvq); if (!vi-rvq-vq_ops-enable(vi-svq)) vi-rvq-vq_ops-kick(vi-svq); You solve the race of packets already waiting in the queue without triggering the irq. Hmm, I dont fully understand your point. I think this will work as long as the host has not consumed all inbound buffers. It will also require that the host sends an additional packet, no? If no additional packet comes the host has no reason to send an interrupt just because it got a notify hypercall. kick inside a guest also does not trigger the poll routine. It also wont work on the following scenario: in virtnet open we will allocate buffers and send them to the host using the kick callback. The host can now use _all_ buffers for incoming data while interrupts are still disabled and the guest is not running.( Lets say the host bridge has lots of multicast traffic and the guest gets not scheduled for a while). When the guest now continues and enables the interrupts nothing happens. Doing a kick does not help, as the host code will bail out with no dma memory for transfer. Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend] virtio_net: remove double ether_setup
Hello Rusty, this is a small fix for virtio_net. virtnet_probe already calls alloc_etherdev, which calls ether_setup. There is no need to do that again. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- drivers/net/virtio_net.c |1 - 1 file changed, 1 deletion(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -329,11 +329,10 @@ static int virtnet_probe(struct virtio_d dev = alloc_etherdev(sizeof(struct virtnet_info)); if (!dev) return -ENOMEM; /* Set up network device as normal. */ - ether_setup(dev); dev-open = virtnet_open; dev-stop = virtnet_close; dev-hard_start_xmit = start_xmit; dev-features = NETIF_F_HIGHDMA; SET_NETDEV_DEV(dev, vdev-dev); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resent] virtio_net: Fix stalled inbound traffic on early packets
Hello Rusty, while implementing and testing virtio on s390 I found a problem in virtio_net: The current virtio_net driver has a startup race, which prevents any incoming traffic: If try_fill_recv submits buffers to the host system data might be filled in and an interrupt is sent, before napi_enable finishes. In that case the interrupt will kick skb_recv_done which will then call netif_rx_schedule. netif_rx_schedule checks, if NAPI_STATE_SCHED is set - which is not as we did not run napi_enable. No poll routine is scheduled. Furthermore, skb_recv_done returns false, we disables interrupts for this device. One solution is the enable napi before inbound buffer are available. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- drivers/net/virtio_net.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -285,13 +285,15 @@ static int virtnet_open(struct net_devic { struct virtnet_info *vi = netdev_priv(dev); + napi_enable(vi-napi); try_fill_recv(vi); /* If we didn't even get one input buffer, we're useless. */ - if (vi-num == 0) + if (vi-num == 0) { + napi_disable(vi-napi); return -ENOMEM; + } - napi_enable(vi-napi); return 0; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets
2nd try. I somehow enable html on the last post Dor Laor wrote: Christian Borntraeger wrote: Hello Rusty, while implementing and testing virtio on s390 I found a problem in virtio_net: The current virtio_net driver has a startup race, which prevents any incoming traffic: If try_fill_recv submits buffers to the host system data might be filled in and an interrupt is sent, before napi_enable finishes. In that case the interrupt will kick skb_recv_done which will then call netif_rx_schedule. netif_rx_schedule checks, if NAPI_STATE_SCHED is set - which is not as we did not run napi_enable. No poll routine is scheduled. Furthermore, skb_recv_done returns false, we disables interrupts for this device. One solution is the enable napi before inbound buffer are available. But then you might get recv interrupt without a buffer. If I look at the current implementation (lguest) no interrupt is sent if there is not buffer available, no? On the other hand, if the host does send an interrupt when no buffer is available, this is also no problem. Looking at virtnet_poll, it seems it can cope with an empty ring. The way other physical NICs doing it is by dis/en/abling interrupt using registers (look at e1000). I suggest we can export add_status and use the original code but before enabling napi add a call to add_status(dev, VIRTIO_CONFIG_DEV_OPEN). The host won't trigger an irq until it sees the above. That would also work. We already have VRING_AVAIL_F_NO_INTERRUPT in virtio_ring.c - maybe we can use that. Its hidden in callback and restart handling, what about adding an explicit startup? BTW: Rusty is on vacation and that's probably the reason he didn't respond. Regards, Dor. Ok, didnt know that. At the moment I can live with my private patch while we work on a final solution. Meanwhile I will try to debug virtio on SMP guests - I still see some strange races on our test system. (But block and net is now working on s390 and can cope with medium load. ) Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets
Am Dienstag, 11. Dezember 2007 schrieb Christian Borntraeger: The way other physical NICs doing it is by dis/en/abling interrupt using registers (look at e1000). I suggest we can export add_status and use the original code but before enabling napi add a call to add_status(dev, VIRTIO_CONFIG_DEV_OPEN). The host won't trigger an irq until it sees the above. That would also work. We already have VRING_AVAIL_F_NO_INTERRUPT in virtio_ring.c - maybe we can use that. Its hidden in callback and restart handling, what about adding an explicit startup? Ok, just to give an example what I thought about: --- drivers/block/virtio_blk.c |3 ++- drivers/net/virtio_net.c |2 ++ drivers/virtio/virtio_ring.c | 16 +--- include/linux/virtio.h |5 + 4 files changed, 22 insertions(+), 4 deletions(-) Index: kvm/drivers/virtio/virtio_ring.c === --- kvm.orig/drivers/virtio/virtio_ring.c +++ kvm/drivers/virtio/virtio_ring.c @@ -241,6 +241,16 @@ static bool vring_restart(struct virtque return true; } +static void vring_startup(struct virtqueue *_vq) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + START_USE(vq); + if (vq-vq.callback) + vq-vring.avail-flags = ~VRING_AVAIL_F_NO_INTERRUPT; + END_USE(vq); +} + + irqreturn_t vring_interrupt(int irq, void *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); @@ -265,6 +275,7 @@ static struct virtqueue_ops vring_vq_ops .get_buf = vring_get_buf, .kick = vring_kick, .restart = vring_restart, + .startup = vring_startup, .shutdown = vring_shutdown, }; @@ -299,9 +310,8 @@ struct virtqueue *vring_new_virtqueue(un vq-in_use = false; #endif - /* No callback? Tell other side not to bother us. */ - if (!callback) - vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; + /* disable interrupts until we enable them */ + vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; /* Put everything in free lists. */ vq-num_free = num; Index: kvm/include/linux/virtio.h === --- kvm.orig/include/linux/virtio.h +++ kvm/include/linux/virtio.h @@ -45,6 +45,9 @@ struct virtqueue * vq: the struct virtqueue we're talking about. * This returns false (and doesn't re-enable) if there are pending * buffers in the queue, to avoid a race. + * @startup: enable callbacks + * vq: the struct virtqueue we're talking abount + * Returns 0 or an error * @shutdown: unadd all buffers. * vq: the struct virtqueue we're talking about. * Remove everything from the queue. @@ -67,6 +70,8 @@ struct virtqueue_ops { bool (*restart)(struct virtqueue *vq); + void (*startup) (struct virtqueue *vq); + void (*shutdown)(struct virtqueue *vq); }; Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -292,6 +292,8 @@ static int virtnet_open(struct net_devic return -ENOMEM; napi_enable(vi-napi); + + vi-rvq-vq_ops-startup(vi-rvq); return 0; } Index: kvm/drivers/block/virtio_blk.c === --- kvm.orig/drivers/block/virtio_blk.c +++ kvm/drivers/block/virtio_blk.c @@ -183,7 +183,8 @@ static int virtblk_probe(struct virtio_d err = PTR_ERR(vblk-vq); goto out_free_vblk; } - + /* enable interrupts */ + vblk-vq-vq_ops-startup(vblk-vq); vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk-pool) { err = -ENOMEM; There is still one small problem: what if the host fills up all host-to-guest buffers before we call startup? So I start to think that your solution is better, given that the host is not only not sending interrupts but also not filling any buffers as long as VIRTIO_CONFIG_DEV_OPEN is not set. I will have a look but I think that add_status needs to be called after napi_enable, otherwise we have the same race. Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_net: remove double ether_setup
Am Mittwoch, 12. Dezember 2007 schrieb Rusty Russell: Can you send straight to akpm or davem? I'm supposed to be on vacation at this is a small fix for virtio_net. virtnet_probe already calls alloc_etherdev, which calls ether_setup. There is no need to do that again. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] Acked-by: Rusty Russell [EMAIL PROTECTED] --- drivers/net/virtio_net.c |1 - 1 file changed, 1 deletion(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -329,11 +329,10 @@ static int virtnet_probe(struct virtio_d dev = alloc_etherdev(sizeof(struct virtnet_info)); if (!dev) return -ENOMEM; /* Set up network device as normal. */ - ether_setup(dev); dev-open = virtnet_open; dev-stop = virtnet_close; dev-hard_start_xmit = start_xmit; dev-features = NETIF_F_HIGHDMA; SET_NETDEV_DEV(dev, vdev-dev); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_net: Fix stalled inbound traffic on early packets
The current virtio_net driver has a startup race, which prevents any incoming traffic: If try_fill_recv submits buffers to the host system data might be filled in and an interrupt is sent, before napi_enable finishes. In that case the interrupt will kick skb_recv_done which will then call netif_rx_schedule. netif_rx_schedule checks, if NAPI_STATE_SCHED is set - which is not as we did not run napi_enable. No poll routine is scheduled. Furthermore, skb_recv_done returns false, we disables interrupts for this device. One solution is the enable napi before inbound buffer are available. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- drivers/net/virtio_net.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) Index: kvm/drivers/net/virtio_net.c === --- kvm.orig/drivers/net/virtio_net.c +++ kvm/drivers/net/virtio_net.c @@ -285,13 +285,15 @@ static int virtnet_open(struct net_devic { struct virtnet_info *vi = netdev_priv(dev); + napi_enable(vi-napi); try_fill_recv(vi); /* If we didn't even get one input buffer, we're useless. */ - if (vi-num == 0) + if (vi-num == 0) { + napi_disable(vi-napi); return -ENOMEM; + } - napi_enable(vi-napi); return 0; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio: free transmit skbs when notified, not on next xmit.
Am Montag, 19. November 2007 schrieb Rusty Russell: This fixes a potential dangling xmit problem. We also suppress refill interrupts until we need them. (Anthony and I have been working on performance recently, so this is a WIP). Signed-off-by: Rusty Russell [EMAIL PROTECTED] Thanks Rusty, that was the 2nd next item on my todo list :-) Acked-by: Christian Borntraeger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IPVS: Fix sysctl warnings about missing strategy
Running the latest git code I get the following messages during boot: sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy I removed the binary sysctl handler for those messages and also removed the definitions in ip_vs.h. The alternative would be to implement a proper strategy handler, but syscall sysctl is deprecated. There are other sysctl definitions that are commented out or work with the default sysctl_data strategy. I did not touch these. Eric, IPVS team, are you ok with that change? CC: Eric W. Biederman [EMAIL PROTECTED] CC: Wensong Zhang [EMAIL PROTECTED] CC: Simon Horman [EMAIL PROTECTED] CC: Julian Anastasov [EMAIL PROTECTED] Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- include/net/ip_vs.h |4 kernel/sysctl_check.c |4 net/ipv4/ipvs/ip_vs_ctl.c |4 3 files changed, 12 deletions(-) Index: linux-2.6/include/net/ip_vs.h === --- linux-2.6.orig/include/net/ip_vs.h +++ linux-2.6/include/net/ip_vs.h @@ -336,9 +336,6 @@ enum { NET_IPV4_VS_DEBUG_LEVEL=1, NET_IPV4_VS_AMEMTHRESH=2, NET_IPV4_VS_AMDROPRATE=3, - NET_IPV4_VS_DROP_ENTRY=4, - NET_IPV4_VS_DROP_PACKET=5, - NET_IPV4_VS_SECURE_TCP=6, NET_IPV4_VS_TO_ES=7, NET_IPV4_VS_TO_SS=8, NET_IPV4_VS_TO_SR=9, @@ -355,7 +352,6 @@ enum { NET_IPV4_VS_LBLCR_EXPIRE=20, NET_IPV4_VS_CACHE_BYPASS=22, NET_IPV4_VS_EXPIRE_NODEST_CONN=23, - NET_IPV4_VS_SYNC_THRESHOLD=24, NET_IPV4_VS_NAT_ICMP_SEND=25, NET_IPV4_VS_EXPIRE_QUIESCENT_TEMPLATE=26, NET_IPV4_VS_LAST Index: linux-2.6/net/ipv4/ipvs/ip_vs_ctl.c === --- linux-2.6.orig/net/ipv4/ipvs/ip_vs_ctl.c +++ linux-2.6/net/ipv4/ipvs/ip_vs_ctl.c @@ -1451,7 +1451,6 @@ static struct ctl_table vs_vars[] = { .proc_handler = proc_dointvec, }, { - .ctl_name = NET_IPV4_VS_DROP_ENTRY, .procname = drop_entry, .data = sysctl_ip_vs_drop_entry, .maxlen = sizeof(int), @@ -1459,7 +1458,6 @@ static struct ctl_table vs_vars[] = { .proc_handler = proc_do_defense_mode, }, { - .ctl_name = NET_IPV4_VS_DROP_PACKET, .procname = drop_packet, .data = sysctl_ip_vs_drop_packet, .maxlen = sizeof(int), @@ -1467,7 +1465,6 @@ static struct ctl_table vs_vars[] = { .proc_handler = proc_do_defense_mode, }, { - .ctl_name = NET_IPV4_VS_SECURE_TCP, .procname = secure_tcp, .data = sysctl_ip_vs_secure_tcp, .maxlen = sizeof(int), @@ -1597,7 +1594,6 @@ static struct ctl_table vs_vars[] = { .proc_handler = proc_dointvec, }, { - .ctl_name = NET_IPV4_VS_SYNC_THRESHOLD, .procname = sync_threshold, .data = sysctl_ip_vs_sync_threshold, .maxlen = sizeof(sysctl_ip_vs_sync_threshold), Index: linux-2.6/kernel/sysctl_check.c === --- linux-2.6.orig/kernel/sysctl_check.c +++ linux-2.6/kernel/sysctl_check.c @@ -242,9 +242,6 @@ static struct trans_ctl_table trans_net_ { NET_IPV4_VS_AMEMTHRESH, amemthresh }, { NET_IPV4_VS_DEBUG_LEVEL, debug_level }, { NET_IPV4_VS_AMDROPRATE, am_droprate }, - { NET_IPV4_VS_DROP_ENTRY, drop_entry }, - { NET_IPV4_VS_DROP_PACKET, drop_packet }, - { NET_IPV4_VS_SECURE_TCP, secure_tcp }, { NET_IPV4_VS_TO_ES,timeout_established }, { NET_IPV4_VS_TO_SS,timeout_synsent }, { NET_IPV4_VS_TO_SR,timeout_synrecv }, @@ -260,7 +257,6 @@ static struct trans_ctl_table trans_net_ { NET_IPV4_VS_CACHE_BYPASS, cache_bypass }, { NET_IPV4_VS_EXPIRE_NODEST_CONN, expire_nodest_conn }, { NET_IPV4_VS_EXPIRE_QUIESCENT_TEMPLATE, expire_quiescent_template }, - { NET_IPV4_VS_SYNC_THRESHOLD, sync_threshold }, { NET_IPV4_VS_NAT_ICMP_SEND,nat_icmp_send }, { NET_IPV4_VS_LBLC_EXPIRE, lblc_expiration }, { NET_IPV4_VS_LBLCR_EXPIRE, lblcr_expiration }, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http
[PATCH] note that NETIF_F_LLTX is deprecated (was: [kvm-devel][PATCH 3/6] virtio net driver)
Am Freitag, 21. September 2007 schrieb Herbert Xu: Please don't use LLTX in new drivers. We're trying to get rid of it since it's 1) unnecessary; 2) causes problems with AF_PACKET seeing things twice. I suggest to document that LLTX is deprecated. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- Documentation/networking/netdevices.txt |3 ++- include/linux/netdevice.h |3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/netdevice.h === --- linux-2.6.orig/include/linux/netdevice.h +++ linux-2.6/include/linux/netdevice.h @@ -339,7 +339,8 @@ struct net_device #define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */ #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN packets */ #define NETIF_F_GSO2048/* Enable software GSO. */ -#define NETIF_F_LLTX 4096/* LockLess TX */ +#define NETIF_F_LLTX 4096/* LockLess TX - deprecated. Please */ + /* do not use LLTX in new drivers */ #define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ /* Segmentation offload features */ Index: linux-2.6/Documentation/networking/netdevices.txt === --- linux-2.6.orig/Documentation/networking/netdevices.txt +++ linux-2.6/Documentation/networking/netdevices.txt @@ -73,7 +73,8 @@ dev-hard_start_xmit: has to lock by itself when needed. It is recommended to use a try lock for this and return NETDEV_TX_LOCKED when the spin lock fails. The locking there should also properly protect against - set_multicast_list. + set_multicast_list. Note that the use of NETIF_F_LLTX is deprecated. + Dont use it for new drivers. Context: Process with BHs disabled or BH (timer), will be called with interrupts disabled by netconsole. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet device (v2.1)
Am Mittwoch, 11. Juli 2007 schrieb Pavel Emelianov: drivers/net/veth.c | 452 include/net/veth.h | 14 + I know, I am late in the game, but wont the name collide somewhat with drivers/net/ibmveth.h, drivers/net/iseries_veth.c, and drivers/net/ibmveth.c? Christian - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] ehea: interface to network stack
, + cqe-wr_id)] = NULL; + dev_kfree_skb(skb); + } + } + cqe = ehea_poll_rq1(qp, wqe_index); + } + + dev-quota -= processed; + *budget -= processed; + + port_res-p_state.ehea_poll += 1; + + port_res-rx_packets += processed; + + ehea_refill_rq1(port_res, last_wqe_index, processed_RQ1); + ehea_refill_rq2(port_res, processed_RQ2); + ehea_refill_rq3(port_res, processed_RQ3); + + intreq = ((port_res-p_state.ehea_poll 0xF) == 0xF); + + EDEB_EX(7, processed=%d, *budget=%d, dev-quota=%d, + processed, *budget, dev-quota); + + if (!cqe || intreq) { + netif_rx_complete(dev); + ehea_reset_cq_ep(port_res-recv_cq); + ehea_reset_cq_n1(port_res-recv_cq); + cqe = ipz_qeit_get_valid(qp-ipz_rqueue1); + EDEB(7, CQE=%p, break ehea_poll while loop, cqe); + if (!cqe || intreq) + return 0; + if (!netif_rx_reschedule(dev, my_quota)) + return 0; + } + return 1; +} The poll function seems too long and therefore hard to review. Please consider splitting it. -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello Evgeniy, +asmlinkage long sys_netchannel_control(void __user *arg) [...] + if (copy_from_user(ctl, arg, sizeof(struct unetchannel_control))) + return -ERESTARTSYS; ^^^ [...] + if (copy_to_user(arg, ctl, sizeof(struct unetchannel_control))) + return -ERESTARTSYS; ^^^ I think this should be -EFAULT instead of -ERESTARTSYS, right? -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Tuesday 18 July 2006 13:51, Evgeniy Polyakov wrote: I think this should be -EFAULT instead of -ERESTARTSYS, right? I have no strong feeling on what must be returned in that case. As far as I see, copy*user can fail due to absence of the next destination page, so -ERESTARTSYS makes sence, but if failure happens due to process size limitation, -EFAULT is correct. If I am not completely mistaken ERESTARTSYS is wrong. include/linux/errno.h says userspace should never see ERESTARTSYS, therefore we should only return it if we were interrupted by a signal as do_signal takes care of ERESTARTSYS. Furthermore, copy*user transparently faults in necessary pages as long as the address is valid in the user context. Let's change it to -EFAULT. Thanks :-) -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] changing value of NETDEV_ALIGN to cacheline size
On Tuesday 16 May 2006 00:49, David S. Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Mon, 15 May 2006 14:39:23 -0700 How about: How about, just leave it alone? :-) Agreed. Currently it only makes a difference with slab debugging, which hurts performance no matter what we do. -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] changing value of NETDEV_ALIGN to cacheline size
while digging through the alloc_netdev function I asked myself why there is a fixed alignment for netdevices. Is there a special reason for choosing 32? If yes, I suggest to add a comment to the definition. If not, I suspect cacheline alignment might be beneficial: struct net_device contains several variables which are cache aligned (poll_list, queue_lock .). As far as I can see, the compiler tries to increase the size of the structure to make that possible, but expects the whole structure to be aligned to cache line size as well. With the current value for NETDEV_ALIGN we dont align struct net_device to the cache line, instead we align it to 32 bytes. That means that poll_list, queue_lock and friends are not always cache aligned, but 32 bytes aligned. The only reason why everything worked so far is the slab allocator design, which gives us a page aligned struct net_device in most cases. I think we should not rely on the behaviour of the memory allocator and use a different value for NETDEV_ALIGN instead. Any comments or corrections? cheers Christian The patch below is compile and boot tested on s390 and x86. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] include/linux/netdevice.h |2 +- net/core/dev.c|2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- --- a/include/linux/netdevice.h 4 Apr 2006 07:25:47 - +++ b/include/linux/netdevice.h 15 May 2006 11:06:05 - @@ -504,7 +504,7 @@ struct class_device class_dev; }; -#defineNETDEV_ALIGN32 +#defineNETDEV_ALIGNL1_CACHE_BYTES #defineNETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) static inline void *netdev_priv(struct net_device *dev) --- a/net/core/dev.c4 Apr 2006 07:25:50 - +++ b/net/core/dev.c15 May 2006 11:06:05 - @@ -2986,7 +2986,7 @@ struct net_device *dev; int alloc_size; - /* ensure 32-byte alignment of both the device and private area */ + /* ensure cacheline alignment of both the device and private area */ alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) ~NETDEV_ALIGN_CONST; alloc_size += sizeof_priv + NETDEV_ALIGN_CONST; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] ipv4: initialize arp_tbl rw lock
As spinlock debugging still does not work with the qeth driver I want to pick up the discussion. On Saturday 08 April 2006 12:12, Andrew Morton wrote: Heiko Carstens [EMAIL PROTECTED] wrote: [...] -vmlinux-main := $(core-y) $(libs-y) $(drivers-y) $(net-y) +vmlinux-main := $(core-y) $(libs-y) $(net-y) $(drivers-y) wonders what this will break What about putting this patch into mm and find out? I have a bad feeling that one day we're going to come unstuck with this practice. Is there anything which dictates that the linker has to lay sections out in .o-file-order? Perhaps net initcalls should be using something higher priority than device_initcall(). I agree that the initcall order offers a lot of room for improvement (like dependencies). Is anybody aware of any work into this direction? -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html