Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-07-26 Thread Christian Borntraeger
On 07/26/2017 10:03 AM, Jason Wang wrote:
> This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it
> was reported to break vhost_net. We want to cache used event and use
> it to check for notification. We try to valid cached used event by
> checking whether or not it was ahead of new, but this is not correct
> all the time, it could be stale and there's no way to know about this.
> 
> Signed-off-by: Jason Wang 

This would then qualify for stable ?



Re: [Patch net] tap: convert a mutex to a spinlock

2017-07-06 Thread Christian Borntraeger
On 07/05/2017 10:50 PM, Cong Wang wrote:
> We are not allowed to block on the RCU reader side, so can't
> just hold the mutex as before. As a quick fix, convert it to
> a spinlock.
> 
> Fixes: d9f1f61c0801 ("tap: Extending tap device create/destroy APIs")
> Reported-by: Christian Borntraeger <borntrae...@de.ibm.com>
> Cc: Sainath Grandhi <sainath.gran...@intel.com>
> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>

Tested-by: Christian Borntraeger <borntrae...@de.ibm.com>


> ---
>  drivers/net/tap.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 4d4173d..d88ae3c 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -106,7 +106,7 @@ struct major_info {
>   struct rcu_head rcu;
>   dev_t major;
>   struct idr minor_idr;
> - struct mutex minor_lock;
> + spinlock_t minor_lock;
>   const char *device_name;
>   struct list_head next;
>  };
> @@ -416,15 +416,15 @@ int tap_get_minor(dev_t major, struct tap_dev *tap)
>   goto unlock;
>   }
> 
> - mutex_lock(_major->minor_lock);
> - retval = idr_alloc(_major->minor_idr, tap, 1, TAP_NUM_DEVS, 
> GFP_KERNEL);
> + spin_lock(_major->minor_lock);
> + retval = idr_alloc(_major->minor_idr, tap, 1, TAP_NUM_DEVS, 
> GFP_ATOMIC);
>   if (retval >= 0) {
>   tap->minor = retval;
>   } else if (retval == -ENOSPC) {
>   netdev_err(tap->dev, "Too many tap devices\n");
>   retval = -EINVAL;
>   }
> - mutex_unlock(_major->minor_lock);
> + spin_unlock(_major->minor_lock);
> 
>  unlock:
>   rcu_read_unlock();
> @@ -442,12 +442,12 @@ void tap_free_minor(dev_t major, struct tap_dev *tap)
>   goto unlock;
>   }
> 
> - mutex_lock(_major->minor_lock);
> + spin_lock(_major->minor_lock);
>   if (tap->minor) {
>   idr_remove(_major->minor_idr, tap->minor);
>   tap->minor = 0;
>   }
> - mutex_unlock(_major->minor_lock);
> + spin_unlock(_major->minor_lock);
> 
>  unlock:
>   rcu_read_unlock();
> @@ -467,13 +467,13 @@ static struct tap_dev *dev_get_by_tap_file(int major, 
> int minor)
>   goto unlock;
>   }
> 
> - mutex_lock(_major->minor_lock);
> + spin_lock(_major->minor_lock);
>   tap = idr_find(_major->minor_idr, minor);
>   if (tap) {
>   dev = tap->dev;
>   dev_hold(dev);
>   }
> - mutex_unlock(_major->minor_lock);
> + spin_unlock(_major->minor_lock);
> 
>  unlock:
>   rcu_read_unlock();
> @@ -1227,7 +1227,7 @@ static int tap_list_add(dev_t major, const char 
> *device_name)
>   tap_major->major = MAJOR(major);
> 
>   idr_init(_major->minor_idr);
> - mutex_init(_major->minor_lock);
> + spin_lock_init(_major->minor_lock);
> 
>   tap_major->device_name = device_name;
> 



locking issues in macvtap (looks like due to tap: Extending tap device create/destroy APIs)

2017-07-03 Thread Christian Borntraeger
Sainath,

with rcu debugging and lock debugging I get the following splats.
I think doing a mutex_lock while in an rcu read-side is not allowed,
since mutex_lock can sleep.

This is in 4.11 and 4.12 and seems to be introduced with commit
d9f1f61c0801a7("tap: Extending tap device create/destroy APIs").


Christian


[  125.678015] ===
[  125.678018] [ ERR: suspicious RCU usage.  ]
[  125.678022] 4.11.0+ #18 Not tainted
[  125.678025] ---
[  125.678028] ./include/linux/rcupdate.h:521 Illegal context switch in RCU 
read-side critical section!
[  125.678031] 
   other info that might help us debug this:

[  125.678035] 
   rcu_scheduler_active = 2, debug_locks = 0
[  125.678038] 2 locks held by libvirtd/3050:
[  125.678041]  #0:  (rtnl_mutex){+.+.+.}, at: [<00772b02>] 
rtnl_newlink+0x2ea/0x880
[  125.678057]  #1:  (rcu_read_lock){..}, at: [<03ff800dad00>] 
tap_get_minor+0x0/0x1d8 [tap]
[  125.678068] 
   stack backtrace:
[  125.678073] CPU: 26 PID: 3050 Comm: libvirtd Not tainted 4.11.0+ #18
[  125.678076] Hardware name: IBM 2964 NC9 704 (LPAR)
[  125.678079] Stack:
[  125.678081]00fa977cb230 00fa977cb2c0 0003 

[  125.678091]00fa977cb360 00fa977cb2d8 00fa977cb2d8 
0020
[  125.678100] 03ff0020 00fa000a 
00fa000a
[  125.678109]000c 00fa977cb328  

[  125.678119]008e2510 001139ac 00fa977cb2c0 
00fa977cb318
[  125.678150] Call Trace:
[  125.678157] ([<00113872>] show_trace+0xea/0xf0)
[  125.678160]  [<00113950>] show_stack+0x68/0xe0 
[  125.678165]  [<0057ef8c>] dump_stack+0x94/0xd8 
[  125.678172]  [<001a4422>] ___might_sleep+0x21a/0x268 
[  125.678177]  [<008ca842>] __mutex_lock+0x52/0x968 
[  125.678180]  [<008cb192>] mutex_lock_nested+0x3a/0x48 
[  125.678184]  [<03ff800dadd6>] tap_get_minor+0xd6/0x1d8 [tap] 
[  125.678188]  [<03ff801773a2>] macvtap_device_event+0x9a/0x1a0 [macvtap] 
[  125.678191]  [<0019bfbe>] notifier_call_chain+0x56/0x98 
[  125.678195]  [<0019c1b2>] raw_notifier_call_chain+0x32/0x40 
[  125.678200]  [<0075d014>] register_netdevice+0x3f4/0x508 
[  125.678204]  [<03ff801718a0>] macvlan_common_newlink+0x360/0x430 
[macvlan] 
[  125.678207]  [<03ff80177564>] macvtap_newlink+0xbc/0xf0 [macvtap] 
[  125.678211]  [<00772e32>] rtnl_newlink+0x61a/0x880 
[  125.678214]  [<0077313c>] rtnetlink_rcv_msg+0xa4/0x248 
[  125.678219]  [<0079cec0>] netlink_rcv_skb+0xd8/0x108 
[  125.678222]  [<0076f538>] rtnetlink_rcv+0x48/0x58 
[  125.678226]  [<0079c750>] netlink_unicast+0x178/0x1f8 
[  125.678229]  [<0079cbd4>] netlink_sendmsg+0x304/0x3b0 
[  125.678233]  [<00730676>] sock_sendmsg+0x6e/0x80 
[  125.678237]  [<007311b0>] ___sys_sendmsg+0x2a0/0x2a8 
[  125.678240]  [<007324d8>] __sys_sendmsg+0x60/0xa8 
[  125.678244]  [<00732ed4>] SyS_socketcall+0x33c/0x390 
[  125.678248]  [<008d08bc>] system_call+0xc4/0x258 
[  125.678251] INFO: lockdep is turned off.
[  125.678255] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:747
[  125.678257] in_atomic(): 1, irqs_disabled(): 0, pid: 3050, name: libvirtd
[  125.678261] INFO: lockdep is turned off.
[  125.678264] CPU: 26 PID: 3050 Comm: libvirtd Not tainted 4.11.0+ #18
[  125.678267] Hardware name: IBM 2964 NC9 704 (LPAR)
[  125.678269] Stack:
[  125.678272]00fa977cb230 00fa977cb2c0 0003 

[  125.678281]00fa977cb360 00fa977cb2d8 00fa977cb2d8 
0020
[  125.678290] 00fa0020 00fa000a 
00fa000a
[  125.678298]000c 00fa977cb328  

[  125.678308]008e2510 001139ac 00fa977cb2c0 
00fa977cb318
[  125.678323] Call Trace:
[  125.678326] ([<00113872>] show_trace+0xea/0xf0)
[  125.678330]  [<00113950>] show_stack+0x68/0xe0 
[  125.678334]  [<0057ef8c>] dump_stack+0x94/0xd8 
[  125.678337]  [<001a438e>] ___might_sleep+0x186/0x268 
[  125.678341]  [<008ca842>] __mutex_lock+0x52/0x968 
[  125.678346]  [<008cb192>] mutex_lock_nested+0x3a/0x48 
[  125.678350]  [<03ff800dadd6>] tap_get_minor+0xd6/0x1d8 [tap] 
[  125.678354]  [<03ff801773a2>] macvtap_device_event+0x9a/0x1a0 [macvtap] 
[  125.678357]  [<0019bfbe>] notifier_call_chain+0x56/0x98 
[  125.678360]  [<0019c1b2>] raw_notifier_call_chain+0x32/0x40 
[  125.678364]  [<0075d014>] register_netdevice+0x3f4/0x508 
[  125.678368]  [<03ff801718a0>] macvlan_common_newlink+0x360/0x430 
[macvlan] 
[  125.678371]  [<03ff80177564>] 

Re: rhashtable - Cap total number of entries to 2^31

2017-04-28 Thread Christian Borntraeger
On 04/28/2017 01:31 PM, Herbert Xu wrote:
> On Fri, Apr 28, 2017 at 12:23:15PM +0200, Christian Borntraeger wrote:
>>
>> I can reproduce this boot failure on s390 bisected to 
>> commit 6d684e54690caef45cf14051ddeb7c71beeb681b
>>rhashtable: Cap total number of entries to 2^31
>> in linux-next from Apr 28
> 
> It should go away with
> 
> https://patchwork.ozlabs.org/patch/756233/
> 
> Thanks,
> 
Yes it does.

Tested-by: Christian Borntraeger <borntrae...@de.ibm.com>

would be nice to have it in the next linux-next version.



Re: rhashtable - Cap total number of entries to 2^31

2017-04-28 Thread Christian Borntraeger
On 04/28/2017 12:21 AM, Florian Fainelli wrote:
> On 04/27/2017 02:16 PM, Florian Fainelli wrote:
>> Hi Herbert,
>>
>> On 04/26/2017 10:44 PM, Herbert Xu wrote:
>>> On Tue, Apr 25, 2017 at 10:48:22AM -0400, David Miller wrote:
 From: Florian Westphal 
 Date: Tue, 25 Apr 2017 16:17:49 +0200

> I'd have less of an issue with this if we'd be talking about
> something computationally expensive, but this is about storing
> an extra value inside a struct just to avoid one "shr" in insert path...

 Agreed, this shift is probably filling an available cpu cycle :-)
>>>
>>> OK, but we need to have an extra field for another reason anyway.
>>> The problem is that we're not capping the total number of elements
>>> in the hashtable when max_size is not set, this means that nelems
>>> can overflow which will cause havoc with the automatic shrinking
>>> when it tries to fit 2^32 entries into a minimum-sized table.
>>>
>>> So I'm taking that hole back for now :)
>>>
>>> ---8<---
>>> When max_size is not set or if it set to a sufficiently large
>>> value, the nelems counter can overflow.  This would cause havoc
>>> with the automatic shrinking as it would then attempt to fit a
>>> huge number of entries into a tiny hash table.
>>>
>>> This patch fixes this by adding max_elems to struct rhashtable
>>> to cap the number of elements.  This is set to 2^31 as nelems is
>>> not a precise count.  This is sufficiently smaller than UINT_MAX
>>> that it should be safe.
>>>
>>> When max_size is set max_elems will be lowered to at most twice
>>> max_size as is the status quo.
>>>
>>> Signed-off-by: Herbert Xu 
>>
>> This commit:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=6d684e54690caef45cf14051ddeb7c71beeb681b
>>
>> makes my ARMv7 (32-bit) system panic on boot with the log below. I can
>> test net-next (or net) and report back if you want me to test anything.
>> Thanks!
> 
> And another on with a QEMU guest:
> 
> [0.389212] NET: Registered protocol family 16
> [0.388807] Kernel panic - not syncing: rtnetlink_init: cannot
> initialize rtnetlink
> [0.388807]
> [0.389445] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> 4.11.0-rc8-02077-ge221c1f0fe25 #1
> [0.389745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
> [0.390219] Call Trace:
> [0.391406]  dump_stack+0x51/0x78
> [0.391585]  panic+0xc7/0x20e
> [0.391740]  ? register_pernet_operations+0xa1/0xd0
> [0.392031]  rtnetlink_init+0x22/0x1a0
> [0.392190]  netlink_proto_init+0x168/0x184
> [0.392359]  ? ptp_classifier_init+0x26/0x30
> [0.392528]  ? netlink_net_init+0x2e/0x2e
> [0.392692]  do_one_initcall+0x54/0x190
> [0.392852]  ? parse_args+0x248/0x400
> [0.393033]  kernel_init_freeable+0x127/0x1b6
> [0.393208]  ? kernel_init_freeable+0x1b6/0x1b6
> [0.393389]  ? rest_init+0x70/0x70
> [0.393533]  kernel_init+0x9/0x100
> [0.393676]  ret_from_fork+0x29/0x40
> [0.394555] ---[ end Kernel panic - not syncing: rtnetlink_init:
> cannot initialize rtnetlink
> [0.394555]
> 
> I traced this down to:
> 
> rtnetlink_net_init()
>   netlink_kernel_create()
>  netlink_insert()
>   __netlink_insert()
>  rhashtable_lookup_insert_key()
> __rhashtable_insert_fast()
> rht_grow_above_max()
> 
> And indeed we have:
> 
> ht->nelemts = 0
> ht->max_elems = 0
> 
> such that rht_grow_above_max() returns true.
> 
> With your commit we actually take this branch:
> 
> if (ht->p.max_size < ht->max_elems / 2)
>   ht->max_elems = ht->p.max_size * 2;
> 
> since max_size = 0 we have max_elems = 0 as well.
> 
> Candidate fix #1:
> 
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index 45f89369c4c8..ad9020e1609c 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -329,7 +329,7 @@ static inline bool rht_grow_above_100(const struct
> rhashtable *ht,
>  static inline bool rht_grow_above_max(const struct rhashtable *ht,
>   const struct bucket_table *tbl)
>  {
> -   return atomic_read(>nelems) >= ht->max_elems;
> +   return ht->p.max_size && atomic_read(>nelems) >= ht->max_elems;
>  }
> 
> Candidate fix #2:
> 
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 751630bbe409..6b4f07760fec 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,
> 
> /* Cap total entries at 2^31 to avoid nelems overflow. */
> ht->max_elems = 1u << 31;
> -   if (ht->p.max_size < ht->max_elems / 2)
> +   if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
> ht->max_elems = ht->p.max_size * 2;
> 
> ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
> 
> Number #2 does not introduce an additional conditional on the fastpath,
> so I 

Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE

2017-03-03 Thread Christian Borntraeger
On 03/02/2017 10:45 PM, Arnd Bergmann wrote:
> On Thu, Mar 2, 2017 at 8:00 PM, Christian Borntraeger
> <borntrae...@de.ibm.com> wrote:
>> On 03/02/2017 06:55 PM, Arnd Bergmann wrote:
>>> On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger
>>> <borntrae...@de.ibm.com> wrote:
>>>> On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
>>>>>
>>>>> This attempts a rewrite of the two macros, using a simpler implementation
>>>>> for the most common case of having a naturally aligned 1, 2, 4, or (on
>>>>> 64-bit architectures) 8  byte object that can be accessed with a single
>>>>> instruction.  For these, we go back to a volatile pointer dereference
>>>>> that we had with the ACCESS_ONCE macro.
>>>>
>>>> We had changed that back then because gcc 4.6 and 4.7 had a bug that could
>>>> removed the volatile statement on aggregate types like the following one
>>>>
>>>> union ipte_control {
>>>> unsigned long val;
>>>> struct {
>>>> unsigned long k  : 1;
>>>> unsigned long kh : 31;
>>>> unsigned long kg : 32;
>>>> };
>>>> };
>>>>
>>>> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
>>>>
>>>> If I see that right, your __ALIGNED_WORD(x)
>>>> macro would say that for above structure  sizeof(x) == sizeof(long)) is 
>>>> true,
>>>> so it would fall back to the old volatile cast and might reintroduce the
>>>> old compiler bug?
>>
>> Oh dear, I should double check my sentences in emails before sending...anyway
>> the full story is referenced in
>>
>> commit 60815cf2e05057db5b78e398d9734c493560b11e
>> Merge tag 'for-linus' of 
>> git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
>> which has a pointer to
>> http://marc.info/?i=54611D86.4040306%40de.ibm.com
>> which contains the full story.
> 
> Ok, got it. So I guess the behavior of forcing aligned accesses on aligned
> data is accidental, and allowing non-power-of-two arguments is also not
> the main purpose.


Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
atomic access for sizes <= word size. And there are certainly places that
rely on that. But the *ONCE thing is mostly used for things where we used
barrier() 10 years ago.


 Maybe we could just bail out on new compilers if we get
> either of those? That might catch code that accidentally does something
> that is inherently non-atomic or that causes a trap when the intention was
> to have a simple atomic access.

I think Linus stated that its ok to assume that the compiler is smart enough 
to uses a single instruction to access aligned and properly sized scalar types
for *ONCE.

Back then when I changed ACCESS_ONCE there were many places that did use it
for non-atomic, > word size accesses. For example on some architectures a pmd_t
is a typedef to an array, for which there is no way to read that atomically.
So the focus must be on the "ONCE" part.

If some code uses a properly aligned, word sized object we can also assume 
atomic access. If the access is not properly sized/aligned we do not get
atomicity, but we do get the "ONCE".
But adding a check for alignment/size would break the compilation of some
code.





Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE

2017-03-02 Thread Christian Borntraeger
On 03/02/2017 06:55 PM, Arnd Bergmann wrote:
> On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger
> <borntrae...@de.ibm.com> wrote:
>> On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
>>>
>>> This attempts a rewrite of the two macros, using a simpler implementation
>>> for the most common case of having a naturally aligned 1, 2, 4, or (on
>>> 64-bit architectures) 8  byte object that can be accessed with a single
>>> instruction.  For these, we go back to a volatile pointer dereference
>>> that we had with the ACCESS_ONCE macro.
>>
>> We had changed that back then because gcc 4.6 and 4.7 had a bug that could
>> removed the volatile statement on aggregate types like the following one
>>
>> union ipte_control {
>> unsigned long val;
>> struct {
>> unsigned long k  : 1;
>> unsigned long kh : 31;
>> unsigned long kg : 32;
>> };
>> };
>>
>> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
>>
>> If I see that right, your __ALIGNED_WORD(x)
>> macro would say that for above structure  sizeof(x) == sizeof(long)) is true,
>> so it would fall back to the old volatile cast and might reintroduce the
>> old compiler bug?

Oh dear, I should double check my sentences in emails before sending...anyway
the full story is referenced in 

commit 60815cf2e05057db5b78e398d9734c493560b11e
Merge tag 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
which has a pointer to
http://marc.info/?i=54611D86.4040306%40de.ibm.com
which contains the full story.

> 
> Ah, right, that's the missing piece. For some reason I didn't find
> the reference in the source or the git log.
> 
>> Could you maybe you fence your simple macro for anything older than 4.9? 
>> After
>> all there was no kasan support anyway on these older gcc version.
> 
> Yes, that should work, thanks!



Re: [PATCH 02/26] rewrite READ_ONCE/WRITE_ONCE

2017-03-02 Thread Christian Borntraeger
On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
> When CONFIG_KASAN is enabled, the READ_ONCE/WRITE_ONCE macros cause
> rather large kernel stacks, e.g.:
> 
> mm/vmscan.c: In function 'shrink_page_list':
> mm/vmscan.c:1333:1: error: the frame size of 3456 bytes is larger than 3072 
> bytes [-Werror=frame-larger-than=]
> block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
> block/cfq-iosched.c:750:1: error: the frame size of 4048 bytes is larger than 
> 3072 bytes [-Werror=frame-larger-than=]
> fs/btrfs/disk-io.c: In function 'open_ctree':
> fs/btrfs/disk-io.c:3314:1: error: the frame size of 3136 bytes is larger than 
> 3072 bytes [-Werror=frame-larger-than=]
> fs/btrfs/relocation.c: In function 'build_backref_tree':
> fs/btrfs/relocation.c:1193:1: error: the frame size of 4336 bytes is larger 
> than 3072 bytes [-Werror=frame-larger-than=]
> fs/fscache/stats.c: In function 'fscache_stats_show':
> fs/fscache/stats.c:287:1: error: the frame size of 6512 bytes is larger than 
> 3072 bytes [-Werror=frame-larger-than=]
> fs/jbd2/commit.c: In function 'jbd2_journal_commit_transaction':
> fs/jbd2/commit.c:1139:1: error: the frame size of 3760 bytes is larger than 
> 3072 bytes [-Werror=frame-larger-than=]
> 
> This attempts a rewrite of the two macros, using a simpler implementation
> for the most common case of having a naturally aligned 1, 2, 4, or (on
> 64-bit architectures) 8  byte object that can be accessed with a single
> instruction.  For these, we go back to a volatile pointer dereference
> that we had with the ACCESS_ONCE macro.

We had changed that back then because gcc 4.6 and 4.7 had a bug that could
removed the volatile statement on aggregate types like the following one

union ipte_control {
unsigned long val;
struct {
unsigned long k  : 1;
unsigned long kh : 31;
unsigned long kg : 32;
};
};

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145

If I see that right, your __ALIGNED_WORD(x)
macro would say that for above structure  sizeof(x) == sizeof(long)) is true,
so it would fall back to the old volatile cast and might reintroduce the 
old compiler bug?

Could you maybe you fence your simple macro for anything older than 4.9? After
all there was no kasan support anyway on these older gcc version.

Christian



Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-06 Thread Christian Borntraeger
On 02/03/2017 07:19 AM, Ben Serebrin wrote:
[...]
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1502,20 +1502,44 @@ static void virtnet_set_affinity(struct virtnet_info 
> *vi)
>* queue pairs, we let the queue pairs to be private to one cpu by
>* setting the affinity hint to eliminate the contention.
>*/
> - if (vi->curr_queue_pairs == 1 ||
> - vi->max_queue_pairs != num_online_cpus()) {
> + if (vi->curr_queue_pairs == 1) {
>   virtnet_clean_affinity(vi, -1);
>   return;
>   }
> 
> + /* If there are more cpus than queues, then assign the queues'
> +  * interrupts to the first cpus until we run out.
> +  */
>   i = 0;
>   for_each_online_cpu(cpu) {
> + if (i == vi->max_queue_pairs)
> + break;
>   virtqueue_set_affinity(vi->rq[i].vq, cpu);
>   virtqueue_set_affinity(vi->sq[i].vq, cpu);
> - netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
>   i++;
>   }
> 
> + /* Stripe the XPS affinities across the online CPUs.
> +  * Hyperthread pairs are typically assigned such that Linux's
> +  * CPU X and X + (numcpus / 2) are hyperthread twins, so we cause
> +  * hyperthread twins to share TX queues, in the case where there are
> +  * more cpus than queues.
> +  */

This is not always true. E.g. on s390 the SMT threads are usually paired 
even/odd.

e.g.

[cborntra@s38lp08 linux]$ lscpu -e
CPU NODE BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS
0   000  00:0:0:0 yesyeshorizontal   0
1   000  01:1:1:1 yesyeshorizontal   1
2   000  12:2:2:2 yesyeshorizontal   2
3   000  13:3:3:3 yesyeshorizontal   3
4   000  24:4:4:4 yesyeshorizontal   4
5   000  25:5:5:5 yesyeshorizontal   5
6   000  36:6:6:6 yesyeshorizontal   6

This does not matter yet for s390 (as virtio is usally doen via the ccw bus)
but maybe we should consider an future patch to provide some arch-specific
striping hints.

Or would it make sense to change the s390 layout for SMT twins because there
is more code that expects all threads 0 at the front and all threads 1 at the
end?


> + for (i = 0; i < vi->max_queue_pairs; i++) {
> + struct cpumask mask;
> + int skip = i;
> +
> + cpumask_clear();
> + for_each_online_cpu(cpu) {
> + while (skip--)
> + cpu = cpumask_next(cpu, cpu_online_mask);
> + if (cpu < num_possible_cpus())
> + cpumask_set_cpu(cpu, );
> + skip = vi->max_queue_pairs - 1;
> + }
> + netif_set_xps_queue(vi->dev, , i);
> + }
> +
>   vi->affinity_hint_set = true;
>  }
> 
> 



Re: [PATCH 5/6] treewide: use kv[mz]alloc* rather than opencoded variants

2017-01-12 Thread Christian Borntraeger
On 01/12/2017 04:37 PM, Michal Hocko wrote:
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 4f74511015b8..e6bbb33d2956 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1126,10 +1126,7 @@ static long kvm_s390_get_skeys(struct kvm *kvm, struct 
> kvm_s390_skeys *args)
>   if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
>   return -EINVAL;
> 
> - keys = kmalloc_array(args->count, sizeof(uint8_t),
> -  GFP_KERNEL | __GFP_NOWARN);
> - if (!keys)
> - keys = vmalloc(sizeof(uint8_t) * args->count);
> + keys = kvmalloc(args->count * sizeof(uint8_t), GFP_KERNEL);
>   if (!keys)
>   return -ENOMEM;
> 
> @@ -1171,10 +1168,7 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct 
> kvm_s390_skeys *args)
>   if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
>   return -EINVAL;
> 
> - keys = kmalloc_array(args->count, sizeof(uint8_t),
> -  GFP_KERNEL | __GFP_NOWARN);
> - if (!keys)
> - keys = vmalloc(sizeof(uint8_t) * args->count);
> + keys = kvmalloc(sizeof(uint8_t) * args->count, GFP_KERNEL);
>   if (!keys)
>   return -ENOMEM;

KVM/s390 parts

Acked-by: Christian Borntraeger <borntrae...@de.ibm.com>



Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Christian Borntraeger
On 11/25/2016 10:08 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 25, 2016 at 05:49:45PM +0100, Christian Borntraeger wrote:
>> On 11/25/2016 05:17 PM, Peter Zijlstra wrote:
>>> On Fri, Nov 25, 2016 at 04:10:04PM +, Mark Rutland wrote:
>>>> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
>>>
>>>>> What are use cases for such primitive that won't be OK with "read once
>>>>> _and_ atomically"?
>>>>
>>>> I have none to hand.
>>>
>>> Whatever triggers the __builtin_memcpy() paths, and even the size==8
>>> paths on 32bit.
>>>
>>> You could put a WARN in there to easily find them.
>>
>> There were several cases that I found during writing the *ONCE stuff.
>> For example there are some 32bit ppc variants with 64bit PTEs. Some for
>> others (I think sparc). And the mm/ code is perfectly fine with these
>> PTE accesses being done NOT atomic.
> 
> In that case do we even need _ONCE at all?

Yes. For example look at gup_pmd_range. Here several checks are made on the pmd.
It is important the the check for pmd_none is made on the same value than
the check for pmd_trans_huge, but it is not important that the value is still up
to date. 
And there are really cases where we cannot read the  thing atomically, e.g. on 
m68k and sparc(32bit) pmd_t is defined as array of longs.

Another problem is that a compiler can implement the following code as 2 memory
reads (e.g. if you have compare instructions that work on memory) instead of a 
memory read and 2 compares

int check(unsigned long *value_p) {
unsigned long value = *value_p;
if (condition_a(value))
return 1;
if (condition_b(value))
return 2;
return 3;
}

With READ_ONCE you forbid that. In past times you would have used barrier() 
after 
the assignment to achieve the same goal.


> Are there assumptions these are two 32 bit reads?

It depends on the code. Some places (e.g. in gup) assumes that the access via
READ_ONCE is atomic (which it is for sane compilers as long as the pointer
is <= word size). In some others places just one bit is tested.
> 
> 
>>
>>>
>>> The advantage of introducing the SINGLE_{LOAD,STORE}() helpers is that
>>> they compiletime validate this the size is 'right' and can runtime check
>>> alignment constraints.
>>>
>>> IE, they are strictly stronger than {READ,WRITE}_ONCE().
>>>
> 



Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Christian Borntraeger
On 11/25/2016 06:28 PM, Mark Rutland wrote:
> On Fri, Nov 25, 2016 at 05:49:45PM +0100, Christian Borntraeger wrote:
>> On 11/25/2016 05:17 PM, Peter Zijlstra wrote:
>>> On Fri, Nov 25, 2016 at 04:10:04PM +, Mark Rutland wrote:
>>>> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
>>>
>>>>> What are use cases for such primitive that won't be OK with "read once
>>>>> _and_ atomically"?
>>>>
>>>> I have none to hand.
>>>
>>> Whatever triggers the __builtin_memcpy() paths, and even the size==8
>>> paths on 32bit.
>>>
>>> You could put a WARN in there to easily find them.
>>
>> There were several cases that I found during writing the *ONCE stuff.
>> For example there are some 32bit ppc variants with 64bit PTEs. Some for
>> others (I think sparc).
> 
> We have similar on 32-bit ARM w/ LPAE. LPAE implies that a naturally
> aligned 64-bit access is single-copy atomic, which is what makes that
> ok.
> 
>> And the mm/ code is perfectly fine with these PTE accesses being done
>> NOT atomic.
> 
> That strikes me as surprising. Is there some mutual exclusion that
> prevents writes from occuring wherever a READ_ONCE() happens to a PTE?

See for example mm/memory.c handle_pte_fault.

---snip

/*
 * some architectures can have larger ptes than wordsize,
 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
 * CONFIG_32BIT=y, so READ_ONCE or ACCESS_ONCE cannot guarantee
 * atomic accesses.  The code below just needs a consistent
 * view for the ifs and we later double check anyway with the
 * ptl lock held. So here a barrier will do.
 */
---snip---

The trick is that the code only does a specific check, but all other accesses 
are under
the pte lock.



Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Christian Borntraeger
On 11/25/2016 05:17 PM, Peter Zijlstra wrote:
> On Fri, Nov 25, 2016 at 04:10:04PM +, Mark Rutland wrote:
>> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
> 
>>> What are use cases for such primitive that won't be OK with "read once
>>> _and_ atomically"?
>>
>> I have none to hand.
> 
> Whatever triggers the __builtin_memcpy() paths, and even the size==8
> paths on 32bit.
> 
> You could put a WARN in there to easily find them.

There were several cases that I found during writing the *ONCE stuff.
For example there are some 32bit ppc variants with 64bit PTEs. Some for
others (I think sparc). And the mm/ code is perfectly fine with these
PTE accesses being done NOT atomic.


> 
> The advantage of introducing the SINGLE_{LOAD,STORE}() helpers is that
> they compiletime validate this the size is 'right' and can runtime check
> alignment constraints.
> 
> IE, they are strictly stronger than {READ,WRITE}_ONCE().
> 



Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Christian Borntraeger
On 11/25/2016 12:22 PM, Mark Rutland wrote:
> On Thu, Nov 24, 2016 at 10:36:58PM +0200, Michael S. Tsirkin wrote:
>> On Thu, Nov 24, 2016 at 10:25:11AM +, Mark Rutland wrote:
>>> For several reasons, it would be beneficial to kill off ACCESS_ONCE()
>>> tree-wide, in favour of {READ,WRITE}_ONCE(). These work with aggregate 
>>> types,
>>> more obviously document their intended behaviour, and are necessary for 
>>> tools
>>> like KTSAN to work correctly (as otherwise reads and writes cannot be
>>> instrumented separately).
>>>
>>> While it's possible to script the bulk of this tree-wide conversion, some 
>>> cases
>>> such as the virtio code, require some manual intervention. This series moves
>>> the virtio and vringh code over to {READ,WRITE}_ONCE(), in the process 
>>> fixing a
>>> bug in the virtio headers.
>>>
>>> Thanks,
>>> Mark.
>>
>> I don't have a problem with this specific patchset.
> 
> Good to hear. :)
> 
> Does that mean you're happy to queue these patches? Or would you prefer
> a new posting at some later point, with ack/review tags accumulated?
> 
>> Though I really question the whole _ONCE APIs esp with
>> aggregate types - these seem to generate a memcpy and
>> an 8-byte read/writes sometimes, and I'm pretty sure this simply
>> can't be read/written at once on all architectures.
> 
> Yes, in cases where the access is larger than the machine can perform in
> a single access, this will result in a memcpy.
> 
> My understanding is that this has always been the case with
> ACCESS_ONCE(), where multiple accesses were silently/implicitly
> generated by the compiler.
> 
> We could add some compile-time warnings for those cases. I'm not sure if
> there's a reason we avoided doing that so far; perhaps Christian has a
> some idea.

My first version had this warning, but it was removed later on as requested
by Linus

http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02670.html
---snip---

Get rid of the f*cking size checks etc on READ_ONCE() and friends.

They are about - wait for it - "reading a value once".

Note how it doesn't say ANYTHING about "atomic" or anything like that.
It's about reading *ONCE*.

---snip---



Re: [PATCH 2/3] vringh: kill off ACCESS_ONCE()

2016-11-24 Thread Christian Borntraeger
On 11/24/2016 11:25 AM, Mark Rutland wrote:
> Despite living under drivers/ vringh.c is also used as part of the userspace
> virtio tools. Before we can kill off the ACCESS_ONCE()definition in the tools,
> we must convert vringh.c to use {READ,WRITE}_ONCE().
> 
> This patch does so, along with the required include of  for
> the relevant definitions. The userspace tools provide their own definitions in
> their own .
> 
> Signed-off-by: Mark Rutland <mark.rutl...@arm.com>
> Cc: Jason Wang <jasow...@redhat.com>
> Cc: Michael S. Tsirkin <m...@redhat.com>
> Cc: k...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org
> ---
>  drivers/vhost/vringh.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
> index 3bb02c6..bb8971f 100644
> --- a/drivers/vhost/vringh.c
> +++ b/drivers/vhost/vringh.c
> @@ -3,6 +3,7 @@
>   *
>   * Since these may be in userspace, we use (inline) accessors.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -820,13 +821,13 @@ EXPORT_SYMBOL(vringh_need_notify_user);
>  static inline int getu16_kern(const struct vringh *vrh,
> u16 *val, const __virtio16 *p)
>  {
> - *val = vringh16_to_cpu(vrh, ACCESS_ONCE(*p));
> + *val = vringh16_to_cpu(vrh, READ_ONCE(*p));
>   return 0;
>  }
> 
>  static inline int putu16_kern(const struct vringh *vrh, __virtio16 *p, u16 
> val)
>  {
> - ACCESS_ONCE(*p) = cpu_to_vringh16(vrh, val);
> + WRITE_ONCE(*p, cpu_to_vringh16(vrh, val));
>   return 0;
>  }
> 

Makes sense

Reviewed-by: Christian Borntraeger <borntrae...@de.ibm.com>



Re: [PATCH V3 3/3] vhost_net: basic polling support

2016-02-28 Thread Christian Borntraeger
On 02/26/2016 09:42 AM, Jason Wang wrote:
> This patch tries to poll for new added tx buffer or socket receive
> queue for a while at the end of tx/rx processing. The maximum time
> spent on polling were specified through a new kind of vring ioctl.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/net.c| 79 
> +++---
>  drivers/vhost/vhost.c  | 14 
>  drivers/vhost/vhost.h  |  1 +
>  include/uapi/linux/vhost.h |  6 
>  4 files changed, 95 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e..c91af93 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -287,6 +287,44 @@ static void vhost_zerocopy_callback(struct ubuf_info 
> *ubuf, bool success)
>   rcu_read_unlock_bh();
>  }
> 
> +static inline unsigned long busy_clock(void)
> +{
> + return local_clock() >> 10;
> +}
> +
> +static bool vhost_can_busy_poll(struct vhost_dev *dev,
> + unsigned long endtime)
> +{
> + return likely(!need_resched()) &&
> +likely(!time_after(busy_clock(), endtime)) &&
> +likely(!signal_pending(current)) &&
> +!vhost_has_work(dev) &&
> +single_task_running();
> +}
> +
> +static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> + struct vhost_virtqueue *vq,
> + struct iovec iov[], unsigned int iov_size,
> + unsigned int *out_num, unsigned int *in_num)
> +{
> + unsigned long uninitialized_var(endtime);
> + int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> + out_num, in_num, NULL, NULL);
> +
> + if (r == vq->num && vq->busyloop_timeout) {
> + preempt_disable();
> + endtime = busy_clock() + vq->busyloop_timeout;
> + while (vhost_can_busy_poll(vq->dev, endtime) &&
> +vhost_vq_avail_empty(vq->dev, vq))
> + cpu_relax();


Can you use cpu_relax_lowlatency (which should be the same as cpu_relax for 
almost
everybody but s390? cpu_relax (without low latency might give up the time slice
when running under another hypervisor (like LPAR on s390), which might not be 
what
we want here.



[...] 
> +static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
> +{
> + struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
> + struct vhost_virtqueue *vq = >vq;
> + unsigned long uninitialized_var(endtime);
> + int len = peek_head_len(sk);
> +
> + if (!len && vq->busyloop_timeout) {
> + /* Both tx vq and rx socket were polled here */
> + mutex_lock(>mutex);
> + vhost_disable_notify(>dev, vq);
> +
> + preempt_disable();
> + endtime = busy_clock() + vq->busyloop_timeout;
> +
> + while (vhost_can_busy_poll(>dev, endtime) &&
> +skb_queue_empty(>sk_receive_queue) &&
> +vhost_vq_avail_empty(>dev, vq))
> + cpu_relax();

here as well.



Re: [PATCH v2] vhost: replace % with & on data path

2015-11-30 Thread Christian Borntraeger
On 11/30/2015 10:15 AM, Michael S. Tsirkin wrote:
> We know vring num is a power of 2, so use &
> to mask the high bits.

Makes a lot of sense and virtio_ring.c seems to use the same logic.

Acked-by: Christian Borntraeger <borntrae...@de.ibm.com>

> 
> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> ---
> 
> Changes from v1: drop an unrelated chunk
> 
>  drivers/vhost/vhost.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 080422f..ad2146a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1369,7 +1369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   /* Grab the next descriptor number they're advertising, and increment
>* the index we've seen. */
>   if (unlikely(__get_user(ring_head,
> - >avail->ring[last_avail_idx % vq->num]))) {
> + >avail->ring[last_avail_idx & (vq->num - 
> 1)]))) {
>   vq_err(vq, "Failed to read head: idx %d address %p\n",
>  last_avail_idx,
>  >avail->ring[last_avail_idx % vq->num]);
> @@ -1489,7 +1489,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
> *vq,
>   u16 old, new;
>   int start;
> 
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   used = vq->used->ring + start;
>   if (count == 1) {
>   if (__put_user(heads[0].id, >id)) {
> @@ -1531,7 +1531,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
> vring_used_elem *heads,
>  {
>   int start, n, r;
> 
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   n = vq->num - start;
>   if (n < count) {
>   r = __vhost_add_used_n(vq, heads, n);
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.4-rc1 errors on bridge/macvtap interfaces (maybe commit 0bc05d58 switchdev: allow caller to exp...).

2015-11-16 Thread Christian Borntraeger
On 11/16/2015 02:25 PM, Jiri Pirko wrote:
> Mon, Nov 16, 2015 at 02:19:55PM CET, mkube...@suse.cz wrote:
>> On Mon, Nov 16, 2015 at 02:00:32PM +0100, Christian Borntraeger wrote:
>>>
>>> on 4.4-rc1 running on an s390x box (so qeth OSA network cards as real NICs) 
>>> I get errors like:
>>>
>>> [   10.940523] Ebtables v2.0 registered
>>> [   11.685609] bridge: automatic filtering via arp/ip/ip6tables has been 
>>> deprecated. Update your scripts to load br_netfilter if you need this.
>>> [   11.686252] virbr0-nic: set_features() failed (-1); wanted 
>>> 0x008048c1, left 0x0080001b48c9
>>> [   11.794553] virbr0: error setting offload STP state on port 1(virbr0-nic)
>>> [   11.794556] virbr0-nic: failed to set HW ageing time
>>> [   11.794558] virbr0: error setting offload STP state on port 1(virbr0-nic)
>>> [   11.794644] virbr0-nic: set_features() failed (-1); wanted 
>>> 0x008048c1, left 0x0080001b48c9
>>> [   11.794675] device virbr0-nic entered promiscuous mode
>>> [   11.794706] virbr0: set_features() failed (-1); wanted 
>>> 0x00801fdb78c9, left 0x00801fff78e9
>>> [   11.962901] ip_tables: (C) 2000-2006 Netfilter Core Team
>>> [   12.025317] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
>>> [   12.086880] virbr0: set_features() failed (-1); wanted 
>>> 0x00801fdb78c9, left 0x00801fff78e9
>>> [   12.086924] virbr0: error setting offload STP state on port 1(virbr0-nic)
>>> [   12.086926] virbr0-nic: failed to set HW ageing time
>>> [   12.086927] virbr0: error setting offload STP state on port 1(virbr0-nic)
>>> [   12.086928] virbr0: port 1(virbr0-nic) entered listening state
>>> [   12.086934] virbr0: port 1(virbr0-nic) entered listening state
>>> [   12.087023] virbr0: set_features() failed (-1); wanted 
>>> 0x00801fdb78c9, left 0x00801fff78e9
>>> [   12.087057] virbr0-nic: set_features() failed (-1); wanted 
>>> 0x008048c1, left 0x0080001b48c9
>>> [   12.087088] virbr0: set_features() failed (-1); wanted 
>>> 0x00801fdb78c9, left 0x00801fff78e9
>>> [   12.087122] virbr0-nic: set_features() failed (-1); wanted 
>>> 0x008048c1, left 0x0080001b48c9
>>> [   12.087153] virbr0: set_features() failed (-1); wanted 
>>> 0x00801fdb78c9, left 0x00801fff78e9
>>> [   12.102634] virbr0: error setting offload STP state on port 1(virbr0-nic)
>>> [   12.102637] virbr0: port 1(virbr0-nic) entered disabled state
>>>
>> ...
>>>
>>> [  260.414547] macvtap0: set_features() failed (-1); wanted 
>>> 0x0004001f5209, left 0x0004001f5a09
>>> [  260.414555] macvtap0: set_features() failed (-1); wanted 
>>> 0x0004001f5209, left 0x0004001f5a09
>>> [  260.541477] macvtap0: set_features() failed (-1); wanted 
>>> 0x0004001f5209, left 0x0004001f5a09
>>> [  260.541532] macvtap0: set_features() failed (-1); wanted 
>>> 0x0004001f5209, left 0x0004001f5a09
>>> [  260.541603] macvtap0: set_features() failed (-1); wanted 
>>> 0x0004001f5209, left 0x0004001f5a09
>>> [  260.984503] device enccw0.0.f500 entered promiscuous mode
>>>
>>> Does somebody have an idea?
>>
>> Looks like the issue addressed by
>>
>>  http://patchwork.ozlabs.org/patch/544307/
> 
> second issues (stp state) is fixed by:
> http://patchwork.ozlabs.org/patch/544246/

Yes, 

Tested-by: Christian Borntraeger <borntrae...@de.ibm.com>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


4.4-rc1 errors on bridge/macvtap interfaces (maybe commit 0bc05d58 switchdev: allow caller to exp...).

2015-11-16 Thread Christian Borntraeger
Jiri,

on 4.4-rc1 running on an s390x box (so qeth OSA network cards as real NICs) I 
get errors like:

[   10.940523] Ebtables v2.0 registered
[   11.685609] bridge: automatic filtering via arp/ip/ip6tables has been 
deprecated. Update your scripts to load br_netfilter if you need this.
[   11.686252] virbr0-nic: set_features() failed (-1); wanted 
0x008048c1, left 0x0080001b48c9
[   11.794553] virbr0: error setting offload STP state on port 1(virbr0-nic)
[   11.794556] virbr0-nic: failed to set HW ageing time
[   11.794558] virbr0: error setting offload STP state on port 1(virbr0-nic)
[   11.794644] virbr0-nic: set_features() failed (-1); wanted 
0x008048c1, left 0x0080001b48c9
[   11.794675] device virbr0-nic entered promiscuous mode
[   11.794706] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, 
left 0x00801fff78e9
[   11.962901] ip_tables: (C) 2000-2006 Netfilter Core Team
[   12.025317] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[   12.086880] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, 
left 0x00801fff78e9
[   12.086924] virbr0: error setting offload STP state on port 1(virbr0-nic)
[   12.086926] virbr0-nic: failed to set HW ageing time
[   12.086927] virbr0: error setting offload STP state on port 1(virbr0-nic)
[   12.086928] virbr0: port 1(virbr0-nic) entered listening state
[   12.086934] virbr0: port 1(virbr0-nic) entered listening state
[   12.087023] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, 
left 0x00801fff78e9
[   12.087057] virbr0-nic: set_features() failed (-1); wanted 
0x008048c1, left 0x0080001b48c9
[   12.087088] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, 
left 0x00801fff78e9
[   12.087122] virbr0-nic: set_features() failed (-1); wanted 
0x008048c1, left 0x0080001b48c9
[   12.087153] virbr0: set_features() failed (-1); wanted 0x00801fdb78c9, 
left 0x00801fff78e9
[   12.102634] virbr0: error setting offload STP state on port 1(virbr0-nic)
[   12.102637] virbr0: port 1(virbr0-nic) entered disabled state



bisecting _one_ of the problem (the error message change, but I said "bad" 
whenever there was at least
one error) points to 

commit 0bc05d585d381c30de3fdf955730df31593d2101
Author: Jiri Pirko 
AuthorDate: Wed Oct 14 19:40:50 2015 +0200
Commit: David S. Miller 
CommitDate: Thu Oct 15 06:09:48 2015 -0700

switchdev: allow caller to explicitly request attr_set as deferred


Macvtap has a similar issue on 4.4-rc1:

[  260.414547] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, 
left 0x0004001f5a09
[  260.414555] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, 
left 0x0004001f5a09
[  260.541477] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, 
left 0x0004001f5a09
[  260.541532] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, 
left 0x0004001f5a09
[  260.541603] macvtap0: set_features() failed (-1); wanted 0x0004001f5209, 
left 0x0004001f5a09
[  260.984503] device enccw0.0.f500 entered promiscuous mode


Does somebody have an idea?

Christian

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.4-rc1 errors on bridge/macvtap interfaces (maybe commit 0bc05d58 switchdev: allow caller to exp...).

2015-11-16 Thread Christian Borntraeger
On 11/16/2015 02:19 PM, Michal Kubecek wrote:
> On Mon, Nov 16, 2015 at 02:00:32PM +0100, Christian Borntraeger wrote:
>>
>> on 4.4-rc1 running on an s390x box (so qeth OSA network cards as real NICs) 
>> I get errors like:
>>
>> [   10.940523] Ebtables v2.0 registered
>> [   11.685609] bridge: automatic filtering via arp/ip/ip6tables has been 
>> deprecated. Update your scripts to load br_netfilter if you need this.
>> [   11.686252] virbr0-nic: set_features() failed (-1); wanted 
>> 0x008048c1, left 0x0080001b48c9
>> [   11.794553] virbr0: error setting offload STP state on port 1(virbr0-nic)
>> [   11.794556] virbr0-nic: failed to set HW ageing time
>> [   11.794558] virbr0: error setting offload STP state on port 1(virbr0-nic)
>> [   11.794644] virbr0-nic: set_features() failed (-1); wanted 
>> 0x008048c1, left 0x0080001b48c9
>> [   11.794675] device virbr0-nic entered promiscuous mode
>> [   11.794706] virbr0: set_features() failed (-1); wanted 
>> 0x00801fdb78c9, left 0x00801fff78e9
>> [   11.962901] ip_tables: (C) 2000-2006 Netfilter Core Team
>> [   12.025317] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
>> [   12.086880] virbr0: set_features() failed (-1); wanted 
>> 0x00801fdb78c9, left 0x00801fff78e9
>> [   12.086924] virbr0: error setting offload STP state on port 1(virbr0-nic)
>> [   12.086926] virbr0-nic: failed to set HW ageing time
>> [   12.086927] virbr0: error setting offload STP state on port 1(virbr0-nic)
>> [   12.086928] virbr0: port 1(virbr0-nic) entered listening state
>> [   12.086934] virbr0: port 1(virbr0-nic) entered listening state
>> [   12.087023] virbr0: set_features() failed (-1); wanted 
>> 0x00801fdb78c9, left 0x00801fff78e9
>> [   12.087057] virbr0-nic: set_features() failed (-1); wanted 
>> 0x008048c1, left 0x0080001b48c9
>> [   12.087088] virbr0: set_features() failed (-1); wanted 
>> 0x00801fdb78c9, left 0x00801fff78e9
>> [   12.087122] virbr0-nic: set_features() failed (-1); wanted 
>> 0x008048c1, left 0x0080001b48c9
>> [   12.087153] virbr0: set_features() failed (-1); wanted 
>> 0x00801fdb78c9, left 0x00801fff78e9
>> [   12.102634] virbr0: error setting offload STP state on port 1(virbr0-nic)
>> [   12.102637] virbr0: port 1(virbr0-nic) entered disabled state
>>
> ...
>>
>> [  260.414547] macvtap0: set_features() failed (-1); wanted 
>> 0x0004001f5209, left 0x0004001f5a09
>> [  260.414555] macvtap0: set_features() failed (-1); wanted 
>> 0x0004001f5209, left 0x0004001f5a09
>> [  260.541477] macvtap0: set_features() failed (-1); wanted 
>> 0x0004001f5209, left 0x0004001f5a09
>> [  260.541532] macvtap0: set_features() failed (-1); wanted 
>> 0x0004001f5209, left 0x0004001f5a09
>> [  260.541603] macvtap0: set_features() failed (-1); wanted 
>> 0x0004001f5209, left 0x0004001f5a09
>> [  260.984503] device enccw0.0.f500 entered promiscuous mode
>>
>> Does somebody have an idea?
> 
> Looks like the issue addressed by
> 
>   http://patchwork.ozlabs.org/patch/544307/

Yes. The set_features error is indeed fixed. 
FWIW;
Tested-by: Christian Borntraeger <borntrae...@de.ibm.com>


The other error is independent.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/1] macvtap regression since 3.18

2015-09-18 Thread Christian Borntraeger
Michael,

here is a fixup for macvtap. It was detected by running
several different patterns via uperf over multiple network
cards. Within some minutes, the network traffic stalled
and Matt bisected it down to 39ec7de7092b ("macvtap: fix
uninitialized access on TUNSETIFF"). Turns out that
this patch broke sndbuf setting. I dont know exactly what
went wrong in that testcase, but an earlier version of
the patch did the trick and the testcase is now stable
again.

I was tempted to add another variable, but u,up,s,sp seem
to have a meaning (signed,signed pointer) so I  made u
an unsigned int again. Long term we might want to refactor
the code to have 

int sk_sendbuf;
short ifr_flags;
int vnet_hdr_sz;

or something like that. But this rework would be to big
for stable.

Some of the casts that I added are not strictly necessary
as get/put_user already do this, but better safe than
sorry as we dont want to rely on the implementation of
macros. Opinions?


Christian Borntraeger (1):
  macvtap: Fix regression for macvtap ioctls

 drivers/net/macvtap.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] macvtap: Fix regression for macvtap ioctls

2015-09-18 Thread Christian Borntraeger
To avoid overwriting the upper bits of the flags, commit
39ec7de7092b ("macvtap: fix uninitialized access on
TUNSETIFF") changed the variable u from unsigned int to
unsigned short and added some ORing logic for the flags.
This introduced at least one regression:
- TUNSETSNDBUF supports int as its size and also uses the now
  short u as buffer - this breaks any sendbuf size > 64k

Let's change u back to unsigned int, keep the ORing and
handle the overwrite issue with casts and masking.

Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: David S. Miller <da...@davemloft.net>
Reported-by: Mark A. Peloquin
Bisected-by: Matthew Rosato <mjros...@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntrae...@de.ibm.com>
Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF")
Cc: sta...@vger.kernel.org
---
 drivers/net/macvtap.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index edd7734..c33fe41 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1060,7 +1060,7 @@ static long macvtap_ioctl(struct file *file, unsigned int 
cmd,
void __user *argp = (void __user *)arg;
struct ifreq __user *ifr = argp;
unsigned int __user *up = argp;
-   unsigned short u;
+   unsigned int u;
int __user *sp = argp;
struct sockaddr sa;
int s;
@@ -1076,7 +1076,7 @@ static long macvtap_ioctl(struct file *file, unsigned int 
cmd,
if ((u & ~MACVTAP_FEATURES) != (IFF_NO_PI | IFF_TAP))
ret = -EINVAL;
else
-   q->flags = (q->flags & ~MACVTAP_FEATURES) | u;
+   q->flags = (q->flags & ~MACVTAP_FEATURES) | (short) u;
 
return ret;
 
@@ -1089,9 +1089,8 @@ static long macvtap_ioctl(struct file *file, unsigned int 
cmd,
}
 
ret = 0;
-   u = q->flags;
if (copy_to_user(>ifr_name, vlan->dev->name, IFNAMSIZ) ||
-   put_user(u, >ifr_flags))
+   put_user((short) q->flags, >ifr_flags))
ret = -EFAULT;
macvtap_put_vlan(vlan);
rtnl_unlock();
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] macvtap: fix TUNSETSNDBUF values > 64k

2015-09-18 Thread Christian Borntraeger
Am 18.09.2015 um 12:41 schrieb Michael S. Tsirkin:
> Upon TUNSETSNDBUF,  macvtap reads the requested sndbuf size into
> a local variable u.
> commit 39ec7de7092b ("macvtap: fix uninitialized access on
> TUNSETIFF") changed its type to u16 (which is the right thing to
> do for all other macvtap ioctls), breaking all values > 64k.
> 
> The value of TUNSETSNDBUF is actually a signed 32 bit integer, so
> the right thing to do is to read it into an int.
> 
> Cc: David S. Miller <da...@davemloft.net>
> Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF")
> Reported-by: Mark A. Peloquin
> Bisected-by: Matthew Rosato <mjros...@linux.vnet.ibm.com>
> Reported-by: Christian Borntraeger <borntrae...@de.ibm.com>
> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>

You can add
Tested-by:  Matthew Rosato <mjros...@linux.vnet.ibm.com>
as this looks identical to an early version of my patch which was tested,
by Matt. (I send you the other version that changes back u as I felt that
u and up are named to identify unsigned)

and please add
Acked-by: Christian Borntraeger <borntrae...@de.ibm.com>

what about
CC: sta...@vger.kernel.org

Christian

> ---
> 
> This patch probably makes sense on stable.
> 
>  drivers/net/macvtap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index edd7734..248478c 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -,10 +,10 @@ static long macvtap_ioctl(struct file *file, unsigned 
> int cmd,
>   return 0;
> 
>   case TUNSETSNDBUF:
> - if (get_user(u, up))
> + if (get_user(s, sp))
>   return -EFAULT;
> 
> - q->sk.sk_sndbuf = u;
> + q->sk.sk_sndbuf = s;
>   return 0;
> 
>   case TUNGETVNETHDRSZ:
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] tun, macvtap: higher order allocations for skbs

2015-06-18 Thread Christian Borntraeger
Am 18.06.2015 um 12:20 schrieb Michael S. Tsirkin:
 Needs more testing. Anyone see anything wrong with this?
Can you explain the motivation? 
FWIW, basic networking between two guest over macvtap still
seems to work on s390 so I dont see any obvious regression.

Christian

 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/net/macvtap.c | 2 +-
  drivers/net/tun.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
 index 928f3f4..80e87e4 100644
 --- a/drivers/net/macvtap.c
 +++ b/drivers/net/macvtap.c
 @@ -610,7 +610,7 @@ static inline struct sk_buff *macvtap_alloc_skb(struct 
 sock *sk, size_t prepad,
   linear = len;
 
   skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
 -err, 0);
 +err, 1);
   if (!skb)
   return NULL;
 
 diff --git a/drivers/net/tun.c b/drivers/net/tun.c
 index cb376b2d..8f2f1e5 100644
 --- a/drivers/net/tun.c
 +++ b/drivers/net/tun.c
 @@ -1069,7 +1069,7 @@ static struct sk_buff *tun_alloc_skb(struct tun_file 
 *tfile,
   linear = len;
 
   skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
 -err, 0);
 +err, 1);
   if (!skb)
   return ERR_PTR(err);
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] virtio_net: Fix oops on early interrupts - introduced by virtio reset code

2008-02-18 Thread Christian Borntraeger
Am Montag, 11. Februar 2008 schrieb Anthony Liguori:
 The reset support is in Linus's tree so we should try to push it for -rc2.

You are right. My repository was borked. will push it to Jeff Garzik. Thanks

Jeff can you schedule this fix into your network driver updates? Thanks
---


With the latest virtio_reset patches I got the following oops:

Unable to handle kernel pointer dereference at virtual kernel address 

Oops: 0004 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 Not tainted 2.6.24zlive-guest-10577-g63f5307-dirty #168
Process swapper (pid: 0, task: 0f866040, ksp: 0f86fd78)
Krnl PSW : 040410018000 0047598a (skb_recv_done+0x52/0x98)
   R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0001  0efd0e60 0001
    0f866040  
   008de4c8 1237 1237 0f977dd8
   0020 001132bc 0f977e08 0f977dd8
Krnl Code: 0047597c: e3104034   lg  %r1,48(%r4)
   00475982: b9040001   lgr %r0,%r1
   00475986: b9810003   ogr %r0,%r3
  0047598a: eb1040300030   csg %r1,%r0,48(%r4)
   00475990: a744fff9   brc 4,475982
   00475994: a7110001   tmll%r1,1
   00475998: a7840009   brc 8,4759aa
   0047599c: e340b0b80004   lg  %r4,184(%r11)
Call Trace:
([01500f978000] 0x1500f978000)
 [004779a6] vring_interrupt+0x72/0x88
 [00491d9c] kvm_extint_handler+0x34/0x44
 [0010d2d4] do_extint+0xc0/0xfc
 [00113b5a] ext_no_vtime+0x1c/0x20
 [0010a0b6] cpu_idle+0x21a/0x230
([0010a096] cpu_idle+0x1fa/0x230)
 [0057dfe4] start_secondary+0xa0/0xb4

We must initialize vdev-priv before we use the notify hypercall as 
vdev-priv is used in skb_recv_done. So lets move the assignment of 
vdev-priv before we call try_fill_recv.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
Acked-by: Anthony Liguori [EMAIL PROTECTED]

---
 drivers/net/virtio_net.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -361,6 +361,7 @@ static int virtnet_probe(struct virtio_d
netif_napi_add(dev, vi-napi, virtnet_poll, napi_weight);
vi-dev = dev;
vi-vdev = vdev;
+   vdev-priv = vi;
 
/* We expect two virtqueues, receive then send. */
vi-rvq = vdev-config-find_vq(vdev, 0, skb_recv_done);
@@ -395,7 +396,6 @@ static int virtnet_probe(struct virtio_d
}
 
pr_debug(virtnet: registered device %s\n, dev-name);
-   vdev-priv = vi;
return 0;
 
 unregister:
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_net: Fix oops on early interrupts - introduced by virtio reset code

2008-02-11 Thread Christian Borntraeger
Am Montag, 11. Februar 2008 schrieb Anthony Liguori:
 The reset support is in Linus's tree so we should try to push it for -rc2.

You are right. My repository was borked. will push it to Jeff Garzik. Thanks

Jeff can you schedule this fix into your network driver updates? Thanks
---


With the latest virtio_reset patches I got the following oops:

Unable to handle kernel pointer dereference at virtual kernel address 

Oops: 0004 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 Not tainted 2.6.24zlive-guest-10577-g63f5307-dirty #168
Process swapper (pid: 0, task: 0f866040, ksp: 0f86fd78)
Krnl PSW : 040410018000 0047598a (skb_recv_done+0x52/0x98)
   R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0001  0efd0e60 0001
    0f866040  
   008de4c8 1237 1237 0f977dd8
   0020 001132bc 0f977e08 0f977dd8
Krnl Code: 0047597c: e3104034   lg  %r1,48(%r4)
   00475982: b9040001   lgr %r0,%r1
   00475986: b9810003   ogr %r0,%r3
  0047598a: eb1040300030   csg %r1,%r0,48(%r4)
   00475990: a744fff9   brc 4,475982
   00475994: a7110001   tmll%r1,1
   00475998: a7840009   brc 8,4759aa
   0047599c: e340b0b80004   lg  %r4,184(%r11)
Call Trace:
([01500f978000] 0x1500f978000)
 [004779a6] vring_interrupt+0x72/0x88
 [00491d9c] kvm_extint_handler+0x34/0x44
 [0010d2d4] do_extint+0xc0/0xfc
 [00113b5a] ext_no_vtime+0x1c/0x20
 [0010a0b6] cpu_idle+0x21a/0x230
([0010a096] cpu_idle+0x1fa/0x230)
 [0057dfe4] start_secondary+0xa0/0xb4

We must initialize vdev-priv before we use the notify hypercall as 
vdev-priv is used in skb_recv_done. So lets move the assignment of 
vdev-priv before we call try_fill_recv.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
Acked-by: Anthony Liguori [EMAIL PROTECTED]

---
 drivers/net/virtio_net.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -361,6 +361,7 @@ static int virtnet_probe(struct virtio_d
netif_napi_add(dev, vi-napi, virtnet_poll, napi_weight);
vi-dev = dev;
vi-vdev = vdev;
+   vdev-priv = vi;
 
/* We expect two virtqueues, receive then send. */
vi-rvq = vdev-config-find_vq(vdev, 0, skb_recv_done);
@@ -395,7 +396,6 @@ static int virtnet_probe(struct virtio_d
}
 
pr_debug(virtnet: registered device %s\n, dev-name);
-   vdev-priv = vi;
return 0;
 
 unregister:
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_net: Fix oops on early interrupts - introduced by virtio reset code

2008-02-11 Thread Christian Borntraeger
Avi,
this fixes a problem that was introduced by the virtio_reset patches. 
Can you apply that fix to kvm.git as a bugfix,  as the virtio_reset
infrastructure is not on Linus upstream yet?

Anthony, Dor,
are you ok with that change?

--

With the latest virtio_reset patches I got the following oops:

Unable to handle kernel pointer dereference at virtual kernel address 

Oops: 0004 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 Not tainted 2.6.24zlive-guest-10577-g63f5307-dirty #168
Process swapper (pid: 0, task: 0f866040, ksp: 0f86fd78)
Krnl PSW : 040410018000 0047598a (skb_recv_done+0x52/0x98)
   R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0001  0efd0e60 0001
    0f866040  
   008de4c8 1237 1237 0f977dd8
   0020 001132bc 0f977e08 0f977dd8
Krnl Code: 0047597c: e3104034   lg  %r1,48(%r4)
   00475982: b9040001   lgr %r0,%r1
   00475986: b9810003   ogr %r0,%r3
  0047598a: eb1040300030   csg %r1,%r0,48(%r4)
   00475990: a744fff9   brc 4,475982
   00475994: a7110001   tmll%r1,1
   00475998: a7840009   brc 8,4759aa
   0047599c: e340b0b80004   lg  %r4,184(%r11)
Call Trace:
([01500f978000] 0x1500f978000)
 [004779a6] vring_interrupt+0x72/0x88
 [00491d9c] kvm_extint_handler+0x34/0x44
 [0010d2d4] do_extint+0xc0/0xfc
 [00113b5a] ext_no_vtime+0x1c/0x20
 [0010a0b6] cpu_idle+0x21a/0x230


We must initialize vdev-priv before we use the notify hypercall as 
vdev-priv is used in skb_recv_done. So lets move the assignment of 
vdev-priv before we call try_fill_recv.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -361,6 +361,7 @@ static int virtnet_probe(struct virtio_d
netif_napi_add(dev, vi-napi, virtnet_poll, napi_weight);
vi-dev = dev;
vi-vdev = vdev;
+   vdev-priv = vi;
 
/* We expect two virtqueues, receive then send. */
vi-rvq = vdev-config-find_vq(vdev, 0, skb_recv_done);
@@ -395,7 +396,6 @@ static int virtnet_probe(struct virtio_d
}
 
pr_debug(virtnet: registered device %s\n, dev-name);
-   vdev-priv = vi;
return 0;
 
 unregister:
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_net: Fix open - interrupt race

2008-02-05 Thread Christian Borntraeger
I got the following oops during interface ifup. Unfortunately its not 
easily reproducable so I cant say for sure that my fix fixes this
problem, but I am confident and I think its correct anyway:

   2kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:234!
4illegal operation: 0001 [#1] PREEMPT SMP
4Modules linked in:
4CPU: 0 Not tainted 2.6.24zlive-guest-07293-gf1ca151-dirty #91
4Process swapper (pid: 0, task: 00800938, ksp: 0084ddb8)
4Krnl PSW : 040430018000 00466374 (vring_disable_cb+0x30/0x34)
4   R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
4Krnl GPRS: 0001 0001 10003800 
00466344
4   0e980900 008848b0 0084e748 

4   0087b300 1237 1237 
0f85bdd8
4   0e980920 001137c0 00464754 
0f85bdd8
4Krnl Code: 00466368: e3b0b074lg  %r11,112(%r11)
4   0046636e: 07febcr 15,%r14
4   00466370: a7f40001brc 15,466372
4  00466374: a7f4fff6brc 15,466360
4   00466378: eb7ff0500024stmg
%r7,%r15,80(%r15)
4   0046637e: a7f13e00tmll%r15,15872
4   00466382: b90400eflgr %r14,%r15
4   00466386: a7840001brc 8,466388
4Call Trace:
4([000201500f85c000] 0x201500f85c000)
4 [00466556] vring_interrupt+0x72/0x88
4 [004801a0] kvm_extint_handler+0x34/0x44
4 [0010d22c] do_extint+0xbc/0xf8
4 [00113f98] ext_no_vtime+0x16/0x1a
4 [0010a182] cpu_idle+0x216/0x238
4([0010a162] cpu_idle+0x1f6/0x238)
4 [00568656] rest_init+0xaa/0xb8
4 [0084ee2c] start_kernel+0x3fc/0x490
4 [00100020] _stext+0x20/0x80
4
4 0Kernel panic - not syncing: Fatal exception in interrupt
4

After looking at the code and the dump I think the following scenario
happened: Ifup was running on cpu2 and the interrupt arrived on cpu0.
Now virtnet_open on cpu 2 managed to execute napi_enable and disable_cb
but did not execute rx_schedule. Meanwhile on cpu 0 skb_recv_done was
called by vring_interrupt, executed netif_rx_schedule_prep, which
succeeded and therefore called disable_cb. This triggered the BUG_ON,
as interrupts were already disabled by cpu 2.

I think the proper solution is to make the call to disable_cb depend on
the atomic update of NAPI_STATE_SCHED by using netif_rx_schedule_prep
in the same way as skb_recv_done. 


Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -321,10 +321,12 @@ static int virtnet_open(struct net_devic
 
/* If all buffers were filled by other side before we napi_enabled, we
 * won't get another interrupt, so process any outstanding packets
-* now.  virtnet_poll wants re-enable the queue, so we disable here. */
-   vi-rvq-vq_ops-disable_cb(vi-rvq);
-   netif_rx_schedule(vi-dev, vi-napi);
-
+* now.  virtnet_poll wants re-enable the queue, so we disable here.
+* We synchronize against interrupts via NAPI_STATE_SCHED */
+   if (netif_rx_schedule_prep(dev, vi-napi)) {
+   vi-rvq-vq_ops-disable_cb(vi-rvq);
+   __netif_rx_schedule(dev, vi-napi);
+   }
return 0;
 }
-- 
IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher 
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_net: Fix open - interrupt race

2008-02-05 Thread Christian Borntraeger
Jeff,

Rusty is (supposed to be) on vacation. Can you send this fix against
virtio_net with your next network driver fixes for 2.6.25? 

Thank you

-

I got the following oops during interface ifup. Unfortunately its not 
easily reproducable so I cant say for sure that my fix fixes this
problem, but I am confident and I think its correct anyway:

   2kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:234!
4illegal operation: 0001 [#1] PREEMPT SMP
4Modules linked in:
4CPU: 0 Not tainted 2.6.24zlive-guest-07293-gf1ca151-dirty #91
4Process swapper (pid: 0, task: 00800938, ksp: 0084ddb8)
4Krnl PSW : 040430018000 00466374 (vring_disable_cb+0x30/0x34)
4   R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
4Krnl GPRS: 0001 0001 10003800 
00466344
4   0e980900 008848b0 0084e748 

4   0087b300 1237 1237 
0f85bdd8
4   0e980920 001137c0 00464754 
0f85bdd8
4Krnl Code: 00466368: e3b0b074lg  %r11,112(%r11)
4   0046636e: 07febcr 15,%r14
4   00466370: a7f40001brc 15,466372
4  00466374: a7f4fff6brc 15,466360
4   00466378: eb7ff0500024stmg
%r7,%r15,80(%r15)
4   0046637e: a7f13e00tmll%r15,15872
4   00466382: b90400eflgr %r14,%r15
4   00466386: a7840001brc 8,466388
4Call Trace:
4([000201500f85c000] 0x201500f85c000)
4 [00466556] vring_interrupt+0x72/0x88
4 [004801a0] kvm_extint_handler+0x34/0x44
4 [0010d22c] do_extint+0xbc/0xf8
4 [00113f98] ext_no_vtime+0x16/0x1a
4 [0010a182] cpu_idle+0x216/0x238
4([0010a162] cpu_idle+0x1f6/0x238)
4 [00568656] rest_init+0xaa/0xb8
4 [0084ee2c] start_kernel+0x3fc/0x490
4 [00100020] _stext+0x20/0x80
4
4 0Kernel panic - not syncing: Fatal exception in interrupt
4

After looking at the code and the dump I think the following scenario
happened: Ifup was running on cpu2 and the interrupt arrived on cpu0.
Now virtnet_open on cpu 2 managed to execute napi_enable and disable_cb
but did not execute rx_schedule. Meanwhile on cpu 0 skb_recv_done was
called by vring_interrupt, executed netif_rx_schedule_prep, which
succeeded and therefore called disable_cb. This triggered the BUG_ON,
as interrupts were already disabled by cpu 2.

I think the proper solution is to make the call to disable_cb depend on
the atomic update of NAPI_STATE_SCHED by using netif_rx_schedule_prep
in the same way as skb_recv_done. 


Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
Acked-by: Rusty Russell [EMAIL PROTECTED]

---
 drivers/net/virtio_net.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -321,10 +321,12 @@ static int virtnet_open(struct net_devic
 
/* If all buffers were filled by other side before we napi_enabled, we
 * won't get another interrupt, so process any outstanding packets
-* now.  virtnet_poll wants re-enable the queue, so we disable here. */
-   vi-rvq-vq_ops-disable_cb(vi-rvq);
-   netif_rx_schedule(vi-dev, vi-napi);
-
+* now.  virtnet_poll wants re-enable the queue, so we disable here.
+* We synchronize against interrupts via NAPI_STATE_SCHED */
+   if (netif_rx_schedule_prep(dev, vi-napi)) {
+   vi-rvq-vq_ops-disable_cb(vi-rvq);
+   __netif_rx_schedule(dev, vi-napi);
+   }
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio_net and SMP guests

2008-01-10 Thread Christian Borntraeger
Am Donnerstag, 10. Januar 2008 schrieb Christian Borntraeger:
 Am Donnerstag, 10. Januar 2008 schrieb Christian Borntraeger:
  Am Dienstag, 18. Dezember 2007 schrieb Rusty Russell:
   To me this points to doing interrupt suppression a different way.  If 
we
   have a -disable_cb() virtio function, and call it before we call
   netif_rx_schedule, does that fix it?
  
  The fix looks good and I agree with it.
  
  There is one problem that I try to find for some days, but the following 
  BUG_ON triggers:
  
  static void vring_disable_cb(struct virtqueue *_vq)
  {
  struct vring_virtqueue *vq = to_vvq(_vq);
  
  START_USE(vq);
     BUG_ON(vq-vring.avail-flags  VRING_AVAIL_F_NO_INTERRUPT);
  vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
  END_USE(vq);
  }
 
 Ok, I found it:
 
 static int virtnet_open(struct net_device *dev)
 {
 struct virtnet_info *vi = netdev_priv(dev);
 try_fill_recv(vi);
 /* If we didn't even get one input buffer, we're useless. */
 if (vi-num == 0)
 return -ENOMEM;
   --- int for new packet
   static void skb_recv_done(struct 
 virtqueue *rvq)
   {
   struct virtnet_info *vi = 
 rvq-vdev-priv;
   /* Suppress further interrupts. 
 */
   rvq-vq_ops-disable_cb(rvq);
   netif_rx_schedule(vi-dev, 
 vi-napi);
   }
   - poll is not yet possible, no softirq
   - return from interrupt
 napi_enable(vi-napi);
   vi-rvq-vq_ops-disable_cb(vi-rvq);
   --- BUG: its already disabled

Btw. this problem also happens on single processor guests.

What about the following patch:

---
 drivers/net/virtio_net.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -179,9 +179,12 @@ static void try_fill_recv(struct virtnet
 static void skb_recv_done(struct virtqueue *rvq)
 {
struct virtnet_info *vi = rvq-vdev-priv;
-   /* Suppress further interrupts. */
-   rvq-vq_ops-disable_cb(rvq);
-   netif_rx_schedule(vi-dev, vi-napi);
+   /* Schedule NAPI, Suppress further interrupts if successful. */
+
+   if (netif_rx_schedule_prep(vi-dev, vi-napi)) {
+   rvq-vq_ops-disable_cb(rvq);
+   __netif_rx_schedule(vi-dev, vi-napi);
+   }
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets

2007-12-13 Thread Christian Borntraeger
Am Donnerstag, 13. Dezember 2007 schrieb Dor Laor:
 You're right I got confused somehow.
 So in that case setting the driver status field on open in addition to 
 your enable will do the trick.
 On DRIVER_OPEN the host will trigger an interrupt if the queue is not 
 empty..
 Thanks,
 Dor

After looking into some other drivers, I prefer my first patch (moving 
napi_enable) ;-)

There are some drivers like xen-netfront, b44, which call napi_enable before 
the buffers are passed to the hardware. So it seems that moving napi is 
also a valid option.

But maybe I can just wait until Rusty returns from vacation (I will leave 
next week) so everything might be wonderful when I return ;-)

Rusty, if you decide to apply my patch, there is one downside: The debugging 
code in virtio_ring sometimes triggers with a false positive:

try_fill_recv calls vring_kick. Here we do a notify to the host. This might 
cause an immediate interrupt, triggering the poll routine before vring_kick 
can run END_USE. This is no real problem, as we dont touch the vq after 
notify.

Christian, facing 64 guest cpus unconvering all kind of races

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets

2007-12-12 Thread Christian Borntraeger
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor:
 This is why initially I suggested another status code in order to split 
 the ring logic with driver status.
  but also not filling any buffers as long as VIRTIO_CONFIG_DEV_OPEN is 
not 
  set. I will have a look but I think that add_status needs to be called 

 It can fill the buffers even without dev_open, when the dev is finally 
 opened the host will issue an interrupt if there are pending buffers. 

There is a problem associated with that scheme. Setting the value in the 
config space does not notify the hypervisor. That means the host will not 
send an interrupt. The interrupt is sent if the host tries to send the next 
packet. If somehow the host manages to use up all buffers before the device 
is finally opened, we have a problem.
 (I'm not sure it's worth solving, maybe  just drop them like you 
suggested).
As said above, dropping seems to me the preferred method. Did I miss 
something?

Christian
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets

2007-12-12 Thread Christian Borntraeger
Am Mittwoch, 12. Dezember 2007 schrieb Rusty Russell:
 On Wednesday 12 December 2007 00:16:12 Christian Borntraeger wrote:
  That would also work. We already have VRING_AVAIL_F_NO_INTERRUPT in
  virtio_ring.c - maybe we can use that. Its hidden in callback and
  restart handling, what about adding an explicit startup?
 
 Yes, I debated whether to make this a separate hook or not; the current
 method  reduces the number of function calls without having two ways of
 disabling  callbacks.
 
 In this case, simply starting devices with callbacks disabled and 
 renaming 'restart' to 'enable' (or something) and calling it at the
 beginning is probably sufficient?

So you suggest something like the following patch? It seems to work but 
there is still a theoretical race at startup. Therefore, I tend to agree 
with Dor that a separate hook seems prefereable, so I am not fully sure if 
the patch is the final solution:

ps: Its ok to answer that after your vacation. 

---
 drivers/block/virtio_blk.c   |3 ++-
 drivers/net/virtio_net.c |5 -
 drivers/virtio/virtio_ring.c |9 -
 include/linux/virtio.h   |4 ++--
 4 files changed, 12 insertions(+), 9 deletions(-)

Index: kvm/drivers/virtio/virtio_ring.c
===
--- kvm.orig/drivers/virtio/virtio_ring.c
+++ kvm/drivers/virtio/virtio_ring.c
@@ -220,7 +220,7 @@ static void *vring_get_buf(struct virtqu
return ret;
 }
 
-static bool vring_restart(struct virtqueue *_vq)
+static bool vring_enable(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
 
@@ -264,7 +264,7 @@ static struct virtqueue_ops vring_vq_ops
.add_buf = vring_add_buf,
.get_buf = vring_get_buf,
.kick = vring_kick,
-   .restart = vring_restart,
+   .enable = vring_enable,
.shutdown = vring_shutdown,
 };
 
@@ -299,9 +299,8 @@ struct virtqueue *vring_new_virtqueue(un
vq-in_use = false;
 #endif
 
-   /* No callback?  Tell other side not to bother us. */
-   if (!callback)
-   vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
+   /* disable interrupts until we enable them */
+   vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
 
/* Put everything in free lists. */
vq-num_free = num;
Index: kvm/include/linux/virtio.h
===
--- kvm.orig/include/linux/virtio.h
+++ kvm/include/linux/virtio.h
@@ -41,7 +41,7 @@ struct virtqueue
  * vq: the struct virtqueue we're talking about.
  * len: the length written into the buffer
  * Returns NULL or the data token handed to add_buf.
- * @restart: restart callbacks after callback returned false.
+ * @enable: restart callbacks after callback returned false.
  * vq: the struct virtqueue we're talking about.
  * This returns false (and doesn't re-enable) if there are pending
  * buffers in the queue, to avoid a race.
@@ -65,7 +65,7 @@ struct virtqueue_ops {
 
void *(*get_buf)(struct virtqueue *vq, unsigned int *len);
 
-   bool (*restart)(struct virtqueue *vq);
+   bool (*enable)(struct virtqueue *vq);
 
void (*shutdown)(struct virtqueue *vq);
 };
Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -201,7 +201,7 @@ again:
/* Out of packets? */
if (received  budget) {
netif_rx_complete(vi-dev, napi);
-   if (unlikely(!vi-rvq-vq_ops-restart(vi-rvq))
+   if (unlikely(!vi-rvq-vq_ops-enable(vi-rvq))
 netif_rx_reschedule(vi-dev, napi))
goto again;
}
@@ -292,6 +292,9 @@ static int virtnet_open(struct net_devic
return -ENOMEM;
 
napi_enable(vi-napi);
+
+   vi-rvq-vq_ops-enable(vi-rvq);
+   vi-svq-vq_ops-enable(vi-svq);
return 0;
 }
 
Index: kvm/drivers/block/virtio_blk.c
===
--- kvm.orig/drivers/block/virtio_blk.c
+++ kvm/drivers/block/virtio_blk.c
@@ -183,7 +183,8 @@ static int virtblk_probe(struct virtio_d
err = PTR_ERR(vblk-vq);
goto out_free_vblk;
}
-
+   /* enable interrupts */
+   vblk-vq-vq_ops-enable(vblk-vq);
vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
if (!vblk-pool) {
err = -ENOMEM;

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [virtio-net][PATCH] Don't arm tx hrtimer with a constant 500us each transmit

2007-12-12 Thread Christian Borntraeger
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor:
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -406,10 +405,10 @@ again:

Hmm, while I agree in general with the patch, I fail to find the proper 
version of virtio_net where this patch applies. I tried kvm.git and 
linux-2.6.git from kernel.org. Can you give me a pointer to the repository 
where you work on virtio?

Christian
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [virtio-net][PATCH] Don't arm tx hrtimer with a constant 500us each transmit

2007-12-12 Thread Christian Borntraeger
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor:
 Christian Borntraeger wrote:
 
  Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor:
   --- a/drivers/net/virtio_net.c
   +++ b/drivers/net/virtio_net.c
   @@ -406,10 +405,10 @@ again:
 
  Hmm, while I agree in general with the patch, I fail to find the proper
  version of virtio_net where this patch applies. I tried kvm.git and
  linux-2.6.git from kernel.org. Can you give me a pointer to the 
repository
  where you work on virtio?
 
 Sorry for that, I added some debug prints of my one.
 Here it is: *git clone 
 git*://kvm.*qumranet*.com/home/*dor*/src/linux-2.6-nv use branch 'virtio'.

Ah, ok. I will look into that branch.

 BTW: what git repository do you use?

I use Avis git from kernel.org:
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm

Christian
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets

2007-12-12 Thread Christian Borntraeger
Am Mittwoch, 12. Dezember 2007 schrieb Dor Laor:
 I think the change below handles the race. Otherwise please detail the 
 use case.
[...]
  @@ -292,6 +292,9 @@ static int virtnet_open(struct net_devic
  return -ENOMEM;
 
  napi_enable(vi-napi);
  +
  +   vi-rvq-vq_ops-enable(vi-rvq);
  +   vi-svq-vq_ops-enable(vi-svq);
 
 If you change it to:
 if (!vi-rvq-vq_ops-enable(vi-rvq))
 vi-rvq-vq_ops-kick(vi-rvq);
 if (!vi-rvq-vq_ops-enable(vi-svq))
 vi-rvq-vq_ops-kick(vi-svq);
 
 You solve the race of packets already waiting in the queue without 
 triggering the irq.

Hmm, I dont fully understand your point. I think this will work as long as 
the host has not consumed all inbound buffers. It will also require that 
the host sends an additional packet, no? If no additional packet comes the 
host has no reason to send an interrupt just because it got a notify 
hypercall. kick inside a guest also does not trigger the poll routine. 

It also wont work on the following scenario:
in virtnet open we will allocate buffers and send them to the host using the 
kick callback. The host can now use _all_ buffers for incoming data while 
interrupts are still disabled and the guest is not running.( Lets say the 
host bridge has lots of multicast traffic and the guest gets not scheduled 
for a while). When the guest now continues and enables the interrupts 
nothing happens. Doing a kick does not help, as the host code will bail out 
with no dma memory for transfer.

Christian

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] virtio_net: remove double ether_setup

2007-12-11 Thread Christian Borntraeger
Hello Rusty,

this is a small fix for virtio_net.
virtnet_probe already calls alloc_etherdev, which calls ether_setup.
There is no need to do that again.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |1 -
 1 file changed, 1 deletion(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -329,11 +329,10 @@ static int virtnet_probe(struct virtio_d
dev = alloc_etherdev(sizeof(struct virtnet_info));
if (!dev)
return -ENOMEM;
 
/* Set up network device as normal. */
-   ether_setup(dev);
dev-open = virtnet_open;
dev-stop = virtnet_close;
dev-hard_start_xmit = start_xmit;
dev-features = NETIF_F_HIGHDMA;
SET_NETDEV_DEV(dev, vdev-dev);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resent] virtio_net: Fix stalled inbound traffic on early packets

2007-12-11 Thread Christian Borntraeger
Hello Rusty,

while implementing and testing virtio on s390 I found a problem in
virtio_net: The current virtio_net driver has a startup race, which
prevents any incoming traffic:

If try_fill_recv submits buffers to the host system data might be
filled in and an interrupt is sent, before napi_enable finishes.
In that case the interrupt will kick skb_recv_done which will then
call netif_rx_schedule. netif_rx_schedule checks, if NAPI_STATE_SCHED
is set - which is not as we did not run napi_enable. No poll routine
is scheduled. Furthermore, skb_recv_done returns false, we disables
interrupts for this device.

One solution is the enable napi before inbound buffer are available.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -285,13 +285,15 @@ static int virtnet_open(struct net_devic
 {
struct virtnet_info *vi = netdev_priv(dev);
 
+   napi_enable(vi-napi);
try_fill_recv(vi);
 
/* If we didn't even get one input buffer, we're useless. */
-   if (vi-num == 0)
+   if (vi-num == 0) {
+   napi_disable(vi-napi);
return -ENOMEM;
+   }
 
-   napi_enable(vi-napi);
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets

2007-12-11 Thread Christian Borntraeger
2nd try. I somehow enable html on the last post

Dor Laor wrote:
 Christian Borntraeger wrote:

 Hello Rusty,

 while implementing and testing virtio on s390 I found a problem in
 virtio_net: The current virtio_net driver has a startup race, which
 prevents any incoming traffic:

 If try_fill_recv submits buffers to the host system data might be
 filled in and an interrupt is sent, before napi_enable finishes.
 In that case the interrupt will kick skb_recv_done which will then
 call netif_rx_schedule. netif_rx_schedule checks, if NAPI_STATE_SCHED
 is set - which is not as we did not run napi_enable. No poll routine
 is scheduled. Furthermore, skb_recv_done returns false, we disables
 interrupts for this device.

 One solution is the enable napi before inbound buffer are available.

 But then you might get recv interrupt without a buffer.

If I look at the current implementation (lguest) no interrupt is sent if
there is not buffer available, no?
On the other hand, if the host does send an interrupt when no buffer is
available, this is also no problem. Looking at virtnet_poll, it seems it
can cope with an empty ring. 

 The way other physical NICs doing it is by dis/en/abling interrupt 
 using registers (look at e1000).
 I suggest we can export add_status and use the original code but
 before enabling napi add a call to add_status(dev, 
 VIRTIO_CONFIG_DEV_OPEN).
 The host won't trigger an irq until it sees the above.

That would also work. We already have VRING_AVAIL_F_NO_INTERRUPT in
virtio_ring.c - maybe we can use that. Its hidden in callback and
restart handling, what about adding an explicit startup?

 BTW: Rusty is on vacation and that's probably the reason he didn't 
 respond.
 Regards,
 Dor.

Ok, didnt know that. At the moment I can live with my private patch
while we work on a final solution. Meanwhile I will try to debug virtio
on SMP guests - I still see some strange races on our test system. (But
block and net is now working on s390 and can cope with medium load. )

Christian


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] [PATCH resent] virtio_net: Fix stalled inbound trafficon early packets

2007-12-11 Thread Christian Borntraeger
Am Dienstag, 11. Dezember 2007 schrieb Christian Borntraeger:
  The way other physical NICs doing it is by dis/en/abling interrupt 
  using registers (look at e1000).
  I suggest we can export add_status and use the original code but
  before enabling napi add a call to add_status(dev, 
  VIRTIO_CONFIG_DEV_OPEN).
  The host won't trigger an irq until it sees the above.
 
 That would also work. We already have VRING_AVAIL_F_NO_INTERRUPT in
 virtio_ring.c - maybe we can use that. Its hidden in callback and
 restart handling, what about adding an explicit startup?

Ok, just to give an example what I thought about:
---
 drivers/block/virtio_blk.c   |3 ++-
 drivers/net/virtio_net.c |2 ++
 drivers/virtio/virtio_ring.c |   16 +---
 include/linux/virtio.h   |5 +
 4 files changed, 22 insertions(+), 4 deletions(-)

Index: kvm/drivers/virtio/virtio_ring.c
===
--- kvm.orig/drivers/virtio/virtio_ring.c
+++ kvm/drivers/virtio/virtio_ring.c
@@ -241,6 +241,16 @@ static bool vring_restart(struct virtque
return true;
 }
 
+static void vring_startup(struct virtqueue *_vq)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   START_USE(vq);
+   if (vq-vq.callback)
+   vq-vring.avail-flags = ~VRING_AVAIL_F_NO_INTERRUPT;
+   END_USE(vq);
+}
+
+
 irqreturn_t vring_interrupt(int irq, void *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
@@ -265,6 +275,7 @@ static struct virtqueue_ops vring_vq_ops
.get_buf = vring_get_buf,
.kick = vring_kick,
.restart = vring_restart,
+   .startup = vring_startup,
.shutdown = vring_shutdown,
 };
 
@@ -299,9 +310,8 @@ struct virtqueue *vring_new_virtqueue(un
vq-in_use = false;
 #endif
 
-   /* No callback?  Tell other side not to bother us. */
-   if (!callback)
-   vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
+   /* disable interrupts until we enable them */
+   vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
 
/* Put everything in free lists. */
vq-num_free = num;
Index: kvm/include/linux/virtio.h
===
--- kvm.orig/include/linux/virtio.h
+++ kvm/include/linux/virtio.h
@@ -45,6 +45,9 @@ struct virtqueue
  * vq: the struct virtqueue we're talking about.
  * This returns false (and doesn't re-enable) if there are pending
  * buffers in the queue, to avoid a race.
+ * @startup: enable callbacks
+ * vq: the struct virtqueue we're talking abount
+ * Returns 0 or an error
  * @shutdown: unadd all buffers.
  * vq: the struct virtqueue we're talking about.
  * Remove everything from the queue.
@@ -67,6 +70,8 @@ struct virtqueue_ops {
 
bool (*restart)(struct virtqueue *vq);
 
+   void (*startup) (struct virtqueue *vq);
+
void (*shutdown)(struct virtqueue *vq);
 };
 
Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -292,6 +292,8 @@ static int virtnet_open(struct net_devic
return -ENOMEM;
 
napi_enable(vi-napi);
+
+   vi-rvq-vq_ops-startup(vi-rvq);
return 0;
 }
 
Index: kvm/drivers/block/virtio_blk.c
===
--- kvm.orig/drivers/block/virtio_blk.c
+++ kvm/drivers/block/virtio_blk.c
@@ -183,7 +183,8 @@ static int virtblk_probe(struct virtio_d
err = PTR_ERR(vblk-vq);
goto out_free_vblk;
}
-
+   /* enable interrupts */
+   vblk-vq-vq_ops-startup(vblk-vq);
vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
if (!vblk-pool) {
err = -ENOMEM;



There is still one small problem: what if the host fills up all 
host-to-guest buffers before we call startup? So I start to think that your 
solution is better, given that the host is not only not sending interrupts 
but also not filling any buffers as long as VIRTIO_CONFIG_DEV_OPEN is not 
set. I will have a look but I think that add_status needs to be called 
after napi_enable, otherwise we have the same race.

Christian
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_net: remove double ether_setup

2007-12-11 Thread Christian Borntraeger
Am Mittwoch, 12. Dezember 2007 schrieb Rusty Russell:
 Can you send straight to akpm or davem?  I'm supposed to be on vacation at 

this is a small fix for virtio_net.
virtnet_probe already calls alloc_etherdev, which calls ether_setup.
There is no need to do that again.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
Acked-by: Rusty Russell [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |1 -
 1 file changed, 1 deletion(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -329,11 +329,10 @@ static int virtnet_probe(struct virtio_d
dev = alloc_etherdev(sizeof(struct virtnet_info));
if (!dev)
return -ENOMEM;
 
/* Set up network device as normal. */
-   ether_setup(dev);
dev-open = virtnet_open;
dev-stop = virtnet_close;
dev-hard_start_xmit = start_xmit;
dev-features = NETIF_F_HIGHDMA;
SET_NETDEV_DEV(dev, vdev-dev);


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_net: Fix stalled inbound traffic on early packets

2007-12-06 Thread Christian Borntraeger
The current virtio_net driver has a startup race, which prevents any
incoming traffic:

If try_fill_recv submits buffers to the host system data might be
filled in and an interrupt is sent, before napi_enable finishes.
In that case the interrupt will kick skb_recv_done which will then
call netif_rx_schedule. netif_rx_schedule checks, if NAPI_STATE_SCHED
is set - which is not as we did not run napi_enable. No poll routine
is scheduled. Furthermore, skb_recv_done returns false, we disables
interrupts for this device.

One solution is the enable napi before inbound buffer are available.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -285,13 +285,15 @@ static int virtnet_open(struct net_devic
 {
struct virtnet_info *vi = netdev_priv(dev);
 
+   napi_enable(vi-napi);
try_fill_recv(vi);
 
/* If we didn't even get one input buffer, we're useless. */
-   if (vi-num == 0)
+   if (vi-num == 0) {
+   napi_disable(vi-napi);
return -ENOMEM;
+   }
 
-   napi_enable(vi-napi);
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] virtio: free transmit skbs when notified, not on next xmit.

2007-11-19 Thread Christian Borntraeger
Am Montag, 19. November 2007 schrieb Rusty Russell:
 This fixes a potential dangling xmit problem.
 
 We also suppress refill interrupts until we need them.
 (Anthony and I have been working on performance recently, so this
 is a WIP).
 
 Signed-off-by: Rusty Russell [EMAIL PROTECTED]

Thanks Rusty,
that was the 2nd next item on my todo list :-)

Acked-by: Christian Borntraeger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IPVS: Fix sysctl warnings about missing strategy

2007-11-13 Thread Christian Borntraeger
Running the latest git code I get the following messages during boot:
sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy
[...] 
sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing 
strategy

I removed the binary sysctl handler for those messages and also removed
the definitions in ip_vs.h. The alternative would be to implement a 
proper strategy handler, but syscall sysctl is deprecated.

There are other sysctl definitions that are commented out or work with 
the default sysctl_data strategy. I did not touch these. 

Eric, IPVS team, are you ok with that change?

CC: Eric W. Biederman [EMAIL PROTECTED]
CC: Wensong Zhang [EMAIL PROTECTED]
CC: Simon Horman [EMAIL PROTECTED]
CC: Julian Anastasov [EMAIL PROTECTED]
Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]

---
 include/net/ip_vs.h   |4 
 kernel/sysctl_check.c |4 
 net/ipv4/ipvs/ip_vs_ctl.c |4 
 3 files changed, 12 deletions(-)

Index: linux-2.6/include/net/ip_vs.h
===
--- linux-2.6.orig/include/net/ip_vs.h
+++ linux-2.6/include/net/ip_vs.h
@@ -336,9 +336,6 @@ enum {
NET_IPV4_VS_DEBUG_LEVEL=1,
NET_IPV4_VS_AMEMTHRESH=2,
NET_IPV4_VS_AMDROPRATE=3,
-   NET_IPV4_VS_DROP_ENTRY=4,
-   NET_IPV4_VS_DROP_PACKET=5,
-   NET_IPV4_VS_SECURE_TCP=6,
NET_IPV4_VS_TO_ES=7,
NET_IPV4_VS_TO_SS=8,
NET_IPV4_VS_TO_SR=9,
@@ -355,7 +352,6 @@ enum {
NET_IPV4_VS_LBLCR_EXPIRE=20,
NET_IPV4_VS_CACHE_BYPASS=22,
NET_IPV4_VS_EXPIRE_NODEST_CONN=23,
-   NET_IPV4_VS_SYNC_THRESHOLD=24,
NET_IPV4_VS_NAT_ICMP_SEND=25,
NET_IPV4_VS_EXPIRE_QUIESCENT_TEMPLATE=26,
NET_IPV4_VS_LAST
Index: linux-2.6/net/ipv4/ipvs/ip_vs_ctl.c
===
--- linux-2.6.orig/net/ipv4/ipvs/ip_vs_ctl.c
+++ linux-2.6/net/ipv4/ipvs/ip_vs_ctl.c
@@ -1451,7 +1451,6 @@ static struct ctl_table vs_vars[] = {
.proc_handler   = proc_dointvec,
},
{
-   .ctl_name   = NET_IPV4_VS_DROP_ENTRY,
.procname   = drop_entry,
.data   = sysctl_ip_vs_drop_entry,
.maxlen = sizeof(int),
@@ -1459,7 +1458,6 @@ static struct ctl_table vs_vars[] = {
.proc_handler   = proc_do_defense_mode,
},
{
-   .ctl_name   = NET_IPV4_VS_DROP_PACKET,
.procname   = drop_packet,
.data   = sysctl_ip_vs_drop_packet,
.maxlen = sizeof(int),
@@ -1467,7 +1465,6 @@ static struct ctl_table vs_vars[] = {
.proc_handler   = proc_do_defense_mode,
},
{
-   .ctl_name   = NET_IPV4_VS_SECURE_TCP,
.procname   = secure_tcp,
.data   = sysctl_ip_vs_secure_tcp,
.maxlen = sizeof(int),
@@ -1597,7 +1594,6 @@ static struct ctl_table vs_vars[] = {
.proc_handler   = proc_dointvec,
},
{
-   .ctl_name   = NET_IPV4_VS_SYNC_THRESHOLD,
.procname   = sync_threshold,
.data   = sysctl_ip_vs_sync_threshold,
.maxlen = sizeof(sysctl_ip_vs_sync_threshold),
Index: linux-2.6/kernel/sysctl_check.c
===
--- linux-2.6.orig/kernel/sysctl_check.c
+++ linux-2.6/kernel/sysctl_check.c
@@ -242,9 +242,6 @@ static struct trans_ctl_table trans_net_
{ NET_IPV4_VS_AMEMTHRESH,   amemthresh },
{ NET_IPV4_VS_DEBUG_LEVEL,  debug_level },
{ NET_IPV4_VS_AMDROPRATE,   am_droprate },
-   { NET_IPV4_VS_DROP_ENTRY,   drop_entry },
-   { NET_IPV4_VS_DROP_PACKET,  drop_packet },
-   { NET_IPV4_VS_SECURE_TCP,   secure_tcp },
{ NET_IPV4_VS_TO_ES,timeout_established },
{ NET_IPV4_VS_TO_SS,timeout_synsent },
{ NET_IPV4_VS_TO_SR,timeout_synrecv },
@@ -260,7 +257,6 @@ static struct trans_ctl_table trans_net_
{ NET_IPV4_VS_CACHE_BYPASS, cache_bypass },
{ NET_IPV4_VS_EXPIRE_NODEST_CONN,   expire_nodest_conn },
{ NET_IPV4_VS_EXPIRE_QUIESCENT_TEMPLATE,
expire_quiescent_template },
-   { NET_IPV4_VS_SYNC_THRESHOLD,   sync_threshold },
{ NET_IPV4_VS_NAT_ICMP_SEND,nat_icmp_send },
{ NET_IPV4_VS_LBLC_EXPIRE,  lblc_expiration },
{ NET_IPV4_VS_LBLCR_EXPIRE, lblcr_expiration },
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http

[PATCH] note that NETIF_F_LLTX is deprecated (was: [kvm-devel][PATCH 3/6] virtio net driver)

2007-09-21 Thread Christian Borntraeger
Am Freitag, 21. September 2007 schrieb Herbert Xu:
 Please don't use LLTX in new drivers.  We're trying to get rid
 of it since it's
 
 1) unnecessary;
 2) causes problems with AF_PACKET seeing things twice.

I suggest to document that LLTX is deprecated. 

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]

---
 Documentation/networking/netdevices.txt |3 ++-
 include/linux/netdevice.h   |3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/netdevice.h
===
--- linux-2.6.orig/include/linux/netdevice.h
+++ linux-2.6/include/linux/netdevice.h
@@ -339,7 +339,8 @@ struct net_device
 #define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
-#define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_LLTX   4096/* LockLess TX - deprecated. Please */
+   /* do not use LLTX in new drivers */
 #define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
Index: linux-2.6/Documentation/networking/netdevices.txt
===
--- linux-2.6.orig/Documentation/networking/netdevices.txt
+++ linux-2.6/Documentation/networking/netdevices.txt
@@ -73,7 +73,8 @@ dev-hard_start_xmit:
has to lock by itself when needed. It is recommended to use a try lock
for this and return NETDEV_TX_LOCKED when the spin lock fails.
The locking there should also properly protect against 
-   set_multicast_list.
+   set_multicast_list. Note that the use of NETIF_F_LLTX is deprecated.
+   Dont use it for new drivers.
 
Context: Process with BHs disabled or BH (timer),
 will be called with interrupts disabled by netconsole.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet device (v2.1)

2007-07-11 Thread Christian Borntraeger
Am Mittwoch, 11. Juli 2007 schrieb Pavel Emelianov:
  drivers/net/veth.c   |  452 
  include/net/veth.h   |   14 +

I know, I am late in the game, but wont the name collide somewhat with 
drivers/net/ibmveth.h, drivers/net/iseries_veth.c, and drivers/net/ibmveth.c?

Christian
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] ehea: interface to network stack

2006-08-09 Thread Christian Borntraeger
,
 +   cqe-wr_id)] = NULL;
 + dev_kfree_skb(skb);
 + }
 + }
 + cqe = ehea_poll_rq1(qp, wqe_index);
 + }
 +
 + dev-quota -= processed;
 + *budget -= processed;
 +
 + port_res-p_state.ehea_poll += 1;
 +
 + port_res-rx_packets += processed;
 +
 + ehea_refill_rq1(port_res, last_wqe_index, processed_RQ1);
 + ehea_refill_rq2(port_res, processed_RQ2);
 + ehea_refill_rq3(port_res, processed_RQ3);
 +
 + intreq = ((port_res-p_state.ehea_poll  0xF) == 0xF);
 +
 + EDEB_EX(7, processed=%d, *budget=%d, dev-quota=%d,
 + processed, *budget, dev-quota);
 +
 + if (!cqe || intreq) {
 + netif_rx_complete(dev);
 + ehea_reset_cq_ep(port_res-recv_cq);
 + ehea_reset_cq_n1(port_res-recv_cq);
 + cqe = ipz_qeit_get_valid(qp-ipz_rqueue1);
 + EDEB(7, CQE=%p, break ehea_poll while loop, cqe);
 + if (!cqe || intreq)
 + return 0;
 + if (!netif_rx_reschedule(dev, my_quota))
 + return 0;
 + }
 + return 1;
 +}

The poll function seems too long and therefore hard to review. Please consider 
splitting it. 


-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux  Virtualization



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Christian Borntraeger
Hello Evgeniy,

 +asmlinkage long sys_netchannel_control(void __user *arg)
[...]
 + if (copy_from_user(ctl, arg, sizeof(struct unetchannel_control)))
 + return -ERESTARTSYS;
^^^
[...]
 + if (copy_to_user(arg, ctl, sizeof(struct unetchannel_control)))
 + return -ERESTARTSYS;
^^^

I think this should be -EFAULT instead of -ERESTARTSYS, right?

-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux  Virtualization



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Christian Borntraeger
On Tuesday 18 July 2006 13:51, Evgeniy Polyakov wrote:
  I think this should be -EFAULT instead of -ERESTARTSYS, right?

 I have no strong feeling on what must be returned in that case.
 As far as I see, copy*user can fail due to absence of the next
 destination page, so -ERESTARTSYS makes sence, but if failure happens due
 to process size limitation, -EFAULT is correct.

If I am not completely mistaken ERESTARTSYS is wrong. 
include/linux/errno.h says userspace should never see ERESTARTSYS, therefore 
we should only return it if we were interrupted by a signal as do_signal 
takes care of ERESTARTSYS. Furthermore, copy*user transparently faults in 
necessary pages as long as the address is valid in the user context. 

 Let's change it to -EFAULT.

Thanks :-)


-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux  Virtualization



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] changing value of NETDEV_ALIGN to cacheline size

2006-05-16 Thread Christian Borntraeger
On Tuesday 16 May 2006 00:49, David S. Miller wrote:
 From: Rick Jones [EMAIL PROTECTED]
 Date: Mon, 15 May 2006 14:39:23 -0700

  How about:

 How about, just leave it alone? :-)

Agreed. Currently it only makes a difference with slab debugging, which hurts 
performance no matter what we do.



-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux  Virtualization



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] changing value of NETDEV_ALIGN to cacheline size

2006-05-15 Thread Christian Borntraeger
while digging through the alloc_netdev function I asked myself why there is a
fixed alignment for netdevices. Is there a special reason for choosing 32? If
yes, I suggest to add a comment to the definition. 

If not, I suspect cacheline alignment might be beneficial:
struct net_device contains several variables which are cache aligned 
(poll_list, queue_lock .). As far as I can see, the compiler tries to 
increase the size of the structure to make that possible, but expects the
whole structure to be aligned to cache line size as well. With the current
value for NETDEV_ALIGN we dont align struct net_device to the cache line,
instead we align it to 32 bytes. That means that poll_list, queue_lock and 
friends are not always cache aligned, but 32 bytes aligned.

The only reason why everything worked so far is the slab allocator design, 
which gives us a page aligned struct net_device in most cases. I think we 
should not rely on the behaviour of the memory allocator and use a different 
value for NETDEV_ALIGN instead. Any comments or corrections?

cheers Christian



The patch below is compile and boot tested on s390 and x86.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]

include/linux/netdevice.h |2 +-
net/core/dev.c|2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--

--- a/include/linux/netdevice.h 4 Apr 2006 07:25:47 -
+++ b/include/linux/netdevice.h 15 May 2006 11:06:05 -
@@ -504,7 +504,7 @@
struct class_device class_dev;
 };
 
-#defineNETDEV_ALIGN32
+#defineNETDEV_ALIGNL1_CACHE_BYTES
 #defineNETDEV_ALIGN_CONST  (NETDEV_ALIGN - 1)
 
 static inline void *netdev_priv(struct net_device *dev)
--- a/net/core/dev.c4 Apr 2006 07:25:50 -
+++ b/net/core/dev.c15 May 2006 11:06:05 -
@@ -2986,7 +2986,7 @@
struct net_device *dev;
int alloc_size;
 
-   /* ensure 32-byte alignment of both the device and private area */
+   /* ensure cacheline alignment of both the device and private area */
alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST)  ~NETDEV_ALIGN_CONST;
alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] ipv4: initialize arp_tbl rw lock

2006-04-19 Thread Christian Borntraeger
As spinlock debugging still does not work with the qeth driver I want to pick 
up the discussion.

On Saturday 08 April 2006 12:12, Andrew Morton wrote:
 Heiko Carstens [EMAIL PROTECTED] wrote:
[...]
   -vmlinux-main := $(core-y) $(libs-y) $(drivers-y) $(net-y)
   +vmlinux-main := $(core-y) $(libs-y) $(net-y) $(drivers-y)

 wonders what this will break

What about putting this patch into mm and find out?

 I have a bad feeling that one day we're going to come unstuck with this
 practice.  Is there anything which dictates that the linker has to lay
 sections out in .o-file-order?

 Perhaps net initcalls should be using something higher priority than
 device_initcall().

I agree that the initcall order offers a lot of room for improvement (like 
dependencies). Is anybody aware of any work into this direction?

-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux  Virtualization



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html