Re: [PATCH] mlx5: avoid unused variable warning

2016-05-18 Thread Amir Vadai"
On Wed, May 18, 2016 at 04:21:07PM +0200, Arnd Bergmann wrote:
> When CONFIG_NET_CLS_ACT is disabled, we get a new warning in the mlx5
> ethernet driver because the tc_for_each_action() loop never references
> the iterator:
> 
> mellanox/mlx5/core/en_tc.c: In function 'mlx5e_stats_flower':
> mellanox/mlx5/core/en_tc.c:431:20: error: unused variable 'a' 
> [-Werror=unused-variable]
>   struct tc_action *a;
> 
> This changes the dummy tc_for_each_action() macro by adding a
> cast to void, letting the compiler know that the variable is
> intentionally declared but not used here. I could not come up
> with a nicer workaround, but this seems to do the trick.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: aad7e08d39bd ("net/mlx5e: Hardware offloaded flower filter statistics 
> support")
> Fixes: 00175aec941e ("net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef")
> ---
Acked-By: Amir Vadai 

Thanks Arnd.


Re: [RFC v1 07/11] net/mlx4: Cache irq_desc->affinity instead of irq_desc

2015-05-07 Thread Amir Vadai
On 5/5/2015 5:53 PM, Thomas Gleixner wrote:
> On Tue, 5 May 2015, Amir Vadai wrote:
> 
>> On Mon, May 4, 2015 at 6:10 PM, Thomas Gleixner  wrote:
>>> On Mon, 4 May 2015, Amir Vadai wrote:
>>>
>>>> On Mon, May 4, 2015 at 6:15 AM, Jiang Liu  
>>>> wrote:
>>>>> The field 'affinity' in irq_desc won't change once the irq_desc data
>>>>> structure is created. So cache irq_desc->affinity instead of irq_desc.
>>>>> This also helps to hide struct irq_desc from device drivers.
>>>>
>>>> Hi Jiang,
>>>>
>>>> I might not understand the new changes irq core, but up until now
>>>> affinity was changed when the user changed it through
>>>> /proc/irq//smp_affinity.
>>>> This code is monitoring the affinity from the napi_poll context to
>>>> detect affinity changes, and prevent napi from keep running on the
>>>> wrong CPU.
>>>> Therefore, the affinity can't be cached at the beginning. Please
>>>> revert this caching.
>>>
>>> Sigh. All of this is completely wrong. Polling in the internals of
>>> irq_desc is just crap.
>> Hi,
>>
>>>
>>> irq_set_affinity_notifier() has been added for exactly this purpose.
>> rmap is already registered on irq_set_affinity_notifier().
>> Actually, I did post an extension to the affinity notifier, to use
>> affinity chain which you rejected [1] (and BTW, convinced me that this
>> is not the way to do it).
>> Current code is according to your suggestion.
>>
>> [1] - https://lkml.org/lkml/2014/5/26/222
> 
> No, it's not. I wrote:
> 
>  "So what's the point of adding notifier call chain complexity,
>   ordering problems etc., if you can simply note the fact that the
>   affinity changed in the rmap itself and check that in the RX path?"
> 
> I did not tell you to fiddle in irqdesc, really.
> 
> Thanks,
> 
>   tglx
> 

It is problematic, since I didn't want to link the napi polling affinity
issue and CONFIG_CPU_RMAP being selected. That's why I CC'd you when
submitted the patch [1]. Guess it was my bad that I didn't do it in a
better way that would give you a chance to fully review it.

Anyway, I understand now that I shouldn't access irqdesc attributes from
a device driver.
I will see if I can live with disabling this whole feature (migrate a
currently looping NAPI when affinity is changed), when CONFIG_CPU_RMAP
is not selected. If not, I might modify the rmap code to give me this
information even when it is not selected. I assume that rmap notifier
callback can access this information.

Thanks,
Amir

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v1 07/11] net/mlx4: Cache irq_desc->affinity instead of irq_desc

2015-05-05 Thread Amir Vadai
On Mon, May 4, 2015 at 6:10 PM, Thomas Gleixner  wrote:
> On Mon, 4 May 2015, Amir Vadai wrote:
>
>> On Mon, May 4, 2015 at 6:15 AM, Jiang Liu  wrote:
>> > The field 'affinity' in irq_desc won't change once the irq_desc data
>> > structure is created. So cache irq_desc->affinity instead of irq_desc.
>> > This also helps to hide struct irq_desc from device drivers.
>>
>> Hi Jiang,
>>
>> I might not understand the new changes irq core, but up until now
>> affinity was changed when the user changed it through
>> /proc/irq//smp_affinity.
>> This code is monitoring the affinity from the napi_poll context to
>> detect affinity changes, and prevent napi from keep running on the
>> wrong CPU.
>> Therefore, the affinity can't be cached at the beginning. Please
>> revert this caching.
>
> Sigh. All of this is completely wrong. Polling in the internals of
> irq_desc is just crap.
Hi,

>
> irq_set_affinity_notifier() has been added for exactly this purpose.
rmap is already registered on irq_set_affinity_notifier().
Actually, I did post an extension to the affinity notifier, to use
affinity chain which you rejected [1] (and BTW, convinced me that this
is not the way to do it).
Current code is according to your suggestion.

[1] - https://lkml.org/lkml/2014/5/26/222

Amir

>
> Please get rid of your irq_desc abuse.
>
> Thanks,
>
> tglx
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v1 07/11] net/mlx4: Cache irq_desc->affinity instead of irq_desc

2015-05-05 Thread Amir Vadai
On Mon, May 4, 2015 at 5:00 PM, Jiang Liu  wrote:
> On 2015/5/4 20:10, Amir Vadai wrote:
>> On Mon, May 4, 2015 at 6:15 AM, Jiang Liu  wrote:
>>> The field 'affinity' in irq_desc won't change once the irq_desc data
>>> structure is created. So cache irq_desc->affinity instead of irq_desc.
>>> This also helps to hide struct irq_desc from device drivers.
>>
>> Hi Jiang,
>>
>> I might not understand the new changes irq core, but up until now
>> affinity was changed when the user changed it through
>> /proc/irq//smp_affinity.
>> This code is monitoring the affinity from the napi_poll context to
>> detect affinity changes, and prevent napi from keep running on the
>> wrong CPU.
>> Therefore, the affinity can't be cached at the beginning. Please
>> revert this caching.
> Hi Amir,
> Thanks for review:) We want to hide irq_desc implementation
> details from device drivers, so made these changes.
> Function irq_get_affinity_mask() returns 'struct cpumask *'
> and we cache the returned pointer. On the other hand, user may change
> IRQ affinity through /proc/irq//smp_affinity, but that only
> changes the bitmap pointed to by the cached pointer and won't change
> the pointer itself. So it should always return the latest affinity
> setting by calling cpumask_test_cpu(cpu_curr, cq->irq_affinity).
> Or am I missing something here?
No - you are completely right - I missed it.

Anyway I tested it to make sure, and everything works as expected.

Acked-By: Amir Vadai 

Thanks,
Amir


> Thanks!
> Gerry
>>
>> Thanks,
>> Amir
>>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v1 07/11] net/mlx4: Cache irq_desc->affinity instead of irq_desc

2015-05-04 Thread Amir Vadai
On Mon, May 4, 2015 at 6:15 AM, Jiang Liu  wrote:
> The field 'affinity' in irq_desc won't change once the irq_desc data
> structure is created. So cache irq_desc->affinity instead of irq_desc.
> This also helps to hide struct irq_desc from device drivers.

Hi Jiang,

I might not understand the new changes irq core, but up until now
affinity was changed when the user changed it through
/proc/irq//smp_affinity.
This code is monitoring the affinity from the napi_poll context to
detect affinity changes, and prevent napi from keep running on the
wrong CPU.
Therefore, the affinity can't be cached at the beginning. Please
revert this caching.

Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mlx4_en: Use correct loop cursor in error path.

2015-04-29 Thread Amir Vadai
On Thu, Apr 30, 2015 at 1:59 AM, Benjamin Poirier  wrote:
> Signed-off-by: Benjamin Poirier 
> Fixes: 9e311e7 ("net/mlx4_en: Use affinity hint")
> ---

Good catch.

Acked-by: Amir Vadai 

Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next v2 0/8] net: extend ethtool link mode bitmaps to 48 bits

2015-01-08 Thread Amir Vadai
On 1/6/2015 7:36 PM, David Decotigny wrote:
> Interesting. It seems that the band-aid I was proposing is already
> obsolete. We could still use the remaining reserved 16 bits to encode
> 5 more bits per mask (that is: 53 bits / mask total). But if I
> understand you, it would allow us to survive only a few months longer,
> as opposed to a few weeks.
> 
> One short-term alternative solution I can imagine is the following:
> /* For example bitmap-based for variable length: */
> struct ethtool_link_mode {
>   __u32 cmd;
>   __u8 autoneg :1;
>   __u8 duplex :2;
>  __u16 supported_nbits;
>   __u16 advertising_nbits;
>   __u16 lp_advertising_nbits;
>   __u32 reserved[4];
>   __u8 masks[0];
> };
> /* Or simpler, statically limited to 64b / mask, but easier to migrate
> to for driver authors: */
I think the first options is better. A driver will have to do changes in
order to support >32 link modes, so better change it once now, without
having to change it again for >64 link modes.

> struct ethtool_link_mode {
>   __u32 cmd;
>   __u8 autoneg :1;
>   __u8 duplex :2;
>__u64 supported;
>   __u64 advertising;
>   __u64 lp_advertising;
>   __u32 reserved[4];
> };
> #define ETHTOOL_GLINK_MODE 0x004a
> #define ETHTOOL_SLINK_MODE 0x004b
> struct ethtool_ops {
> ...
>int (*get_link_mode)(struct net_device *, struct ethtool_link_mode *);
>int (*set_link_mode)(struct net_device *, struct ethtool_link_mode *);
> };
> 
> The same thing required for EEE.
Yeh :(

> 
> I am not sure about moving the autoneg and duplex fields into the new
> struct. Especially the "duplex" field.
I think so too. ethtool user space will call ETHTOOL_[GS]SET and after
that ETHTOOL_[GS]LINK_MODE (if supported). No need to get the
duplex/autoneg fields again.

> 
> Then the idea would be to update the ethtool user-space tool to try
> get/set_link mode when reporting/changing the autoneg/advertising
> settings.
> 
> Both will require significant effort from the driver authors.
> Especially if the variable-length bitmap approach is preferred:
>  - most drivers currently use simple bitwise arithmetic in their code,
> and that goes far beyond get/set_settings, it is sometimes part of the
> core driver logic. They will have to migrate to the bitmap API if they
> want to use the larger bitmaps (note: no change needed if they are
> happy with <= 32b / mask)
As I said above, it will save as doing this work again in the future,
and more problematic, save another version to backport in the future. In
addition, not all drivers will have to do it, only if >32 link speeds is
needed - this work will be required.

>  - we would have to progressively deprecate the use of #define
> ADVERTISED_1000baseT_Full in favor of an enum of the bit indices.
Maybe we could use some macro juggling to define the legacy macro's
using enum for the first 32 bits, and fail the compilation if used on
>32. For example, calling this:
DEFINE_LINK_MODE(ADVERTISED_1000baseT_Full, 5)

Will add the following:
ADVERTISED_1000baseT_Full_SHIFT = 5
ADVERTISED_1000baseT_Full = (1<<5)

DEFINE_LINK_MODE(ADVERTISED_10baseKR5_Full, 50) will add:
ADVERTISED_10baseKR5_Full_SHIFT = 50
ADVERTISED_10baseKR5_Full = #error new link speeds must be defined
using [gs]et_link_speed

This will break compilation if ADVERTISED_10baseKR5_Full is used in
[gs]et_settings (I know the '#error' will not print something very
pretty - I used it only to explain what I meant)

> 
> Any feedback welcome. In the meantime, I am going to propose a v3 of
> current option with 53 bits / mask. I can also propose a prototype of
> the scheme described above, please let me know.
I think that it is better to do it once, and skip the 53 bits / mask
version.
I'll be happy to assist.

Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next v2 0/8] net: extend ethtool link mode bitmaps to 48 bits

2015-01-06 Thread Amir Vadai
On 1/6/2015 4:54 AM, David Decotigny wrote:
> This patch series increases the width of the supported/advertising
> ethtool masks from 32 bits to 48. This should allow to breathe for a
> couple more years (or... months?).
Hi,

Nice work.
What worries me is that it is going to be for weeks more than years or
months...

Mellanox is about to release next month a driver for a new NIC, with 3
new speeds * few link modes for each + new link modes for 10G.
It seems that we will need to consume almost all the new bits.

Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mlx4: don't duplicate kvfree()

2014-11-20 Thread Amir Vadai
On Thu, Nov 20, 2014 at 10:15 AM, Al Viro  wrote:
> Signed-off-by: Al Viro 
> ---
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c 
> b/drivers/net/ethernet/mellanox/mlx4/mr.c
> index 193a6ad..d6f5496 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mr.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
> @@ -130,10 +130,7 @@ static int mlx4_buddy_init(struct mlx4_buddy *buddy, int 
> max_order)
>
>  err_out_free:
> for (i = 0; i <= buddy->max_order; ++i)
> -   if (buddy->bits[i] && is_vmalloc_addr(buddy->bits[i]))
> -   vfree(buddy->bits[i]);
> -   else
> -   kfree(buddy->bits[i]);
> +   kvfree(buddy->bits[i]);
>
>  err_out:
> kfree(buddy->bits);
> @@ -147,10 +144,7 @@ static void mlx4_buddy_cleanup(struct mlx4_buddy *buddy)
> int i;
>
> for (i = 0; i <= buddy->max_order; ++i)
> -   if (is_vmalloc_addr(buddy->bits[i]))
> -   vfree(buddy->bits[i]);
> -   else
> -   kfree(buddy->bits[i]);
> +   kvfree(buddy->bits[i]);
>
> kfree(buddy->bits);
> kfree(buddy->num_free);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Thanks Al

Acked-by: Amir Vadai 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] mellanox: Change en_print to return void

2014-09-22 Thread Amir Vadai
On 9/22/2014 8:40 PM, Joe Perches wrote:
> No caller or macro uses the return value so make it void.
> 
> Signed-off-by: Joe Perches 
> ---
> This change is associated to a desire to eventually
> change printk to return void.
> 
>  drivers/net/ethernet/mellanox/mlx4/en_main.c | 17 +++--
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 ++--
>  2 files changed, 9 insertions(+), 12 deletions(-)

Thanks Joe.

Acked-By: Amir Vadai 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next V6 2/2] net/mlx4_en: Use affinity hint

2014-06-02 Thread Amir Vadai

On 6/2/2014 8:13 AM, Eric Dumazet wrote:

On Sun, 2014-06-01 at 21:56 -0700, David Miller wrote:


Indeed you have to provide a dummy version for a non-SMP build etc.

I'm reverting.



Hi David. I think your revert took one wrong commit.


# git show ee39facbf82e73e468c504d2b40e83e2d223c28c | diffstat -p1 -w70
  drivers/net/ethernet/micrel/ks8851.c |   50 ++-
  include/linux/cpumask.h  |2
  lib/cpumask.c|   64 -
  3 files changed, 28 insertions(+), 88 deletions(-)





Hi,

Yeh, Eric is right and it seems that 2a82e40 "net: ks8851: Don't use 
regulator_get_optional()" was reverted by mistake instead of 70a640d: 
"net/mlx4_en: Use affinity hint"


I'm working on a fixed version of the affinity patches - this time I 
will double check the CONFIG_SMP/CONFIG_CPUMASK_OFFSTACK combinations.


I'm preparing a public git with Mellanox updates, so that Mellanox 
drivers patches will pass 0-DAY kernel build testing, before landing in 
net-next.


Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Extend irq_set_affinity_notifier() to use a call chain

2014-05-27 Thread Amir Vadai

On 5/26/2014 3:39 PM, Thomas Gleixner wrote:

[...]



The rmap _IS_ instantiated by the driver, and both the driver and the
networking core know about it.

So it's not completely different consumers. Just because it's a
library does not mean it's disjunct from the code which uses it.

Aside of the fact, that maintaining a per irq notifier chain is going
to be ugly as hell due to life time and locking issues, it's just
opening a can of worms. How do you make sure that the invocation order
is correct? What are the dependency rules of the driver restarting the
napi session versus updating the rmap?

Even if you'd solve that and have a callback in the driver, then the
callback never can restart the napi session directly. All it can do is
set a flag which needs to be checked in the RX path, right?

So what's the point of adding notifier call chain complexity, ordering
problems etc., if you can simply note the fact that the affinity
changed in the rmap itself and check that in the RX path?


I will try to find a solution in the spirit of what you suggested - to 
let the rmap library notify napi about affinity changes - without adding 
this complexity to the code.


Thanks,
Amir



Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Extend irq_set_affinity_notifier() to use a call chain

2014-05-26 Thread Amir Vadai

On 5/26/2014 2:34 PM, Thomas Gleixner wrote:

On Mon, 26 May 2014, Amir Vadai wrote:


On 5/26/2014 2:15 PM, Thomas Gleixner wrote:

On Sun, 25 May 2014, Amir Vadai wrote:

In order to do that, I need to add a new irq affinity notification
callback (In addition to the existing cpu_rmap notification). For
that I would like to extend irq_set_affinity_notifier() to have a
notifier call-chain instead of a single notifier callback.


Why? "I would like" is a non argument.


Current implementation enables only one callback to be registered for irq
affinity change notifications.


I'm well aware of that.


cpu_rmap is registered be notified - for RFS purposes.  mlx4_en (and
probably other network drivers) needs to be notified too, in order
to stop the napi polling on the old cpu and move to the new one.  To
enable more than 1 notification callbacks, I suggest to use a
notifier call chain.


You are not describing what needs to be notified and why. Please
explain the details of that and how the RFS (whatever that is) and the
network driver are connected

The goal of RFS is to increase datacache hitrate by steering
kernel processing of packets in multi-queue devices to the CPU where the 
application thread consuming the packet is running.


In order to select the right queue, the networking stack needs to have a 
reverse map of IRQ affinty. This is the rmap that was added by Ben 
Hutchings [1]. To keep the rmap updated, cpu_rmap registers on the 
affinity notify.


This is the first affinity callback - it is located as a general library 
and not under net/...


The motivation to the second irq affinity callback is:
When traffic starts, first packet fires an interrupt which starts the 
napi polling on the cpu according the irq affinity.
If there is always packets to be consumed by the napi polling, no 
further interrupts will be fired, and napi will consume all the packets 
from the cpu it was started.
If the user changes the irq affinity, napi polling will continue to be 
done from the original cpu.
Only when the traffic will pause, napi session will be finished, and 
when traffic will resume, the new napi session will be done from the new 
cpu.
This is a problematic behavior, because from the user point of view, cpu 
affinity can't be changed in a non-stop traffic scenario.


To solve this, the network driver should be notified on irq affinity 
change event, and restart the napi session. This could be done by 
closing the napi session and arming the interrupts. Next packet arrives 
will trigger an interrupt and napi will session will start, this time on 
the new CPU.


> and why this notification cannot be
> propagated inside the network stack itself.

To my understanding, those are two different consumers to the same 
event, one is a general library to maintain a reverse irq affinity map, 
and the other is networking specific, and maybe even a networking driver 
specific.


[1] - c39649c lib: cpu_rmap: CPU affinity reverse-mapping

Thanks,
Amir



notifier chains are almost always a clear sign for a design disaster
and I'm not going to even think about it before I do not have a
concice explanation of the problem at hand and why a notifier chain is
a good solution.

Thanks,

tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Extend irq_set_affinity_notifier() to use a call chain

2014-05-26 Thread Amir Vadai

On 5/26/2014 2:15 PM, Thomas Gleixner wrote:

On Sun, 25 May 2014, Amir Vadai wrote:

In order to do that, I need to add a new irq affinity notification
callback (In addition to the existing cpu_rmap notification). For
that I would like to extend irq_set_affinity_notifier() to have a
notifier call-chain instead of a single notifier callback.


Why? "I would like" is a non argument.
Current implementation enables only one callback to be registered for 
irq affinity change notifications.


cpu_rmap is registered be notified - for RFS purposes.
mlx4_en (and probably other network drivers) needs to be notified too, 
in order to stop the napi polling on the old cpu and move to the new one.
To enable more than 1 notification callbacks, I suggest to use a 
notifier call chain.


Amir



Thanks,

tglx



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next V6 2/2] net/mlx4_en: Use affinity hint

2014-05-25 Thread Amir Vadai
From: Yuval Atias 

The “affinity hint” mechanism is used by the user space
daemon, irqbalancer, to indicate a preferred CPU mask for irqs.
Irqbalancer can use this hint to balance the irqs between the
cpus indicated by the mask.

We wish the HCA to preferentially map the IRQs it uses to numa cores
close to it.  To accomplish this, we use cpumask_set_cpu_local_first(), that
sets the affinity hint according the following policy:
First it maps IRQs to “close” numa cores.  If these are exhausted, the
remaining IRQs are mapped to “far” numa cores.

Signed-off-by: Yuval Atias 
Signed-off-by: Amir Vadai 
---
 drivers/infiniband/hw/mlx4/main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |  6 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 30 ++
 drivers/net/ethernet/mellanox/mlx4/eq.c| 13 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 +
 include/linux/mlx4/device.h|  2 +-
 6 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index a9638ae..8c88960 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1837,7 +1837,7 @@ static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, 
struct mlx4_ib_dev *ibdev)
 i, j, dev->pdev->bus->name);
/* Set IRQ for specific name (per ring) */
if (mlx4_assign_eq(dev, name, NULL,
-  &ibdev->eq_table[eq])) {
+  &ibdev->eq_table[eq], NULL)) {
/* Use legacy (same as mlx4_en driver) */
pr_warn("Can't allocate EQ %d; reverting to 
legacy\n", eq);
ibdev->eq_table[eq] =
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c 
b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 636963d..ea2cd72 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -118,11 +118,15 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
if (cq->is_tx == RX) {
if (mdev->dev->caps.comp_pool) {
if (!cq->vector) {
+   struct mlx4_en_rx_ring *ring =
+   priv->rx_ring[cq->ring];
+
sprintf(name, "%s-%d", priv->dev->name,
cq->ring);
/* Set IRQ for specific name (per ring) */
if (mlx4_assign_eq(mdev->dev, name, rmap,
-  &cq->vector)) {
+  &cq->vector,
+  ring->affinity_mask)) {
cq->vector = (cq->ring + 1 + priv->port)
% mdev->dev->caps.num_comp_vectors;
mlx4_warn(mdev, "Failed assigning an EQ 
to %s, falling back to legacy EQ's\n",
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 5bb7eda..826d150 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1531,6 +1531,32 @@ static void mlx4_en_linkstate(struct work_struct *work)
mutex_unlock(&mdev->state_lock);
 }
 
+static void mlx4_en_init_affinity_hint(struct mlx4_en_priv *priv, int ring_idx)
+{
+   struct mlx4_en_rx_ring *ring = priv->rx_ring[ring_idx];
+   int numa_node = priv->mdev->dev->numa_node;
+
+   if (numa_node == -1)
+   return;
+
+   if (!zalloc_cpumask_var(&ring->affinity_mask, GFP_KERNEL)) {
+   en_err(priv, "Failed to allocate core mask\n");
+   return;
+   }
+
+   if (cpumask_set_cpu_local_first(ring_idx, numa_node,
+   ring->affinity_mask)) {
+   en_err(priv, "Failed setting affinity hint\n");
+   free_cpumask_var(ring->affinity_mask);
+   ring->affinity_mask = NULL;
+   }
+}
+
+static void mlx4_en_free_affinity_hint(struct mlx4_en_priv *priv, int ring_idx)
+{
+   free_cpumask_var(priv->rx_ring[ring_idx]->affinity_mask);
+   priv->rx_ring[ring_idx]->affinity_mask = NULL;
+}
 
 int mlx4_en_start_port(struct net_device *dev)
 {
@@ -1572,6 +1598,8 @@ int mlx4_en_start_port(struct net_device *dev)
 
mlx4_en_cq_init_lock(cq);
 
+   mlx4_en_init_affinity_hint(priv, i);
+
err = mlx4_en_acti

[PATCH net-next V6 0/2] cpumask,net: Affinity hint helper function

2014-05-25 Thread Amir Vadai
Hi,

This patchset will set affinity hint to influence IRQs to be allocated on the
same NUMA node as the one where the card resides. As discussed in
http://www.spinics.net/lists/netdev/msg271497.html

If the number of IRQs allocated is greater than the number of local NUMA cores, 
all
local cores will be used first, and the rest of the IRQs will be on a remote
NUMA node.
If no NUMA support - IRQ's and cores will be mapped 1:1

Since the utility function to calculate the mapping could be useful in other mq
drivers in the kernel, it was added to cpumask.[ch]

This patchset was tested and applied on top of net-next since the first
consumer is a network device (mlx4_en).  Over commit 506724c: "tg3: Override
clock, link aware and link idle mode during NVRAM dump"

I couldn't find a maintainer for cpumask.c, so only added the kernel mailing
list

Amir

Changes from V5:
- Moved the utility function from kernel/irq/manage.c to lib/cpumask.c, and
  renamed it's name accordingly to cpumask_set_cpu_local_first()
- Added some comments as Thomas Gleixner suggested
- Changed -EINVAL to -EAGAIN, that describes the error situtation better.

Changes from V4:
- Patch 1/2: irq: Utility function to get affinity_hint by policy
  Thank you Ben for the great review:
  - Moved the function it kernel/irq/manage.c since it could be useful for
block mq devices
  - Fixed Typo's
  - Use cpumask_t * instead of cpumask_var_t in function header  
  - Restructured the function to remove NULL assignment in a cpumask_var_t
  - Fix for offline local CPU's

Changes from V3:
- Patch 2/2: net/mlx4_en: Use affinity hint
  - somehow patch file was corrupted

Changes from V2:
- Patch 1/2: net: Utility function to get affinity_hint by policy
  - Fixed style issues

Changes from V1:
- Patch 1/2: net: Utility function to get affinity_hint by policy
  - Fixed error flow to return -EINVAL on error (thanks govind)
- Patch 2/2: net/mlx4_en: Use affinity hint
  - Set ring->affinity_hint to NULL on error

Changes from V0:
- Fixed small style issues


Amir Vadai (1):
  cpumask: Utility function to set n'th cpu - local cpu first

Yuval Atias (1):
  net/mlx4_en: Use affinity hint

 drivers/infiniband/hw/mlx4/main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |  6 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 30 
 drivers/net/ethernet/mellanox/mlx4/eq.c| 13 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 +
 include/linux/cpumask.h|  2 +
 include/linux/mlx4/device.h|  2 +-
 lib/cpumask.c  | 64 ++
 8 files changed, 116 insertions(+), 4 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next V6 1/2] cpumask: Utility function to set n'th cpu - local cpu first

2014-05-25 Thread Amir Vadai
This function sets the n'th cpu - local cpu's first.
For example: in a 16 cores server with even cpu's local, will get the
following values:
cpumask_set_cpu_local_first(0, numa, cpumask) => cpu 0 is set
cpumask_set_cpu_local_first(1, numa, cpumask) => cpu 2 is set
...
cpumask_set_cpu_local_first(7, numa, cpumask) => cpu 14 is set
cpumask_set_cpu_local_first(8, numa, cpumask) => cpu 1 is set
cpumask_set_cpu_local_first(9, numa, cpumask) => cpu 3 is set
...
cpumask_set_cpu_local_first(15, numa, cpumask) => cpu 15 is set

Curently this function will be used by multi queue networking devices to
calculate the irq affinity mask, such that as many local cpu's as
possible will be utilized to handle the mq device irq's.

Signed-off-by: Amir Vadai 
---
 include/linux/cpumask.h |  2 ++
 lib/cpumask.c   | 64 +
 2 files changed, 66 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index d08e4d2..3551d66 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -257,6 +257,8 @@ static inline void cpumask_set_cpu(unsigned int cpu, struct 
cpumask *dstp)
set_bit(cpumask_check(cpu), cpumask_bits(dstp));
 }
 
+int cpumask_set_cpu_local_first(int i, int numa_node, cpumask_t *dstp);
+
 /**
  * cpumask_clear_cpu - clear a cpu in a cpumask
  * @cpu: cpu number (< nr_cpu_ids)
diff --git a/lib/cpumask.c b/lib/cpumask.c
index b810b75..14049a9 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -163,4 +163,68 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 {
memblock_free_early(__pa(mask), cpumask_size());
 }
+
+/**
+ * cpumask_set_cpu_local_first - set i'th cpu with local numa cpu's first
+ *
+ * @i: index number
+ * @numa_node: local numa_node
+ * @dstp: cpumask with the relevant cpu bit set according to the policy
+ *
+ * This function sets the cpumask according to a numa aware policy.
+ * cpumask could be used as an affinity hint for the IRQ related to a
+ * queue. When the policy is to spread queues across cores - local cores
+ * first.
+ *
+ * Returns 0 on success, -ENOMEM for no memory, and -EAGAIN when failed to set
+ * the cpu bit and need to re-call the function.
+ */
+int cpumask_set_cpu_local_first(int i, int numa_node, cpumask_t *dstp)
+{
+   cpumask_var_t mask;
+   int cpu;
+   int ret = 0;
+
+   if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   i %= num_online_cpus();
+
+   if (!cpumask_of_node(numa_node)) {
+   /* Use all online cpu's for non numa aware system */
+   cpumask_copy(mask, cpu_online_mask);
+   } else {
+   int n;
+
+   cpumask_and(mask,
+   cpumask_of_node(numa_node), cpu_online_mask);
+
+   n = cpumask_weight(mask);
+   if (i >= n) {
+   i -= n;
+
+   /* If index > number of local cpu's, mask out local
+* cpu's
+*/
+   cpumask_andnot(mask, cpu_online_mask, mask);
+   }
+   }
+
+   for_each_cpu(cpu, mask) {
+   if (--i < 0)
+   goto out;
+   }
+
+   ret = -EAGAIN;
+
+out:
+   free_cpumask_var(mask);
+
+   if (!ret)
+   cpumask_set_cpu(cpu, dstp);
+
+   return ret;
+}
+EXPORT_SYMBOL(cpumask_set_cpu_local_first);
+
 #endif
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Extend irq_set_affinity_notifier() to use a call chain

2014-05-25 Thread Amir Vadai

On 5/25/2014 3:15 PM, Amir Vadai wrote:

Hi,

I'm working for Mellanox on mlx4_en NIC driver.

We need to be able to be notified on irq affinity changes.
This is because, during non-stop full bandwidth traffic, napi will poll
constantly and no interrupt will be fired. Because of that, even if the
user changes the irq affinity, polling will continue to be done on the
original CPU that was chosen on the first packet.
We would like to be notified when the affinity is changed. When such an
event happen, the driver will arm the interrupts and end the napi
session. An interrupt will start a new napi session on the right CPU.

In order to do that, I need to add a new irq affinity notification
callback (In addition to the existing cpu_rmap notification). For that I
would like to extend irq_set_affinity_notifier() to have a notifier
call-chain instead of a single notifier callback.

I wanted to hear your opinion on this, and unless there is a better
solution, will send an RFC later on.

References:
- http://patchwork.ozlabs.org/patch/65244/ - Review done by Thomas
Glexiber to Ben Hutchings first version of the irq affinity notifiers.
- http://patchwork.ozlabs.org/patch/79593/ - Final version of
irq_set_affinity_notifier() that was applied

Thanks,
Amir


Didn't mention that a patch to set irq affinity notifier was already 
added to mlx4_en [1], and now aRFS is broken because cpu_rmap callback 
is dropped when mlx4_en callback is set.


[1] - http://patchwork.ozlabs.org/patch/348669/

Amir

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Extend irq_set_affinity_notifier() to use a call chain

2014-05-25 Thread Amir Vadai

Hi,

I'm working for Mellanox on mlx4_en NIC driver.

We need to be able to be notified on irq affinity changes.
This is because, during non-stop full bandwidth traffic, napi will poll 
constantly and no interrupt will be fired. Because of that, even if the 
user changes the irq affinity, polling will continue to be done on the 
original CPU that was chosen on the first packet.
We would like to be notified when the affinity is changed. When such an 
event happen, the driver will arm the interrupts and end the napi 
session. An interrupt will start a new napi session on the right CPU.


In order to do that, I need to add a new irq affinity notification 
callback (In addition to the existing cpu_rmap notification). For that I 
would like to extend irq_set_affinity_notifier() to have a notifier 
call-chain instead of a single notifier callback.


I wanted to hear your opinion on this, and unless there is a better 
solution, will send an RFC later on.


References:
- http://patchwork.ozlabs.org/patch/65244/ - Review done by Thomas 
Glexiber to Ben Hutchings first version of the irq affinity notifiers.
- http://patchwork.ozlabs.org/patch/79593/ - Final version of 
irq_set_affinity_notifier() that was applied


Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next] genirq: Eliminate explicit dependency between IRQ affinity notifiers and CONFIG_SMP

2014-05-20 Thread Amir Vadai
From: Eyal Perry 

Instead of requiring each consumer of the IRQ affinity notifier to have
themselves be explicitly dependent on CONFIG_SMP, make the definition of
struct irq_affinity_notify to exist independently of that config option
and introduce a stub for irq_set_affinity_notifier() under non SMP
configuration.

Fixes: 2eacc23 ("net/mlx4_core: Enforce irq affinity changes
immediatly")

Signed-off-by: Eyal Perry 
Signed-off-by: Amir Vadai 
---
 include/linux/interrupt.h | 46 ++
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 97ac926..3f74c059 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -199,6 +199,26 @@ extern int check_wakeup_irqs(void);
 static inline int check_wakeup_irqs(void) { return 0; }
 #endif
 
+/**
+ * struct irq_affinity_notify - context for notification of IRQ affinity 
changes
+ * @irq:   Interrupt to which notification applies
+ * @kref:  Reference count, for internal use
+ * @work:  Work item, for internal use
+ * @notify:Function to be called on change.  This will be
+ * called in process context.
+ * @release:   Function to be called on release.  This will be
+ * called in process context.  Once registered, the
+ * structure must only be freed when this function is
+ * called or later.
+ */
+struct irq_affinity_notify {
+   unsigned int irq;
+   struct kref kref;
+   struct work_struct work;
+   void (*notify)(struct irq_affinity_notify *, const cpumask_t *mask);
+   void (*release)(struct kref *ref);
+};
+
 #if defined(CONFIG_SMP)
 
 extern cpumask_var_t irq_default_affinity;
@@ -242,26 +262,6 @@ extern int irq_select_affinity(unsigned int irq);
 
 extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
 
-/**
- * struct irq_affinity_notify - context for notification of IRQ affinity 
changes
- * @irq:   Interrupt to which notification applies
- * @kref:  Reference count, for internal use
- * @work:  Work item, for internal use
- * @notify:Function to be called on change.  This will be
- * called in process context.
- * @release:   Function to be called on release.  This will be
- * called in process context.  Once registered, the
- * structure must only be freed when this function is
- * called or later.
- */
-struct irq_affinity_notify {
-   unsigned int irq;
-   struct kref kref;
-   struct work_struct work;
-   void (*notify)(struct irq_affinity_notify *, const cpumask_t *mask);
-   void (*release)(struct kref *ref);
-};
-
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify 
*notify);
 
@@ -284,6 +284,12 @@ static inline int irq_set_affinity_hint(unsigned int irq,
 {
return -EINVAL;
 }
+
+static inline int
+irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
+{
+   return 0;
+}
 #endif /* CONFIG_SMP */
 
 /*
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net/mlx4_en: Fix uninitialized use of 'port_up' in mlx4_en_set_channels()

2014-05-18 Thread Amir Vadai

On 5/18/2014 12:52 AM, Christian Engelmayer wrote:

Function mlx4_en_set_channels() stops running ports before performing the
requested action. In that case local variable 'port_up' is set so that the
port is restarted at the end of the function, however, in case the port was
not stopped, variable 'port_up' is left uninitialized and the behaviour is
undetermined. Detected by Coverity - CID 751497.

Signed-off-by: Christian Engelmayer 
---


Acked-By: Amir Vadai 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mellanox: Logging message cleanups

2014-05-05 Thread Amir Vadai

On 5/5/2014 6:35 AM, Joe Perches wrote:

Use a more current logging style.

o Coalesce formats
o Add missing spaces for coalesced formats
o Align arguments for modified formats
o Add missing newlines for some logging messages
o Use DRV_NAME as part of format instead of %s, DRV_NAME to
   reduce overall text.
o Use ..., ##__VA_ARGS__ instead of args... in macros
o Correct a few format typos
o Use a single line message where appropriate

Signed-off-by: Joe Perches 
---
  drivers/net/ethernet/mellanox/mlx4/cmd.c   |  85 ---
  drivers/net/ethernet/mellanox/mlx4/en_cq.c |   3 +-
  drivers/net/ethernet/mellanox/mlx4/en_ethtool.c|   6 +-
  drivers/net/ethernet/mellanox/mlx4/en_main.c   |   7 +-
  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   8 +-
  drivers/net/ethernet/mellanox/mlx4/en_rx.c |  17 +-
  drivers/net/ethernet/mellanox/mlx4/en_tx.c |   8 +-
  drivers/net/ethernet/mellanox/mlx4/eq.c|  85 +++
  drivers/net/ethernet/mellanox/mlx4/fw.c|  25 +--
  drivers/net/ethernet/mellanox/mlx4/main.c  | 246 +
  drivers/net/ethernet/mellanox/mlx4/mcg.c   |  16 +-
  drivers/net/ethernet/mellanox/mlx4/mlx4.h  |  17 +-
  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  40 ++--
  drivers/net/ethernet/mellanox/mlx4/mr.c|  14 +-
  drivers/net/ethernet/mellanox/mlx4/port.c  |  12 +-
  drivers/net/ethernet/mellanox/mlx4/profile.c   |  13 +-
  drivers/net/ethernet/mellanox/mlx4/qp.c|   7 +-
  drivers/net/ethernet/mellanox/mlx4/reset.c |  24 +-
  .../net/ethernet/mellanox/mlx4/resource_tracker.c  |  73 +++---
  drivers/net/ethernet/mellanox/mlx5/core/cmd.c  |   4 +-
  drivers/net/ethernet/mellanox/mlx5/core/eq.c   |   9 +-
  drivers/net/ethernet/mellanox/mlx5/core/main.c |  14 +-
  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|  30 +--
  drivers/net/ethernet/mellanox/mlx5/core/mr.c   |   8 +-
  .../net/ethernet/mellanox/mlx5/core/pagealloc.c|  15 +-
  drivers/net/ethernet/mellanox/mlx5/core/qp.c   |   4 +-
  26 files changed, 353 insertions(+), 437 deletions(-)



[...]



diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c 
b/drivers/net/ethernet/mellanox/mlx4/fw.c
index d16a4d1..0a87169 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -1,3 +1,4 @@
+

new line


  /*
   * Copyright (c) 2004, 2005 Topspin Communications.  All rights reserved.
   * Copyright (c) 2005, 2006, 2007, 2008 Mellanox Technologies. All rights 
reserved.
@@ -428,8 +429,7 @@ int mlx4_QUERY_FUNC_CAP(struct mlx4_dev *dev, u32 
gen_or_port,
} else if (dev->caps.port_type[gen_or_port] == MLX4_PORT_TYPE_IB) {
MLX4_GET(field, outbox, QUERY_FUNC_CAP_FLAGS0_OFFSET);
if (field & QUERY_FUNC_CAP_FLAGS0_FORCE_PHY_WQE_GID) {
-   mlx4_err(dev, "phy_wqe_gid is "
-"enforced on this ib port\n");
+   mlx4_err(dev, "phy_wqe_gid is enforced on this ib 
port\n");
err = -EPROTONOSUPPORT;
goto out;
    }


[...]

Acked-By: Amir Vadai 

Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next V5 0/2] irq,net: Affinity hint helper function

2014-03-11 Thread Amir Vadai
Hi,

This patchset will set affinity hint to influence IRQs to be allocated on the
same NUMA node as the one where the card resides. As discussed in
http://www.spinics.net/lists/netdev/msg271497.html

If number of IRQs allocated is greater than the number of local NUMA cores, all
local cores will be used first, and the rest of the IRQs will be on a remote
NUMA node.
If no NUMA support - IRQ's and cores will be mapped 1:1

This patchset was tested and applied on top of net-next since the first
consumer is a network device (mlx4_en) - commit be14cc9: "vlan: use use
ether_addr_equal_64bits to instead of ether_addr_equal"

Amir

Changes from V4:
- Patch 1/2: irq: Utility function to get affinity_hint by policy
  Thank you Ben for the great review:
  - Moved the function it kernel/irq/manage.c since it could be useful for
block mq devices
  - Fixed Typo's
  - Use cpumask_t * instead of cpumask_var_t in function header  
  - Restructured the function to remove NULL assignment in a cpumask_var_t
  - Fix for offline local CPU's

Changes from V3:
- Patch 2/2: net/mlx4_en: Use affinity hint
  - somehow patch file was corrupted

Changes from V2:
- Patch 1/2: net: Utility function to get affinity_hint by policy
  - Fixed style issues

Changes from V1:
- Patch 1/2: net: Utility function to get affinity_hint by policy
  - Fixed error flow to return -EINVAL on error (thanks govind)
- Patch 2/2: net/mlx4_en: Use affinity hint
  - Set ring->affinity_hint to NULL on error

Changes from V0:
- Fixed small style issues


Amir Vadai (1):
  irq: Utility function to get affinity_hint by policy

Yuval Atias (1):
  net/mlx4_en: Use affinity hint

 drivers/infiniband/hw/mlx4/main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |  6 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 30 +
 drivers/net/ethernet/mellanox/mlx4/eq.c| 14 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 +
 include/linux/interrupt.h  |  9 
 include/linux/mlx4/device.h|  2 +-
 kernel/irq/manage.c| 58 ++
 8 files changed, 118 insertions(+), 4 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next V5 1/2] irq: Utility function to get affinity_hint by policy

2014-03-11 Thread Amir Vadai
This function sets the affinity_mask for a multi queue device according
to a numa aware policy. affinity_mask could be used as an affinity hint
for the IRQ related to this queue.
Current policy is to spread queues accross cores - local cores first.
It could be extended in the future.

CC: Prarit Bhargava 
CC: Govindarajulu Varadarajan 
Signed-off-by: Amir Vadai 
---
 include/linux/interrupt.h |  9 
 kernel/irq/manage.c   | 58 +++
 2 files changed, 67 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a2678d3..81baefb 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -208,6 +208,8 @@ extern int irq_select_affinity(unsigned int irq);
 
 extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
 
+extern int irq_set_mq_dev_affinit_hint(int q, int numa_node,
+  cpumask_t *affinity_mask);
 /**
  * struct irq_affinity_notify - context for notification of IRQ affinity 
changes
  * @irq:   Interrupt to which notification applies
@@ -250,6 +252,13 @@ static inline int irq_set_affinity_hint(unsigned int irq,
 {
return -EINVAL;
 }
+
+static inline int irq_set_mq_dev_affinit_hint(int q, int numa_node,
+ cpumask_t *affinity_mask)
+{
+   return -EINVAL;
+}
+
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 481a13c..8d98e6e 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -221,6 +221,64 @@ int irq_set_affinity_hint(unsigned int irq, const struct 
cpumask *m)
 }
 EXPORT_SYMBOL_GPL(irq_set_affinity_hint);
 
+/**
+ * irq_set_mq_dev_affinit_hint - set affinity hint of a queue in multi queue
+ * device
+ * @q: queue index number
+ * @numa_node: prefered numa_node
+ * @affinity_mask: the relevant cpu bit is set according to the policy
+ *
+ * This function sets the affinity_mask according to a numa aware policy.
+ * affinity_mask could be used as an affinity hint for the IRQ related to this
+ * queue.
+ * The policy is to spread queues across cores - local cores first.
+ *
+ * Returns 0 on success, or a negative error code.
+ */
+int irq_set_mq_dev_affinit_hint(int q, int numa_node,
+   cpumask_t *affinity_mask)
+{
+   cpumask_var_t mask;
+   int affinity_cpu;
+   int ret = 0;
+
+   if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   q %= num_online_cpus();
+
+   if (!cpumask_of_node(numa_node)) {
+   cpumask_copy(mask, cpu_online_mask);
+   } else {
+   int n;
+
+   cpumask_and(mask,
+   cpumask_of_node(numa_node), cpu_online_mask);
+
+   n = cpumask_weight(mask);
+   if (q >= n) {
+   q -= n;
+   cpumask_andnot(mask, cpu_online_mask, mask);
+   }
+   }
+
+   for_each_cpu(affinity_cpu, mask) {
+   if (--q < 0)
+   goto out;
+   }
+
+   ret = -EINVAL;
+
+out:
+   free_cpumask_var(mask);
+
+   if (!ret)
+   cpumask_set_cpu(affinity_cpu, affinity_mask);
+
+   return ret;
+}
+EXPORT_SYMBOL(irq_set_mq_dev_affinit_hint);
+
 static void irq_affinity_notify(struct work_struct *work)
 {
struct irq_affinity_notify *notify =
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next V5 2/2] net/mlx4_en: Use affinity hint

2014-03-11 Thread Amir Vadai
From: Yuval Atias 

The “affinity hint” mechanism is used by the user space
daemon, irqbalancer, to indicate a preferred CPU mask for irqs.
Irqbalancer can use this hint to balance the irqs between the
cpus indicated by the mask.

We wish the HCA to preferentially map the IRQs it uses to numa cores
close to it.  To accomplish this, we use
netif_set_rx_queue_affinity_hint(), that sets the affinity hint
according the following policy:
First it maps IRQs to “close” numa cores.  If these are exhausted, the
remaining IRQs are mapped to “far” numa cores.

Signed-off-by: Yuval Atias 
Signed-off-by: Amir Vadai 
---
 drivers/infiniband/hw/mlx4/main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |  6 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 30 ++
 drivers/net/ethernet/mellanox/mlx4/eq.c| 14 +++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 +
 include/linux/mlx4/device.h|  2 +-
 6 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index f9c12e9..7b4725d 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1837,7 +1837,7 @@ static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, 
struct mlx4_ib_dev *ibdev)
i, j, dev->pdev->bus->name);
/* Set IRQ for specific name (per ring) */
if (mlx4_assign_eq(dev, name, NULL,
-  &ibdev->eq_table[eq])) {
+  &ibdev->eq_table[eq], NULL)) {
/* Use legacy (same as mlx4_en driver) */
pr_warn("Can't allocate EQ %d; reverting to 
legacy\n", eq);
ibdev->eq_table[eq] =
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c 
b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 70e9532..b09418b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -119,11 +119,15 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
if (cq->is_tx == RX) {
if (mdev->dev->caps.comp_pool) {
if (!cq->vector) {
+   struct mlx4_en_rx_ring *ring =
+   priv->rx_ring[cq->ring];
+
sprintf(name, "%s-%d", priv->dev->name,
cq->ring);
/* Set IRQ for specific name (per ring) */
if (mlx4_assign_eq(mdev->dev, name, rmap,
-  &cq->vector)) {
+  &cq->vector,
+  ring->affinity_mask)) {
cq->vector = (cq->ring + 1 + priv->port)
% mdev->dev->caps.num_comp_vectors;
mlx4_warn(mdev, "Failed Assigning an EQ 
to "
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 3db5946..a6fc7a6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1521,6 +1521,32 @@ static void mlx4_en_linkstate(struct work_struct *work)
mutex_unlock(&mdev->state_lock);
 }
 
+static void mlx4_en_init_affinity_hint(struct mlx4_en_priv *priv, int ring_idx)
+{
+   struct mlx4_en_rx_ring *ring = priv->rx_ring[ring_idx];
+   int numa_node = priv->mdev->dev->numa_node;
+
+   if (numa_node == -1)
+   return;
+
+   if (!zalloc_cpumask_var(&ring->affinity_mask, GFP_KERNEL)) {
+   en_err(priv, "Failed to allocate core mask\n");
+   return;
+   }
+
+   if (irq_set_mq_dev_affinit_hint(ring_idx, numa_node,
+   ring->affinity_mask)) {
+   en_err(priv, "Failed setting affinity hint\n");
+   free_cpumask_var(ring->affinity_mask);
+   ring->affinity_mask = NULL;
+   }
+}
+
+static void mlx4_en_free_affinity_hint(struct mlx4_en_priv *priv, int ring_idx)
+{
+   free_cpumask_var(priv->rx_ring[ring_idx]->affinity_mask);
+   priv->rx_ring[ring_idx]->affinity_mask = NULL;
+}
 
 int mlx4_en_start_port(struct net_device *dev)
 {
@@ -1562,6 +1588,8 @@ int mlx4_en_start_port(struct net_device *dev)
 
mlx4_en_cq_init_lock(cq);
 
+   mlx4_en_init_affinity_hint(priv, i);
+
err = mlx4_en_activate_cq(priv, cq, i);
i

[PATCH net-next] net/mlx4_en: mlx4_en_verify_params() can be static

2014-03-05 Thread Amir Vadai
From: Fengguang Wu 

Fix static error introduced by commit:
b97b33a3df0439401f80f041eda507d4fffa0dbf [645/653] net/mlx4_en: Verify
mlx4_en module parameters

sparse warnings:
drivers/net/ethernet/mellanox/mlx4/en_main.c:335:6: sparse: symbol
'mlx4_en_verify_params' was not declared. Should it be static?

CC: net...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: Eugenia Emantayev 
Signed-off-by: Fengguang Wu 
Signed-off-by: Amir Vadai 
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c 
b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 3454437..0c59d4f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -332,7 +332,7 @@ static struct mlx4_interface mlx4_en_interface = {
.protocol   = MLX4_PROT_ETH,
 };
 
-void mlx4_en_verify_params(void)
+static void mlx4_en_verify_params(void)
 {
if (pfctx > MAX_PFC_TX) {
pr_warn("mlx4_en: WARNING: illegal module parameter pfctx 0x%x 
- should be in range 0-0x%x, will be changed to default (0)\n",
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mlx4_core: DMA-API: device driver tries to sync DMA memory it has not allocated

2014-02-27 Thread Amir Vadai
On 10/12/13 14:44 +0100, Gerald Schaefer wrote:
> Hi,
> 
> During network stress test with CONFIG_DMA_API_DEBUG=y we get the following
> message on the server side:
> 
> [57523.955982] mlx4_core :00:00.0: DMA-API: device driver tries to sync 
> DMA memory it has not allocated [device address=0x000143d2c202] 
> [size=1526 bytes]
> [57523.956018] [ cut here ]
> [57523.956021] WARNING: at 
> /home/dailybuild/builder/packages/BUILD/linux-3.12.3-20131209/lib/dma-debug.c:986
> [57523.956022] Modules linked in: mlx4_en dm_multipath scsi_dh mlx4_core 
> eadm_sch dm_mod lcs autofs4
> [57523.956034] CPU: 3 PID: 20 Comm: ksoftirqd/3 Not tainted 
> 3.12.3-65.x.20131209-s390xdefault #1
> [57523.956037] task: 00026f47cb60 ti: 00026f484000 task.ti: 
> 00026f484000
> [57523.956044] Krnl PSW : 0404e0018000 004a0422 
> (check_sync+0xf2/0x52c)
> [57523.956047]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
> EA:3
> [57523.956047] Krnl GPRS: 0111 8b767e9c 0092 
> 
> [57523.956051]004a041e  05f6 
> 000143d2c202
> [57523.956053]00026f487a98 0001 00a71144 
> 00026f10dd60
> [57523.956055] 0075bb18 004a041e 
> 00026f487980
> [57523.956064] Krnl Code: 004a0412: c020002143d7larl
> %r2,8c8bc0
> [57523.956064]004a0418: c0e50012717abrasl   
> %r14,6ee70c
> [57523.956064]   #004a041e: a7f40001brc 
> 15,4a0420
> [57523.956064]   >004a0422: e330a012lt  
> %r3,0(%r10)
> [57523.956064]004a0428: a7740201brc 
> 7,4a082a
> [57523.956064]004a042c: a7f401f3brc 
> 15,4a0812
> [57523.956064]004a0430: d5078030c030clc 
> 48(8,%r8),48(%r12)
> [57523.956064]004a0436: a7c40070brc 
> 12,4a0516
> [57523.956081] Call Trace:
> [57523.956084] ([<004a041e>] check_sync+0xee/0x52c)
> [57523.956087]  [<004a0a26>] debug_dma_sync_single_for_cpu+0x66/0x74
> [57523.956094]  [<03ff801fcae0>] mlx4_en_complete_rx_desc+0x15c/0x240 
> [mlx4_en]
> [57523.956098]  [<03ff801fe1fa>] mlx4_en_process_rx_cq+0x56a/0xb58 
> [mlx4_en]
> [57523.956101]  [<03ff801fe8ea>] mlx4_en_poll_rx_cq+0x92/0xf8 [mlx4_en]
> [57523.956105]  [<005d558c>] net_rx_action+0xa0/0x2f0
> [57523.956109]  [<0013cd10>] __do_softirq+0x1e8/0x480
> [57523.956111]  [<0013cfe4>] run_ksoftirqd+0x3c/0x80
> [57523.956114]  [<0016f496>] smpboot_thread_fn+0x2c6/0x2e8
> [57523.956118]  [<00163834>] kthread+0xd0/0xe4
> [57523.956123]  [<006ff19e>] kernel_thread_starter+0x6/0xc
> [57523.956125]  [<006ff198>] kernel_thread_starter+0x0/0xc
> [57523.956127] 1 lock held by ksoftirqd/3/20:
> [57523.956129]  #0:  (&(&dma_entry_hash[i].lock)->rlock){-.-...}, at: 
> [<0049f1d6>] get_hash_bucket+0x4a/0x64
> [57523.956135] Last Breaking-Event-Address:
> [57523.956137]  [<004a041e>] check_sync+0xee/0x52c
> [57523.956139] ---[ end trace ec82af84e1d16c5f ]---
> 
> --

Hi Gerald,

It seems that I missed this mail.

I didn't see this warning in our lab in x86 platform.
Is it reproduced easily on PPC?
Any hints how to reproduce it in our lab?

Do you have a 'good' version that didn't have the warning?

Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ixgbevf: Use pci_enable_msix_range() instead of pci_enable_msix()

2014-02-18 Thread Amir Vadai
On 18/02/14 14:01 +0200, Amir Vadai wrote:
> On 18/02/14 11:11 +0100, Alexander Gordeev wrote:
> > As result of deprecation of MSI-X/MSI enablement functions
> > pci_enable_msix() and pci_enable_msi_block() all drivers
> > using these two interfaces need to be updated to use the
> > new pci_enable_msi_range() and pci_enable_msix_range()
> > interfaces.
> > 
> > Signed-off-by: Alexander Gordeev 
> > Cc: Jeff Kirsher 
> > Cc: Jesse Brandeburg 
> > Cc: Bruce Allan 
> > Cc: e1000-de...@lists.sourceforge.net
> > Cc: net...@vger.kernel.org
> > Cc: linux-...@vger.kernel.org
> > ---
> 
> Acked-By: Amir Vadai 
Oh, I'm still working for Mellanox - just replied to the wrong mail :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ixgbevf: Use pci_enable_msix_range() instead of pci_enable_msix()

2014-02-18 Thread Amir Vadai
On 18/02/14 11:11 +0100, Alexander Gordeev wrote:
> As result of deprecation of MSI-X/MSI enablement functions
> pci_enable_msix() and pci_enable_msi_block() all drivers
> using these two interfaces need to be updated to use the
> new pci_enable_msi_range() and pci_enable_msix_range()
> interfaces.
> 
> Signed-off-by: Alexander Gordeev 
> Cc: Jeff Kirsher 
> Cc: Jesse Brandeburg 
> Cc: Bruce Allan 
> Cc: e1000-de...@lists.sourceforge.net
> Cc: net...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> ---

Acked-By: Amir Vadai 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mlx4: Use pci_enable_msix_range()

2014-02-02 Thread Amir Vadai
On 31/01/14 16:08 +0100, Alexander Gordeev wrote:
> As result of deprecation of MSI-X/MSI enablement functions
> pci_enable_msix() and pci_enable_msi_block() all drivers
> using these two interfaces need to be updated to use the
> new pci_enable_msi_range() and pci_enable_msix_range()
> interfaces.
> 
> Signed-off-by: Alexander Gordeev 
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c |   19 ---
>  1 files changed, 4 insertions(+), 15 deletions(-)
> 

Acked By: Amir Vadai 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: net/mlx4_core: clean up two functions

2014-01-13 Thread Amir Vadai
On 13/01/14 11:13 -0800, David Miller wrote:
> From: Paul Bolle 
> Date: Tue, 07 Jan 2014 14:00:22 +0100
> 
> > 0) These two patches are very similar. They both clean up a function to
> > help GCC understand the codeflow. Both help silence a warning.
> > 
> > 1) Compile tested only (on 32 bits x86). I do not have this hardware.
> > 
> > 2) Please note that there's no MAINTAINERS entry for mlx4_core. (Neither
> > is there an entry for the MLX4 IB driver.) Shouldn't it be added? 
> 
> These patches have been rotting for a week.  I know the mlx4 folks
> said the SRIOV guy inside Mellanox will look at it, but this is taking
> way too long.
> 
> This is absolutely unreasonable from Paul's perspective to have to wait
> this long for a review of these relatively simple patches.

You're absolutely right.
And yes, jack is very busy.
We will have a reply ready tomorrow morning.

Amir


> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: net: mlx4: slight optimization of addr compare

2013-12-29 Thread Amir Vadai
On 27/12/13 14:48 +0800, Ding Tianhong wrote:
> Use possibly more efficient ether_addr_equal
> to instead of memcmp.
> 
> Cc: Amir Vadai 
> Signed-off-by: Ding Tianhong 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c| 4 ++--
>  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 6f92090..7e43858 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -782,7 +782,7 @@ static void update_mclist_flags(struct mlx4_en_priv *priv,
>   list_for_each_entry(dst_tmp, dst, list) {
>   found = false;
>   list_for_each_entry(src_tmp, src, list) {
> - if (!memcmp(dst_tmp->addr, src_tmp->addr, ETH_ALEN)) {
> + if (ether_addr_equal(dst_tmp->addr, src_tmp->addr)) {
>   found = true;
>   break;
>   }
> @@ -797,7 +797,7 @@ static void update_mclist_flags(struct mlx4_en_priv *priv,
>   list_for_each_entry(src_tmp, src, list) {
>   found = false;
>   list_for_each_entry(dst_tmp, dst, list) {
> - if (!memcmp(dst_tmp->addr, src_tmp->addr, ETH_ALEN)) {
> + if (ether_addr_equal(dst_tmp->addr, src_tmp->addr)) {
>   dst_tmp->action = MCLIST_NONE;
>   found = true;
>   break;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
> b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
> index 2f3f2bc..2e3232c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
> @@ -3634,7 +3634,7 @@ static int validate_eth_header_mac(int slave, struct 
> _rule_hw *eth_header,
>   !is_broadcast_ether_addr(eth_header->eth.dst_mac)) {
>   list_for_each_entry_safe(res, tmp, rlist, list) {
>   be_mac = cpu_to_be64(res->mac << 16);
> - if (!memcmp(&be_mac, eth_header->eth.dst_mac, ETH_ALEN))
> + if (ether_addr_equal((u8 *)&be_mac, 
> eth_header->eth.dst_mac))
>   return 0;
>   }
>   pr_err("MAC %pM doesn't belong to VF %d, Steering rule 
> rejected\n",
> -- 
> 1.8.0
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Acked-By: Amir Vadai 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC net-next] net: epoll support for busy poll

2013-08-27 Thread Amir Vadai
On 26/08/2013 09:03, Eliezer Tamir wrote:
> On 26/08/2013 00:30, Amir Vadai wrote:
>> I'm on vacation, will test and have some inputs later this week when I be
>> back.
>>
> Hello Amir,
> 
> Ping me when you get back and I will send you my latest so you can
> play with it.
> 

Hi Eliezer,

You can send the latest version, so I can play with it and get some numbers.

Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sockperf: add SO_LL socketop support

2013-06-12 Thread Amir Vadai
On 11/06/2013 17:26, Eliezer Tamir wrote:
> Add lls socket option support to sockperf.
> Right now we always get the option before set to show the option is
> working properly. We should probably remove that in an official release.
> use --lls (value in usecs) to override global setting.
> ---
> 
>  src/Defs.h   |3 +++
>  src/SockPerf.cpp |   54
> +-
>  2 files changed, 56 insertions(+), 1 deletions(-)
> 
> diff --git a/src/Defs.h b/src/Defs.h
> index e38e3a4..87b45a0 100644
> --- a/src/Defs.h
> +++ b/src/Defs.h
> @@ -161,6 +161,7 @@ enum {
>  OPT_OUTPUT_PRECISION,   //35
>  OPT_CLIENTPORT, //36
>  OPT_CLIENTIP,   //37
> +OPT_LLS,//38
>  };
> 
>  #define MODULE_NAME"sockperf"
> @@ -527,6 +528,8 @@ struct user_params_t {
>  //bool stream_mode; - use b_stream instead
>  int mthread_server;
>  struct timeval* select_timeout;
> +unsigned long lls_usecs;
> +bool lls_is_set;
>  int sock_buff_size;
>  int threads_num;
>  char threads_affinity[MAX_ARGV_SIZE];
> diff --git a/src/SockPerf.cpp b/src/SockPerf.cpp
> index 41daf95..d76320f 100644
> --- a/src/SockPerf.cpp
> +++ b/src/SockPerf.cpp
> @@ -207,6 +207,10 @@ static const AOPT_DESC  common_opt_desc[] =
>  "Limit the lifetime of the message (default 2)."
>  },
>  {
> +OPT_LLS, AOPT_ARG, aopt_set_literal( 0 ), aopt_set_string(
> "lls" ),
> +"Turn on LLS via socket option (value = us to poll)."
> +},
> +{
>  OPT_BUFFER_SIZE, AOPT_ARG, aopt_set_literal( 0 ),
> aopt_set_string( "buffer-size" ),
>  "Set total socket receive/send buffer  in bytes (system
> defined by default)."
>  },
> @@ -292,7 +296,7 @@ static int proc_mode_help( int id, int argc, const
> char **argv )
>  int   i = 0;
> 
>  printf(MODULE_NAME " is a tool for testing network latency and
> throughput.\n");
> -printf("version %s\n", STR(VERSION));
> +printf("version %s-lls\n", STR(VERSION));
>  printf("\n");
>  printf("Usage: " MODULE_NAME "  [options] [args]\n");
>  printf("Type: \'" MODULE_NAME "  --help\' for help on a
> specific subcommand.\n");
> @@ -1789,6 +1793,26 @@ static int parse_common_opt( const AOPT_OBJECT
> *common_obj )
>  s_user_params.is_nonblocked_send = true;
>  }
> 
> +if ( !rc && aopt_check(common_obj, OPT_LLS) ) {
> +const char* optarg = aopt_value(common_obj, OPT_LLS);
> +if (optarg) {
> +errno = 0;
> +int value = strtoul(optarg, NULL, 0);
> +if (errno != 0 || value < 0) {
> +log_msg("'-%d' Invalid LLS value: %s", OPT_LLS,
> optarg);
> +rc = SOCKPERF_ERR_BAD_ARGUMENT;
> +}
> +else {
> +s_user_params.lls_usecs = value;
> +s_user_params.lls_is_set = true;
> +}
> +}
> +else {
> +log_msg("'-%d' Invalid value", OPT_LLS);
> +rc = SOCKPERF_ERR_BAD_ARGUMENT;
> +}
> +}
> +
>  if ( !rc && aopt_check(common_obj, OPT_RECV_LOOPING) ) {
> 
>  const char* optarg = aopt_value(common_obj, OPT_RECV_LOOPING);
> @@ -2296,6 +2320,29 @@ int sock_set_reuseaddr(int fd)
>  return rc;
>  }
> 
> +#ifndef SO_LL
> +#define SO_LL 46
> +#endif
> +int sock_set_lls(int fd)
> +{
> +int rc = SOCKPERF_ERR_NONE;
> +unsigned long lls;
> +int size = sizeof(lls);
> +
> +if(getsockopt(fd, SOL_SOCKET, SO_LL, &lls, (socklen_t *)&size) < 0){
> +log_err("getsockopt(SO_LL) failed");
> +return SOCKPERF_ERR_SOCKET;
> +} else
> +log_msg("socket option SO_LL default was %lu, changing to %lu",
> lls, s_user_params.lls_usecs);
> +
> +if (setsockopt(fd, SOL_SOCKET, SO_LL, &(s_user_params.lls_usecs),
> sizeof(s_user_params.lls_usecs)) < 0) {
> +log_err("setsockopt(SO_LL) failed");
> +rc = SOCKPERF_ERR_SOCKET;
> +}
> +return rc;
> +}
> +
> +
>  int sock_set_snd_rcv_bufs(int fd)
>  {
>  /*
> @@ -2460,6 +2507,11 @@ int prepare_socket(int fd, struct fds_data *p_data)
>  }
> 
>  if (!rc &&
> +(s_user_params.lls_is_set == true))
> +{
> +rc = sock_set_lls(fd);
> +}
> +if (!rc &&
>  (s_user_params.sock_buff_size > 0))
>  {
>  rc = sock_set_snd_rcv_bufs(fd);
> 
> 
> 
> 
> 

Eliezer Hi,

Added sockperf maintainers to CC list.

To upload changes to sockperf code, please open a ticket on the sockperf
google code and then submit the patch
(https://code.google.com/p/sockperf/issues/).

You will need to wrap it in " #ifndef w-i-n-d-o-w-s" but sockperf
maintainers can help you do that on the sockperf pages.

You should look at the recently added TOS which is very similar
(https://code.google.com/p/sockperf/issues/detail?id=44).

Amir

Re: [PATCH v8 net-next 2/7] net: add low latency socket poll

2013-06-03 Thread Amir Vadai
On 03/06/2013 11:01, Eliezer Tamir wrote:
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 741db5fc..ae98c95 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static int one = 1;
>  
> @@ -284,6 +285,15 @@ static struct ctl_table net_core_table[] = {
>   .proc_handler   = flow_limit_table_len_sysctl
>   },
>  #endif /* CONFIG_NET_FLOW_LIMIT */
> +#ifdef CONFIG_NET_LL_RX_POLL
> + {
> + .procname   = "low_latency_poll",
> + .data   = &sysctl_net_ll_poll,
> + .maxlen = sizeof(unsigned long),
One more thing I just noticed: should be 'sizeof(int)'.
Because proc_dointvec is expecting an int, sizeof(unsigned long) will
produce an array of two ints.


> + .mode   = 0644,
> + .proc_handler   = proc_dointvec
> + },
> +#endif
>  #endif /* CONFIG_NET */
>   {
>   .procname   = "netdev_budget",

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 net-next 0/7] net: low latency Ethernet device polling

2013-06-03 Thread Amir Vadai
Acked-By: Amir Vadai 

Nice work Eliezer :)

On 03/06/2013 11:01, Eliezer Tamir wrote:
> David,
> 
> Here is v8 with yet more fixes.
> 
> Please consider applying.
> 
> Thanks to everyone for their input.
> 
> -Eliezer
> 
> change log:
> v8
> - split out udp and select/poll into seperate patches.
>   what used to be patch 2/5 is now three patches.
> - type corrections from Amir Vadai and Cong Wang:
>   one unsigned long that was left when chenging to cycles_t
>   int -> bool 
> - more detailed patch descriptions.
> 
> v7
> - suggested by Ben Hutchings and Eric Dumazet:
>   type fixes, static for globals in net/core.c,
>   avoid napi_id collisions in napi_hash_add()
> 
> v6
> - many small fixes suggested by Eric Dumazet:
>   data locality, typos, documentation
>   protect napi_hash insert/delete with a spinlock (napi_gen_id is no
>   longer atomic_t since it's only accessed with the spinlock held.)
> - added IPv6 TCP and UDP support (only minimally tested)
> 
> v5
> - corrections suggested by Ben Hutchings:
>   fixed typos, moved the config option and sysctl value from IPv4 to net
> - moved sk_mark_ll() to the protocol handlers
> - removed global id mechanism, replaced with a hashed napi_id.
>   based on code sample from Eric Dumazet
>   Note that ixgbe_free_q_vector() already waits an rcu grace period
>   before freeing the q_vector, so nothing additional needs to be done
>   when adding a call to napi_hash_del().
> - simple poll/select support
> 
> v4
> - removed separate config option for TCP as suggested Eric Dumazet.
> - added linux mib counter for packets received through the low latency path,
>   as suggested by Andi Kleen.
> - re-allow module unloading, remove module param, use a global generation id
>   instead to prevent the use of a stale napi pointer, as suggested
>   by Eric Dumazet
> - updated Documentation/networking/ip-sysctl.txt text
> 
> v3
> - coding style changes suggested by Dave Miller
> 
> v2
> - the sysctl knob is now in microseconds. The default value is now 0 (off).
> - for now the code depends at configure time on CONFIG_I86_TSC 
> - the napi reference in struct skb is now a union with the dma cookie
>   since the former is only used on RX and the latter on TX,
>   as suggested by Eric Dumazet.
> - we do a better job at honoring non-blocking operations.
> - removed busy-polling support for tcp_read_sock()
> - remove dynamic disabling of GRO
> - coding style fixes
> - disallow unloading the device module after the feature has been used
> 
> Credit:
> Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
> Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
> Mike Polehn, Anil Vasudevan, Don Wood
> Special thanks for finding bugs in earlier versions:
> Willem de Bruijn and Andi Kleen
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 net-next 2/5] net: implement support for low latency socket polling

2013-05-30 Thread Amir Vadai
On 30/05/2013 14:41, Eliezer Tamir wrote:
> diff --git a/fs/select.c b/fs/select.c
> index 8c1c96c..f116bf0 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -400,6 +401,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec 
> *end_time)
>   poll_table *wait;
>   int retval, i, timed_out = 0;
>   unsigned long slack = 0;
> + cycles_t ll_time = ll_end_time();
>  
>   rcu_read_lock();
>   retval = max_select_fd(n, fds);
> @@ -486,6 +488,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec 
> *end_time)
>   break;
>   }
>  
> + if (can_poll_ll(ll_time))
> + continue;
I don't see here discrimination between sockets that you want to poll
and other sockets.
So it means that select will busy poll every type of file descriptor
instead of going to sleep.
Should have a condition with something like sk_valid_ll()

>   /*
>* If this is the first loop and we have a timeout
>* given, then we convert to ktime_t and set the to


...

> +static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
> +{
> +}
> +
> +static inline bool can_poll_ll(unsigned long end_time)
should use here cycles_t too.

> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_NET_LL_RX_POLL */
> +#endif /* _LINUX_NET_LL_POLL_H */


Also, something general about this patch. I think you should split out
to separate patches all the users of the feature, like you did for TCP.
I would suggest small patches for UDP, select and poll.

Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 85/86] mlx4_en: fix allocation of CPU affinity reverse-map

2013-03-28 Thread Amir Vadai
This fix introduced a bug in SRIOV.

Should squash into it, the upstream commit:
f74d525bc973f2003b55b1f71f377e31fb5d3c8b - "net/mlx4_en: Disable RFS in
SRIOV virtual functions"

Thanks,
Amir


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: build failure after merge of the akpm tree

2013-02-11 Thread Amir Vadai
On 11/02/2013 09:13, Stephen Rothwell wrote:
> Hi Andrew,
> 
> After merging the akpm tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function 
> 'mlx4_en_process_rx_cq':
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:628:53: error: macro 
> "hlist_for_each_entry_rcu" passed 4 arguments, but takes just 3
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:628:5: error: 
> 'hlist_for_each_entry_rcu' undeclared (first use in this function)
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:628:55: error: expected ';' before 
> '{' token
> 
> Caused by commit c07cb4b0ab78 ("net/mlx4_en: Manage hash of MAC addresses
> per port") from the net-next tree interacting with commit "hlist: drop
> the node parameter from iterators" from the akpm tree.
> 
> I applied the following merge fix patch for today:
> 
> From 7a10f5e7e8d1232d618307d568ea9a78dc4680bb Mon Sep 17 00:00:00 2001
> From: Stephen Rothwell 
> Date: Mon, 11 Feb 2013 18:01:23 +1100
> Subject: [PATCH] net/mlx4_en: fix up for hlist_for_each_entry_rcu API change
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index ce38654..19a9c05 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -617,7 +617,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
> mlx4_en_cq *cq, int bud
>  
>   if (is_multicast_ether_addr(ethh->h_dest)) {
>   struct mlx4_mac_entry *entry;
> - struct hlist_node *n;
>   struct hlist_head *bucket;
>   unsigned int mac_hash;
>  
> @@ -625,7 +624,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
> mlx4_en_cq *cq, int bud
>   mac_hash = ethh->h_source[MLX4_EN_MAC_HASH_IDX];
>   bucket = &priv->mac_hash[mac_hash];
>   rcu_read_lock();
> - hlist_for_each_entry_rcu(entry, n, bucket, 
> hlist) {
> + hlist_for_each_entry_rcu(entry, bucket, hlist) {
>       if (ether_addr_equal_64bits(entry->mac,
>   
> ethh->h_source)) {
>   rcu_read_unlock();
> 

Acked-By: Amir Vadai 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] lib: cpu_rmap: avoid flushing all workqueues

2013-01-07 Thread Amir Vadai

On 02/01/2013 23:52, David Decotigny wrote:

In some cases, free_irq_cpu_rmap() is called while holding a lock
(eg. rtnl). This can lead to deadlocks, because it invokes
flush_scheduled_work() which ends up waiting for whole system
workqueue to flush, but some pending works might try to acquire the
lock we are already holding.

This commit uses reference-counting to replace
irq_run_affinity_notifiers(). It also removes
irq_run_affinity_notifiers() altogether.

Signed-off-by: David Decotigny
---
  include/linux/cpu_rmap.h  |   13 +++---
  include/linux/interrupt.h |5 
  lib/cpu_rmap.c|   63 +
  3 files changed, 62 insertions(+), 19 deletions(-)



Acked-by: Amir Vadai 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] lib: cpu_rmap: avoid flushing all workqueues

2013-01-07 Thread Amir Vadai

On 02/01/2013 23:52, David Decotigny wrote:

In some cases, free_irq_cpu_rmap() is called while holding a lock
(eg. rtnl). This can lead to deadlocks, because it invokes
flush_scheduled_work() which ends up waiting for whole system
workqueue to flush, but some pending works might try to acquire the
lock we are already holding.

This commit uses reference-counting to replace
irq_run_affinity_notifiers(). It also removes
irq_run_affinity_notifiers() altogether.

Signed-off-by: David Decotigny
---
  include/linux/cpu_rmap.h  |   13 +++---
  include/linux/interrupt.h |5 
  lib/cpu_rmap.c|   63 +
  3 files changed, 62 insertions(+), 19 deletions(-)



Acked-by: Amir Vadai 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/