Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-20 Thread Phil Sutter
Hi Roman,

On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote:
> So far, if the filter was too large to fit in the allocated skb, the
> kernel did not return any error and stopped dumping. Modify the dumper
> so that it returns -EMSGSIZE when a filter fails to dump and it is the
> first filter in the skb. If we are not first, we will get a next chance
> with more room.
> 
> I understand this is pretty near to being an API change, but the
> original design (silent truncation) can be considered a bug.
> 
> Note: The error case can happen pretty easily if you create a filter
> with 32 actions and have 4kb pages. Also recent versions of iproute try
> to be clever with their buffer allocation size, which in turn leads to

I'm curious, what does it lead to? :)

Thanks, Phil


Re: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-20 Thread Sargun Dhillon
On Mon, Feb 19, 2018 at 4:00 PM, Daniel Borkmann  wrote:
> On 02/19/2018 05:22 PM, Sargun Dhillon wrote:
>> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant
>> to be used for seccomp filters as an alternative to cBPF filters. The
>> program type has relatively limited capabilities in terms of helpers,
>> but that can be extended later on.
>>
>> The eBPF code loading is separated from attachment of the filter, so
>> a privileged user can load the filter, and pass it back to an
>> unprivileged user who can attach it and use it at a later time.
>>
>> In order to attach the filter itself, you need to supply a flag to the
>> seccomp syscall indicating that a eBPF filter is being attached, as
>> opposed to a cBPF one. Verification occurs at program load time,
>> so the user should only receive errors related to attachment.
>>
>> Signed-off-by: Sargun Dhillon 
> [...]
>> @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>>
>>   spin_lock_irq(>sighand->siglock);
>>
>> - if (!seccomp_may_assign_mode(seccomp_mode))
>> + if (!seccomp_may_assign_mode(filter_mode))
>>   goto out;
>>
>>   ret = seccomp_attach_filter(flags, prepared);
>> @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>>   /* Do not free the successfully attached filter. */
>>   prepared = NULL;
>>
>> - seccomp_assign_mode(current, seccomp_mode);
>> + seccomp_assign_mode(current, filter_mode);
>>  out:
>>   spin_unlock_irq(>sighand->siglock);
>>   if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>> @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, 
>> unsigned long filter_off,
>>   if (IS_ERR(filter))
>>   return PTR_ERR(filter);
>>
>> - fprog = filter->prog->orig_prog;
>> - if (!fprog) {
>> + if (!bpf_prog_was_classic(filter->prog)) {
>
> This is actually a bug, see f8e529ed941b ("seccomp, ptrace: add support for
> dumping seccomp filters") and would cause a NULL ptr deref in case the filter
> was created with bpf_prog_create_from_user() with save_orig as false, so the
> if (!fprog) test for cBPF cannot be removed from here.
>
Isn't this function within:
#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
#endif

And, above, where bpf_prog_create_from_user is, save_prog is derived from:
const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);

Are there any other places this can be loaded, or this function can be
exposes if CONFIG_CHECKPOINT_RESTORE = n?


>>   /* This must be a new non-cBPF filter, since we save
>>* every cBPF filter's orig_prog above when
>>* CONFIG_CHECKPOINT_RESTORE is enabled.
>> @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, 
>> unsigned long filter_off,
>>   goto out;
>>   }
>>
>> + fprog = filter->prog->orig_prog;
>>   ret = fprog->len;
>
> (See above.)
>
>>   if (!data)
>>   goto out;
>> @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct 
>> ctl_table *ro_table, int write,
>>   return 0;
>>  }
>>
>> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
>> +static bool seccomp_is_valid_access(int off, int size,
>> + enum bpf_access_type type,
>> + struct bpf_insn_access_aux *info)
>> +{
>> + if (type != BPF_READ)
>> + return false;
>> +
>> + if (off < 0 || off + size > sizeof(struct seccomp_data))
>> + return false;
>
> if (off % size != 0)
> return false;
>
>> + switch (off) {
>> + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]):
>> + return (size == sizeof(__u64));
>> + case bpf_ctx_range(struct seccomp_data, nr):
>> + return (size == FIELD_SIZEOF(struct seccomp_data, nr));
>> + case bpf_ctx_range(struct seccomp_data, arch):
>> + return (size == FIELD_SIZEOF(struct seccomp_data, arch));
>> + case bpf_ctx_range(struct seccomp_data, instruction_pointer):
>> + return (size == FIELD_SIZEOF(struct seccomp_data,
>> +  instruction_pointer));
>
> default:
> return false;
>
> [...]
>> +static const struct bpf_func_proto *
>> +seccomp_func_proto(enum bpf_func_id func_id)
>> +{
>> + switch (func_id) {
>> + case BPF_FUNC_get_current_uid_gid:
>> + return _get_current_uid_gid_proto;
>> + case BPF_FUNC_ktime_get_ns:
>> + return _ktime_get_ns_proto;
>> + case BPF_FUNC_get_prandom_u32:
>> + return _get_prandom_u32_proto;
>> + case BPF_FUNC_get_current_pid_tgid:
>> + return _get_current_pid_tgid_proto;
>
> Do you have a use-case description for the above helpers? Is the prandom/ktime
> one for simulating errors coming from the syscall? And the other two for
> 

Re: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-20 Thread Sargun Dhillon
On Mon, Feb 19, 2018 at 4:00 PM, Daniel Borkmann  wrote:
> On 02/19/2018 05:22 PM, Sargun Dhillon wrote:
>> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant
>> to be used for seccomp filters as an alternative to cBPF filters. The
>> program type has relatively limited capabilities in terms of helpers,
>> but that can be extended later on.
>>
>> The eBPF code loading is separated from attachment of the filter, so
>> a privileged user can load the filter, and pass it back to an
>> unprivileged user who can attach it and use it at a later time.
>>
>> In order to attach the filter itself, you need to supply a flag to the
>> seccomp syscall indicating that a eBPF filter is being attached, as
>> opposed to a cBPF one. Verification occurs at program load time,
>> so the user should only receive errors related to attachment.
>>
>> Signed-off-by: Sargun Dhillon 
> [...]
>> @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>>
>>   spin_lock_irq(>sighand->siglock);
>>
>> - if (!seccomp_may_assign_mode(seccomp_mode))
>> + if (!seccomp_may_assign_mode(filter_mode))
>>   goto out;
>>
>>   ret = seccomp_attach_filter(flags, prepared);
>> @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>>   /* Do not free the successfully attached filter. */
>>   prepared = NULL;
>>
>> - seccomp_assign_mode(current, seccomp_mode);
>> + seccomp_assign_mode(current, filter_mode);
>>  out:
>>   spin_unlock_irq(>sighand->siglock);
>>   if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>> @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, 
>> unsigned long filter_off,
>>   if (IS_ERR(filter))
>>   return PTR_ERR(filter);
>>
>> - fprog = filter->prog->orig_prog;
>> - if (!fprog) {
>> + if (!bpf_prog_was_classic(filter->prog)) {
>
> This is actually a bug, see f8e529ed941b ("seccomp, ptrace: add support for
> dumping seccomp filters") and would cause a NULL ptr deref in case the filter
> was created with bpf_prog_create_from_user() with save_orig as false, so the
> if (!fprog) test for cBPF cannot be removed from here.
>
>>   /* This must be a new non-cBPF filter, since we save
>>* every cBPF filter's orig_prog above when
>>* CONFIG_CHECKPOINT_RESTORE is enabled.
>> @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, 
>> unsigned long filter_off,
>>   goto out;
>>   }
>>
>> + fprog = filter->prog->orig_prog;
>>   ret = fprog->len;
>
> (See above.)
>
>>   if (!data)
>>   goto out;
>> @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct 
>> ctl_table *ro_table, int write,
>>   return 0;
>>  }
>>
>> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
>> +static bool seccomp_is_valid_access(int off, int size,
>> + enum bpf_access_type type,
>> + struct bpf_insn_access_aux *info)
>> +{
>> + if (type != BPF_READ)
>> + return false;
>> +
>> + if (off < 0 || off + size > sizeof(struct seccomp_data))
>> + return false;
>
> if (off % size != 0)
> return false;
>
Won't this break access to the instruction pointer, and args if
sizeof(int) != 4? Don't know any if any architectures fall under that.

>> + switch (off) {
>> + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]):
>> + return (size == sizeof(__u64));
>> + case bpf_ctx_range(struct seccomp_data, nr):
>> + return (size == FIELD_SIZEOF(struct seccomp_data, nr));
>> + case bpf_ctx_range(struct seccomp_data, arch):
>> + return (size == FIELD_SIZEOF(struct seccomp_data, arch));
>> + case bpf_ctx_range(struct seccomp_data, instruction_pointer):
>> + return (size == FIELD_SIZEOF(struct seccomp_data,
>> +  instruction_pointer));
>
> default:
> return false;
>
> [...]
>> +static const struct bpf_func_proto *
>> +seccomp_func_proto(enum bpf_func_id func_id)
>> +{
>> + switch (func_id) {
>> + case BPF_FUNC_get_current_uid_gid:
>> + return _get_current_uid_gid_proto;
>> + case BPF_FUNC_ktime_get_ns:
>> + return _ktime_get_ns_proto;
>> + case BPF_FUNC_get_prandom_u32:
>> + return _get_prandom_u32_proto;
>> + case BPF_FUNC_get_current_pid_tgid:
>> + return _get_current_pid_tgid_proto;
>
> Do you have a use-case description for the above helpers? Is the prandom/ktime
> one for simulating errors coming from the syscall? And the other two for
> orchestration purposes?
>
My specific use case with uid_guid and pid is for containers, I have a
use case where I can put systemd, or a privileged init system into a
container, pid 1, running as uid 0 will get access to whetever is

Re: [PATCH iproute2-next v2 2/7] iplink: Correctly report error when network device isn't found

2018-02-20 Thread Serhey Popovych
David Ahern wrote:
> On 2/20/18 2:37 PM, Serhey Popovych wrote:
>> Distinguish cases when "dev" parameter isn't given from cases where no
>> network device corresponding to "dev" is found.
>>
>> Do not check for index validity in xdp_parse(): caller should take care
>> of this because has more information (e.g. when "dev" is given or not
>> found) for this.
>>
>> Signed-off-by: Serhey Popovych 
>> ---
>>  ip/iplink.c |   16 +---
>>  ip/iplink_xdp.c |7 +--
>>  2 files changed, 14 insertions(+), 9 deletions(-)
>>
> 
> don't really see any benefit from this one.

This clearly shows to user what is wrong with it's command line: either
dev/name is missing while parsing VF/XDP or it specifies non existent
network device (or at least ll_name_to_index() can't resolve it).

Moving ifindex check from xdp_parse() to the caller is done to
consolidate all dev/name parsing and validating code in single place:
iplink_parse().

> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH iproute2-next v2] ip: link_gre6.c: Support IP6_TNL_F_ALLOW_LOCAL_REMOTE flag

2018-02-20 Thread Serhey Popovych
Petr Machata wrote:
> For IP-in-IP tunnels, one can specify the [no]allow-localremote command
> when configuring a device. Under the hood, this flips the
> IP6_TNL_F_ALLOW_LOCAL_REMOTE flag on the netdevice. However, ip6gretap
> and ip6erspan devices, where the flag is also relevant, are not IP-in-IP
> tunnels, and thus there's no way to configure the flag on these
> netdevices. Therefore introduce the command to link_gre6 as well.
> 
> The original support was introduced in commit
> 21440d19d957 ("ip: link_ip6tnl.c/ip6tunnel.c: Support 
> IP6_TNL_F_ALLOW_LOCAL_REMOTE flag")

This would produce following checkpatch.pl error:

--
Commit b7a2f4a74cd0 ("ip: link_gre6.c: Support
IP6_TNL_F_ALLOW_LOCAL_REMOTE flag")
--
WARNING: Possible unwrapped commit description (prefer a maximum 75
chars per line)
#14:
21440d19d957 ("ip: link_ip6tnl.c/ip6tunnel.c: Support
IP6_TNL_F_ALLOW_LOCAL_REMOTE flag")

ERROR: Please use git commit description style 'commit <12+ chars of
sha1> ("")' - ie: 'commit 21440d19d957 ("ip:
link_ip6tnl.c/ip6tunnel.c: Support IP6_TNL_F_ALLOW_LOCAL_REMOT'
#14:
21440d19d957 ("ip: link_ip6tnl.c/ip6tunnel.c: Support
IP6_TNL_F_ALLOW_LOCAL_REMOTE flag")

You probably should address it. In general change looks good to me.

> 
> Signed-off-by: Petr Machata 
> ---
> 
> Notes:
> Changes from v1 to v2:
> 
> - Rebase to iproute2-next
> 
>  ip/link_gre6.c| 11 +++
>  man/man8/ip-link.8.in | 14 ++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/ip/link_gre6.c b/ip/link_gre6.c
> index 6c77038..e0746bc 100644
> --- a/ip/link_gre6.c
> +++ b/ip/link_gre6.c
> @@ -48,6 +48,7 @@ static void gre_print_help(struct link_util *lu, int argc, 
> char **argv, FILE *f)
>   " [ dscp inherit ]\n"
>   " [ dev PHYS_DEV ]\n"
>   " [ fwmark MARK ]\n"
> + " [ [no]allow-localremote ]\n"
>   " [ external ]\n"
>   " [ noencap ]\n"
>   " [ encap { fou | gue | none } ]\n"
> @@ -346,6 +347,10 @@ get_failed:
>   invarg("invalid fwmark\n", *argv);
>   flags &= ~IP6_TNL_F_USE_ORIG_FWMARK;
>   }
> + } else if (strcmp(*argv, "allow-localremote") == 0) {
> + flags |= IP6_TNL_F_ALLOW_LOCAL_REMOTE;
> + } else if (strcmp(*argv, "noallow-localremote") == 0) {
> + flags &= ~IP6_TNL_F_ALLOW_LOCAL_REMOTE;
>   } else if (strcmp(*argv, "encaplimit") == 0) {
>   NEXT_ARG();
>   if (strcmp(*argv, "none") == 0) {
> @@ -534,6 +539,12 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
> struct rtattr *tb[])
>   if (oflags & GRE_CSUM)
>   print_bool(PRINT_ANY, "ocsum", "ocsum ", true);
>  
> + if (flags & IP6_TNL_F_ALLOW_LOCAL_REMOTE)
> + print_bool(PRINT_ANY,
> +"ip6_tnl_f_allow_local_remote",
> +"allow-localremote ",
> +true);
> +
>   if (flags & IP6_TNL_F_USE_ORIG_FWMARK) {
>   print_bool(PRINT_ANY,
>  "ip6_tnl_f_use_orig_fwmark",
> diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
> index 481589e..5dee9fc 100644
> --- a/man/man8/ip-link.8.in
> +++ b/man/man8/ip-link.8.in
> @@ -793,6 +793,8 @@ the following additional arguments are supported:
>  ] [
>  .BI "dscp inherit"
>  ] [
> +.BI "[no]allow-localremote"
> +] [
>  .BI dev " PHYS_DEV "
>  ] [
>  .RB external
> @@ -857,6 +859,11 @@ flag is equivalent to the combination
>  - specifies a fixed flowlabel.
>  
>  .sp
> +.BI  [no]allow-localremote
> +- specifies whether to allow remote endpoint to have an address configured on
> +local host.
> +
> +.sp
>  .BI  tclass " TCLASS"
>  - specifies the traffic class field on
>  tunneled packets, which can be specified as either a two-digit
> @@ -927,6 +934,8 @@ the following additional arguments are supported:
>  ] [
>  .BR erspan_hwid " \fIhwid "
>  ] [
> +.BI "[no]allow-localremote"
> +] [
>  .RB external
>  ]
>  
> @@ -965,6 +974,11 @@ traffic's source port and direction.
>  is a 6-bit value for users to configure.
>  
>  .sp
> +.BI  [no]allow-localremote
> +- specifies whether to allow remote endpoint to have an address configured on
> +local host.
> +
> +.sp
>  .BR external
>  - make this tunnel externally controlled (or not, which is the default).
>  In the kernel, this is referred to as collect metadata mode.  This flag is
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH iproute2-next v2 3/7] iplink: Use "dev" and "name" parameters interchangeable when possible

2018-02-20 Thread Serhey Popovych
Stephen Hemminger wrote:
> On Tue, 20 Feb 2018 23:37:25 +0200
> Serhey Popovych  wrote:
> 
>> Both of them accept network device name as argument, but have different
>> meaning:
>>
>>   dev  - is a device by it's name,
>>   name - name for specific device device.
>>
>> The only case where they treated separately is network device rename
>> case where need to specify both ifindex and new name. In rest of the
>> cases we can assume that dev == name.
> 
> I would rather keep name as only being a valid argument for set command.
> And reject use of:
> 
>ip li show name eth0
> 

But ip-link(8) man page is full of examples where "name" is used instead
of "dev". I suppose there are lot of configuration we can broke when
explicitly disabling "name" as an alternative to "dev".

Especially on the first lines in man page we have:

ip link add [ link DEVICE ] [ name ] NAME
...

and there is no reference to "dev".

What do you think?



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net-next] r8169: remove not needed PHY soft reset in rtl8168e_2_hw_phy_config

2018-02-20 Thread Heiner Kallweit
Am 21.02.2018 um 05:27 schrieb David Miller:
> From: Heiner Kallweit 
> Date: Tue, 20 Feb 2018 07:30:16 +0100
> 
>> rtl8169_init_phy() resets the PHY anyway after applying the chip-specific
>> PHY configuration. So we don't need to soft-reset the PHY as part of the
>> chip-specific configuration.
>>
>> Signed-off-by: Heiner Kallweit 
> 
> I'm not so comfortable with this.
> 
> There are so many r8169 chip variants out there.
> 
> And who knows, maybe one of them needs this second PHY reset or due to
> some way the driver is coded it is necessary.
> 
> Unless you can test this change on every r8169 chip type, I'm very
> reluctant to apply this patch.
> 
I understand the concern, the change however is in rtl8168e_2_hw_phy_config()
which is specific to chip version 34 (RTL8168evl).
On my system with this chip version the change didn't change system behavior.


> Sorry.
> 



Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko
Hi.

On středa 21. února 2018 0:21:37 CET Eric Dumazet wrote:
> My latest patch (fixing BBR underestimation of cwnd)
> was meant for net tree, on a NIC where SG/TSO/GSO) are disabled.
> 
> ( ie when sk->sk_gso_max_segs is not set to 'infinite' )
> 
> It is packet scheduler independent really.
> 
> Tested here with pfifo_fast, TSO/GSO off.

Well, before the patch with BBR and sg off here it is ~450 Mbps for fq and 
~115 Mbps for pfifo_fast. So, comparing to what I see with the patch (850 and 
200 respectively), it is definitely an improvement.

Thanks.




Re: [PATCH iproute2-next] color: disable color when json output is requested

2018-02-20 Thread Vincent Bernat
 ❦ 20 février 2018 16:04 -0800, Stephen Hemminger  :

>> Instead of declaring -color and -json exclusive, ignore -color when
>> -json is provided. The rationale is to allow to put -color in an alias
>> for ip while still being able to use -json. -color is merely a
>> presentation suggestion and we can assume there is nothing to color in
>> the JSON output.
>> 
>> Signed-off-by: Vincent Bernat 
>
> Looks fine to me, this could even go into master.
> Need to update man page and make sure behavior is consistent
> across ip, tc, and bridge commands.

Currently, in master or net-next, only the "ip" command has a -color
option.
-- 
The human race is a race of cowards; and I am not only marching in that
procession but carrying a banner.
-- Mark Twain


nla_put_string() vs NLA_STRING

2018-02-20 Thread Kees Cook
Hi,

It seems that in at least one case[1], nla_put_string() is being used
on an NLA_STRING, which lacks a NULL terminator, which leads to
silliness when nla_put_string() uses strlen() to figure out the size:

/**
 * nla_put_string - Add a string netlink attribute to a socket buffer
 * @skb: socket buffer to add attribute to
 * @attrtype: attribute type
 * @str: NUL terminated string
*/
static inline int nla_put_string(struct sk_buff *skb, int attrtype,
const char *str)
{
return nla_put(skb, attrtype, strlen(str) + 1, str);
}


This is a problem at least here:

struct regulatory_request {
...
char alpha2[2];
...

static const struct nla_policy nl80211_policy[NUM_NL80211_ATTR] = {
...
[NL80211_ATTR_REG_ALPHA2] = { .type = NLA_STRING, .len = 2 },
...

AIUI, working with NLA_STRING needs nla_strlcpy() to "extract" them,
and that takes the nla_policy size normally to bounds-check the copy.


So, this specific problem needs fixing (in at least two places calling
nla_put_string(msg, NL80211_ATTR_REG_ALPHA2, ...)). While I suspect
it's only ever written an extra byte from the following variable in
the structure which is an enum nl80211_dfs_regions, I worry there
might be a lot more of these (though I'd hope unterminated strings are
uncommon for internal representation). And more generally, it seems
like only the NLA _input_ functions actually check nla_policy details.
It seems that the output functions should do the same too, yes?

-Kees

[1] https://github.com/copperhead/linux-hardened/issues/72

-- 
Kees Cook
Pixel Security


Re: [PATCH] netlink: put module reference if dump start fails

2018-02-20 Thread Eric Dumazet
On Wed, 2018-02-21 at 04:41 +0100, Jason A. Donenfeld wrote:
> Before, if cb->start() failed, the module reference would never be put,
> because cb->cb_running is intentionally false at this point. Users are
> generally annoyed by this because they can no longer unload modules that
> leak references. Also, it may be possible to tediously wrap a reference
> counter back to zero, especially since module.c still uses atomic_inc
> instead of refcount_inc.
> 
> This patch expands the error path to simply call module_put if
> cb->start() fails.
> 
> Signed-off-by: Jason A. Donenfeld 
> ---
> This probably should be queued up for stable.

When was the bug added ? This would help a lot stable teams ...

Thanks.



[PATCH net] smsc75xx: fix smsc75xx_set_features()

2018-02-20 Thread Eric Dumazet
From: Eric Dumazet 

If an attempt is made to disable RX checksums, USB adapter is changed
but netdev->features is not, because smsc75xx_set_features() returns a
non zero value.

This throws errors from netdev_rx_csum_fault() :
: hw csum failure

Signed-off-by: Eric Dumazet 
Cc: Steve Glendinning 
---
 drivers/net/usb/smsc75xx.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/smsc75xx.c b/drivers/net/usb/smsc75xx.c
index 
d0a113743195acae86931c51eea50b94ddadd487..7a6a1fe793090b8e28f5ef075f5ebc2ad385b5eb
 100644
--- a/drivers/net/usb/smsc75xx.c
+++ b/drivers/net/usb/smsc75xx.c
@@ -954,10 +954,11 @@ static int smsc75xx_set_features(struct net_device 
*netdev,
/* it's racing here! */
 
ret = smsc75xx_write_reg(dev, RFE_CTL, pdata->rfe_ctl);
-   if (ret < 0)
+   if (ret < 0) {
netdev_warn(dev->net, "Error writing RFE_CTL\n");
-
-   return ret;
+   return ret;
+   }
+   return 0;
 }
 
 static int smsc75xx_wait_ready(struct usbnet *dev, int in_pm)



Re: [PATCH iproute2-next v2 2/7] iplink: Correctly report error when network device isn't found

2018-02-20 Thread David Ahern
On 2/20/18 2:37 PM, Serhey Popovych wrote:
> Distinguish cases when "dev" parameter isn't given from cases where no
> network device corresponding to "dev" is found.
> 
> Do not check for index validity in xdp_parse(): caller should take care
> of this because has more information (e.g. when "dev" is given or not
> found) for this.
> 
> Signed-off-by: Serhey Popovych 
> ---
>  ip/iplink.c |   16 +---
>  ip/iplink_xdp.c |7 +--
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 

don't really see any benefit from this one.



Re: [PATCH iproute2-next v2 1/7] utils: Introduce and use nodev() helper routine

2018-02-20 Thread David Ahern
On 2/20/18 2:37 PM, Serhey Popovych wrote:

> diff --git a/ip/ipneigh.c b/ip/ipneigh.c
> index 0735424..9c9cd23 100644
> --- a/ip/ipneigh.c
> +++ b/ip/ipneigh.c
> @@ -178,11 +178,13 @@ static int ipneigh_modify(int cmd, int flags, int argc, 
> char **argv)
>  
>   ll_init_map();
>  
> - if (dev && (req.ndm.ndm_ifindex = ll_name_to_index(dev)) == 0) {
> - fprintf(stderr, "Cannot find device \"%s\"\n", dev);
> - return -1;
> + if (dev) {
> + req.ndm.ndm_ifindex = ll_name_to_index(dev);
> + if (!req.ndm.ndm_ifindex)
> + return nodev(dev);
>   }
>  
> +
>   if (rtnl_talk(, , NULL) < 0)
>   exit(2);
>  

Remove the extra newline



Re: [PATCH net-next] r8169: remove not needed PHY soft reset in rtl8168e_2_hw_phy_config

2018-02-20 Thread David Miller
From: Heiner Kallweit 
Date: Tue, 20 Feb 2018 07:30:16 +0100

> rtl8169_init_phy() resets the PHY anyway after applying the chip-specific
> PHY configuration. So we don't need to soft-reset the PHY as part of the
> chip-specific configuration.
> 
> Signed-off-by: Heiner Kallweit 

I'm not so comfortable with this.

There are so many r8169 chip variants out there.

And who knows, maybe one of them needs this second PHY reset or due to
some way the driver is coded it is necessary.

Unless you can test this change on every r8169 chip type, I'm very
reluctant to apply this patch.

Sorry.


Re: [PATCH net-next] r8169: remove some WOL-related dead code

2018-02-20 Thread David Miller
From: Heiner Kallweit 
Date: Tue, 20 Feb 2018 07:23:03 +0100

> Commit bde135a672bf "r8169: only enable PCI wakeups when WOL is active"
> removed the only user of flag RTL_FEATURE_WOL. So let's remove some
> now dead code.
> 
> Signed-off-by: Heiner Kallweit 

Applied, thank you.


Re: [PATCH V7 2/4] sctp: Add ip option support

2018-02-20 Thread Neil Horman
On Tue, Feb 20, 2018 at 07:15:27PM +, Richard Haines wrote:
> Add ip option support to allow LSM security modules to utilise CIPSO/IPv4
> and CALIPSO/IPv6 services.
> 
> Signed-off-by: Richard Haines 
> ---
> All SCTP lksctp-tools/src/func_tests run correctly in enforcing mode.
> All "./sctp-tests run" obtained from: https://github.com/sctp/sctp-tests
> pass.
> 
> V7 Changes:
> 1) Log when copy ip options fail for IPv4 and IPv6
> 2) Correct sctp_setsockopt_maxseg() function. Note that the lksctp-tools
> func_tests do not test with struct sctp_assoc_value. Just used simple test
> and okay.
> 3) Move calculation of overheads to sctp_packet_config().
> NOTE: Initially in sctp_packet_reset() I set packet->size and
> packet->overhead to zero (as it is a reset). This was okay for all the
> lksctp-tools function tests, however when running "sctp-tests" ndatshched
> tests it causes these to fail with an st_s.log entry of:
>   sid: 3, expected: 3
>   sid: 3, expected: 3
>   unexpected sid packet !!!
>   sid: 1, expected: 3
> 
> I then found sctp_packet_transmit() relies on setting
> "packet->size = packet->overhead;" to reset size to the current overhead
> after sending packets, hence the comment in sctp_packet_reset()
> 
>  include/net/sctp/sctp.h|  4 +++-
>  include/net/sctp/structs.h |  2 ++
>  net/sctp/chunk.c   | 10 +++---
>  net/sctp/ipv6.c| 45 ++---
>  net/sctp/output.c  | 34 +-
>  net/sctp/protocol.c| 38 ++
>  net/sctp/socket.c  | 11 ---
>  7 files changed, 117 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index f7ae6b0..25c5c87 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -441,9 +441,11 @@ static inline int sctp_list_single_entry(struct 
> list_head *head)
>  static inline int sctp_frag_point(const struct sctp_association *asoc, int 
> pmtu)
>  {
>   struct sctp_sock *sp = sctp_sk(asoc->base.sk);
> + struct sctp_af *af = sp->pf->af;
>   int frag = pmtu;
>  
> - frag -= sp->pf->af->net_header_len;
> + frag -= af->ip_options_len(asoc->base.sk);
> + frag -= af->net_header_len;
>   frag -= sizeof(struct sctphdr) + sctp_datachk_len(>stream);
>  
>   if (asoc->user_frag)
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 03e92dd..ead5fce 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -491,6 +491,7 @@ struct sctp_af {
>   void(*ecn_capable)(struct sock *sk);
>   __u16   net_header_len;
>   int sockaddr_len;
> + int (*ip_options_len)(struct sock *sk);
>   sa_family_t sa_family;
>   struct list_head list;
>  };
> @@ -515,6 +516,7 @@ struct sctp_pf {
>   int (*addr_to_user)(struct sctp_sock *sk, union sctp_addr *addr);
>   void (*to_sk_saddr)(union sctp_addr *, struct sock *sk);
>   void (*to_sk_daddr)(union sctp_addr *, struct sock *sk);
> + void (*copy_ip_options)(struct sock *sk, struct sock *newsk);
>   struct sctp_af *af;
>  };
>  
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 991a530..d726d21 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -171,6 +171,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
>   struct list_head *pos, *temp;
>   struct sctp_chunk *chunk;
>   struct sctp_datamsg *msg;
> + struct sctp_sock *sp;
> + struct sctp_af *af;
>   int err;
>  
>   msg = sctp_datamsg_new(GFP_KERNEL);
> @@ -189,9 +191,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
>   /* This is the biggest possible DATA chunk that can fit into
>* the packet
>*/
> - max_data = asoc->pathmtu -
> -sctp_sk(asoc->base.sk)->pf->af->net_header_len -
> -sizeof(struct sctphdr) - sctp_datachk_len(>stream);
> + sp = sctp_sk(asoc->base.sk);
> + af = sp->pf->af;
> + max_data = asoc->pathmtu - af->net_header_len -
> +sizeof(struct sctphdr) - sctp_datachk_len(>stream) -
> +af->ip_options_len(asoc->base.sk);
>   max_data = SCTP_TRUNC4(max_data);
>  
>   /* If the the peer requested that we authenticate DATA chunks
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index e35d4f7..30a05a8 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -427,6 +427,41 @@ static void sctp_v6_copy_addrlist(struct list_head 
> *addrlist,
>   rcu_read_unlock();
>  }
>  
> +/* Copy over any ip options */
> +static void sctp_v6_copy_ip_options(struct sock *sk, struct sock *newsk)
> +{
> + struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
> + struct ipv6_txoptions *opt;
> +
> + newnp = inet6_sk(newsk);
> +
> + rcu_read_lock();
> + 

[PATCH] netlink: put module reference if dump start fails

2018-02-20 Thread Jason A. Donenfeld
Before, if cb->start() failed, the module reference would never be put,
because cb->cb_running is intentionally false at this point. Users are
generally annoyed by this because they can no longer unload modules that
leak references. Also, it may be possible to tediously wrap a reference
counter back to zero, especially since module.c still uses atomic_inc
instead of refcount_inc.

This patch expands the error path to simply call module_put if
cb->start() fails.

Signed-off-by: Jason A. Donenfeld 
---
This probably should be queued up for stable.

 net/netlink/af_netlink.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 2ad445c1d27c..07e8478068f0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2308,7 +2308,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
if (cb->start) {
ret = cb->start(cb);
if (ret)
-   goto error_unlock;
+   goto error_put;
}
 
nlk->cb_running = true;
@@ -2328,6 +2328,8 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
 */
return -EINTR;
 
+error_put:
+   module_put(control->module);
 error_unlock:
sock_put(sk);
mutex_unlock(nlk->cb_mutex);
-- 
2.16.1



Re: [PATCH] Carrier detect ok, don't turn off negotiation

2018-02-20 Thread Denis Du
Hi, David:

How  is your thinking about this patch?



>From b5902a4dfc709b62b704997ab64f31c9ef69a6db Mon Sep 17 00:00:00 2001 
From: Denis Du  
Date: Mon, 15 Jan 2018 17:26:06 -0500 
Subject: [PATCH] netdev: carrier detect ok, don't turn off negotiation 

Sometimes when physical lines have a just good noise to make the protocol 
handshaking fail, but the carrier detect still good. Then after remove of 
the noise, nobody will trigger this protocol to be start again to cause 
the link to never come back. The fix is when the carrier is still on, not 
terminate the protocol handshaking. 

Signed-off-by: Denis Du  
--- 
drivers/net/wan/hdlc_ppp.c | 5 - 
1 file changed, 4 insertions(+), 1 deletion(-) 

diff --git a/drivers/net/wan/hdlc_ppp.c b/drivers/net/wan/hdlc_ppp.c 
index afeca6b..ab8b3cb 100644 
--- a/drivers/net/wan/hdlc_ppp.c 
+++ b/drivers/net/wan/hdlc_ppp.c 
@@ -574,7 +574,10 @@ static void ppp_timer(struct timer_list *t) 
ppp_cp_event(proto->dev, proto->pid, TO_GOOD, 0, 0, 
 0, NULL); 
proto->restart_counter--; 
-} else 
+} else if (netif_carrier_ok(proto->dev)) 
+ppp_cp_event(proto->dev, proto->pid, TO_GOOD, 0, 0, 
+ 0, NULL); 
+else 
ppp_cp_event(proto->dev, proto->pid, TO_BAD, 0, 0, 
 0, NULL); 
break; 
-- 
2.1.4 





On ‎Tuesday‎, ‎February‎ ‎06‎, ‎2018‎ ‎12‎:‎06‎:‎43‎ ‎PM‎ ‎EST, Denis Du 
 wrote: 







Ok, I submit it  again.


In drivers/net/wan/hdlc_ppp.c, some noise on physical line can cause the 
carrier detect still ok, but the protocol will fail. So if carrier detect ok, 
don't turn off protocol negotiation

This patch is against the kernel version Linux 4.15-rc8





On Tuesday, February 6, 2018, 10:29:53 AM EST, David Miller 
 wrote: 





From: Denis Du 

Date: Tue, 6 Feb 2018 15:15:28 + (UTC)

> How  do you think my patch?
> 
> As you see, Krzysztof  think my patch is ok to be accepted.
> But if you have a better idea to fix it,I am glad to see it. Anyway, this 
> issue have to be fixed.


Please resubmit it and I'll think about it again, thank you.


Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-20 Thread David Miller
From: Roman Kapl 
Date: Mon, 19 Feb 2018 21:32:51 +0100

> So far, if the filter was too large to fit in the allocated skb, the
> kernel did not return any error and stopped dumping. Modify the dumper
> so that it returns -EMSGSIZE when a filter fails to dump and it is the
> first filter in the skb. If we are not first, we will get a next chance
> with more room.
> 
> I understand this is pretty near to being an API change, but the
> original design (silent truncation) can be considered a bug.
> 
> Note: The error case can happen pretty easily if you create a filter
> with 32 actions and have 4kb pages. Also recent versions of iproute try
> to be clever with their buffer allocation size, which in turn leads to
> 
> Signed-off-by: Roman Kapl 

Applied and queued up for -stable, thanks Roman.


Re: [PATCH RFC 3/3] netfilter: nf_tables: add BPF-based jit infrastructure

2018-02-20 Thread Alexei Starovoitov
On Tue, Feb 20, 2018 at 11:53:55AM +0100, Pablo Neira Ayuso wrote:
> Hi David,
> 
> On Mon, Feb 19, 2018 at 01:53:34PM -0500, David Miller wrote:
> > I'm very suprised that this is generating classical BPF filters.
> >
> > We have native eBPF and that is what anything generating new code
> > should be using, rather than the 20+ year old CBPF.
> 
> I'm not the only one that likes 90s interfaces after all ;-)
> 
> I'll explore how to generate eBPF code in the next patchset version.

from the user space please...
it was already explained few times in this thread and in other
threads (kprobe, seccomp, etc related) over the last years why
it's not ok to generate eBPF from the kernel.



Re: [Intel-wired-lan] [PATCH net-queue] e1000e: Fix check_for_link return value with autoneg off.

2018-02-20 Thread Neftin, Sasha

On 2/19/2018 22:12, Benjamin Poirier wrote:

When autoneg is off, the .check_for_link callback functions clear the
get_link_status flag and systematically return a "pseudo-error". This means
that the link is not detected as up until the next execution of the
e1000_watchdog_task() 2 seconds later.

Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
Signed-off-by: Benjamin Poirier 
---
  drivers/net/ethernet/intel/e1000e/ich8lan.c | 2 +-
  drivers/net/ethernet/intel/e1000e/mac.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 31277d3bb7dc..ff308b05d68c 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1602,7 +1602,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return -E1000_ERR_CONFIG;
+   return 1;
  
  	/* Auto-Neg is enabled.  Auto Speed Detection takes care

 * of MAC speed/duplex configuration.  So we only need to
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
b/drivers/net/ethernet/intel/e1000e/mac.c
index f457c5703d0c..db735644b312 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -450,7 +450,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return -E1000_ERR_CONFIG;
+   return 1;
  
  	/* Auto-Neg is enabled.  Auto Speed Detection takes care

 * of MAC speed/duplex configuration.  So we only need to


Acked-by: Sasha Neftin 


[PATCH iproute2-next 2/4] ip: Display ip rule protocol used

2018-02-20 Thread Donald Sharp
Newer kernels are now accepting a protocol from the installing
program for who installed the rule.  This change allows us
to see this change if it is being specified by the installing
program.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index c40d76f1..b3e7d92c 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -341,6 +341,10 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
rtnl_rtntype_n2a(frh->action,
 b1, sizeof(b1)));
 
+   if (frh->proto != RTPROT_UNSPEC)
+   fprintf(fp, " proto %s ",
+   rtnl_rtprot_n2a(frh->proto, b1, sizeof(b1)));
+
fprintf(fp, "\n");
fflush(fp);
return 0;
-- 
2.14.3



[PATCH iproute2-next 3/4] ip: Allow rules to accept a specified protocol

2018-02-20 Thread Donald Sharp
Allow the specification of a protocol when the user
adds/modifies/deletes a rule.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index b3e7d92c..fd242fee 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -675,6 +675,12 @@ static int iprule_modify(int cmd, int argc, char **argv)
if (get_rt_realms_or_raw(, *argv))
invarg("invalid realms\n", *argv);
addattr32(, sizeof(req), FRA_FLOW, realm);
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 proto;
+   NEXT_ARG();
+   if (rtnl_rtprot_a2n(, *argv))
+   invarg("\"protocol\" value is invalid\n", 
*argv);
+   req.frh.proto = proto;
} else if (matches(*argv, "table") == 0 ||
   strcmp(*argv, "lookup") == 0) {
NEXT_ARG();
-- 
2.14.3



[PATCH iproute2-next 0/4] Allow 'ip rule' command to use protocol

2018-02-20 Thread Donald Sharp
Fix iprule.c to use the actual `struct fib_rule_hdr` and to
allow the end user to see and use the protocol keyword
for rule manipulations.

Donald Sharp (4):
  ip: Use the `struct fib_rule_hdr` for rules
  ip: Display ip rule protocol used
  ip: Allow rules to accept a specified protocol
  ip: Add ability to flush a rule based upon protocol

 include/uapi/linux/fib_rules.h |   2 +-
 ip/iprule.c| 167 -
 man/man8/ip-rule.8 |  18 -
 3 files changed, 117 insertions(+), 70 deletions(-)

-- 
2.14.3



[PATCH iproute2-next 4/4] ip: Add ability to flush a rule based upon protocol

2018-02-20 Thread Donald Sharp
Add code to allow the `ip rule flush protocol XXX`
command to be accepted and properly handled.

Additionally modify the documentation to be correct
with these changes.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c| 25 ++---
 man/man8/ip-rule.8 | 18 +-
 2 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/ip/iprule.c b/ip/iprule.c
index fd242fee..b69413dd 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -47,6 +47,7 @@ static void usage(void)
"[ iif STRING ] [ oif STRING ] [ pref NUMBER ] [ 
l3mdev ]\n"
"[ uidrange NUMBER-NUMBER ]\n"
"ACTION := [ table TABLE_ID ]\n"
+   "  [ protocol RPROTO ]\n"
"  [ nat ADDRESS ]\n"
"  [ realms [SRCREALM/]DSTREALM ]\n"
"  [ goto NUMBER ]\n"
@@ -71,6 +72,8 @@ static struct
struct fib_rule_uid_range range;
inet_prefix src;
inet_prefix dst;
+   int protocol;
+   int protocolmask;
 } filter;
 
 static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
@@ -398,6 +401,9 @@ static int flush_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
 
parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
+   if ((filter.protocol^frh->proto))
+   return 0;
+
if (tb[FRA_PRIORITY]) {
n->nlmsg_type = RTM_DELRULE;
n->nlmsg_flags = NLM_F_REQUEST;
@@ -422,12 +428,6 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
if (af == AF_UNSPEC)
af = AF_INET;
 
-   if (action != IPRULE_LIST && argc > 0) {
-   fprintf(stderr, "\"ip rule %s\" does not take any arguments.\n",
-   action == IPRULE_SAVE ? "save" : "flush");
-   return -1;
-   }
-
switch (action) {
case IPRULE_SAVE:
if (save_rule_prep())
@@ -515,7 +515,18 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
NEXT_ARG();
if (get_prefix(, *argv, af))
invarg("from value is invalid\n", *argv);
-   } else {
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 prot;
+   NEXT_ARG();
+   filter.protocolmask = -1;
+   if (rtnl_rtprot_a2n(, *argv)) {
+   if (strcmp(*argv, "all") != 0)
+   invarg("invalid \"protocol\"\n", *argv);
+   prot = 0;
+   filter.protocolmask = 0;
+   }
+   filter.protocol = prot;
+   } else{
if (matches(*argv, "dst") == 0 ||
matches(*argv, "to") == 0) {
NEXT_ARG();
diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8
index a5c47981..98b2573d 100644
--- a/man/man8/ip-rule.8
+++ b/man/man8/ip-rule.8
@@ -50,6 +50,8 @@ ip-rule \- routing policy database management
 .IR ACTION " := [ "
 .B  table
 .IR TABLE_ID " ] [ "
+.B  protocol
+.IR RPROTO " ] [ "
 .B  nat
 .IR ADDRESS " ] [ "
 .B realms
@@ -240,6 +242,10 @@ The options preference and order are synonyms with 
priority.
 the routing table identifier to lookup if the rule selector matches.
 It is also possible to use lookup instead of table.
 
+.TP
+.BI protocol " RPROTO"
+the protocol who installed the rule in question.
+
 .TP
 .BI suppress_prefixlength " NUMBER"
 reject routing decisions that have a prefix length of NUMBER or less.
@@ -275,7 +281,11 @@ updates, it flushes the routing cache with
 .RE
 .TP
 .B ip rule flush - also dumps all the deleted rules.
-This command has no arguments.
+.RS
+.TP
+.BI protocol " RPROTO"
+Select the originating protocol.
+.RE
 .TP
 .B ip rule show - list rules
 This command has no arguments.
@@ -283,6 +293,12 @@ The options list or lst are synonyms with show.
 
 .TP
 .B ip rule save
+.RS
+.TP
+.BI protocl " RPROTO"
+Select the originating protocol.
+.RE
+.TP
 save rules table information to stdout
 .RS
 This command behaves like
-- 
2.14.3



[PATCH iproute2-next 1/4] ip: Use the `struct fib_rule_hdr` for rules

2018-02-20 Thread Donald Sharp
The iprule.c code was using `struct rtmsg` as the data
type to pass into the kernel for the netlink message.
While 'struct rtmsg' and `struct fib_rule_hdr` are
the same size and mostly the same, we should use
the correct data structure.  This commit translates
the data structures to have iprule.c use the correct
one.

Additionally copy over the modified fib_rules.h file

Signed-off-by: Donald Sharp 
---
 include/uapi/linux/fib_rules.h |   2 +-
 ip/iprule.c| 132 ++---
 2 files changed, 72 insertions(+), 62 deletions(-)

diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b642bf9..92553917 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -23,8 +23,8 @@ struct fib_rule_hdr {
__u8tos;
 
__u8table;
+   __u8proto;
__u8res1;   /* reserved */
-   __u8res2;   /* reserved */
__u8action;
 
__u32   flags;
diff --git a/ip/iprule.c b/ip/iprule.c
index a3abf2f6..c40d76f1 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -75,23 +75,23 @@ static struct
 
 static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 {
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
__u32 table;
 
-   if (preferred_family != AF_UNSPEC && r->rtm_family != preferred_family)
+   if (preferred_family != AF_UNSPEC && frh->family != preferred_family)
return false;
 
if (filter.prefmask &&
filter.pref ^ (tb[FRA_PRIORITY] ? rta_getattr_u32(tb[FRA_PRIORITY]) 
: 0))
return false;
-   if (filter.not && !(r->rtm_flags & FIB_RULE_INVERT))
+   if (filter.not && !(frh->flags & FIB_RULE_INVERT))
return false;
 
if (filter.src.family) {
inet_prefix *f_src = 
 
-   if (f_src->family != r->rtm_family ||
-   f_src->bitlen > r->rtm_src_len)
+   if (f_src->family != frh->family ||
+   f_src->bitlen > frh->src_len)
return false;
 
if (inet_addr_match_rta(f_src, tb[FRA_SRC]))
@@ -101,15 +101,15 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct 
rtattr **tb, int host_len)
if (filter.dst.family) {
inet_prefix *f_dst = 
 
-   if (f_dst->family != r->rtm_family ||
-   f_dst->bitlen > r->rtm_dst_len)
+   if (f_dst->family != frh->family ||
+   f_dst->bitlen > frh->dst_len)
return false;
 
if (inet_addr_match_rta(f_dst, tb[FRA_DST]))
return false;
}
 
-   if (filter.tosmask && filter.tos ^ r->rtm_tos)
+   if (filter.tosmask && filter.tos ^ frh->tos)
return false;
 
if (filter.fwmark) {
@@ -159,7 +159,13 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return false;
}
 
-   table = rtm_get_table(r, tb);
+
+   /* struct fib_rule_hdr and struct rtmsg
+* were intentionally the same.  Since
+* the table is the rtm_table, just call
+* it.
+*/
+   table = rtm_get_table((struct rtmsg *)frh, tb);
if (filter.tb > 0 && filter.tb ^ table)
return false;
 
@@ -169,7 +175,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
 int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
FILE *fp = (FILE *)arg;
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
int len = n->nlmsg_len;
int host_len = -1;
__u32 table;
@@ -180,13 +186,13 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE)
return 0;
 
-   len -= NLMSG_LENGTH(sizeof(*r));
+   len -= NLMSG_LENGTH(sizeof(*frh));
if (len < 0)
return -1;
 
-   parse_rtattr(tb, FRA_MAX, RTM_RTA(r), len);
+   parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
-   host_len = af_bit_len(r->rtm_family);
+   host_len = af_bit_len(frh->family);
 
if (!filter_nlmsg(n, tb, host_len))
return 0;
@@ -200,41 +206,41 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
else
fprintf(fp, "0:\t");
 
-   if (r->rtm_flags & FIB_RULE_INVERT)
+   if (frh->flags & FIB_RULE_INVERT)
fprintf(fp, "not ");
 
if (tb[FRA_SRC]) {
-   if (r->rtm_src_len != host_len) {
+   if (frh->src_len != host_len) {
fprintf(fp, "from %s/%u ",
-   

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Alexei Starovoitov
On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote:
> 
>   Don't get me wrong, no software is safe from security issues, but if you
>   don't abstract your resources in the right way, you have more chance to
>   have experimence more problems.

interesting point.
The key part of iptables and nft design is heavy use of indirect calls.
The execution of single iptable rule is ~3 indirect calls.
Quite a lot worse in nft where every expression is an indirect call.
If my math is correct even simplest nft rule will get to ~10.
It was all fine until spectre2 was discovered and retpoline
now adds 20-30 cycles for each indirect call.
To put numbers in perspective the simple
for(...)
  indirect_call();
loop without retpoline does ~500 M iterations per second on 2+Ghz xeon.
clang -mretpoline
gcc -mindirect-branch=thunk
gcc -mindirect-branch=thunk-inline
produce slightly different code with performance of 80-90 M
iterations per second for the above loop.

Looks like iptables/nft did not abstract the resources in
the right way and now experiences more problems.

CPUs will eventually be fixed and IBRS_ALL will become reality.
Until then the kernel has to deal with the performance issues.

bpf and the networking stack will suffer from retpoline as well
and we need to work asap on devirtualization and other ideas.
For xdp a single indirect call we do per packet (to call into bpf prog)
is noticeable and we're experimenting with static_key-like approach to
call bpf program with direct call.
bpf_tail_calls will suffer too and cannot be accelerated as-is.
To solve that we're working on dynamic linking via verifier improvements.
C based bpf programs will use normal indirect calls, but verifier will
replace indirect with direct at pointer update time.
It's not going to be easy, but bpf and stack is fixable,
whereas iptables/nft are going to suffer until fixed CPUs find
their way into servers years from now.



[PATCH net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins

2018-02-20 Thread Andrew Lunn
MV88E6352 and later switches support GPIO control through the "Scratch
& Misc" global2 register. Two of the pins controlled this way on the
mv88e6390 family are the external MDIO pins. They can either by used
as part of the MII interface for port 0, GPIOs, or MDIO. Add a
function to configure them for MDIO, if possible, and call it when
registering the external MDIO bus.

Suggested-by: Russell King 
Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c|  9 +
 drivers/net/dsa/mv88e6xxx/global2.h | 14 
 drivers/net/dsa/mv88e6xxx/global2_scratch.c | 51 +
 3 files changed, 74 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 39c7ad7e490f..f9ed612a6e39 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2165,6 +2165,15 @@ static int mv88e6xxx_mdio_register(struct mv88e6xxx_chip 
*chip,
struct mii_bus *bus;
int err;
 
+   if (external) {
+   mutex_lock(>reg_lock);
+   err = mv88e6xxx_g2_scratch_gpio_set_ext_smi(chip, true);
+   mutex_unlock(>reg_lock);
+
+   if (err)
+   return err;
+   }
+
bus = devm_mdiobus_alloc_size(chip->dev, sizeof(*mdio_bus));
if (!bus)
return -ENOMEM;
diff --git a/drivers/net/dsa/mv88e6xxx/global2.h 
b/drivers/net/dsa/mv88e6xxx/global2.h
index 25f92b3d7157..c1c7df133d3f 100644
--- a/drivers/net/dsa/mv88e6xxx/global2.h
+++ b/drivers/net/dsa/mv88e6xxx/global2.h
@@ -266,6 +266,11 @@
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL50x6D
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL60x6E
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL70x6F
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA0  0x70
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA1  0x71
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU   BIT(2)
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA2  0x72
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK 0x3
 
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_GPIO0
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_TRIG1
@@ -325,6 +330,9 @@ extern const struct mv88e6xxx_avb_ops mv88e6390_avb_ops;
 
 extern const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops;
 
+int mv88e6xxx_g2_scratch_gpio_set_ext_smi(struct mv88e6xxx_chip *chip,
+ bool external);
+
 #else /* !CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */
 
 static inline int mv88e6xxx_g2_require(struct mv88e6xxx_chip *chip)
@@ -465,6 +473,12 @@ static const struct mv88e6xxx_avb_ops mv88e6390_avb_ops = 
{};
 
 static const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = {};
 
+int mv88e6xxx_g2_scratch_gpio_set_ext_smi(struct mv88e6xxx_chip *chip,
+ bool external)
+{
+   return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */
 
 #endif /* _MV88E6XXX_GLOBAL2_H */
diff --git a/drivers/net/dsa/mv88e6xxx/global2_scratch.c 
b/drivers/net/dsa/mv88e6xxx/global2_scratch.c
index 0ff12bff9f0e..632fd4d91c44 100644
--- a/drivers/net/dsa/mv88e6xxx/global2_scratch.c
+++ b/drivers/net/dsa/mv88e6xxx/global2_scratch.c
@@ -238,3 +238,54 @@ const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = {
.get_pctl = mv88e6352_g2_scratch_gpio_get_pctl,
.set_pctl = mv88e6352_g2_scratch_gpio_set_pctl,
 };
+
+/**
+ * mv88e6xxx_g2_gpio_set_ext_smi - set gpio muxing for external smi
+ * @chip: chip private data
+ * @external: set mux for external smi, or free for gpio usage
+ *
+ * Some mv88e6xxx models have GPIO pins that may be configured as
+ * an external SMI interface, or they may be made free for other
+ * GPIO uses.
+ */
+int mv88e6xxx_g2_scratch_gpio_set_ext_smi(struct mv88e6xxx_chip *chip,
+ bool external)
+{
+   int misc_cfg = MV88E6352_G2_SCRATCH_MISC_CFG;
+   int config_data1 = MV88E6352_G2_SCRATCH_CONFIG_DATA1;
+   int config_data2 = MV88E6352_G2_SCRATCH_CONFIG_DATA2;
+   bool no_cpu;
+   u8 p0_mode;
+   int err;
+   u8 val;
+
+   err = mv88e6xxx_g2_scratch_read(chip, config_data2, );
+   if (err)
+   return err;
+
+   p0_mode = val & MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK;
+
+   if (p0_mode == 0x01 || p0_mode == 0x02)
+   return -EBUSY;
+
+   err = mv88e6xxx_g2_scratch_read(chip, config_data1, );
+   if (err)
+   return err;
+
+   no_cpu = !!(val & MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU);
+
+   err = mv88e6xxx_g2_scratch_read(chip, misc_cfg, );
+   if (err)
+   return err;
+
+   /* NO_CPU being 0 inverts the meaning of the bit */
+   if (!no_cpu)
+   external = !external;
+
+   if (external)
+   val |= MV88E6352_G2_SCRATCH_MISC_CFG_NORMALSMI;
+   else
+   val &= ~MV88E6352_G2_SCRATCH_MISC_CFG_NORMALSMI;
+
+   

[net 06/11] net/mlx5e: Return error if prio is specified when offloading eswitch vlan push

2018-02-20 Thread Saeed Mahameed
From: Or Gerlitz 

This isn't supported when we emulate eswitch vlan push action which
is the current state of things.

Fixes: 8b32580df1cb ('net/mlx5e: Add TC vlan action for SRIOV offloads')
Signed-off-by: Or Gerlitz 
Reviewed-by: Mark Bloch 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index fd98b0dc610f..fa86a1466718 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2529,7 +2529,8 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
if (tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
attr->action |= 
MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
} else if (tcf_vlan_action(a) == TCA_VLAN_ACT_PUSH) {
-   if (tcf_vlan_push_proto(a) != 
htons(ETH_P_8021Q))
+   if (tcf_vlan_push_proto(a) != 
htons(ETH_P_8021Q) ||
+   tcf_vlan_push_prio(a))
return -EOPNOTSUPP;
 
attr->action |= 
MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
-- 
2.14.3



[net 01/11] net/mlx5e: Fix TCP checksum in LRO buffers

2018-02-20 Thread Saeed Mahameed
From: Gal Pressman 

When receiving an LRO packet, the checksum field is set by the hardware
to the checksum of the first coalesced packet. Obviously, this checksum
is not valid for the merged LRO packet and should be fixed.  We can use
the CQE checksum which covers the checksum of the entire merged packet
TCP payload to help us calculate the checksum incrementally.

Tested by sending IPv4/6 traffic with LRO enabled, RX checksum disabled
and watching nstat checksum error counters (in addition to the obvious
bandwidth drop caused by checksum errors).

This bug is usually "hidden" since LRO packets would go through the
CHECKSUM_UNNECESSARY flow which does not validate the packet checksum.

It's important to note that previous to this patch, LRO packets provided
with CHECKSUM_UNNECESSARY are indeed packets with a correct validated
checksum (even though the checksum inside the TCP header is incorrect),
since the hardware LRO aggregation is terminated upon receiving a packet
with bad checksum.

Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++---
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0d4bb0688faa..e5c3ab46a24a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -546,20 +547,33 @@ bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
return true;
 }
 
+static void mlx5e_lro_update_tcp_hdr(struct mlx5_cqe64 *cqe, struct tcphdr 
*tcp)
+{
+   u8 l4_hdr_type = get_cqe_l4_hdr_type(cqe);
+   u8 tcp_ack = (l4_hdr_type == CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA) ||
+(l4_hdr_type == CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA);
+
+   tcp->check  = 0;
+   tcp->psh= get_cqe_lro_tcppsh(cqe);
+
+   if (tcp_ack) {
+   tcp->ack= 1;
+   tcp->ack_seq= cqe->lro_ack_seq_num;
+   tcp->window = cqe->lro_tcp_win;
+   }
+}
+
 static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
 u32 cqe_bcnt)
 {
struct ethhdr   *eth = (struct ethhdr *)(skb->data);
struct tcphdr   *tcp;
int network_depth = 0;
+   __wsum check;
__be16 proto;
u16 tot_len;
void *ip_p;
 
-   u8 l4_hdr_type = get_cqe_l4_hdr_type(cqe);
-   u8 tcp_ack = (l4_hdr_type == CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA) ||
-   (l4_hdr_type == CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA);
-
proto = __vlan_get_protocol(skb, eth->h_proto, _depth);
 
tot_len = cqe_bcnt - network_depth;
@@ -576,23 +590,30 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, 
struct mlx5_cqe64 *cqe,
ipv4->check = 0;
ipv4->check = ip_fast_csum((unsigned char *)ipv4,
   ipv4->ihl);
+
+   mlx5e_lro_update_tcp_hdr(cqe, tcp);
+   check = csum_partial(tcp, tcp->doff * 4,
+csum_unfold((__force 
__sum16)cqe->check_sum));
+   /* Almost done, don't forget the pseudo header */
+   tcp->check = csum_tcpudp_magic(ipv4->saddr, ipv4->daddr,
+  tot_len - sizeof(struct iphdr),
+  IPPROTO_TCP, check);
} else {
+   u16 payload_len = tot_len - sizeof(struct ipv6hdr);
struct ipv6hdr *ipv6 = ip_p;
 
tcp = ip_p + sizeof(struct ipv6hdr);
skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
 
ipv6->hop_limit = cqe->lro_min_ttl;
-   ipv6->payload_len   = cpu_to_be16(tot_len -
- sizeof(struct ipv6hdr));
-   }
-
-   tcp->psh = get_cqe_lro_tcppsh(cqe);
-
-   if (tcp_ack) {
-   tcp->ack= 1;
-   tcp->ack_seq= cqe->lro_ack_seq_num;
-   tcp->window = cqe->lro_tcp_win;
+   ipv6->payload_len   = cpu_to_be16(payload_len);
+
+   mlx5e_lro_update_tcp_hdr(cqe, tcp);
+   check = csum_partial(tcp, tcp->doff * 4,
+csum_unfold((__force 
__sum16)cqe->check_sum));
+   /* Almost done, don't forget the pseudo header */
+   tcp->check = csum_ipv6_magic(>saddr, >daddr, 
payload_len,
+IPPROTO_TCP, check);

[net 04/11] net/mlx5e: Eliminate build warnings on no previous prototype

2018-02-20 Thread Saeed Mahameed
From: Or Gerlitz 

Fix these gcc warnings on drivers/net/ethernet/mellanox/mlx5:

[..]/core/lib/clock.c:454:6: warning: no previous prototype for 
'mlx5_init_clock' [-Wmissing-prototypes]
[..]/core/lib/clock.c:510:6: warning: no previous prototype for 
'mlx5_cleanup_clock' [-Wmissing-prototypes]
[..]/core/en_main.c:3141:5: warning: no previous prototype for 'mlx5e_setup_tc' 
[-Wmissing-prototypes]

Signed-off-by: Or Gerlitz 
Reviewed-by: Matan Barak 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c   | 4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 47bab842c5ee..a64b9226d281 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2994,8 +2994,8 @@ static int mlx5e_setup_tc_block(struct net_device *dev,
 }
 #endif
 
-int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
-  void *type_data)
+static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
+ void *type_data)
 {
switch (type) {
 #ifdef CONFIG_MLX5_ESWITCH
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index e159243e0fcf..857035583ccd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include "en.h"
+#include "clock.h"
 
 enum {
MLX5_CYCLES_SHIFT   = 23
-- 
2.14.3



[net 11/11] net/mlx5: Fix error handling when adding flow rules

2018-02-20 Thread Saeed Mahameed
From: Vlad Buslov 

If building match list or adding existing fg fails when
node is locked, function returned without unlocking it.
This happened if node version changed or adding existing fg
returned with EAGAIN after jumping to search_again_locked label.

Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in 
parallel")
Signed-off-by: Vlad Buslov 
Reviewed-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 6caa4a7ad869..31fc2cfac3b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1759,8 +1759,11 @@ _mlx5_add_flow_rules(struct mlx5_flow_table *ft,
 
/* Collect all fgs which has a matching match_criteria */
err = build_match_list(_head, ft, spec);
-   if (err)
+   if (err) {
+   if (take_write)
+   up_write_ref_node(>node);
return ERR_PTR(err);
+   }
 
if (!take_write)
up_read_ref_node(>node);
@@ -1769,8 +1772,11 @@ _mlx5_add_flow_rules(struct mlx5_flow_table *ft,
  dest_num, version);
free_match_list(_head);
if (!IS_ERR(rule) ||
-   (PTR_ERR(rule) != -ENOENT && PTR_ERR(rule) != -EAGAIN))
+   (PTR_ERR(rule) != -ENOENT && PTR_ERR(rule) != -EAGAIN)) {
+   if (take_write)
+   up_write_ref_node(>node);
return rule;
+   }
 
if (!take_write) {
nested_down_write_ref_node(>node, FS_LOCK_GRANDPARENT);
-- 
2.14.3



[net 08/11] net/mlx5: Use 128B cacheline size for 128B or larger cachelines

2018-02-20 Thread Saeed Mahameed
From: Daniel Jurgens 

The adapter uses the cache_line_128byte setting to set the bounds for
end padding. On systems where the cacheline size is greater than 128B
use 128B instead of the default of 64B. This results in fewer partial
cacheline writes. There's a 50% chance it will pad to the end of a 256B
cache line vs only 25% when using 64B.

Fixes: f32f5bd2eb7e ("net/mlx5: Configure cache line size for start and end 
padding")
Signed-off-by: Daniel Jurgens 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 2ef641c91c26..ae391e4b7070 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -551,7 +551,7 @@ static int handle_hca_cap(struct mlx5_core_dev *dev)
MLX5_SET(cmd_hca_cap,
 set_hca_cap,
 cache_line_128byte,
-cache_line_size() == 128 ? 1 : 0);
+cache_line_size() >= 128 ? 1 : 0);
 
if (MLX5_CAP_GEN_MAX(dev, dct))
MLX5_SET(cmd_hca_cap, set_hca_cap, dct, 1);
-- 
2.14.3



[net 07/11] net/mlx5e: Specify numa node when allocating drop rq

2018-02-20 Thread Saeed Mahameed
From: Gal Pressman 

When allocating a drop rq, no numa node is explicitly set which means
allocations are done on node zero. This is not necessarily the nearest
numa node to the HCA, and even worse, might even be a memoryless numa
node.

Choose the numa_node given to us by the pci device in order to properly
allocate the coherent dma memory instead of assuming zero is valid.

Fixes: 556dd1b9c313 ("net/mlx5e: Set drop RQ's necessary parameters only")
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a64b9226d281..da94c8cba5ee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1768,13 +1768,16 @@ static void mlx5e_build_rq_param(struct mlx5e_priv 
*priv,
param->wq.linear = 1;
 }
 
-static void mlx5e_build_drop_rq_param(struct mlx5e_rq_param *param)
+static void mlx5e_build_drop_rq_param(struct mlx5_core_dev *mdev,
+ struct mlx5e_rq_param *param)
 {
void *rqc = param->rqc;
void *wq = MLX5_ADDR_OF(rqc, rqc, wq);
 
MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_LINKED_LIST);
MLX5_SET(wq, wq, log_wq_stride,ilog2(sizeof(struct mlx5e_rx_wqe)));
+
+   param->wq.buf_numa_node = dev_to_node(>pdev->dev);
 }
 
 static void mlx5e_build_sq_param_common(struct mlx5e_priv *priv,
@@ -2634,6 +2637,9 @@ static int mlx5e_alloc_drop_cq(struct mlx5_core_dev *mdev,
   struct mlx5e_cq *cq,
   struct mlx5e_cq_param *param)
 {
+   param->wq.buf_numa_node = dev_to_node(>pdev->dev);
+   param->wq.db_numa_node  = dev_to_node(>pdev->dev);
+
return mlx5e_alloc_cq_common(mdev, param, cq);
 }
 
@@ -2645,7 +2651,7 @@ static int mlx5e_open_drop_rq(struct mlx5_core_dev *mdev,
struct mlx5e_cq *cq = _rq->cq;
int err;
 
-   mlx5e_build_drop_rq_param(_param);
+   mlx5e_build_drop_rq_param(mdev, _param);
 
err = mlx5e_alloc_drop_cq(mdev, cq, _param);
if (err)
-- 
2.14.3



[net 10/11] net/mlx5: E-Switch, Fix drop counters use before creation

2018-02-20 Thread Saeed Mahameed
From: Eugenia Emantayev 

First use of drop counters happens in esw_apply_vport_conf function,
while they are allocated later in the flow. Fix that by moving
esw_vport_create_drop_counters function to be called before the first use.

Fixes: b8a0dbe3a90b ("net/mlx5e: E-switch, Add steering drop counters")
Signed-off-by: Eugenia Emantayev 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 5ecf2cddc16d..c2b1d7d351fc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1529,6 +1529,10 @@ static void esw_enable_vport(struct mlx5_eswitch *esw, 
int vport_num,
 
esw_debug(esw->dev, "Enabling VPORT(%d)\n", vport_num);
 
+   /* Create steering drop counters for ingress and egress ACLs */
+   if (vport_num && esw->mode == SRIOV_LEGACY)
+   esw_vport_create_drop_counters(vport);
+
/* Restore old vport configuration */
esw_apply_vport_conf(esw, vport);
 
@@ -1545,10 +1549,6 @@ static void esw_enable_vport(struct mlx5_eswitch *esw, 
int vport_num,
if (!vport_num)
vport->info.trusted = true;
 
-   /* create steering drop counters for ingress and egress ACLs */
-   if (vport_num && esw->mode == SRIOV_LEGACY)
-   esw_vport_create_drop_counters(vport);
-
esw_vport_change_handle_locked(vport);
 
esw->enabled_vports++;
-- 
2.14.3



[net 02/11] net/mlx5e: Fix loopback self test when GRO is off

2018-02-20 Thread Saeed Mahameed
From: Inbar Karmy 

When GRO is off, the transport header pointer in sk_buff is
initialized to network's header.

To find the udp header, instead of using udp_hdr() which assumes
skb_network_header was set, manually calculate the udp header offset.

Fixes: 0952da791c97 ("net/mlx5e: Add support for loopback selftest")
Signed-off-by: Inbar Karmy 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
index 5a4608281f38..707976482c09 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
@@ -216,7 +216,8 @@ mlx5e_test_loopback_validate(struct sk_buff *skb,
if (iph->protocol != IPPROTO_UDP)
goto out;
 
-   udph = udp_hdr(skb);
+   /* Don't assume skb_transport_header() was set */
+   udph = (struct udphdr *)((u8 *)iph + 4 * iph->ihl);
if (udph->dest != htons(9))
goto out;
 
-- 
2.14.3



[net 05/11] net/mlx5: Address static checker warnings on non-constant initializers

2018-02-20 Thread Saeed Mahameed
From: Or Gerlitz 

Address these sparse warnings on drivers/net/ethernet/mellanox/mlx5

[..]/core/diag/fs_tracepoint.c:99:53: warning: non-constant initializer for 
static object
[..]/core/diag/fs_tracepoint.c:102:53: warning: non-constant initializer for 
static object

etc

Signed-off-by: Or Gerlitz 
Reviewed-by: Matan Barak 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
index 0be4575b58a2..fd509160c8f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
@@ -96,10 +96,10 @@ static void print_lyr_2_4_hdrs(struct trace_seq *p,
  "%pI4");
} else if (ethertype.v == ETH_P_IPV6) {
static const struct in6_addr full_ones = {
-   .in6_u.u6_addr32 = {htonl(0x),
-   htonl(0x),
-   htonl(0x),
-   htonl(0x)},
+   .in6_u.u6_addr32 = 
{__constant_htonl(0x),
+   
__constant_htonl(0x),
+   
__constant_htonl(0x),
+   
__constant_htonl(0x)},
};
DECLARE_MASK_VAL(struct in6_addr, src_ipv6);
DECLARE_MASK_VAL(struct in6_addr, dst_ipv6);
-- 
2.14.3



[net 09/11] net/mlx5: Add header re-write to the checks for conflicting actions

2018-02-20 Thread Saeed Mahameed
From: Or Gerlitz 

We can't allow only some of the rules sharing an FTE to ask for
header re-write, add it to the conflicting action checks.

Fixes: 0d235c3fabb7 ('net/mlx5: Add hash table to search FTEs in a flow-group')
Signed-off-by: Or Gerlitz 
Reviewed-by: Matan Barak 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index c025c98700e4..6caa4a7ad869 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1429,7 +1429,8 @@ static bool check_conflicting_actions(u32 action1, u32 
action2)
 
if (xored_actions & (MLX5_FLOW_CONTEXT_ACTION_DROP  |
 MLX5_FLOW_CONTEXT_ACTION_ENCAP |
-MLX5_FLOW_CONTEXT_ACTION_DECAP))
+MLX5_FLOW_CONTEXT_ACTION_DECAP |
+MLX5_FLOW_CONTEXT_ACTION_MOD_HDR))
return true;
 
return false;
-- 
2.14.3



[pull request][net 00/11] Mellanox, mlx5 fixes 2018-02-20

2018-02-20 Thread Saeed Mahameed
Hi Dave,

The following pull request includes some fixes for the mlx5 core and
netdevice driver.

Please pull and let me know if there's any issue.

-stable 4.10.y:
('net/mlx5e: Fix loopback self test when GRO is off')

-stable 4.12.y:
('net/mlx5e: Specify numa node when allocating drop rq')

-stable 4.13.y:
('net/mlx5e: Verify inline header size do not exceed SKB linear size')

-stable 4.15.y:
('net/mlx5e: Fix TCP checksum in LRO buffers')
('net/mlx5: Fix error handling when adding flow rules')

Thanks,
Saeed.

--- 

The following changes since commit 7928b2cbe55b2a410a0f5c1f154610059c57b1b2:

  Linux 4.16-rc1 (2018-02-11 15:04:29 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5-fixes-2018-02-20

for you to fetch changes up to 9238e380e823a39983ee8d6b6ee8d1a9c4ba8a65:

  net/mlx5: Fix error handling when adding flow rules (2018-02-20 12:53:00 
-0800)


mlx5-fixes-2018-02-20


Daniel Jurgens (1):
  net/mlx5: Use 128B cacheline size for 128B or larger cachelines

Eran Ben Elisha (1):
  net/mlx5e: Verify inline header size do not exceed SKB linear size

Eugenia Emantayev (1):
  net/mlx5: E-Switch, Fix drop counters use before creation

Gal Pressman (2):
  net/mlx5e: Fix TCP checksum in LRO buffers
  net/mlx5e: Specify numa node when allocating drop rq

Inbar Karmy (1):
  net/mlx5e: Fix loopback self test when GRO is off

Or Gerlitz (4):
  net/mlx5e: Eliminate build warnings on no previous prototype
  net/mlx5: Address static checker warnings on non-constant initializers
  net/mlx5e: Return error if prio is specified when offloading eswitch vlan 
push
  net/mlx5: Add header re-write to the checks for conflicting actions

Vlad Buslov (1):
  net/mlx5: Fix error handling when adding flow rules

 .../mellanox/mlx5/core/diag/fs_tracepoint.c|  8 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 14 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 49 +++---
 .../net/ethernet/mellanox/mlx5/core/en_selftest.c  |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c|  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  8 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 13 --
 .../net/ethernet/mellanox/mlx5/core/lib/clock.c|  1 +
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  2 +-
 10 files changed, 70 insertions(+), 33 deletions(-)


[net 03/11] net/mlx5e: Verify inline header size do not exceed SKB linear size

2018-02-20 Thread Saeed Mahameed
From: Eran Ben Elisha 

Driver tries to copy at least MLX5E_MIN_INLINE bytes into the control
segment of the WQE. It assumes that the linear part contains at least
MLX5E_MIN_INLINE bytes, which can be wrong.

Cited commit verified that driver will not copy more bytes into the
inline header part that the actual size of the packet. Re-factor this
check to make sure we do not exceed the linear part as well.

This fix is aligned with the current driver's assumption that the entire
L2 will be present in the linear part of the SKB.

Fixes: 6aace17e64f4 ("net/mlx5e: Fix inline header size for small packets")
Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 569b42a01026..11b4f1089d1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -176,7 +176,7 @@ static inline u16 mlx5e_calc_min_inline(enum 
mlx5_inline_modes mode,
default:
hlen = mlx5e_skb_l2_header_offset(skb);
}
-   return min_t(u16, hlen, skb->len);
+   return min_t(u16, hlen, skb_headlen(skb));
 }
 
 static inline void mlx5e_tx_skb_pull_inline(unsigned char **skb_data,
-- 
2.14.3



Re: [PATCH net-next] cxgb4: append firmware dump to vmcore in kernel panic

2018-02-20 Thread Jakub Kicinski
On Tue, 20 Feb 2018 16:51:03 -0800, Florian Fainelli wrote:
> On 02/20/2018 04:43 PM, Jakub Kicinski wrote:
> > On Mon, 19 Feb 2018 18:04:17 +0530, Rahul Lakkireddy wrote:  
> >> Our requirement is to analyze the state of firmware/hardware at the
> >> time of kernel panic.   
> > 
> > I was wondering about this since you posted the patch and I can't come
> > up with any specific scenario where kernel crash would correlate
> > clearly with device state in non-trivial way.
> > 
> > Perhaps there is something about cxgb4 HW/FW that makes this useful.
> > Could you explain?  Could you give a real life example of a bug?  
> > Is it related to the TOE-looking TLS offload Atul is posting?
> > 
> > Is the panic you're targeting here real or manually triggered from user
> > space to get a full dump of kernel and FW?
> > 
> > That's me trying to guess what you're doing.. :)
> 
> One case where this might be helpful is if you are chasing down DMA
> corruption and you would like to get a nearly instant capture of both
> the kernel's memory and the adapter which may be responsible for that.
> This is not probably 100% proof because there is a timing window during
> which the dumps of both contexts are going to happen, and that alone
> might be influencing the captured memory view. Just guessing of course.

Perhaps this is what you mean with the timing window - but with random
corruptions by the time kernel hits the corrupted memory 40/100Gb
adapter has likely forgotten all about those DMAs..  And IOMMUs are
pretty good at catching corruptions on big iron CPUs (i.e. it's easy to
catch them in testing, even if production environment runs iommu=pt).
At least that's my gut feeling/experience ;)


Re: [PATCH net-next] cxgb4: append firmware dump to vmcore in kernel panic

2018-02-20 Thread Florian Fainelli
On 02/20/2018 04:43 PM, Jakub Kicinski wrote:
> On Mon, 19 Feb 2018 18:04:17 +0530, Rahul Lakkireddy wrote:
>> Our requirement is to analyze the state of firmware/hardware at the
>> time of kernel panic. 
> 
> I was wondering about this since you posted the patch and I can't come
> up with any specific scenario where kernel crash would correlate
> clearly with device state in non-trivial way.
> 
> Perhaps there is something about cxgb4 HW/FW that makes this useful.
> Could you explain?  Could you give a real life example of a bug?  
> Is it related to the TOE-looking TLS offload Atul is posting?
> 
> Is the panic you're targeting here real or manually triggered from user
> space to get a full dump of kernel and FW?
> 
> That's me trying to guess what you're doing.. :)
> 

One case where this might be helpful is if you are chasing down DMA
corruption and you would like to get a nearly instant capture of both
the kernel's memory and the adapter which may be responsible for that.
This is not probably 100% proof because there is a timing window during
which the dumps of both contexts are going to happen, and that alone
might be influencing the captured memory view. Just guessing of course.
-- 
Florian


Re: [PATCH iproute2] ip: link_gre6.c: Support IP6_TNL_F_ALLOW_LOCAL_REMOTE flag

2018-02-20 Thread Petr Machata
Serhey Popovych  writes:

> Maybe it is better to rebase this against iproute2-next?

Sure, I sent a v2 rebased on top of iproute2.

Thanks,
Petr


[PATCH iproute2-next v2] ip: link_gre6.c: Support IP6_TNL_F_ALLOW_LOCAL_REMOTE flag

2018-02-20 Thread Petr Machata
For IP-in-IP tunnels, one can specify the [no]allow-localremote command
when configuring a device. Under the hood, this flips the
IP6_TNL_F_ALLOW_LOCAL_REMOTE flag on the netdevice. However, ip6gretap
and ip6erspan devices, where the flag is also relevant, are not IP-in-IP
tunnels, and thus there's no way to configure the flag on these
netdevices. Therefore introduce the command to link_gre6 as well.

The original support was introduced in commit
21440d19d957 ("ip: link_ip6tnl.c/ip6tunnel.c: Support 
IP6_TNL_F_ALLOW_LOCAL_REMOTE flag")

Signed-off-by: Petr Machata 
---

Notes:
Changes from v1 to v2:

- Rebase to iproute2-next

 ip/link_gre6.c| 11 +++
 man/man8/ip-link.8.in | 14 ++
 2 files changed, 25 insertions(+)

diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 6c77038..e0746bc 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -48,6 +48,7 @@ static void gre_print_help(struct link_util *lu, int argc, 
char **argv, FILE *f)
" [ dscp inherit ]\n"
" [ dev PHYS_DEV ]\n"
" [ fwmark MARK ]\n"
+   " [ [no]allow-localremote ]\n"
" [ external ]\n"
" [ noencap ]\n"
" [ encap { fou | gue | none } ]\n"
@@ -346,6 +347,10 @@ get_failed:
invarg("invalid fwmark\n", *argv);
flags &= ~IP6_TNL_F_USE_ORIG_FWMARK;
}
+   } else if (strcmp(*argv, "allow-localremote") == 0) {
+   flags |= IP6_TNL_F_ALLOW_LOCAL_REMOTE;
+   } else if (strcmp(*argv, "noallow-localremote") == 0) {
+   flags &= ~IP6_TNL_F_ALLOW_LOCAL_REMOTE;
} else if (strcmp(*argv, "encaplimit") == 0) {
NEXT_ARG();
if (strcmp(*argv, "none") == 0) {
@@ -534,6 +539,12 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
if (oflags & GRE_CSUM)
print_bool(PRINT_ANY, "ocsum", "ocsum ", true);
 
+   if (flags & IP6_TNL_F_ALLOW_LOCAL_REMOTE)
+   print_bool(PRINT_ANY,
+  "ip6_tnl_f_allow_local_remote",
+  "allow-localremote ",
+  true);
+
if (flags & IP6_TNL_F_USE_ORIG_FWMARK) {
print_bool(PRINT_ANY,
   "ip6_tnl_f_use_orig_fwmark",
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 481589e..5dee9fc 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -793,6 +793,8 @@ the following additional arguments are supported:
 ] [
 .BI "dscp inherit"
 ] [
+.BI "[no]allow-localremote"
+] [
 .BI dev " PHYS_DEV "
 ] [
 .RB external
@@ -857,6 +859,11 @@ flag is equivalent to the combination
 - specifies a fixed flowlabel.
 
 .sp
+.BI  [no]allow-localremote
+- specifies whether to allow remote endpoint to have an address configured on
+local host.
+
+.sp
 .BI  tclass " TCLASS"
 - specifies the traffic class field on
 tunneled packets, which can be specified as either a two-digit
@@ -927,6 +934,8 @@ the following additional arguments are supported:
 ] [
 .BR erspan_hwid " \fIhwid "
 ] [
+.BI "[no]allow-localremote"
+] [
 .RB external
 ]
 
@@ -965,6 +974,11 @@ traffic's source port and direction.
 is a 6-bit value for users to configure.
 
 .sp
+.BI  [no]allow-localremote
+- specifies whether to allow remote endpoint to have an address configured on
+local host.
+
+.sp
 .BR external
 - make this tunnel externally controlled (or not, which is the default).
 In the kernel, this is referred to as collect metadata mode.  This flag is
-- 
2.4.11



Re: [PATCH net-next] cxgb4: append firmware dump to vmcore in kernel panic

2018-02-20 Thread Jakub Kicinski
On Mon, 19 Feb 2018 18:04:17 +0530, Rahul Lakkireddy wrote:
> Our requirement is to analyze the state of firmware/hardware at the
> time of kernel panic. 

I was wondering about this since you posted the patch and I can't come
up with any specific scenario where kernel crash would correlate
clearly with device state in non-trivial way.

Perhaps there is something about cxgb4 HW/FW that makes this useful.
Could you explain?  Could you give a real life example of a bug?  
Is it related to the TOE-looking TLS offload Atul is posting?

Is the panic you're targeting here real or manually triggered from user
space to get a full dump of kernel and FW?

That's me trying to guess what you're doing.. :)


Re: [PATCH net-next RFC 0/2] Convert OKI PCH GBE to mdiobus and phylib

2018-02-20 Thread Paul Burton
Hi Andrew,

On Tue, Feb 20, 2018 at 03:28:17AM +0100, Andrew Lunn wrote:
> Hi Paul
> 
> Here is my stab at converting the OKI PCH GBE to use the common MDIO
> bus and phylib drivers. This is compile tested only, and pretty much
> guaranteed to be broken. But hopefully it will help you. Feel free to
> pick it up and make it work. Without hardware, there is not much more
> i can do, other than answer questions.

Thanks - this will be useful :) I'll try to find some time to get it
running this week.

> The handling of the GPIO for reset is possibly wrong.
> 
> 1) It is probably active high, when active low is wanted. Your patch
>would sort this out.
> 
> 2) It is possible the reset happens too late. Since there is no device
>tree binding for the minnow board, it is not possible to register
>the gpio with the core code until after it has been probed and
>attached to the MAC. If the PHY is being held in reset, it might
>not respond to the bus scan, and hence we fail to find it.  If that
>is true, we need a different solution.

Note that unfortunately I don't have access to the Minnow platform which
the driver as-is currently supports, so I won't be able to observe any
Minnow-specific issues either.

Thanks,
Paul

> Andrew Lunn (2):
>   net: phy: at803x: Export at803x_debug_reg_mask()
>   net: ethernet: pch_gbe: Convert to mdiobus and phylib
> 
>  drivers/net/ethernet/oki-semi/pch_gbe/Kconfig  |   1 +
>  drivers/net/ethernet/oki-semi/pch_gbe/Makefile |   2 +-
>  drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h|  35 +-
>  .../net/ethernet/oki-semi/pch_gbe/pch_gbe_api.c| 118 ---
>  .../net/ethernet/oki-semi/pch_gbe/pch_gbe_api.h|   8 +-
>  .../ethernet/oki-semi/pch_gbe/pch_gbe_ethtool.c|  89 +
>  .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 373 ++--
>  .../net/ethernet/oki-semi/pch_gbe/pch_gbe_param.c  | 265 ---
>  .../net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.c| 377 
> -
>  .../net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.h|  37 --
>  drivers/net/phy/at803x.c   |   5 +-
>  include/linux/at803x_phy.h |  16 +
>  12 files changed, 227 insertions(+), 1099 deletions(-)
>  delete mode 100644 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.c
>  delete mode 100644 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.h
>  create mode 100644 include/linux/at803x_phy.h
> 
> -- 
> 2.16.1
> 


[PATCH iproute2-next] tc: implement color output

2018-02-20 Thread Stephen Hemminger
Implement the -color option; in this case -co is ambiguous
since it was already used for -conf.
For now this just means putting device name in color.

Signed-off-by: Stephen Hemminger 
---
 man/man8/tc.8  |  9 +++--
 tc/tc.c| 15 +++
 tc/tc_filter.c |  3 +--
 tc/tc_qdisc.c  |  3 +--
 tc/tc_util.c   | 11 +++
 tc/tc_util.h   |  1 +
 6 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/man/man8/tc.8 b/man/man8/tc.8
index a58f46542340..a20b06bb43c0 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -94,10 +94,11 @@ tc \- show / manipulate traffic control settings
 \fB\-s\fR[\fItatistics\fR] |
 \fB\-d\fR[\fIetails\fR] |
 \fB\-r\fR[\fIaw\fR] |
-\fB\-p\fR[\fIretty\fR] |
 \fB\-i\fR[\fIec\fR] |
 \fB\-g\fR[\fIraph\fR] |
-\fB\-j\fR[\fIjson\fR] }
+\fB\-j\fR[\fIjson\fR] |
+\fB\-p\fR[\fIretty\fR] |
+\fB\-col\fR[\fIor\fR] }
 
 .SH DESCRIPTION
 .B Tc
@@ -685,6 +686,10 @@ option was specified. Classes can be filtered only by
 .BR "dev"
 option.
 
+.TP
+.BR "\ -color"
+Use color output.
+
 .TP
 .BR "\-j", " \-json"
 Display results in JSON format.
diff --git a/tc/tc.c b/tc/tc.c
index cfccf8750562..a31f075d1ffe 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -41,6 +41,7 @@ int use_iec;
 int force;
 bool use_names;
 int json;
+int color;
 
 static char *conf_file;
 
@@ -186,10 +187,11 @@ noexist:
 static void usage(void)
 {
fprintf(stderr, "Usage: tc [ OPTIONS ] OBJECT { COMMAND | help }\n"
-   "   tc [-force] -batch filename\n"
-   "where  OBJECT := { qdisc | class | filter | action | 
monitor | exec }\n"
-   "   OPTIONS := { -s[tatistics] | -d[etails] | 
-r[aw] | -p[retty] | -b[atch] [filename] | -n[etns] name |\n"
-   "-nm | -nam[es] | { -cf | -conf } 
path } | -j[son]\n");
+   "   tc [-force] -batch filename\n"
+   "where  OBJECT := { qdisc | class | filter | action | monitor | 
exec }\n"
+   "   OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] | 
-b[atch] [filename] | -n[etns] name |\n"
+   "-nm | -nam[es] | { -cf | -conf } path } 
|\n"
+   "-j[son] -p[retty] -c[olor]\n");
 }
 
 static int do_cmd(int argc, char **argv, void *buf, size_t buflen)
@@ -476,6 +478,8 @@ int main(int argc, char **argv)
matches(argv[1], "-conf") == 0) {
NEXT_ARG();
conf_file = argv[1];
+   } else if (matches(argv[1], "-color") == 0) {
+   ++color;
} else if (matches(argv[1], "-timestamp") == 0) {
timestamp++;
} else if (matches(argv[1], "-tshort") == 0) {
@@ -490,6 +494,9 @@ int main(int argc, char **argv)
argc--; argv++;
}
 
+   if (color & !json)
+   enable_color();
+
if (batch_file)
return batch(batch_file);
 
diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index 5c31a4cea658..c7701fd8956e 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -303,8 +303,7 @@ int print_filter(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
   t->tcm_block_index);
} else {
if (!filter_ifindex || filter_ifindex != t->tcm_ifindex)
-   print_string(PRINT_ANY, "dev", "dev %s ",
-ll_index_to_name(t->tcm_ifindex));
+   print_devname(PRINT_ANY, t->tcm_ifindex);
 
if (!filter_parent || filter_parent != t->tcm_parent) {
if (t->tcm_parent == TC_H_ROOT)
diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index 2fcf04c336df..78b80e67cf57 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -270,8 +270,7 @@ int print_qdisc(const struct sockaddr_nl *who,
print_string(PRINT_FP, NULL, " ", NULL);
 
if (filter_ifindex == 0)
-   print_string(PRINT_ANY, "dev", "dev %s ",
-ll_index_to_name(t->tcm_ifindex));
+   print_devname(PRINT_ANY, t->tcm_ifindex);
 
if (t->tcm_parent == TC_H_ROOT)
print_bool(PRINT_ANY, "root", "root ", true);
diff --git a/tc/tc_util.c b/tc/tc_util.c
index aceb0d944933..7f6a8e3109c4 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -444,6 +444,17 @@ int get_size_and_cell(unsigned int *size, int *cell_log, 
char *str)
return 0;
 }
 
+void print_devname(enum output_type type, int ifindex)
+{
+   const char *ifname = ll_index_to_name(ifindex);
+
+   if (!is_json_context())
+   printf("dev ");
+
+   print_color_string(type, COLOR_IFNAME,
+  "dev", "%s ", ifname);
+}
+
 void print_size(char *buf, int len, __u32 sz)
 {
double tmp = sz;
diff --git a/tc/tc_util.h b/tc/tc_util.h
index cd2ff5964e19..6632c4f9c528 100644
--- a/tc/tc_util.h
+++ 

[PATCH net-next v4 2/2] ipvlan: selects master_l3 device instead of depending on it

2018-02-20 Thread Matteo Croce
The L3 Master device is just a glue between the core networking code and
device drivers, so it should be selected automatically rather than
requiring to be enabled explicitly.

Signed-off-by: Matteo Croce 
---
 drivers/net/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 3234c6618d75..d88b78a17440 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -150,7 +150,7 @@ config IPVLAN
 tristate "IP-VLAN support"
 depends on INET
 depends on NETFILTER
-depends on NET_L3_MASTER_DEV
+select NET_L3_MASTER_DEV
 ---help---
   This allows one to create virtual devices off of a main interface
   and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
-- 
2.14.3



[PATCH net-next v4 0/2] Remove IPVlan module dependencies on IPv6 and L3 Master dev

2018-02-20 Thread Matteo Croce
The IPVlan module currently depends on IPv6 and L3 Master dev.
Refactor the code to allow building IPVlan module regardless of the value
of CONFIG_IPV6 as done in other drivers like VxLAN or GENEVE.
Also change the CONFIG_NET_L3_MASTER_DEV dependency into a select,
since compiling L3 Master device alone has little sense.

$ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
CONFIG_IPV6=y
CONFIG_IPVLAN=m
$ ll drivers/net/ipvlan/ipvlan.ko
48K drivers/net/ipvlan/ipvlan.ko

$ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
# CONFIG_IPV6 is not set
CONFIG_IPVLAN=m
$ ll drivers/net/ipvlan/ipvlan.ko
44K drivers/net/ipvlan/ipvlan.ko

Matteo Croce (2):
  ipvlan: drop ipv6 dependency
  ipvlan: selects master_l3 device instead of depending on it

 drivers/net/Kconfig  |  3 +-
 drivers/net/ipvlan/ipvlan_core.c | 72 ++--
 drivers/net/ipvlan/ipvlan_main.c | 48 +--
 3 files changed, 86 insertions(+), 37 deletions(-)

-- 
2.14.3



[PATCH net-next v4 1/2] ipvlan: drop ipv6 dependency

2018-02-20 Thread Matteo Croce
IPVlan has an hard dependency on IPv6, refactor the ipvlan code to allow
compiling it with IPv6 disabled, move duplicate code into addr_equal()
and refactor series of if-else into a switch.

Signed-off-by: Matteo Croce 
---
v4: more descriptive commit message and fix checkpatch.pl warnings

 drivers/net/Kconfig  |  1 -
 drivers/net/ipvlan/ipvlan_core.c | 72 ++--
 drivers/net/ipvlan/ipvlan_main.c | 48 +--
 3 files changed, 85 insertions(+), 36 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 944ec3c9282c..3234c6618d75 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -149,7 +149,6 @@ config MACVTAP
 config IPVLAN
 tristate "IP-VLAN support"
 depends on INET
-depends on IPV6
 depends on NETFILTER
 depends on NET_L3_MASTER_DEV
 ---help---
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index c1f008fe4e1d..1b5dc200b573 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -35,6 +35,7 @@ void ipvlan_count_rx(const struct ipvl_dev *ipvlan,
 }
 EXPORT_SYMBOL_GPL(ipvlan_count_rx);
 
+#if IS_ENABLED(CONFIG_IPV6)
 static u8 ipvlan_get_v6_hash(const void *iaddr)
 {
const struct in6_addr *ip6_addr = iaddr;
@@ -42,6 +43,12 @@ static u8 ipvlan_get_v6_hash(const void *iaddr)
return __ipv6_addr_jhash(ip6_addr, ipvlan_jhash_secret) &
   IPVLAN_HASH_MASK;
 }
+#else
+static u8 ipvlan_get_v6_hash(const void *iaddr)
+{
+   return 0;
+}
+#endif
 
 static u8 ipvlan_get_v4_hash(const void *iaddr)
 {
@@ -51,6 +58,23 @@ static u8 ipvlan_get_v4_hash(const void *iaddr)
   IPVLAN_HASH_MASK;
 }
 
+static bool addr_equal(bool is_v6, struct ipvl_addr *addr, const void *iaddr)
+{
+   if (!is_v6 && addr->atype == IPVL_IPV4) {
+   struct in_addr *i4addr = (struct in_addr *)iaddr;
+
+   return addr->ip4addr.s_addr == i4addr->s_addr;
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (is_v6 && addr->atype == IPVL_IPV6) {
+   struct in6_addr *i6addr = (struct in6_addr *)iaddr;
+
+   return ipv6_addr_equal(>ip6addr, i6addr);
+#endif
+   }
+
+   return false;
+}
+
 static struct ipvl_addr *ipvlan_ht_addr_lookup(const struct ipvl_port *port,
   const void *iaddr, bool is_v6)
 {
@@ -59,15 +83,9 @@ static struct ipvl_addr *ipvlan_ht_addr_lookup(const struct 
ipvl_port *port,
 
hash = is_v6 ? ipvlan_get_v6_hash(iaddr) :
   ipvlan_get_v4_hash(iaddr);
-   hlist_for_each_entry_rcu(addr, >hlhead[hash], hlnode) {
-   if (is_v6 && addr->atype == IPVL_IPV6 &&
-   ipv6_addr_equal(>ip6addr, iaddr))
-   return addr;
-   else if (!is_v6 && addr->atype == IPVL_IPV4 &&
-addr->ip4addr.s_addr ==
-   ((struct in_addr *)iaddr)->s_addr)
+   hlist_for_each_entry_rcu(addr, >hlhead[hash], hlnode)
+   if (addr_equal(is_v6, addr, iaddr))
return addr;
-   }
return NULL;
 }
 
@@ -93,13 +111,9 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev 
*ipvlan,
 {
struct ipvl_addr *addr;
 
-   list_for_each_entry(addr, >addrs, anode) {
-   if ((is_v6 && addr->atype == IPVL_IPV6 &&
-   ipv6_addr_equal(>ip6addr, iaddr)) ||
-   (!is_v6 && addr->atype == IPVL_IPV4 &&
-   addr->ip4addr.s_addr == ((struct in_addr *)iaddr)->s_addr))
+   list_for_each_entry(addr, >addrs, anode)
+   if (addr_equal(is_v6, addr, iaddr))
return addr;
-   }
return NULL;
 }
 
@@ -150,6 +164,7 @@ static void *ipvlan_get_L3_hdr(struct ipvl_port *port, 
struct sk_buff *skb, int
lyr3h = ip4h;
break;
}
+#if IS_ENABLED(CONFIG_IPV6)
case htons(ETH_P_IPV6): {
struct ipv6hdr *ip6h;
 
@@ -188,6 +203,7 @@ static void *ipvlan_get_L3_hdr(struct ipvl_port *port, 
struct sk_buff *skb, int
}
break;
}
+#endif
default:
return NULL;
}
@@ -337,14 +353,18 @@ static struct ipvl_addr *ipvlan_addr_lookup(struct 
ipvl_port *port,
 {
struct ipvl_addr *addr = NULL;
 
-   if (addr_type == IPVL_IPV6) {
+   switch (addr_type) {
+#if IS_ENABLED(CONFIG_IPV6)
+   case IPVL_IPV6: {
struct ipv6hdr *ip6h;
struct in6_addr *i6addr;
 
ip6h = (struct ipv6hdr *)lyr3h;
i6addr = use_dest ? >daddr : >saddr;
addr = ipvlan_ht_addr_lookup(port, i6addr, true);
-   } else if (addr_type == IPVL_ICMPV6) {
+   break;
+   }
+   case IPVL_ICMPV6: {
struct nd_msg *ndmh;
struct in6_addr *i6addr;
 
@@ -356,14 

Re: [PATCH iproute2-next] color: disable color when json output is requested

2018-02-20 Thread Stephen Hemminger
On Wed, 21 Feb 2018 00:28:04 +0100
Vincent Bernat  wrote:

> Instead of declaring -color and -json exclusive, ignore -color when
> -json is provided. The rationale is to allow to put -color in an alias
> for ip while still being able to use -json. -color is merely a
> presentation suggestion and we can assume there is nothing to color in
> the JSON output.
> 
> Signed-off-by: Vincent Bernat 

Looks fine to me, this could even go into master.
Need to update man page and make sure behavior is consistent
across ip, tc, and bridge commands.


[PATCH iproute2-next] color: disable color when json output is requested

2018-02-20 Thread Vincent Bernat
Instead of declaring -color and -json exclusive, ignore -color when
-json is provided. The rationale is to allow to put -color in an alias
for ip while still being able to use -json. -color is merely a
presentation suggestion and we can assume there is nothing to color in
the JSON output.

Signed-off-by: Vincent Bernat 
---
 include/color.h | 1 -
 ip/ip.c | 7 ---
 lib/color.c | 8 
 3 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/color.h b/include/color.h
index f6c351b77746..c80359d3e2e9 100644
--- a/include/color.h
+++ b/include/color.h
@@ -13,7 +13,6 @@ enum color_attr {
 };
 
 void enable_color(void);
-void check_if_color_enabled(void);
 void set_color_palette(void);
 int color_fprintf(FILE *fp, enum color_attr attr, const char *fmt, ...);
 enum color_attr ifa_family_color(__u8 ifa_family);
diff --git a/ip/ip.c b/ip/ip.c
index b15e6b66b3f6..ebf77aa2ee37 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -172,6 +172,7 @@ int main(int argc, char **argv)
 {
char *basename;
char *batch_file = NULL;
+   int color = 0;
 
basename = strrchr(argv[0], '/');
if (basename == NULL)
@@ -273,7 +274,7 @@ int main(int argc, char **argv)
}
rcvbuf = size;
} else if (matches(opt, "-color") == 0) {
-   enable_color();
+   ++color;
} else if (matches(opt, "-help") == 0) {
usage();
} else if (matches(opt, "-netns") == 0) {
@@ -293,8 +294,8 @@ int main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (json)
-   check_if_color_enabled();
+   if (color && !json)
+   enable_color();
 
if (batch_file)
return batch(batch_file);
diff --git a/lib/color.c b/lib/color.c
index a13a4930b10c..da1f516cb249 100644
--- a/lib/color.c
+++ b/lib/color.c
@@ -92,14 +92,6 @@ void set_color_palette(void)
is_dark_bg = 1;
 }
 
-void check_if_color_enabled(void)
-{
-   if (color_is_enabled) {
-   fprintf(stderr, "Option \"-json\" conflicts with 
\"-color\".\n");
-   exit(1);
-   }
-}
-
 int color_fprintf(FILE *fp, enum color_attr attr, const char *fmt, ...)
 {
int ret = 0;
-- 
2.16.1



Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Eric Dumazet
On Tue, 2018-02-20 at 21:45 +0100, Oleksandr Natalenko wrote:
> On úterý 20. února 2018 21:09:37 CET Eric Dumazet wrote:
> > Also you can tune your NIC to accept few MSS per GSO/TSO packet
> > 
> > ip link set dev eth0 gso_max_segs 2
> > 
> > So even if TSO/GSO is there, BBR should not use sk->sk_gso_max_segs to
> > size its bursts, since burt sizes are also impacting GRO on the
> > receiver.
> 
> net-next + 7 patches (6 from the patchset + this one).

My latest patch (fixing BBR underestimation of cwnd)
was meant for net tree, on a NIC where SG/TSO/GSO) are disabled.

( ie when sk->sk_gso_max_segs is not set to 'infinite' )

It is packet scheduler independent really.

Tested here with pfifo_fast, TSO/GSO off.

Before patch :
for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
691  (ss -temoi shows cwnd is stuck around 6 )
667
651
631
517

After patch :
# for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
   1733 (ss -temoi shows cwnd is around 386 )
   1778
   1746
   1781
   1718

Thanks.



Re: [PATCH 1/3] net: Kill net_mutex

2018-02-20 Thread Stephen Hemminger
On Mon, 19 Feb 2018 12:58:38 +0300
Kirill Tkhai  wrote:

> + struct list_headexit_list;  /* To linked to call pernet exit
> +  * methods on dead net (net_sem
> +  * read locked), or to 
> unregister
> +  * pernet ops (net_sem wr 
> locked).
> +  */

Sorry, that comment is completely unparseable.
Either you know what it does, and therefore comment is unnecessary
Or change comment to a valid explanation of the semantics of the list.

Maybe comments about locking model are best left to where
it is used in the code.


Re: [PATCH iproute2-next v2 3/7] iplink: Use "dev" and "name" parameters interchangeable when possible

2018-02-20 Thread Stephen Hemminger
On Tue, 20 Feb 2018 23:37:25 +0200
Serhey Popovych  wrote:

> Both of them accept network device name as argument, but have different
> meaning:
> 
>   dev  - is a device by it's name,
>   name - name for specific device device.
> 
> The only case where they treated separately is network device rename
> case where need to specify both ifindex and new name. In rest of the
> cases we can assume that dev == name.

I would rather keep name as only being a valid argument for set command.
And reject use of:

   ip li show name eth0


Re: [V9fs-developer] [PATCH] net/9p: avoid -ERESTARTSYS leak to userspace

2018-02-20 Thread Greg Kurz
Hi Al,

It's been two years without any sign of life from 9p maintainers... :-\

Would you apply (or nack) this patch ?

Thanks,

--
Greg

PS: in the case you apply it, probable Cc sta...@vger.kernel.org as well


On Thu, 08 Feb 2018 18:38:49 +0100

Greg Kurz  wrote:

> If it was interrupted by a signal, the 9p client may need to send some
> more requests to the server for cleanup before returning to userspace.
> 
> To avoid such a last minute request to be interrupted right away, the
> client memorizes if a signal is pending, clear TIF_SIGPENDING, handle
> the request and call recalc_sigpending() before returning.
> 
> Unfortunately, if the transmission of this cleanup request fails for any
> reason, the transport returns an error and the client propagates it right
> away, without calling recalc_sigpending().
> 
> This ends up with -ERESTARTSYS from the initially interrupted request
> crawling up to syscall exit, with TIF_SIGPENDING cleared by the cleanup
> request. The specific signal handling code, which is responsible for
> converting -ERESTARTSYS to -EINTR is not called, and userspace receives
> the confusing errno value:
> 
> open: Unknown error 512 (512)
> 
> This is really hard to hit in real life. I discovered the issue while
> working on hot-unplug of a virtio-9p-pci device with an instrumented
> QEMU allowing to control request completion.
> 
> Both p9_client_zc_rpc() and p9_client_rpc() functions have this buggy
> error path actually. Their code flow is a bit obscure and the best
> thing to do would probably be a full rewrite: to really ensure this
> situation of clearing TIF_SIGPENDING and returning -ERESTARTSYS can
> never happen.
> 
> But given the general lack of interest for the 9p code, I won't risk
> breaking more things. So this patch simply fix the buggy paths in both
> functions with a trivial label+goto.
> 
> Thanks to Laurent Dufour for his help and suggestions on how to find
> the root cause and how to fix it.
> 
> Signed-off-by: Greg Kurz 
> ---
>  net/9p/client.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 4c8cf9c1631a..5154eaf19fff 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -769,7 +769,7 @@ p9_client_rpc(struct p9_client *c, int8_t type, const 
> char *fmt, ...)
>   if (err < 0) {
>   if (err != -ERESTARTSYS && err != -EFAULT)
>   c->status = Disconnected;
> - goto reterr;
> + goto recalc_sigpending;
>   }
>  again:
>   /* Wait for the response */
> @@ -804,6 +804,7 @@ p9_client_rpc(struct p9_client *c, int8_t type, const 
> char *fmt, ...)
>   if (req->status == REQ_STATUS_RCVD)
>   err = 0;
>   }
> +recalc_sigpending:
>   if (sigpending) {
>   spin_lock_irqsave(>sighand->siglock, flags);
>   recalc_sigpending();
> @@ -867,7 +868,7 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client 
> *c, int8_t type,
>   if (err == -EIO)
>   c->status = Disconnected;
>   if (err != -ERESTARTSYS)
> - goto reterr;
> + goto recalc_sigpending;
>   }
>   if (req->status == REQ_STATUS_ERROR) {
>   p9_debug(P9_DEBUG_ERROR, "req_status error %d\n", req->t_err);
> @@ -885,6 +886,7 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client 
> *c, int8_t type,
>   if (req->status == REQ_STATUS_RCVD)
>   err = 0;
>   }
> +recalc_sigpending:
>   if (sigpending) {
>   spin_lock_irqsave(>sighand->siglock, flags);
>   recalc_sigpending();
> 
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> V9fs-developer mailing list
> v9fs-develo...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/v9fs-developer



Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread Roopa Prabhu
On Tue, Feb 20, 2018 at 12:49 PM, David Ahern  wrote:
> On 2/20/18 1:44 PM, Roopa Prabhu wrote:
>> On Tue, Feb 20, 2018 at 12:33 PM, Stephen Hemminger
>>  wrote:
>>> On Tue, 20 Feb 2018 13:27:21 -0700
>>> David Ahern  wrote:
>>>
 On 2/20/18 1:17 PM, Serhey Popovych wrote:
> Stephen Hemminger wrote:
>> On Tue, 20 Feb 2018 21:39:51 +0200
>> Serhey Popovych  wrote:
>>
>>> Signed-off-by: Serhey Popovych 
>>> ---
>>>  ip/iplink.c |   12 ++--
>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/ip/iplink.c b/ip/iplink.c
>>> index 74c377c..a2c8108 100644
>>> --- a/ip/iplink.c
>>> +++ b/ip/iplink.c
>>> @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct 
>>> iplink_req *req,
>>>   NEXT_ARG();
>>>   if (xdp_parse(, , req, dev_index,
>>> generic, drv, offload))
>>> - exit(-1);
>>> + return -1;
>>>   } else if (strcmp(*argv, "netns") == 0) {
>>>   NEXT_ARG();
>>>   if (netns != -1)
>>> @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int 
>>> flags, int argc, char **argv)
>>>   if (!dev) {
>>>   fprintf(stderr,
>>>   "Not enough information: \"dev\" argument is 
>>> required.\n");
>>> - exit(-1);
>>> + return -1;
>>>   }
>>>   if (cmd == RTM_NEWLINK && index) {
>>>   fprintf(stderr,
>>>   "index can be used only when creating 
>>> devices.\n");
>>> - exit(-1);
>>> + return -1;
>>>   }
>>>
>>>   req.i.ifi_index = ll_name_to_index(dev);
>>> @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
>>>   if (!dev) {
>>>   fprintf(stderr,
>>>   "Not enough of information: \"dev\" argument is 
>>> required.\n");
>>> - exit(-1);
>>> + return -1;
>>>   }
>>>
>>>   if (newaddr || newbrd) {
>>> @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
>>>   fprintf(stderr,
>>>   "Command \"%s\" is unknown, try \"ip link 
>>> help\".\n",
>>>   *argv);
>>> - exit(-1);
>>> + return -1;
>>>   }
>>>
>>>   argv++; argc--;
>>> @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
>>>
>>>   fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
>>>   *argv);
>>> - exit(-1);
>>> + return -1;
>>>  }
>>
>> Not sure I like this. If given bad input in batch it is better to stop 
>> and exit
>> rather than continuing with more bad data.
>
> When preparing this change I think in opposite direction: we want to
> continue batch mode if single line is broken.
>

 batch mode needs to stop on the line that fails. That said, batch still
 fails with the /exit/return/ change

 $ cat /tmp/ip.batch
 li sh
 li foo
 li add veth1 type veth peer name veth2

 Current command
 $ ip -batch /tmp/ip.batch
 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
 DEFAULT group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

 
 Command "foo" is unknown, try "ip link help".
 $ echo $?
 255

 $ ip/ip -batch /tmp/ip.batch
 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
 DEFAULT group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 
 Command "foo" is unknown, try "ip link help".
 Command failed /tmp/ip.batch:2
 dsa@kenny:~/iproute2/iproute2-next.git$ echo $?
 1

 I like that better because it tells me the line that fails.
>>>
>>> Normally ip batch will exit on errors.
>>> The question is what about -force?
>>
>> on -force, it needs to continue.
>>
>
>
> It does.
>
> $ ip/ip -batch /tmp/ip.batch -force
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
> DEFAULT group default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 
> Command "foo" is unknown, try "ip link help".
> Command failed /tmp/ip.batch:2
> RTNETLINK answers: Operation not permitted
> Command failed /tmp/ip.batch:3
>
> Which is an improvement over today where it just exits - ignoring the force.

cool :).

some other iproute2 commands have also received patches to continue on
-force over the last few years.


[RFC net-next 03/11] ip6mr: Align hash implementation to ipmr

2018-02-20 Thread Yuval Mintz
Since commit 8fb472c09b9d ("ipmr: improve hash scalability") ipmr has
been using rhashtable as a basis for its mfc routes, but ip6mr is
currently still using the old private MFC hash implementation.

Align ip6mr to the current ipmr implementation.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute6.h |  30 ++---
 net/ipv6/ip6mr.c| 313 ++--
 2 files changed, 184 insertions(+), 159 deletions(-)

diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index e1b9fb0..e2dac19 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_IPV6_MROUTE
 static inline int ip6_mroute_opt(int opt)
@@ -65,10 +66,20 @@ static inline void ip6_mr_cleanup(void)
 
 #define VIFF_STATIC 0x8000
 
+struct mfc6_cache_cmp_arg {
+   struct in6_addr mf6c_mcastgrp;
+   struct in6_addr mf6c_origin;
+};
+
 struct mfc6_cache {
-   struct list_head list;
-   struct in6_addr mf6c_mcastgrp;  /* Group the entry 
belongs to   */
-   struct in6_addr mf6c_origin;/* Source of packet 
*/
+   struct rhlist_head mnode;
+   union {
+   struct {
+   struct in6_addr mf6c_mcastgrp;
+   struct in6_addr mf6c_origin;
+   };
+   struct mfc6_cache_cmp_arg cmparg;
+   };
mifi_t mf6c_parent; /* Source interface 
*/
int mfc_flags;  /* Flags on line
*/
 
@@ -88,22 +99,13 @@ struct mfc6_cache {
unsigned char ttls[MAXMIFS];/* TTL thresholds   
*/
} res;
} mfc_un;
+   struct list_head list;
+   struct rcu_head rcu;
 };
 
 #define MFC_STATIC 1
 #define MFC_NOTIFY 2
 
-#define MFC6_LINES 64
-
-#define MFC6_HASH(a, g) (((__force u32)(a)->s6_addr32[0] ^ \
- (__force u32)(a)->s6_addr32[1] ^ \
- (__force u32)(a)->s6_addr32[2] ^ \
- (__force u32)(a)->s6_addr32[3] ^ \
- (__force u32)(g)->s6_addr32[0] ^ \
- (__force u32)(g)->s6_addr32[1] ^ \
- (__force u32)(g)->s6_addr32[2] ^ \
- (__force u32)(g)->s6_addr32[3]) % MFC6_LINES)
-
 #define MFC_ASSERT_THRESH (3*HZ)   /* Maximal freq. of asserts */
 
 struct rtmsg;
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 7792fc5..717370e 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -61,8 +61,9 @@ struct mr6_table {
struct sock __rcu   *mroute6_sk;
struct timer_list   ipmr_expire_timer;
struct list_headmfc6_unres_queue;
-   struct list_headmfc6_cache_array[MFC6_LINES];
struct vif_device   vif6_table[MAXMIFS];
+   struct rhltable mfc6_hash;
+   struct list_headmfc6_cache_list;
int maxvif;
atomic_tcache_resolve_queue_len;
boolmroute_do_assert;
@@ -299,10 +300,29 @@ static void __net_exit ip6mr_rules_exit(struct net *net)
 }
 #endif
 
+static int ip6mr_hash_cmp(struct rhashtable_compare_arg *arg,
+ const void *ptr)
+{
+   const struct mfc6_cache_cmp_arg *cmparg = arg->key;
+   struct mfc6_cache *c = (struct mfc6_cache *)ptr;
+
+   return !ipv6_addr_equal(>mf6c_mcastgrp, >mf6c_mcastgrp) ||
+  !ipv6_addr_equal(>mf6c_origin, >mf6c_origin);
+}
+
+static const struct rhashtable_params ip6mr_rht_params = {
+   .head_offset = offsetof(struct mfc6_cache, mnode),
+   .key_offset = offsetof(struct mfc6_cache, cmparg),
+   .key_len = sizeof(struct mfc6_cache_cmp_arg),
+   .nelem_hint = 3,
+   .locks_mul = 1,
+   .obj_cmpfn = ip6mr_hash_cmp,
+   .automatic_shrinking = true,
+};
+
 static struct mr6_table *ip6mr_new_table(struct net *net, u32 id)
 {
struct mr6_table *mrt;
-   unsigned int i;
 
mrt = ip6mr_get_table(net, id);
if (mrt)
@@ -314,10 +334,8 @@ static struct mr6_table *ip6mr_new_table(struct net *net, 
u32 id)
mrt->id = id;
write_pnet(>net, net);
 
-   /* Forwarding cache */
-   for (i = 0; i < MFC6_LINES; i++)
-   INIT_LIST_HEAD(>mfc6_cache_array[i]);
-
+   rhltable_init(>mfc6_hash, _rht_params);
+   INIT_LIST_HEAD(>mfc6_cache_list);
INIT_LIST_HEAD(>mfc6_unres_queue);
 
timer_setup(>ipmr_expire_timer, ipmr_expire_process, 0);
@@ -335,6 +353,7 @@ static void ip6mr_free_table(struct mr6_table *mrt)
 {
del_timer_sync(>ipmr_expire_timer);
mroute_clean_tables(mrt, true);
+   rhltable_destroy(>mfc6_hash);
kfree(mrt);
 }
 
@@ -344,7 +363,6 @@ struct ipmr_mfc_iter {
struct 

[RFC net-next 09/11] ipmr, ip6mr: Unite vif seq functions

2018-02-20 Thread Yuval Mintz
Same as previously done with the mfc seq, the logic for the vif seq is
refactored to be shared between ipmr and ip6mr.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute_base.h | 33 ++
 net/ipv4/ipmr.c | 49 +---
 net/ipv4/ipmr_base.c| 33 ++
 net/ipv6/ip6mr.c| 50 +
 4 files changed, 76 insertions(+), 89 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 413f103..edc6e6b 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -206,6 +206,12 @@ static inline void *mr_mfc_find(struct mr_table *mrt, void 
*hasharg)
 }
 
 #ifdef CONFIG_PROC_FS
+struct mr_vif_iter {
+   struct seq_net_private p;
+   struct mr_table *mrt;
+   int ct;
+};
+
 struct mr_mfc_iter {
struct seq_net_private p;
struct mr_table *mrt;
@@ -216,6 +222,16 @@ struct mr_mfc_iter {
 };
 
 #ifdef CONFIG_IP_MROUTE_COMMON
+void *mr_vif_seq_idx(struct net *net, struct mr_vif_iter *iter, loff_t pos);
+void *mr_vif_seq_next(struct seq_file *seq, void *v, loff_t *pos);
+
+static inline void *mr_vif_seq_start(struct seq_file *seq, loff_t *pos)
+{
+   return *pos ? mr_vif_seq_idx(seq_file_net(seq),
+seq->private, *pos - 1)
+   : SEQ_START_TOKEN;
+}
+
 /* These actually return 'struct mr_mfc *', but to avoid need for explicit
  * castings they simply return void.
  */
@@ -249,6 +265,23 @@ static inline void mr_mfc_seq_stop(struct seq_file *seq, 
void *v)
rcu_read_unlock();
 }
 #else
+static inline void *mr_vif_seq_idx(struct net *net, struct mr_vif_iter *iter,
+  loff_t pos)
+{
+   return NULL;
+}
+
+static inline void *mr_vif_seq_next(struct seq_file *seq,
+   void *v, loff_t *pos)
+{
+   return NULL;
+}
+
+static inline void *mr_vif_seq_start(struct seq_file *seq, loff_t *pos)
+{
+   return NULL;
+}
+
 static inline void *mr_mfc_seq_idx(struct net *net,
   struct mr_mfc_iter *it, loff_t pos)
 {
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 0281f89..6039751 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2908,31 +2908,11 @@ static int ipmr_rtm_dumplink(struct sk_buff *skb, 
struct netlink_callback *cb)
 /* The /proc interfaces to multicast routing :
  * /proc/net/ip_mr_cache & /proc/net/ip_mr_vif
  */
-struct ipmr_vif_iter {
-   struct seq_net_private p;
-   struct mr_table *mrt;
-   int ct;
-};
-
-static struct vif_device *ipmr_vif_seq_idx(struct net *net,
-   struct ipmr_vif_iter *iter,
-   loff_t pos)
-{
-   struct mr_table *mrt = iter->mrt;
-
-   for (iter->ct = 0; iter->ct < mrt->maxvif; ++iter->ct) {
-   if (!VIF_EXISTS(mrt, iter->ct))
-   continue;
-   if (pos-- == 0)
-   return >vif_table[iter->ct];
-   }
-   return NULL;
-}
 
 static void *ipmr_vif_seq_start(struct seq_file *seq, loff_t *pos)
__acquires(mrt_lock)
 {
-   struct ipmr_vif_iter *iter = seq->private;
+   struct mr_vif_iter *iter = seq->private;
struct net *net = seq_file_net(seq);
struct mr_table *mrt;
 
@@ -2943,26 +2923,7 @@ static void *ipmr_vif_seq_start(struct seq_file *seq, 
loff_t *pos)
iter->mrt = mrt;
 
read_lock(_lock);
-   return *pos ? ipmr_vif_seq_idx(net, seq->private, *pos - 1)
-   : SEQ_START_TOKEN;
-}
-
-static void *ipmr_vif_seq_next(struct seq_file *seq, void *v, loff_t *pos)
-{
-   struct ipmr_vif_iter *iter = seq->private;
-   struct net *net = seq_file_net(seq);
-   struct mr_table *mrt = iter->mrt;
-
-   ++*pos;
-   if (v == SEQ_START_TOKEN)
-   return ipmr_vif_seq_idx(net, iter, 0);
-
-   while (++iter->ct < mrt->maxvif) {
-   if (!VIF_EXISTS(mrt, iter->ct))
-   continue;
-   return >vif_table[iter->ct];
-   }
-   return NULL;
+   return mr_vif_seq_start(seq, pos);
 }
 
 static void ipmr_vif_seq_stop(struct seq_file *seq, void *v)
@@ -2973,7 +2934,7 @@ static void ipmr_vif_seq_stop(struct seq_file *seq, void 
*v)
 
 static int ipmr_vif_seq_show(struct seq_file *seq, void *v)
 {
-   struct ipmr_vif_iter *iter = seq->private;
+   struct mr_vif_iter *iter = seq->private;
struct mr_table *mrt = iter->mrt;
 
if (v == SEQ_START_TOKEN) {
@@ -2996,7 +2957,7 @@ static int ipmr_vif_seq_show(struct seq_file *seq, void 
*v)
 
 static const struct seq_operations ipmr_vif_seq_ops = {
.start = ipmr_vif_seq_start,
-   .next  = ipmr_vif_seq_next,
+   .next  = mr_vif_seq_next,
.stop  = ipmr_vif_seq_stop,
.show  = 

[RFC net-next 06/11] ipmr, ip6mr: Make mfc_cache a common structure

2018-02-20 Thread Yuval Mintz
mfc_cache and mfc6_cache are almost identical - the main difference is
in the origin/group addresses and comparison-key. Make a common
structure encapsulating most of the multicast routing logic  - mr_mfc
and convert both ipmr and ip6mr into using it.

For easy conversion [casting, in this case] mr_mfc has to be the first
field inside every multicast routing abstraction utilizing it.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c |  21 +-
 include/linux/mroute.h|  45 +---
 include/linux/mroute6.h   |  23 +--
 include/linux/mroute_base.h   |  45 
 net/ipv4/ipmr.c   | 234 +++--
 net/ipv6/ip6mr.c  | 241 +++---
 6 files changed, 312 insertions(+), 297 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
index d20b143..978a3c7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
@@ -126,8 +126,8 @@ mlxsw_sp_mr_route_ivif_in_evifs(const struct 
mlxsw_sp_mr_route *mr_route)
 
switch (mr_route->mr_table->proto) {
case MLXSW_SP_L3_PROTO_IPV4:
-   ivif = mr_route->mfc4->mfc_parent;
-   return mr_route->mfc4->mfc_un.res.ttls[ivif] != 255;
+   ivif = mr_route->mfc4->_c.mfc_parent;
+   return mr_route->mfc4->_c.mfc_un.res.ttls[ivif] != 255;
case MLXSW_SP_L3_PROTO_IPV6:
/* fall through */
default:
@@ -364,7 +364,7 @@ mlxsw_sp_mr_route4_create(struct mlxsw_sp_mr_table 
*mr_table,
mr_route->mfc4 = mfc;
mr_route->mr_table = mr_table;
for (i = 0; i < MAXVIFS; i++) {
-   if (mfc->mfc_un.res.ttls[i] != 255) {
+   if (mfc->_c.mfc_un.res.ttls[i] != 255) {
err = mlxsw_sp_mr_route_evif_link(mr_route,
  _table->vifs[i]);
if (err)
@@ -374,7 +374,8 @@ mlxsw_sp_mr_route4_create(struct mlxsw_sp_mr_table 
*mr_table,
mr_route->min_mtu = mr_table->vifs[i].dev->mtu;
}
}
-   mlxsw_sp_mr_route_ivif_link(mr_route, _table->vifs[mfc->mfc_parent]);
+   mlxsw_sp_mr_route_ivif_link(mr_route,
+   _table->vifs[mfc->_c.mfc_parent]);
 
mr_route->route_action = mlxsw_sp_mr_route_action(mr_route);
return mr_route;
@@ -418,9 +419,9 @@ static void mlxsw_sp_mr_mfc_offload_set(struct 
mlxsw_sp_mr_route *mr_route,
switch (mr_route->mr_table->proto) {
case MLXSW_SP_L3_PROTO_IPV4:
if (offload)
-   mr_route->mfc4->mfc_flags |= MFC_OFFLOAD;
+   mr_route->mfc4->_c.mfc_flags |= MFC_OFFLOAD;
else
-   mr_route->mfc4->mfc_flags &= ~MFC_OFFLOAD;
+   mr_route->mfc4->_c.mfc_flags &= ~MFC_OFFLOAD;
break;
case MLXSW_SP_L3_PROTO_IPV6:
/* fall through */
@@ -943,10 +944,10 @@ static void mlxsw_sp_mr_route_stats_update(struct 
mlxsw_sp *mlxsw_sp,
 
switch (mr_route->mr_table->proto) {
case MLXSW_SP_L3_PROTO_IPV4:
-   if (mr_route->mfc4->mfc_un.res.pkt != packets)
-   mr_route->mfc4->mfc_un.res.lastuse = jiffies;
-   mr_route->mfc4->mfc_un.res.pkt = packets;
-   mr_route->mfc4->mfc_un.res.bytes = bytes;
+   if (mr_route->mfc4->_c.mfc_un.res.pkt != packets)
+   mr_route->mfc4->_c.mfc_un.res.lastuse = jiffies;
+   mr_route->mfc4->_c.mfc_un.res.pkt = packets;
+   mr_route->mfc4->_c.mfc_un.res.bytes = bytes;
break;
case MLXSW_SP_L3_PROTO_IPV6:
/* fall through */
diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index 8688c5d..63b36e6 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -81,28 +81,13 @@ struct mfc_cache_cmp_arg {
 
 /**
  * struct mfc_cache - multicast routing entries
- * @mnode: rhashtable list
+ * @_c: Common multicast routing information; has to be first [for casting]
  * @mfc_mcastgrp: destination multicast group address
  * @mfc_origin: source address
  * @cmparg: used for rhashtable comparisons
- * @mfc_parent: source interface (iif)
- * @mfc_flags: entry flags
- * @expires: unresolved entry expire time
- * @unresolved: unresolved cached skbs
- * @last_assert: time of last assert
- * @minvif: minimum VIF id
- * @maxvif: maximum VIF id
- * @bytes: bytes that have passed for this entry
- * @pkt: packets that have passed for this entry
- * @wrong_if: number of wrong source interface hits
- * @lastuse: time of last use of the group (traffic or update)
- * @ttls: OIF TTL threshold array

[RFC net-next 00/11] ipmr, ip6mr: Align multicast routing for IPv4 & IPv6

2018-02-20 Thread Yuval Mintz
Historically ip6mr was based [cut-n-paste] on ipmr and the two have not
diverged too much. Apparently as ipv4 multicast routing is more common
than its ipv6 brethren modifications since then are mostly one-way,
affecting ipmr while leaving ip6mr unchanged.

This series is meant to re-factor both ipmr and ip6mr into having common
structures [and some functionality], adding 2 new common files -
mroute_base.h and ipmr_base.c.

The series begins by bringing ip6mr up to speed to some of the changes
applied in the past to ipmr [#2, #3].
It is then possible to re-factor a lot of the common structures - 
vif devices [#1], mr_table [#4] mfc_cache [#6], and use the common
structures in both ipmr and ip6mr.

The rest of the patches re-factor some choice flows used by both ipmr
and ip6mr and eliminates duplicity.

This series would later allow for easy extension of ipmr offloading
to support ip6mr offloading as well, as almost all structures
related to the offloading would be shared between the two protocols.

Yuval Mintz (11):
  ipmr,ipmr6: Define a uniform vif_device
  ip6mr: Make mroute_sk rcu-based
  ip6mr: Align hash implementation to ipmr
  mroute*: Make mr_table a common struct
  ipmr, ip6mr: Unite creation of new mr_table
  ipmr, ip6mr: Make mfc_cache a common structure
  ipmr, ip6mr: Unite logic for searching in MFC cache
  ipmr, ip6mr: Unite mfc seq logic
  ipmr, ip6mr: Unite vif seq logic
  ip6mr: Remove MFC_NOTIFY and refactor flags
  ipmr, ip6mr: Unite dumproute flows

 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c |  21 +-
 include/linux/mroute.h|  88 +-
 include/linux/mroute6.h   |  62 +-
 include/linux/mroute_base.h   | 346 
 include/net/netns/ipv6.h  |   2 +-
 net/ipv4/Kconfig  |   5 +
 net/ipv4/Makefile |   1 +
 net/ipv4/ipmr.c   | 576 -
 net/ipv4/ipmr_base.c  | 323 +++
 net/ipv6/Kconfig  |   1 +
 net/ipv6/ip6_output.c |   2 +-
 net/ipv6/ip6mr.c  | 984 --
 12 files changed, 1240 insertions(+), 1171 deletions(-)
 create mode 100644 include/linux/mroute_base.h
 create mode 100644 net/ipv4/ipmr_base.c

-- 
2.4.3



[RFC net-next 11/11] ipmr, ip6mr: Unite dumproute flows

2018-02-20 Thread Yuval Mintz
The various MFC entries are being held in the same kind of mr_tables
for both ipmr and ip6mr, and their traversal logic is identical.
Also, with the exception of the addresses [and other small tidbits]
the major bulk of the nla setting is identical.

Unite as much of the dumping as possible between the two.
Notice this requires creating an mr_table iterator for each, as the
for-each preprocessor macro can't be used by the common logic.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute_base.h |  29 +
 net/ipv4/ipmr.c | 154 +---
 net/ipv4/ipmr_base.c| 123 +++
 net/ipv6/ip6mr.c| 149 +-
 4 files changed, 216 insertions(+), 239 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 2054118..2da30da 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -170,6 +170,16 @@ void *mr_mfc_find_parent(struct mr_table *mrt,
 void *mr_mfc_find_any_parent(struct mr_table *mrt, int vifi);
 void *mr_mfc_find_any(struct mr_table *mrt, int vifi, void *hasharg);
 
+int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
+  struct mr_mfc *c, struct rtmsg *rtm);
+int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
+struct mr_table *(*iter)(struct net *net,
+ struct mr_table *mrt),
+int (*fill)(struct mr_table *mrt,
+struct sk_buff *skb,
+u32 portid, u32 seq, struct mr_mfc *c,
+int cmd, int flags),
+spinlock_t *lock);
 #else
 static inline void vif_device_init(struct vif_device *v,
   struct net_device *dev,
@@ -207,6 +217,25 @@ static inline struct mr_mfc *mr_mfc_find_any(struct 
mr_table *mrt,
 {
return NULL;
 }
+
+static inline int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
+struct mr_mfc *c, struct rtmsg *rtm)
+{
+   return -EINVAL;
+}
+
+static inline int
+mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
+struct mr_table *(*iter)(struct net *net,
+ struct mr_table *mrt),
+int (*fill)(struct mr_table *mrt,
+struct sk_buff *skb,
+u32 portid, u32 seq, struct mr_mfc *c,
+int cmd, int flags),
+spinlock_t *lock)
+{
+   return -EINVAL;
+}
 #endif
 
 static inline void *mr_mfc_find(struct mr_table *mrt, void *hasharg)
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 6039751..446b372 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -104,8 +104,6 @@ static void ip_mr_forward(struct net *net, struct mr_table 
*mrt,
  struct mfc_cache *cache, int local);
 static int ipmr_cache_report(struct mr_table *mrt,
 struct sk_buff *pkt, vifi_t vifi, int assert);
-static int __ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
- struct mr_mfc *c, struct rtmsg *rtm);
 static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc,
 int cmd);
 static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
@@ -116,6 +114,16 @@ static void ipmr_expire_process(struct timer_list *t);
 #define ipmr_for_each_table(mrt, net) \
list_for_each_entry_rcu(mrt, >ipv4.mr_tables, list)
 
+static struct mr_table *ipmr_mr_table_iter(struct net *net,
+  struct mr_table *mrt)
+{
+   if (!mrt)
+   return list_entry_rcu(>ipv4.mr_tables->next,
+ typeof(mrt), list);
+   return list_entry_rcu(mrt->list.next,
+ typeof(mrt), list);
+}
+
 static struct mr_table *ipmr_get_table(struct net *net, u32 id)
 {
struct mr_table *mrt;
@@ -283,6 +291,14 @@ EXPORT_SYMBOL(ipmr_rule_default);
 #define ipmr_for_each_table(mrt, net) \
for (mrt = net->ipv4.mrt; mrt; mrt = NULL)
 
+static struct mr_table *ipmr_mr_table_iter(struct net *net,
+  struct mr_table *mrt)
+{
+   if (!mrt)
+   return net->ipv4.mrt;
+   return NULL;
+}
+
 static struct mr_table *ipmr_get_table(struct net *net, u32 id)
 {
return net->ipv4.mrt;
@@ -1050,8 +1066,8 @@ static void ipmr_cache_resolve(struct net *net, struct 
mr_table *mrt,
struct nlmsghdr *nlh = skb_pull(skb,
sizeof(struct iphdr));
 
-   if (__ipmr_fill_mroute(mrt, skb, >_c,
-  nlmsg_data(nlh)) > 

[RFC net-next 04/11] mroute*: Make mr_table a common struct

2018-02-20 Thread Yuval Mintz
Following previous changes to ip6mr, mr_table and mr6_table are
basically the same [up to mr6_table having additional '6' suffixes to
its variable names].
Move the common structure definition into a common header; This
requires renaming all references in ip6mr to variables that had the
distinct suffix.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute.h  |  21 ---
 include/linux/mroute6.h |   1 -
 include/linux/mroute_base.h |  46 +++
 include/net/netns/ipv6.h|   2 +-
 net/ipv4/ipmr.c |   2 -
 net/ipv6/ip6mr.c| 311 
 6 files changed, 191 insertions(+), 192 deletions(-)

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index b8aadff..8688c5d 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -4,8 +4,6 @@
 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -67,25 +65,6 @@ struct vif_entry_notifier_info {
 
 #define VIFF_STATIC 0x8000
 
-#define VIF_EXISTS(_mrt, _idx) ((_mrt)->vif_table[_idx].dev != NULL)
-
-struct mr_table {
-   struct list_headlist;
-   possible_net_t  net;
-   u32 id;
-   struct sock __rcu   *mroute_sk;
-   struct timer_list   ipmr_expire_timer;
-   struct list_headmfc_unres_queue;
-   struct vif_device   vif_table[MAXVIFS];
-   struct rhltable mfc_hash;
-   struct list_headmfc_cache_list;
-   int maxvif;
-   atomic_tcache_resolve_queue_len;
-   boolmroute_do_assert;
-   boolmroute_do_pim;
-   int mroute_reg_vif_num;
-};
-
 /* mfc_flags:
  * MFC_STATIC - the entry was added statically (not by a routing daemon)
  * MFC_OFFLOAD - the entry was offloaded to the hardware
diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index e2dac19..d5c8dc1 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #ifdef CONFIG_IPV6_MROUTE
 static inline int ip6_mroute_opt(int opt)
diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 0de651e..1cc944a 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -2,6 +2,9 @@
 #define __LINUX_MROUTE_BASE_H
 
 #include 
+#include 
+#include 
+#include 
 
 /**
  * struct vif_device - interface representor for multicast routing
@@ -32,6 +35,49 @@ struct vif_device {
__be32 local, remote;
 };
 
+#ifndef MAXVIFS
+/* This one is nasty; value is defined in uapi using different symbols for
+ * mroute and morute6 but both map into same 32.
+ */
+#define MAXVIFS32
+#endif
+
+#define VIF_EXISTS(_mrt, _idx) (!!((_mrt)->vif_table[_idx].dev))
+
+/**
+ * struct mr_table - a multicast routing table
+ * @list: entry within a list of multicast routing tables
+ * @net: net where this table belongs
+ * @id: identifier of the table
+ * @mroute_sk: socket associated with the table
+ * @ipmr_expire_timer: timer for handling unresolved routes
+ * @mfc_unres_queue: list of unresolved MFC entries
+ * @vif_table: array containing all possible vifs
+ * @mfc_hash: Hash table of all resolved routes for easy lookup
+ * @mfc_cache_list: list of resovled routes for possible traversal
+ * @maxvif: Identifier of highest value vif currently in use
+ * @cache_resolve_queue_len: current size of unresolved queue
+ * @mroute_do_assert: Whether to inform userspace on wrong ingress
+ * @mroute_do_pim: Whether to receive IGMP PIMv1
+ * @mroute_reg_vif_num: PIM-device vif index
+ */
+struct mr_table {
+   struct list_headlist;
+   possible_net_t  net;
+   u32 id;
+   struct sock __rcu   *mroute_sk;
+   struct timer_list   ipmr_expire_timer;
+   struct list_headmfc_unres_queue;
+   struct vif_device   vif_table[MAXVIFS];
+   struct rhltable mfc_hash;
+   struct list_headmfc_cache_list;
+   int maxvif;
+   atomic_tcache_resolve_queue_len;
+   boolmroute_do_assert;
+   boolmroute_do_pim;
+   int mroute_reg_vif_num;
+};
+
 #ifdef CONFIG_IP_MROUTE_COMMON
 void vif_device_init(struct vif_device *v,
 struct net_device *dev,
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 987cc45..0d177fa9 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -84,7 +84,7 @@ struct netns_ipv6 {
struct sock *mc_autojoin_sk;
 #ifdef CONFIG_IPV6_MROUTE
 #ifndef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
-   struct mr6_table*mrt6;
+   struct mr_table *mrt6;
 #else
struct list_headmr6_tables;
struct fib_rules_ops*mr6_rules_ops;
diff --git a/net/ipv4/ipmr.c 

[RFC net-next 05/11] ipmr, ip6mr: Unite creation of new mr_table

2018-02-20 Thread Yuval Mintz
Now that both ipmr and ip6mr are using the same mr_table structure,
we can have a common function to allocate & initialize a new instance.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute_base.h | 17 +
 net/ipv4/ipmr.c | 27 ++-
 net/ipv4/ipmr_base.c| 27 +++
 net/ipv6/ip6mr.c| 30 ++
 4 files changed, 64 insertions(+), 37 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 1cc944a..8053057 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -85,6 +85,13 @@ void vif_device_init(struct vif_device *v,
 unsigned char threshold,
 unsigned short flags,
 unsigned short get_iflink_mask);
+
+struct mr_table *
+mr_table_alloc(struct net *net, u32 id,
+  const struct rhashtable_params *rht_params,
+  void (*expire_func)(struct timer_list *t),
+  void (*table_set)(struct mr_table *mrt,
+struct net *net));
 #else
 static inline void vif_device_init(struct vif_device *v,
   struct net_device *dev,
@@ -94,5 +101,15 @@ static inline void vif_device_init(struct vif_device *v,
   unsigned short get_iflink_mask)
 {
 }
+
+static inline struct mr_table *
+mr_table_alloc(struct net *net, u32 id,
+  const struct rhashtable_params *rht_params,
+  void (*expire_func)(struct timer_list *t),
+  void (*table_set)(struct mr_table *mrt,
+struct net *net))
+{
+   return NULL;
+}
 #endif
 #endif
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 78046d2..67f752f 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -351,6 +351,14 @@ static const struct rhashtable_params ipmr_rht_params = {
.automatic_shrinking = true,
 };
 
+static void ipmr_new_table_set(struct mr_table *mr,
+  struct net *net)
+{
+#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
+   list_add_tail_rcu(>list, >ipv4.mr_tables);
+#endif
+}
+
 static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 {
struct mr_table *mrt;
@@ -363,23 +371,8 @@ static struct mr_table *ipmr_new_table(struct net *net, 
u32 id)
if (mrt)
return mrt;
 
-   mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
-   if (!mrt)
-   return ERR_PTR(-ENOMEM);
-   write_pnet(>net, net);
-   mrt->id = id;
-
-   rhltable_init(>mfc_hash, _rht_params);
-   INIT_LIST_HEAD(>mfc_cache_list);
-   INIT_LIST_HEAD(>mfc_unres_queue);
-
-   timer_setup(>ipmr_expire_timer, ipmr_expire_process, 0);
-
-   mrt->mroute_reg_vif_num = -1;
-#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
-   list_add_tail_rcu(>list, >ipv4.mr_tables);
-#endif
-   return mrt;
+   return mr_table_alloc(net, id, _rht_params,
+ ipmr_expire_process, ipmr_new_table_set);
 }
 
 static void ipmr_free_table(struct mr_table *mrt)
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 22758f8..3e21a58 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -26,3 +26,30 @@ void vif_device_init(struct vif_device *v,
v->link = dev->ifindex;
 }
 EXPORT_SYMBOL(vif_device_init);
+
+struct mr_table *
+mr_table_alloc(struct net *net, u32 id,
+  const struct rhashtable_params *rht_params,
+  void (*expire_func)(struct timer_list *t),
+  void (*table_set)(struct mr_table *mrt,
+struct net *net))
+{
+   struct mr_table *mrt;
+
+   mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
+   if (!mrt)
+   return NULL;
+   mrt->id = id;
+   write_pnet(>net, net);
+
+   rhltable_init(>mfc_hash, rht_params);
+   INIT_LIST_HEAD(>mfc_cache_list);
+   INIT_LIST_HEAD(>mfc_unres_queue);
+
+   timer_setup(>ipmr_expire_timer, expire_func, 0);
+
+   mrt->mroute_reg_vif_num = -1;
+   table_set(mrt, net);
+   return mrt;
+}
+EXPORT_SYMBOL(mr_table_alloc);
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 0095a43..9b11321 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -295,6 +294,14 @@ static const struct rhashtable_params ip6mr_rht_params = {
.automatic_shrinking = true,
 };
 
+static void ip6mr_new_table_set(struct mr_table *mrt,
+   struct net *net)
+{
+#ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
+   list_add_tail_rcu(>list, >ipv6.mr_tables);
+#endif
+}
+
 static struct mr_table *ip6mr_new_table(struct net *net, u32 id)
 {
struct mr_table *mrt;
@@ -303,25 +310,8 @@ static struct mr_table *ip6mr_new_table(struct net *net, 
u32 id)
if (mrt)
return mrt;
 

[RFC net-next 10/11] ip6mr: Remove MFC_NOTIFY and refactor flags

2018-02-20 Thread Yuval Mintz
MFC_NOTIFY exists in ip6mr, probably as some legacy code
[was already removed for ipmr in commit
06bd6c0370bb ("net: ipmr: remove unused MFC_NOTIFY flag and make the flags 
enum").
Remove it from ip6mr as well, and move the enum into a common file;
Notice MFC_OFFLOAD is currently only used by ipmr.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute.h  | 9 -
 include/linux/mroute6.h | 3 ---
 include/linux/mroute_base.h | 9 +
 net/ipv6/ip6mr.c| 3 ---
 4 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index 63b36e6..7ed82e4 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -65,15 +65,6 @@ struct vif_entry_notifier_info {
 
 #define VIFF_STATIC 0x8000
 
-/* mfc_flags:
- * MFC_STATIC - the entry was added statically (not by a routing daemon)
- * MFC_OFFLOAD - the entry was offloaded to the hardware
- */
-enum {
-   MFC_STATIC = BIT(0),
-   MFC_OFFLOAD = BIT(1),
-};
-
 struct mfc_cache_cmp_arg {
__be32 mfc_mcastgrp;
__be32 mfc_origin;
diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index 6acf576..1ac38e6 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -81,9 +81,6 @@ struct mfc6_cache {
};
 };
 
-#define MFC_STATIC 1
-#define MFC_NOTIFY 2
-
 #define MFC_ASSERT_THRESH (3*HZ)   /* Maximal freq. of asserts */
 
 struct rtmsg;
diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index edc6e6b..2054118 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -45,6 +45,15 @@ struct vif_device {
 
 #define VIF_EXISTS(_mrt, _idx) (!!((_mrt)->vif_table[_idx].dev))
 
+/* mfc_flags:
+ * MFC_STATIC - the entry was added statically (not by a routing daemon)
+ * MFC_OFFLOAD - the entry was offloaded to the hardware
+ */
+enum {
+   MFC_STATIC = BIT(0),
+   MFC_OFFLOAD = BIT(1),
+};
+
 /**
  * struct mr_mfc - common multicast routing entries
  * @mnode: rhashtable list
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index d0df242..1717638 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -2208,9 +2208,6 @@ int ip6mr_get_route(struct net *net, struct sk_buff *skb, 
struct rtmsg *rtm,
return err;
}
 
-   if (rtm->rtm_flags & RTM_F_NOTIFY)
-   cache->_c.mfc_flags |= MFC_NOTIFY;
-
err = __ip6mr_fill_mroute(mrt, skb, cache, rtm);
read_unlock(_lock);
return err;
-- 
2.4.3



[RFC net-next 08/11] ipmr, ip6mr: Unite mfc seq logic

2018-02-20 Thread Yuval Mintz
With the exception of the final dump, ipmr and ip6mr have the exact same
seq logic for traversing a given mr_table. Refactor that code and make
it common.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute_base.h | 69 
 net/ipv4/ipmr.c | 93 +++
 net/ipv4/ipmr_base.c| 62 +
 net/ipv6/ip6mr.c| 97 -
 4 files changed, 143 insertions(+), 178 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 18a1d75..413f103 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -203,4 +204,72 @@ static inline void *mr_mfc_find(struct mr_table *mrt, void 
*hasharg)
 {
return mr_mfc_find_parent(mrt, hasharg, -1);
 }
+
+#ifdef CONFIG_PROC_FS
+struct mr_mfc_iter {
+   struct seq_net_private p;
+   struct mr_table *mrt;
+   struct list_head *cache;
+
+   /* Lock protecting the mr_table's unresolved queue */
+   spinlock_t *lock;
+};
+
+#ifdef CONFIG_IP_MROUTE_COMMON
+/* These actually return 'struct mr_mfc *', but to avoid need for explicit
+ * castings they simply return void.
+ */
+void *mr_mfc_seq_idx(struct net *net,
+struct mr_mfc_iter *it, loff_t pos);
+void *mr_mfc_seq_next(struct seq_file *seq, void *v,
+ loff_t *pos);
+
+static inline void *mr_mfc_seq_start(struct seq_file *seq, loff_t *pos,
+struct mr_table *mrt, spinlock_t *lock)
+{
+   struct mr_mfc_iter *it = seq->private;
+
+   it->mrt = mrt;
+   it->cache = NULL;
+   it->lock = lock;
+
+   return *pos ? mr_mfc_seq_idx(seq_file_net(seq),
+seq->private, *pos - 1)
+   : SEQ_START_TOKEN;
+}
+
+static inline void mr_mfc_seq_stop(struct seq_file *seq, void *v)
+{
+   struct mr_mfc_iter *it = seq->private;
+   struct mr_table *mrt = it->mrt;
+
+   if (it->cache == >mfc_unres_queue)
+   spin_unlock_bh(it->lock);
+   else if (it->cache == >mfc_cache_list)
+   rcu_read_unlock();
+}
+#else
+static inline void *mr_mfc_seq_idx(struct net *net,
+  struct mr_mfc_iter *it, loff_t pos)
+{
+   return NULL;
+}
+
+static inline void *mr_mfc_seq_next(struct seq_file *seq, void *v,
+   loff_t *pos)
+{
+   return NULL;
+}
+
+static inline void *mr_mfc_seq_start(struct seq_file *seq, loff_t *pos,
+struct mr_table *mrt, spinlock_t *lock)
+{
+   return NULL;
+}
+
+static inline void mr_mfc_seq_stop(struct seq_file *seq, void *v)
+{
+}
+#endif
+#endif
 #endif
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 0df4dd5..0281f89 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -3014,41 +3014,8 @@ static const struct file_operations ipmr_vif_fops = {
.release = seq_release_net,
 };
 
-struct ipmr_mfc_iter {
-   struct seq_net_private p;
-   struct mr_table *mrt;
-   struct list_head *cache;
-};
-
-static struct mfc_cache *ipmr_mfc_seq_idx(struct net *net,
- struct ipmr_mfc_iter *it, loff_t pos)
-{
-   struct mr_table *mrt = it->mrt;
-   struct mr_mfc *mfc;
-
-   rcu_read_lock();
-   it->cache = >mfc_cache_list;
-   list_for_each_entry_rcu(mfc, >mfc_cache_list, list)
-   if (pos-- == 0)
-   return (struct mfc_cache *)mfc;
-   rcu_read_unlock();
-
-   spin_lock_bh(_unres_lock);
-   it->cache = >mfc_unres_queue;
-   list_for_each_entry(mfc, it->cache, list)
-   if (pos-- == 0)
-   return (struct mfc_cache *)mfc;
-
-   spin_unlock_bh(_unres_lock);
-
-   it->cache = NULL;
-   return NULL;
-}
-
-
 static void *ipmr_mfc_seq_start(struct seq_file *seq, loff_t *pos)
 {
-   struct ipmr_mfc_iter *it = seq->private;
struct net *net = seq_file_net(seq);
struct mr_table *mrt;
 
@@ -3056,57 +3023,7 @@ static void *ipmr_mfc_seq_start(struct seq_file *seq, 
loff_t *pos)
if (!mrt)
return ERR_PTR(-ENOENT);
 
-   it->mrt = mrt;
-   it->cache = NULL;
-   return *pos ? ipmr_mfc_seq_idx(net, seq->private, *pos - 1)
-   : SEQ_START_TOKEN;
-}
-
-static void *ipmr_mfc_seq_next(struct seq_file *seq, void *v, loff_t *pos)
-{
-   struct ipmr_mfc_iter *it = seq->private;
-   struct net *net = seq_file_net(seq);
-   struct mr_table *mrt = it->mrt;
-   struct mfc_cache *mfc = v;
-
-   ++*pos;
-
-   if (v == SEQ_START_TOKEN)
-   return ipmr_mfc_seq_idx(net, seq->private, 0);
-
-   if (mfc->_c.list.next != it->cache)
-   return (struct mfc_cache *)(list_entry(mfc->_c.list.next,
- 

[RFC net-next 01/11] ipmr,ipmr6: Define a uniform vif_device

2018-02-20 Thread Yuval Mintz
The two implementations have almost identical structures - vif_device and
mif_device. As a step toward uniforming the mr_tables, eliminate the
mif_device and relocate the vif_device definition into a new common
header file.

Also, introduce a common initializing function for setting most of the
vif_device fields in a new common source file. This requires modifying
the ipv{4,6] Kconfig and ipv4 makefile as we're introducing a new common
config option - CONFIG_IP_MROUTE_COMMON.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute.h  | 13 +---
 include/linux/mroute6.h | 11 +-
 include/linux/mroute_base.h | 52 +
 net/ipv4/Kconfig|  5 +
 net/ipv4/Makefile   |  1 +
 net/ipv4/ipmr.c | 32 +---
 net/ipv4/ipmr_base.c| 28 
 net/ipv6/Kconfig|  1 +
 net/ipv6/ip6mr.c| 37 
 9 files changed, 117 insertions(+), 63 deletions(-)
 create mode 100644 include/linux/mroute_base.h
 create mode 100644 net/ipv4/ipmr_base.c

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index 5396521..b8aadff 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_IP_MROUTE
 static inline int ip_mroute_opt(int opt)
@@ -56,18 +57,6 @@ static inline bool ipmr_rule_default(const struct fib_rule 
*rule)
 }
 #endif
 
-struct vif_device {
-   struct net_device   *dev;   /* Device we are using 
*/
-   struct netdev_phys_item_id dev_parent_id;   /* Device parent ID
*/
-   unsigned long   bytes_in,bytes_out;
-   unsigned long   pkt_in,pkt_out; /* Statistics   
*/
-   unsigned long   rate_limit; /* Traffic shaping (NI) 
*/
-   unsigned char   threshold;  /* TTL threshold
*/
-   unsigned short  flags;  /* Control flags
*/
-   __be32  local,remote;   /* Addresses(remote for 
tunnels)*/
-   int link;   /* Physical interface index 
*/
-};
-
 struct vif_entry_notifier_info {
struct fib_notifier_info info;
struct net_device *dev;
diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index 3014c52..e5e5b82 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -7,6 +7,7 @@
 #include   /* for struct sk_buff_head */
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_IPV6_MROUTE
 static inline int ip6_mroute_opt(int opt)
@@ -62,16 +63,6 @@ static inline void ip6_mr_cleanup(void)
 }
 #endif
 
-struct mif_device {
-   struct net_device   *dev;   /* Device we are using 
*/
-   unsigned long   bytes_in,bytes_out;
-   unsigned long   pkt_in,pkt_out; /* Statistics   
*/
-   unsigned long   rate_limit; /* Traffic shaping (NI) 
*/
-   unsigned char   threshold;  /* TTL threshold
*/
-   unsigned short  flags;  /* Control flags
*/
-   int link;   /* Physical interface index 
*/
-};
-
 #define VIFF_STATIC 0x8000
 
 struct mfc6_cache {
diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
new file mode 100644
index 000..0de651e
--- /dev/null
+++ b/include/linux/mroute_base.h
@@ -0,0 +1,52 @@
+#ifndef __LINUX_MROUTE_BASE_H
+#define __LINUX_MROUTE_BASE_H
+
+#include 
+
+/**
+ * struct vif_device - interface representor for multicast routing
+ * @dev: network device being used
+ * @bytes_in: statistic; bytes ingressing
+ * @bytes_out: statistic; bytes egresing
+ * @pkt_in: statistic; packets ingressing
+ * @pkt_out: statistic; packets egressing
+ * @rate_limit: Traffic shaping (NI)
+ * @threshold: TTL threshold
+ * @flags: Control flags
+ * @link: Physical interface index
+ * @dev_parent_id: device parent id
+ * @local: Local address
+ * @remote: Remote address for tunnels
+ */
+struct vif_device {
+   struct net_device *dev;
+   unsigned long bytes_in, bytes_out;
+   unsigned long pkt_in, pkt_out;
+   unsigned long rate_limit;
+   unsigned char threshold;
+   unsigned short flags;
+   int link;
+
+   /* Currently only used by ipmr */
+   struct netdev_phys_item_id dev_parent_id;
+   __be32 local, remote;
+};
+
+#ifdef CONFIG_IP_MROUTE_COMMON
+void vif_device_init(struct vif_device *v,
+struct net_device *dev,
+unsigned long rate_limit,
+unsigned char threshold,
+unsigned short flags,
+unsigned short get_iflink_mask);
+#else
+static inline void vif_device_init(struct vif_device *v,
+  struct net_device *dev,
+ 

[RFC net-next 02/11] ip6mr: Make mroute_sk rcu-based

2018-02-20 Thread Yuval Mintz
In ipmr the mr_table socket is handled under RCU. Introduce the same
for ip6mr.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute6.h |  6 +++---
 net/ipv6/ip6_output.c   |  2 +-
 net/ipv6/ip6mr.c| 43 ++-
 3 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index e5e5b82..e1b9fb0 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -111,12 +111,12 @@ extern int ip6mr_get_route(struct net *net, struct 
sk_buff *skb,
   struct rtmsg *rtm, u32 portid);
 
 #ifdef CONFIG_IPV6_MROUTE
-extern struct sock *mroute6_socket(struct net *net, struct sk_buff *skb);
+bool mroute6_is_socket(struct net *net, struct sk_buff *skb);
 extern int ip6mr_sk_done(struct sock *sk);
 #else
-static inline struct sock *mroute6_socket(struct net *net, struct sk_buff *skb)
+static inline bool mroute6_is_socket(struct net *net, struct sk_buff *skb)
 {
-   return NULL;
+   return false;
 }
 static inline int ip6mr_sk_done(struct sock *sk)
 {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 997c7f1..a6eb0e6 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -71,7 +71,7 @@ static int ip6_finish_output2(struct net *net, struct sock 
*sk, struct sk_buff *
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
 
if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) &&
-   ((mroute6_socket(net, skb) &&
+   ((mroute6_is_socket(net, skb) &&
 !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) ||
 ipv6_chk_mcast_addr(dev, _hdr(skb)->daddr,
 _hdr(skb)->saddr))) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index e397990..7792fc5 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -58,7 +58,7 @@ struct mr6_table {
struct list_headlist;
possible_net_t  net;
u32 id;
-   struct sock *mroute6_sk;
+   struct sock __rcu   *mroute6_sk;
struct timer_list   ipmr_expire_timer;
struct list_headmfc6_unres_queue;
struct list_headmfc6_cache_array[MFC6_LINES];
@@ -1121,6 +1121,7 @@ static void ip6mr_cache_resolve(struct net *net, struct 
mr6_table *mrt,
 static int ip6mr_cache_report(struct mr6_table *mrt, struct sk_buff *pkt,
  mifi_t mifi, int assert)
 {
+   struct sock *mroute6_sk;
struct sk_buff *skb;
struct mrt6msg *msg;
int ret;
@@ -1190,17 +1191,19 @@ static int ip6mr_cache_report(struct mr6_table *mrt, 
struct sk_buff *pkt,
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
 
-   if (!mrt->mroute6_sk) {
+   rcu_read_lock();
+   mroute6_sk = rcu_dereference(mrt->mroute6_sk);
+   if (!mroute6_sk) {
+   rcu_read_unlock();
kfree_skb(skb);
return -EINVAL;
}
 
mrt6msg_netlink_event(mrt, skb);
 
-   /*
-*  Deliver to user space multicast routing algorithms
-*/
+   /* Deliver to user space multicast routing algorithms */
ret = sock_queue_rcv_skb(mrt->mroute6_sk, skb);
+   rcu_read_unlock();
if (ret < 0) {
net_warn_ratelimited("mroute6: pending queue full, dropping 
entries\n");
kfree_skb(skb);
@@ -1584,11 +1587,11 @@ static int ip6mr_sk_init(struct mr6_table *mrt, struct 
sock *sk)
 
rtnl_lock();
write_lock_bh(_lock);
-   if (likely(mrt->mroute6_sk == NULL)) {
-   mrt->mroute6_sk = sk;
-   net->ipv6.devconf_all->mc_forwarding++;
-   } else {
+   if (rtnl_dereference(mrt->mroute6_sk)) {
err = -EADDRINUSE;
+   } else {
+   rcu_assign_pointer(mrt->mroute6_sk, sk);
+   net->ipv6.devconf_all->mc_forwarding++;
}
write_unlock_bh(_lock);
 
@@ -1614,9 +1617,9 @@ int ip6mr_sk_done(struct sock *sk)
 
rtnl_lock();
ip6mr_for_each_table(mrt, net) {
-   if (sk == mrt->mroute6_sk) {
+   if (sk == rtnl_dereference(mrt->mroute6_sk)) {
write_lock_bh(_lock);
-   mrt->mroute6_sk = NULL;
+   RCU_INIT_POINTER(mrt->mroute6_sk, NULL);
net->ipv6.devconf_all->mc_forwarding--;
write_unlock_bh(_lock);
inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
@@ -1630,11 +1633,12 @@ int ip6mr_sk_done(struct sock *sk)
}
}
rtnl_unlock();
+   synchronize_rcu();
 
return err;
 }
 
-struct sock *mroute6_socket(struct net *net, struct sk_buff *skb)
+bool mroute6_is_socket(struct net *net, struct sk_buff *skb)
 {
struct mr6_table *mrt;
struct flowi6 fl6 = {
@@ -1646,8 +1650,9 @@ 

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Jakub Kicinski
On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
> 
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?

IMHO netvsc is a vendor specific driver which made a mistake on what
behaviour it provides (or tried to align itself with Windows SR-IOV).
Let's not make a far, far more commonly deployed and important driver
(virtio) bug-compatible with netvsc.

To Jiri's initial comments, I feel the same way, in fact I've talked to
the NetworkManager guys to get auto-bonding based on MACs handled in
user space.  I think it may very well get done in next versions of NM,
but isn't done yet.  Stephen also raised the point that not everybody is
using NM.


[RFC net-next 07/11] ipmr, ip6mr: Unite logic for searching in MFC cache

2018-02-20 Thread Yuval Mintz
ipmr and ip6mr utilize the exact same methods for searching the
hashed resolved connections, difference being only in the construction
of the hash comparison key.

In order to unite the flow, introduce an mr_table operation set that
would contain the protocol specific information required for common
flows, in this case - the hash parameters and a comparison key
representing a (*,*) route.

Signed-off-by: Yuval Mintz 
---
 include/linux/mroute_base.h | 52 +--
 net/ipv4/ipmr.c | 71 ++-
 net/ipv4/ipmr_base.c| 54 +++--
 net/ipv6/ip6mr.c| 74 +++--
 4 files changed, 134 insertions(+), 117 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 2769e2f..18a1d75 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -89,10 +89,23 @@ struct mr_mfc {
struct rcu_head rcu;
 };
 
+struct mr_table;
+
+/**
+ * struct mr_table_ops - callbacks and info for protocol-specific ops
+ * rht_params: parameters for accessing the MFC hash
+ * cmparg_any: a hash key to be used for matching on (*,*) routes
+ */
+struct mr_table_ops {
+   const struct rhashtable_params *rht_params;
+   void *cmparg_any;
+};
+
 /**
  * struct mr_table - a multicast routing table
  * @list: entry within a list of multicast routing tables
  * @net: net where this table belongs
+ * @op: protocol specific operations
  * @id: identifier of the table
  * @mroute_sk: socket associated with the table
  * @ipmr_expire_timer: timer for handling unresolved routes
@@ -109,6 +122,7 @@ struct mr_mfc {
 struct mr_table {
struct list_headlist;
possible_net_t  net;
+   struct mr_table_ops ops;
u32 id;
struct sock __rcu   *mroute_sk;
struct timer_list   ipmr_expire_timer;
@@ -133,10 +147,19 @@ void vif_device_init(struct vif_device *v,
 
 struct mr_table *
 mr_table_alloc(struct net *net, u32 id,
-  const struct rhashtable_params *rht_params,
+  struct mr_table_ops *ops,
   void (*expire_func)(struct timer_list *t),
   void (*table_set)(struct mr_table *mrt,
 struct net *net));
+
+/* These actually return 'struct mr_mfc *', but to avoid need for explicit
+ * castings they simply return void.
+ */
+void *mr_mfc_find_parent(struct mr_table *mrt,
+void *hasharg, int parent);
+void *mr_mfc_find_any_parent(struct mr_table *mrt, int vifi);
+void *mr_mfc_find_any(struct mr_table *mrt, int vifi, void *hasharg);
+
 #else
 static inline void vif_device_init(struct vif_device *v,
   struct net_device *dev,
@@ -147,14 +170,37 @@ static inline void vif_device_init(struct vif_device *v,
 {
 }
 
-static inline struct mr_table *
+static inline void *
 mr_table_alloc(struct net *net, u32 id,
-  const struct rhashtable_params *rht_params,
+  struct mr_table_ops *ops,
   void (*expire_func)(struct timer_list *t),
   void (*table_set)(struct mr_table *mrt,
 struct net *net))
 {
return NULL;
 }
+
+static inline void *mr_mfc_find_parent(struct mr_table *mrt,
+  void *hasharg, int parent)
+{
+   return NULL;
+}
+
+static inline void *mr_mfc_find_any_parent(struct mr_table *mrt,
+  int vifi)
+{
+   return NULL;
+}
+
+static inline struct mr_mfc *mr_mfc_find_any(struct mr_table *mrt,
+int vifi, void *hasharg)
+{
+   return NULL;
+}
 #endif
+
+static inline void *mr_mfc_find(struct mr_table *mrt, void *hasharg)
+{
+   return mr_mfc_find_parent(mrt, hasharg, -1);
+}
 #endif
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index a47d061..0df4dd5 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -359,6 +359,16 @@ static void ipmr_new_table_set(struct mr_table *mr,
 #endif
 }
 
+static struct mfc_cache_cmp_arg ipmr_mr_table_ops_cmparg_any = {
+   .mfc_mcastgrp = htonl(INADDR_ANY),
+   .mfc_origin = htonl(INADDR_ANY),
+};
+
+static struct mr_table_ops ipmr_mr_table_ops = {
+   .rht_params = _rht_params,
+   .cmparg_any = _mr_table_ops_cmparg_any,
+};
+
 static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 {
struct mr_table *mrt;
@@ -371,7 +381,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 
id)
if (mrt)
return mrt;
 
-   return mr_table_alloc(net, id, _rht_params,
+   return mr_table_alloc(net, id, _mr_table_ops,
  ipmr_expire_process, ipmr_new_table_set);
 }
 
@@ -972,33 +982,8 @@ static struct mfc_cache *ipmr_cache_find(struct mr_table 
*mrt,
.mfc_mcastgrp = 

Re: [PATCH] bpf: hide a possibly unused variable

2018-02-20 Thread Daniel Borkmann
On 02/20/2018 11:08 PM, Arnd Bergmann wrote:
> On Tue, Feb 20, 2018 at 10:44 PM, Daniel Borkmann  
> wrote:
>> Hi Arnd,
>>
>> On 02/20/2018 10:16 PM, Arnd Bergmann wrote:
>>> The only user of this variable is inside of an #ifdef, causing
>>> a warning without CONFIG_INET:
>>>
>>> net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set':
>>> net/core/filter.c:3382:6: error: unused variable 'val' 
>>> [-Werror=unused-variable]
>>>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
>>>
>>> This adds the same #ifdef around the declaration.
>>>
>>> Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
>>> Signed-off-by: Arnd Bergmann 
>>> ---
>>>  net/core/filter.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 08ab4c65a998..c3dc6d60b4bb 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -3379,7 +3379,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct 
>>> bpf_sock_ops_kern *, bpf_sock,
>>>  int, argval)
>>>  {
>>>   struct sock *sk = bpf_sock->sk;
>>> +#ifdef CONFIG_INET
>>>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
>>> +#endif
>>
>> Looks good, thanks for the fix!
>>
>> Could you move the existing '#ifdef CONFIG_INET' to the beginning of
>> the function given the only error in case of !CONFIG_INET is -EINVAL
>> anyway? That would at least not increase the ifdef ugliness further.
> 
> Sure, sent a new version now. I decided to clean up that #ifdef
> check as well, since a IS_ENABLED() check is nicer anway.

Looks great, thanks! I'll get it into bpf once the pending pr got pulled.


Re: [PATCH] bpf: hide a possibly unused variable

2018-02-20 Thread Arnd Bergmann
On Tue, Feb 20, 2018 at 10:44 PM, Daniel Borkmann  wrote:
> Hi Arnd,
>
> On 02/20/2018 10:16 PM, Arnd Bergmann wrote:
>> The only user of this variable is inside of an #ifdef, causing
>> a warning without CONFIG_INET:
>>
>> net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set':
>> net/core/filter.c:3382:6: error: unused variable 'val' 
>> [-Werror=unused-variable]
>>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
>>
>> This adds the same #ifdef around the declaration.
>>
>> Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
>> Signed-off-by: Arnd Bergmann 
>> ---
>>  net/core/filter.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 08ab4c65a998..c3dc6d60b4bb 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -3379,7 +3379,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct 
>> bpf_sock_ops_kern *, bpf_sock,
>>  int, argval)
>>  {
>>   struct sock *sk = bpf_sock->sk;
>> +#ifdef CONFIG_INET
>>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
>> +#endif
>
> Looks good, thanks for the fix!
>
> Could you move the existing '#ifdef CONFIG_INET' to the beginning of
> the function given the only error in case of !CONFIG_INET is -EINVAL
> anyway? That would at least not increase the ifdef ugliness further.

Sure, sent a new version now. I decided to clean up that #ifdef
check as well, since a IS_ENABLED() check is nicer anway.

   Arnd


[PATCH] bpf: clean up unused-variable warning

2018-02-20 Thread Arnd Bergmann
The only user of this variable is inside of an #ifdef, causing
a warning without CONFIG_INET:

net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set':
net/core/filter.c:3382:6: error: unused variable 'val' [-Werror=unused-variable]
  int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;

This replaces the #ifdef with a nicer IS_ENABLED() check that
makes the code more readable and avoids the warning.

Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
Signed-off-by: Arnd Bergmann 
---
 net/core/filter.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 08ab4c65a998..0c121adbdbaa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3381,17 +3381,13 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct 
bpf_sock_ops_kern *, bpf_sock,
struct sock *sk = bpf_sock->sk;
int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
 
-   if (!sk_fullsock(sk))
+   if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
return -EINVAL;
 
-#ifdef CONFIG_INET
if (val)
tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
 
return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
-#else
-   return -EINVAL;
-#endif
 }
 
 static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
-- 
2.9.0



Re: [PATCH] bpf: hide a possibly unused variable

2018-02-20 Thread Daniel Borkmann
Hi Arnd,

On 02/20/2018 10:16 PM, Arnd Bergmann wrote:
> The only user of this variable is inside of an #ifdef, causing
> a warning without CONFIG_INET:
> 
> net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set':
> net/core/filter.c:3382:6: error: unused variable 'val' 
> [-Werror=unused-variable]
>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> 
> This adds the same #ifdef around the declaration.
> 
> Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
> Signed-off-by: Arnd Bergmann 
> ---
>  net/core/filter.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 08ab4c65a998..c3dc6d60b4bb 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3379,7 +3379,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct 
> bpf_sock_ops_kern *, bpf_sock,
>  int, argval)
>  {
>   struct sock *sk = bpf_sock->sk;
> +#ifdef CONFIG_INET
>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> +#endif

Looks good, thanks for the fix!

Could you move the existing '#ifdef CONFIG_INET' to the beginning of
the function given the only error in case of !CONFIG_INET is -EINVAL
anyway? That would at least not increase the ifdef ugliness further.

Thanks a lot,
Daniel

>   if (!sk_fullsock(sk))
>   return -EINVAL;
> 



[PATCH iproute2-next v2 3/7] iplink: Use "dev" and "name" parameters interchangeable when possible

2018-02-20 Thread Serhey Popovych
Both of them accept network device name as argument, but have different
meaning:

  dev  - is a device by it's name,
  name - name for specific device device.

The only case where they treated separately is network device rename
case where need to specify both ifindex and new name. In rest of the
cases we can assume that dev == name.

With this change we do following:

  1) Kill ambiguity with both "dev" and "name" parameters given the same
 name:

   ip link {add|set} dev veth100a name veth100a ...

  2) Make sure we do not accept "name" more than once.

  3) For VF and XDP treat "name" as "dev". Fail in case of "dev" is
 given after VF and/or XDP parsing.

  4) Make veth and vxcan to accept both "name" and "dev" as their peer
 parameters, effectively following general ip-link(8) utility
 behaviour on link create:

   ip link add {name|dev} veth1a type veth peer {name|dev} veth1b

Signed-off-by: Serhey Popovych 
---
 ip/iplink.c |   34 ++
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index fc358fc..1359c0f 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -605,9 +605,15 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
req->i.ifi_flags &= ~IFF_UP;
} else if (strcmp(*argv, "name") == 0) {
NEXT_ARG();
+   if (*name)
+   duparg("name", *argv);
if (check_ifname(*argv))
invarg("\"name\" not a valid ifname", *argv);
*name = *argv;
+   if (!*dev) {
+   *dev = *name;
+   dev_index = ll_name_to_index(*dev);
+   }
} else if (strcmp(*argv, "index") == 0) {
NEXT_ARG();
if (*index)
@@ -665,6 +671,9 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
if (xdp_parse(, , req, dev_index,
  generic, drv, offload))
exit(-1);
+
+   if (offload && *name == *dev)
+   *dev = NULL;
} else if (strcmp(*argv, "netns") == 0) {
NEXT_ARG();
if (netns != -1)
@@ -755,6 +764,9 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
if (len < 0)
return -1;
addattr_nest_end(>n, vflist);
+
+   if (*name == *dev)
+   *dev = NULL;
} else if (matches(*argv, "master") == 0) {
int ifindex;
 
@@ -905,7 +917,7 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
 
if (strcmp(*argv, "dev") == 0)
NEXT_ARG();
-   if (*dev)
+   if (*dev != *name)
duparg2("dev", *argv);
if (check_ifname(*argv))
invarg("\"dev\" not a valid ifname", *argv);
@@ -915,6 +927,14 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
argc--; argv++;
}
 
+   /* Allow "ip link add dev" and "ip link add name" */
+   if (!*name)
+   *name = *dev;
+   else if (!*dev)
+   *dev = *name;
+   else if (!strcmp(*name, *dev))
+   *name = *dev;
+
if (dev_index && addr_len) {
int halen = nl_get_ll_addr_len(dev_index);
 
@@ -993,10 +1013,16 @@ static int iplink_modify(int cmd, unsigned int flags, 
int argc, char **argv)
req.i.ifi_index = ll_name_to_index(dev);
if (!req.i.ifi_index)
return nodev(dev);
+
+   /* Not renaming to the same name */
+   if (name == dev)
+   name = NULL;
} else {
-   /* Allow "ip link add dev" and "ip link add name" */
-   if (!name)
-   name = dev;
+   if (name != dev) {
+   fprintf(stderr,
+   "both \"name\" and \"dev\" cannot be used when 
creating devices.\n");
+   exit(-1);
+   }
 
if (link) {
int ifindex;
-- 
1.7.10.4



[PATCH iproute2-next v2 2/7] iplink: Correctly report error when network device isn't found

2018-02-20 Thread Serhey Popovych
Distinguish cases when "dev" parameter isn't given from cases where no
network device corresponding to "dev" is found.

Do not check for index validity in xdp_parse(): caller should take care
of this because has more information (e.g. when "dev" is given or not
found) for this.

Signed-off-by: Serhey Popovych 
---
 ip/iplink.c |   16 +---
 ip/iplink_xdp.c |7 +--
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 5471626..fc358fc 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -569,6 +569,14 @@ static int iplink_parse_vf(int vf, int *argcp, char 
***argvp,
return 0;
 }
 
+static void has_dev(const char *dev, int dev_index)
+{
+   if (!dev)
+   missarg("dev");
+   if (!dev_index)
+   exit(nodev(dev));
+}
+
 int iplink_parse(int argc, char **argv, struct iplink_req *req,
 char **name, char **type, char **link, char **dev,
 int *group, int *index)
@@ -650,6 +658,9 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
bool drv = strcmp(*argv, "xdpdrv") == 0;
bool offload = strcmp(*argv, "xdpoffload") == 0;
 
+   if (offload)
+   has_dev(*dev, dev_index);
+
NEXT_ARG();
if (xdp_parse(, , req, dev_index,
  generic, drv, offload))
@@ -732,15 +743,14 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
} else if (strcmp(*argv, "vf") == 0) {
struct rtattr *vflist;
 
+   has_dev(*dev, dev_index);
+
NEXT_ARG();
if (get_integer(,  *argv, 0))
invarg("Invalid \"vf\" value\n", *argv);
 
vflist = addattr_nest(>n, sizeof(*req),
  IFLA_VFINFO_LIST);
-   if (dev_index == 0)
-   missarg("dev");
-
len = iplink_parse_vf(vf, , , req, dev_index);
if (len < 0)
return -1;
diff --git a/ip/iplink_xdp.c b/ip/iplink_xdp.c
index 8382635..3df38b8 100644
--- a/ip/iplink_xdp.c
+++ b/ip/iplink_xdp.c
@@ -55,17 +55,12 @@ int xdp_parse(int *argc, char ***argv, struct iplink_req 
*req, __u32 ifindex,
.type = BPF_PROG_TYPE_XDP,
.argc = *argc,
.argv = *argv,
+   .ifindex = ifindex,
};
struct xdp_req xdp = {
.req = req,
};
 
-   if (offload) {
-   if (!ifindex)
-   incomplete_command();
-   cfg.ifindex = ifindex;
-   }
-
if (!force)
xdp.flags |= XDP_FLAGS_UPDATE_IF_NOEXIST;
if (generic)
-- 
1.7.10.4



[PATCH iproute2-next v2 5/7] iplink: Perform most of request buffer setups and checks in iplink_parse()

2018-02-20 Thread Serhey Popovych
To benefit other users (e.g. link_veth.c) of iplink_parse() from
additional attribute checks and setups made in iplink_modify().

This catches most of weired cobination of parameters to peer device
configuration in iplink_vxcan.c and link_veth.c. Use memcpy() to save
peer device @ifm data structure as now ->ifi_index is set inside
iplist_parse(): it will be inlined by the compiler anyway.

Drop @link, @group and @index from iplink_parse() parameters list: they
are not needed outside.

While there change return -1 to exit(-1) for group parsing errors: we
want to stop further command processing unless -force option is given
to get error line easily.

Signed-off-by: Serhey Popovych 
---
 ip/ip_common.h|3 +-
 ip/iplink.c   |  118 +
 ip/iplink_vxcan.c |   27 +++-
 ip/link_veth.c|   27 +++-
 4 files changed, 69 insertions(+), 106 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index e4e628b..f762821 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -133,8 +133,7 @@ struct link_util {
 struct link_util *get_link_kind(const char *kind);
 
 int iplink_parse(int argc, char **argv, struct iplink_req *req,
-char **name, char **type, char **link, char **dev,
-int *group, int *index);
+char **name, char **type, char **dev);
 
 /* iplink_bridge.c */
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
diff --git a/ip/iplink.c b/ip/iplink.c
index 4e9f571..e53d890 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -578,9 +578,9 @@ static void has_dev(const char *dev, int dev_index)
 }
 
 int iplink_parse(int argc, char **argv, struct iplink_req *req,
-char **name, char **type, char **link, char **dev,
-int *group, int *index)
+char **name, char **type, char **dev)
 {
+   char *link = NULL;
int ret, len;
char abuf[32];
int qlen = -1;
@@ -591,9 +591,10 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
int numrxqueues = -1;
int dev_index = 0;
int link_netnsid = -1;
+   int index = 0;
+   int group = -1;
int addr_len = 0;
 
-   *group = -1;
ret = argc;
 
while (argc > 0) {
@@ -616,14 +617,14 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
}
} else if (strcmp(*argv, "index") == 0) {
NEXT_ARG();
-   if (*index)
+   if (index)
duparg("index", *argv);
-   *index = atoi(*argv);
-   if (*index <= 0)
+   index = atoi(*argv);
+   if (index <= 0)
invarg("Invalid \"index\" value", *argv);
} else if (matches(*argv, "link") == 0) {
NEXT_ARG();
-   *link = *argv;
+   link = *argv;
} else if (matches(*argv, "address") == 0) {
NEXT_ARG();
addr_len = ll_addr_a2n(abuf, sizeof(abuf), *argv);
@@ -816,10 +817,11 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
  *argv, len);
} else if (strcmp(*argv, "group") == 0) {
NEXT_ARG();
-   if (*group != -1)
+   if (group != -1)
duparg("group", *argv);
-   if (rtnl_group_a2n(group, *argv))
+   if (rtnl_group_a2n(, *argv))
invarg("Invalid \"group\" value\n", *argv);
+   addattr32(>n, sizeof(*req), IFLA_GROUP, group);
} else if (strcmp(*argv, "mode") == 0) {
int mode;
 
@@ -946,80 +948,47 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
}
}
 
-   return ret - argc;
-}
-
-static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
-{
-   char *dev = NULL;
-   char *name = NULL;
-   char *link = NULL;
-   char *type = NULL;
-   int index = 0;
-   int group;
-   struct link_util *lu = NULL;
-   struct iplink_req req = {
-   .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
-   .n.nlmsg_flags = NLM_F_REQUEST | flags,
-   .n.nlmsg_type = cmd,
-   .i.ifi_family = preferred_family,
-   };
-   int ret;
-
-   ret = iplink_parse(argc, argv,
-  , , , , , , );
-   if (ret < 0)
-   return ret;
-
-   argc -= ret;
-   argv += ret;
-
-   if (!(flags & NLM_F_CREATE) && index) {
+   if (!(req->n.nlmsg_flags & NLM_F_CREATE) && index) {
fprintf(stderr,
   

[PATCH iproute2-next v2 7/7] iplink: Reduce number of arguments to iplink_parse()

2018-02-20 Thread Serhey Popovych
Introduce new @struct iplink_parse_args data structure to consolidate
arguments to iplink_parse(). This will reduce number of arguments
passed to it.

Pass this data structure to ->parse_opt() in iplink specific modules:
it may be used to get network device name and other information.

Signed-off-by: Serhey Popovych 
---
 ip/ip_common.h   |   16 +---
 ip/iplink.c  |   34 ++
 ip/iplink_bond.c |4 +++-
 ip/iplink_bond_slave.c   |4 +++-
 ip/iplink_bridge.c   |4 +++-
 ip/iplink_bridge_slave.c |4 +++-
 ip/iplink_can.c  |4 +++-
 ip/iplink_geneve.c   |4 +++-
 ip/iplink_hsr.c  |4 +++-
 ip/iplink_ipoib.c|4 +++-
 ip/iplink_ipvlan.c   |4 +++-
 ip/iplink_macvlan.c  |4 +++-
 ip/iplink_vlan.c |4 +++-
 ip/iplink_vrf.c  |5 -
 ip/iplink_vxcan.c|   14 ++
 ip/iplink_vxlan.c|4 +++-
 ip/ipmacsec.c|4 +++-
 ip/link_gre.c|6 --
 ip/link_gre6.c   |6 --
 ip/link_ip6tnl.c |6 --
 ip/link_iptnl.c  |6 --
 ip/link_veth.c   |   14 ++
 ip/link_vti.c|6 --
 ip/link_vti6.c   |6 --
 24 files changed, 114 insertions(+), 57 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index f762821..aef70de 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -112,12 +112,23 @@ struct iplink_req {
charbuf[1024];
 };
 
+struct iplink_parse_args {
+   const char *dev;
+   const char *name;
+   const char *type;
+
+   /* This definitely must be the last one and initialized
+* by the caller of iplink_parse() that will initialize rest.
+*/
+   struct iplink_req *req;
+};
+
 struct link_util {
struct link_util*next;
const char  *id;
int maxattr;
int (*parse_opt)(struct link_util *, int, char **,
-struct nlmsghdr *);
+struct iplink_parse_args *);
void(*print_opt)(struct link_util *, FILE *,
 struct rtattr *[]);
void(*print_xstats)(struct link_util *, FILE *,
@@ -132,8 +143,7 @@ struct link_util {
 
 struct link_util *get_link_kind(const char *kind);
 
-int iplink_parse(int argc, char **argv, struct iplink_req *req,
-char **name, char **type, char **dev);
+int iplink_parse(int argc, char **argv, struct iplink_parse_args *pa);
 
 /* iplink_bridge.c */
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
diff --git a/ip/iplink.c b/ip/iplink.c
index e53d890..837e2b0 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -10,6 +10,7 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -577,9 +578,12 @@ static void has_dev(const char *dev, int dev_index)
exit(nodev(dev));
 }
 
-int iplink_parse(int argc, char **argv, struct iplink_req *req,
-char **name, char **type, char **dev)
+int iplink_parse(int argc, char **argv, struct iplink_parse_args *pa)
 {
+   struct iplink_req *req = pa->req;
+   const char **dev = >dev;
+   const char **name = >name;
+   const char **type = >type;
char *link = NULL;
int ret, len;
char abuf[32];
@@ -597,6 +601,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
 
ret = argc;
 
+   memset(pa, 0, offsetof(struct iplink_parse_args, req));
+
while (argc > 0) {
if (strcmp(*argv, "up") == 0) {
req->i.ifi_change |= IFF_UP;
@@ -1016,32 +1022,32 @@ out:
 
 static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 {
-   char *dev = NULL;
-   char *name = NULL;
-   char *type = NULL;
struct iplink_req req = {
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
.n.nlmsg_flags = NLM_F_REQUEST | flags,
.n.nlmsg_type = cmd,
.i.ifi_family = preferred_family,
};
+   struct iplink_parse_args pa;
int ret;
 
-   ret = iplink_parse(argc, argv, , , , );
+   pa.req = 
+
+   ret = iplink_parse(argc, argv, );
if (ret < 0)
return ret;
 
-   if (type) {
+   if (pa.type) {
struct link_util *lu;
struct rtattr *linkinfo;
-   char *ulinep = strchr(type, '_');
+   char *ulinep = strchr(pa.type, '_');
int iflatype;
 
linkinfo = addattr_nest(, sizeof(req), IFLA_LINKINFO);
-   addattr_l(, sizeof(req), IFLA_INFO_KIND, type,
-strlen(type));
+   addattr_l(, sizeof(req), 

[PATCH iproute2-next v2 6/7] iplink: Move data structures to block of their users

2018-02-20 Thread Serhey Popovych
This will consolidate data and code using it in single place and prepare
for upcoming ->parse_opt() method change.

Signed-off-by: Serhey Popovych 
---
 ip/link_gre.c|   32 
 ip/link_gre6.c   |   32 
 ip/link_ip6tnl.c |   32 
 ip/link_iptnl.c  |   32 
 ip/link_vti.c|   32 
 ip/link_vti6.c   |   32 
 6 files changed, 96 insertions(+), 96 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index bc1cee8..6654525 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -66,22 +66,6 @@ static void gre_print_help(struct link_util *lu, int argc, 
char **argv, FILE *f)
 static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 struct nlmsghdr *n)
 {
-   struct ifinfomsg *ifi = NLMSG_DATA(n);
-   struct {
-   struct nlmsghdr n;
-   struct ifinfomsg i;
-   } req = {
-   .n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
-   .n.nlmsg_flags = NLM_F_REQUEST,
-   .n.nlmsg_type = RTM_GETLINK,
-   .i.ifi_family = preferred_family,
-   .i.ifi_index = ifi->ifi_index,
-   };
-   struct nlmsghdr *answer;
-   struct rtattr *tb[IFLA_MAX + 1];
-   struct rtattr *linkinfo[IFLA_INFO_MAX+1];
-   struct rtattr *greinfo[IFLA_GRE_MAX + 1];
-   int len;
__u16 iflags = 0;
__u16 oflags = 0;
__be32 ikey = 0;
@@ -107,7 +91,23 @@ static int gre_parse_opt(struct link_util *lu, int argc, 
char **argv,
inet_prefix_reset();
 
if (!(n->nlmsg_flags & NLM_F_CREATE)) {
+   struct ifinfomsg *ifi = NLMSG_DATA(n);
+   struct {
+   struct nlmsghdr n;
+   struct ifinfomsg i;
+   } req = {
+   .n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
+   .n.nlmsg_flags = NLM_F_REQUEST,
+   .n.nlmsg_type = RTM_GETLINK,
+   .i.ifi_family = preferred_family,
+   .i.ifi_index = ifi->ifi_index,
+   };
+   struct nlmsghdr *answer;
+   struct rtattr *tb[IFLA_MAX + 1];
+   struct rtattr *linkinfo[IFLA_INFO_MAX+1];
+   struct rtattr *greinfo[IFLA_GRE_MAX + 1];
const struct rtattr *rta;
+   int len;
 
if (rtnl_talk(, , ) < 0) {
 get_failed:
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 83c61e2..a92854d 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -77,22 +77,6 @@ static void gre_print_help(struct link_util *lu, int argc, 
char **argv, FILE *f)
 static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 struct nlmsghdr *n)
 {
-   struct ifinfomsg *ifi = NLMSG_DATA(n);
-   struct {
-   struct nlmsghdr n;
-   struct ifinfomsg i;
-   } req = {
-   .n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
-   .n.nlmsg_flags = NLM_F_REQUEST,
-   .n.nlmsg_type = RTM_GETLINK,
-   .i.ifi_family = preferred_family,
-   .i.ifi_index = ifi->ifi_index,
-   };
-   struct nlmsghdr *answer;
-   struct rtattr *tb[IFLA_MAX + 1];
-   struct rtattr *linkinfo[IFLA_INFO_MAX+1];
-   struct rtattr *greinfo[IFLA_GRE_MAX + 1];
-   int len;
__u16 iflags = 0;
__u16 oflags = 0;
__be32 ikey = 0;
@@ -118,7 +102,23 @@ static int gre_parse_opt(struct link_util *lu, int argc, 
char **argv,
inet_prefix_reset();
 
if (!(n->nlmsg_flags & NLM_F_CREATE)) {
+   struct ifinfomsg *ifi = NLMSG_DATA(n);
+   struct {
+   struct nlmsghdr n;
+   struct ifinfomsg i;
+   } req = {
+   .n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
+   .n.nlmsg_flags = NLM_F_REQUEST,
+   .n.nlmsg_type = RTM_GETLINK,
+   .i.ifi_family = preferred_family,
+   .i.ifi_index = ifi->ifi_index,
+   };
+   struct nlmsghdr *answer;
+   struct rtattr *tb[IFLA_MAX + 1];
+   struct rtattr *linkinfo[IFLA_INFO_MAX+1];
+   struct rtattr *greinfo[IFLA_GRE_MAX + 1];
const struct rtattr *rta;
+   int len;
 
if (rtnl_talk(, , ) < 0) {
 get_failed:
diff --git a/ip/link_ip6tnl.c b/ip/link_ip6tnl.c
index c7fef2e..edd7632 100644
--- a/ip/link_ip6tnl.c
+++ b/ip/link_ip6tnl.c
@@ -77,22 +77,6 @@ static void ip6tunnel_print_help(struct link_util *lu, int 
argc, char **argv,
 static int ip6tunnel_parse_opt(struct link_util *lu, int argc, char **argv,
   struct nlmsghdr *n)
 {
-   struct ifinfomsg 

[PATCH iproute2-next v2 0/7] iplink: Improve iplink_parse()

2018-02-20 Thread Serhey Popovych
This is main routine to parse ip-link(8) configuration parameters.

Main reason to improve it is to pass network device @name, @dev and
other parameters to kind specific ->parse_opt() function so they can use
this information.

For example later we will extend iplink_get() to parse netlink
attributes deeper and replace open coded rtnl_talk() in ip/tunnel
modules to simplify getting existing tunnel information.

Among main change there is a number of patches to prepare for it
that improve iplink_parse() in some way.

See individual patch description message for more information.

v2
  Terminate via exit() when failing to parse command line arguments
  to help identify failing line in batch mode.

Thanks,
Serhii

Serhey Popovych (7):
  utils: Introduce and use nodev() helper routine
  iplink: Correctly report error when network device isn't found
  iplink: Use "dev" and "name" parameters interchangeable when possible
  iplink: Follow documented behaviour when "index" is given
  iplink: Perform most of request buffer setups and checks in
iplink_parse()
  iplink: Move data structures to block of their users
  iplink: Reduce number of arguments to iplink_parse()

 bridge/fdb.c |   17 ++--
 bridge/link.c|8 +-
 bridge/mdb.c |   19 ++---
 bridge/vlan.c|7 +-
 include/utils.h  |1 +
 ip/ip6tunnel.c   |6 +-
 ip/ip_common.h   |   17 +++-
 ip/ipaddress.c   |7 +-
 ip/iplink.c  |  200 +++---
 ip/iplink_bond.c |8 +-
 ip/iplink_bond_slave.c   |4 +-
 ip/iplink_bridge.c   |   11 ++-
 ip/iplink_bridge_slave.c |4 +-
 ip/iplink_can.c  |4 +-
 ip/iplink_geneve.c   |4 +-
 ip/iplink_hsr.c  |4 +-
 ip/iplink_ipoib.c|4 +-
 ip/iplink_ipvlan.c   |4 +-
 ip/iplink_macvlan.c  |4 +-
 ip/iplink_vlan.c |4 +-
 ip/iplink_vrf.c  |5 +-
 ip/iplink_vxcan.c|   39 +++--
 ip/iplink_vxlan.c|   11 ++-
 ip/iplink_xdp.c  |7 +-
 ip/ipmacsec.c|4 +-
 ip/ipmroute.c|7 +-
 ip/ipneigh.c |   15 ++--
 ip/ipntable.c|6 +-
 ip/iproute.c |   36 +++--
 ip/iproute_lwtunnel.c|4 +-
 ip/iptunnel.c|6 +-
 ip/link_gre.c|   43 +-
 ip/link_gre6.c   |   43 +-
 ip/link_ip6tnl.c |   40 +-
 ip/link_iptnl.c  |   40 +-
 ip/link_veth.c   |   39 +++--
 ip/link_vti.c|   43 +-
 ip/link_vti6.c   |   43 +-
 lib/utils.c  |6 ++
 tc/m_mirred.c|6 +-
 tc/tc_class.c|   14 ++--
 tc/tc_filter.c   |   18 ++---
 tc/tc_qdisc.c|   12 +--
 43 files changed, 405 insertions(+), 419 deletions(-)

-- 
1.7.10.4



[PATCH iproute2-next v2 4/7] iplink: Follow documented behaviour when "index" is given

2018-02-20 Thread Serhey Popovych
Both ip-link(8) and error message when "index" parameter is given for
set/delete case says that index can only be given during network
device creation.

Follow this documented behaviour and get rid of ambiguous behaviour in
case of both "dev" and "index" specified for ip link delete scenario
(actually "index" being ignored in favor to "dev").

Prohibit "index" when configuring/deleting group of network devices.

Signed-off-by: Serhey Popovych 
---
 ip/iplink.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 1359c0f..4e9f571 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -974,6 +974,12 @@ static int iplink_modify(int cmd, unsigned int flags, int 
argc, char **argv)
argc -= ret;
argv += ret;
 
+   if (!(flags & NLM_F_CREATE) && index) {
+   fprintf(stderr,
+   "index can be used only when creating devices.\n");
+   exit(-1);
+   }
+
if (group != -1) {
if (dev)
addattr_l(, sizeof(req), IFLA_GROUP,
@@ -1004,11 +1010,6 @@ static int iplink_modify(int cmd, unsigned int flags, 
int argc, char **argv)
"Not enough information: \"dev\" argument is 
required.\n");
exit(-1);
}
-   if (cmd == RTM_NEWLINK && index) {
-   fprintf(stderr,
-   "index can be used only when creating 
devices.\n");
-   exit(-1);
-   }
 
req.i.ifi_index = ll_name_to_index(dev);
if (!req.i.ifi_index)
-- 
1.7.10.4



[PATCH iproute2-next v2 1/7] utils: Introduce and use nodev() helper routine

2018-02-20 Thread Serhey Popovych
There is a couple of places where we report error in case of no network
device is found. In all of them we output message in the same format to
stderr and either return -1 or 1 to the caller or exit with -1.

Introduce new helper function nodev() that takes name of the network
device caused error and returns -1 to it's caller. Either call exit()
or return to the caller to preserve behaviour before change.

Use -nodev() in traffic control (tc) code to return 1.

Simplify expression for checking for argument being 0/NULL in @if
statement.

Signed-off-by: Serhey Popovych 
---
 bridge/fdb.c  |   17 ++---
 bridge/link.c |8 +++-
 bridge/mdb.c  |   19 ++-
 bridge/vlan.c |7 ++-
 include/utils.h   |1 +
 ip/ip6tunnel.c|6 ++
 ip/ipaddress.c|7 +++
 ip/iplink.c   |   13 -
 ip/iplink_bond.c  |4 ++--
 ip/iplink_bridge.c|7 ++-
 ip/iplink_vxlan.c |7 ++-
 ip/ipmroute.c |7 +++
 ip/ipneigh.c  |   15 ---
 ip/ipntable.c |6 ++
 ip/iproute.c  |   36 
 ip/iproute_lwtunnel.c |4 ++--
 ip/iptunnel.c |6 ++
 ip/link_gre.c |7 ++-
 ip/link_gre6.c|7 ++-
 ip/link_ip6tnl.c  |4 ++--
 ip/link_iptnl.c   |4 ++--
 ip/link_vti.c |7 ++-
 ip/link_vti6.c|7 ++-
 lib/utils.c   |6 ++
 tc/m_mirred.c |6 ++
 tc/tc_class.c |   14 ++
 tc/tc_filter.c|   18 ++
 tc/tc_qdisc.c |   12 
 28 files changed, 98 insertions(+), 164 deletions(-)

diff --git a/bridge/fdb.c b/bridge/fdb.c
index 8b133f9..2ba1cde 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -375,11 +375,8 @@ static int fdb_show(int argc, char **argv)
/*we'll keep around filter_dev for older kernels */
if (filter_dev) {
filter_index = ll_name_to_index(filter_dev);
-   if (filter_index == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n",
-   filter_dev);
-   return -1;
-   }
+   if (!filter_index)
+   return nodev(filter_dev);
req.ifm.ifi_index = filter_index;
}
 
@@ -464,8 +461,8 @@ static int fdb_modify(int cmd, int flags, int argc, char 
**argv)
} else if (strcmp(*argv, "via") == 0) {
NEXT_ARG();
via = ll_name_to_index(*argv);
-   if (via == 0)
-   invarg("invalid device\n", *argv);
+   if (!via)
+   exit(nodev(*argv));
} else if (strcmp(*argv, "self") == 0) {
req.ndm.ndm_flags |= NTF_SELF;
} else if (matches(*argv, "master") == 0) {
@@ -540,10 +537,8 @@ static int fdb_modify(int cmd, int flags, int argc, char 
**argv)
addattr32(, sizeof(req), NDA_IFINDEX, via);
 
req.ndm.ndm_ifindex = ll_name_to_index(d);
-   if (req.ndm.ndm_ifindex == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n", d);
-   return -1;
-   }
+   if (!req.ndm.ndm_ifindex)
+   return nodev(d);
 
if (rtnl_talk(, , NULL) < 0)
return -1;
diff --git a/bridge/link.c b/bridge/link.c
index 90c9734..1ea9c00 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -470,11 +470,9 @@ static int brlink_show(int argc, char **argv)
}
 
if (filter_dev) {
-   if ((filter_index = ll_name_to_index(filter_dev)) == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n",
-   filter_dev);
-   return -1;
-   }
+   filter_index = ll_name_to_index(filter_dev);
+   if (!filter_index)
+   return nodev(filter_dev);
}
 
if (show_details) {
diff --git a/bridge/mdb.c b/bridge/mdb.c
index 62dc8a0..e3f6978 100644
--- a/bridge/mdb.c
+++ b/bridge/mdb.c
@@ -312,11 +312,8 @@ static int mdb_show(int argc, char **argv)
 
if (filter_dev) {
filter_index = ll_name_to_index(filter_dev);
-   if (filter_index == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n",
-   filter_dev);
-   return -1;
-   }
+   if (!filter_index)
+   return nodev(filter_dev);
}
 
/* get mdb entries*/
@@ -418,16 +415,12 @@ static int mdb_modify(int cmd, int flags, int argc, char 
**argv)
}
 
req.bpm.ifindex = ll_name_to_index(d);
-   if (req.bpm.ifindex == 0) {
-   

[PATCH net] amd-xgbe: Restore PCI interrupt enablement setting on resume

2018-02-20 Thread Tom Lendacky
After resuming from suspend, the PCI device support must re-enable the
interrupt setting so that interrupts are actually delivered.

Signed-off-by: Tom Lendacky 
---

Please queue this patch up to stable releases 4.14 and above.

 drivers/net/ethernet/amd/xgbe/xgbe-pci.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
index 3e5833c..eb23f9b 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
@@ -426,6 +426,8 @@ static int xgbe_pci_resume(struct pci_dev *pdev)
struct net_device *netdev = pdata->netdev;
int ret = 0;
 
+   XP_IOWRITE(pdata, XP_INT_EN, 0x1f);
+
pdata->lpm_ctrl &= ~MDIO_CTRL1_LPOWER;
XMDIO_WRITE(pdata, MDIO_MMD_PCS, MDIO_CTRL1, pdata->lpm_ctrl);
 



[PATCH] bpf: hide a possibly unused variable

2018-02-20 Thread Arnd Bergmann
The only user of this variable is inside of an #ifdef, causing
a warning without CONFIG_INET:

net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set':
net/core/filter.c:3382:6: error: unused variable 'val' [-Werror=unused-variable]
  int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;

This adds the same #ifdef around the declaration.

Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
Signed-off-by: Arnd Bergmann 
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 08ab4c65a998..c3dc6d60b4bb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3379,7 +3379,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct 
bpf_sock_ops_kern *, bpf_sock,
   int, argval)
 {
struct sock *sk = bpf_sock->sk;
+#ifdef CONFIG_INET
int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+#endif
 
if (!sk_fullsock(sk))
return -EINVAL;
-- 
2.9.0



pull-request: bpf 2018-02-20

2018-02-20 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a memory leak in LPM trie's map_free() callback function, where
   the trie structure itself was not freed since initial implementation.
   Also a synchronize_rcu() was needed in order to wait for outstanding
   programs accessing the trie to complete, from Yonghong.

2) Fix sock_map_alloc()'s error path in order to correctly propagate
   the -EINVAL error in case of too large allocation requests. This
   was just recently introduced when fixing close hooks via ULP layer,
   fix from Eric.

3) Do not use GFP_ATOMIC in __cpu_map_entry_alloc(). Reason is that this
   will not work with the recent __ptr_ring_init_queue_alloc() conversion
   to kvmalloc_array(), where in case of fallback to vmalloc() that GFP
   flag is invalid, from Jason.

4) Fix two recent syzkaller warnings: i) fix bpf_prog_array_copy_to_user()
   when a prog query with a big number of ids was performed where we'd
   otherwise trigger a warning from allocator side, ii) fix a missing
   mlock precharge on arraymaps, from Daniel.

5) Two fixes for bpftool in order to avoid breaking JSON output when used
   in batch mode, from Quentin.

6) Move a pr_debug() in libbpf in order to avoid having an otherwise
   uninitialized variable in bpf_program__reloc_text(), from Jeremy.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit d4014d8cc6dfa964e3e66df525de2384e3583018:

  rds: do not call ->conn_alloc with GFP_KERNEL (2018-02-13 13:52:02 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to b1a2ce825737b0165cc08e6f98f8c0ea1affdd60:

  tools/libbpf: Avoid possibly using uninitialized variable (2018-02-20 
21:08:20 +0100)


Daniel Borkmann (3):
  bpf: fix bpf_prog_array_copy_to_user warning from perf event prog query
  Merge branch 'bpf-bpftool-json-fixes'
  bpf: fix mlock precharge on arraymaps

Eric Dumazet (1):
  bpf: fix sock_map_alloc() error path

Jason Wang (1):
  bpf: cpumap: use GFP_KERNEL instead of GFP_ATOMIC in 
__cpu_map_entry_alloc()

Jeremy Cline (1):
  tools/libbpf: Avoid possibly using uninitialized variable

Quentin Monnet (2):
  tools: bpftool: preserve JSON for batch mode when dumping insns to file
  tools: bpftool: preserve JSON output on errors on batch file parsing

Yonghong Song (1):
  bpf: fix memory leak in lpm_trie map_free callback function

 kernel/bpf/arraymap.c| 28 
 kernel/bpf/core.c|  2 +-
 kernel/bpf/cpumap.c  |  2 +-
 kernel/bpf/lpm_trie.c| 11 +++
 kernel/bpf/sockmap.c |  3 ++-
 kernel/trace/bpf_trace.c |  2 ++
 tools/bpf/bpftool/main.c |  2 +-
 tools/bpf/bpftool/prog.c |  3 +++
 tools/lib/bpf/libbpf.c   |  5 +++--
 9 files changed, 36 insertions(+), 22 deletions(-)


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Alexander Duyck
On Tue, Feb 20, 2018 at 12:14 PM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudr...@intel.com wrote:
>>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko  wrote:
>>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>> > > > used by hypervisor to indicate that virtio_net interface should act as
>>> > > > a backup for another device with the same MAC address.
>>> > > >
>>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>>> > > > solution.  However, it creates some issues we'll get into in a moment.
>>> > > > It extends virtio_net to use alternate datapath when available and
>>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>>> > > > an additional 'bypass' netdev that acts as a master device and 
>>> > > > controls
>>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> > > > associated with the same 'pci' device.  The user accesses the network
>>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' 
>>> > > > netdev
>>> > > > as default for transmits when it is available with link up and 
>>> > > > running.
>>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>>> > > are mature solutions, well tested, broadly used, with lots of issues
>>> > > resolved in the past. What you try to introduce is a weird shortcut
>>> > > that already has couple of issues as you mentioned and will certanly
>>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>>> > > with ideas like multiple VFs, LACP and similar bonding things.
>>> > The problem with the bond and team drivers is they are too large and
>>> > have too many interfaces available for configuration so as a result
>>> > they can really screw this interface up.
>>> What? Too large is which sense? Why "too many interfaces" is a problem?
>>> Also, team has only one interface to userspace team-generic-netlink.
>>>
>>>
>>> > Essentially this is meant to be a bond that is more-or-less managed by
>>> > the host, not the guest. We want the host to be able to configure it
>>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>>> virtio_net, pci vf.
>>> I don't see how host can do any managing of that, other than the
>>> obvious. But still, the active/backup decision is done in guest. This is
>>> a simple bond/team usecase. As I said, there is something needed to be
>>> implemented in userspace in order to handle re-appear of vf netdev.
>>> But that should be fairly easy to do in teamd.
>>
>>The host manages the active/backup decision by
>>- assigning the same MAC address to both VF and virtio interfaces
>>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>>take
>>  over the VFs datapath.
>>- only enable one datapath at anytime so that packets don't get looped back
>>- during live migration enable virtio datapth, unplug vf on the source and
>>replug
>>  vf on the destination.
>>
>>The VM is not expected and doesn't have any control of setting the MAC
>>address
>>or bringing up/down the links.
>>
>>This is the model that is currently supported with netvsc driver on Azure.
>
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?
>
> The fact that the netvsc/virtio_net kidnaps a netdev only because it
> has the same mac is going to give me some serious nighmares...
> I think we need to introduce some more strict checks.

In order for that to work we need to settle on a model for these. The
issue is that netvsc is using what we refer to as the "2 netdev" model
where they don't expose the paravirtual interface as its own netdev.
The opinion of Jakub and others has been that we should do a "3
netdev" model in the case of virtio_net since otherwise we will lose
functionality such as in-driver XDP and have to deal with an extra set
of qdiscs and Tx queue locks on transmit path.

Really at this point I am good either way, but we need to probably
have Stephen, Jakub, and whoever else had an opinion on the matter
sort out the 2 vs 3 argument before we could proceed on that. Most of
patch 2 in the set can easily be broken out into a separate file later
if we decide to go that route.

Thanks.

- Alex


Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread David Ahern
On 2/20/18 1:44 PM, Roopa Prabhu wrote:
> On Tue, Feb 20, 2018 at 12:33 PM, Stephen Hemminger
>  wrote:
>> On Tue, 20 Feb 2018 13:27:21 -0700
>> David Ahern  wrote:
>>
>>> On 2/20/18 1:17 PM, Serhey Popovych wrote:
 Stephen Hemminger wrote:
> On Tue, 20 Feb 2018 21:39:51 +0200
> Serhey Popovych  wrote:
>
>> Signed-off-by: Serhey Popovych 
>> ---
>>  ip/iplink.c |   12 ++--
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/ip/iplink.c b/ip/iplink.c
>> index 74c377c..a2c8108 100644
>> --- a/ip/iplink.c
>> +++ b/ip/iplink.c
>> @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct 
>> iplink_req *req,
>>   NEXT_ARG();
>>   if (xdp_parse(, , req, dev_index,
>> generic, drv, offload))
>> - exit(-1);
>> + return -1;
>>   } else if (strcmp(*argv, "netns") == 0) {
>>   NEXT_ARG();
>>   if (netns != -1)
>> @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int 
>> flags, int argc, char **argv)
>>   if (!dev) {
>>   fprintf(stderr,
>>   "Not enough information: \"dev\" argument is 
>> required.\n");
>> - exit(-1);
>> + return -1;
>>   }
>>   if (cmd == RTM_NEWLINK && index) {
>>   fprintf(stderr,
>>   "index can be used only when creating 
>> devices.\n");
>> - exit(-1);
>> + return -1;
>>   }
>>
>>   req.i.ifi_index = ll_name_to_index(dev);
>> @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
>>   if (!dev) {
>>   fprintf(stderr,
>>   "Not enough of information: \"dev\" argument is 
>> required.\n");
>> - exit(-1);
>> + return -1;
>>   }
>>
>>   if (newaddr || newbrd) {
>> @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
>>   fprintf(stderr,
>>   "Command \"%s\" is unknown, try \"ip link 
>> help\".\n",
>>   *argv);
>> - exit(-1);
>> + return -1;
>>   }
>>
>>   argv++; argc--;
>> @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
>>
>>   fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
>>   *argv);
>> - exit(-1);
>> + return -1;
>>  }
>
> Not sure I like this. If given bad input in batch it is better to stop 
> and exit
> rather than continuing with more bad data.

 When preparing this change I think in opposite direction: we want to
 continue batch mode if single line is broken.

>>>
>>> batch mode needs to stop on the line that fails. That said, batch still
>>> fails with the /exit/return/ change
>>>
>>> $ cat /tmp/ip.batch
>>> li sh
>>> li foo
>>> li add veth1 type veth peer name veth2
>>>
>>> Current command
>>> $ ip -batch /tmp/ip.batch
>>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
>>> DEFAULT group default qlen 1000
>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>
>>> 
>>> Command "foo" is unknown, try "ip link help".
>>> $ echo $?
>>> 255
>>>
>>> $ ip/ip -batch /tmp/ip.batch
>>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
>>> DEFAULT group default qlen 1000
>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>> 
>>> Command "foo" is unknown, try "ip link help".
>>> Command failed /tmp/ip.batch:2
>>> dsa@kenny:~/iproute2/iproute2-next.git$ echo $?
>>> 1
>>>
>>> I like that better because it tells me the line that fails.
>>
>> Normally ip batch will exit on errors.
>> The question is what about -force?
> 
> on -force, it needs to continue.
> 


It does.

$ ip/ip -batch /tmp/ip.batch -force
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Command "foo" is unknown, try "ip link help".
Command failed /tmp/ip.batch:2
RTNETLINK answers: Operation not permitted
Command failed /tmp/ip.batch:3

Which is an improvement over today where it just exits - ignoring the force.


Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko
On úterý 20. února 2018 21:09:37 CET Eric Dumazet wrote:
> Also you can tune your NIC to accept few MSS per GSO/TSO packet
> 
> ip link set dev eth0 gso_max_segs 2
> 
> So even if TSO/GSO is there, BBR should not use sk->sk_gso_max_segs to
> size its bursts, since burt sizes are also impacting GRO on the
> receiver.

net-next + 7 patches (6 from the patchset + this one).

Before playing with gso_max_segs:

BBR+fq
sg on:  4.39 Gbits/sec
sg off: 1.33 Gbits/sec

BBR+fq_codel
sg on:  4.02 Gbits/sec
sg off: 1.41 Gbits/sec

BBR+pfifo_fast
sg on:  3.66 Gbits/sec
sg off: 1.41 Gbits/sec

Reno+fq
sg on:  5.69 Gbits/sec
sg off: 1.53 Gbits/sec

Reno+fq_codel
sg on:  6.33 Gbits/sec
sg off: 1.50 Gbits/sec

Reno+pfifo_fast
sg on:  6.26 Gbits/sec
sg off: 1.48 Gbits/sec

After "ip link set dev eth1 gso_max_segs 2":

BBR+fq
sg on:  806 Mbits/sec
sg off: 886 Mbits/sec

BBR+fq_codel
sg on:  206 Mbits/sec
sg off: 207 Mbits/sec

BBR+pfifo_fast
sg on:  220 Mbits/sec
sg off: 200 Mbits/sec

Reno+fq
sg on:  2.16 Gbits/sec
sg off: 1.27 Gbits/sec

Reno+fq_codel
sg on:  2.45 Gbits/sec
sg off: 1.52 Gbits/sec

Reno+pfifo_fast
sg on:  2.31 Gbits/sec
sg off: 1.54 Gbits/sec

Oleksandr




Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread Serhey Popovych
David Ahern wrote:
> On 2/20/18 1:17 PM, Serhey Popovych wrote:
>> Stephen Hemminger wrote:
>>> On Tue, 20 Feb 2018 21:39:51 +0200
>>> Serhey Popovych  wrote:
>>>
 Signed-off-by: Serhey Popovych 
 ---
  ip/iplink.c |   12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

 diff --git a/ip/iplink.c b/ip/iplink.c
 index 74c377c..a2c8108 100644
 --- a/ip/iplink.c
 +++ b/ip/iplink.c
 @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct 
 iplink_req *req,
NEXT_ARG();
if (xdp_parse(, , req, dev_index,
  generic, drv, offload))
 -  exit(-1);
 +  return -1;
} else if (strcmp(*argv, "netns") == 0) {
NEXT_ARG();
if (netns != -1)
 @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int 
 flags, int argc, char **argv)
if (!dev) {
fprintf(stderr,
"Not enough information: \"dev\" argument is 
 required.\n");
 -  exit(-1);
 +  return -1;
}
if (cmd == RTM_NEWLINK && index) {
fprintf(stderr,
"index can be used only when creating 
 devices.\n");
 -  exit(-1);
 +  return -1;
}
  
req.i.ifi_index = ll_name_to_index(dev);
 @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
if (!dev) {
fprintf(stderr,
"Not enough of information: \"dev\" argument is 
 required.\n");
 -  exit(-1);
 +  return -1;
}
  
if (newaddr || newbrd) {
 @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
fprintf(stderr,
"Command \"%s\" is unknown, try \"ip link 
 help\".\n",
*argv);
 -  exit(-1);
 +  return -1;
}
  
argv++; argc--;
 @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
  
fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
*argv);
 -  exit(-1);
 +  return -1;
  }
>>>
>>> Not sure I like this. If given bad input in batch it is better to stop and 
>>> exit
>>> rather than continuing with more bad data.
>>
>> When preparing this change I think in opposite direction: we want to
>> continue batch mode if single line is broken.
>>
> 
> batch mode needs to stop on the line that fails. That said, batch still
> fails with the /exit/return/ change
> 
> $ cat /tmp/ip.batch
> li sh
> li foo
> li add veth1 type veth peer name veth2
> 
> Current command
> $ ip -batch /tmp/ip.batch
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
> DEFAULT group default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 
> 
> Command "foo" is unknown, try "ip link help".
> $ echo $?
> 255
> 
> $ ip/ip -batch /tmp/ip.batch
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
> DEFAULT group default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 
> Command "foo" is unknown, try "ip link help".
> Command failed /tmp/ip.batch:2
> dsa@kenny:~/iproute2/iproute2-next.git$ echo $?
> 1
> 
> I like that better because it tells me the line that fails.
> 
Okay, now is clear. Will prepare v2 according to these rules.




signature.asc
Description: OpenPGP digital signature


Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread Roopa Prabhu
On Tue, Feb 20, 2018 at 12:33 PM, Stephen Hemminger
 wrote:
> On Tue, 20 Feb 2018 13:27:21 -0700
> David Ahern  wrote:
>
>> On 2/20/18 1:17 PM, Serhey Popovych wrote:
>> > Stephen Hemminger wrote:
>> >> On Tue, 20 Feb 2018 21:39:51 +0200
>> >> Serhey Popovych  wrote:
>> >>
>> >>> Signed-off-by: Serhey Popovych 
>> >>> ---
>> >>>  ip/iplink.c |   12 ++--
>> >>>  1 file changed, 6 insertions(+), 6 deletions(-)
>> >>>
>> >>> diff --git a/ip/iplink.c b/ip/iplink.c
>> >>> index 74c377c..a2c8108 100644
>> >>> --- a/ip/iplink.c
>> >>> +++ b/ip/iplink.c
>> >>> @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct 
>> >>> iplink_req *req,
>> >>>   NEXT_ARG();
>> >>>   if (xdp_parse(, , req, dev_index,
>> >>> generic, drv, offload))
>> >>> - exit(-1);
>> >>> + return -1;
>> >>>   } else if (strcmp(*argv, "netns") == 0) {
>> >>>   NEXT_ARG();
>> >>>   if (netns != -1)
>> >>> @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int 
>> >>> flags, int argc, char **argv)
>> >>>   if (!dev) {
>> >>>   fprintf(stderr,
>> >>>   "Not enough information: \"dev\" argument is 
>> >>> required.\n");
>> >>> - exit(-1);
>> >>> + return -1;
>> >>>   }
>> >>>   if (cmd == RTM_NEWLINK && index) {
>> >>>   fprintf(stderr,
>> >>>   "index can be used only when creating 
>> >>> devices.\n");
>> >>> - exit(-1);
>> >>> + return -1;
>> >>>   }
>> >>>
>> >>>   req.i.ifi_index = ll_name_to_index(dev);
>> >>> @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
>> >>>   if (!dev) {
>> >>>   fprintf(stderr,
>> >>>   "Not enough of information: \"dev\" argument is 
>> >>> required.\n");
>> >>> - exit(-1);
>> >>> + return -1;
>> >>>   }
>> >>>
>> >>>   if (newaddr || newbrd) {
>> >>> @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
>> >>>   fprintf(stderr,
>> >>>   "Command \"%s\" is unknown, try \"ip link 
>> >>> help\".\n",
>> >>>   *argv);
>> >>> - exit(-1);
>> >>> + return -1;
>> >>>   }
>> >>>
>> >>>   argv++; argc--;
>> >>> @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
>> >>>
>> >>>   fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
>> >>>   *argv);
>> >>> - exit(-1);
>> >>> + return -1;
>> >>>  }
>> >>
>> >> Not sure I like this. If given bad input in batch it is better to stop 
>> >> and exit
>> >> rather than continuing with more bad data.
>> >
>> > When preparing this change I think in opposite direction: we want to
>> > continue batch mode if single line is broken.
>> >
>>
>> batch mode needs to stop on the line that fails. That said, batch still
>> fails with the /exit/return/ change
>>
>> $ cat /tmp/ip.batch
>> li sh
>> li foo
>> li add veth1 type veth peer name veth2
>>
>> Current command
>> $ ip -batch /tmp/ip.batch
>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
>> DEFAULT group default qlen 1000
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>
>> 
>> Command "foo" is unknown, try "ip link help".
>> $ echo $?
>> 255
>>
>> $ ip/ip -batch /tmp/ip.batch
>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
>> DEFAULT group default qlen 1000
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 
>> Command "foo" is unknown, try "ip link help".
>> Command failed /tmp/ip.batch:2
>> dsa@kenny:~/iproute2/iproute2-next.git$ echo $?
>> 1
>>
>> I like that better because it tells me the line that fails.
>
> Normally ip batch will exit on errors.
> The question is what about -force?

on -force, it needs to continue.


Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread Stephen Hemminger
On Tue, 20 Feb 2018 13:27:21 -0700
David Ahern  wrote:

> On 2/20/18 1:17 PM, Serhey Popovych wrote:
> > Stephen Hemminger wrote:  
> >> On Tue, 20 Feb 2018 21:39:51 +0200
> >> Serhey Popovych  wrote:
> >>  
> >>> Signed-off-by: Serhey Popovych 
> >>> ---
> >>>  ip/iplink.c |   12 ++--
> >>>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/ip/iplink.c b/ip/iplink.c
> >>> index 74c377c..a2c8108 100644
> >>> --- a/ip/iplink.c
> >>> +++ b/ip/iplink.c
> >>> @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct 
> >>> iplink_req *req,
> >>>   NEXT_ARG();
> >>>   if (xdp_parse(, , req, dev_index,
> >>> generic, drv, offload))
> >>> - exit(-1);
> >>> + return -1;
> >>>   } else if (strcmp(*argv, "netns") == 0) {
> >>>   NEXT_ARG();
> >>>   if (netns != -1)
> >>> @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int 
> >>> flags, int argc, char **argv)
> >>>   if (!dev) {
> >>>   fprintf(stderr,
> >>>   "Not enough information: \"dev\" argument is 
> >>> required.\n");
> >>> - exit(-1);
> >>> + return -1;
> >>>   }
> >>>   if (cmd == RTM_NEWLINK && index) {
> >>>   fprintf(stderr,
> >>>   "index can be used only when creating 
> >>> devices.\n");
> >>> - exit(-1);
> >>> + return -1;
> >>>   }
> >>>  
> >>>   req.i.ifi_index = ll_name_to_index(dev);
> >>> @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
> >>>   if (!dev) {
> >>>   fprintf(stderr,
> >>>   "Not enough of information: \"dev\" argument is 
> >>> required.\n");
> >>> - exit(-1);
> >>> + return -1;
> >>>   }
> >>>  
> >>>   if (newaddr || newbrd) {
> >>> @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
> >>>   fprintf(stderr,
> >>>   "Command \"%s\" is unknown, try \"ip link 
> >>> help\".\n",
> >>>   *argv);
> >>> - exit(-1);
> >>> + return -1;
> >>>   }
> >>>  
> >>>   argv++; argc--;
> >>> @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
> >>>  
> >>>   fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
> >>>   *argv);
> >>> - exit(-1);
> >>> + return -1;
> >>>  }  
> >>
> >> Not sure I like this. If given bad input in batch it is better to stop and 
> >> exit
> >> rather than continuing with more bad data.  
> > 
> > When preparing this change I think in opposite direction: we want to
> > continue batch mode if single line is broken.
> >   
> 
> batch mode needs to stop on the line that fails. That said, batch still
> fails with the /exit/return/ change
> 
> $ cat /tmp/ip.batch
> li sh
> li foo
> li add veth1 type veth peer name veth2
> 
> Current command
> $ ip -batch /tmp/ip.batch
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
> DEFAULT group default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 
> 
> Command "foo" is unknown, try "ip link help".
> $ echo $?
> 255
> 
> $ ip/ip -batch /tmp/ip.batch
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
> DEFAULT group default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 
> Command "foo" is unknown, try "ip link help".
> Command failed /tmp/ip.batch:2
> dsa@kenny:~/iproute2/iproute2-next.git$ echo $?
> 1
> 
> I like that better because it tells me the line that fails.

Normally ip batch will exit on errors.
The question is what about -force?


Re: [PATCH] tools/libbpf: Avoid possibly using uninitialized variable

2018-02-20 Thread Daniel Borkmann
On 02/20/2018 02:00 AM, Jeremy Cline wrote:
> Fixes a GCC maybe-uninitialized warning introduced by 48cca7e44f9f.
> "text" is only initialized inside the if statement so only print debug
> info there.
> 
> Signed-off-by: Jeremy Cline 

Looks good, applied to bpf tree, thanks Jeremy!


Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread David Ahern
On 2/20/18 1:17 PM, Serhey Popovych wrote:
> Stephen Hemminger wrote:
>> On Tue, 20 Feb 2018 21:39:51 +0200
>> Serhey Popovych  wrote:
>>
>>> Signed-off-by: Serhey Popovych 
>>> ---
>>>  ip/iplink.c |   12 ++--
>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/ip/iplink.c b/ip/iplink.c
>>> index 74c377c..a2c8108 100644
>>> --- a/ip/iplink.c
>>> +++ b/ip/iplink.c
>>> @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct 
>>> iplink_req *req,
>>> NEXT_ARG();
>>> if (xdp_parse(, , req, dev_index,
>>>   generic, drv, offload))
>>> -   exit(-1);
>>> +   return -1;
>>> } else if (strcmp(*argv, "netns") == 0) {
>>> NEXT_ARG();
>>> if (netns != -1)
>>> @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int flags, 
>>> int argc, char **argv)
>>> if (!dev) {
>>> fprintf(stderr,
>>> "Not enough information: \"dev\" argument is 
>>> required.\n");
>>> -   exit(-1);
>>> +   return -1;
>>> }
>>> if (cmd == RTM_NEWLINK && index) {
>>> fprintf(stderr,
>>> "index can be used only when creating 
>>> devices.\n");
>>> -   exit(-1);
>>> +   return -1;
>>> }
>>>  
>>> req.i.ifi_index = ll_name_to_index(dev);
>>> @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
>>> if (!dev) {
>>> fprintf(stderr,
>>> "Not enough of information: \"dev\" argument is 
>>> required.\n");
>>> -   exit(-1);
>>> +   return -1;
>>> }
>>>  
>>> if (newaddr || newbrd) {
>>> @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
>>> fprintf(stderr,
>>> "Command \"%s\" is unknown, try \"ip link 
>>> help\".\n",
>>> *argv);
>>> -   exit(-1);
>>> +   return -1;
>>> }
>>>  
>>> argv++; argc--;
>>> @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
>>>  
>>> fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
>>> *argv);
>>> -   exit(-1);
>>> +   return -1;
>>>  }
>>
>> Not sure I like this. If given bad input in batch it is better to stop and 
>> exit
>> rather than continuing with more bad data.
> 
> When preparing this change I think in opposite direction: we want to
> continue batch mode if single line is broken.
> 

batch mode needs to stop on the line that fails. That said, batch still
fails with the /exit/return/ change

$ cat /tmp/ip.batch
li sh
li foo
li add veth1 type veth peer name veth2

Current command
$ ip -batch /tmp/ip.batch
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00


Command "foo" is unknown, try "ip link help".
$ echo $?
255

$ ip/ip -batch /tmp/ip.batch
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Command "foo" is unknown, try "ip link help".
Command failed /tmp/ip.batch:2
dsa@kenny:~/iproute2/iproute2-next.git$ echo $?
1

I like that better because it tells me the line that fails.


Re: [PATCH iproute2-next v2] ip link: add support to display extended tun attributes

2018-02-20 Thread David Ahern
On 2/20/18 11:40 AM, Stephen Hemminger wrote:
>>> diff --git a/ip/iptuntap.c b/ip/iptuntap.c
>>> index 4628db2832b4..07253870472f 100644
>>> --- a/ip/iptuntap.c
>>> +++ b/ip/iptuntap.c
>>> @@ -469,3 +469,89 @@ int do_iptuntap(int argc, char **argv)
>>> *argv);
>>> exit(-1);
>>>  }
>>> +
>>> +static void print_owner(FILE *f, uid_t uid)
>>> +{
>>> +   struct passwd *pw = getpwuid(uid);
>>> +
>>> +   if (pw)
>>> +   fprintf(f, "user %s ", pw->pw_name);
>>> +   else
>>> +   fprintf(f, "user %u ", uid);
>>> +}
>>> +
>>> +static void print_group(FILE *f, gid_t gid)
>>> +{
>>> +   struct group *group = getgrgid(gid);
>>> +
>>> +   if (group)
>>> +   fprintf(f, "group %s ", group->gr_name);
>>> +   else
>>> +   fprintf(f, "group %u ", gid);
>>> +}
>>> +  
>>
>>
>> Those helpers can be re-used to make 'ip tuntap show' better too.
> 
> These should support JSON output.
> 

Good point. Missed that detail. Sabrina: Please send a patch to fix the
json output.


Re: [PATCH iproute2-next] ip: don't colorize the master device

2018-02-20 Thread David Ahern
On 2/20/18 12:08 PM, Stephen Hemminger wrote:
> From: Stephen Hemminger 
> 
> Putting whole string "master eth0" in the interface name color
> is wrong and confusing. Let's just turn color off for all attributes
> of device.
> 
> Fixes: d92cc2d087b0 ("ipaddress: ll_map: Replace ll_idx_n2a() with 
> ll_index_to_name()")
> Signed-off-by: Stephen Hemminger 
> ---
>  ip/ipaddress.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 

Change in behavior too. Applied to iproute2-next



Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread Serhey Popovych
Stephen Hemminger wrote:
> On Tue, 20 Feb 2018 21:39:51 +0200
> Serhey Popovych  wrote:
> 
>> Signed-off-by: Serhey Popovych 
>> ---
>>  ip/iplink.c |   12 ++--
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/ip/iplink.c b/ip/iplink.c
>> index 74c377c..a2c8108 100644
>> --- a/ip/iplink.c
>> +++ b/ip/iplink.c
>> @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct 
>> iplink_req *req,
>>  NEXT_ARG();
>>  if (xdp_parse(, , req, dev_index,
>>generic, drv, offload))
>> -exit(-1);
>> +return -1;
>>  } else if (strcmp(*argv, "netns") == 0) {
>>  NEXT_ARG();
>>  if (netns != -1)
>> @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int flags, 
>> int argc, char **argv)
>>  if (!dev) {
>>  fprintf(stderr,
>>  "Not enough information: \"dev\" argument is 
>> required.\n");
>> -exit(-1);
>> +return -1;
>>  }
>>  if (cmd == RTM_NEWLINK && index) {
>>  fprintf(stderr,
>>  "index can be used only when creating 
>> devices.\n");
>> -exit(-1);
>> +return -1;
>>  }
>>  
>>  req.i.ifi_index = ll_name_to_index(dev);
>> @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
>>  if (!dev) {
>>  fprintf(stderr,
>>  "Not enough of information: \"dev\" argument is 
>> required.\n");
>> -exit(-1);
>> +return -1;
>>  }
>>  
>>  if (newaddr || newbrd) {
>> @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
>>  fprintf(stderr,
>>  "Command \"%s\" is unknown, try \"ip link 
>> help\".\n",
>>  *argv);
>> -exit(-1);
>> +return -1;
>>  }
>>  
>>  argv++; argc--;
>> @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
>>  
>>  fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
>>  *argv);
>> -exit(-1);
>> +return -1;
>>  }
> 
> Not sure I like this. If given bad input in batch it is better to stop and 
> exit
> rather than continuing with more bad data.

When preparing this change I think in opposite direction: we want to
continue batch mode if single line is broken.

If desired I will drop this one.

> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH iproute2] ip: link_gre6.c: Support IP6_TNL_F_ALLOW_LOCAL_REMOTE flag

2018-02-20 Thread Serhey Popovych
Petr Machata wrote:
> For IP-in-IP tunnels, one can specify the [no]allow-localremote command
> when configuring a device. Under the hood, this flips the
> IP6_TNL_F_ALLOW_LOCAL_REMOTE flag on the netdevice. However, ip6gretap
> and ip6erspan devices, where the flag is also relevant, are not IP-in-IP
> tunnels, and thus there's no way to configure the flag on these
> netdevices. Therefore introduce the command to link_gre6 as well.
> 
> The original support was introduced in commit
> 21440d19d957 ("ip: link_ip6tnl.c/ip6tunnel.c: Support 
> IP6_TNL_F_ALLOW_LOCAL_REMOTE flag")

Maybe it is better to rebase this against iproute2-next?

There are lot of changes already done here and there are plans on ip and
ipv6 tunnel modules variants merge.

> 
> Signed-off-by: Petr Machata 
> ---
>  ip/link_gre6.c| 11 +++
>  man/man8/ip-link.8.in | 14 ++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/ip/link_gre6.c b/ip/link_gre6.c
> index 4045f65..4c05344 100644
> --- a/ip/link_gre6.c
> +++ b/ip/link_gre6.c
> @@ -44,6 +44,7 @@ static void print_usage(FILE *f)
>   "  [ flowlabel FLOWLABEL ]\n"
>   "  [ dscp inherit ]\n"
>   "  [ fwmark MARK ]\n"
> + "  [ [no]allow-localremote ]\n"
>   "  [ dev PHYS_DEV ]\n"
>   "  [ noencap ]\n"
>   "  [ encap { fou | gue | none } 
> ]\n"
> @@ -348,6 +349,10 @@ get_failed:
>   invarg("invalid fwmark\n", *argv);
>   flags &= ~IP6_TNL_F_USE_ORIG_FWMARK;
>   }
> + } else if (strcmp(*argv, "allow-localremote") == 0) {
> + flags |= IP6_TNL_F_ALLOW_LOCAL_REMOTE;
> + } else if (strcmp(*argv, "noallow-localremote") == 0) {
> + flags &= ~IP6_TNL_F_ALLOW_LOCAL_REMOTE;
>   } else if (strcmp(*argv, "encaplimit") == 0) {
>   NEXT_ARG();
>   if (strcmp(*argv, "none") == 0) {
> @@ -534,6 +539,12 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
> struct rtattr *tb[])
>   if (oflags & GRE_CSUM)
>   print_bool(PRINT_ANY, "ocsum", "ocsum ", true);
>  
> + if (flags & IP6_TNL_F_ALLOW_LOCAL_REMOTE)
> + print_bool(PRINT_ANY,
> +"ip6_tnl_f_allow_local_remote",
> +"allow-localremote ",
> +true);
> +
>   if (flags & IP6_TNL_F_USE_ORIG_FWMARK) {
>   print_bool(PRINT_ANY,
>  "ip6_tnl_f_use_orig_fwmark",
> diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
> index 481589e..5dee9fc 100644
> --- a/man/man8/ip-link.8.in
> +++ b/man/man8/ip-link.8.in
> @@ -793,6 +793,8 @@ the following additional arguments are supported:
>  ] [
>  .BI "dscp inherit"
>  ] [
> +.BI "[no]allow-localremote"
> +] [
>  .BI dev " PHYS_DEV "
>  ] [
>  .RB external
> @@ -857,6 +859,11 @@ flag is equivalent to the combination
>  - specifies a fixed flowlabel.
>  
>  .sp
> +.BI  [no]allow-localremote
> +- specifies whether to allow remote endpoint to have an address configured on
> +local host.
> +
> +.sp
>  .BI  tclass " TCLASS"
>  - specifies the traffic class field on
>  tunneled packets, which can be specified as either a two-digit
> @@ -927,6 +934,8 @@ the following additional arguments are supported:
>  ] [
>  .BR erspan_hwid " \fIhwid "
>  ] [
> +.BI "[no]allow-localremote"
> +] [
>  .RB external
>  ]
>  
> @@ -965,6 +974,11 @@ traffic's source port and direction.
>  is a 6-bit value for users to configure.
>  
>  .sp
> +.BI  [no]allow-localremote
> +- specifies whether to allow remote endpoint to have an address configured on
> +local host.
> +
> +.sp
>  .BR external
>  - make this tunnel externally controlled (or not, which is the default).
>  In the kernel, this is referred to as collect metadata mode.  This flag is
> 




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Jiri Pirko
Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudr...@intel.com wrote:
>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko  wrote:
>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>> > > > used by hypervisor to indicate that virtio_net interface should act as
>> > > > a backup for another device with the same MAC address.
>> > > > 
>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>> > > > solution.  However, it creates some issues we'll get into in a moment.
>> > > > It extends virtio_net to use alternate datapath when available and
>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>> > > > an additional 'bypass' netdev that acts as a master device and controls
>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> > > > associated with the same 'pci' device.  The user accesses the network
>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' 
>> > > > netdev
>> > > > as default for transmits when it is available with link up and running.
>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>> > > are mature solutions, well tested, broadly used, with lots of issues
>> > > resolved in the past. What you try to introduce is a weird shortcut
>> > > that already has couple of issues as you mentioned and will certanly
>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>> > > with ideas like multiple VFs, LACP and similar bonding things.
>> > The problem with the bond and team drivers is they are too large and
>> > have too many interfaces available for configuration so as a result
>> > they can really screw this interface up.
>> What? Too large is which sense? Why "too many interfaces" is a problem?
>> Also, team has only one interface to userspace team-generic-netlink.
>> 
>> 
>> > Essentially this is meant to be a bond that is more-or-less managed by
>> > the host, not the guest. We want the host to be able to configure it
>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>> virtio_net, pci vf.
>> I don't see how host can do any managing of that, other than the
>> obvious. But still, the active/backup decision is done in guest. This is
>> a simple bond/team usecase. As I said, there is something needed to be
>> implemented in userspace in order to handle re-appear of vf netdev.
>> But that should be fairly easy to do in teamd.
>
>The host manages the active/backup decision by
>- assigning the same MAC address to both VF and virtio interfaces
>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>take
>  over the VFs datapath.
>- only enable one datapath at anytime so that packets don't get looped back
>- during live migration enable virtio datapth, unplug vf on the source and
>replug
>  vf on the destination.
>
>The VM is not expected and doesn't have any control of setting the MAC
>address
>or bringing up/down the links.
>
>This is the model that is currently supported with netvsc driver on Azure.

Yeah, I can see it now :( I guess that the ship has sailed and we are
stuck with this ugly thing forever...

Could you at least make some common code that is shared in between
netvsc and virtio_net so this is handled in exacly the same way in both?

The fact that the netvsc/virtio_net kidnaps a netdev only because it
has the same mac is going to give me some serious nighmares...
I think we need to introduce some more strict checks.


Re: [PATCH iproute2-next 1/8] iplink: Return from function instead of calling exit()

2018-02-20 Thread Stephen Hemminger
On Tue, 20 Feb 2018 21:39:51 +0200
Serhey Popovych  wrote:

> Signed-off-by: Serhey Popovych 
> ---
>  ip/iplink.c |   12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/ip/iplink.c b/ip/iplink.c
> index 74c377c..a2c8108 100644
> --- a/ip/iplink.c
> +++ b/ip/iplink.c
> @@ -653,7 +653,7 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
> *req,
>   NEXT_ARG();
>   if (xdp_parse(, , req, dev_index,
> generic, drv, offload))
> - exit(-1);
> + return -1;
>   } else if (strcmp(*argv, "netns") == 0) {
>   NEXT_ARG();
>   if (netns != -1)
> @@ -972,12 +972,12 @@ static int iplink_modify(int cmd, unsigned int flags, 
> int argc, char **argv)
>   if (!dev) {
>   fprintf(stderr,
>   "Not enough information: \"dev\" argument is 
> required.\n");
> - exit(-1);
> + return -1;
>   }
>   if (cmd == RTM_NEWLINK && index) {
>   fprintf(stderr,
>   "index can be used only when creating 
> devices.\n");
> - exit(-1);
> + return -1;
>   }
>  
>   req.i.ifi_index = ll_name_to_index(dev);
> @@ -1392,7 +1392,7 @@ static int do_set(int argc, char **argv)
>   if (!dev) {
>   fprintf(stderr,
>   "Not enough of information: \"dev\" argument is 
> required.\n");
> - exit(-1);
> + return -1;
>   }
>  
>   if (newaddr || newbrd) {
> @@ -1553,7 +1553,7 @@ static int iplink_afstats(int argc, char **argv)
>   fprintf(stderr,
>   "Command \"%s\" is unknown, try \"ip link 
> help\".\n",
>   *argv);
> - exit(-1);
> + return -1;
>   }
>  
>   argv++; argc--;
> @@ -1648,5 +1648,5 @@ int do_iplink(int argc, char **argv)
>  
>   fprintf(stderr, "Command \"%s\" is unknown, try \"ip link help\".\n",
>   *argv);
> - exit(-1);
> + return -1;
>  }

Not sure I like this. If given bad input in batch it is better to stop and exit
rather than continuing with more bad data.


  1   2   3   >