Re: [PATCH net-next v2 1/3] ipv4: support sport and dport in RTM_GETROUTE

2018-05-06 Thread kbuild test robot
Hi Roopa,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Roopa-Prabhu/fib-rule-selftest/20180507-094538
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/ipv4/route.c:1271:31: sparse: expression using sizeof(void)
   net/ipv4/route.c:1271:31: sparse: expression using sizeof(void)
   net/ipv4/route.c:1274:16: sparse: expression using sizeof(void)
   net/ipv4/route.c:1274:16: sparse: expression using sizeof(void)
   net/ipv4/route.c:1295:15: sparse: expression using sizeof(void)
   net/ipv4/route.c:688:38: sparse: expression using sizeof(void)
   net/ipv4/route.c:712:38: sparse: expression using sizeof(void)
   net/ipv4/route.c:782:46: sparse: incorrect type in argument 2 (different 
base types) @@expected unsigned int [unsigned] [usertype] key @@got ed 
int [unsigned] [usertype] key @@
   net/ipv4/route.c:782:46:expected unsigned int [unsigned] [usertype] key
   net/ipv4/route.c:782:46:got restricted __be32 [usertype] new_gw
>> net/ipv4/route.c:2695:29: sparse: incorrect type in initializer (different 
>> base types) @@expected int [signed] p @@got restint [signed] p @@
   net/ipv4/route.c:2695:29:expected int [signed] p
   net/ipv4/route.c:2695:29:got restricted __be16
>> net/ipv4/route.c:2700:15: sparse: incorrect type in assignment (different 
>> base types) @@expected restricted __be16 [usertype]  @@got 
>> 6 [usertype]  @@
   net/ipv4/route.c:2700:15:expected restricted __be16 [usertype] 
   net/ipv4/route.c:2700:15:got int [signed] p
>> net/ipv4/route.c:2816:27: sparse: incorrect type in assignment (different 
>> base types) @@expected restricted __be16 [usertype] len @@got 6 
>> [usertype] len @@
   net/ipv4/route.c:2816:27:expected restricted __be16 [usertype] len
   net/ipv4/route.c:2816:27:got unsigned long

vim +2695 net/ipv4/route.c

  2692  
  2693  static int nla_get_port(struct nlattr *attr, __be16 *port)
  2694  {
> 2695  int p = nla_get_be16(attr);
  2696  
  2697  if (p <= 0 || p >= 0x)
  2698  return -EINVAL;
  2699  
> 2700  *port = p;
  2701  return 0;
  2702  }
  2703  
  2704  static int inet_rtm_getroute_reply(struct sk_buff *in_skb, struct 
nlmsghdr *nlh,
  2705 __be32 dst, __be32 src, struct 
flowi4 *fl4,
  2706 struct rtable *rt, struct fib_result 
*res)
  2707  {
  2708  struct net *net = sock_net(in_skb->sk);
  2709  struct rtmsg *rtm = nlmsg_data(nlh);
  2710  u32 table_id = RT_TABLE_MAIN;
  2711  struct sk_buff *skb;
  2712  int err = 0;
  2713  
  2714  skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
  2715  if (!skb) {
  2716  err = -ENOMEM;
  2717  return err;
  2718  }
  2719  
  2720  if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE)
  2721  table_id = res->table ? res->table->tb_id : 0;
  2722  
  2723  if (rtm->rtm_flags & RTM_F_FIB_MATCH)
  2724  err = fib_dump_info(skb, NETLINK_CB(in_skb).portid,
  2725  nlh->nlmsg_seq, RTM_NEWROUTE, 
table_id,
  2726  rt->rt_type, res->prefix, 
res->prefixlen,
  2727  fl4->flowi4_tos, res->fi, 0);
  2728  else
  2729  err = rt_fill_info(net, dst, src, rt, table_id,
  2730 fl4, skb, NETLINK_CB(in_skb).portid,
  2731 nlh->nlmsg_seq);
  2732  if (err < 0)
  2733  goto errout;
  2734  
  2735  return rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
  2736  
  2737  errout:
  2738  kfree_skb(skb);
  2739  return err;
  2740  }
  2741  
  2742  static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr 
*nlh,
  2743   struct netlink_ext_ack *extack)
  2744  {
  2745  struct net *net = sock_net(in_skb->sk);
  2746  struct nlattr *tb[RTA_MAX+1];
  2747  __be16 sport = 0, dport = 0;
  2748  struct fib_result res = {};
  2749  struct rtable *rt = NULL;
  2750  struct sk_buff *skb;
  2751  struct rtmsg *rtm;
  2752  struct flowi4 fl4;
  2753  struct iphdr *iph;
  2754  struct udphdr *udph;
  2755  __be32 dst = 0;
  2756  __be32 src = 0;
  2757  kuid_t uid;
  2758  u32 iif;
  2759  int err;
  2760  int mark;
  2761  
  2762  err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, 
rtm_ipv4_policy,
  2763extack);
  2764  if (err < 0)
  2765  

Re: linux-next: manual merge of the tip tree with the bpf-next tree

2018-05-06 Thread Stephen Rothwell
Hi all,

On Mon, 7 May 2018 12:09:09 +1000 Stephen Rothwell  
wrote:
>
> Today's linux-next merge of the tip tree got a conflict in:
> 
>   arch/x86/net/bpf_jit_comp.c
> 
> between commit:
> 
>   e782bdcf58c5 ("bpf, x64: remove ld_abs/ld_ind")
> 
> from the bpf-next tree and commit:
> 
>   5f26c50143f5 ("x86/bpf: Clean up non-standard comments, to make the code 
> more readable")
> 
> from the tip tree.
> 
> I fixed it up (the former commit removed some code modified by the latter,
> so I just removed it) and can carry the fix as necessary. This is now
> fixed as far as linux-next is concerned, but any non trivial conflicts
> should be mentioned to your upstream maintainer when your tree is
> submitted for merging.  You may also want to consider cooperating with
> the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Actually the tip tree commit has been added to the bpf-next tree as a
different commit, so dropping it from the tip tree will clean this up.

-- 
Cheers,
Stephen Rothwell


pgpaJldvXfoBa.pgp
Description: OpenPGP digital signature


Re: [RFC PATCH 3/3] arcnet: com20020: Add ethtool support

2018-05-06 Thread Tobin C. Harding
On Sat, May 05, 2018 at 11:35:29PM +0200, Andrea Greco wrote:
> From: Andrea Greco 
> 
> Setup ethtols for export com20020 diag register
> 
> Signed-off-by: Andrea Greco 
> ---
>  drivers/net/arcnet/com20020-isa.c|  1 +
>  drivers/net/arcnet/com20020-membus.c |  1 +
>  drivers/net/arcnet/com20020.c| 29 +
>  drivers/net/arcnet/com20020.h|  1 +
>  drivers/net/arcnet/com20020_cs.c |  1 +
>  include/uapi/linux/if_arcnet.h   |  6 ++
>  6 files changed, 39 insertions(+)
> 
> diff --git a/drivers/net/arcnet/com20020-isa.c 
> b/drivers/net/arcnet/com20020-isa.c
> index 38fa60ddaf2e..44ab6dcccb58 100644
> --- a/drivers/net/arcnet/com20020-isa.c
> +++ b/drivers/net/arcnet/com20020-isa.c
> @@ -154,6 +154,7 @@ static int __init com20020_init(void)
>   dev->dev_addr[0] = node;
>  
>   dev->netdev_ops = _netdev_ops;
> + dev->ethtool_ops = _ethtool_ops;
>  
>   lp = netdev_priv(dev);
>   lp->backplane = backplane;
> diff --git a/drivers/net/arcnet/com20020-membus.c 
> b/drivers/net/arcnet/com20020-membus.c
> index 6e4a2f3a84f7..9eead734a3cf 100644
> --- a/drivers/net/arcnet/com20020-membus.c
> +++ b/drivers/net/arcnet/com20020-membus.c
> @@ -91,6 +91,7 @@ static int com20020_probe(struct platform_device *pdev)
>  
>   dev = alloc_arcdev(NULL);// Let autoassign name arc%d
>   dev->netdev_ops = _netdev_ops;
> + dev->ethtool_ops = _ethtool_ops;
>   lp = netdev_priv(dev);
>  
>   lp->card_flags = ARC_CAN_10MBIT;/* pretend all of them can 10Mbit */
> diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
> index abd32ed8ec9b..2089b45e81c8 100644
> --- a/drivers/net/arcnet/com20020.c
> +++ b/drivers/net/arcnet/com20020.c
> @@ -201,6 +201,34 @@ const struct net_device_ops com20020_netdev_ops = {
>   .ndo_set_rx_mode = com20020_set_mc_list,
>  };
>  
> +static int com20020_ethtool_regs_len(struct net_device *netdev)
> +{
> + return sizeof(struct com20020_ethtool_regs);
> +}
> +
> +static void com20020_ethtool_regs_read(struct net_device *dev,
> +struct ethtool_regs *regs, void *p)
> +{
> + struct arcnet_local *lp;
> + struct com20020_ethtool_regs *com_reg;
> +
> + lp = netdev_priv(dev);
> + memset(p, 0, sizeof(struct com20020_ethtool_regs));

perhaps:

struct arcnet_local *lp = netdev_priv(dev);
struct com20020_ethtool_regs *com_reg = p;

memset(com_reg, 0, sizeof(*com_reg));

> +
> + regs->version = 1;

Should this function really have a side effect?  If so, perhaps it could
be commented.

> +
> + com_reg = p;
> +
> + com_reg->status = lp->hw.status(dev) & 0xFF;
> + com_reg->diag_register = (lp->hw.status(dev) >> 8) & 0xFF;
> + com_reg->reconf_count = lp->num_recons;
> +}
> +
> +const struct ethtool_ops com20020_ethtool_ops = {
> + .get_regs = com20020_ethtool_regs_read,
> + .get_regs_len  = com20020_ethtool_regs_len,
> +};
> +

Hope this helps,
Tobin.


Re: [RFC PATCH 2/3] arcnet: com20020: Fixup missing SLOWARB bit

2018-05-06 Thread Tobin C. Harding
On Sat, May 05, 2018 at 11:37:54PM +0200, Andrea Greco wrote:
> From: Andrea Greco 
> 
> If com20020 clock is major of 40Mhz SLOWARB bit is requested.
> 
> Signed-off-by: Andrea Greco 
> ---
>  drivers/net/arcnet/com20020.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
> index f09ea77dd6a8..abd32ed8ec9b 100644
> --- a/drivers/net/arcnet/com20020.c
> +++ b/drivers/net/arcnet/com20020.c
> @@ -102,6 +102,10 @@ int com20020_check(struct net_device *dev)
>   lp->setup = lp->clockm ? 0 : (lp->clockp << 1);
>   lp->setup2 = (lp->clockm << 4) | 8;
>  
> + // If clock is major of 40Mhz, SLOWARB bit must be set

/* C89 style comments please :) */


Hope this helps,
Tobin.


Re: [RFC PATCH 1/3] arcnet: com20020: Add memory map of com20020

2018-05-06 Thread Tobin C. Harding
On Sat, May 05, 2018 at 11:34:45PM +0200, Andrea Greco wrote:
> From: Andrea Greco 

Hi Andrea,

Here are some (mostly stylistic) suggestions to help you get your driver merged.

> Add support for com20022I/com20020, memory mapped chip version.
> Support bus: Intel 80xx and Motorola 68xx.
> Bus size: Only 8 bit bus size is supported.
> Added related device tree bindings
> 
> Signed-off-by: Andrea Greco 
> ---
>  .../devicetree/bindings/net/smsc-com20020.txt  |  23 +++
>  drivers/net/arcnet/Kconfig |  12 +-
>  drivers/net/arcnet/Makefile|   1 +
>  drivers/net/arcnet/arcdevice.h |  27 ++-
>  drivers/net/arcnet/com20020-membus.c   | 191 
> +
>  drivers/net/arcnet/com20020.c  |   9 +-
>  6 files changed, 253 insertions(+), 10 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt
>  create mode 100644 drivers/net/arcnet/com20020-membus.c
> 
> diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt 
> b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> new file mode 100644
> index ..39c5b19c55af
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> @@ -0,0 +1,23 @@
> +SMSC com20020, com20022I
> +
> +timeout: Arcnet timeout, checkout datashet
> +clockp: Clock Prescaler, checkout datashet
> +clockm: Clock multiplier, checkout datasheet
> +
> +phy-reset-gpios: Chip reset ppin
> +phy-irq-gpios: Chip irq pin
> +
> +com20020_A@0 {
> +compatible = "smsc,com20020";
> +
> + timeout = <0x3>;
> + backplane = <0x0>;
> +
> + clockp = <0x0>;
> + clockm = <0x3>;
> +
> + phy-reset-gpios = < 21 GPIO_ACTIVE_LOW>;
> + phy-irq-gpios = < 10 GPIO_ACTIVE_LOW>;
> +
> + status = "okay";
> +};
> diff --git a/drivers/net/arcnet/Kconfig b/drivers/net/arcnet/Kconfig
> index 39bd16f3f86d..d39faf45be1e 100644
> --- a/drivers/net/arcnet/Kconfig
> +++ b/drivers/net/arcnet/Kconfig
> @@ -3,7 +3,7 @@
>  #
>  
>  menuconfig ARCNET
> - depends on NETDEVICES && (ISA || PCI || PCMCIA)
> + depends on NETDEVICES
>   tristate "ARCnet support"
>   ---help---
> If you have a network card of this type, say Y and check out the
> @@ -129,5 +129,15 @@ config ARCNET_COM20020_CS
>  
> To compile this driver as a module, choose M here: the module will be
> called com20020_cs.  If unsure, say N.
> +config ARCNET_COM20020_MEMORY_BUS
> + bool "Support for COM20020 on external memory"
> + depends on ARCNET_COM20020 && !(ARCNET_COM20020_PCI || 
> ARCNET_COM20020_ISA || ARCNET_COM20020_CS)
> + help
> +   Say Y here if on your custom board mount com20020 or friends.
> +
> +   Com20022I support arcnet bus 10Mbitps.
> +   This driver support only 8bit

This driver only supports 8bit bus size.

>  , and DMA is not supported is attached 
> on your board at external interface bus.

This bit does not make sense, sorry.

> +   Supported bus Intel80xx / Motorola 68xx.
> +   This driver not work with other com20020 module: PCI or PCMCIA 
> compiled as [M].

I'm not sure exactly what you want to say here, perhaps:

  This driver does not work with other com20020 modules compiled
  as PCI or PCMCIA [M].
>  
>  endif # ARCNET
> diff --git a/drivers/net/arcnet/Makefile b/drivers/net/arcnet/Makefile
> index 53525e8ea130..19425c1e06f4 100644
> --- a/drivers/net/arcnet/Makefile
> +++ b/drivers/net/arcnet/Makefile
> @@ -14,3 +14,4 @@ obj-$(CONFIG_ARCNET_COM20020) += com20020.o
>  obj-$(CONFIG_ARCNET_COM20020_ISA) += com20020-isa.o
>  obj-$(CONFIG_ARCNET_COM20020_PCI) += com20020-pci.o
>  obj-$(CONFIG_ARCNET_COM20020_CS) += com20020_cs.o
> +obj-$(CONFIG_ARCNET_COM20020_MEMORY_BUS) += com20020-membus.o
> diff --git a/drivers/net/arcnet/arcdevice.h b/drivers/net/arcnet/arcdevice.h
> index d09b2b46ab63..16c608269cca 100644
> --- a/drivers/net/arcnet/arcdevice.h
> +++ b/drivers/net/arcnet/arcdevice.h
> @@ -201,7 +201,7 @@ struct ArcProto {
>   void (*rx)(struct net_device *dev, int bufnum,
>  struct archdr *pkthdr, int length);
>   int (*build_header)(struct sk_buff *skb, struct net_device *dev,
> - unsigned short ethproto, uint8_t daddr);
> + unsigned short ethproto, uint8_t daddr);

  + unsigned short ethproto, uint8_t daddr);

Please use Linux coding style style, parameters continuing on separate
line are aligned with opening parenthesis.

>   /* these functions return '1' if the skb can now be freed */
>   int (*prepare_tx)(struct net_device *dev, struct archdr *pkt,
> @@ -326,9 +326,9 @@ struct arcnet_local {
>   void (*recontrigger) (struct net_device * dev, int enable);
>  
>   void (*copy_to_card)(struct net_device *dev, int bufnum,
> -  

[PATCH net] net/tls: Fix connection stall on partial tls record

2018-05-06 Thread Andre Tomt
In the case of writing a partial tls record we forgot to clear the
ctx->in_tcp_sendpages flag, causing some connections to stall.

Fixes: c212d2c7fc47 ("net/tls: Don't recursively call push_record during 
tls_write_space callbacks")
Signed-off-by: Andre Tomt 
---
 net/tls/tls_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index cc03e00785c7..a02ebdfa0675 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -135,6 +135,7 @@ int tls_push_sg(struct sock *sk,
offset -= sg->offset;
ctx->partially_sent_offset = offset;
ctx->partially_sent_record = (void *)sg;
+   ctx->in_tcp_sendpages = false;
return ret;
}
 
-- 
2.17.0



linux-next: manual merge of the tip tree with the bpf-next tree

2018-05-06 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the tip tree got a conflict in:

  arch/x86/net/bpf_jit_comp.c

between commit:

  e782bdcf58c5 ("bpf, x64: remove ld_abs/ld_ind")

from the bpf-next tree and commit:

  5f26c50143f5 ("x86/bpf: Clean up non-standard comments, to make the code more 
readable")

from the tip tree.

I fixed it up (the former commit removed some code modified by the latter,
so I just removed it) and can carry the fix as necessary. This is now
fixed as far as linux-next is concerned, but any non trivial conflicts
should be mentioned to your upstream maintainer when your tree is
submitted for merging.  You may also want to consider cooperating with
the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell


pgpWf_JBVj2dv.pgp
Description: OpenPGP digital signature


Re: [PATCH 00/51] Netfilter/IPVS updates for net-next

2018-05-06 Thread David Miller
From: Pablo Neira Ayuso 
Date: Mon,  7 May 2018 00:46:18 +0200

> 
> The following patchset contains Netfilter/IPVS updates for your net-next
> tree, more relevant updates in this batch are:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Pulled.


Re: [PATCH net-next v2 2/3] ipv6: support sport and dport in RTM_GETROUTE

2018-05-06 Thread David Ahern
On 5/6/18 6:59 PM, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> This is a followup to fib6 rules sport and dport
> match support. Having them supported in getroute
> makes it easier to test fib6 rule lookups. Used by fib6 rule
> self tests.
> 
> Signed-off-by: Roopa Prabhu 
> ---
>  net/ipv6/route.c | 25 +
>  1 file changed, 25 insertions(+)

similar comments as IPv4 patch.


Re: [PATCH net-next v2 1/3] ipv4: support sport and dport in RTM_GETROUTE

2018-05-06 Thread David Ahern
On 5/6/18 6:59 PM, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> This is a followup to fib rules sport, dport match support.
> Having them supported in getroute makes it easier to test
> fib rule lookups. Used by fib rule self tests. Before this patch
> getroute used same skb to pass through the route lookup and
> for the netlink getroute reply msg. This patch allocates separate
> skb's to keep flow dissector happy.
> 
> Signed-off-by: Roopa Prabhu 
> ---
>  include/uapi/linux/rtnetlink.h |   2 +
>  net/ipv4/route.c   | 151 
> ++---
>  2 files changed, 115 insertions(+), 38 deletions(-)
> 
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 9b15005..630ecf4 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -327,6 +327,8 @@ enum rtattr_type_t {
>   RTA_PAD,
>   RTA_UID,
>   RTA_TTL_PROPAGATE,
> + RTA_SPORT,
> + RTA_DPORT,

If you are going to add sport and dport because of the potential for FIB
rules, you need to add ip-proto as well. I realize existing code assumed
UDP, but the FIB rules cover any IP proto. Yes, I know this makes the
change much larger to generate tcp, udp as well as iphdr options; the
joys of new features. ;-)

I also suggest a comment that these new RTA attributes are used for
GETROUTE only.

And you need to add the new entries to rtm_ipv4_policy.


>   __RTA_MAX
>  };
>  
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 1412a7b..e91ed62 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2568,11 +2568,10 @@ struct rtable *ip_route_output_flow(struct net *net, 
> struct flowi4 *flp4,
>  EXPORT_SYMBOL_GPL(ip_route_output_flow);
>  
>  /* called with rcu_read_lock held */
> -static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 
> table_id,
> - struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
> - u32 seq)
> +static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
> + struct rtable *rt, u32 table_id, struct flowi4 *fl4,
> + struct sk_buff *skb, u32 portid, u32 seq)
>  {
> - struct rtable *rt = skb_rtable(skb);
>   struct rtmsg *r;
>   struct nlmsghdr *nlh;
>   unsigned long expires = 0;
> @@ -2651,6 +2650,14 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
> __be32 src, u32 table_id,
>   from_kuid_munged(current_user_ns(), fl4->flowi4_uid)))
>   goto nla_put_failure;
>  
> + if (fl4->fl4_sport &&
> + nla_put_be16(skb, RTA_SPORT, fl4->fl4_sport))
> + goto nla_put_failure;
> +
> + if (fl4->fl4_dport &&
> + nla_put_be16(skb, RTA_DPORT, fl4->fl4_dport))
> + goto nla_put_failure;

Why return the attributes to the user? I can't see any value in that.
UID option is not returned either so there is precedence.


> +
>   error = rt->dst.error;
>  
>   if (rt_is_input_route(rt)) {
> @@ -2668,7 +2675,7 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
> __be32 src, u32 table_id,
>   }
>   } else
>  #endif
> - if (nla_put_u32(skb, RTA_IIF, skb->dev->ifindex))
> + if (nla_put_u32(skb, RTA_IIF, fl4->flowi4_iif))
>   goto nla_put_failure;
>   }
>  
> @@ -2683,35 +2690,86 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
> __be32 src, u32 table_id,
>   return -EMSGSIZE;
>  }
>  
> +static int nla_get_port(struct nlattr *attr, __be16 *port)
> +{
> + int p = nla_get_be16(attr);

__be16 p;

> +
> + if (p <= 0 || p >= 0x)
> + return -EINVAL;

This check is not needed by definition of be16.

> +
> + *port = p;
> + return 0;
> +}
> +
> +static int inet_rtm_getroute_reply(struct sk_buff *in_skb, struct nlmsghdr 
> *nlh,
> +__be32 dst, __be32 src, struct flowi4 *fl4,
> +struct rtable *rt, struct fib_result *res)
> +{
> + struct net *net = sock_net(in_skb->sk);
> + struct rtmsg *rtm = nlmsg_data(nlh);
> + u32 table_id = RT_TABLE_MAIN;
> + struct sk_buff *skb;
> + int err = 0;
> +
> + skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
> + if (!skb) {
> + err = -ENOMEM;
> + return err;
> + }

just 'return -ENOMEM' and without the {}.


> +
> + if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE)
> + table_id = res->table ? res->table->tb_id : 0;
> +
> + if (rtm->rtm_flags & RTM_F_FIB_MATCH)
> + err = fib_dump_info(skb, NETLINK_CB(in_skb).portid,
> + nlh->nlmsg_seq, RTM_NEWROUTE, table_id,
> + rt->rt_type, res->prefix, res->prefixlen,
> + fl4->flowi4_tos, res->fi, 0);
> + else
> + 

Re: [PATCH] isdn: eicon: fix a missing-check bug

2018-05-06 Thread YU Bo

Hello,
I am just notice your subject line.There are missing something i think
On Sat, May 05, 2018 at 02:32:46PM -0500, Wenwen Wang wrote:

In divasmain.c, the function divas_write() firstly invokes the function
diva_xdi_open_adapter() to open the adapter that matches with the adapter
number provided by the user, and then invokes the function diva_xdi_write()
to perform the write operation using the matched adapter. The two functions
diva_xdi_open_adapter() and diva_xdi_write() are located in diva.c.

In diva_xdi_open_adapter(), the user command is copied to the object 'msg'
from the userspace pointer 'src' through the function pointer 'cp_fn',
which eventually calls copy_from_user() to do the copy. Then, the adapter
number 'msg.adapter' is used to find out a matched adapter from the
'adapter_queue'. A matched adapter will be returned if it is found.
Otherwise, NULL is returned to indicate the failure of the verification on
the adapter number.

As mentioned above, if a matched adapter is returned, the function
diva_xdi_write() is invoked to perform the write operation. In this
function, the user command is copied once again from the userspace pointer
'src', which is the same as the 'src' pointer in diva_xdi_open_adapter() as
both of them are from the 'buf' pointer in divas_write(). Similarly, the
copy is achieved through the function pointer 'cp_fn', which finally calls
copy_from_user(). After the successful copy, the corresponding command
processing handler of the matched adapter is invoked to perform the write
operation.

It is obvious that there are two copies here from userspace, one is in
diva_xdi_open_adapter(), and one is in diva_xdi_write(). Plus, both of
these two copies share the same source userspace pointer, i.e., the 'buf'
pointer in divas_write(). Given that a malicious userspace process can race
to change the content pointed by the 'buf' pointer, this can pose potential
security issues. For example, in the first copy, the user provides a valid
adapter number to pass the verification process and a valid adapter can be
found. Then the user can modify the adapter number to an invalid number.
This way, the user can bypass the verification process of the adapter
number and inject inconsistent data.

To avoid such issues, this patch adds a check after the second copy in the
function diva_xdi_write(). If the adapter number is not equal to the one
obtained in the first copy, (-4) will be returned to divas_write(), which
will then return an error code -EINVAL.

Signed-off-by: Wenwen Wang 
---
drivers/isdn/hardware/eicon/diva.c  | 6 +-
drivers/isdn/hardware/eicon/divasmain.c | 3 +++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/isdn/hardware/eicon/diva.c 
b/drivers/isdn/hardware/eicon/diva.c
index 944a7f3..46cbf76 100644
--- a/drivers/isdn/hardware/eicon/diva.c
+++ b/drivers/isdn/hardware/eicon/diva.c
@@ -440,6 +440,7 @@ diva_xdi_write(void *adapter, void *os_handle, const void 
__user *src,
   int length, divas_xdi_copy_from_user_fn_t cp_fn)
{
diva_os_xdi_adapter_t *a = (diva_os_xdi_adapter_t *) adapter;
+   diva_xdi_um_cfg_cmd_t *p;
void *data;

if (a->xdi_mbox.status & DIVA_XDI_MBOX_BUSY) {
@@ -461,7 +462,10 @@ diva_xdi_write(void *adapter, void *os_handle, const void 
__user *src,

length = (*cp_fn) (os_handle, data, src, length);
if (length > 0) {
-   if ((*(a->interface.cmd_proc))
+   p = (diva_xdi_um_cfg_cmd_t *) data;
+   if (a->controller != (int)p->adapter) {
+   length = -4;
+   } else if ((*(a->interface.cmd_proc))
(a, (diva_xdi_um_cfg_cmd_t *) data, length)) {
length = -3;
}
diff --git a/drivers/isdn/hardware/eicon/divasmain.c 
b/drivers/isdn/hardware/eicon/divasmain.c
index b9980e8..a03c658 100644
--- a/drivers/isdn/hardware/eicon/divasmain.c
+++ b/drivers/isdn/hardware/eicon/divasmain.c
@@ -614,6 +614,9 @@ static ssize_t divas_write(struct file *file, const char 
__user *buf,
case -3:
ret = -ENXIO;
break;
+   case -4:
+   ret = -EINVAL;
+   break;
}
DBG_TRC(("write: ret %d", ret));
return (ret);
--
2.7.4



Re: [PATCH bpf-next v3 0/6] ipv6: sr: introduce seg6local End.BPF action

2018-05-06 Thread Alexei Starovoitov
On Sun, May 06, 2018 at 06:27:28PM +0100, Mathieu Xhonneux wrote:
> As of Linux 4.14, it is possible to define advanced local processing for
> IPv6 packets with a Segment Routing Header through the seg6local LWT
> infrastructure. This LWT implements the network programming principles
> defined in the IETF “SRv6 Network Programming” draft.
> 
> The implemented operations are generic, and it would be very interesting to
> be able to implement user-specific seg6local actions, without having to
> modify the kernel directly. To do so, this patchset adds an End.BPF action
> to seg6local, powered by some specific Segment Routing-related helpers,
> which provide SR functionalities that can be applied on the packet. This
> BPF hook would then allow to implement specific actions at native kernel
> speed such as OAM features, advanced SR SDN policies, SRv6 actions like
> Segment Routing Header (SRH) encapsulation depending on the content of
> the packet, etc ... 
> 
> This patchset is divided in 6 patches, whose main features are :
> 
> - A new seg6local action End.BPF with the corresponding new BPF program
>   type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
>   passed to the LWT seg6local through netlink, the same way as the LWT
>   BPF hook operates.
> - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
>   shrink a SRH and apply on a packet some of the generic SRv6 actions.
> - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
>   encapsulation (via IPv6 encapsulation or inlining if the packet contains
>   already an IPv6 header).
> 
> As this patchset adds a new LWT BPF hook, I took into account the result of
> the discussions when the LWT BPF infrastructure got merged. Hence, the
> seg6local BPF hook doesn’t allow write access to skb->data directly, only
> the SRH can be modified through specific helpers, which ensures that the
> integrity of the packet is maintained.
> More details are available in the related patches messages.
> 
> The performances of this BPF hook have been assessed with the BPF JIT
> enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
> clocked at 2.53 GHz. No throughput losses are noted with the seg6local
> BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
> TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
> drops the throughput to 410kpps, and inlining a SRH via
> bpf_lwt_seg6_action drops the throughput to 420kpps.
> All throughputs are stable.
> 
> ---
> v2: move the SRH integrity state from skb->cb to a per-cpu buffer
> v3: - document helpers in man-page style
> - fix kbuild bugs
> - un-break BPF LWT out hook
> - bpf_push_seg6_encap is now static
> - preempt_enable is now called when the packet is dropped in
>   input_action_end_bpf

Please fix build issue that 0bot caught and resubmit.

Thanks



[PATCH net-next v2 2/3] ipv6: support sport and dport in RTM_GETROUTE

2018-05-06 Thread Roopa Prabhu
From: Roopa Prabhu 

This is a followup to fib6 rules sport and dport
match support. Having them supported in getroute
makes it easier to test fib6 rule lookups. Used by fib6 rule
self tests.

Signed-off-by: Roopa Prabhu 
---
 net/ipv6/route.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8ed1b51..bcdc056 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4071,6 +4071,8 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] 
= {
[RTA_UID]   = { .type = NLA_U32 },
[RTA_MARK]  = { .type = NLA_U32 },
[RTA_TABLE] = { .type = NLA_U32 },
+   [RTA_SPORT] = { .type = NLA_U16 },
+   [RTA_DPORT] = { .type = NLA_U16 },
 };
 
 static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -4728,6 +4730,17 @@ int rt6_dump_route(struct fib6_info *rt, void *p_arg)
 arg->cb->nlh->nlmsg_seq, NLM_F_MULTI);
 }
 
+static int nla_get_port(struct nlattr *attr, __be16 *port)
+{
+   int p = nla_get_be16(attr);
+
+   if (p <= 0 || p >= 0x)
+   return -EINVAL;
+
+   *port = p;
+   return 0;
+}
+
 static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
  struct netlink_ext_ack *extack)
 {
@@ -4782,6 +4795,18 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
else
fl6.flowi6_uid = iif ? INVALID_UID : current_uid();
 
+   if (tb[RTA_SPORT]) {
+   err = nla_get_port(tb[RTA_SPORT], _sport);
+   if (err)
+   goto errout;
+   }
+
+   if (tb[RTA_DPORT]) {
+   err = nla_get_port(tb[RTA_DPORT], _dport);
+   if (err)
+   goto errout;
+   }
+
if (iif) {
struct net_device *dev;
int flags = 0;
-- 
2.1.4



[PATCH net-next v2 3/3] selftests: net: initial fib rule tests

2018-05-06 Thread Roopa Prabhu
From: Roopa Prabhu 

This adds a first set of tests for fib rule match/action for
ipv4 and ipv6. Initial tests only cover action lookup table.
can be extended to cover other actions in the future.
Uses ip route get to validate the rule lookup.

Signed-off-by: Roopa Prabhu 
---
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/fib_rule_tests.sh | 224 ++
 2 files changed, 225 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/fib_rule_tests.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 902820d..9a8f9b0 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -6,7 +6,7 @@ CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh udpgso.sh
-TEST_PROGS += udpgso_bench.sh
+TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh
 TEST_GEN_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/fib_rule_tests.sh 
b/tools/testing/selftests/net/fib_rule_tests.sh
new file mode 100644
index 000..01a250f
--- /dev/null
+++ b/tools/testing/selftests/net/fib_rule_tests.sh
@@ -0,0 +1,224 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This test is for checking IPv4 and IPv6 FIB rules API
+
+ret=0
+
+PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
+IP="ip -netns testns"
+
+RTABLE=100
+GW_IP4=192.51.100.2
+SRC_IP=192.51.100.3
+GW_IP6=2001:db8:1::2
+SRC_IP6=2001:db8:1::3
+
+DEV_ADDR=192.51.100.1
+DEV=dummy0
+
+log_test()
+{
+   local rc=$1
+   local expected=$2
+   local msg="$3"
+
+   if [ ${rc} -eq ${expected} ]; then
+   nsuccess=$((nsuccess+1))
+   printf "\nTEST: %-50s  [ OK ]\n" "${msg}"
+   else
+   nfail=$((nfail+1))
+   printf "\nTEST: %-50s  [FAIL]\n" "${msg}"
+   if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+   echo
+   echo "hit enter to continue, 'q' to quit"
+   read a
+   [ "$a" = "q" ] && exit 1
+   fi
+   fi
+}
+
+log_section()
+{
+   echo
+   echo 
"##"
+   echo "TEST SECTION: $*"
+   echo 
"##"
+}
+
+setup()
+{
+   set -e
+   ip netns add testns
+   $IP link set dev lo up
+
+   $IP link add dummy0 type dummy
+   $IP link set dev dummy0 up
+   $IP address add 198.51.100.1/24 dev dummy0
+   $IP -6 address add 2001:db8:1::1/64 dev dummy0
+
+   set +e
+}
+
+cleanup()
+{
+   $IP link del dev dummy0 &> /dev/null
+   ip netns del testns
+}
+
+fib_check_iproute_support()
+{
+   ip rule help 2>&1 | grep -q $1
+   if [ $? -ne 0 ]; then
+   echo "SKIP: iproute2 iprule too old, missing $1 match"
+   return 1
+   fi
+
+   ip route get help 2>&1 | grep -q $2
+   if [ $? -ne 0 ]; then
+   echo "SKIP: iproute2 get route too old, missing $2 match"
+   return 1
+   fi
+
+   return 0
+}
+
+fib_rule6_del()
+{
+   $IP -6 rule del $1
+   log_test $? 0 "rule6 del $1"
+}
+
+fib_rule6_del_by_pref()
+{
+   pref=$($IP -6 rule show | grep "$1 lookup $TABLE" | cut -d ":" -f 1)
+   $IP -6 rule del pref $pref
+}
+
+fib_rule6_test_match_n_redirect()
+{
+   local match="$1"
+   local getmatch="$2"
+
+   $IP -6 rule add $match table $RTABLE
+   $IP -6 route get $GW_IP6 $getmatch | grep -q "table $RTABLE"
+   log_test $? 0 "rule6 check: $1"
+
+   fib_rule6_del_by_pref "$match"
+   log_test $? 0 "rule6 del by pref: $match"
+}
+
+fib_rule6_test()
+{
+   # setup the fib rule redirect route
+   $IP -6 route add table $RTABLE default via $GW_IP6 dev $DEV onlink
+
+   match="oif $DEV"
+   fib_rule6_test_match_n_redirect "$match" "$match" "oif redirect to 
table"
+
+   match="from $SRC_IP6 iif $DEV"
+   fib_rule6_test_match_n_redirect "$match" "$match" "iif redirect to 
table"
+
+   match="tos 0x10"
+   fib_rule6_test_match_n_redirect "$match" "$match" "tos redirect to 
table"
+
+   match="fwmark 0x64"
+   getmatch="mark 0x64"
+   fib_rule6_test_match_n_redirect "$match" "$getmatch" "fwmark redirect 
to table"
+
+   fib_check_iproute_support "uidrange" "uid"
+   if [ $? -eq 0 ]; then
+   match="uidrange 100-100"
+   getmatch="uid 100"
+   fib_rule6_test_match_n_redirect "$match" "$getmatch" "uid 
redirect to table"
+   fi
+
+   fib_check_iproute_support "sport" "sport"
+   if [ $? -eq 0 ]; then
+   

[PATCH net-next v2 1/3] ipv4: support sport and dport in RTM_GETROUTE

2018-05-06 Thread Roopa Prabhu
From: Roopa Prabhu 

This is a followup to fib rules sport, dport match support.
Having them supported in getroute makes it easier to test
fib rule lookups. Used by fib rule self tests. Before this patch
getroute used same skb to pass through the route lookup and
for the netlink getroute reply msg. This patch allocates separate
skb's to keep flow dissector happy.

Signed-off-by: Roopa Prabhu 
---
 include/uapi/linux/rtnetlink.h |   2 +
 net/ipv4/route.c   | 151 ++---
 2 files changed, 115 insertions(+), 38 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 9b15005..630ecf4 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -327,6 +327,8 @@ enum rtattr_type_t {
RTA_PAD,
RTA_UID,
RTA_TTL_PROPAGATE,
+   RTA_SPORT,
+   RTA_DPORT,
__RTA_MAX
 };
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 1412a7b..e91ed62 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2568,11 +2568,10 @@ struct rtable *ip_route_output_flow(struct net *net, 
struct flowi4 *flp4,
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
 /* called with rcu_read_lock held */
-static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
-   struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
-   u32 seq)
+static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
+   struct rtable *rt, u32 table_id, struct flowi4 *fl4,
+   struct sk_buff *skb, u32 portid, u32 seq)
 {
-   struct rtable *rt = skb_rtable(skb);
struct rtmsg *r;
struct nlmsghdr *nlh;
unsigned long expires = 0;
@@ -2651,6 +2650,14 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
__be32 src, u32 table_id,
from_kuid_munged(current_user_ns(), fl4->flowi4_uid)))
goto nla_put_failure;
 
+   if (fl4->fl4_sport &&
+   nla_put_be16(skb, RTA_SPORT, fl4->fl4_sport))
+   goto nla_put_failure;
+
+   if (fl4->fl4_dport &&
+   nla_put_be16(skb, RTA_DPORT, fl4->fl4_dport))
+   goto nla_put_failure;
+
error = rt->dst.error;
 
if (rt_is_input_route(rt)) {
@@ -2668,7 +2675,7 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
__be32 src, u32 table_id,
}
} else
 #endif
-   if (nla_put_u32(skb, RTA_IIF, skb->dev->ifindex))
+   if (nla_put_u32(skb, RTA_IIF, fl4->flowi4_iif))
goto nla_put_failure;
}
 
@@ -2683,35 +2690,86 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
__be32 src, u32 table_id,
return -EMSGSIZE;
 }
 
+static int nla_get_port(struct nlattr *attr, __be16 *port)
+{
+   int p = nla_get_be16(attr);
+
+   if (p <= 0 || p >= 0x)
+   return -EINVAL;
+
+   *port = p;
+   return 0;
+}
+
+static int inet_rtm_getroute_reply(struct sk_buff *in_skb, struct nlmsghdr 
*nlh,
+  __be32 dst, __be32 src, struct flowi4 *fl4,
+  struct rtable *rt, struct fib_result *res)
+{
+   struct net *net = sock_net(in_skb->sk);
+   struct rtmsg *rtm = nlmsg_data(nlh);
+   u32 table_id = RT_TABLE_MAIN;
+   struct sk_buff *skb;
+   int err = 0;
+
+   skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
+   if (!skb) {
+   err = -ENOMEM;
+   return err;
+   }
+
+   if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE)
+   table_id = res->table ? res->table->tb_id : 0;
+
+   if (rtm->rtm_flags & RTM_F_FIB_MATCH)
+   err = fib_dump_info(skb, NETLINK_CB(in_skb).portid,
+   nlh->nlmsg_seq, RTM_NEWROUTE, table_id,
+   rt->rt_type, res->prefix, res->prefixlen,
+   fl4->flowi4_tos, res->fi, 0);
+   else
+   err = rt_fill_info(net, dst, src, rt, table_id,
+  fl4, skb, NETLINK_CB(in_skb).portid,
+  nlh->nlmsg_seq);
+   if (err < 0)
+   goto errout;
+
+   return rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
+
+errout:
+   kfree_skb(skb);
+   return err;
+}
+
 static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 struct netlink_ext_ack *extack)
 {
struct net *net = sock_net(in_skb->sk);
-   struct rtmsg *rtm;
struct nlattr *tb[RTA_MAX+1];
+   __be16 sport = 0, dport = 0;
struct fib_result res = {};
struct rtable *rt = NULL;
+   struct sk_buff *skb;
+   struct rtmsg *rtm;
struct flowi4 fl4;
+   struct iphdr *iph;
+   struct udphdr *udph;

[PATCH net-next v2 0/3] fib rule selftest

2018-05-06 Thread Roopa Prabhu
From: Roopa Prabhu 

This series adds a new test to test fib rules.
ip route get is used to test fib rule matches.
This series also extends ip route get to match on
sport and dport to test recent support of sport
and dport fib rule match.

v2 - address ido's commemt to make sport dport
ip route get to work correctly for input route
get. I don't support ip route get on ip-proto match yet.
ip route get creates a udp packet and i have left
it at that. We could extend ip route get to support
a few ip proto matches in followup patches.


Roopa Prabhu (3):
  ipv4: support sport and dport in RTM_GETROUTE
  ipv6: support sport and dport in RTM_GETROUTE
  selftests: net: initial fib rule tests

 include/uapi/linux/rtnetlink.h|   2 +
 net/ipv4/route.c  | 152 -
 net/ipv6/route.c  |  25 +++
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/fib_rule_tests.sh | 224 ++
 5 files changed, 366 insertions(+), 39 deletions(-)
 create mode 100644 tools/testing/selftests/net/fib_rule_tests.sh

-- 
2.1.4



RE: [RFC net-next 4/5] net: phy: Add support for IEEE standard test modes

2018-05-06 Thread Woojung.Huh
Hi Florian,

> Well, the way the code is structure is that if you call that function
> with a test mode value that is not part of the standard set, it returns
> -EOPNOTSUPP, so if your particular PHY driver wants to "overlay"
> standard and non-standard modes, it can by using that hint.
> 
> This should work even if we have more standard test modes in the future
> because the test modes are dynamically fetched by user-space using the
> ETH_GSTRINGS ioctl().
> 
> Does that cover what you had in mind?
Basically, agree on your explanation.

My idea was making genphy_set_test() more expandable for other test modes
because it would be a good place to add more standard test modes later.

No problem to keep current codes.

Thanks.
Woojung


linux-next: manual merge of the net-next tree with the net tree

2018-05-06 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  kernel/bpf/syscall.c

between commit:

  9ef09e35e521 ("bpf: fix possible spectre-v1 in find_and_alloc_map()")

from the net tree and commit:

  a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap")

from the net-next tree.

I fixed it up (I removed the conflicting addition of an include of
linux/btf.h in the latter commit as it had already been included
earlier in the file by a previous commit) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


pgpE08pKqWM8n.pgp
Description: OpenPGP digital signature


Re: [PATCH bpf-next v3 3/6] bpf: Add IPv6 Segment Routing helpers

2018-05-06 Thread kbuild test robot
Hi Mathieu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Mathieu-Xhonneux/ipv6-sr-introduce-seg6local-End-BPF-action/20180506-233046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: parisc-allmodconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=parisc 

All errors (new ones prefixed by >>):

   net/core/filter.o: In function `bpf_push_seg6_encap':
>> (.text.bpf_push_seg6_encap+0x40): undefined reference to `seg6_validate_srh'
>> (.text.bpf_push_seg6_encap+0x74): undefined reference to `seg6_do_srh_inline'
>> (.text.bpf_push_seg6_encap+0xa8): undefined reference to `seg6_do_srh_encap'
>> (.text.bpf_push_seg6_encap+0xe8): undefined reference to 
>> `seg6_lookup_nexthop'
   net/core/filter.o: In function `bpf_lwt_seg6_store_bytes':
>> (.text.bpf_lwt_seg6_store_bytes+0x48): undefined reference to 
>> `seg6_bpf_srh_states'
   (.text.bpf_lwt_seg6_store_bytes+0x4c): undefined reference to 
`seg6_bpf_srh_states'
   net/core/filter.o: In function `bpf_lwt_seg6_action':
>> (.text.bpf_lwt_seg6_action+0x48): undefined reference to 
>> `seg6_bpf_srh_states'
   (.text.bpf_lwt_seg6_action+0x4c): undefined reference to 
`seg6_bpf_srh_states'
>> (.text.bpf_lwt_seg6_action+0xc8): undefined reference to `seg6_validate_srh'
>> (.text.bpf_lwt_seg6_action+0x12c): undefined reference to 
>> `seg6_lookup_nexthop'
   (.text.bpf_lwt_seg6_action+0x14c): undefined reference to 
`seg6_lookup_nexthop'
   net/core/filter.o: In function `bpf_lwt_seg6_adjust_srh':
>> (.text.bpf_lwt_seg6_adjust_srh+0x38): undefined reference to 
>> `seg6_bpf_srh_states'
   (.text.bpf_lwt_seg6_adjust_srh+0x3c): undefined reference to 
`seg6_bpf_srh_states'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH bpf-next v3 3/6] bpf: Add IPv6 Segment Routing helpers

2018-05-06 Thread kbuild test robot
Hi Mathieu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Mathieu-Xhonneux/ipv6-sr-introduce-seg6local-End-BPF-action/20180506-233046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: s390-allmodconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   net/core/filter.o: In function `bpf_push_seg6_encap':
   filter.c:(.text+0xaf4c): undefined reference to `seg6_validate_srh'
   filter.c:(.text+0xaf8a): undefined reference to `seg6_do_srh_inline'
   filter.c:(.text+0xafc4): undefined reference to `seg6_do_srh_encap'
   filter.c:(.text+0xb016): undefined reference to `seg6_lookup_nexthop'
   net/core/filter.o: In function `bpf_lwt_seg6_store_bytes':
>> (.text+0xb106): undefined reference to `seg6_bpf_srh_states'
   net/core/filter.o: In function `bpf_lwt_seg6_action':
   (.text+0xb2b0): undefined reference to `seg6_bpf_srh_states'
>> (.text+0xb334): undefined reference to `seg6_validate_srh'
>> (.text+0xb394): undefined reference to `seg6_lookup_nexthop'
   (.text+0xb3c4): undefined reference to `seg6_lookup_nexthop'
   net/core/filter.o: In function `bpf_lwt_seg6_adjust_srh':
   (.text+0xb492): undefined reference to `seg6_bpf_srh_states'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH 03/51] netfilter: ipvs: Add Maglev hashing scheduler

2018-05-06 Thread Pablo Neira Ayuso
From: Inju Song 

Implements the Google's Maglev hashing algorithm as a IPVS scheduler.

Basically it provides consistent hashing but offers some special
features about disruption and load balancing.

 1) minimal disruption: when the set of destinations changes,
a connection will likely be sent to the same destination
as it was before.

 2) load balancing: each destination will receive an almost
equal number of connections.

Seel also for detail: [3.4 Consistent Hasing] in
https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf

Signed-off-by: Inju Song 
Signed-off-by: Julian Anastasov 
Signed-off-by: Simon Horman 
---
 net/netfilter/ipvs/ip_vs_mh.c | 540 ++
 1 file changed, 540 insertions(+)
 create mode 100644 net/netfilter/ipvs/ip_vs_mh.c

diff --git a/net/netfilter/ipvs/ip_vs_mh.c b/net/netfilter/ipvs/ip_vs_mh.c
new file mode 100644
index ..0f795b186eb3
--- /dev/null
+++ b/net/netfilter/ipvs/ip_vs_mh.c
@@ -0,0 +1,540 @@
+// SPDX-License-Identifier: GPL-2.0
+/* IPVS:   Maglev Hashing scheduling module
+ *
+ * Authors:Inju Song 
+ *
+ */
+
+/* The mh algorithm is to assign�a preference list of all the lookup
+ * table positions to each destination and populate the table with
+ * the most-preferred position of destinations. Then it is to select
+ * destination with the hash key of source IP address�through looking
+ * up a the lookup table.
+ *
+ * The algorithm is detailed in:
+ * [3.4 Consistent Hasing]
+https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf
+ *
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+#include 
+
+#define IP_VS_SVC_F_SCHED_MH_FALLBACK  IP_VS_SVC_F_SCHED1 /* MH fallback */
+#define IP_VS_SVC_F_SCHED_MH_PORT  IP_VS_SVC_F_SCHED2 /* MH use port */
+
+struct ip_vs_mh_lookup {
+   struct ip_vs_dest __rcu *dest;  /* real server (cache) */
+};
+
+struct ip_vs_mh_dest_setup {
+   unsigned intoffset; /* starting offset */
+   unsigned intskip;   /* skip */
+   unsigned intperm;   /* next_offset */
+   int turns;  /* weight / gcd() and rshift */
+};
+
+/* Available prime numbers for MH table */
+static int primes[] = {251, 509, 1021, 2039, 4093,
+  8191, 16381, 32749, 65521, 131071};
+
+/* For IPVS MH entry hash table */
+#ifndef CONFIG_IP_VS_MH_TAB_INDEX
+#define CONFIG_IP_VS_MH_TAB_INDEX  12
+#endif
+#define IP_VS_MH_TAB_BITS  (CONFIG_IP_VS_MH_TAB_INDEX / 2)
+#define IP_VS_MH_TAB_INDEX (CONFIG_IP_VS_MH_TAB_INDEX - 8)
+#define IP_VS_MH_TAB_SIZE   primes[IP_VS_MH_TAB_INDEX]
+
+struct ip_vs_mh_state {
+   struct rcu_head rcu_head;
+   struct ip_vs_mh_lookup  *lookup;
+   struct ip_vs_mh_dest_setup  *dest_setup;
+   hsiphash_key_t  hash1, hash2;
+   int gcd;
+   int rshift;
+};
+
+static inline void generate_hash_secret(hsiphash_key_t *hash1,
+   hsiphash_key_t *hash2)
+{
+   hash1->key[0] = 2654435761UL;
+   hash1->key[1] = 2654435761UL;
+
+   hash2->key[0] = 2654446892UL;
+   hash2->key[1] = 2654446892UL;
+}
+
+/* Helper function to determine if server is unavailable */
+static inline bool is_unavailable(struct ip_vs_dest *dest)
+{
+   return atomic_read(>weight) <= 0 ||
+  dest->flags & IP_VS_DEST_F_OVERLOAD;
+}
+
+/* Returns hash value for IPVS MH entry */
+static inline unsigned int
+ip_vs_mh_hashkey(int af, const union nf_inet_addr *addr,
+__be16 port, hsiphash_key_t *key, unsigned int offset)
+{
+   unsigned int v;
+   __be32 addr_fold = addr->ip;
+
+#ifdef CONFIG_IP_VS_IPV6
+   if (af == AF_INET6)
+   addr_fold = addr->ip6[0] ^ addr->ip6[1] ^
+   addr->ip6[2] ^ addr->ip6[3];
+#endif
+   v = (offset + ntohs(port) + ntohl(addr_fold));
+   return hsiphash(, sizeof(v), key);
+}
+
+/* Reset all the hash buckets of the specified table. */
+static void ip_vs_mh_reset(struct ip_vs_mh_state *s)
+{
+   int i;
+   struct ip_vs_mh_lookup *l;
+   struct ip_vs_dest *dest;
+
+   l = >lookup[0];
+   for (i = 0; i < IP_VS_MH_TAB_SIZE; i++) {
+   dest = rcu_dereference_protected(l->dest, 1);
+   if (dest) {
+   ip_vs_dest_put(dest);
+   RCU_INIT_POINTER(l->dest, NULL);
+   }
+   l++;
+   }
+}
+
+static int ip_vs_mh_permutate(struct ip_vs_mh_state *s,
+ struct ip_vs_service *svc)
+{
+   struct list_head *p;
+   struct ip_vs_mh_dest_setup 

[PATCH 00/51] Netfilter/IPVS updates for net-next

2018-05-06 Thread Pablo Neira Ayuso
Hi David,

The following patchset contains Netfilter/IPVS updates for your net-next
tree, more relevant updates in this batch are:

1) Add Maglev support to IPVS. Moreover, store lastest server weight in
   IPVS since this is needed by maglev, patches from from Inju Song.

2) Preparation works to add iptables flowtable support, patches
   from Felix Fietkau.

3) Hand over flows back to conntrack slow path in case of TCP RST/FIN
   packet is seen via new teardown state, also from Felix.

4) Add support for extended netlink error reporting for nf_tables.

5) Support for larger timeouts that 23 days in nf_tables, patch from
   Florian Westphal.

6) Always set an upper limit to dynamic sets, also from Florian.

7) Allow number generator to make map lookups, from Laura Garcia.

8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat.

9) Extend ip6tables SRH match to support previous, next and last SID,
   from Ahmed Abdelsalam.

10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez.

11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot.

12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables,
   from Taehee Yoo.

13) Unify meta bridge with core nft_meta, then make nft_meta built-in.
   Make rt and exthdr built-in too, again from Florian.

14) Missing initialization of tbl->entries in IPVS, from Cong Wang.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks.



The following changes since commit 415787d7799f4fccbe8d49cb0b8e5811be6b0389:

  ipv6: frags: fix a lockdep false positive (2018-04-18 23:19:39 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD

for you to fetch changes up to b13468dc577498002cf4e62978359ff97ffcd187:

  netfilter: nft_dynset: fix timeout updates on 32bit (2018-05-07 00:05:22 
+0200)


Ahmed Abdelsalam (1):
  netfilter: ip6t_srh: extend SRH matching for previous, next and last SID

Arvind Yadav (1):
  netfilter: ipvs: Fix space before '[' error.

Cong Wang (2):
  ipvs: initialize tbl->entries after allocation
  ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()

Felix Fietkau (19):
  netfilter: nf_flow_table: use IP_CT_DIR_* values for FLOW_OFFLOAD_DIR_*
  netfilter: nf_flow_table: clean up flow_offload_alloc
  ipv6: make ip6_dst_mtu_forward inline
  netfilter: nf_flow_table: cache mtu in struct flow_offload_tuple
  netfilter: nf_flow_table: rename nf_flow_table.c to nf_flow_table_core.c
  netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table
  netfilter: nf_flow_table: move ip header check out of nf_flow_exceeds_mtu
  netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table
  netfilter: nf_flow_table: relax mixed ipv4/ipv6 flowtable dependencies
  netfilter: nf_flow_table: move init code to nf_flow_table_core.c
  netfilter: nf_flow_table: fix priv pointer for netdev hook
  netfilter: nf_flow_table: track flow tables in nf_flow_table directly
  netfilter: nf_flow_table: make flow_offload_dead inline
  netfilter: nf_flow_table: add a new flow state for tearing down offloading
  netfilter: nf_flow_table: in flow_offload_lookup, skip entries being 
deleted
  netfilter: nf_flow_table: add support for sending flows back to the slow 
path
  netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen
  netfilter: nf_flow_table: add missing condition for TCP state check
  netfilter: nf_flow_table: fix offloading connections with SNAT+DNAT

Fernando Fernandez Mancera (1):
  netfilter: extract Passive OS fingerprint infrastructure from xt_osf

Florent Fourcot (1):
  netfilter: ctnetlink: export nf_conntrack_max

Florian Westphal (8):
  netfilter: nf_tables: support timeouts larger than 23 days
  netfilter: nf_tables: always use an upper set size for dynsets
  netfilter: merge meta_bridge into nft_meta
  netfilter: nf_tables: make meta expression builtin
  netfilter: nf_tables: merge rt expression into nft core
  netfilter: nf_tables: merge exthdr expression into nft core
  netfilter: nf_nat: remove unused ct arg from lookup functions
  netfilter: nft_dynset: fix timeout updates on 32bit

Inju Song (3):
  netfilter: ipvs: Keep latest weight of destination
  netfilter: ipvs: Add Maglev hashing scheduler
  netfilter: ipvs: Add configurations of Maglev hashing

Laura Garcia Liebana (2):
  netfilter: nft_numgen: add map lookups for numgen statements
  netfilter: nft_numgen: enable hashing of one element

Pablo Neira Ayuso (3):
  netfilter: nf_tables: simplify lookup functions
  netfilter: nf_tables: initial support for extended ACK reporting
  Merge tag 'ipvs-for-v4.18' of 

[PATCH 10/51] netfilter: nf_flow_table: cache mtu in struct flow_offload_tuple

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Reduces the number of cache lines touched in the offload forwarding
path. This is safe because PMTU limits are bypassed for the forwarding
path (see commit f87c10a8aa1e for more details).

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h   |  2 ++
 net/ipv4/netfilter/nf_flow_table_ipv4.c | 17 +++--
 net/ipv6/netfilter/nf_flow_table_ipv6.c | 17 +++--
 net/netfilter/nf_flow_table.c   |  8 ++--
 4 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index 09ba67598991..76ee5c81b752 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -55,6 +55,8 @@ struct flow_offload_tuple {
 
int oifidx;
 
+   u16 mtu;
+
struct dst_entry*dst_cache;
 };
 
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
index 0cd46bffa469..461b1815e633 100644
--- a/net/ipv4/netfilter/nf_flow_table_ipv4.c
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -178,7 +178,7 @@ static int nf_flow_tuple_ip(struct sk_buff *skb, const 
struct net_device *dev,
 }
 
 /* Based on ip_exceeds_mtu(). */
-static bool __nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
+static bool nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
 {
if (skb->len <= mtu)
return false;
@@ -192,17 +192,6 @@ static bool __nf_flow_exceeds_mtu(const struct sk_buff 
*skb, unsigned int mtu)
return true;
 }
 
-static bool nf_flow_exceeds_mtu(struct sk_buff *skb, const struct rtable *rt)
-{
-   u32 mtu;
-
-   mtu = ip_dst_mtu_maybe_forward(>dst, true);
-   if (__nf_flow_exceeds_mtu(skb, mtu))
-   return true;
-
-   return false;
-}
-
 unsigned int
 nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state)
@@ -233,9 +222,9 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 
dir = tuplehash->tuple.dir;
flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
-
rt = (const struct rtable *)flow->tuplehash[dir].tuple.dst_cache;
-   if (unlikely(nf_flow_exceeds_mtu(skb, rt)))
+
+   if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)))
return NF_ACCEPT;
 
if (skb_try_make_writable(skb, sizeof(*iph)))
diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
index 207cb35569b1..0e6328490142 100644
--- a/net/ipv6/netfilter/nf_flow_table_ipv6.c
+++ b/net/ipv6/netfilter/nf_flow_table_ipv6.c
@@ -173,7 +173,7 @@ static int nf_flow_tuple_ipv6(struct sk_buff *skb, const 
struct net_device *dev,
 }
 
 /* Based on ip_exceeds_mtu(). */
-static bool __nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
+static bool nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
 {
if (skb->len <= mtu)
return false;
@@ -184,17 +184,6 @@ static bool __nf_flow_exceeds_mtu(const struct sk_buff 
*skb, unsigned int mtu)
return true;
 }
 
-static bool nf_flow_exceeds_mtu(struct sk_buff *skb, const struct rt6_info *rt)
-{
-   u32 mtu;
-
-   mtu = ip6_dst_mtu_forward(>dst);
-   if (__nf_flow_exceeds_mtu(skb, mtu))
-   return true;
-
-   return false;
-}
-
 unsigned int
 nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
  const struct nf_hook_state *state)
@@ -225,9 +214,9 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 
dir = tuplehash->tuple.dir;
flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
-
rt = (struct rt6_info *)flow->tuplehash[dir].tuple.dst_cache;
-   if (unlikely(nf_flow_exceeds_mtu(skb, rt)))
+
+   if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)))
return NF_ACCEPT;
 
if (skb_try_make_writable(skb, sizeof(*ip6h)))
diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index db0673a40b97..7403a0dfddf7 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c
@@ -4,6 +4,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -23,6 +25,7 @@ flow_offload_fill_dir(struct flow_offload *flow, struct 
nf_conn *ct,
 {
struct flow_offload_tuple *ft = >tuplehash[dir].tuple;
struct nf_conntrack_tuple *ctt = >tuplehash[dir].tuple;
+   struct dst_entry *dst = route->tuple[dir].dst;
 
ft->dir = dir;
 
@@ -30,10 +33,12 @@ flow_offload_fill_dir(struct flow_offload *flow, struct 
nf_conn *ct,
case NFPROTO_IPV4:
ft->src_v4 = ctt->src.u3.in;

[PATCH 09/51] ipv6: make ip6_dst_mtu_forward inline

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Just like ip_dst_mtu_maybe_forward(), to avoid a dependency with ipv6.ko.

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ip6_route.h | 21 +
 include/net/ipv6.h  |  2 --
 net/ipv6/ip6_output.c   | 22 --
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index d5fb1e4ae7ac..376928c26d2d 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -279,6 +279,27 @@ static inline bool rt6_duplicate_nexthop(struct fib6_info 
*a, struct fib6_info *
   !lwtunnel_cmp_encap(a->fib6_nh.nh_lwtstate, 
b->fib6_nh.nh_lwtstate);
 }
 
+static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
+{
+   struct inet6_dev *idev;
+   unsigned int mtu;
+
+   if (dst_metric_locked(dst, RTAX_MTU)) {
+   mtu = dst_metric_raw(dst, RTAX_MTU);
+   if (mtu)
+   return mtu;
+   }
+
+   mtu = IPV6_MIN_MTU;
+   rcu_read_lock();
+   idev = __in6_dev_get(dst->dev);
+   if (idev)
+   mtu = idev->cnf.mtu6;
+   rcu_read_unlock();
+
+   return mtu;
+}
+
 struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw,
   struct net_device *dev, struct sk_buff *skb,
   const void *daddr);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 68b167d98879..765441867cfa 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -958,8 +958,6 @@ static inline struct sk_buff *ip6_finish_skb(struct sock 
*sk)
  _sk(sk)->cork);
 }
 
-unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst);
-
 int ip6_dst_lookup(struct net *net, struct sock *sk, struct dst_entry **dst,
   struct flowi6 *fl6);
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3db47986ef38..cec49e137dbb 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -383,28 +383,6 @@ static inline int ip6_forward_finish(struct net *net, 
struct sock *sk,
return dst_output(net, sk, skb);
 }
 
-unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
-{
-   unsigned int mtu;
-   struct inet6_dev *idev;
-
-   if (dst_metric_locked(dst, RTAX_MTU)) {
-   mtu = dst_metric_raw(dst, RTAX_MTU);
-   if (mtu)
-   return mtu;
-   }
-
-   mtu = IPV6_MIN_MTU;
-   rcu_read_lock();
-   idev = __in6_dev_get(dst->dev);
-   if (idev)
-   mtu = idev->cnf.mtu6;
-   rcu_read_unlock();
-
-   return mtu;
-}
-EXPORT_SYMBOL_GPL(ip6_dst_mtu_forward);
-
 static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
 {
if (skb->len <= mtu)
-- 
2.11.0



[PATCH 04/51] netfilter: ipvs: Add configurations of Maglev hashing

2018-05-06 Thread Pablo Neira Ayuso
From: Inju Song 

To build the maglev hashing scheduler, add some configuration
to Kconfig and Makefile.

 - The compile configurations of MH are added to the Kconfig.

 - The MH build rule is added to the Makefile.

Signed-off-by: Inju Song 
Signed-off-by: Julian Anastasov 
Signed-off-by: Simon Horman 
---
 net/netfilter/ipvs/Kconfig  | 37 +
 net/netfilter/ipvs/Makefile |  1 +
 2 files changed, 38 insertions(+)

diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index b32fb0dbe237..05dc1b77e466 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -225,6 +225,25 @@ config IP_VS_SH
  If you want to compile it in kernel, say Y. To compile it as a
  module, choose M here. If unsure, say N.
 
+config IP_VS_MH
+   tristate "maglev hashing scheduling"
+   ---help---
+ The maglev consistent hashing scheduling algorithm provides the
+ Google's Maglev hashing algorithm as a IPVS scheduler. It assigns
+ network connections to the servers through looking up a statically
+ assigned special hash table called the lookup table. Maglev hashing
+ is to assign a preference list of all the lookup table positions
+ to each destination.
+
+ Through this operation, The maglev hashing gives an almost equal
+ share of the lookup table to each of the destinations and provides
+ minimal disruption by using the lookup table. When the set of
+ destinations changes, a connection will likely be sent to the same
+ destination as it was before.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
 config IP_VS_SED
tristate "shortest expected delay scheduling"
---help---
@@ -266,6 +285,24 @@ config IP_VS_SH_TAB_BITS
  needs to be large enough to effectively fit all the destinations
  multiplied by their respective weights.
 
+comment 'IPVS MH scheduler'
+
+config IP_VS_MH_TAB_INDEX
+   int "IPVS maglev hashing table index of size (the prime numbers)"
+   range 8 17
+   default 12
+   ---help---
+ The maglev hashing scheduler maps source IPs to destinations
+ stored in a hash table. This table is assigned by a preference
+ list of the positions to each destination until all slots in
+ the table are filled. The index determines the prime for size of
+ the table as?251, 509, 1021, 2039, 4093, 8191, 16381, 32749,
+ 65521 or 131071.?When using weights to allow destinations to
+ receive more connections,?the table is assigned an amount
+ proportional to the weights specified.?The table needs to be large
+ enough to effectively fit all the destinations multiplied by their
+ respective weights.
+
 comment 'IPVS application helper'
 
 config IP_VS_FTP
diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile
index c552993fa4b9..bfce2677fda2 100644
--- a/net/netfilter/ipvs/Makefile
+++ b/net/netfilter/ipvs/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_IP_VS_LBLC) += ip_vs_lblc.o
 obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o
 obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
+obj-$(CONFIG_IP_VS_MH) += ip_vs_mh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
 
-- 
2.11.0



[PATCH 08/51] netfilter: nf_flow_table: clean up flow_offload_alloc

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Reduce code duplication and make it much easier to read

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_flow_table.c | 93 ---
 1 file changed, 34 insertions(+), 59 deletions(-)

diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index ec410cae9307..db0673a40b97 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c
@@ -16,6 +16,38 @@ struct flow_offload_entry {
struct rcu_head rcu_head;
 };
 
+static void
+flow_offload_fill_dir(struct flow_offload *flow, struct nf_conn *ct,
+ struct nf_flow_route *route,
+ enum flow_offload_tuple_dir dir)
+{
+   struct flow_offload_tuple *ft = >tuplehash[dir].tuple;
+   struct nf_conntrack_tuple *ctt = >tuplehash[dir].tuple;
+
+   ft->dir = dir;
+
+   switch (ctt->src.l3num) {
+   case NFPROTO_IPV4:
+   ft->src_v4 = ctt->src.u3.in;
+   ft->dst_v4 = ctt->dst.u3.in;
+   break;
+   case NFPROTO_IPV6:
+   ft->src_v6 = ctt->src.u3.in6;
+   ft->dst_v6 = ctt->dst.u3.in6;
+   break;
+   }
+
+   ft->l3proto = ctt->src.l3num;
+   ft->l4proto = ctt->dst.protonum;
+   ft->src_port = ctt->src.u.tcp.port;
+   ft->dst_port = ctt->dst.u.tcp.port;
+
+   ft->iifidx = route->tuple[dir].ifindex;
+   ft->oifidx = route->tuple[!dir].ifindex;
+
+   ft->dst_cache = route->tuple[dir].dst;
+}
+
 struct flow_offload *
 flow_offload_alloc(struct nf_conn *ct, struct nf_flow_route *route)
 {
@@ -40,65 +72,8 @@ flow_offload_alloc(struct nf_conn *ct, struct nf_flow_route 
*route)
 
entry->ct = ct;
 
-   switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num) {
-   case NFPROTO_IPV4:
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v4 =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3.in;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_v4 =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u3.in;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v4 =
-   ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u3.in;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v4 =
-   ct->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u3.in;
-   break;
-   case NFPROTO_IPV6:
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v6 =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3.in6;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_v6 =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u3.in6;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v6 =
-   ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u3.in6;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v6 =
-   ct->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u3.in6;
-   break;
-   }
-
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.l3proto =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.l4proto =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.l3proto =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.l4proto =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum;
-
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_cache =
- route->tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_cache =
- route->tuple[FLOW_OFFLOAD_DIR_REPLY].dst;
-
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_port =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u.tcp.port;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_port =
-   ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u.tcp.port;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_port =
-   ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u.tcp.port;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_port =
-   ct->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u.tcp.port;
-
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dir =
-   FLOW_OFFLOAD_DIR_ORIGINAL;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dir =
-   FLOW_OFFLOAD_DIR_REPLY;
-
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx =
-   route->tuple[FLOW_OFFLOAD_DIR_ORIGINAL].ifindex;
-   flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.oifidx =
-   

[PATCH 14/51] netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Useful as preparation for adding iptables support for offload.

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv6/netfilter/nf_flow_table_ipv6.c | 232 
 net/netfilter/nf_flow_table_ip.c| 215 +
 2 files changed, 215 insertions(+), 232 deletions(-)

diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
index 0e6328490142..f1804ce8d561 100644
--- a/net/ipv6/netfilter/nf_flow_table_ipv6.c
+++ b/net/ipv6/netfilter/nf_flow_table_ipv6.c
@@ -3,240 +3,8 @@
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
-#include 
-#include 
 #include 
 #include 
-/* For layer 4 checksum field offset. */
-#include 
-#include 
-
-static int nf_flow_nat_ipv6_tcp(struct sk_buff *skb, unsigned int thoff,
-   struct in6_addr *addr,
-   struct in6_addr *new_addr)
-{
-   struct tcphdr *tcph;
-
-   if (!pskb_may_pull(skb, thoff + sizeof(*tcph)) ||
-   skb_try_make_writable(skb, thoff + sizeof(*tcph)))
-   return -1;
-
-   tcph = (void *)(skb_network_header(skb) + thoff);
-   inet_proto_csum_replace16(>check, skb, addr->s6_addr32,
- new_addr->s6_addr32, true);
-
-   return 0;
-}
-
-static int nf_flow_nat_ipv6_udp(struct sk_buff *skb, unsigned int thoff,
-   struct in6_addr *addr,
-   struct in6_addr *new_addr)
-{
-   struct udphdr *udph;
-
-   if (!pskb_may_pull(skb, thoff + sizeof(*udph)) ||
-   skb_try_make_writable(skb, thoff + sizeof(*udph)))
-   return -1;
-
-   udph = (void *)(skb_network_header(skb) + thoff);
-   if (udph->check || skb->ip_summed == CHECKSUM_PARTIAL) {
-   inet_proto_csum_replace16(>check, skb, addr->s6_addr32,
- new_addr->s6_addr32, true);
-   if (!udph->check)
-   udph->check = CSUM_MANGLED_0;
-   }
-
-   return 0;
-}
-
-static int nf_flow_nat_ipv6_l4proto(struct sk_buff *skb, struct ipv6hdr *ip6h,
-   unsigned int thoff, struct in6_addr *addr,
-   struct in6_addr *new_addr)
-{
-   switch (ip6h->nexthdr) {
-   case IPPROTO_TCP:
-   if (nf_flow_nat_ipv6_tcp(skb, thoff, addr, new_addr) < 0)
-   return NF_DROP;
-   break;
-   case IPPROTO_UDP:
-   if (nf_flow_nat_ipv6_udp(skb, thoff, addr, new_addr) < 0)
-   return NF_DROP;
-   break;
-   }
-
-   return 0;
-}
-
-static int nf_flow_snat_ipv6(const struct flow_offload *flow,
-struct sk_buff *skb, struct ipv6hdr *ip6h,
-unsigned int thoff,
-enum flow_offload_tuple_dir dir)
-{
-   struct in6_addr addr, new_addr;
-
-   switch (dir) {
-   case FLOW_OFFLOAD_DIR_ORIGINAL:
-   addr = ip6h->saddr;
-   new_addr = flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v6;
-   ip6h->saddr = new_addr;
-   break;
-   case FLOW_OFFLOAD_DIR_REPLY:
-   addr = ip6h->daddr;
-   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v6;
-   ip6h->daddr = new_addr;
-   break;
-   default:
-   return -1;
-   }
-
-   return nf_flow_nat_ipv6_l4proto(skb, ip6h, thoff, , _addr);
-}
-
-static int nf_flow_dnat_ipv6(const struct flow_offload *flow,
-struct sk_buff *skb, struct ipv6hdr *ip6h,
-unsigned int thoff,
-enum flow_offload_tuple_dir dir)
-{
-   struct in6_addr addr, new_addr;
-
-   switch (dir) {
-   case FLOW_OFFLOAD_DIR_ORIGINAL:
-   addr = ip6h->daddr;
-   new_addr = flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v6;
-   ip6h->daddr = new_addr;
-   break;
-   case FLOW_OFFLOAD_DIR_REPLY:
-   addr = ip6h->saddr;
-   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_v6;
-   ip6h->saddr = new_addr;
-   break;
-   default:
-   return -1;
-   }
-
-   return nf_flow_nat_ipv6_l4proto(skb, ip6h, thoff, , _addr);
-}
-
-static int nf_flow_nat_ipv6(const struct flow_offload *flow,
-   struct sk_buff *skb,
-   enum flow_offload_tuple_dir dir)
-{
-   struct ipv6hdr *ip6h = ipv6_hdr(skb);
-   unsigned int thoff = sizeof(*ip6h);
-
-   if (flow->flags & FLOW_OFFLOAD_SNAT &&
-   (nf_flow_snat_port(flow, skb, thoff, ip6h->nexthdr, dir) < 0 ||
-nf_flow_snat_ipv6(flow, skb, 

[PATCH 11/51] netfilter: nf_flow_table: rename nf_flow_table.c to nf_flow_table_core.c

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Preparation for adding more code to the same module

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/Makefile  | 2 ++
 net/netfilter/{nf_flow_table.c => nf_flow_table_core.c} | 0
 2 files changed, 2 insertions(+)
 rename net/netfilter/{nf_flow_table.c => nf_flow_table_core.c} (100%)

diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index fd32bd2c9521..700c5d51e405 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -111,6 +111,8 @@ obj-$(CONFIG_NFT_FWD_NETDEV)+= nft_fwd_netdev.o
 
 # flow table infrastructure
 obj-$(CONFIG_NF_FLOW_TABLE)+= nf_flow_table.o
+nf_flow_table-objs := nf_flow_table_core.o
+
 obj-$(CONFIG_NF_FLOW_TABLE_INET) += nf_flow_table_inet.o
 
 # generic X tables 
diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table_core.c
similarity index 100%
rename from net/netfilter/nf_flow_table.c
rename to net/netfilter/nf_flow_table_core.c
-- 
2.11.0



[PATCH 13/51] netfilter: nf_flow_table: move ip header check out of nf_flow_exceeds_mtu

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Allows the function to be shared with the IPv6 hook code

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_flow_table_ip.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 034fda963392..103263e0c7c2 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -182,9 +182,6 @@ static bool nf_flow_exceeds_mtu(const struct sk_buff *skb, 
unsigned int mtu)
if (skb->len <= mtu)
return false;
 
-   if ((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0)
-   return false;
-
if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
return false;
 
@@ -223,7 +220,8 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
rt = (const struct rtable *)flow->tuplehash[dir].tuple.dst_cache;
 
-   if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)))
+   if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)) 
&&
+   (ip_hdr(skb)->frag_off & htons(IP_DF)) != 0)
return NF_ACCEPT;
 
if (skb_try_make_writable(skb, sizeof(*iph)))
-- 
2.11.0



[PATCH 12/51] netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Allows some minor code sharing with the ipv6 hook code and is also
useful as preparation for adding iptables support for offload

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/nf_flow_table_ipv4.c | 241 ---
 net/netfilter/Makefile  |   2 +-
 net/netfilter/nf_flow_table_ip.c| 246 
 3 files changed, 247 insertions(+), 242 deletions(-)
 create mode 100644 net/netfilter/nf_flow_table_ip.c

diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
index 461b1815e633..b6e43ff0c7b7 100644
--- a/net/ipv4/netfilter/nf_flow_table_ipv4.c
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -2,249 +2,8 @@
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
-#include 
-#include 
 #include 
 #include 
-/* For layer 4 checksum field offset. */
-#include 
-#include 
-
-static int nf_flow_nat_ip_tcp(struct sk_buff *skb, unsigned int thoff,
- __be32 addr, __be32 new_addr)
-{
-   struct tcphdr *tcph;
-
-   if (!pskb_may_pull(skb, thoff + sizeof(*tcph)) ||
-   skb_try_make_writable(skb, thoff + sizeof(*tcph)))
-   return -1;
-
-   tcph = (void *)(skb_network_header(skb) + thoff);
-   inet_proto_csum_replace4(>check, skb, addr, new_addr, true);
-
-   return 0;
-}
-
-static int nf_flow_nat_ip_udp(struct sk_buff *skb, unsigned int thoff,
- __be32 addr, __be32 new_addr)
-{
-   struct udphdr *udph;
-
-   if (!pskb_may_pull(skb, thoff + sizeof(*udph)) ||
-   skb_try_make_writable(skb, thoff + sizeof(*udph)))
-   return -1;
-
-   udph = (void *)(skb_network_header(skb) + thoff);
-   if (udph->check || skb->ip_summed == CHECKSUM_PARTIAL) {
-   inet_proto_csum_replace4(>check, skb, addr,
-new_addr, true);
-   if (!udph->check)
-   udph->check = CSUM_MANGLED_0;
-   }
-
-   return 0;
-}
-
-static int nf_flow_nat_ip_l4proto(struct sk_buff *skb, struct iphdr *iph,
- unsigned int thoff, __be32 addr,
- __be32 new_addr)
-{
-   switch (iph->protocol) {
-   case IPPROTO_TCP:
-   if (nf_flow_nat_ip_tcp(skb, thoff, addr, new_addr) < 0)
-   return NF_DROP;
-   break;
-   case IPPROTO_UDP:
-   if (nf_flow_nat_ip_udp(skb, thoff, addr, new_addr) < 0)
-   return NF_DROP;
-   break;
-   }
-
-   return 0;
-}
-
-static int nf_flow_snat_ip(const struct flow_offload *flow, struct sk_buff 
*skb,
-  struct iphdr *iph, unsigned int thoff,
-  enum flow_offload_tuple_dir dir)
-{
-   __be32 addr, new_addr;
-
-   switch (dir) {
-   case FLOW_OFFLOAD_DIR_ORIGINAL:
-   addr = iph->saddr;
-   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v4.s_addr;
-   iph->saddr = new_addr;
-   break;
-   case FLOW_OFFLOAD_DIR_REPLY:
-   addr = iph->daddr;
-   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v4.s_addr;
-   iph->daddr = new_addr;
-   break;
-   default:
-   return -1;
-   }
-   csum_replace4(>check, addr, new_addr);
-
-   return nf_flow_nat_ip_l4proto(skb, iph, thoff, addr, new_addr);
-}
-
-static int nf_flow_dnat_ip(const struct flow_offload *flow, struct sk_buff 
*skb,
-  struct iphdr *iph, unsigned int thoff,
-  enum flow_offload_tuple_dir dir)
-{
-   __be32 addr, new_addr;
-
-   switch (dir) {
-   case FLOW_OFFLOAD_DIR_ORIGINAL:
-   addr = iph->daddr;
-   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v4.s_addr;
-   iph->daddr = new_addr;
-   break;
-   case FLOW_OFFLOAD_DIR_REPLY:
-   addr = iph->saddr;
-   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_v4.s_addr;
-   iph->saddr = new_addr;
-   break;
-   default:
-   return -1;
-   }
-   csum_replace4(>check, addr, new_addr);
-
-   return nf_flow_nat_ip_l4proto(skb, iph, thoff, addr, new_addr);
-}
-
-static int nf_flow_nat_ip(const struct flow_offload *flow, struct sk_buff *skb,
- enum flow_offload_tuple_dir dir)
-{
-   struct iphdr *iph = ip_hdr(skb);
-   unsigned int thoff = iph->ihl * 4;
-
-   if (flow->flags & FLOW_OFFLOAD_SNAT &&
-   (nf_flow_snat_port(flow, skb, thoff, iph->protocol, dir) < 0 ||
-nf_flow_snat_ip(flow, skb, iph, thoff, dir) < 0))
-   return -1;
-   if 

[PATCH 15/51] netfilter: nf_flow_table: relax mixed ipv4/ipv6 flowtable dependencies

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Since the offload hook code was moved, this table no longer depends on
the IPv4 and IPv6 flowtable modules

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/Kconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 704b3832dbad..d20664b02ae4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -666,8 +666,7 @@ endif # NF_TABLES
 
 config NF_FLOW_TABLE_INET
tristate "Netfilter flow table mixed IPv4/IPv6 module"
-   depends on NF_FLOW_TABLE_IPV4
-   depends on NF_FLOW_TABLE_IPV6
+   depends on NF_FLOW_TABLE
help
   This option adds the flow table mixed IPv4/IPv6 support.
 
-- 
2.11.0



[PATCH 06/51] netfilter: xt_NFLOG: use nf_log_packet instead of nfulnl_log_packet.

2018-05-06 Thread Pablo Neira Ayuso
From: Taehee Yoo 

The nfulnl_log_packet() is added to make sure that the NFLOG target
works as only user-space logger. but now, nf_log_packet() can find proper
log function using NF_LOG_TYPE_ULOG and NF_LOG_TYPE_LOG.

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nfnetlink_log.h | 17 -
 net/netfilter/nfnetlink_log.c |  8 +++-
 net/netfilter/xt_NFLOG.c  | 15 +++
 3 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/include/net/netfilter/nfnetlink_log.h 
b/include/net/netfilter/nfnetlink_log.h
index 612cfb63ac68..ea32a7d3cf1b 100644
--- a/include/net/netfilter/nfnetlink_log.h
+++ b/include/net/netfilter/nfnetlink_log.h
@@ -1,18 +1 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _KER_NFNETLINK_LOG_H
-#define _KER_NFNETLINK_LOG_H
-
-void
-nfulnl_log_packet(struct net *net,
- u_int8_t pf,
- unsigned int hooknum,
- const struct sk_buff *skb,
- const struct net_device *in,
- const struct net_device *out,
- const struct nf_loginfo *li_user,
- const char *prefix);
-
-#define NFULNL_COPY_DISABLED0xff
-
-#endif /* _KER_NFNETLINK_LOG_H */
-
diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index 7b46aa4c478d..e5cc4d9b9ce7 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -37,7 +37,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -47,6 +46,7 @@
 #include "../bridge/br_private.h"
 #endif
 
+#define NFULNL_COPY_DISABLED   0xff
 #define NFULNL_NLBUFSIZ_DEFAULTNLMSG_GOODSIZE
 #define NFULNL_TIMEOUT_DEFAULT 100 /* every second */
 #define NFULNL_QTHRESH_DEFAULT 100 /* 100 packets */
@@ -618,7 +618,7 @@ static const struct nf_loginfo default_loginfo = {
 };
 
 /* log handler for internal netfilter logging api */
-void
+static void
 nfulnl_log_packet(struct net *net,
  u_int8_t pf,
  unsigned int hooknum,
@@ -633,7 +633,7 @@ nfulnl_log_packet(struct net *net,
struct nfulnl_instance *inst;
const struct nf_loginfo *li;
unsigned int qthreshold;
-   unsigned int plen;
+   unsigned int plen = 0;
struct nfnl_log_net *log = nfnl_log_pernet(net);
const struct nfnl_ct_hook *nfnl_ct = NULL;
struct nf_conn *ct = NULL;
@@ -648,7 +648,6 @@ nfulnl_log_packet(struct net *net,
if (!inst)
return;
 
-   plen = 0;
if (prefix)
plen = strlen(prefix) + 1;
 
@@ -760,7 +759,6 @@ nfulnl_log_packet(struct net *net,
/* FIXME: statistics */
goto unlock_and_release;
 }
-EXPORT_SYMBOL_GPL(nfulnl_log_packet);
 
 static int
 nfulnl_rcv_nl_event(struct notifier_block *this,
diff --git a/net/netfilter/xt_NFLOG.c b/net/netfilter/xt_NFLOG.c
index c7f8958cea4a..1ed0cac585c4 100644
--- a/net/netfilter/xt_NFLOG.c
+++ b/net/netfilter/xt_NFLOG.c
@@ -13,7 +13,6 @@
 #include 
 #include 
 #include 
-#include 
 
 MODULE_AUTHOR("Patrick McHardy ");
 MODULE_DESCRIPTION("Xtables: packet logging to netlink using NFLOG");
@@ -37,8 +36,9 @@ nflog_tg(struct sk_buff *skb, const struct xt_action_param 
*par)
if (info->flags & XT_NFLOG_F_COPY_LEN)
li.u.ulog.flags |= NF_LOG_F_COPY_LEN;
 
-   nfulnl_log_packet(net, xt_family(par), xt_hooknum(par), skb,
- xt_in(par), xt_out(par), , info->prefix);
+   nf_log_packet(net, xt_family(par), xt_hooknum(par), skb, xt_in(par),
+ xt_out(par), , "%s", info->prefix);
+
return XT_CONTINUE;
 }
 
@@ -50,7 +50,13 @@ static int nflog_tg_check(const struct xt_tgchk_param *par)
return -EINVAL;
if (info->prefix[sizeof(info->prefix) - 1] != '\0')
return -EINVAL;
-   return 0;
+
+   return nf_logger_find_get(par->family, NF_LOG_TYPE_ULOG);
+}
+
+static void nflog_tg_destroy(const struct xt_tgdtor_param *par)
+{
+   nf_logger_put(par->family, NF_LOG_TYPE_ULOG);
 }
 
 static struct xt_target nflog_tg_reg __read_mostly = {
@@ -58,6 +64,7 @@ static struct xt_target nflog_tg_reg __read_mostly = {
.revision   = 0,
.family = NFPROTO_UNSPEC,
.checkentry = nflog_tg_check,
+   .destroy= nflog_tg_destroy,
.target = nflog_tg,
.targetsize = sizeof(struct xt_nflog_info),
.me = THIS_MODULE,
-- 
2.11.0



[PATCH 17/51] netfilter: nf_flow_table: fix priv pointer for netdev hook

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

The offload ip hook expects a pointer to the flowtable, not to the
rhashtable. Since the rhashtable is the first member, this is safe for
the moment, but breaks as soon as the structure layout changes

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 6cd9955916e5..517bb93c00fb 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -5019,7 +5019,7 @@ static int nf_tables_flowtable_parse_hook(const struct 
nft_ctx *ctx,
flowtable->ops[i].pf= NFPROTO_NETDEV;
flowtable->ops[i].hooknum   = hooknum;
flowtable->ops[i].priority  = priority;
-   flowtable->ops[i].priv  = >data.rhashtable;
+   flowtable->ops[i].priv  = >data;
flowtable->ops[i].hook  = flowtable->data.type->hook;
flowtable->ops[i].dev   = dev_array[i];
flowtable->dev_name[i]  = kstrdup(dev_array[i]->name,
-- 
2.11.0



[PATCH 21/51] netfilter: nf_flow_table: in flow_offload_lookup, skip entries being deleted

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Preparation for sending flows back to the slow path

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_flow_table_core.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index 5a81e4f771e9..ff5e17a15963 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -184,8 +184,21 @@ struct flow_offload_tuple_rhash *
 flow_offload_lookup(struct nf_flowtable *flow_table,
struct flow_offload_tuple *tuple)
 {
-   return rhashtable_lookup_fast(_table->rhashtable, tuple,
- nf_flow_offload_rhash_params);
+   struct flow_offload_tuple_rhash *tuplehash;
+   struct flow_offload *flow;
+   int dir;
+
+   tuplehash = rhashtable_lookup_fast(_table->rhashtable, tuple,
+  nf_flow_offload_rhash_params);
+   if (!tuplehash)
+   return NULL;
+
+   dir = tuplehash->tuple.dir;
+   flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
+   if (flow->flags & (FLOW_OFFLOAD_DYING | FLOW_OFFLOAD_TEARDOWN))
+   return NULL;
+
+   return tuplehash;
 }
 EXPORT_SYMBOL_GPL(flow_offload_lookup);
 
-- 
2.11.0



[PATCH 16/51] netfilter: nf_flow_table: move init code to nf_flow_table_core.c

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Reduces duplication of .gc and .params in flowtable type definitions and
makes the API clearer

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h   |   6 +-
 net/ipv4/netfilter/nf_flow_table_ipv4.c |   3 +-
 net/ipv6/netfilter/nf_flow_table_ipv6.c |   3 +-
 net/netfilter/nf_flow_table_core.c  | 102 +++-
 net/netfilter/nf_flow_table_inet.c  |   3 +-
 net/netfilter/nf_tables_api.c   |  22 +++
 6 files changed, 74 insertions(+), 65 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index 76ee5c81b752..f876e32a60b8 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -14,9 +14,8 @@ struct nf_flowtable;
 struct nf_flowtable_type {
struct list_headlist;
int family;
-   void(*gc)(struct work_struct *work);
+   int (*init)(struct nf_flowtable *ft);
void(*free)(struct nf_flowtable *ft);
-   const struct rhashtable_params  *params;
nf_hookfn   *hook;
struct module   *owner;
 };
@@ -100,9 +99,8 @@ int nf_flow_table_iterate(struct nf_flowtable *flow_table,
 
 void nf_flow_table_cleanup(struct net *net, struct net_device *dev);
 
+int nf_flow_table_init(struct nf_flowtable *flow_table);
 void nf_flow_table_free(struct nf_flowtable *flow_table);
-void nf_flow_offload_work_gc(struct work_struct *work);
-extern const struct rhashtable_params nf_flow_offload_rhash_params;
 
 void flow_offload_dead(struct flow_offload *flow);
 
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
index b6e43ff0c7b7..e1e56d7123d2 100644
--- a/net/ipv4/netfilter/nf_flow_table_ipv4.c
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -7,8 +7,7 @@
 
 static struct nf_flowtable_type flowtable_ipv4 = {
.family = NFPROTO_IPV4,
-   .params = _flow_offload_rhash_params,
-   .gc = nf_flow_offload_work_gc,
+   .init   = nf_flow_table_init,
.free   = nf_flow_table_free,
.hook   = nf_flow_offload_ip_hook,
.owner  = THIS_MODULE,
diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
index f1804ce8d561..c511d206bf9b 100644
--- a/net/ipv6/netfilter/nf_flow_table_ipv6.c
+++ b/net/ipv6/netfilter/nf_flow_table_ipv6.c
@@ -8,8 +8,7 @@
 
 static struct nf_flowtable_type flowtable_ipv6 = {
.family = NFPROTO_IPV6,
-   .params = _flow_offload_rhash_params,
-   .gc = nf_flow_offload_work_gc,
+   .init   = nf_flow_table_init,
.free   = nf_flow_table_free,
.hook   = nf_flow_offload_ipv6_hook,
.owner  = THIS_MODULE,
diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index 7403a0dfddf7..09d1be669c39 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -116,16 +116,50 @@ void flow_offload_dead(struct flow_offload *flow)
 }
 EXPORT_SYMBOL_GPL(flow_offload_dead);
 
+static u32 flow_offload_hash(const void *data, u32 len, u32 seed)
+{
+   const struct flow_offload_tuple *tuple = data;
+
+   return jhash(tuple, offsetof(struct flow_offload_tuple, dir), seed);
+}
+
+static u32 flow_offload_hash_obj(const void *data, u32 len, u32 seed)
+{
+   const struct flow_offload_tuple_rhash *tuplehash = data;
+
+   return jhash(>tuple, offsetof(struct flow_offload_tuple, 
dir), seed);
+}
+
+static int flow_offload_hash_cmp(struct rhashtable_compare_arg *arg,
+   const void *ptr)
+{
+   const struct flow_offload_tuple *tuple = arg->key;
+   const struct flow_offload_tuple_rhash *x = ptr;
+
+   if (memcmp(>tuple, tuple, offsetof(struct flow_offload_tuple, dir)))
+   return 1;
+
+   return 0;
+}
+
+static const struct rhashtable_params nf_flow_offload_rhash_params = {
+   .head_offset= offsetof(struct flow_offload_tuple_rhash, 
node),
+   .hashfn = flow_offload_hash,
+   .obj_hashfn = flow_offload_hash_obj,
+   .obj_cmpfn  = flow_offload_hash_cmp,
+   .automatic_shrinking= true,
+};
+
 int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload 
*flow)
 {
flow->timeout = (u32)jiffies;
 
rhashtable_insert_fast(_table->rhashtable,
   >tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].node,
-  *flow_table->type->params);
+  nf_flow_offload_rhash_params);

[PATCH 26/51] netfilter: nf_tables: simplify lookup functions

2018-05-06 Thread Pablo Neira Ayuso
Replace the nf_tables_ prefix by nft_ and merge code into single lookup
function whenever possible. In many cases we go over the 80-chars
boundary function names, this save us ~50 LoC.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |  12 +-
 net/netfilter/nf_tables_api.c | 249 +++---
 net/netfilter/nft_flow_offload.c  |   5 +-
 net/netfilter/nft_objref.c|   4 +-
 4 files changed, 110 insertions(+), 160 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 2f2062ae1c45..123e82a2f8bb 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1015,9 +1015,9 @@ static inline void *nft_obj_data(const struct nft_object 
*obj)
 
 #define nft_expr_obj(expr) *((struct nft_object **)nft_expr_priv(expr))
 
-struct nft_object *nf_tables_obj_lookup(const struct nft_table *table,
-   const struct nlattr *nla, u32 objtype,
-   u8 genmask);
+struct nft_object *nft_obj_lookup(const struct nft_table *table,
+ const struct nlattr *nla, u32 objtype,
+ u8 genmask);
 
 void nft_obj_notify(struct net *net, struct nft_table *table,
struct nft_object *obj, u32 portid, u32 seq,
@@ -1106,9 +1106,9 @@ struct nft_flowtable {
struct nf_flowtable data;
 };
 
-struct nft_flowtable *nf_tables_flowtable_lookup(const struct nft_table *table,
-const struct nlattr *nla,
-u8 genmask);
+struct nft_flowtable *nft_flowtable_lookup(const struct nft_table *table,
+  const struct nlattr *nla,
+  u8 genmask);
 
 void nft_register_flowtable_type(struct nf_flowtable_type *type);
 void nft_unregister_flowtable_type(struct nf_flowtable_type *type);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 16b67f54b3d2..f65e650b61aa 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -386,13 +386,17 @@ static struct nft_table *nft_table_lookup(const struct 
net *net,
 {
struct nft_table *table;
 
+   if (nla == NULL)
+   return ERR_PTR(-EINVAL);
+
list_for_each_entry(table, >nft.tables, list) {
if (!nla_strcmp(nla, table->name) &&
table->family == family &&
nft_active_genmask(table, genmask))
return table;
}
-   return NULL;
+
+   return ERR_PTR(-ENOENT);
 }
 
 static struct nft_table *nft_table_lookup_byhandle(const struct net *net,
@@ -406,37 +410,6 @@ static struct nft_table *nft_table_lookup_byhandle(const 
struct net *net,
nft_active_genmask(table, genmask))
return table;
}
-   return NULL;
-}
-
-static struct nft_table *nf_tables_table_lookup(const struct net *net,
-   const struct nlattr *nla,
-   u8 family, u8 genmask)
-{
-   struct nft_table *table;
-
-   if (nla == NULL)
-   return ERR_PTR(-EINVAL);
-
-   table = nft_table_lookup(net, nla, family, genmask);
-   if (table != NULL)
-   return table;
-
-   return ERR_PTR(-ENOENT);
-}
-
-static struct nft_table *nf_tables_table_lookup_byhandle(const struct net *net,
-const struct nlattr 
*nla,
-u8 genmask)
-{
-   struct nft_table *table;
-
-   if (nla == NULL)
-   return ERR_PTR(-EINVAL);
-
-   table = nft_table_lookup_byhandle(net, nla, genmask);
-   if (table != NULL)
-   return table;
 
return ERR_PTR(-ENOENT);
 }
@@ -608,8 +581,7 @@ static int nf_tables_gettable(struct net *net, struct sock 
*nlsk,
return netlink_dump_start(nlsk, skb, nlh, );
}
 
-   table = nf_tables_table_lookup(net, nla[NFTA_TABLE_NAME], family,
-  genmask);
+   table = nft_table_lookup(net, nla[NFTA_TABLE_NAME], family, genmask);
if (IS_ERR(table))
return PTR_ERR(table);
 
@@ -735,7 +707,7 @@ static int nf_tables_newtable(struct net *net, struct sock 
*nlsk,
int err;
 
name = nla[NFTA_TABLE_NAME];
-   table = nf_tables_table_lookup(net, name, family, genmask);
+   table = nft_table_lookup(net, name, family, genmask);
if (IS_ERR(table)) {
if (PTR_ERR(table) != -ENOENT)
return PTR_ERR(table);
@@ -893,12 +865,11 @@ static int nf_tables_deltable(struct net *net, struct 
sock *nlsk,
return nft_flush(, family);
 
  

[PATCH 33/51] netfilter: x_tables: remove duplicate ip6t_get_target function call

2018-05-06 Thread Pablo Neira Ayuso
From: Taehee Yoo 

In the check_target, ip6t_get_target is called twice.

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv6/netfilter/ip6_tables.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 65c9e1a58305..7097bbf95843 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -528,7 +528,6 @@ static int check_target(struct ip6t_entry *e, struct net 
*net, const char *name)
.family= NFPROTO_IPV6,
};
 
-   t = ip6t_get_target(e);
return xt_check_target(, t->u.target_size - sizeof(*t),
   e->ipv6.proto,
   e->ipv6.invflags & IP6T_INV_PROTO);
-- 
2.11.0



[PATCH 29/51] netfilter: add NAT support for shifted portmap ranges

2018-05-06 Thread Pablo Neira Ayuso
From: Thierry Du Tre 

This is a patch proposal to support shifted ranges in portmaps.  (i.e. tcp/udp
incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100)

Currently DNAT only works for single port or identical port ranges.  (i.e.
ports 5000-5100 on WAN interface redirected to a LAN host while original
destination port is not altered) When different port ranges are configured,
either 'random' mode should be used, or else all incoming connections are
mapped onto the first port in the redirect range. (in described example
WAN:5000-5100 will all be mapped to 192.168.1.5:2000)

This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET
which uses a base port value to calculate an offset with the destination port
present in the incoming stream. That offset is then applied as index within the
redirect port range (index modulo rangewidth to handle range overflow).

In described example the base port would be 5000. An incoming stream with
destination port 5004 would result in an offset value 4 which means that the
NAT'ed stream will be using destination port 2004.

Other possibilities include deterministic mapping of larger or multiple ranges
to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port
51xx)

This patch does not change any current behavior. It just adds new NAT proto
range functionality which must be selected via the specific flag when intended
to use.

A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed
which makes this functionality immediately available.

Signed-off-by: Thierry Du Tre 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/ipv4/nf_nat_masquerade.h |  2 +-
 include/net/netfilter/ipv6/nf_nat_masquerade.h |  2 +-
 include/net/netfilter/nf_nat.h |  2 +-
 include/net/netfilter/nf_nat_l3proto.h |  4 +-
 include/net/netfilter/nf_nat_l4proto.h |  8 +--
 include/net/netfilter/nf_nat_redirect.h|  2 +-
 include/uapi/linux/netfilter/nf_nat.h  | 12 -
 net/ipv4/netfilter/ipt_MASQUERADE.c|  2 +-
 net/ipv4/netfilter/nf_nat_h323.c   |  4 +-
 net/ipv4/netfilter/nf_nat_l3proto_ipv4.c   |  4 +-
 net/ipv4/netfilter/nf_nat_masquerade_ipv4.c|  4 +-
 net/ipv4/netfilter/nf_nat_pptp.c   |  2 +-
 net/ipv4/netfilter/nf_nat_proto_gre.c  |  2 +-
 net/ipv4/netfilter/nf_nat_proto_icmp.c |  2 +-
 net/ipv4/netfilter/nft_masq_ipv4.c |  2 +-
 net/ipv6/netfilter/ip6t_MASQUERADE.c   |  2 +-
 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c   |  4 +-
 net/ipv6/netfilter/nf_nat_masquerade_ipv6.c|  4 +-
 net/ipv6/netfilter/nf_nat_proto_icmpv6.c   |  2 +-
 net/ipv6/netfilter/nft_masq_ipv6.c |  2 +-
 net/ipv6/netfilter/nft_redir_ipv6.c|  2 +-
 net/netfilter/nf_nat_core.c| 27 +-
 net/netfilter/nf_nat_helper.c  |  2 +-
 net/netfilter/nf_nat_proto_common.c|  9 ++--
 net/netfilter/nf_nat_proto_dccp.c  |  2 +-
 net/netfilter/nf_nat_proto_sctp.c  |  2 +-
 net/netfilter/nf_nat_proto_tcp.c   |  2 +-
 net/netfilter/nf_nat_proto_udp.c   |  4 +-
 net/netfilter/nf_nat_proto_unknown.c   |  2 +-
 net/netfilter/nf_nat_redirect.c|  6 +--
 net/netfilter/nf_nat_sip.c |  2 +-
 net/netfilter/nft_nat.c|  2 +-
 net/netfilter/xt_NETMAP.c  |  8 +--
 net/netfilter/xt_REDIRECT.c|  2 +-
 net/netfilter/xt_nat.c | 72 +++---
 net/openvswitch/conntrack.c|  4 +-
 36 files changed, 145 insertions(+), 71 deletions(-)

diff --git a/include/net/netfilter/ipv4/nf_nat_masquerade.h 
b/include/net/netfilter/ipv4/nf_nat_masquerade.h
index ebd869473603..cd24be4c4a99 100644
--- a/include/net/netfilter/ipv4/nf_nat_masquerade.h
+++ b/include/net/netfilter/ipv4/nf_nat_masquerade.h
@@ -6,7 +6,7 @@
 
 unsigned int
 nf_nat_masquerade_ipv4(struct sk_buff *skb, unsigned int hooknum,
-  const struct nf_nat_range *range,
+  const struct nf_nat_range2 *range,
   const struct net_device *out);
 
 void nf_nat_masquerade_ipv4_register_notifier(void);
diff --git a/include/net/netfilter/ipv6/nf_nat_masquerade.h 
b/include/net/netfilter/ipv6/nf_nat_masquerade.h
index 1ed4f2631ed6..0c3b5ebf0bb8 100644
--- a/include/net/netfilter/ipv6/nf_nat_masquerade.h
+++ b/include/net/netfilter/ipv6/nf_nat_masquerade.h
@@ -3,7 +3,7 @@
 #define _NF_NAT_MASQUERADE_IPV6_H_
 
 unsigned int
-nf_nat_masquerade_ipv6(struct sk_buff *skb, const struct nf_nat_range *range,
+nf_nat_masquerade_ipv6(struct sk_buff *skb, const struct nf_nat_range2 *range,
   const struct net_device *out);
 void nf_nat_masquerade_ipv6_register_notifier(void);
 void 

[PATCH 31/51] netfilter: ebtables: add ebt_free_table_info function

2018-05-06 Thread Pablo Neira Ayuso
From: Taehee Yoo 

A ebt_free_table_info frees all of chainstacks.
It similar to xt_free_table_info. this inline function
reduces code line.

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/netfilter/ebtables.c | 39 +++
 1 file changed, 15 insertions(+), 24 deletions(-)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 032e0fe45940..355410b13316 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -343,6 +343,16 @@ find_table_lock(struct net *net, const char *name, int 
*error,
"ebtable_", error, mutex);
 }
 
+static inline void ebt_free_table_info(struct ebt_table_info *info)
+{
+   int i;
+
+   if (info->chainstack) {
+   for_each_possible_cpu(i)
+   vfree(info->chainstack[i]);
+   vfree(info->chainstack);
+   }
+}
 static inline int
 ebt_check_match(struct ebt_entry_match *m, struct xt_mtchk_param *par,
unsigned int *cnt)
@@ -975,7 +985,7 @@ static void get_counters(const struct ebt_counter 
*oldcounters,
 static int do_replace_finish(struct net *net, struct ebt_replace *repl,
  struct ebt_table_info *newinfo)
 {
-   int ret, i;
+   int ret;
struct ebt_counter *counterstmp = NULL;
/* used to be able to unlock earlier */
struct ebt_table_info *table;
@@ -1051,13 +1061,8 @@ static int do_replace_finish(struct net *net, struct 
ebt_replace *repl,
  ebt_cleanup_entry, net, NULL);
 
vfree(table->entries);
-   if (table->chainstack) {
-   for_each_possible_cpu(i)
-   vfree(table->chainstack[i]);
-   vfree(table->chainstack);
-   }
+   ebt_free_table_info(table);
vfree(table);
-
vfree(counterstmp);
 
 #ifdef CONFIG_AUDIT
@@ -1078,11 +1083,7 @@ static int do_replace_finish(struct net *net, struct 
ebt_replace *repl,
 free_counterstmp:
vfree(counterstmp);
/* can be initialized in translate_table() */
-   if (newinfo->chainstack) {
-   for_each_possible_cpu(i)
-   vfree(newinfo->chainstack[i]);
-   vfree(newinfo->chainstack);
-   }
+   ebt_free_table_info(newinfo);
return ret;
 }
 
@@ -1147,8 +1148,6 @@ static int do_replace(struct net *net, const void __user 
*user,
 
 static void __ebt_unregister_table(struct net *net, struct ebt_table *table)
 {
-   int i;
-
mutex_lock(_mutex);
list_del(>list);
mutex_unlock(_mutex);
@@ -1157,11 +1156,7 @@ static void __ebt_unregister_table(struct net *net, 
struct ebt_table *table)
if (table->private->nentries)
module_put(table->me);
vfree(table->private->entries);
-   if (table->private->chainstack) {
-   for_each_possible_cpu(i)
-   vfree(table->private->chainstack[i]);
-   vfree(table->private->chainstack);
-   }
+   ebt_free_table_info(table->private);
vfree(table->private);
kfree(table);
 }
@@ -1263,11 +1258,7 @@ int ebt_register_table(struct net *net, const struct 
ebt_table *input_table,
 free_unlock:
mutex_unlock(_mutex);
 free_chainstack:
-   if (newinfo->chainstack) {
-   for_each_possible_cpu(i)
-   vfree(newinfo->chainstack[i]);
-   vfree(newinfo->chainstack);
-   }
+   ebt_free_table_info(newinfo);
vfree(newinfo->entries);
 free_newinfo:
vfree(newinfo);
-- 
2.11.0



[PATCH 30/51] netfilter: add __exit mark to helper modules

2018-05-06 Thread Pablo Neira Ayuso
From: Taehee Yoo 

There are no __exit mark in the helper modules.
because these exit functions used to be called by init function
but now that is not. so we can add __exit mark.

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_ftp.c  | 3 +--
 net/netfilter/nf_conntrack_irc.c  | 6 +-
 net/netfilter/nf_conntrack_sane.c | 3 +--
 net/netfilter/nf_conntrack_sip.c  | 2 +-
 net/netfilter/nf_conntrack_tftp.c | 2 +-
 5 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/nf_conntrack_ftp.c b/net/netfilter/nf_conntrack_ftp.c
index f0e9a7511e1a..a11c304fb771 100644
--- a/net/netfilter/nf_conntrack_ftp.c
+++ b/net/netfilter/nf_conntrack_ftp.c
@@ -566,8 +566,7 @@ static const struct nf_conntrack_expect_policy 
ftp_exp_policy = {
.timeout= 5 * 60,
 };
 
-/* don't make this __exit, since it's called from __init ! */
-static void nf_conntrack_ftp_fini(void)
+static void __exit nf_conntrack_ftp_fini(void)
 {
nf_conntrack_helpers_unregister(ftp, ports_c * 2);
kfree(ftp_buffer);
diff --git a/net/netfilter/nf_conntrack_irc.c b/net/netfilter/nf_conntrack_irc.c
index 5523acce9d69..4099f4d79bae 100644
--- a/net/netfilter/nf_conntrack_irc.c
+++ b/net/netfilter/nf_conntrack_irc.c
@@ -232,8 +232,6 @@ static int help(struct sk_buff *skb, unsigned int protoff,
 static struct nf_conntrack_helper irc[MAX_PORTS] __read_mostly;
 static struct nf_conntrack_expect_policy irc_exp_policy;
 
-static void nf_conntrack_irc_fini(void);
-
 static int __init nf_conntrack_irc_init(void)
 {
int i, ret;
@@ -276,9 +274,7 @@ static int __init nf_conntrack_irc_init(void)
return 0;
 }
 
-/* This function is intentionally _NOT_ defined as __exit, because
- * it is needed by the init function */
-static void nf_conntrack_irc_fini(void)
+static void __exit nf_conntrack_irc_fini(void)
 {
nf_conntrack_helpers_unregister(irc, ports_c);
kfree(irc_buffer);
diff --git a/net/netfilter/nf_conntrack_sane.c 
b/net/netfilter/nf_conntrack_sane.c
index ae457f39d5ce..5072ff96ab33 100644
--- a/net/netfilter/nf_conntrack_sane.c
+++ b/net/netfilter/nf_conntrack_sane.c
@@ -173,8 +173,7 @@ static const struct nf_conntrack_expect_policy 
sane_exp_policy = {
.timeout= 5 * 60,
 };
 
-/* don't make this __exit, since it's called from __init ! */
-static void nf_conntrack_sane_fini(void)
+static void __exit nf_conntrack_sane_fini(void)
 {
nf_conntrack_helpers_unregister(sane, ports_c * 2);
kfree(sane_buffer);
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 4dbb5bad4363..148ce1a52cc7 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -1609,7 +1609,7 @@ static const struct nf_conntrack_expect_policy 
sip_exp_policy[SIP_EXPECT_MAX + 1
},
 };
 
-static void nf_conntrack_sip_fini(void)
+static void __exit nf_conntrack_sip_fini(void)
 {
nf_conntrack_helpers_unregister(sip, ports_c * 4);
 }
diff --git a/net/netfilter/nf_conntrack_tftp.c 
b/net/netfilter/nf_conntrack_tftp.c
index 0ec6779fd5d9..548b673b3625 100644
--- a/net/netfilter/nf_conntrack_tftp.c
+++ b/net/netfilter/nf_conntrack_tftp.c
@@ -104,7 +104,7 @@ static const struct nf_conntrack_expect_policy 
tftp_exp_policy = {
.timeout= 5 * 60,
 };
 
-static void nf_conntrack_tftp_fini(void)
+static void __exit nf_conntrack_tftp_fini(void)
 {
nf_conntrack_helpers_unregister(tftp, ports_c * 2);
 }
-- 
2.11.0



[PATCH 18/51] netfilter: nf_flow_table: track flow tables in nf_flow_table directly

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Avoids having nf_flow_table depend on nftables (useful for future
iptables backport work)

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h |  1 +
 include/net/netfilter/nf_tables.h |  3 ---
 net/netfilter/nf_flow_table_core.c| 21 ++---
 net/netfilter/nf_tables_api.c | 17 -
 4 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index f876e32a60b8..ab408adba688 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -21,6 +21,7 @@ struct nf_flowtable_type {
 };
 
 struct nf_flowtable {
+   struct list_headlist;
struct rhashtable   rhashtable;
const struct nf_flowtable_type  *type;
struct delayed_work gc_work;
diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index cd368d1b8cb8..2f2062ae1c45 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1109,9 +1109,6 @@ struct nft_flowtable {
 struct nft_flowtable *nf_tables_flowtable_lookup(const struct nft_table *table,
 const struct nlattr *nla,
 u8 genmask);
-void nft_flow_table_iterate(struct net *net,
-   void (*iter)(struct nf_flowtable *flowtable, void 
*data),
-   void *data);
 
 void nft_register_flowtable_type(struct nf_flowtable_type *type);
 void nft_unregister_flowtable_type(struct nf_flowtable_type *type);
diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index 09d1be669c39..e761359b56a9 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -18,6 +18,9 @@ struct flow_offload_entry {
struct rcu_head rcu_head;
 };
 
+static DEFINE_MUTEX(flowtable_lock);
+static LIST_HEAD(flowtables);
+
 static void
 flow_offload_fill_dir(struct flow_offload *flow, struct nf_conn *ct,
  struct nf_flow_route *route,
@@ -410,6 +413,10 @@ int nf_flow_table_init(struct nf_flowtable *flowtable)
queue_delayed_work(system_power_efficient_wq,
   >gc_work, HZ);
 
+   mutex_lock(_lock);
+   list_add(>list, );
+   mutex_unlock(_lock);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(nf_flow_table_init);
@@ -425,20 +432,28 @@ static void nf_flow_table_do_cleanup(struct flow_offload 
*flow, void *data)
 }
 
 static void nf_flow_table_iterate_cleanup(struct nf_flowtable *flowtable,
- void *data)
+ struct net_device *dev)
 {
-   nf_flow_table_iterate(flowtable, nf_flow_table_do_cleanup, data);
+   nf_flow_table_iterate(flowtable, nf_flow_table_do_cleanup, dev);
flush_delayed_work(>gc_work);
 }
 
 void nf_flow_table_cleanup(struct net *net, struct net_device *dev)
 {
-   nft_flow_table_iterate(net, nf_flow_table_iterate_cleanup, dev);
+   struct nf_flowtable *flowtable;
+
+   mutex_lock(_lock);
+   list_for_each_entry(flowtable, , list)
+   nf_flow_table_iterate_cleanup(flowtable, dev);
+   mutex_unlock(_lock);
 }
 EXPORT_SYMBOL_GPL(nf_flow_table_cleanup);
 
 void nf_flow_table_free(struct nf_flowtable *flow_table)
 {
+   mutex_lock(_lock);
+   list_del(_table->list);
+   mutex_unlock(_lock);
cancel_delayed_work_sync(_table->gc_work);
nf_flow_table_iterate(flow_table, nf_flow_table_do_cleanup, NULL);
WARN_ON(!nf_flow_offload_gc_step(flow_table));
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 517bb93c00fb..16b67f54b3d2 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -5060,23 +5060,6 @@ static const struct nf_flowtable_type 
*nft_flowtable_type_get(u8 family)
return ERR_PTR(-ENOENT);
 }
 
-void nft_flow_table_iterate(struct net *net,
-   void (*iter)(struct nf_flowtable *flowtable, void 
*data),
-   void *data)
-{
-   struct nft_flowtable *flowtable;
-   const struct nft_table *table;
-
-   nfnl_lock(NFNL_SUBSYS_NFTABLES);
-   list_for_each_entry(table, >nft.tables, list) {
-   list_for_each_entry(flowtable, >flowtables, list) {
-   iter(>data, data);
-   }
-   }
-   nfnl_unlock(NFNL_SUBSYS_NFTABLES);
-}
-EXPORT_SYMBOL_GPL(nft_flow_table_iterate);
-
 static void nft_unregister_flowtable_net_hooks(struct net *net,
   struct nft_flowtable *flowtable)
 {
-- 
2.11.0



[PATCH 28/51] netfilter: nf_tables: Simplify set backend selection

2018-05-06 Thread Pablo Neira Ayuso
From: Phil Sutter 

Drop nft_set_type's ability to act as a container of multiple backend
implementations it chooses from. Instead consolidate the whole selection
logic in nft_select_set_ops() and the actual backend provided estimate()
callback.

This turns nf_tables_set_types into a list containing all available
backends which is traversed when selecting one matching userspace
requested criteria.

Also, this change allows to embed nft_set_ops structure into
nft_set_type and pull flags field into the latter as it's only used
during selection phase.

A crucial part of this change is to make sure the new layout respects
hash backend constraints formerly enforced by nft_hash_select_ops()
function: This is achieved by introduction of a specific estimate()
callback for nft_hash_fast_ops which returns false for key lengths != 4.
In turn, nft_hash_estimate() is changed to return false for key lengths
== 4 so it won't be chosen by accident. Also, both callbacks must return
false for unbounded sets as their size estimate depends on a known
maximum element count.

Note that this patch partially reverts commit 4f2921ca21b71 ("netfilter:
nf_tables: meter: pick a set backend that supports updates") by making
nft_set_ops_candidate() not explicitly look for an update callback but
make NFT_SET_EVAL a regular backend feature flag which is checked along
with the others. This way all feature requirements are checked in one
go.

Signed-off-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |  34 -
 net/netfilter/nf_tables_api.c |  25 +++
 net/netfilter/nft_set_bitmap.c|  34 -
 net/netfilter/nft_set_hash.c  | 153 +-
 net/netfilter/nft_set_rbtree.c|  36 -
 5 files changed, 139 insertions(+), 143 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 123e82a2f8bb..de77d36e36b3 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -275,23 +275,6 @@ struct nft_set_estimate {
enum nft_set_class  space;
 };
 
-/**
- *  struct nft_set_type - nf_tables set type
- *
- *  @select_ops: function to select nft_set_ops
- *  @ops: default ops, used when no select_ops functions is present
- *  @list: used internally
- *  @owner: module reference
- */
-struct nft_set_type {
-   const struct nft_set_ops*(*select_ops)(const struct nft_ctx *,
-  const struct 
nft_set_desc *desc,
-  u32 flags);
-   const struct nft_set_ops*ops;
-   struct list_headlist;
-   struct module   *owner;
-};
-
 struct nft_set_ext;
 struct nft_expr;
 
@@ -310,7 +293,6 @@ struct nft_expr;
  * @init: initialize private data of new set instance
  * @destroy: destroy private data of set instance
  * @elemsize: element private size
- * @features: features supported by the implementation
  */
 struct nft_set_ops {
bool(*lookup)(const struct net *net,
@@ -361,9 +343,23 @@ struct nft_set_ops {
void(*destroy)(const struct nft_set *set);
 
unsigned intelemsize;
+};
+
+/**
+ *  struct nft_set_type - nf_tables set type
+ *
+ *  @ops: set ops for this type
+ *  @list: used internally
+ *  @owner: module reference
+ *  @features: features supported by the implementation
+ */
+struct nft_set_type {
+   const struct nft_set_opsops;
+   struct list_headlist;
+   struct module   *owner;
u32 features;
-   const struct nft_set_type   *type;
 };
+#define to_set_type(o) container_of(o, struct nft_set_type, ops)
 
 int nft_register_set(struct nft_set_type *type);
 void nft_unregister_set(struct nft_set_type *type);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 2f14cadd9922..9ce35acf491d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2523,14 +2523,12 @@ void nft_unregister_set(struct nft_set_type *type)
 EXPORT_SYMBOL_GPL(nft_unregister_set);
 
 #define NFT_SET_FEATURES   (NFT_SET_INTERVAL | NFT_SET_MAP | \
-NFT_SET_TIMEOUT | NFT_SET_OBJECT)
+NFT_SET_TIMEOUT | NFT_SET_OBJECT | \
+NFT_SET_EVAL)
 
-static bool nft_set_ops_candidate(const struct nft_set_ops *ops, u32 flags)
+static bool nft_set_ops_candidate(const struct nft_set_type *type, u32 flags)
 {
-   if ((flags & NFT_SET_EVAL) && !ops->update)
-   return false;
-
-   return (flags & ops->features) == (flags & NFT_SET_FEATURES);
+   return (flags & type->features) == (flags & 

[PATCH 32/51] netfilter: ebtables: remove EBT_MATCH and EBT_NOMATCH

2018-05-06 Thread Pablo Neira Ayuso
From: Taehee Yoo 

EBT_MATCH and EBT_NOMATCH are used to change return value.
match functions(ebt_xxx.c) return false when received frame is not matched
and returns true when received frame is matched.
but, EBT_MATCH_ITERATE understands oppositely.
so, to change return value, EBT_MATCH and EBT_NOMATCH are used.
but, we can use operation '!' simply.

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/netfilter_bridge/ebtables.h | 4 
 net/bridge/netfilter/ebtables.c   | 2 +-
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/netfilter_bridge/ebtables.h 
b/include/linux/netfilter_bridge/ebtables.h
index 0773b5a032f1..c6935be7c6ca 100644
--- a/include/linux/netfilter_bridge/ebtables.h
+++ b/include/linux/netfilter_bridge/ebtables.h
@@ -17,10 +17,6 @@
 #include 
 #include 
 
-/* return values for match() functions */
-#define EBT_MATCH 0
-#define EBT_NOMATCH 1
-
 struct ebt_match {
struct list_head list;
const char name[EBT_FUNCTION_MAXNAMELEN];
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 355410b13316..7c07221369c0 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -101,7 +101,7 @@ ebt_do_match(struct ebt_entry_match *m, const struct 
sk_buff *skb,
 {
par->match = m->u.match;
par->matchinfo = m->data;
-   return m->u.match->match(skb, par) ? EBT_MATCH : EBT_NOMATCH;
+   return !m->u.match->match(skb, par);
 }
 
 static inline int
-- 
2.11.0



[PATCH 37/51] netfilter: nf_tables: always use an upper set size for dynsets

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

nft rejects rules that lack a timeout and a size limit when they're used
to add elements from packet path.

Pick a sane upperlimit instead of rejecting outright.
The upperlimit is visible to userspace, just as if it would have been
given during set declaration.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_dynset.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c
index 04863fad05dd..5cc3509659c6 100644
--- a/net/netfilter/nft_dynset.c
+++ b/net/netfilter/nft_dynset.c
@@ -36,7 +36,7 @@ static void *nft_dynset_new(struct nft_set *set, const struct 
nft_expr *expr,
u64 timeout;
void *elem;
 
-   if (set->size && !atomic_add_unless(>nelems, 1, set->size))
+   if (!atomic_add_unless(>nelems, 1, set->size))
return NULL;
 
timeout = priv->timeout ? : set->timeout;
@@ -216,6 +216,9 @@ static int nft_dynset_init(const struct nft_ctx *ctx,
if (err < 0)
goto err1;
 
+   if (set->size == 0)
+   set->size = 0x;
+
priv->set = set;
return 0;
 
-- 
2.11.0



[PATCH 36/51] netfilter: nf_tables: support timeouts larger than 23 days

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

Marco De Benedetto says:
 I would like to use a timeout of 30 days for elements in a set but it
 seems there is a some kind of problem above 24d20h31m23s.

Fix this by using 'jiffies64' for timeout handling to get same behaviour
on 32 and 64bit systems.

nftables passes timeouts as u64 in milliseconds to the kernel,
but on kernel side we used a mixture of 'long' and jiffies conversions
rather than u64 and jiffies64.

Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1237
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |  4 ++--
 net/netfilter/nf_tables_api.c | 50 +--
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index de77d36e36b3..435c9e3b9181 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -585,7 +585,7 @@ static inline u64 *nft_set_ext_timeout(const struct 
nft_set_ext *ext)
return nft_set_ext(ext, NFT_SET_EXT_TIMEOUT);
 }
 
-static inline unsigned long *nft_set_ext_expiration(const struct nft_set_ext 
*ext)
+static inline u64 *nft_set_ext_expiration(const struct nft_set_ext *ext)
 {
return nft_set_ext(ext, NFT_SET_EXT_EXPIRATION);
 }
@@ -603,7 +603,7 @@ static inline struct nft_expr *nft_set_ext_expr(const 
struct nft_set_ext *ext)
 static inline bool nft_set_elem_expired(const struct nft_set_ext *ext)
 {
return nft_set_ext_exists(ext, NFT_SET_EXT_EXPIRATION) &&
-  time_is_before_eq_jiffies(*nft_set_ext_expiration(ext));
+  time_is_before_eq_jiffies64(*nft_set_ext_expiration(ext));
 }
 
 static inline struct nft_set_ext *nft_set_elem_ext(const struct nft_set *set,
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 9ce35acf491d..d57aeea89a79 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2779,6 +2779,27 @@ static int nf_tables_set_alloc_name(struct nft_ctx *ctx, 
struct nft_set *set,
return 0;
 }
 
+static int nf_msecs_to_jiffies64(const struct nlattr *nla, u64 *result)
+{
+   u64 ms = be64_to_cpu(nla_get_be64(nla));
+   u64 max = (u64)(~((u64)0));
+
+   max = div_u64(max, NSEC_PER_MSEC);
+   if (ms >= max)
+   return -ERANGE;
+
+   ms *= NSEC_PER_MSEC;
+   *result = nsecs_to_jiffies64(ms);
+   return 0;
+}
+
+static u64 nf_jiffies64_to_msecs(u64 input)
+{
+   u64 ms = jiffies64_to_nsecs(input);
+
+   return cpu_to_be64(div_u64(ms, NSEC_PER_MSEC));
+}
+
 static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
  const struct nft_set *set, u16 event, u16 flags)
 {
@@ -2826,7 +2847,7 @@ static int nf_tables_fill_set(struct sk_buff *skb, const 
struct nft_ctx *ctx,
 
if (set->timeout &&
nla_put_be64(skb, NFTA_SET_TIMEOUT,
-cpu_to_be64(jiffies_to_msecs(set->timeout)),
+nf_jiffies64_to_msecs(set->timeout),
 NFTA_SET_PAD))
goto nla_put_failure;
if (set->gc_int &&
@@ -3122,8 +3143,10 @@ static int nf_tables_newset(struct net *net, struct sock 
*nlsk,
if (nla[NFTA_SET_TIMEOUT] != NULL) {
if (!(flags & NFT_SET_TIMEOUT))
return -EINVAL;
-   timeout = msecs_to_jiffies(be64_to_cpu(nla_get_be64(
-   nla[NFTA_SET_TIMEOUT])));
+
+   err = nf_msecs_to_jiffies64(nla[NFTA_SET_TIMEOUT], );
+   if (err)
+   return err;
}
gc_int = 0;
if (nla[NFTA_SET_GC_INTERVAL] != NULL) {
@@ -3387,8 +3410,8 @@ const struct nft_set_ext_type nft_set_ext_types[] = {
.align  = __alignof__(u64),
},
[NFT_SET_EXT_EXPIRATION]= {
-   .len= sizeof(unsigned long),
-   .align  = __alignof__(unsigned long),
+   .len= sizeof(u64),
+   .align  = __alignof__(u64),
},
[NFT_SET_EXT_USERDATA]  = {
.len= sizeof(struct nft_userdata),
@@ -3481,22 +3504,21 @@ static int nf_tables_fill_setelem(struct sk_buff *skb,
 
if (nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT) &&
nla_put_be64(skb, NFTA_SET_ELEM_TIMEOUT,
-cpu_to_be64(jiffies_to_msecs(
-   *nft_set_ext_timeout(ext))),
+nf_jiffies64_to_msecs(*nft_set_ext_timeout(ext)),
 NFTA_SET_ELEM_PAD))
goto nla_put_failure;
 
if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPIRATION)) {
-   unsigned long expires, now = jiffies;
+   u64 expires, now = get_jiffies_64();
 
expires = 

[PATCH 35/51] netfilter: xtables: use ipt_get_target_c instead of ipt_get_target

2018-05-06 Thread Pablo Neira Ayuso
From: Taehee Yoo 

ipt_get_target is used to get struct xt_entry_target
and ipt_get_target_c is used to get const struct xt_entry_target.
However in the ipt_do_table, ipt_get_target is used to get
const struct xt_entry_target. it should be replaced by ipt_get_target_c.

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/ip_tables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 44b308d93ec2..444f125f3974 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -300,7 +300,7 @@ ipt_do_table(struct sk_buff *skb,
counter = xt_get_this_cpu_counter(>counters);
ADD_COUNTER(*counter, skb->len, 1);
 
-   t = ipt_get_target(e);
+   t = ipt_get_target_c(e);
WARN_ON(!t->u.kernel.target);
 
 #if IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TRACE)
-- 
2.11.0



[PATCH 40/51] netfilter: nf_tables: merge rt expression into nft core

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

before:
   textdata bss dec hex filename
   2657 844   03501 dad net/netfilter/nft_rt.ko
 1008262240 401  103467   1942b net/netfilter/nf_tables.ko
after:
   2657 844   03501 dad net/netfilter/nft_rt.ko
 1024562316 401  105173   19ad5 net/netfilter/nf_tables.ko

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables_core.h |  1 +
 net/netfilter/Kconfig  |  6 --
 net/netfilter/Makefile |  3 +--
 net/netfilter/nf_tables_core.c |  1 +
 net/netfilter/nft_rt.c | 22 +-
 5 files changed, 4 insertions(+), 29 deletions(-)

diff --git a/include/net/netfilter/nf_tables_core.h 
b/include/net/netfilter/nf_tables_core.h
index 3339cce8f585..d6a358ae3749 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -11,6 +11,7 @@ extern struct nft_expr_type nft_payload_type;
 extern struct nft_expr_type nft_dynset_type;
 extern struct nft_expr_type nft_range_type;
 extern struct nft_expr_type nft_meta_type;
+extern struct nft_expr_type nft_rt_type;
 
 int nf_tables_core_module_init(void);
 void nf_tables_core_module_exit(void);
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 29a13c7a5af2..771f1a4f3376 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -480,12 +480,6 @@ config NFT_EXTHDR
  This option adds the "exthdr" expression that you can use to match
  IPv6 extension headers and tcp options.
 
-config NFT_RT
-   tristate "Netfilter nf_tables routing module"
-   help
- This option adds the "rt" expression that you can use to match
- packet routing information such as the packet nexthop.
-
 config NFT_NUMGEN
tristate "Netfilter nf_tables number generator module"
help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 89634c389fe7..128dbcfaa194 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -76,12 +76,11 @@ obj-$(CONFIG_NF_DUP_NETDEV) += nf_dup_netdev.o
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
  nf_tables_trace.o nft_immediate.o nft_cmp.o nft_range.o \
  nft_bitwise.o nft_byteorder.o nft_payload.o nft_lookup.o \
- nft_dynset.o nft_meta.o
+ nft_dynset.o nft_meta.o nft_rt.o
 
 obj-$(CONFIG_NF_TABLES)+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)   += nft_compat.o
 obj-$(CONFIG_NFT_EXTHDR)   += nft_exthdr.o
-obj-$(CONFIG_NFT_RT)   += nft_rt.o
 obj-$(CONFIG_NFT_NUMGEN)   += nft_numgen.o
 obj-$(CONFIG_NFT_CT)   += nft_ct.o
 obj-$(CONFIG_NFT_FLOW_OFFLOAD) += nft_flow_offload.o
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index b67d6577f767..481ce2c0bbbf 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -252,6 +252,7 @@ static struct nft_expr_type *nft_basic_types[] = {
_dynset_type,
_range_type,
_meta_type,
+   _rt_type,
 };
 
 int __init nf_tables_core_module_init(void)
diff --git a/net/netfilter/nft_rt.c b/net/netfilter/nft_rt.c
index 11a2071b6dd4..76dba9f6b6f6 100644
--- a/net/netfilter/nft_rt.c
+++ b/net/netfilter/nft_rt.c
@@ -7,8 +7,6 @@
  */
 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -179,7 +177,6 @@ static int nft_rt_validate(const struct nft_ctx *ctx, const 
struct nft_expr *exp
return nft_chain_validate_hooks(ctx->chain, hooks);
 }
 
-static struct nft_expr_type nft_rt_type;
 static const struct nft_expr_ops nft_rt_get_ops = {
.type   = _rt_type,
.size   = NFT_EXPR_SIZE(sizeof(struct nft_rt)),
@@ -189,27 +186,10 @@ static const struct nft_expr_ops nft_rt_get_ops = {
.validate   = nft_rt_validate,
 };
 
-static struct nft_expr_type nft_rt_type __read_mostly = {
+struct nft_expr_type nft_rt_type __read_mostly = {
.name   = "rt",
.ops= _rt_get_ops,
.policy = nft_rt_policy,
.maxattr= NFTA_RT_MAX,
.owner  = THIS_MODULE,
 };
-
-static int __init nft_rt_module_init(void)
-{
-   return nft_register_expr(_rt_type);
-}
-
-static void __exit nft_rt_module_exit(void)
-{
-   nft_unregister_expr(_rt_type);
-}
-
-module_init(nft_rt_module_init);
-module_exit(nft_rt_module_exit);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Anders K. Pedersen ");
-MODULE_ALIAS_NFT_EXPR("rt");
-- 
2.11.0



[PATCH 42/51] ipvs: initialize tbl->entries after allocation

2018-05-06 Thread Pablo Neira Ayuso
From: Cong Wang 

tbl->entries is not initialized after kmalloc(), therefore
causes an uninit-value warning in ip_vs_lblc_check_expire()
as reported by syzbot.

Reported-by: 
Cc: Simon Horman 
Cc: Julian Anastasov 
Cc: Pablo Neira Ayuso 
Signed-off-by: Cong Wang 
Acked-by: Julian Anastasov 
Acked-by: Simon Horman 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/ipvs/ip_vs_lblcr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index 9b6a6c9e9cfa..542c4949937a 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -535,6 +535,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
tbl->counter = 1;
tbl->dead = false;
tbl->svc = svc;
+   atomic_set(>entries, 0);
 
/*
 *Hook periodic timer for garbage collection
-- 
2.11.0



[PATCH 34/51] netfilter: ebtables: add ebt_get_target and ebt_get_target_c

2018-05-06 Thread Pablo Neira Ayuso
From: Taehee Yoo 

ebt_get_target similar to {ip/ip6/arp}t_get_target.
and ebt_get_target_c similar to {ip/ip6/arp}t_get_target_c.

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter_bridge/ebtables.h |  6 ++
 net/bridge/netfilter/ebtables.c| 22 +-
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/netfilter_bridge/ebtables.h 
b/include/uapi/linux/netfilter_bridge/ebtables.h
index 0c7dc8315013..3b86c14ea49d 100644
--- a/include/uapi/linux/netfilter_bridge/ebtables.h
+++ b/include/uapi/linux/netfilter_bridge/ebtables.h
@@ -191,6 +191,12 @@ struct ebt_entry {
unsigned char elems[0] __attribute__ ((aligned (__alignof__(struct 
ebt_replace;
 };
 
+static __inline__ struct ebt_entry_target *
+ebt_get_target(struct ebt_entry *e)
+{
+   return (void *)e + e->target_offset;
+}
+
 /* {g,s}etsockopt numbers */
 #define EBT_BASE_CTL128
 
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 7c07221369c0..9be240129448 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -177,6 +177,12 @@ struct ebt_entry *ebt_next_entry(const struct ebt_entry 
*entry)
return (void *)entry + entry->next_offset;
 }
 
+static inline const struct ebt_entry_target *
+ebt_get_target_c(const struct ebt_entry *e)
+{
+   return ebt_get_target((struct ebt_entry *)e);
+}
+
 /* Do some firewalling */
 unsigned int ebt_do_table(struct sk_buff *skb,
  const struct nf_hook_state *state,
@@ -230,8 +236,7 @@ unsigned int ebt_do_table(struct sk_buff *skb,
 */
EBT_WATCHER_ITERATE(point, ebt_do_watcher, skb, );
 
-   t = (struct ebt_entry_target *)
-  (((char *)point) + point->target_offset);
+   t = ebt_get_target_c(point);
/* standard target */
if (!t->u.target->target)
verdict = ((struct ebt_standard_target *)t)->verdict;
@@ -637,7 +642,7 @@ ebt_cleanup_entry(struct ebt_entry *e, struct net *net, 
unsigned int *cnt)
return 1;
EBT_WATCHER_ITERATE(e, ebt_cleanup_watcher, net, NULL);
EBT_MATCH_ITERATE(e, ebt_cleanup_match, net, NULL);
-   t = (struct ebt_entry_target *)(((char *)e) + e->target_offset);
+   t = ebt_get_target(e);
 
par.net  = net;
par.target   = t->u.target;
@@ -716,7 +721,7 @@ ebt_check_entry(struct ebt_entry *e, struct net *net,
ret = EBT_WATCHER_ITERATE(e, ebt_check_watcher, , );
if (ret != 0)
goto cleanup_watchers;
-   t = (struct ebt_entry_target *)(((char *)e) + e->target_offset);
+   t = ebt_get_target(e);
gap = e->next_offset - e->target_offset;
 
target = xt_request_find_target(NFPROTO_BRIDGE, t->u.name, 0);
@@ -789,8 +794,7 @@ static int check_chainloops(const struct ebt_entries 
*chain, struct ebt_cl_stack
if (pos == nentries)
continue;
}
-   t = (struct ebt_entry_target *)
-  (((char *)e) + e->target_offset);
+   t = ebt_get_target_c(e);
if (strcmp(t->u.name, EBT_STANDARD_TARGET))
goto letscontinue;
if (e->target_offset + sizeof(struct ebt_standard_target) >
@@ -1396,7 +1400,7 @@ static inline int ebt_entry_to_user(struct ebt_entry *e, 
const char *base,
return -EFAULT;
 
hlp = ubase + (((char *)e + e->target_offset) - base);
-   t = (struct ebt_entry_target *)(((char *)e) + e->target_offset);
+   t = ebt_get_target_c(e);
 
ret = EBT_MATCH_ITERATE(e, ebt_match_to_user, base, ubase);
if (ret != 0)
@@ -1737,7 +1741,7 @@ static int compat_copy_entry_to_user(struct ebt_entry *e, 
void __user **dstptr,
return ret;
target_offset = e->target_offset - (origsize - *size);
 
-   t = (struct ebt_entry_target *) ((char *) e + e->target_offset);
+   t = ebt_get_target(e);
 
ret = compat_target_to_user(t, dstptr, size);
if (ret)
@@ -1785,7 +1789,7 @@ static int compat_calc_entry(const struct ebt_entry *e,
EBT_MATCH_ITERATE(e, compat_calc_match, );
EBT_WATCHER_ITERATE(e, compat_calc_watcher, );
 
-   t = (const struct ebt_entry_target *) ((char *) e + e->target_offset);
+   t = ebt_get_target_c(e);
 
off += xt_compat_target_offset(t->u.target);
off += ebt_compat_entry_padsize();
-- 
2.11.0



[PATCH 39/51] netfilter: nf_tables: make meta expression builtin

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

size net/netfilter/nft_meta.ko
   textdata bss dec hex filename
   5826 936   167631a6b net/netfilter/nft_meta.ko
  964072064 400   98871   18237 net/netfilter/nf_tables.ko

after:
 1008262240 401  103467   1942b net/netfilter/nf_tables.ko

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables_core.h |  1 +
 net/netfilter/Kconfig  |  6 --
 net/netfilter/Makefile |  3 +--
 net/netfilter/nf_tables_core.c |  1 +
 net/netfilter/nft_meta.c   | 22 +-
 5 files changed, 4 insertions(+), 29 deletions(-)

diff --git a/include/net/netfilter/nf_tables_core.h 
b/include/net/netfilter/nf_tables_core.h
index ea5aab568be8..3339cce8f585 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -10,6 +10,7 @@ extern struct nft_expr_type nft_byteorder_type;
 extern struct nft_expr_type nft_payload_type;
 extern struct nft_expr_type nft_dynset_type;
 extern struct nft_expr_type nft_range_type;
+extern struct nft_expr_type nft_meta_type;
 
 int nf_tables_core_module_init(void);
 void nf_tables_core_module_exit(void);
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index d20664b02ae4..29a13c7a5af2 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -480,12 +480,6 @@ config NFT_EXTHDR
  This option adds the "exthdr" expression that you can use to match
  IPv6 extension headers and tcp options.
 
-config NFT_META
-   tristate "Netfilter nf_tables meta module"
-   help
- This option adds the "meta" expression that you can use to match and
- to set packet metainformation such as the packet mark.
-
 config NFT_RT
tristate "Netfilter nf_tables routing module"
help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 3103ed1efe17..89634c389fe7 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -76,12 +76,11 @@ obj-$(CONFIG_NF_DUP_NETDEV) += nf_dup_netdev.o
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
  nf_tables_trace.o nft_immediate.o nft_cmp.o nft_range.o \
  nft_bitwise.o nft_byteorder.o nft_payload.o nft_lookup.o \
- nft_dynset.o
+ nft_dynset.o nft_meta.o
 
 obj-$(CONFIG_NF_TABLES)+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)   += nft_compat.o
 obj-$(CONFIG_NFT_EXTHDR)   += nft_exthdr.o
-obj-$(CONFIG_NFT_META) += nft_meta.o
 obj-$(CONFIG_NFT_RT)   += nft_rt.o
 obj-$(CONFIG_NFT_NUMGEN)   += nft_numgen.o
 obj-$(CONFIG_NFT_CT)   += nft_ct.o
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index dfd0bf3810d2..b67d6577f767 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -251,6 +251,7 @@ static struct nft_expr_type *nft_basic_types[] = {
_payload_type,
_dynset_type,
_range_type,
+   _meta_type,
 };
 
 int __init nf_tables_core_module_init(void)
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 6c0b82628117..5348bd058c88 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -11,8 +11,6 @@
  */
 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -495,7 +493,6 @@ static void nft_meta_set_destroy(const struct nft_ctx *ctx,
static_branch_dec(_trace_enabled);
 }
 
-static struct nft_expr_type nft_meta_type;
 static const struct nft_expr_ops nft_meta_get_ops = {
.type   = _meta_type,
.size   = NFT_EXPR_SIZE(sizeof(struct nft_meta)),
@@ -534,27 +531,10 @@ nft_meta_select_ops(const struct nft_ctx *ctx,
return ERR_PTR(-EINVAL);
 }
 
-static struct nft_expr_type nft_meta_type __read_mostly = {
+struct nft_expr_type nft_meta_type __read_mostly = {
.name   = "meta",
.select_ops = nft_meta_select_ops,
.policy = nft_meta_policy,
.maxattr= NFTA_META_MAX,
.owner  = THIS_MODULE,
 };
-
-static int __init nft_meta_module_init(void)
-{
-   return nft_register_expr(_meta_type);
-}
-
-static void __exit nft_meta_module_exit(void)
-{
-   nft_unregister_expr(_meta_type);
-}
-
-module_init(nft_meta_module_init);
-module_exit(nft_meta_module_exit);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Patrick McHardy ");
-MODULE_ALIAS_NFT_EXPR("meta");
-- 
2.11.0



[PATCH 41/51] netfilter: nf_tables: merge exthdr expression into nft core

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

before:
   textdata bss dec hex filename
   5056 844   05900170c net/netfilter/nft_exthdr.ko
 1024562316 401  105173   19ad5 net/netfilter/nf_tables.ko

after:
 1064102392 401  109203   1aa93 net/netfilter/nf_tables.ko

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables_core.h |  1 +
 net/netfilter/Kconfig  |  6 --
 net/netfilter/Makefile |  3 +--
 net/netfilter/nf_tables_core.c |  1 +
 net/netfilter/nft_exthdr.c | 23 ++-
 5 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/include/net/netfilter/nf_tables_core.h 
b/include/net/netfilter/nf_tables_core.h
index d6a358ae3749..cd6915b6c054 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -12,6 +12,7 @@ extern struct nft_expr_type nft_dynset_type;
 extern struct nft_expr_type nft_range_type;
 extern struct nft_expr_type nft_meta_type;
 extern struct nft_expr_type nft_rt_type;
+extern struct nft_expr_type nft_exthdr_type;
 
 int nf_tables_core_module_init(void);
 void nf_tables_core_module_exit(void);
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 771f1a4f3376..f66586fb41cd 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -474,12 +474,6 @@ config NF_TABLES_NETDEV
help
  This option enables support for the "netdev" table.
 
-config NFT_EXTHDR
-   tristate "Netfilter nf_tables exthdr module"
-   help
- This option adds the "exthdr" expression that you can use to match
- IPv6 extension headers and tcp options.
-
 config NFT_NUMGEN
tristate "Netfilter nf_tables number generator module"
help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 128dbcfaa194..b37ce0bc9ab7 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -76,11 +76,10 @@ obj-$(CONFIG_NF_DUP_NETDEV) += nf_dup_netdev.o
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
  nf_tables_trace.o nft_immediate.o nft_cmp.o nft_range.o \
  nft_bitwise.o nft_byteorder.o nft_payload.o nft_lookup.o \
- nft_dynset.o nft_meta.o nft_rt.o
+ nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o
 
 obj-$(CONFIG_NF_TABLES)+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)   += nft_compat.o
-obj-$(CONFIG_NFT_EXTHDR)   += nft_exthdr.o
 obj-$(CONFIG_NFT_NUMGEN)   += nft_numgen.o
 obj-$(CONFIG_NFT_CT)   += nft_ct.o
 obj-$(CONFIG_NFT_FLOW_OFFLOAD) += nft_flow_offload.o
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 481ce2c0bbbf..9cf47c4cb9d5 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -253,6 +253,7 @@ static struct nft_expr_type *nft_basic_types[] = {
_range_type,
_meta_type,
_rt_type,
+   _exthdr_type,
 };
 
 int __init nf_tables_core_module_init(void)
diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c
index 47ec1046ad11..a940c9fd9045 100644
--- a/net/netfilter/nft_exthdr.c
+++ b/net/netfilter/nft_exthdr.c
@@ -10,11 +10,10 @@
 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -353,7 +352,6 @@ static int nft_exthdr_dump_set(struct sk_buff *skb, const 
struct nft_expr *expr)
return nft_exthdr_dump_common(skb, priv);
 }
 
-static struct nft_expr_type nft_exthdr_type;
 static const struct nft_expr_ops nft_exthdr_ipv6_ops = {
.type   = _exthdr_type,
.size   = NFT_EXPR_SIZE(sizeof(struct nft_exthdr)),
@@ -407,27 +405,10 @@ nft_exthdr_select_ops(const struct nft_ctx *ctx,
return ERR_PTR(-EOPNOTSUPP);
 }
 
-static struct nft_expr_type nft_exthdr_type __read_mostly = {
+struct nft_expr_type nft_exthdr_type __read_mostly = {
.name   = "exthdr",
.select_ops = nft_exthdr_select_ops,
.policy = nft_exthdr_policy,
.maxattr= NFTA_EXTHDR_MAX,
.owner  = THIS_MODULE,
 };
-
-static int __init nft_exthdr_module_init(void)
-{
-   return nft_register_expr(_exthdr_type);
-}
-
-static void __exit nft_exthdr_module_exit(void)
-{
-   nft_unregister_expr(_exthdr_type);
-}
-
-module_init(nft_exthdr_module_init);
-module_exit(nft_exthdr_module_exit);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Patrick McHardy ");
-MODULE_ALIAS_NFT_EXPR("exthdr");
-- 
2.11.0



[PATCH 38/51] netfilter: merge meta_bridge into nft_meta

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

It overcomplicates things for no reason.
nft_meta_bridge only offers retrieval of bridge port interface name.

Because of this being its own module, we had to export all nft_meta
functions, which we can then make static again (which even reduces
the size of nft_meta -- including bridge port retrieval...):

before:
   textdata bss dec hex filename
   1838 832   02670 a6e net/bridge/netfilter/nft_meta_bridge.ko
   6147 936   170841bac net/netfilter/nft_meta.ko

after:
   5826 936   167631a6b net/netfilter/nft_meta.ko

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nft_meta.h   |  44 ---
 net/bridge/netfilter/Kconfig   |   7 --
 net/bridge/netfilter/Makefile  |   1 -
 net/bridge/netfilter/nft_meta_bridge.c | 135 -
 net/netfilter/nft_meta.c   |  90 ++
 5 files changed, 58 insertions(+), 219 deletions(-)
 delete mode 100644 include/net/netfilter/nft_meta.h
 delete mode 100644 net/bridge/netfilter/nft_meta_bridge.c

diff --git a/include/net/netfilter/nft_meta.h b/include/net/netfilter/nft_meta.h
deleted file mode 100644
index 5c69e9b09388..
--- a/include/net/netfilter/nft_meta.h
+++ /dev/null
@@ -1,44 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _NFT_META_H_
-#define _NFT_META_H_
-
-struct nft_meta {
-   enum nft_meta_keys  key:8;
-   union {
-   enum nft_registers  dreg:8;
-   enum nft_registers  sreg:8;
-   };
-};
-
-extern const struct nla_policy nft_meta_policy[];
-
-int nft_meta_get_init(const struct nft_ctx *ctx,
- const struct nft_expr *expr,
- const struct nlattr * const tb[]);
-
-int nft_meta_set_init(const struct nft_ctx *ctx,
- const struct nft_expr *expr,
- const struct nlattr * const tb[]);
-
-int nft_meta_get_dump(struct sk_buff *skb,
- const struct nft_expr *expr);
-
-int nft_meta_set_dump(struct sk_buff *skb,
- const struct nft_expr *expr);
-
-void nft_meta_get_eval(const struct nft_expr *expr,
-  struct nft_regs *regs,
-  const struct nft_pktinfo *pkt);
-
-void nft_meta_set_eval(const struct nft_expr *expr,
-  struct nft_regs *regs,
-  const struct nft_pktinfo *pkt);
-
-void nft_meta_set_destroy(const struct nft_ctx *ctx,
- const struct nft_expr *expr);
-
-int nft_meta_set_validate(const struct nft_ctx *ctx,
- const struct nft_expr *expr,
- const struct nft_data **data);
-
-#endif
diff --git a/net/bridge/netfilter/Kconfig b/net/bridge/netfilter/Kconfig
index f212447794bd..9a0159aebe1a 100644
--- a/net/bridge/netfilter/Kconfig
+++ b/net/bridge/netfilter/Kconfig
@@ -8,13 +8,6 @@ menuconfig NF_TABLES_BRIDGE
bool "Ethernet Bridge nf_tables support"
 
 if NF_TABLES_BRIDGE
-
-config NFT_BRIDGE_META
-   tristate "Netfilter nf_table bridge meta support"
-   depends on NFT_META
-   help
- Add support for bridge dedicated meta key.
-
 config NFT_BRIDGE_REJECT
tristate "Netfilter nf_tables bridge reject support"
depends on NFT_REJECT && NFT_REJECT_IPV4 && NFT_REJECT_IPV6
diff --git a/net/bridge/netfilter/Makefile b/net/bridge/netfilter/Makefile
index 4bc758dd4a8c..9b868861f21a 100644
--- a/net/bridge/netfilter/Makefile
+++ b/net/bridge/netfilter/Makefile
@@ -3,7 +3,6 @@
 # Makefile for the netfilter modules for Link Layer filtering on a bridge.
 #
 
-obj-$(CONFIG_NFT_BRIDGE_META)  += nft_meta_bridge.o
 obj-$(CONFIG_NFT_BRIDGE_REJECT)  += nft_reject_bridge.o
 
 # packet logging
diff --git a/net/bridge/netfilter/nft_meta_bridge.c 
b/net/bridge/netfilter/nft_meta_bridge.c
deleted file mode 100644
index bb63c9aed55d..
--- a/net/bridge/netfilter/nft_meta_bridge.c
+++ /dev/null
@@ -1,135 +0,0 @@
-/*
- * Copyright (c) 2014 Intel Corporation
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "../br_private.h"
-
-static void nft_meta_bridge_get_eval(const struct nft_expr *expr,
-struct nft_regs *regs,
-const struct nft_pktinfo *pkt)
-{
-   const struct nft_meta *priv = nft_expr_priv(expr);
-   const struct net_device *in = nft_in(pkt), *out = nft_out(pkt);
-   u32 *dest = >data[priv->dreg];
-   const struct net_bridge_port *p;
-
-   switch (priv->key) {
-   case NFT_META_BRI_IIFNAME:
-   if 

[PATCH 24/51] netfilter: nf_flow_table: add missing condition for TCP state check

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Avoid looking at unrelated fields in UDP packets

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_flow_table_ip.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 692c75ef5cb7..82451b7e0acb 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -15,11 +15,14 @@
 #include 
 #include 
 
-static int nf_flow_tcp_state_check(struct flow_offload *flow,
-  struct sk_buff *skb, unsigned int thoff)
+static int nf_flow_state_check(struct flow_offload *flow, int proto,
+  struct sk_buff *skb, unsigned int thoff)
 {
struct tcphdr *tcph;
 
+   if (proto != IPPROTO_TCP)
+   return 0;
+
if (!pskb_may_pull(skb, thoff + sizeof(*tcph)))
return -1;
 
@@ -248,7 +251,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
return NF_DROP;
 
thoff = ip_hdr(skb)->ihl * 4;
-   if (nf_flow_tcp_state_check(flow, skb, thoff))
+   if (nf_flow_state_check(flow, ip_hdr(skb)->protocol, skb, thoff))
return NF_ACCEPT;
 
if (flow->flags & (FLOW_OFFLOAD_SNAT | FLOW_OFFLOAD_DNAT) &&
@@ -460,7 +463,8 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)))
return NF_ACCEPT;
 
-   if (nf_flow_tcp_state_check(flow, skb, sizeof(*ip6h)))
+   if (nf_flow_state_check(flow, ipv6_hdr(skb)->nexthdr, skb,
+   sizeof(*ip6h)))
return NF_ACCEPT;
 
if (skb_try_make_writable(skb, sizeof(*ip6h)))
-- 
2.11.0



[PATCH 46/51] netfilter: ip6t_srh: extend SRH matching for previous, next and last SID

2018-05-06 Thread Pablo Neira Ayuso
From: Ahmed Abdelsalam 

IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed
by SR encapsulated packet. Each SID is encoded as an IPv6 prefix.

When a Firewall receives an SR encapsulated packet, it should be able
to identify which node previously processed the packet (previous SID),
which node is going to process the packet next (next SID), and which
node is the last to process the packet (last SID) which represent the
final destination of the packet in case of inline SR mode.

An example use-case of using these features could be SID list that
includes two firewalls. When the second firewall receives a packet,
it can check whether the packet has been processed by the first firewall
or not. Based on that check, it decides to apply all rules, apply just
subset of the rules, or totally skip all rules and forward the packet to
the next SID.

This patch extends SRH match to support matching previous SID, next SID,
and last SID.

Signed-off-by: Ahmed Abdelsalam 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter_ipv6/ip6t_srh.h |  43 ++-
 net/ipv6/netfilter/ip6t_srh.c| 173 +--
 2 files changed, 205 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h 
b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
index f3cc0ef514a7..54ed83360dac 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
@@ -17,7 +17,10 @@
 #define IP6T_SRH_LAST_GT0x0100
 #define IP6T_SRH_LAST_LT0x0200
 #define IP6T_SRH_TAG0x0400
-#define IP6T_SRH_MASK   0x07FF
+#define IP6T_SRH_PSID   0x0800
+#define IP6T_SRH_NSID   0x1000
+#define IP6T_SRH_LSID   0x2000
+#define IP6T_SRH_MASK   0x3FFF
 
 /* Values for "mt_invflags" field in struct ip6t_srh */
 #define IP6T_SRH_INV_NEXTHDR0x0001
@@ -31,7 +34,10 @@
 #define IP6T_SRH_INV_LAST_GT0x0100
 #define IP6T_SRH_INV_LAST_LT0x0200
 #define IP6T_SRH_INV_TAG0x0400
-#define IP6T_SRH_INV_MASK   0x07FF
+#define IP6T_SRH_INV_PSID   0x0800
+#define IP6T_SRH_INV_NSID   0x1000
+#define IP6T_SRH_INV_LSID   0x2000
+#define IP6T_SRH_INV_MASK   0x3FFF
 
 /**
  *  struct ip6t_srh - SRH match options
@@ -54,4 +60,37 @@ struct ip6t_srh {
__u16   mt_invflags;
 };
 
+/**
+ *  struct ip6t_srh1 - SRH match options (revision 1)
+ *  @ next_hdr: Next header field of SRH
+ *  @ hdr_len: Extension header length field of SRH
+ *  @ segs_left: Segments left field of SRH
+ *  @ last_entry: Last entry field of SRH
+ *  @ tag: Tag field of SRH
+ *  @ psid_addr: Address of previous SID in SRH SID list
+ *  @ nsid_addr: Address of NEXT SID in SRH SID list
+ *  @ lsid_addr: Address of LAST SID in SRH SID list
+ *  @ psid_msk: Mask of previous SID in SRH SID list
+ *  @ nsid_msk: Mask of next SID in SRH SID list
+ *  @ lsid_msk: MAsk of last SID in SRH SID list
+ *  @ mt_flags: match options
+ *  @ mt_invflags: Invert the sense of match options
+ */
+
+struct ip6t_srh1 {
+   __u8next_hdr;
+   __u8hdr_len;
+   __u8segs_left;
+   __u8last_entry;
+   __u16   tag;
+   struct in6_addr psid_addr;
+   struct in6_addr nsid_addr;
+   struct in6_addr lsid_addr;
+   struct in6_addr psid_msk;
+   struct in6_addr nsid_msk;
+   struct in6_addr lsid_msk;
+   __u16   mt_flags;
+   __u16   mt_invflags;
+};
+
 #endif /*_IP6T_SRH_H*/
diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
index 33719d5560c8..1059894a6f4c 100644
--- a/net/ipv6/netfilter/ip6t_srh.c
+++ b/net/ipv6/netfilter/ip6t_srh.c
@@ -117,6 +117,130 @@ static bool srh_mt6(const struct sk_buff *skb, struct 
xt_action_param *par)
return true;
 }
 
+static bool srh1_mt6(const struct sk_buff *skb, struct xt_action_param *par)
+{
+   int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
+   const struct ip6t_srh1 *srhinfo = par->matchinfo;
+   struct in6_addr *psid, *nsid, *lsid;
+   struct in6_addr _psid, _nsid, _lsid;
+   struct ipv6_sr_hdr *srh;
+   struct ipv6_sr_hdr _srh;
+
+   if (ipv6_find_hdr(skb, , IPPROTO_ROUTING, NULL, NULL) < 0)
+   return false;
+   srh = skb_header_pointer(skb, srhoff, sizeof(_srh), &_srh);
+   if (!srh)
+   return false;
+
+   hdrlen = ipv6_optlen(srh);
+   if (skb->len - srhoff < hdrlen)
+   return false;
+
+   if (srh->type != IPV6_SRCRT_TYPE_4)
+   return false;
+
+   if (srh->segments_left > srh->first_segment)
+   return false;
+
+   /* Next Header matching */
+   if 

[PATCH 50/51] netfilter: ctnetlink: export nf_conntrack_max

2018-05-06 Thread Pablo Neira Ayuso
From: Florent Fourcot 

IPCTNL_MSG_CT_GET_STATS netlink command allow to monitor current number
of conntrack entries. However, if one wants to compare it with the
maximum (and detect exhaustion), the only solution is currently to read
sysctl value.

This patch add nf_conntrack_max value in netlink message, and simplify
monitoring for application built on netlink API.

Signed-off-by: Florent Fourcot 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nfnetlink_conntrack.h | 1 +
 net/netfilter/nf_conntrack_core.c  | 1 +
 net/netfilter/nf_conntrack_netlink.c   | 3 +++
 3 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/netfilter/nfnetlink_conntrack.h 
b/include/uapi/linux/netfilter/nfnetlink_conntrack.h
index 77987111cab0..1d41810d17e2 100644
--- a/include/uapi/linux/netfilter/nfnetlink_conntrack.h
+++ b/include/uapi/linux/netfilter/nfnetlink_conntrack.h
@@ -262,6 +262,7 @@ enum ctattr_stats_cpu {
 enum ctattr_stats_global {
CTA_STATS_GLOBAL_UNSPEC,
CTA_STATS_GLOBAL_ENTRIES,
+   CTA_STATS_GLOBAL_MAX_ENTRIES,
__CTA_STATS_GLOBAL_MAX,
 };
 #define CTA_STATS_GLOBAL_MAX (__CTA_STATS_GLOBAL_MAX - 1)
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 41ff04ee2554..605441727008 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -186,6 +186,7 @@ unsigned int nf_conntrack_htable_size __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 
 unsigned int nf_conntrack_max __read_mostly;
+EXPORT_SYMBOL_GPL(nf_conntrack_max);
 seqcount_t nf_conntrack_generation __read_mostly;
 static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index 4c1d0c5bc268..d807b8770be3 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -2205,6 +2205,9 @@ ctnetlink_stat_ct_fill_info(struct sk_buff *skb, u32 
portid, u32 seq, u32 type,
if (nla_put_be32(skb, CTA_STATS_GLOBAL_ENTRIES, htonl(nr_conntracks)))
goto nla_put_failure;
 
+   if (nla_put_be32(skb, CTA_STATS_GLOBAL_MAX_ENTRIES, 
htonl(nf_conntrack_max)))
+   goto nla_put_failure;
+
nlmsg_end(skb, nlh);
return skb->len;
 
-- 
2.11.0



[PATCH 43/51] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()

2018-05-06 Thread Pablo Neira Ayuso
From: Cong Wang 

Similarly, tbl->entries is not initialized after kmalloc(),
therefore causes an uninit-value warning in ip_vs_lblc_check_expire(),
as reported by syzbot.

Reported-by: 
Cc: Simon Horman 
Cc: Julian Anastasov 
Cc: Pablo Neira Ayuso 
Signed-off-by: Cong Wang 
Acked-by: Julian Anastasov 
Acked-by: Simon Horman 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/ipvs/ip_vs_lblc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 08147fc6400c..b9f375e6dc93 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -372,6 +372,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc)
tbl->counter = 1;
tbl->dead = false;
tbl->svc = svc;
+   atomic_set(>entries, 0);
 
/*
 *Hook periodic timer for garbage collection
-- 
2.11.0



[PATCH 47/51] netfilter: nf_nat: remove unused ct arg from lookup functions

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_nat_l3proto.h   | 24 
 net/ipv4/netfilter/iptable_nat.c |  3 +--
 net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 14 +-
 net/ipv4/netfilter/nft_chain_nat_ipv4.c  |  3 +--
 net/ipv6/netfilter/ip6table_nat.c|  3 +--
 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 14 +-
 net/ipv6/netfilter/nft_chain_nat_ipv6.c  |  3 +--
 7 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/include/net/netfilter/nf_nat_l3proto.h 
b/include/net/netfilter/nf_nat_l3proto.h
index ac47098a61dc..8bad2560576f 100644
--- a/include/net/netfilter/nf_nat_l3proto.h
+++ b/include/net/netfilter/nf_nat_l3proto.h
@@ -48,30 +48,26 @@ unsigned int nf_nat_ipv4_in(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state,
unsigned int (*do_chain)(void *priv,
 struct sk_buff *skb,
-const struct nf_hook_state 
*state,
-struct nf_conn *ct));
+const struct nf_hook_state 
*state));
 
 unsigned int nf_nat_ipv4_out(void *priv, struct sk_buff *skb,
 const struct nf_hook_state *state,
 unsigned int (*do_chain)(void *priv,
  struct sk_buff *skb,
- const struct 
nf_hook_state *state,
- struct nf_conn *ct));
+ const struct 
nf_hook_state *state));
 
 unsigned int nf_nat_ipv4_local_fn(void *priv,
  struct sk_buff *skb,
  const struct nf_hook_state *state,
  unsigned int (*do_chain)(void *priv,
   struct sk_buff *skb,
-  const struct 
nf_hook_state *state,
-  struct nf_conn *ct));
+  const struct 
nf_hook_state *state));
 
 unsigned int nf_nat_ipv4_fn(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state,
unsigned int (*do_chain)(void *priv,
 struct sk_buff *skb,
-const struct nf_hook_state 
*state,
-struct nf_conn *ct));
+const struct nf_hook_state 
*state));
 
 int nf_nat_icmpv6_reply_translation(struct sk_buff *skb, struct nf_conn *ct,
enum ip_conntrack_info ctinfo,
@@ -81,29 +77,25 @@ unsigned int nf_nat_ipv6_in(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state,
unsigned int (*do_chain)(void *priv,
 struct sk_buff *skb,
-const struct nf_hook_state 
*state,
-struct nf_conn *ct));
+const struct nf_hook_state 
*state));
 
 unsigned int nf_nat_ipv6_out(void *priv, struct sk_buff *skb,
 const struct nf_hook_state *state,
 unsigned int (*do_chain)(void *priv,
  struct sk_buff *skb,
- const struct 
nf_hook_state *state,
- struct nf_conn *ct));
+ const struct 
nf_hook_state *state));
 
 unsigned int nf_nat_ipv6_local_fn(void *priv,
  struct sk_buff *skb,
  const struct nf_hook_state *state,
  unsigned int (*do_chain)(void *priv,
   struct sk_buff *skb,
-  const struct 
nf_hook_state *state,
-  struct nf_conn *ct));
+  const struct 
nf_hook_state *state));
 
 unsigned int nf_nat_ipv6_fn(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state,
unsigned int 

[PATCH 48/51] netfilter: nf_tables: Provide NFT_{RT,CT}_MAX for userspace

2018-05-06 Thread Pablo Neira Ayuso
From: Phil Sutter 

These macros allow conveniently declaring arrays which use NFT_{RT,CT}_*
values as indexes.

Signed-off-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 5a5551a580f7..ce031cf72288 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -831,7 +831,9 @@ enum nft_rt_keys {
NFT_RT_NEXTHOP4,
NFT_RT_NEXTHOP6,
NFT_RT_TCPMSS,
+   __NFT_RT_MAX
 };
+#define NFT_RT_MAX (__NFT_RT_MAX - 1)
 
 /**
  * enum nft_hash_types - nf_tables hash expression types
@@ -949,7 +951,9 @@ enum nft_ct_keys {
NFT_CT_DST_IP,
NFT_CT_SRC_IP6,
NFT_CT_DST_IP6,
+   __NFT_CT_MAX
 };
+#define NFT_CT_MAX (__NFT_CT_MAX - 1)
 
 /**
  * enum nft_ct_attributes - nf_tables ct expression netlink attributes
-- 
2.11.0



[PATCH 51/51] netfilter: nft_dynset: fix timeout updates on 32bit

2018-05-06 Thread Pablo Neira Ayuso
From: Florian Westphal 

This must now use a 64bit jiffies value, else we set
a bogus timeout on 32bit.

Fixes: 8e1102d5a1596 ("netfilter: nf_tables: support timeouts larger than 23 
days")
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_dynset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c
index 5cc3509659c6..b07a3fd9eeea 100644
--- a/net/netfilter/nft_dynset.c
+++ b/net/netfilter/nft_dynset.c
@@ -81,7 +81,7 @@ static void nft_dynset_eval(const struct nft_expr *expr,
if (priv->op == NFT_DYNSET_OP_UPDATE &&
nft_set_ext_exists(ext, NFT_SET_EXT_EXPIRATION)) {
timeout = priv->timeout ? : set->timeout;
-   *nft_set_ext_expiration(ext) = jiffies + timeout;
+   *nft_set_ext_expiration(ext) = get_jiffies_64() + 
timeout;
}
 
if (sexpr != NULL)
-- 
2.11.0



[PATCH 49/51] netfilter: extract Passive OS fingerprint infrastructure from xt_osf

2018-05-06 Thread Pablo Neira Ayuso
From: Fernando Fernandez Mancera 

Add nf_osf_ttl() and nf_osf_match() into nf_osf.c to prepare for
nf_tables support.

Signed-off-by: Fernando Fernandez Mancera 
Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/netfilter/nf_osf.h  |  27 +
 include/uapi/linux/netfilter/nf_osf.h |  90 ++
 include/uapi/linux/netfilter/xt_osf.h | 106 +++--
 net/netfilter/Kconfig |   4 +
 net/netfilter/Makefile|   1 +
 net/netfilter/nf_osf.c| 218 ++
 net/netfilter/xt_osf.c| 202 +--
 7 files changed, 359 insertions(+), 289 deletions(-)
 create mode 100644 include/linux/netfilter/nf_osf.h
 create mode 100644 include/uapi/linux/netfilter/nf_osf.h
 create mode 100644 net/netfilter/nf_osf.c

diff --git a/include/linux/netfilter/nf_osf.h b/include/linux/netfilter/nf_osf.h
new file mode 100644
index ..a2b39602e87d
--- /dev/null
+++ b/include/linux/netfilter/nf_osf.h
@@ -0,0 +1,27 @@
+#include 
+
+/* Initial window size option state machine: multiple of mss, mtu or
+ * plain numeric value. Can also be made as plain numeric value which
+ * is not a multiple of specified value.
+ */
+enum nf_osf_window_size_options {
+   OSF_WSS_PLAIN   = 0,
+   OSF_WSS_MSS,
+   OSF_WSS_MTU,
+   OSF_WSS_MODULO,
+   OSF_WSS_MAX,
+};
+
+enum osf_fmatch_states {
+   /* Packet does not match the fingerprint */
+   FMATCH_WRONG = 0,
+   /* Packet matches the fingerprint */
+   FMATCH_OK,
+   /* Options do not match the fingerprint, but header does */
+   FMATCH_OPT_WRONG,
+};
+
+bool nf_osf_match(const struct sk_buff *skb, u_int8_t family,
+ int hooknum, struct net_device *in, struct net_device *out,
+ const struct nf_osf_info *info, struct net *net,
+ const struct list_head *nf_osf_fingers);
diff --git a/include/uapi/linux/netfilter/nf_osf.h 
b/include/uapi/linux/netfilter/nf_osf.h
new file mode 100644
index ..45376eae31ef
--- /dev/null
+++ b/include/uapi/linux/netfilter/nf_osf.h
@@ -0,0 +1,90 @@
+#ifndef _NF_OSF_H
+#define _NF_OSF_H
+
+#define MAXGENRELEN32
+
+#define NF_OSF_GENRE   (1 << 0)
+#define NF_OSF_TTL (1 << 1)
+#define NF_OSF_LOG (1 << 2)
+#define NF_OSF_INVERT  (1 << 3)
+
+#define NF_OSF_LOGLEVEL_ALL0   /* log all matched fingerprints 
*/
+#define NF_OSF_LOGLEVEL_FIRST  1   /* log only the first matced 
fingerprint */
+#define NF_OSF_LOGLEVEL_ALL_KNOWN  2   /* do not log unknown packets */
+
+#define NF_OSF_TTL_TRUE0   /* True ip and 
fingerprint TTL comparison */
+
+/* Do not compare ip and fingerprint TTL at all */
+#define NF_OSF_TTL_NOCHECK 2
+
+/* Wildcard MSS (kind of).
+ * It is used to implement a state machine for the different wildcard values
+ * of the MSS and window sizes.
+ */
+struct nf_osf_wc {
+   __u32   wc;
+   __u32   val;
+};
+
+/* This struct represents IANA options
+ * http://www.iana.org/assignments/tcp-parameters
+ */
+struct nf_osf_opt {
+   __u16   kind, length;
+   struct nf_osf_wcwc;
+};
+
+struct nf_osf_info {
+   chargenre[MAXGENRELEN];
+   __u32   len;
+   __u32   flags;
+   __u32   loglevel;
+   __u32   ttl;
+};
+
+struct nf_osf_user_finger {
+   struct nf_osf_wcwss;
+
+   __u8ttl, df;
+   __u16   ss, mss;
+   __u16   opt_num;
+
+   chargenre[MAXGENRELEN];
+   charversion[MAXGENRELEN];
+   charsubtype[MAXGENRELEN];
+
+   /* MAX_IPOPTLEN is maximum if all options are NOPs or EOLs */
+   struct nf_osf_opt   opt[MAX_IPOPTLEN];
+};
+
+struct nf_osf_finger {
+   struct rcu_head rcu_head;
+   struct list_headfinger_entry;
+   struct nf_osf_user_finger   finger;
+};
+
+struct nf_osf_nlmsg {
+   struct nf_osf_user_finger   f;
+   struct iphdrip;
+   struct tcphdr   tcp;
+};
+
+/* Defines for IANA option kinds */
+enum iana_options {
+   OSFOPT_EOL = 0, /* End of options */
+   OSFOPT_NOP, /* NOP */
+   OSFOPT_MSS, /* Maximum segment size */
+   OSFOPT_WSO, /* Window scale option */
+   OSFOPT_SACKP,   /* SACK permitted */
+   OSFOPT_SACK,/* SACK */
+   OSFOPT_ECHO,
+   OSFOPT_ECHOREPLY,
+   OSFOPT_TS,  /* Timestamp option */
+   OSFOPT_POCP,/* Partial Order Connection Permitted */
+   OSFOPT_POSP,/* Partial Order Service Profile */
+
+   /* Others are not used in the current OSF */
+   OSFOPT_EMPTY = 255,
+};
+
+#endif /* _NF_OSF_H */
diff --git a/include/uapi/linux/netfilter/xt_osf.h 
b/include/uapi/linux/netfilter/xt_osf.h

[PATCH 44/51] netfilter: nft_numgen: add map lookups for numgen statements

2018-05-06 Thread Pablo Neira Ayuso
From: Laura Garcia Liebana 

This patch includes a new attribute in the numgen structure to allow
the lookup of an element based on the number generator as a key.

For this purpose, different ops have been included to extend the
current numgen inc functions.

Currently, only supported for numgen incremental operations, but
it will be supported for random in a follow-up patch.

Signed-off-by: Laura Garcia Liebana 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  4 ++
 net/netfilter/nft_numgen.c   | 85 ++--
 2 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 6a3d653d5b27..5a5551a580f7 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -1450,6 +1450,8 @@ enum nft_trace_types {
  * @NFTA_NG_MODULUS: maximum counter value (NLA_U32)
  * @NFTA_NG_TYPE: operation type (NLA_U32)
  * @NFTA_NG_OFFSET: offset to be added to the counter (NLA_U32)
+ * @NFTA_NG_SET_NAME: name of the map to lookup (NLA_STRING)
+ * @NFTA_NG_SET_ID: id of the map (NLA_U32)
  */
 enum nft_ng_attributes {
NFTA_NG_UNSPEC,
@@ -1457,6 +1459,8 @@ enum nft_ng_attributes {
NFTA_NG_MODULUS,
NFTA_NG_TYPE,
NFTA_NG_OFFSET,
+   NFTA_NG_SET_NAME,
+   NFTA_NG_SET_ID,
__NFTA_NG_MAX
 };
 #define NFTA_NG_MAX(__NFTA_NG_MAX - 1)
diff --git a/net/netfilter/nft_numgen.c b/net/netfilter/nft_numgen.c
index 5a3a52c71545..8a64db8f2e69 100644
--- a/net/netfilter/nft_numgen.c
+++ b/net/netfilter/nft_numgen.c
@@ -24,13 +24,11 @@ struct nft_ng_inc {
u32 modulus;
atomic_tcounter;
u32 offset;
+   struct nft_set  *map;
 };
 
-static void nft_ng_inc_eval(const struct nft_expr *expr,
-   struct nft_regs *regs,
-   const struct nft_pktinfo *pkt)
+static u32 nft_ng_inc_gen(struct nft_ng_inc *priv)
 {
-   struct nft_ng_inc *priv = nft_expr_priv(expr);
u32 nval, oval;
 
do {
@@ -38,7 +36,36 @@ static void nft_ng_inc_eval(const struct nft_expr *expr,
nval = (oval + 1 < priv->modulus) ? oval + 1 : 0;
} while (atomic_cmpxchg(>counter, oval, nval) != oval);
 
-   regs->data[priv->dreg] = nval + priv->offset;
+   return nval + priv->offset;
+}
+
+static void nft_ng_inc_eval(const struct nft_expr *expr,
+   struct nft_regs *regs,
+   const struct nft_pktinfo *pkt)
+{
+   struct nft_ng_inc *priv = nft_expr_priv(expr);
+
+   regs->data[priv->dreg] = nft_ng_inc_gen(priv);
+}
+
+static void nft_ng_inc_map_eval(const struct nft_expr *expr,
+   struct nft_regs *regs,
+   const struct nft_pktinfo *pkt)
+{
+   struct nft_ng_inc *priv = nft_expr_priv(expr);
+   const struct nft_set *map = priv->map;
+   const struct nft_set_ext *ext;
+   u32 result;
+   bool found;
+
+   result = nft_ng_inc_gen(priv);
+   found = map->ops->lookup(nft_net(pkt), map, , );
+
+   if (!found)
+   return;
+
+   nft_data_copy(>data[priv->dreg],
+ nft_set_ext_data(ext), map->dlen);
 }
 
 static const struct nla_policy nft_ng_policy[NFTA_NG_MAX + 1] = {
@@ -46,6 +73,9 @@ static const struct nla_policy nft_ng_policy[NFTA_NG_MAX + 1] 
= {
[NFTA_NG_MODULUS]   = { .type = NLA_U32 },
[NFTA_NG_TYPE]  = { .type = NLA_U32 },
[NFTA_NG_OFFSET]= { .type = NLA_U32 },
+   [NFTA_NG_SET_NAME]  = { .type = NLA_STRING,
+   .len = NFT_SET_MAXNAMELEN - 1 },
+   [NFTA_NG_SET_ID]= { .type = NLA_U32 },
 };
 
 static int nft_ng_inc_init(const struct nft_ctx *ctx,
@@ -71,6 +101,25 @@ static int nft_ng_inc_init(const struct nft_ctx *ctx,
   NFT_DATA_VALUE, sizeof(u32));
 }
 
+static int nft_ng_inc_map_init(const struct nft_ctx *ctx,
+  const struct nft_expr *expr,
+  const struct nlattr * const tb[])
+{
+   struct nft_ng_inc *priv = nft_expr_priv(expr);
+   u8 genmask = nft_genmask_next(ctx->net);
+
+   nft_ng_inc_init(ctx, expr, tb);
+
+   priv->map = nft_set_lookup_global(ctx->net, ctx->table,
+ tb[NFTA_NG_SET_NAME],
+ tb[NFTA_NG_SET_ID], genmask);
+
+   if (IS_ERR(priv->map))
+   return PTR_ERR(priv->map);
+
+   return 0;
+}
+
 static int nft_ng_dump(struct sk_buff *skb, enum nft_registers dreg,
   u32 modulus, enum nft_ng_types type, u32 offset)
 {
@@ -97,6 +146,22 @@ static int nft_ng_inc_dump(struct sk_buff *skb, 

[PATCH 27/51] netfilter: nf_tables: initial support for extended ACK reporting

2018-05-06 Thread Pablo Neira Ayuso
Keep it simple to start with, just report attribute offsets that can be
useful to userspace when representating errors to users.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 299 +-
 1 file changed, 206 insertions(+), 93 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index f65e650b61aa..2f14cadd9922 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -582,8 +582,10 @@ static int nf_tables_gettable(struct net *net, struct sock 
*nlsk,
}
 
table = nft_table_lookup(net, nla[NFTA_TABLE_NAME], family, genmask);
-   if (IS_ERR(table))
+   if (IS_ERR(table)) {
+   NL_SET_BAD_ATTR(extack, nla[NFTA_TABLE_NAME]);
return PTR_ERR(table);
+   }
 
skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
if (!skb2)
@@ -699,21 +701,23 @@ static int nf_tables_newtable(struct net *net, struct 
sock *nlsk,
 {
const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
u8 genmask = nft_genmask_next(net);
-   const struct nlattr *name;
-   struct nft_table *table;
int family = nfmsg->nfgen_family;
+   const struct nlattr *attr;
+   struct nft_table *table;
u32 flags = 0;
struct nft_ctx ctx;
int err;
 
-   name = nla[NFTA_TABLE_NAME];
-   table = nft_table_lookup(net, name, family, genmask);
+   attr = nla[NFTA_TABLE_NAME];
+   table = nft_table_lookup(net, attr, family, genmask);
if (IS_ERR(table)) {
if (PTR_ERR(table) != -ENOENT)
return PTR_ERR(table);
} else {
-   if (nlh->nlmsg_flags & NLM_F_EXCL)
+   if (nlh->nlmsg_flags & NLM_F_EXCL) {
+   NL_SET_BAD_ATTR(extack, attr);
return -EEXIST;
+   }
if (nlh->nlmsg_flags & NLM_F_REPLACE)
return -EOPNOTSUPP;
 
@@ -732,7 +736,7 @@ static int nf_tables_newtable(struct net *net, struct sock 
*nlsk,
if (table == NULL)
goto err_kzalloc;
 
-   table->name = nla_strdup(name, GFP_KERNEL);
+   table->name = nla_strdup(attr, GFP_KERNEL);
if (table->name == NULL)
goto err_strdup;
 
@@ -855,8 +859,9 @@ static int nf_tables_deltable(struct net *net, struct sock 
*nlsk,
 {
const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
u8 genmask = nft_genmask_next(net);
-   struct nft_table *table;
int family = nfmsg->nfgen_family;
+   const struct nlattr *attr;
+   struct nft_table *table;
struct nft_ctx ctx;
 
nft_ctx_init(, net, skb, nlh, 0, NULL, NULL, nla);
@@ -864,15 +869,18 @@ static int nf_tables_deltable(struct net *net, struct 
sock *nlsk,
(!nla[NFTA_TABLE_NAME] && !nla[NFTA_TABLE_HANDLE]))
return nft_flush(, family);
 
-   if (nla[NFTA_TABLE_HANDLE])
-   table = nft_table_lookup_byhandle(net, nla[NFTA_TABLE_HANDLE],
- genmask);
-   else
-   table = nft_table_lookup(net, nla[NFTA_TABLE_NAME], family,
-genmask);
+   if (nla[NFTA_TABLE_HANDLE]) {
+   attr = nla[NFTA_TABLE_HANDLE];
+   table = nft_table_lookup_byhandle(net, attr, genmask);
+   } else {
+   attr = nla[NFTA_TABLE_NAME];
+   table = nft_table_lookup(net, attr, family, genmask);
+   }
 
-   if (IS_ERR(table))
+   if (IS_ERR(table)) {
+   NL_SET_BAD_ATTR(extack, attr);
return PTR_ERR(table);
+   }
 
if (nlh->nlmsg_flags & NLM_F_NONREC &&
table->use > 0)
@@ -1164,12 +1172,16 @@ static int nf_tables_getchain(struct net *net, struct 
sock *nlsk,
}
 
table = nft_table_lookup(net, nla[NFTA_CHAIN_TABLE], family, genmask);
-   if (IS_ERR(table))
+   if (IS_ERR(table)) {
+   NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TABLE]);
return PTR_ERR(table);
+   }
 
chain = nft_chain_lookup(table, nla[NFTA_CHAIN_NAME], genmask);
-   if (IS_ERR(chain))
+   if (IS_ERR(chain)) {
+   NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_NAME]);
return PTR_ERR(chain);
+   }
 
skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
if (!skb2)
@@ -1531,9 +1543,9 @@ static int nf_tables_newchain(struct net *net, struct 
sock *nlsk,
  struct netlink_ext_ack *extack)
 {
const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
-   const struct nlattr * uninitialized_var(name);
u8 genmask = nft_genmask_next(net);
int family = nfmsg->nfgen_family;
+   const struct nlattr *attr;
struct nft_table *table;
struct nft_chain *chain;
u8 policy = NF_ACCEPT;
@@ -1544,34 +1556,45 @@ static int 

[PATCH 45/51] netfilter: nft_numgen: enable hashing of one element

2018-05-06 Thread Pablo Neira Ayuso
From: Laura Garcia Liebana 

The modulus in the hash function was limited to > 1 as initially
there was no sense to create a hashing of just one element.

Nevertheless, there are certain cases specially for load balancing
where this case needs to be addressed.

This patch fixes the following error.

Error: Could not process rule: Numerical result out of range
add rule ip nftlb lb01 dnat to jhash ip saddr mod 1 map { 0: 192.168.0.10 }
^^^

The solution comes to force the hash to 0 when the modulus is 1.

Signed-off-by: Laura Garcia Liebana 
---
 net/netfilter/nft_hash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 24f2f7567ddb..e235c17f1b8b 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -97,7 +97,7 @@ static int nft_jhash_init(const struct nft_ctx *ctx,
priv->len = len;
 
priv->modulus = ntohl(nla_get_be32(tb[NFTA_HASH_MODULUS]));
-   if (priv->modulus <= 1)
+   if (priv->modulus < 1)
return -ERANGE;
 
if (priv->offset + priv->modulus - 1 < priv->offset)
-- 
2.11.0



[PATCH 19/51] netfilter: nf_flow_table: make flow_offload_dead inline

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

It is too trivial to keep as a separate exported function

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h | 5 -
 net/netfilter/nf_flow_table_core.c| 6 --
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index ab408adba688..5aa49524ebef 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -103,7 +103,10 @@ void nf_flow_table_cleanup(struct net *net, struct 
net_device *dev);
 int nf_flow_table_init(struct nf_flowtable *flow_table);
 void nf_flow_table_free(struct nf_flowtable *flow_table);
 
-void flow_offload_dead(struct flow_offload *flow);
+static inline void flow_offload_dead(struct flow_offload *flow)
+{
+   flow->flags |= FLOW_OFFLOAD_DYING;
+}
 
 int nf_flow_snat_port(const struct flow_offload *flow,
  struct sk_buff *skb, unsigned int thoff,
diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index e761359b56a9..0d38f20fd226 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -113,12 +113,6 @@ void flow_offload_free(struct flow_offload *flow)
 }
 EXPORT_SYMBOL_GPL(flow_offload_free);
 
-void flow_offload_dead(struct flow_offload *flow)
-{
-   flow->flags |= FLOW_OFFLOAD_DYING;
-}
-EXPORT_SYMBOL_GPL(flow_offload_dead);
-
 static u32 flow_offload_hash(const void *data, u32 len, u32 seed)
 {
const struct flow_offload_tuple *tuple = data;
-- 
2.11.0



[PATCH 25/51] netfilter: nf_flow_table: fix offloading connections with SNAT+DNAT

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Pass all NAT types to the flow offload struct, otherwise parts of the
address/port pair do not get translated properly, causing connection
stalls

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_flow_table_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index 0699981a8511..eb0d1658ac05 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -84,7 +84,7 @@ flow_offload_alloc(struct nf_conn *ct, struct nf_flow_route 
*route)
 
if (ct->status & IPS_SRC_NAT)
flow->flags |= FLOW_OFFLOAD_SNAT;
-   else if (ct->status & IPS_DST_NAT)
+   if (ct->status & IPS_DST_NAT)
flow->flags |= FLOW_OFFLOAD_DNAT;
 
return flow;
-- 
2.11.0



[PATCH 23/51] netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Allow the slow path to handle the shutdown of the connection with proper
timeouts. The packet containing RST/FIN is also sent to the slow path
and the TCP conntrack module will update its state.

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_flow_table_ip.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index dc570fb7641d..692c75ef5cb7 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -15,6 +15,23 @@
 #include 
 #include 
 
+static int nf_flow_tcp_state_check(struct flow_offload *flow,
+  struct sk_buff *skb, unsigned int thoff)
+{
+   struct tcphdr *tcph;
+
+   if (!pskb_may_pull(skb, thoff + sizeof(*tcph)))
+   return -1;
+
+   tcph = (void *)(skb_network_header(skb) + thoff);
+   if (unlikely(tcph->fin || tcph->rst)) {
+   flow_offload_teardown(flow);
+   return -1;
+   }
+
+   return 0;
+}
+
 static int nf_flow_nat_ip_tcp(struct sk_buff *skb, unsigned int thoff,
  __be32 addr, __be32 new_addr)
 {
@@ -119,10 +136,9 @@ static int nf_flow_dnat_ip(const struct flow_offload 
*flow, struct sk_buff *skb,
 }
 
 static int nf_flow_nat_ip(const struct flow_offload *flow, struct sk_buff *skb,
- enum flow_offload_tuple_dir dir)
+ unsigned int thoff, enum flow_offload_tuple_dir dir)
 {
struct iphdr *iph = ip_hdr(skb);
-   unsigned int thoff = iph->ihl * 4;
 
if (flow->flags & FLOW_OFFLOAD_SNAT &&
(nf_flow_snat_port(flow, skb, thoff, iph->protocol, dir) < 0 ||
@@ -202,6 +218,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
struct flow_offload *flow;
struct net_device *outdev;
const struct rtable *rt;
+   unsigned int thoff;
struct iphdr *iph;
__be32 nexthop;
 
@@ -230,8 +247,12 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
if (skb_try_make_writable(skb, sizeof(*iph)))
return NF_DROP;
 
+   thoff = ip_hdr(skb)->ihl * 4;
+   if (nf_flow_tcp_state_check(flow, skb, thoff))
+   return NF_ACCEPT;
+
if (flow->flags & (FLOW_OFFLOAD_SNAT | FLOW_OFFLOAD_DNAT) &&
-   nf_flow_nat_ip(flow, skb, dir) < 0)
+   nf_flow_nat_ip(flow, skb, thoff, dir) < 0)
return NF_DROP;
 
flow->timeout = (u32)jiffies + NF_FLOW_TIMEOUT;
@@ -439,6 +460,9 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)))
return NF_ACCEPT;
 
+   if (nf_flow_tcp_state_check(flow, skb, sizeof(*ip6h)))
+   return NF_ACCEPT;
+
if (skb_try_make_writable(skb, sizeof(*ip6h)))
return NF_DROP;
 
-- 
2.11.0



[PATCH 02/51] netfilter: ipvs: Keep latest weight of destination

2018-05-06 Thread Pablo Neira Ayuso
From: Inju Song 

The hashing table in scheduler such as source hash or maglev hash
should ignore the changed weight to 0 and allow changing the weight
from/to non-0 values. So, struct ip_vs_dest needs to keep weight
with latest non-0 weight.

Signed-off-by: Inju Song 
Signed-off-by: Julian Anastasov 
Signed-off-by: Simon Horman 
---
 include/net/ip_vs.h| 1 +
 net/netfilter/ipvs/ip_vs_ctl.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index eb0bec043c96..0ac795b41ab8 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -668,6 +668,7 @@ struct ip_vs_dest {
volatile unsigned int   flags;  /* dest status flags */
atomic_tconn_flags; /* flags to copy to conn */
atomic_tweight; /* server weight */
+   atomic_tlast_weight;/* server latest weight */
 
refcount_t  refcnt; /* reference counter */
struct ip_vs_stats  stats;  /* statistics */
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 5ebde4b15810..b91bb70ece92 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -821,6 +821,10 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct 
ip_vs_dest *dest,
if (add && udest->af != svc->af)
ipvs->mixed_address_family_dests++;
 
+   /* keep the last_weight with latest non-0 weight */
+   if (add || udest->weight != 0)
+   atomic_set(>last_weight, udest->weight);
+
/* set the weight and the flags */
atomic_set(>weight, udest->weight);
conn_flags = udest->conn_flags & IP_VS_CONN_F_DEST_MASK;
-- 
2.11.0



[PATCH 05/51] ipvs: fix multiplicative hashing in sh/dh/lblc/lblcr algorithms

2018-05-06 Thread Pablo Neira Ayuso
From: Vincent Bernat 

The sh/dh/lblc/lblcr algorithms are using Knuth's multiplicative
hashing incorrectly. Replace its use by the hash_32() macro, which
correctly implements this algorithm. It doesn't use the same constant,
but it shouldn't matter.

Signed-off-by: Vincent Bernat 
Acked-by: Julian Anastasov 
Signed-off-by: Simon Horman 
---
 net/netfilter/ipvs/ip_vs_dh.c| 3 ++-
 net/netfilter/ipvs/ip_vs_lblc.c  | 3 ++-
 net/netfilter/ipvs/ip_vs_lblcr.c | 3 ++-
 net/netfilter/ipvs/ip_vs_sh.c| 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_dh.c b/net/netfilter/ipvs/ip_vs_dh.c
index 75f798f8e83b..07459e71d907 100644
--- a/net/netfilter/ipvs/ip_vs_dh.c
+++ b/net/netfilter/ipvs/ip_vs_dh.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -81,7 +82,7 @@ static inline unsigned int ip_vs_dh_hashkey(int af, const 
union nf_inet_addr *ad
addr_fold = addr->ip6[0]^addr->ip6[1]^
addr->ip6[2]^addr->ip6[3];
 #endif
-   return (ntohl(addr_fold)*2654435761UL) & IP_VS_DH_TAB_MASK;
+   return hash_32(ntohl(addr_fold), IP_VS_DH_TAB_BITS);
 }
 
 
diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 3057e453bf31..08147fc6400c 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* for sysctl */
 #include 
@@ -160,7 +161,7 @@ ip_vs_lblc_hashkey(int af, const union nf_inet_addr *addr)
addr_fold = addr->ip6[0]^addr->ip6[1]^
addr->ip6[2]^addr->ip6[3];
 #endif
-   return (ntohl(addr_fold)*2654435761UL) & IP_VS_LBLC_TAB_MASK;
+   return hash_32(ntohl(addr_fold), IP_VS_LBLC_TAB_BITS);
 }
 
 
diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index 92adc04557ed..9b6a6c9e9cfa 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* for sysctl */
 #include 
@@ -323,7 +324,7 @@ ip_vs_lblcr_hashkey(int af, const union nf_inet_addr *addr)
addr_fold = addr->ip6[0]^addr->ip6[1]^
addr->ip6[2]^addr->ip6[3];
 #endif
-   return (ntohl(addr_fold)*2654435761UL) & IP_VS_LBLCR_TAB_MASK;
+   return hash_32(ntohl(addr_fold), IP_VS_LBLCR_TAB_BITS);
 }
 
 
diff --git a/net/netfilter/ipvs/ip_vs_sh.c b/net/netfilter/ipvs/ip_vs_sh.c
index 16aaac6eedc9..1e01c782583a 100644
--- a/net/netfilter/ipvs/ip_vs_sh.c
+++ b/net/netfilter/ipvs/ip_vs_sh.c
@@ -96,7 +96,8 @@ ip_vs_sh_hashkey(int af, const union nf_inet_addr *addr,
addr_fold = addr->ip6[0]^addr->ip6[1]^
addr->ip6[2]^addr->ip6[3];
 #endif
-   return (offset + (ntohs(port) + ntohl(addr_fold))*2654435761UL) &
+   return (offset + hash_32(ntohs(port) + ntohl(addr_fold),
+IP_VS_SH_TAB_BITS)) &
IP_VS_SH_TAB_MASK;
 }
 
-- 
2.11.0



[PATCH 20/51] netfilter: nf_flow_table: add a new flow state for tearing down offloading

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

On cleanup, this will be treated differently from FLOW_OFFLOAD_DYING:

If FLOW_OFFLOAD_DYING is set, the connection is going away, so both the
offload state and the connection tracking entry will be deleted.

If FLOW_OFFLOAD_TEARDOWN is set, the connection remains alive, but
the offload state is torn down. This is useful for cases that require
more complex state tracking / timeout handling on TCP, or if the
connection has been idle for too long.

Support for sending flows back to the slow path will be implemented in
a following patch

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h |  2 ++
 net/netfilter/nf_flow_table_core.c| 22 ++
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index 5aa49524ebef..ba9fa4592f2b 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -68,6 +68,7 @@ struct flow_offload_tuple_rhash {
 #define FLOW_OFFLOAD_SNAT  0x1
 #define FLOW_OFFLOAD_DNAT  0x2
 #define FLOW_OFFLOAD_DYING 0x4
+#define FLOW_OFFLOAD_TEARDOWN  0x8
 
 struct flow_offload {
struct flow_offload_tuple_rhash tuplehash[FLOW_OFFLOAD_DIR_MAX];
@@ -103,6 +104,7 @@ void nf_flow_table_cleanup(struct net *net, struct 
net_device *dev);
 int nf_flow_table_init(struct nf_flowtable *flow_table);
 void nf_flow_table_free(struct nf_flowtable *flow_table);
 
+void flow_offload_teardown(struct flow_offload *flow);
 static inline void flow_offload_dead(struct flow_offload *flow)
 {
flow->flags |= FLOW_OFFLOAD_DYING;
diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index 0d38f20fd226..5a81e4f771e9 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -174,6 +174,12 @@ static void flow_offload_del(struct nf_flowtable 
*flow_table,
flow_offload_free(flow);
 }
 
+void flow_offload_teardown(struct flow_offload *flow)
+{
+   flow->flags |= FLOW_OFFLOAD_TEARDOWN;
+}
+EXPORT_SYMBOL_GPL(flow_offload_teardown);
+
 struct flow_offload_tuple_rhash *
 flow_offload_lookup(struct nf_flowtable *flow_table,
struct flow_offload_tuple *tuple)
@@ -226,11 +232,6 @@ static inline bool nf_flow_has_expired(const struct 
flow_offload *flow)
return (__s32)(flow->timeout - (u32)jiffies) <= 0;
 }
 
-static inline bool nf_flow_is_dying(const struct flow_offload *flow)
-{
-   return flow->flags & FLOW_OFFLOAD_DYING;
-}
-
 static int nf_flow_offload_gc_step(struct nf_flowtable *flow_table)
 {
struct flow_offload_tuple_rhash *tuplehash;
@@ -258,7 +259,8 @@ static int nf_flow_offload_gc_step(struct nf_flowtable 
*flow_table)
flow = container_of(tuplehash, struct flow_offload, 
tuplehash[0]);
 
if (nf_flow_has_expired(flow) ||
-   nf_flow_is_dying(flow))
+   (flow->flags & (FLOW_OFFLOAD_DYING |
+   FLOW_OFFLOAD_TEARDOWN)))
flow_offload_del(flow_table, flow);
}
 out:
@@ -419,10 +421,14 @@ static void nf_flow_table_do_cleanup(struct flow_offload 
*flow, void *data)
 {
struct net_device *dev = data;
 
-   if (dev && flow->tuplehash[0].tuple.iifidx != dev->ifindex)
+   if (!dev) {
+   flow_offload_teardown(flow);
return;
+   }
 
-   flow_offload_dead(flow);
+   if (flow->tuplehash[0].tuple.iifidx == dev->ifindex ||
+   flow->tuplehash[1].tuple.iifidx == dev->ifindex)
+   flow_offload_dead(flow);
 }
 
 static void nf_flow_table_iterate_cleanup(struct nf_flowtable *flowtable,
-- 
2.11.0



[PATCH 22/51] netfilter: nf_flow_table: add support for sending flows back to the slow path

2018-05-06 Thread Pablo Neira Ayuso
From: Felix Fietkau 

Since conntrack hasn't seen any packets from the offloaded flow in a
while, and the timeout for offloaded flows is set to an extremely long
value, we need to fix up the state before we can send a flow back to the
slow path.

For TCP, reset td_maxwin in both directions, which makes it resync its
state on the next packets.

Use the regular timeout for TCP and UDP established connections.

This allows the slow path to take over again once the offload state has
been torn down

Signed-off-by: Felix Fietkau 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_flow_table_core.c | 50 +-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index ff5e17a15963..0699981a8511 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -100,6 +100,43 @@ flow_offload_alloc(struct nf_conn *ct, struct 
nf_flow_route *route)
 }
 EXPORT_SYMBOL_GPL(flow_offload_alloc);
 
+static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
+{
+   tcp->state = TCP_CONNTRACK_ESTABLISHED;
+   tcp->seen[0].td_maxwin = 0;
+   tcp->seen[1].td_maxwin = 0;
+}
+
+static void flow_offload_fixup_ct_state(struct nf_conn *ct)
+{
+   const struct nf_conntrack_l4proto *l4proto;
+   struct net *net = nf_ct_net(ct);
+   unsigned int *timeouts;
+   unsigned int timeout;
+   int l4num;
+
+   l4num = nf_ct_protonum(ct);
+   if (l4num == IPPROTO_TCP)
+   flow_offload_fixup_tcp(>proto.tcp);
+
+   l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), l4num);
+   if (!l4proto)
+   return;
+
+   timeouts = l4proto->get_timeouts(net);
+   if (!timeouts)
+   return;
+
+   if (l4num == IPPROTO_TCP)
+   timeout = timeouts[TCP_CONNTRACK_ESTABLISHED];
+   else if (l4num == IPPROTO_UDP)
+   timeout = timeouts[UDP_CT_REPLIED];
+   else
+   return;
+
+   ct->timeout = nfct_time_stamp + timeout;
+}
+
 void flow_offload_free(struct flow_offload *flow)
 {
struct flow_offload_entry *e;
@@ -107,7 +144,8 @@ void flow_offload_free(struct flow_offload *flow)
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_cache);
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_cache);
e = container_of(flow, struct flow_offload_entry, flow);
-   nf_ct_delete(e->ct, 0, 0);
+   if (flow->flags & FLOW_OFFLOAD_DYING)
+   nf_ct_delete(e->ct, 0, 0);
nf_ct_put(e->ct);
kfree_rcu(e, rcu_head);
 }
@@ -164,6 +202,8 @@ EXPORT_SYMBOL_GPL(flow_offload_add);
 static void flow_offload_del(struct nf_flowtable *flow_table,
 struct flow_offload *flow)
 {
+   struct flow_offload_entry *e;
+
rhashtable_remove_fast(_table->rhashtable,
   >tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].node,
   nf_flow_offload_rhash_params);
@@ -171,12 +211,20 @@ static void flow_offload_del(struct nf_flowtable 
*flow_table,
   >tuplehash[FLOW_OFFLOAD_DIR_REPLY].node,
   nf_flow_offload_rhash_params);
 
+   e = container_of(flow, struct flow_offload_entry, flow);
+   clear_bit(IPS_OFFLOAD_BIT, >ct->status);
+
flow_offload_free(flow);
 }
 
 void flow_offload_teardown(struct flow_offload *flow)
 {
+   struct flow_offload_entry *e;
+
flow->flags |= FLOW_OFFLOAD_TEARDOWN;
+
+   e = container_of(flow, struct flow_offload_entry, flow);
+   flow_offload_fixup_ct_state(e->ct);
 }
 EXPORT_SYMBOL_GPL(flow_offload_teardown);
 
-- 
2.11.0



[PATCH 01/51] netfilter: ipvs: Fix space before '[' error.

2018-05-06 Thread Pablo Neira Ayuso
From: Arvind Yadav 

Fix checkpatch.pl error:
ERROR: space prohibited before open square bracket '['.

Signed-off-by: Arvind Yadav 
Signed-off-by: Simon Horman 
---
 net/netfilter/ipvs/ip_vs_proto_tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
b/net/netfilter/ipvs/ip_vs_proto_tcp.c
index bcd9b7bde4ee..569631d2b2a1 100644
--- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
+++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
@@ -436,7 +436,7 @@ static bool tcp_state_active(int state)
return tcp_state_active_table[state];
 }
 
-static struct tcp_states_t tcp_states [] = {
+static struct tcp_states_t tcp_states[] = {
 /* INPUT */
 /*sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA*/
 /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }},
@@ -459,7 +459,7 @@ static struct tcp_states_t tcp_states [] = {
 /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }},
 };
 
-static struct tcp_states_t tcp_states_dos [] = {
+static struct tcp_states_t tcp_states_dos[] = {
 /* INPUT */
 /*sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA*/
 /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSA }},
-- 
2.11.0



Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-06 Thread Alexander Duyck
On Sun, May 6, 2018 at 10:17 AM, Willem de Bruijn
 wrote:
> On Sat, May 5, 2018 at 7:39 PM, Alexander Duyck
>  wrote:
>> On Sat, May 5, 2018 at 3:01 AM, Willem de Bruijn
>>  wrote:
>>> On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
>>>  wrote:
 From: Alexander Duyck 

 This patch is meant to allow us to avoid having to recompute the checksum
 from scratch and have it passed as a parameter.

 Instead of taking that approach we can take advantage of the fact that the
 length that was used to compute the existing checksum is included in the
 UDP header. If we cancel that out by adding the value XOR with 0x we
 can then just add the new length in and fold that into the new result.

 I think this may be fixing a checksum bug in the original code as well
 since the checksum that was passed included the UDP header in the checksum
 computation, but then excluded it for the adjustment on the last frame. I
 believe this may have an effect on things in the cases where the two differ
 by bits that would result in things crossing the byte boundaries.
>>>
>>> The replacement code, below, subtracts original payload size then adds
>>> the new payload size. mss here excludes the udp header size.
>>>
 /* last packet can be partial gso_size */
 -   if (!seg->next)
 -   csum_replace2(>check, htons(mss),
 - htons(seg->len - hdrlen - 
 sizeof(*uh)));
>>
>> That is my point. When you calculated your checksum you included the
>> UDP header in the calculation.
>>
>> -   return __udp_gso_segment(gso_skb, features,
>> -udp_v4_check(sizeof(struct udphdr) + mss,
>> - iph->saddr, iph->daddr, 0));
>>
>> Basically the problem is in one spot you are adding the sizeof(struct
>> udphdr) + mss and then in another you are cancelling it out as mss and
>> trying to account for it by also dropping the UDP header from the
>> payload length of the value you are adding. That works in the cases
>> where the effect doesn't cause any issues with the byte ordering,
>> however I think when mss + 8 crosses a byte boundary it can lead to
>> issues since the calculation is done on a byte swapped value.
>
> Do you mean that the issue is that the arithmetic operations
> on a __be16 in csum_replace2 may be incorrect if they exceed
> the least significant byte?
>
> csum_replace2 is used in many locations in the stack to adjust a network
> byte order csum when the payload length changes (e.g., iph->tot_len in
> inet_gro_complete).
>
> Or am I missing something specific about the udphdr calculations?

Actually it looks like the math I was applying to test this was off.

Basically the part I wasn't a fan of is the fact that we account for
the UDP header in the first calculation but not in the next. I guess
in the grand scheme of things though you are just dropping it from
both the value being removed and the value being added so it works out
do to the fact that the checksum can be associative.

I guess I was just being too literal in my thinking. Still an
expensive way of doing this though. I'll update the patch description.

Thanks.

- Alex


Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-06 Thread NeilBrown
On Sun, May 06 2018, Herbert Xu wrote:

> On Sun, May 06, 2018 at 08:00:49AM +1000, NeilBrown wrote:
>>
>> The insert function must (and does) take the lock on the bucket before
>> testing if there is a "next" table.
>> If one inserter finds that it has locked the "last" table (because there
>> is no next) and successfully inserts, then the other inserter cannot
>> have locked that table yet, else it would have inserted.  When it does,
>> it will find what the first inserter inserted. 
>
> If you release the lock to the first table then it may be deleted
> by the resize thread.  Hence the other inserter may not have even
> started from the same place.

This is true, but I don't see how it is relevant.
At some point, each thread will find that the table they have just
locked for their search key, has a NULL 'future_tbl' pointer.
At the point, the thread can know that the key is not in any table,
and that no other thread can add the key until the lock is released.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 7/8] rhashtable: add rhashtable_walk_prev()

2018-05-06 Thread NeilBrown
On Sat, May 05 2018, Tom Herbert wrote:

> On Sat, May 5, 2018 at 2:43 AM, Herbert Xu  
> wrote:
>> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>>> rhashtable_walk_prev() returns the object returned by
>>> the previous rhashtable_walk_next(), providing it is still in the
>>> table (or was during this grace period).
>>> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
>>> have been called since the last rhashtable_walk_next().
>>>
>>> If there have been no calls to rhashtable_walk_next(), or if the
>>> object is gone from the table, then NULL is returned.
>>>
>>> This can usefully be used in a seq_file ->start() function.
>>> If the pos is the same as was returned by the last ->next() call,
>>> then rhashtable_walk_prev() can be used to re-establish the
>>> current location in the table.  If it returns NULL, then
>>> rhashtable_walk_next() should be used.
>>>
>>> Signed-off-by: NeilBrown 
>>
>> I will ack this if Tom is OK with replacing peek with it.
>>
> I'm not following why this is any better than peek. The point of
> having rhashtable_walk_peek is to to allow the caller to get then
> current element not the next one. This is needed when table is read in
> multiple parts and we need to pick up with what was returned from the
> last call to rhashtable_walk_next (which apparently is what this patch
> is also trying to do).
>
> There is one significant difference in that peek will return the
> element in the table regardless of where the iterator is at (this is
> why peek can move the iterator) and only returns NULL at end of table.
> As mentioned above rhashtable_walk_prev can return NULL so then caller
> needs and so rhashtable_walk_next "should be used" to get the previous
> element. Doing a peek is a lot cleaner and more straightforward API in
> this regard.

Thanks for the review.  I agree with a lot of what you say about the
behavior of the different implementations.
One important difference is the documentation.  The current
documentation for rhashtable_walk_peek() is wrong.   For example it says
that the function doesn't change the iterator, but sometimes it does.
The first rhashtable patch I submitted tried to fix this up, but it is
hard to document the function clearly because it really is doing one of
two different things.  It returns the previous element if it still
exists, or it returns the next one.  With my rhashtable_walk_prev(),
that can be done with
  rhashtable_walk_prev() ?: rhashtable_walk_next();

Both of these functions can easily be documented clearly.
We could combine the two as you have done, but "peek" does seem like a
good name.  "prev_or_next" is more honest, but somewhat clumsy.
Whether that is a good thing or not is partly a matter of taste, and we
seem to be on opposite sides of that fence.
There is a practical aspect to it though.

Lustre has a debugfs seq_file which shows all the cached pages of all
the cached object.  The objects are in a hashtable (which I want to
change to an rhashtable).  So the seq_file iterator has both an
rhashtable iterator an offset in the object.

When we restart a walk, we might be in the middle of some object - but
that object might have been removed from the cache, so we would need to
go on to the first page of the next object.
Using my interface I can do

 obj = rhashtable_walk_prev(_iter);
 offset = iter.offset;
 if (!obj) {
obj = rhashtable_walk_next(_iter)
offset = 0;
 }

I could achieve something similar with your interface if I kept an extra
copy of the previous object and compared with the value returned by
rhashtable_walk_peek(), but (to me) that seems like double handling.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [net-next PATCH v2 6/8] udp: Add support for software checksum and GSO_PARTIAL with GSO offload

2018-05-06 Thread Alexander Duyck
On Sun, May 6, 2018 at 2:50 PM, Willem de Bruijn
 wrote:
> On Sat, May 5, 2018 at 3:31 AM, Alexander Duyck
>  wrote:
>> From: Alexander Duyck 
>>
>> This patch adds support for a software provided checksum and GSO_PARTIAL
>> segmentation support. With this we can offload UDP segmentation on devices
>> that only have partial support for tunnels.
>>
>> Since we are no longer needing the hardware checksum we can drop the checks
>> in the segmentation code that were verifying if it was present.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>>  net/ipv4/udp_offload.c |   28 ++--
>>  net/ipv6/udp_offload.c |   11 +--
>>  2 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>> index 946d06d2aa0c..fd94bbb369b2 100644
>> --- a/net/ipv4/udp_offload.c
>> +++ b/net/ipv4/udp_offload.c
>> @@ -217,6 +217,13 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> return segs;
>> }
>>
>> +   /* GSO partial and frag_list segmentation only requires splitting
>> +* the frame into an MSS multiple and possibly a remainder, both
>> +* cases return a GSO skb. So update the mss now.
>> +*/
>> +   if (skb_is_gso(segs))
>> +   mss *= skb_shinfo(segs)->gso_segs;
>> +
>> seg = segs;
>> uh = udp_hdr(seg);
>>
>> @@ -237,6 +244,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> uh->len = newlen;
>> uh->check = check;
>>
>> +   if (seg->ip_summed == CHECKSUM_PARTIAL)
>> +   gso_reset_checksum(seg, ~check);
>> +   else
>> +   uh->check = gso_make_checksum(seg, ~check);
>
> Here and below, this needs
>
>   if (uh->check == 0)
> uh->check = CSUM_MANGLED_0;
>
> similar to __skb_udp_tunnel_segment?

Good call, though I think I might change this up a bit and do something like:
uh->check = gso_make_checksum(seg, ~check) ? : CSUM_MANGLED_0;

That way I can avoid the extra read.

Thanks.

- Alex


Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-06 Thread NeilBrown
On Sun, May 06 2018, Herbert Xu wrote:

> On Sun, May 06, 2018 at 07:48:20AM +1000, NeilBrown wrote:
>>
>> The spinlock protects 2 or more buckets.  The nested table contains at
>> least 512 buckets, maybe more.
>> It is quite possible for two insertions into 2 different buckets to both
>> get their spinlock and both try to instantiate the same nested table.
>
> I think you missed the fact that when we use nested tables the spin
> lock table is limited to just a single page and hence corresponds
> to the first level in the nested table.  Therefore it's always safe.

Yes I had missed that - thanks for pointing it out.
In fact the lock table is limited to the number of nested_tables
in the second level.
And it is the same low-order bits that choose both the lock
and the set of nested tables.
So there isn't a bug here.  So we don't need this patch. (I still like
it though - it seems more obviously correct).

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [net-next PATCH v2 6/8] udp: Add support for software checksum and GSO_PARTIAL with GSO offload

2018-05-06 Thread Willem de Bruijn
On Sat, May 5, 2018 at 3:31 AM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> This patch adds support for a software provided checksum and GSO_PARTIAL
> segmentation support. With this we can offload UDP segmentation on devices
> that only have partial support for tunnels.
>
> Since we are no longer needing the hardware checksum we can drop the checks
> in the segmentation code that were verifying if it was present.
>
> Signed-off-by: Alexander Duyck 
> ---
>  net/ipv4/udp_offload.c |   28 ++--
>  net/ipv6/udp_offload.c |   11 +--
>  2 files changed, 19 insertions(+), 20 deletions(-)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 946d06d2aa0c..fd94bbb369b2 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -217,6 +217,13 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
> *gso_skb,
> return segs;
> }
>
> +   /* GSO partial and frag_list segmentation only requires splitting
> +* the frame into an MSS multiple and possibly a remainder, both
> +* cases return a GSO skb. So update the mss now.
> +*/
> +   if (skb_is_gso(segs))
> +   mss *= skb_shinfo(segs)->gso_segs;
> +
> seg = segs;
> uh = udp_hdr(seg);
>
> @@ -237,6 +244,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
> *gso_skb,
> uh->len = newlen;
> uh->check = check;
>
> +   if (seg->ip_summed == CHECKSUM_PARTIAL)
> +   gso_reset_checksum(seg, ~check);
> +   else
> +   uh->check = gso_make_checksum(seg, ~check);

Here and below, this needs

  if (uh->check == 0)
uh->check = CSUM_MANGLED_0;

similar to __skb_udp_tunnel_segment?


Re: BUG?: receiving on a packet socket with .sll_protocoll and bridging

2018-05-06 Thread Willem de Bruijn
>> > If now I add veth0 to a bridge (e.g.
>> >
>> > ip link add br0 type bridge
>> > ip link set dev veth0 master br0
>> >
>> > ) and continue to send on veth1 and receive on veth0 I don't receive
>> > the packets any more. The other direction (veth0 sending, veth1
>> > receiving) still works fine.
>> >
>> > Each of the following changes allow to
>> > receive again:
>> >
>> >  a) take veth0 out of the bridge
>> >  b) bind(2) the receiving socket to br0 instead of veth0
>> >  c) use .sll_protocol = htons(ETH_P_ALL) for bind(2)
>> >
>> > In the end only c) could be sensible (because I need to know the port
>> > the packet entered the stack and that might well be bridged), but I
>> > wonder why .sll_protocol = htons(ETH_P_MRP) seems to do the right thing
>> > for an unbridged device but not for a bridged one.
>> >
>> > Is this a bug or a feature I don't understand?
>>
>> Packets are redirected to the bridge device in __netif_receive_skb_core
>> at the rx_handler hook.
>
> OK, thanks for finding that place. It would have taken quite some of my
> time to find it.
>
>> This happens after packets are passed to packet types attached to
>> list ptype_all, which includes packet sockets with protocol ETH_P_ALL.
>> But before packets are passed to protocol specific packet types (and
>> sockets) attached to ptype_base[].
>
> Still I wonder if there is something to fix in the kernel or if this
> inconsistency is intended (or at least accepted).

It is established behavior.


Re: BUG?: receiving on a packet socket with .sll_protocoll and bridging

2018-05-06 Thread Uwe Kleine-König
Hello Willem,

On Sun, May 06, 2018 at 06:58:34PM +0200, Willem de Bruijn wrote:
> On Sat, May 5, 2018 at 10:57 AM, Uwe Kleine-König
>  wrote:
> > For testing purposes I created a veth device pair (veth0 veth1), open a
> > socket for each of the devices and send packets around between them. In
> > tcpdump a typical package looks as follows:
> >
> > 10:36:34.755208 ae:a9:da:50:48:db (oui Unknown) > 01:15:e4:00:00:01 (oui 
> > Unknown), ethertype Unknown (0x88e3), length 58:
> > 0x:  0001 0212 8000 aea9 da50 48db    .PH.
> > 0x0010:   0589 40f2 6574 6800     @.eth...
> > 0x0020:   0100 0a80 3d38 4c5e ..=8L^..
> >
> > The socket to receive these packages is opened using:
> >
> > #define ETH_P_MRP 0x88e3
> >
> > struct sockaddr_ll sa_ll = {
> > .sll_family = AF_PACKET,
> > .sll_protocol = htons(ETH_P_MRP),
> > .sll_ifindex = if_nametoindex("veth0")
> > };
> >
> > fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_MRP));
> > bind(fd, (struct sockaddr *)_ll, sizeof(sa_ll));
> >
> > So far everything works fine and I can receive the packets I send.
> >
> > If now I add veth0 to a bridge (e.g.
> >
> > ip link add br0 type bridge
> > ip link set dev veth0 master br0
> >
> > ) and continue to send on veth1 and receive on veth0 I don't receive
> > the packets any more. The other direction (veth0 sending, veth1
> > receiving) still works fine.
> >
> > Each of the following changes allow to
> > receive again:
> >
> >  a) take veth0 out of the bridge
> >  b) bind(2) the receiving socket to br0 instead of veth0
> >  c) use .sll_protocol = htons(ETH_P_ALL) for bind(2)
> >
> > In the end only c) could be sensible (because I need to know the port
> > the packet entered the stack and that might well be bridged), but I
> > wonder why .sll_protocol = htons(ETH_P_MRP) seems to do the right thing
> > for an unbridged device but not for a bridged one.
> >
> > Is this a bug or a feature I don't understand?
> 
> Packets are redirected to the bridge device in __netif_receive_skb_core
> at the rx_handler hook.

OK, thanks for finding that place. It would have taken quite some of my
time to find it.

> This happens after packets are passed to packet types attached to
> list ptype_all, which includes packet sockets with protocol ETH_P_ALL.
> But before packets are passed to protocol specific packet types (and
> sockets) attached to ptype_base[].

Still I wonder if there is something to fix in the kernel or if this
inconsistency is intended (or at least accepted).

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |


Re: simplify procfs code for seq_file instances V2

2018-05-06 Thread Al Viro
On Sun, May 06, 2018 at 08:19:49PM +0300, Alexey Dobriyan wrote:
> +++ b/fs/proc/internal.h
> @@ -48,8 +48,8 @@ struct proc_dir_entry {
>   const struct seq_operations *seq_ops;
>   int (*single_show)(struct seq_file *, void *);
>   };
> - unsigned int state_size;
>   void *data;
> + unsigned int state_size;
>   unsigned int low_ino;
>   nlink_t nlink;
>   kuid_t uid;

Makes sense

> @@ -62,9 +62,9 @@ struct proc_dir_entry {
>   umode_t mode;
>   u8 namelen;
>  #ifdef CONFIG_64BIT
> -#define SIZEOF_PDE_INLINE_NAME   (192-139)
> +#define SIZEOF_PDE_INLINE_NAME   (192-155)
>  #else
> -#define SIZEOF_PDE_INLINE_NAME   (128-87)
> +#define SIZEOF_PDE_INLINE_NAME   (128-95)

>  #endif
>   char inline_name[SIZEOF_PDE_INLINE_NAME];
>  } __randomize_layout;

*UGH*

Both to the original state and that kind of "adjustments".
Incidentally, with __bugger_layout in there these expressions
are simply wrong.

If nothing else, I would suggest turning the last one into
char inline_name[];
in hope that layout won't get... randomized that much and
used

#ifdef CONFIG_64BIT
#define PDE_SIZE 192
#else
#define PDE_SIZE 128
#endif

union __proc_dir_entry {
char pad[PDE_SIZE];
struct proc_dir_entry real;
};

#define SIZEOF_PDE_INLINE_NAME (PDE_SIZE - offsetof(struct proc_dir_entry, 
inline_name))

for constants, adjusted sizeof and sizeof_field when creating
proc_dir_entry_cache and turned proc_root into

union __proc_dir_entry __proc_root = { .real = {
.low_ino= PROC_ROOT_INO,
.namelen= 5,
.mode   = S_IFDIR | S_IRUGO | S_IXUGO,
.nlink  = 2,
.refcnt = REFCOUNT_INIT(1),
.proc_iops  = _root_inode_operations,
.proc_fops  = _root_operations,
.parent = &__proc_root.real,
.subdir = RB_ROOT,
.name   = __proc_root.real.inline_name,
.inline_name= "/proc",
}};

#define proc_root __proc_root.real
(or actually used __proc_root.real in all of a 6 places where it remains).

> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index baf1994289ce..7d94fa005b0d 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -40,7 +40,7 @@ static struct net *get_proc_net(const struct inode *inode)
>  
>  static int seq_open_net(struct inode *inode, struct file *file)
>  {
> - size_t state_size = PDE(inode)->state_size;
> + unsigned int state_size = PDE(inode)->state_size;
>   struct seq_net_private *p;
>   struct net *net;


You and your "size_t is evil" crusade...


Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: add PHYLINK support

2018-05-06 Thread Florian Fainelli
On May 6, 2018 10:26:37 AM PDT, Andrew Lunn  wrote:
>On Sat, May 05, 2018 at 12:04:23PM -0700, Florian Fainelli wrote:
>> From: Russell King 
>> 
>> Add rudimentary phylink support to mv88e6xxx. This allows the driver
>> using user ports with fixed links to keep operating normally. User
>ports
>> with normal PHYs are not affected since the switch automatically
>manages
>> their link parameters. User facing ports which use a SFP/SFF with a
>> non-fixed link mode might require a call to phylink_mac_change() to
>> operate properly.
>
>Hi Florian
>
>I have a regression with this patch on ZII devel B, and i think a fix.
>I'm running some more tests now. Once they pass, i will post a patch.

Thanks for giving this a spin, let me know what the results are. Things worked 
fine here with optical4 and a 1000basex media converter though that thing tends 
to be finicky...
Hi Andrew,
-- 
Florian


INFO: task hung in tls_push_record

2018-05-06 Thread syzbot

Hello,

syzbot found the following crash on:

HEAD commit:8fb11a9a8d51 net/ipv6: rename rt6_next to fib6_next
git tree:   net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=108e923780
kernel config:  https://syzkaller.appspot.com/x/.config?x=c416c61f3cd96be
dashboard link: https://syzkaller.appspot.com/bug?extid=4006516aae0b06e7050f
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4006516aae0b06e70...@syzkaller.appspotmail.com

INFO: task syz-executor7:20304 blocked for more than 120 seconds.
  Not tainted 4.17.0-rc3+ #33
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor7   D24680 20304   4547 0x0004
Call Trace:
 context_switch kernel/sched/core.c:2848 [inline]
 __schedule+0x801/0x1e30 kernel/sched/core.c:3490
 schedule+0xef/0x430 kernel/sched/core.c:3549
 schedule_timeout+0x1b5/0x240 kernel/time/timer.c:1777
 do_wait_for_common kernel/sched/completion.c:83 [inline]
 __wait_for_common kernel/sched/completion.c:104 [inline]
 wait_for_common kernel/sched/completion.c:115 [inline]
 wait_for_completion+0x3e7/0x870 kernel/sched/completion.c:136
 crypto_wait_req include/linux/crypto.h:512 [inline]
 tls_do_encryption net/tls/tls_sw.c:217 [inline]
 tls_push_record+0xedc/0x13e0 net/tls/tls_sw.c:248
 tls_sw_sendmsg+0x8d7/0x12b0 net/tls/tls_sw.c:440
 inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 sock_write_iter+0x35a/0x5a0 net/socket.c:908
 call_write_iter include/linux/fs.h:1784 [inline]
 new_sync_write fs/read_write.c:474 [inline]
 __vfs_write+0x64d/0x960 fs/read_write.c:487
 vfs_write+0x1f8/0x560 fs/read_write.c:549
 ksys_write+0xf9/0x250 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455979
RSP: 002b:7fad08582c68 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 7fad085836d4 RCX: 00455979
RDX: 0050 RSI: 2280 RDI: 0013
RBP: 0072bea0 R08:  R09: 
R10:  R11: 0246 R12: 
R13: 0713 R14: 006fea68 R15: 

Showing all locks held in the system:
2 locks held by khungtaskd/892:
 #0: 3f978916 (rcu_read_lock){}, at:  
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
 #0: 3f978916 (rcu_read_lock){}, at: watchdog+0x1ff/0xf60  
kernel/hung_task.c:249
 #1: a6e1e84d (tasklist_lock){.+.+}, at:  
debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470

2 locks held by getty/4466:
 #0: bb90ee4c (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: 5c64e739 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4467:
 #0: a703ee54 (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: c6bc54dc (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4468:
 #0: 7e39712e (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: 3afa8b0a (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4469:
 #0: 4a2f1f14 (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: a9bb6673 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4470:
 #0: 5c9ac5a5 (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: e940f7ee (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4471:
 #0: b0318201 (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: faa92852 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

2 locks held by getty/4472:
 #0: 2f556699 (>ldisc_sem){}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
 #1: c5b4fb47 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131

1 lock held by syz-executor7/20304:
 #0: 1da4f4a9 (sk_lock-AF_INET6){+.+.}, at: lock_sock  
include/net/sock.h:1474 [inline]
 #0: 1da4f4a9 (sk_lock-AF_INET6){+.+.}, at:  
tls_sw_sendmsg+0x1b9/0x12b0 net/tls/tls_sw.c:384

1 lock held by syz-executor7/20375:
 #0: 286d2e23 (sk_lock-AF_INET6){+.+.}, at: lock_sock  
include/net/sock.h:1474 [inline]
 #0: 286d2e23 

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: add PHYLINK support

2018-05-06 Thread Andrew Lunn
On Sat, May 05, 2018 at 12:04:23PM -0700, Florian Fainelli wrote:
> From: Russell King 
> 
> Add rudimentary phylink support to mv88e6xxx. This allows the driver
> using user ports with fixed links to keep operating normally. User ports
> with normal PHYs are not affected since the switch automatically manages
> their link parameters. User facing ports which use a SFP/SFF with a
> non-fixed link mode might require a call to phylink_mac_change() to
> operate properly.

Hi Florian

I have a regression with this patch on ZII devel B, and i think a fix.
I'm running some more tests now. Once they pass, i will post a patch.

Andrew


Re: simplify procfs code for seq_file instances V2

2018-05-06 Thread Alexey Dobriyan
On Wed, Apr 25, 2018 at 05:47:47PM +0200, Christoph Hellwig wrote:
> Changes since V1:
>  - open code proc_create_data to avoid setting not fully initialized
>entries live
>  - use unsigned int for state_size

Need this to maintain sizeof(struct proc_dir_entry):

Otherwise ACK fs/proc/ part.

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 6d171485c45b..a318ae5b36b4 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -48,8 +48,8 @@ struct proc_dir_entry {
const struct seq_operations *seq_ops;
int (*single_show)(struct seq_file *, void *);
};
-   unsigned int state_size;
void *data;
+   unsigned int state_size;
unsigned int low_ino;
nlink_t nlink;
kuid_t uid;
@@ -62,9 +62,9 @@ struct proc_dir_entry {
umode_t mode;
u8 namelen;
 #ifdef CONFIG_64BIT
-#define SIZEOF_PDE_INLINE_NAME (192-139)
+#define SIZEOF_PDE_INLINE_NAME (192-155)
 #else
-#define SIZEOF_PDE_INLINE_NAME (128-87)
+#define SIZEOF_PDE_INLINE_NAME (128-95)
 #endif
char inline_name[SIZEOF_PDE_INLINE_NAME];
 } __randomize_layout;
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index baf1994289ce..7d94fa005b0d 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -40,7 +40,7 @@ static struct net *get_proc_net(const struct inode *inode)
 
 static int seq_open_net(struct inode *inode, struct file *file)
 {
-   size_t state_size = PDE(inode)->state_size;
+   unsigned int state_size = PDE(inode)->state_size;
struct seq_net_private *p;
struct net *net;
 


Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-06 Thread Willem de Bruijn
On Sat, May 5, 2018 at 7:39 PM, Alexander Duyck
 wrote:
> On Sat, May 5, 2018 at 3:01 AM, Willem de Bruijn
>  wrote:
>> On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
>>  wrote:
>>> From: Alexander Duyck 
>>>
>>> This patch is meant to allow us to avoid having to recompute the checksum
>>> from scratch and have it passed as a parameter.
>>>
>>> Instead of taking that approach we can take advantage of the fact that the
>>> length that was used to compute the existing checksum is included in the
>>> UDP header. If we cancel that out by adding the value XOR with 0x we
>>> can then just add the new length in and fold that into the new result.
>>>
>>> I think this may be fixing a checksum bug in the original code as well
>>> since the checksum that was passed included the UDP header in the checksum
>>> computation, but then excluded it for the adjustment on the last frame. I
>>> believe this may have an effect on things in the cases where the two differ
>>> by bits that would result in things crossing the byte boundaries.
>>
>> The replacement code, below, subtracts original payload size then adds
>> the new payload size. mss here excludes the udp header size.
>>
>>> /* last packet can be partial gso_size */
>>> -   if (!seg->next)
>>> -   csum_replace2(>check, htons(mss),
>>> - htons(seg->len - hdrlen - 
>>> sizeof(*uh)));
>
> That is my point. When you calculated your checksum you included the
> UDP header in the calculation.
>
> -   return __udp_gso_segment(gso_skb, features,
> -udp_v4_check(sizeof(struct udphdr) + mss,
> - iph->saddr, iph->daddr, 0));
>
> Basically the problem is in one spot you are adding the sizeof(struct
> udphdr) + mss and then in another you are cancelling it out as mss and
> trying to account for it by also dropping the UDP header from the
> payload length of the value you are adding. That works in the cases
> where the effect doesn't cause any issues with the byte ordering,
> however I think when mss + 8 crosses a byte boundary it can lead to
> issues since the calculation is done on a byte swapped value.

Do you mean that the issue is that the arithmetic operations
on a __be16 in csum_replace2 may be incorrect if they exceed
the least significant byte?

csum_replace2 is used in many locations in the stack to adjust a network
byte order csum when the payload length changes (e.g., iph->tot_len in
inet_gro_complete).

Or am I missing something specific about the udphdr calculations?


Re: BUG?: receiving on a packet socket with .sll_protocoll and bridging

2018-05-06 Thread Willem de Bruijn
On Sat, May 5, 2018 at 10:57 AM, Uwe Kleine-König
 wrote:
> Hello,
>
> my eventual goal is to implement MRP and for that I started to program a
> bit and stumbled over a problem I don't understand.
>
> For testing purposes I created a veth device pair (veth0 veth1), open a
> socket for each of the devices and send packets around between them. In
> tcpdump a typical package looks as follows:
>
> 10:36:34.755208 ae:a9:da:50:48:db (oui Unknown) > 01:15:e4:00:00:01 (oui 
> Unknown), ethertype Unknown (0x88e3), length 58:
> 0x:  0001 0212 8000 aea9 da50 48db    .PH.
> 0x0010:   0589 40f2 6574 6800     @.eth...
> 0x0020:   0100 0a80 3d38 4c5e ..=8L^..
>
> The socket to receive these packages is opened using:
>
> #define ETH_P_MRP 0x88e3
>
> struct sockaddr_ll sa_ll = {
> .sll_family = AF_PACKET,
> .sll_protocol = htons(ETH_P_MRP),
> .sll_ifindex = if_nametoindex("veth0")
> };
>
> fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_MRP));
> bind(fd, (struct sockaddr *)_ll, sizeof(sa_ll));
>
> So far everything works fine and I can receive the packets I send.
>
> If now I add veth0 to a bridge (e.g.
>
> ip link add br0 type bridge
> ip link set dev veth0 master br0
>
> ) and continue to send on veth1 and receive on veth0 I don't receive
> the packets any more. The other direction (veth0 sending, veth1
> receiving) still works fine.
>
> Each of the following changes allow to
> receive again:
>
>  a) take veth0 out of the bridge
>  b) bind(2) the receiving socket to br0 instead of veth0
>  c) use .sll_protocol = htons(ETH_P_ALL) for bind(2)
>
> In the end only c) could be sensible (because I need to know the port
> the packet entered the stack and that might well be bridged), but I
> wonder why .sll_protocol = htons(ETH_P_MRP) seems to do the right thing
> for an unbridged device but not for a bridged one.
>
> Is this a bug or a feature I don't understand?

Packets are redirected to the bridge device in __netif_receive_skb_core
at the rx_handler hook.

This happens after packets are passed to packet types attached to
list ptype_all, which includes packet sockets with protocol ETH_P_ALL.
But before packets are passed to protocol specific packet types (and
sockets) attached to ptype_base[].


Re: [RFC PATCH ghak32 V2 01/13] audit: add container id

2018-05-06 Thread Richard Guy Briggs
On 2018-04-18 19:47, Paul Moore wrote:
> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs  wrote:
> > Implement the proc fs write to set the audit container ID of a process,
> > emitting an AUDIT_CONTAINER record to document the event.
> >
> > This is a write from the container orchestrator task to a proc entry of
> > the form /proc/PID/containerid where PID is the process ID of the newly
> > created task that is to become the first task in a container, or an
> > additional task added to a container.
> >
> > The write expects up to a u64 value (unset: 18446744073709551615).
> >
> > This will produce a record such as this:
> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 
> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 
> > ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> >
> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > the orchestrator while the "opid" field is the object's PID, the process
> > being "contained".  Old and new container ID values are given in the
> > "contid" fields, while res indicates its success.
> >
> > It is not permitted to self-set, unset or re-set the container ID.  A
> > child inherits its parent's container ID, but then can be set only once
> > after.
> >
> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >
> > Signed-off-by: Richard Guy Briggs 
> > ---
> >  fs/proc/base.c | 37 
> >  include/linux/audit.h  | 16 +
> >  include/linux/init_task.h  |  4 ++-
> >  include/linux/sched.h  |  1 +
> >  include/uapi/linux/audit.h |  2 ++
> >  kernel/auditsc.c   | 84 
> > ++
> >  6 files changed, 143 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 60316b5..6ce4fbe 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1299,6 +1299,41 @@ static ssize_t proc_sessionid_read(struct file * 
> > file, char __user * buf,
> > .read   = proc_sessionid_read,
> > .llseek = generic_file_llseek,
> >  };
> > +
> > +static ssize_t proc_containerid_write(struct file *file, const char __user 
> > *buf,
> > +  size_t count, loff_t *ppos)
> > +{
> > +   struct inode *inode = file_inode(file);
> > +   u64 containerid;
> > +   int rv;
> > +   struct task_struct *task = get_proc_task(inode);
> > +
> > +   if (!task)
> > +   return -ESRCH;
> > +   if (*ppos != 0) {
> > +   /* No partial writes. */
> > +   put_task_struct(task);
> > +   return -EINVAL;
> > +   }
> > +
> > +   rv = kstrtou64_from_user(buf, count, 10, );
> > +   if (rv < 0) {
> > +   put_task_struct(task);
> > +   return rv;
> > +   }
> > +
> > +   rv = audit_set_containerid(task, containerid);
> > +   put_task_struct(task);
> > +   if (rv < 0)
> > +   return rv;
> > +   return count;
> > +}
> > +
> > +static const struct file_operations proc_containerid_operations = {
> > +   .write  = proc_containerid_write,
> > +   .llseek = generic_file_llseek,
> > +};
> > +
> >  #endif
> >
> >  #ifdef CONFIG_FAULT_INJECTION
> > @@ -2961,6 +2996,7 @@ static int proc_pid_patch_state(struct seq_file *m, 
> > struct pid_namespace *ns,
> >  #ifdef CONFIG_AUDITSYSCALL
> > REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
> > REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > +   REG("containerid", S_IWUSR, proc_containerid_operations),
> >  #endif
> >  #ifdef CONFIG_FAULT_INJECTION
> > REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > @@ -3355,6 +3391,7 @@ static int proc_tid_comm_permission(struct inode 
> > *inode, int mask)
> >  #ifdef CONFIG_AUDITSYSCALL
> > REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
> > REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > +   REG("containerid", S_IWUSR, proc_containerid_operations),
> >  #endif
> >  #ifdef CONFIG_FAULT_INJECTION
> > REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),

...

> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d258826..1b82191 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -796,6 +796,7 @@ struct task_struct {
> >  #ifdef CONFIG_AUDITSYSCALL
> > kuid_t  loginuid;
> > unsigned intsessionid;
> > +   u64 containerid;
> 
> This one line addition to the task_struct scares me the most of
> anything in this patchset.  Why?  It's a field named "containerid" in
> a perhaps one of the most widely used core kernel structures; the
> possibilities for abuse are endless, and it's foolish to think we
> would 

Re: Locking in network code

2018-05-06 Thread Alexander Duyck
On Sun, May 6, 2018 at 6:43 AM, Jacob S. Moroni  wrote:
> Hello,
>
> I have a stupid question regarding which variant of spin_lock to use
> throughout the network stack, and inside RX handlers specifically.
>
> It's my understanding that skbuffs are normally passed into the stack
> from soft IRQ context if the device is using NAPI, and hard IRQ
> context if it's not using NAPI (and I guess process context too if the
> driver does it's own workqueue thing).
>
> So, that means that handlers registered with netdev_rx_handler_register
> may end up being called from any context.

I am pretty sure the Rx handlers are all called from softirq context.
The hard IRQ will just call netif_rx which will queue the packet up to
be handles in the soft IRQ later.

> However, the RX handler in the macvlan code calls ip_check_defrag,
> which could eventually lead to a call to ip_defrag, which ends
> up taking a regular spin_lock around the call to ip_frag_queue.
>
> Is this a risk of deadlock, and if not, why?
>
> What if you're running a system with one CPU and a packet fragment
> arrives on a NAPI interface, then, while the spin_lock is held,
> another fragment somehow arrives on another interface which does
> its processing in hard IRQ context?
>
> --
>   Jacob S. Moroni
>   m...@jakemoroni.com

Take a look at the netif_rx code and it should answer most of your
questions. Basically everything is handed off from the hard IRQ to the
soft IRQ via a backlog queue and then handled in net_rx_action.

- Alex


[PATCH bpf-next v3 4/6] bpf: Split lwt inout verifier structures

2018-05-06 Thread Mathieu Xhonneux
The new bpf_lwt_push_encap helper should only be accessible within the
LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
panic.

At the moment, both LWT BPF IN and OUT share the same list of helpers,
whose calls are authorized by the verifier. This patch separates the
verifier ops for the IN and OUT hooks, and allows the IN hook to call the
bpf_lwt_push_encap helper.

This patch is also the occasion to put all lwt_*_func_proto functions
together for clarity. At the moment, socks_op_func_proto is in the middle
of lwt_inout_func_proto and lwt_xmit_func_proto.

Signed-off-by: Mathieu Xhonneux 
Acked-by: David Lebrun 
---
 include/linux/bpf_types.h |  4 +--
 net/core/filter.c | 83 +--
 2 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d7df1b323082..cc9d7e031330 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -9,8 +9,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
diff --git a/net/core/filter.c b/net/core/filter.c
index 5a0c03ec22ac..2aa83e0f40ce 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4448,33 +4448,6 @@ xdp_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
}
 }
 
-static const struct bpf_func_proto *
-lwt_inout_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
-{
-   switch (func_id) {
-   case BPF_FUNC_skb_load_bytes:
-   return _skb_load_bytes_proto;
-   case BPF_FUNC_skb_pull_data:
-   return _skb_pull_data_proto;
-   case BPF_FUNC_csum_diff:
-   return _csum_diff_proto;
-   case BPF_FUNC_get_cgroup_classid:
-   return _get_cgroup_classid_proto;
-   case BPF_FUNC_get_route_realm:
-   return _get_route_realm_proto;
-   case BPF_FUNC_get_hash_recalc:
-   return _get_hash_recalc_proto;
-   case BPF_FUNC_perf_event_output:
-   return _skb_event_output_proto;
-   case BPF_FUNC_get_smp_processor_id:
-   return _get_smp_processor_id_proto;
-   case BPF_FUNC_skb_under_cgroup:
-   return _skb_under_cgroup_proto;
-   default:
-   return bpf_base_func_proto(func_id);
-   }
-}
-
 static const struct bpf_func_proto *
 sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4534,6 +4507,44 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
}
 }
 
+static const struct bpf_func_proto *
+lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+   switch (func_id) {
+   case BPF_FUNC_skb_load_bytes:
+   return _skb_load_bytes_proto;
+   case BPF_FUNC_skb_pull_data:
+   return _skb_pull_data_proto;
+   case BPF_FUNC_csum_diff:
+   return _csum_diff_proto;
+   case BPF_FUNC_get_cgroup_classid:
+   return _get_cgroup_classid_proto;
+   case BPF_FUNC_get_route_realm:
+   return _get_route_realm_proto;
+   case BPF_FUNC_get_hash_recalc:
+   return _get_hash_recalc_proto;
+   case BPF_FUNC_perf_event_output:
+   return _skb_event_output_proto;
+   case BPF_FUNC_get_smp_processor_id:
+   return _get_smp_processor_id_proto;
+   case BPF_FUNC_skb_under_cgroup:
+   return _skb_under_cgroup_proto;
+   default:
+   return bpf_base_func_proto(func_id);
+   }
+}
+
+static const struct bpf_func_proto *
+lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+   switch (func_id) {
+   case BPF_FUNC_lwt_push_encap:
+   return _lwt_push_encap_proto;
+   default:
+   return lwt_out_func_proto(func_id, prog);
+   }
+}
+
 static const struct bpf_func_proto *
 lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4565,7 +4576,7 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
case BPF_FUNC_set_hash_invalid:
return _set_hash_invalid_proto;
default:
-   return lwt_inout_func_proto(func_id, prog);
+   return lwt_out_func_proto(func_id, prog);
}
 }
 
@@ -6131,13 +6142,23 @@ const struct bpf_prog_ops cg_skb_prog_ops = {
.test_run   = bpf_prog_test_run_skb,
 };
 
-const struct bpf_verifier_ops 

[PATCH bpf-next v3 1/6] ipv6: sr: make seg6.h includable without IPv6

2018-05-06 Thread Mathieu Xhonneux
include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
not enabled:
   include/net/seg6.h: In function 'seg6_pernet':
>> include/net/seg6.h:52:14: error: 'struct net' has no member named
'ipv6'; did you mean 'ipv4'?
 return net->ipv6.seg6_data;
 ^~~~
 ipv4

This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
allowing seg6.h to be included regardless of the configuration.

Signed-off-by: Mathieu Xhonneux 
---
 include/net/seg6.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/net/seg6.h b/include/net/seg6.h
index 099bad59dc90..70b4cfac52d7 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -49,7 +49,11 @@ struct seg6_pernet_data {
 
 static inline struct seg6_pernet_data *seg6_pernet(struct net *net)
 {
+#if IS_ENABLED(CONFIG_IPV6)
return net->ipv6.seg6_data;
+#else
+   return NULL;
+#endif
 }
 
 extern int seg6_init(void);
-- 
2.16.1



[PATCH bpf-next v3 2/6] ipv6: sr: export function lookup_nexthop

2018-05-06 Thread Mathieu Xhonneux
The function lookup_nexthop is essential to implement most of the seg6local
actions. As we want to provide a BPF helper allowing to apply some of these
actions on the packet being processed, the helper should be able to call
this function, hence the need to make it public.

Moreover, if one argument is incorrect or if the next hop can not be found,
an error should be returned by the BPF helper so the BPF program can adapt
its processing of the packet (return an error, properly force the drop,
...). This patch hence makes this function return dst->error to indicate a
possible error.

Signed-off-by: Mathieu Xhonneux 
Acked-by: David Lebrun 
---
 include/net/seg6.h   |  3 ++-
 include/net/seg6_local.h | 24 
 net/ipv6/seg6_local.c| 20 +++-
 3 files changed, 37 insertions(+), 10 deletions(-)
 create mode 100644 include/net/seg6_local.h

diff --git a/include/net/seg6.h b/include/net/seg6.h
index 70b4cfac52d7..e029e301faa5 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -67,5 +67,6 @@ extern bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int 
len);
 extern int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh,
 int proto);
 extern int seg6_do_srh_inline(struct sk_buff *skb, struct ipv6_sr_hdr *osrh);
-
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+  u32 tbl_id);
 #endif
diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
new file mode 100644
index ..57498b23085d
--- /dev/null
+++ b/include/net/seg6_local.h
@@ -0,0 +1,24 @@
+/*
+ *  SR-IPv6 implementation
+ *
+ *  Authors:
+ *  David Lebrun 
+ *  eBPF support: Mathieu Xhonneux 
+ *
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _NET_SEG6_LOCAL_H
+#define _NET_SEG6_LOCAL_H
+
+#include 
+#include 
+
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+  u32 tbl_id);
+
+#endif
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 45722327375a..e9b23fb924ad 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -30,6 +30,7 @@
 #ifdef CONFIG_IPV6_SEG6_HMAC
 #include 
 #endif
+#include 
 #include 
 
 struct seg6_local_lwt;
@@ -140,8 +141,8 @@ static void advance_nextseg(struct ipv6_sr_hdr *srh, struct 
in6_addr *daddr)
*daddr = *addr;
 }
 
-static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
-  u32 tbl_id)
+int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+   u32 tbl_id)
 {
struct net *net = dev_net(skb->dev);
struct ipv6hdr *hdr = ipv6_hdr(skb);
@@ -187,6 +188,7 @@ static void lookup_nexthop(struct sk_buff *skb, struct 
in6_addr *nhaddr,
 
skb_dst_drop(skb);
skb_dst_set(skb, dst);
+   return dst->error;
 }
 
 /* regular endpoint function */
@@ -200,7 +202,7 @@ static int input_action_end(struct sk_buff *skb, struct 
seg6_local_lwt *slwt)
 
advance_nextseg(srh, _hdr(skb)->daddr);
 
-   lookup_nexthop(skb, NULL, 0);
+   seg6_lookup_nexthop(skb, NULL, 0);
 
return dst_input(skb);
 
@@ -220,7 +222,7 @@ static int input_action_end_x(struct sk_buff *skb, struct 
seg6_local_lwt *slwt)
 
advance_nextseg(srh, _hdr(skb)->daddr);
 
-   lookup_nexthop(skb, >nh6, 0);
+   seg6_lookup_nexthop(skb, >nh6, 0);
 
return dst_input(skb);
 
@@ -239,7 +241,7 @@ static int input_action_end_t(struct sk_buff *skb, struct 
seg6_local_lwt *slwt)
 
advance_nextseg(srh, _hdr(skb)->daddr);
 
-   lookup_nexthop(skb, NULL, slwt->table);
+   seg6_lookup_nexthop(skb, NULL, slwt->table);
 
return dst_input(skb);
 
@@ -331,7 +333,7 @@ static int input_action_end_dx6(struct sk_buff *skb,
if (!ipv6_addr_any(>nh6))
nhaddr = >nh6;
 
-   lookup_nexthop(skb, nhaddr, 0);
+   seg6_lookup_nexthop(skb, nhaddr, 0);
 
return dst_input(skb);
 drop:
@@ -380,7 +382,7 @@ static int input_action_end_dt6(struct sk_buff *skb,
if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
goto drop;
 
-   lookup_nexthop(skb, NULL, slwt->table);
+   seg6_lookup_nexthop(skb, NULL, slwt->table);
 
return dst_input(skb);
 
@@ -406,7 +408,7 @@ static int input_action_end_b6(struct sk_buff *skb, struct 
seg6_local_lwt *slwt)
ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
-   lookup_nexthop(skb, NULL, 0);
+   seg6_lookup_nexthop(skb, NULL, 0);
 
return dst_input(skb);

[PATCH bpf-next v3 6/6] selftests/bpf: test for seg6local End.BPF action

2018-05-06 Thread Mathieu Xhonneux
Add a new test for the seg6local End.BPF action. The following helpers
are also tested :

- bpf_lwt_push_encap within the LWT BPF IN hook
- bpf_lwt_seg6_action
- bpf_lwt_seg6_adjust_srh
- bpf_lwt_seg6_store_bytes

A chain of End.BPF actions is built. The SRH is injected through a LWT
BPF IN hook before the chain. Each End.BPF action validates the previous
one, otherwise the packet is dropped.
The test succeeds if the last node in the chain receives the packet and
the UDP datagram contained can be retrieved from userspace.

Signed-off-by: Mathieu Xhonneux 
---
 tools/include/uapi/linux/bpf.h|  97 -
 tools/testing/selftests/bpf/Makefile  |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  12 +
 tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++
 tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++
 5 files changed, 689 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 83a95ae388dd..8c42297bf117 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
+   BPF_MAP_TYPE_XSKMAP,
 };
 
 enum bpf_prog_type {
@@ -138,6 +139,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_RAW_TRACEPOINT,
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+   BPF_PROG_TYPE_LWT_SEG6LOCAL,
 };
 
 enum bpf_attach_type {
@@ -1825,6 +1827,89 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ * Description
+ * Encapsulate the packet associated to *skb* within a Layer 3
+ * protocol header. This header is provided in the buffer at
+ * address *hdr*, with *len* its size in bytes. *type* indicates
+ * the protocol of the header and can be one of:
+ *
+ * **BPF_LWT_ENCAP_SEG6**
+ * IPv6 encapsulation with Segment Routing Header
+ * (**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ * the IPv6 header is computed by the kernel.
+ * **BPF_LWT_ENCAP_SEG6_INLINE**
+ * Only works if *skb* contains an IPv6 packet. Insert a
+ * Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ * the IPv6 header.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void 
*from, u32 len)
+ * Description
+ * Store *len* bytes from address *from* into the packet
+ * associated to *skb*, at *offset*. Only the flags, tag and TLVs
+ * inside the outermost IPv6 Segment Routing Header can be
+ * modified through this helper.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ * Description
+ * Adjust the size allocated to TLVs in the outermost IPv6
+ * Segment Routing Header contained in the packet associated to
+ * *skb*, at position *offset* by *delta* bytes. Only offsets
+ * after the segments are accepted. *delta* can be as well
+ * positive (growing) as negative (shrinking).
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 
param_len)
+ * Description
+ * 

[PATCH bpf-next v3 3/6] bpf: Add IPv6 Segment Routing helpers

2018-05-06 Thread Mathieu Xhonneux
The BPF seg6local hook should be powerful enough to enable users to
implement most of the use-cases one could think of. After some thinking,
we figured out that the following actions should be possible on a SRv6
packet, requiring 3 specific helpers :
- bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
- bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
   (to add/delete TLVs)
- bpf_lwt_seg6_action: Apply some SRv6 network programming actions
   (specifically End.X, End.T, End.B6 and
End.B6.Encap)

The specifications of these helpers are provided in the patch (see
include/uapi/linux/bpf.h).

The non-sensitive fields of the SRH are the following : flags, tag and
TLVs. The other fields can not be modified, to maintain the SRH
integrity. Flags, tag and TLVs can easily be modified as their validity
can be checked afterwards via seg6_validate_srh. It is not allowed to
modify the segments directly. If one wants to add segments on the path,
he should stack a new SRH using the End.B6 action via
bpf_lwt_seg6_action.

Growing, shrinking or editing TLVs via the helpers will flag the SRH as
invalid, and it will have to be re-validated before re-entering the IPv6
layer. This flag is stored in a per-CPU buffer, along with the current
header length in bytes.

Storing the SRH len in bytes in the control block is mandatory when using
bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
boundary). When adding/deleting TLVs within the BPF program, the SRH may
temporary be in an invalid state where its length cannot be rounded to 8
bytes without remainder, hence the need to store the length in bytes
separately. The caller of the BPF program can then ensure that the SRH's
final length is valid using this value. Again, a final SRH modified by a
BPF program which doesn’t respect the 8-bytes boundary will be discarded
as it will be considered as invalid.

Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
available from the LWT BPF IN hook, but not from the seg6local BPF one.
This helper allows to encapsulate a Segment Routing Header (either with
a new outer IPv6 header, or by inlining it directly in the existing IPv6
header) into a non-SRv6 packet. This helper is required if we want to
offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
as the BPF seg6local hook only works on traffic already containing a SRH.
This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
the same purpose but with a static SRH per route.

Signed-off-by: Mathieu Xhonneux 
Acked-by: David Lebrun 
---
 include/net/seg6_local.h |   8 ++
 include/uapi/linux/bpf.h |  95 +++-
 net/core/filter.c| 282 +++
 net/ipv6/seg6_local.c|   2 +
 4 files changed, 363 insertions(+), 24 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 57498b23085d..661fd5b4d3e0 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -15,10 +15,18 @@
 #ifndef _NET_SEG6_LOCAL_H
 #define _NET_SEG6_LOCAL_H
 
+#include 
 #include 
 #include 
 
 extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
   u32 tbl_id);
 
+struct seg6_bpf_srh_state {
+   bool valid;
+   u16 hdrlen;
+};
+
+DECLARE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 93d5a4eeec2a..df14a31500eb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1826,6 +1826,89 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ * Description
+ * Encapsulate the packet associated to *skb* within a Layer 3
+ * protocol header. This header is provided in the buffer at
+ * address *hdr*, with *len* its size in bytes. *type* indicates
+ * the protocol of the header and can be one of:
+ *
+ * **BPF_LWT_ENCAP_SEG6**
+ * IPv6 encapsulation with Segment Routing Header
+ * (**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ * the IPv6 header is computed by the kernel.
+ * **BPF_LWT_ENCAP_SEG6_INLINE**
+ * Only works if *skb* contains an IPv6 packet. Insert a
+ * Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ * the IPv6 header.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the 

[PATCH bpf-next v3 0/6] ipv6: sr: introduce seg6local End.BPF action

2018-05-06 Thread Mathieu Xhonneux
As of Linux 4.14, it is possible to define advanced local processing for
IPv6 packets with a Segment Routing Header through the seg6local LWT
infrastructure. This LWT implements the network programming principles
defined in the IETF “SRv6 Network Programming” draft.

The implemented operations are generic, and it would be very interesting to
be able to implement user-specific seg6local actions, without having to
modify the kernel directly. To do so, this patchset adds an End.BPF action
to seg6local, powered by some specific Segment Routing-related helpers,
which provide SR functionalities that can be applied on the packet. This
BPF hook would then allow to implement specific actions at native kernel
speed such as OAM features, advanced SR SDN policies, SRv6 actions like
Segment Routing Header (SRH) encapsulation depending on the content of
the packet, etc ... 

This patchset is divided in 6 patches, whose main features are :

- A new seg6local action End.BPF with the corresponding new BPF program
  type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
  passed to the LWT seg6local through netlink, the same way as the LWT
  BPF hook operates.
- 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
  shrink a SRH and apply on a packet some of the generic SRv6 actions.
- 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
  encapsulation (via IPv6 encapsulation or inlining if the packet contains
  already an IPv6 header).

As this patchset adds a new LWT BPF hook, I took into account the result of
the discussions when the LWT BPF infrastructure got merged. Hence, the
seg6local BPF hook doesn’t allow write access to skb->data directly, only
the SRH can be modified through specific helpers, which ensures that the
integrity of the packet is maintained.
More details are available in the related patches messages.

The performances of this BPF hook have been assessed with the BPF JIT
enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
clocked at 2.53 GHz. No throughput losses are noted with the seg6local
BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
drops the throughput to 410kpps, and inlining a SRH via
bpf_lwt_seg6_action drops the throughput to 420kpps.
All throughputs are stable.

---
v2: move the SRH integrity state from skb->cb to a per-cpu buffer
v3: - document helpers in man-page style
- fix kbuild bugs
- un-break BPF LWT out hook
- bpf_push_seg6_encap is now static
- preempt_enable is now called when the packet is dropped in
  input_action_end_bpf

Thanks.


Mathieu Xhonneux (6):
  ipv6: sr: make seg6.h includable without IPv6
  ipv6: sr: export function lookup_nexthop
  bpf: Add IPv6 Segment Routing helpers
  bpf: Split lwt inout verifier structures
  ipv6: sr: Add seg6local action End.BPF
  selftests/bpf: test for seg6local End.BPF action

 include/linux/bpf_types.h |   7 +-
 include/net/seg6.h|   7 +-
 include/net/seg6_local.h  |  32 ++
 include/uapi/linux/bpf.h  |  96 -
 include/uapi/linux/seg6_local.h   |   3 +
 kernel/bpf/verifier.c |   1 +
 net/core/filter.c | 390 ---
 net/ipv6/seg6_local.c | 180 -
 tools/include/uapi/linux/bpf.h|  97 -
 tools/testing/selftests/bpf/Makefile  |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  12 +
 tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++
 tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++
 13 files changed, 1335 insertions(+), 73 deletions(-)
 create mode 100644 include/net/seg6_local.h
 create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh

-- 
2.16.1



[PATCH bpf-next v3 5/6] ipv6: sr: Add seg6local action End.BPF

2018-05-06 Thread Mathieu Xhonneux
This patch adds the End.BPF action to the LWT seg6local infrastructure.
This action works like any other seg6local End action, meaning that an IPv6
header with SRH is needed, whose DA has to be equal to the SID of the
action. It will also advance the SRH to the next segment, the BPF program
does not have to take care of this.

Since the BPF program may not be a source of instability in the kernel, it
is important to ensure that the integrity of the packet is maintained
before yielding it back to the IPv6 layer. The hook hence keeps track if
the SRH has been altered through the helpers, and re-validates its
content if needed with seg6_validate_srh. The state kept for validation is
stored in a per-CPU buffer. The BPF program is not allowed to directly
write into the packet, and only some fields of the SRH can be altered
through the helper bpf_lwt_seg6_store_bytes.

Performances profiling has shown that the SRH re-validation does not induce
a significant overhead. If the altered SRH is deemed as invalid, the packet
is dropped.

This validation is also done before executing any action through
bpf_lwt_seg6_action, and will not be performed again if the SRH is not
modified after calling the action.

The BPF program may return 3 types of return codes:
- BPF_OK: the End.BPF action will look up the next destination through
 seg6_lookup_nexthop.
- BPF_REDIRECT: if an action has been executed through the
  bpf_lwt_seg6_action helper, the BPF program should return this
  value, as the skb's destination is already set and the default
  lookup should not be performed.
- BPF_DROP : the packet will be dropped.

Signed-off-by: Mathieu Xhonneux 
Acked-by: David Lebrun 
---
 include/linux/bpf_types.h   |   3 +
 include/uapi/linux/bpf.h|   1 +
 include/uapi/linux/seg6_local.h |   3 +
 kernel/bpf/verifier.c   |   1 +
 net/core/filter.c   |  25 +++
 net/ipv6/seg6_local.c   | 158 +++-
 6 files changed, 188 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cc9d7e031330..5b732bfff8a3 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -12,6 +12,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
+#ifdef CONFIG_IPV6_SEG6_LWTUNNEL
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_SEG6LOCAL, lwt_seg6local)
+#endif
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index df14a31500eb..8c42297bf117 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -139,6 +139,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_RAW_TRACEPOINT,
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+   BPF_PROG_TYPE_LWT_SEG6LOCAL,
 };
 
 enum bpf_attach_type {
diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
index ef2d8c3e76c1..aadcc11fb918 100644
--- a/include/uapi/linux/seg6_local.h
+++ b/include/uapi/linux/seg6_local.h
@@ -25,6 +25,7 @@ enum {
SEG6_LOCAL_NH6,
SEG6_LOCAL_IIF,
SEG6_LOCAL_OIF,
+   SEG6_LOCAL_BPF,
__SEG6_LOCAL_MAX,
 };
 #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
@@ -59,6 +60,8 @@ enum {
SEG6_LOCAL_ACTION_END_AS= 13,
/* forward to SR-unaware VNF with masquerading */
SEG6_LOCAL_ACTION_END_AM= 14,
+   /* custom BPF action */
+   SEG6_LOCAL_ACTION_END_BPF   = 15,
 
__SEG6_LOCAL_ACTION_MAX,
 };
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d5e1a6c4165d..bb6e4a17ce3d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1262,6 +1262,7 @@ static bool may_access_direct_pkt_data(struct 
bpf_verifier_env *env,
switch (env->prog->type) {
case BPF_PROG_TYPE_LWT_IN:
case BPF_PROG_TYPE_LWT_OUT:
+   case BPF_PROG_TYPE_LWT_SEG6LOCAL:
/* dst_input() and dst_output() can't write for now */
if (t == BPF_WRITE)
return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index 2aa83e0f40ce..592dec8c781c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4580,6 +4580,21 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
}
 }
 
+static const struct bpf_func_proto *
+lwt_seg6local_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+   switch (func_id) {
+   case BPF_FUNC_lwt_seg6_store_bytes:
+   return _lwt_seg6_store_bytes_proto;
+   case BPF_FUNC_lwt_seg6_action:
+   return _lwt_seg6_action_proto;
+   case BPF_FUNC_lwt_seg6_adjust_srh:
+   

Locking in network code

2018-05-06 Thread Jacob S. Moroni
Hello,

I have a stupid question regarding which variant of spin_lock to use
throughout the network stack, and inside RX handlers specifically.

It's my understanding that skbuffs are normally passed into the stack
from soft IRQ context if the device is using NAPI, and hard IRQ
context if it's not using NAPI (and I guess process context too if the
driver does it's own workqueue thing). 

So, that means that handlers registered with netdev_rx_handler_register
may end up being called from any context.

However, the RX handler in the macvlan code calls ip_check_defrag,
which could eventually lead to a call to ip_defrag, which ends
up taking a regular spin_lock around the call to ip_frag_queue.

Is this a risk of deadlock, and if not, why?

What if you're running a system with one CPU and a packet fragment
arrives on a NAPI interface, then, while the spin_lock is held,
another fragment somehow arrives on another interface which does
its processing in hard IRQ context?

-- 
  Jacob S. Moroni
  m...@jakemoroni.com


[PATCH 8/9] net: flow_dissector: fix typo 'can by' to 'can be'

2018-05-06 Thread Wolfram Sang
Signed-off-by: Wolfram Sang 
---
 include/net/flow_dissector.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 9a074776f70b66..d1fcf2442a423b 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -251,7 +251,7 @@ extern struct flow_dissector flow_keys_buf_dissector;
  * This structure is used to hold a digest of the full flow keys. This is a
  * larger "hash" of a flow to allow definitively matching specific flows where
  * the 32 bit skb->hash is not large enough. The size is limited to 16 bytes so
- * that it can by used in CB of skb (see sch_choke for an example).
+ * that it can be used in CB of skb (see sch_choke for an example).
  */
 #define FLOW_KEYS_DIGEST_LEN   16
 struct flow_keys_digest {
-- 
2.11.0



[PATCH 0/9] tree-wide: fix typo 'can by' to 'can be'

2018-05-06 Thread Wolfram Sang
I found this kind of typo when reading the documentation for device_remove().
So, I checked the tree for it.

CCing all the subsystems directly, and I'd think the leftover ones could be
picked up by the trivial tree. Or would it be more convenient if trivial would
pick up all? I don't mind.

Based on v4.17-rc3.

Wolfram Sang (9):
  dt-bindings: i2c: fix typo 'can by' to 'can be'
  powerpc/watchdog: fix typo 'can by' to 'can be'
  base: core: fix typo 'can by' to 'can be'
  hwmon: fschmd: fix typo 'can by' to 'can be'
  input: ati_remote2: fix typo 'can by' to 'can be'
  NTB: ntb_hw_idt: fix typo 'can by' to 'can be'
  reiserfs: journal: fix typo 'can by' to 'can be'
  net: flow_dissector: fix typo 'can by' to 'can be'
  objtool: fix typo 'can by' to 'can be'

 Documentation/devicetree/bindings/i2c/i2c-davinci.txt | 2 +-
 arch/powerpc/kernel/watchdog.c| 2 +-
 drivers/base/core.c   | 2 +-
 drivers/hwmon/fschmd.c| 2 +-
 drivers/input/misc/ati_remote2.c  | 2 +-
 drivers/ntb/hw/idt/ntb_hw_idt.c   | 2 +-
 fs/reiserfs/journal.c | 2 +-
 include/net/flow_dissector.h  | 2 +-
 tools/objtool/Documentation/stack-validation.txt  | 2 +-
 9 files changed, 9 insertions(+), 9 deletions(-)

-- 
2.11.0



[PATCH] mwifiex: delete unneeded include

2018-05-06 Thread Julia Lawall
Nothing that is defined in 11ac.h is referenced in cmdevt.c.

Signed-off-by: Julia Lawall 

---
 drivers/net/wireless/marvell/mwifiex/cmdevt.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/cmdevt.c 
b/drivers/net/wireless/marvell/mwifiex/cmdevt.c
index 7014f44..9cfcdf6 100644
--- a/drivers/net/wireless/marvell/mwifiex/cmdevt.c
+++ b/drivers/net/wireless/marvell/mwifiex/cmdevt.c
@@ -25,7 +25,6 @@
 #include "main.h"
 #include "wmm.h"
 #include "11n.h"
-#include "11ac.h"
 
 static void mwifiex_cancel_pending_ioctl(struct mwifiex_adapter *adapter);
 



Re: [PATCH] net/mlx5: Fix mlx5_get_vector_affinity function

2018-05-06 Thread Thomas Gleixner
On Sun, 6 May 2018, Thomas Gleixner wrote:
> On Sat, 5 May 2018, Guenter Roeck wrote:
> > > -#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
> > > - mask = irq_data_get_effective_affinity_mask(>irq_data);
> > > -#else
> > > - mask = desc->irq_common_data.affinity;
> > > -#endif
> > > - return mask;
> > > + return desc->affinity_hint;
> 
> NAK.
> 
> Nothing in regular device drivers is supposed to ever fiddle with struct
> irq_desc. The existing code is already a violation of that rule and needs
> to be fixed, but not in that way.
> 
> The logic here is completely screwed. affinity_hint is set by the driver,
> so the driver already knows what it is. If the driver does not set it, then
> the thing is NULL.

And this completely insane fiddling with irq_desc is in MLX4 as
well. Dammit, why can't people respect subsytem boundaries and just fiddle
in everything just because they can? If there is something missing at the
core level then please talk to the maintainers instead of hacking utter
crap into your driver.

Yours grumpy

  tglx


  1   2   >