date:20180417

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig

> I'm not sure if I am really a fan of trying to solve this in this way.
> It seems like this is going to be optimizing the paths for one case at
> the detriment of others. Historically mapping and unmapping has always
> been expensive, especially in the case of IOMMU enabled environments.
> I would much rather see us focus on having swiotlb_dma_ops replaced
> with dma_direct_ops in the cases where the device can access all of
> physical memory.

I am definitively not a fan, but IFF indirect calls are such an overhead
it makes sense to avoid it for the common and simple case.  And the
direct mapping is a common case present on just about every
architecture, and it is a very simple fast path that just adds an offset
to the physical address.  So if we want to speed something up, this is
it.

> > -   if (ops->unmap_page)
> > +   if (!dev->is_dma_direct && ops->unmap_page)
> 
> If I understand correctly this is only needed for the swiotlb case and
> not the dma_direct case. It would make much more sense to just
> overwrite the dev->dma_ops pointer with dma_direct_ops to address all
> of the sync and unmap cases.

Yes.

> > +   if (dev->dma_ops == _direct_ops ||
> > +   (dev->dma_ops == _dma_ops &&
> > +mask == DMA_BIT_MASK(64)))
> > +   dev->is_dma_direct = true;
> > +   else
> > +   dev->is_dma_direct = false;
> 
> So I am not sure this will work on x86. If I am not mistaken I believe
> dev->dma_ops is normally not set and instead the default dma
> operations are pulled via get_arch_dma_ops which returns the global
> dma_ops pointer.

True, for x86 we'd need to check get_arch_dma_ops().

> What you may want to consider as an alternative would be to look at
> modifying drivers that are using the swiotlb so that you could just
> overwrite the dev->dma_ops with the dma_direct_ops in the cases where
> the hardware can support accessing all of physical hardware and where
> we aren't forcing the use of the bounce buffers in the swiotlb.
> 
> Then for the above code you only have to worry about the map calls,
> and you could just do a check against the dma_direct_ops pointer
> instead of having to add a new flag.

That would be the long term plan IFF we go down this route.  For now I
just wanted a quick hack for performance testing.

[PATCH bpf-next 07/10] [bpf]: make cavium thunder compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for cavium's thunder driver we will just calculate packet's length
unconditionally

Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 707db3304396..7135db45927e 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -538,9 +538,9 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct 
bpf_prog *prog,
action = bpf_prog_run_xdp(prog, );
rcu_read_unlock();
 
+   len = xdp.data_end - xdp.data;
/* Check if XDP program has changed headers */
if (orig_data != xdp.data) {
-   len = xdp.data_end - xdp.data;
offset = orig_data - xdp.data;
dma_addr -= offset;
}
-- 
2.15.1

[PATCH bpf-next 01/10] [bpf]: adding bpf_xdp_adjust_tail helper

2018-04-17 Thread Nikita V. Shirokov

Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.

Signed-off-by: Nikita V. Shirokov 
---
 include/uapi/linux/bpf.h | 10 +-
 net/core/filter.c| 29 -
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..9a2d1a04eb24 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -755,6 +755,13 @@ union bpf_attr {
  * @addr: pointer to struct sockaddr to bind socket to
  * @addr_len: length of sockaddr structure
  * Return: 0 on success or negative error code
+ *
+ * int bpf_xdp_adjust_tail(xdp_md, delta)
+ * Adjust the xdp_md.data_end by delta. Only shrinking of packet's
+ * size is supported.
+ * @xdp_md: pointer to xdp_md
+ * @delta: A negative integer to be added to xdp_md.data_end
+ * Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -821,7 +828,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(xdp_adjust_tail),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index d31aff93270d..6c8ac7b548d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2717,6 +2717,30 @@ static const struct bpf_func_proto 
bpf_xdp_adjust_head_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
+{
+   /* only shrinking is allowed for now. */
+   if (unlikely(offset > 0))
+   return -EINVAL;
+
+   void *data_end = xdp->data_end + offset;
+
+   if (unlikely(data_end < xdp->data + ETH_HLEN))
+   return -EINVAL;
+
+   xdp->data_end = data_end;
+
+   return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
+   .func   = bpf_xdp_adjust_tail,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+};
+
 BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 {
void *meta = xdp->data_meta + offset;
@@ -3053,7 +3077,8 @@ bool bpf_helper_changes_pkt_data(void *func)
func == bpf_l4_csum_replace ||
func == bpf_xdp_adjust_head ||
func == bpf_xdp_adjust_meta ||
-   func == bpf_msg_pull_data)
+   func == bpf_msg_pull_data ||
+   func == bpf_xdp_adjust_tail)
return true;
 
return false;
@@ -3867,6 +3892,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return _xdp_redirect_proto;
case BPF_FUNC_redirect_map:
return _xdp_redirect_map_proto;
+   case BPF_FUNC_xdp_adjust_tail:
+   return _xdp_adjust_tail_proto;
default:
return bpf_base_func_proto(func_id);
}
-- 
2.15.1

Re: [PATCH][next] iwlwifi: mvm: remove division by size of sizeof(struct ieee80211_wmm_rule)

2018-04-17 Thread Luca Coelho

On Wed, 2018-04-11 at 14:05 +0100, Colin King wrote:
> From: Colin Ian King 
> 
> The subtraction of two struct ieee80211_wmm_rule pointers leaves a
> result
> that is automatically scaled down by the size of the size of pointed-
> to
> type, hence the division by sizeof(struct ieee80211_wmm_rule) is
> bogus and should be removed.
> 
> Detected by CoverityScan, CID#146 ("Extra sizeof expression")
> 
> Fixes: 77e30e10ee28 ("iwlwifi: mvm: query regdb for wmm rule if
> needed")
> Signed-off-by: Colin Ian King 
> ---

Thanks, Colin! I've pushed this to our internal tree for review and if
everything goes fine it will land upstream following our normal
upstreaming process.

--
Cheers,
Luca.

[PATCH net-next 3/3] net: phy: Enable C45 PHYs with vendor specific address space

2018-04-17 Thread Vicentiu Galanopulo

A search of the dev-addr property is done in of_mdiobus_register.
If the property is found in the PHY node, of_mdiobus_register_vend_spec_phy()
is called. This is a wrapper function for of_mdiobus_register_phy()
which finds the device in package based on dev-addr, and fills
devices_addrs, which is a new field added to phy_c45_device_ids.
This new field will store the dev-addr property on the same index
where the device in package has been found.

The of_mdiobus_register_phy() now contains an extra parameter,
which is struct phy_c45_device_ids *c45_ids.
If c45_ids is not NULL, get_vend_spec_addr_phy_device() is called
and c45_ids are propagated all the way to get_phy_c45_ids().

Having dev-addr stored in devices_addrs, in get_phy_c45_ids(),
when probing the identifiers, dev-addr can be extracted from
devices_addrs and probed if devices_addrs[current_identifier] is not 0.

Signed-off-by: Vicentiu Galanopulo 
---
 drivers/net/phy/phy_device.c |  49 +--
 drivers/of/of_mdio.c | 113 +--
 include/linux/phy.h  |  14 ++
 3 files changed, 169 insertions(+), 7 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index ac23322..5c79fd8 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -457,7 +457,7 @@ static int get_phy_c45_devs_in_pkg(struct mii_bus *bus, int 
addr, int dev_addr,
 static int get_phy_c45_ids(struct mii_bus *bus, int addr, u32 *phy_id,
   struct phy_c45_device_ids *c45_ids) {
int phy_reg;
-   int i, reg_addr;
+   int i, reg_addr, dev_addr;
const int num_ids = ARRAY_SIZE(c45_ids->device_ids);
u32 *devs = _ids->devices_in_package;
 
@@ -493,13 +493,23 @@ static int get_phy_c45_ids(struct mii_bus *bus, int addr, 
u32 *phy_id,
if (!(c45_ids->devices_in_package & (1 << i)))
continue;
 
-   reg_addr = MII_ADDR_C45 | i << 16 | MII_PHYSID1;
+   /* if c45_ids->devices_addrs for the current id is not 0,
+* then dev-addr was defined in the PHY device tree node,
+* and the PHY has been seen as a valid device, and added
+* in the package. In this case we can use the
+* dev-addr(c45_ids->devices_addrs[i]) to do the MDIO
+* reading of the PHY ID.
+*/
+   dev_addr = !!c45_ids->devices_addrs[i] ?
+   c45_ids->devices_addrs[i] : i;
+
+   reg_addr = MII_ADDR_C45 | dev_addr << 16 | MII_PHYSID1;
phy_reg = mdiobus_read(bus, addr, reg_addr);
if (phy_reg < 0)
return -EIO;
c45_ids->device_ids[i] = (phy_reg & 0x) << 16;
 
-   reg_addr = MII_ADDR_C45 | i << 16 | MII_PHYSID2;
+   reg_addr = MII_ADDR_C45 | dev_addr << 16 | MII_PHYSID2;
phy_reg = mdiobus_read(bus, addr, reg_addr);
if (phy_reg < 0)
return -EIO;
@@ -551,6 +561,39 @@ static int get_phy_id(struct mii_bus *bus, int addr, u32 
*phy_id,
 }
 
 /**
+ * get_vend_spec_addr_phy_device - reads the specified PHY device
+ *and returns its @phy_device struct
+ * @bus: the target MII bus
+ * @addr: PHY address on the MII bus
+ * @is_c45: If true the PHY uses the 802.3 clause 45 protocol
+ * @c45_ids: Query the c45_ids to see if a PHY with a vendor specific
+ *   register address space was defined in the PHY device tree
+ *   node by adding the "dev-addr" property to the node.
+ *   Store the c45 ID information about the rest of the PHYs
+ *   found PHYs on the MDIO bus during probing.
+ *
+ * Description: Reads the ID registers of the PHY at @addr on the
+ *   @bus, then allocates and returns the phy_device to represent it.
+ */
+struct phy_device *get_vend_spec_addr_phy_device(struct mii_bus *bus,
+int addr, bool is_c45,
+struct phy_c45_device_ids 
*c45_ids)
+{
+   u32 phy_id = 0;
+   int r;
+
+   r = get_phy_id(bus, addr, _id, is_c45, c45_ids);
+   if (r)
+   return ERR_PTR(r);
+
+   /* If the phy_id is mostly Fs, there is no device there */
+   if ((phy_id & 0x1fff) == 0x1fff)
+   return ERR_PTR(-ENODEV);
+
+   return phy_device_create(bus, addr, phy_id, is_c45, c45_ids);
+}
+
+/**
  * get_phy_device - reads the specified PHY device and returns its @phy_device
  * struct
  * @bus: the target MII bus
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 8c0c927..52e8bfb 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -45,7 +45,8 @@ static int of_get_phy_id(struct device_node *device, u32 
*phy_id)
 }
 
 static int

[PATCH net-next 2/3] net: phy: Change the array size to 32 for device_ids

2018-04-17 Thread Vicentiu Galanopulo

In the context of enabling the discovery of the PHYs
which have the C45 MDIO address space in a non-standard
address:  num_ids in get_phy_c45_ids, has the
value 8 (ARRAY_SIZE(c45_ids->device_ids)), but the
u32 *devs can store 32 devices in the bitfield.

If a device is stored in *devs, in bits 32 to 9
(bit counting in lookup loop starts from 1), it will
not be found.

Signed-off-by: Vicentiu Galanopulo 
---
 include/linux/phy.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/phy.h b/include/linux/phy.h
index f0b5870..26aa320 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -360,7 +360,7 @@ enum phy_state {
  */
 struct phy_c45_device_ids {
u32 devices_in_package;
-   u32 device_ids[8];
+   u32 device_ids[32];
 };
 
 /* phy_device: An instance of a PHY
-- 
2.7.4

[PATCH net-next 0/3] net: phy: Enable C45 vendor specific MDIO register addr space

2018-04-17 Thread Vicentiu Galanopulo

Enabling the discovery on the MDIO bus of PHYs which
have a vendor specific address space for accessing the C45 MDIO registers.

Vicentiu Galanopulo (3):
  net: phy: Add binding for vendor specific C45 MDIO address space
  net: phy: Change the array size to 32 for device_ids
  net: phy: Enable C45 PHYs with vendor specific address space

 Documentation/devicetree/bindings/net/phy.txt |   6 ++
 drivers/net/phy/phy_device.c  |  49 ++-
 drivers/of/of_mdio.c  | 113 +-
 include/linux/phy.h   |  16 +++-
 4 files changed, 176 insertions(+), 8 deletions(-)

-- 
2.7.4

Re: [PATCH net-next 3/5] ipv4: support sport, dport and ip protocol in RTM_GETROUTE

2018-04-17 Thread Ido Schimmel

On Mon, Apr 16, 2018 at 01:41:36PM -0700, Roopa Prabhu wrote:
> @@ -2757,6 +2796,12 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, 
> struct nlmsghdr *nlh,
>   fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
>   fl4.flowi4_mark = mark;
>   fl4.flowi4_uid = uid;
> + if (sport)
> + fl4.fl4_sport = sport;
> + if (dport)
> + fl4.fl4_dport = dport;
> + if (ip_proto)
> + fl4.flowi4_proto = ip_proto;

Hi Roopa,

This info isn't set in the synthesized skb, but only in the flow info
and therefore not used for input routes. I see you added a test case,
but it's only for output routes. I believe an input route test case will
fail.

Also, note that the skb as synthesized now is invalid - iph->ihl is 0
for example - so the flow dissector will spit it out. It effectively
means that route get is broken when L4 hashing is used. It also affects
output routes because since commit 3765d35ed8b9 ("net: ipv4: Convert
inet_rtm_getroute to rcu versions of route lookup") the skb is used to
calculate the multipath hash.

Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek

On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote:
> > I turned pr_debug on in tcp_in_window() for another try and it's a bit
> > mangled because the information on multiple lines and the function is
> > called in parallel but it looks like I do have some seq > maxend +1
> > 
> > Although it's weird, the maxend was set WAY earlier apparently?
> > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 
> > receiver end=505004284 maxend=505033596 maxwin=29200
> > then window decreased drastically e.g. previous ack just before refusal:
> > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 
> > end=1913292311
> > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 
> > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7
> 
> scale=0 is suspect.
> 
> Really if conntrack does not see SYN SYNACK packets, it should not
> make any window check, since windows can be scaled up to 14 :/

Hm... it doesn't seem to be the case here:

14.364038 tcp_in_window: START
14.364065 tcp_in_window: 
14.364090 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284
14.364129 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=0 maxend=0 maxwin=0 scale=0
14.364158 tcp_in_window: 
14.364185 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284
14.364210 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=0 maxend=0 maxwin=0 scale=0
14.364237 tcp_in_window: I=1 II=1 III=1 IV=1
14.364262 tcp_in_window: res=1 sender end=505004284 maxend=505004284 
maxwin=29200 receiver end=0 maxend=29200 maxwin=0

looks like SYN packet

14.661682 tcp_in_window: START
14.661706 tcp_in_window: 
14.661731 seq=1913287797 ack=0+(0) sack=0+(0) win=29200 end=1913287798
14.661828 tcp_in_window: sender end=0 maxend=29200 maxwin=0 scale=0 receiver 
end=505004284 maxend=505004284 maxwin=29200 scale=7
14.661867 tcp_in_window: START
14.661893 tcp_in_window: 
14.661917 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 
end=1025597635
14.661942 tcp_in_window: START
14.661966 tcp_in_window: 
14.661993 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 
scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7
14.662020 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 
end=505004284
14.662045 tcp_in_window: 
14.662072 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 
end=1025597635
14.662097 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7
14.662125 tcp_in_window: 
14.662150 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 
scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7
14.662175 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 
end=505004284
14.662202 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7
14.662226 tcp_in_window: I=1 II=1 III=1 IV=1
14.662251 tcp_in_window: I=1 II=1 III=1 IV=1
14.662277 tcp_in_window: res=1 sender end=505004284 maxend=505004284 
maxwin=29200 receiver end=1913287798 maxend=1913316998 maxwin=29200
14.662302 tcp_in_window: res=1 sender end=1025597635 maxend=1025635103 
maxwin=354378 receiver end=1542862349 maxend=1543171341 maxwin=37504

SYNACK response and (dataless) ACK in the original direction, mixed with
an unrelated packet.

14.687411 tcp_in_window: START
14.687522 tcp_in_window: 
14.687570 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287798
14.687619 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 
scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7
14.687659 tcp_in_window: 
14.687699 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287798
14.687739 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 
scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7
14.687774 tcp_in_window: I=1 II=1 III=1 IV=1
14.687806 tcp_in_window: res=1 sender end=1913287798 maxend=1913316998 
maxwin=29312 receiver end=505004284 maxend=505033596 maxwin=29200

ACK in the reply direction (no data). We still have scale=7 in both
directions.

14.688706 tcp_in_window: START
14.688733 tcp_in_window: 
14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
14.688824 tcp_in_window: 
14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
14.688911 tcp_in_window: I=1 II=1 III=1 IV=1
14.688938 tcp_in_window: res=1 sender end=1913287819

[PATCH] net: change the comment of dev_mc_init

2018-04-17 Thread sunlianwen


the comment of dev_mc_init() is wrong. which use dev_mc_flush
instead of dev_mc_init.


Signed-off-by:Lianwen Sun 
---
 net/core/dev_addr_lists.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index e3e6a3e2ca22..d884d8f5f0e5 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -839,7 +839,7 @@ void dev_mc_flush(struct net_device *dev)
 EXPORT_SYMBOL(dev_mc_flush);

 /**
- * dev_mc_flush - Init multicast address list
+ * dev_mc_init - Init multicast address list
  * @dev: device
  *
  * Init multicast address list.
--
2.17.0

[PATCH net-next v4 3/3] cxgb4: collect hardware dump in second kernel

2018-04-17 Thread Rahul Lakkireddy

Register callback to collect hardware/firmware dumps in second kernel
before hardware/firmware is initialized. The dumps for each device
will be available as elf notes in /proc/vmcore in second kernel.

Signed-off-by: Rahul Lakkireddy 
Signed-off-by: Ganesh Goudar 
---
v4:
- No changes.

v3:
- Replaced all crashdd* with vmcoredd*.
- Replaced crashdd_add_dump() with vmcore_add_device_dump().
- Updated comments and commit message.

v2:
- No Changes.

Changes since rfc v2:
- Update comments and commit message for sysfs change.

rfc v2:
- Updated dump registration to the new API in patch 1.

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h   |  4 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |  3 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  | 10 ++
 4 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 688f95440af2..01e7aad4ce5b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -964,6 +965,9 @@ struct adapter {
struct hma_data hma;
 
struct srq_data *srq;
+
+   /* Dump buffer for collecting logs in kdump kernel */
+   struct vmcoredd_data vmcoredd;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 143686c60234..76433d4fe483 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -488,3 +488,28 @@ void cxgb4_init_ethtool_dump(struct adapter *adapter)
adapter->eth_dump.version = adapter->params.fw_vers;
adapter->eth_dump.len = 0;
 }
+
+static int cxgb4_cudbg_vmcoredd_collect(struct vmcoredd_data *data, void *buf)
+{
+   struct adapter *adap = container_of(data, struct adapter, vmcoredd);
+   u32 len = data->size;
+
+   return cxgb4_cudbg_collect(adap, buf, , CXGB4_ETH_DUMP_ALL);
+}
+
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap)
+{
+   struct vmcoredd_data *data = >vmcoredd;
+   u32 len;
+
+   len = sizeof(struct cudbg_hdr) +
+ sizeof(struct cudbg_entity_hdr) * CUDBG_MAX_ENTITY;
+   len += CUDBG_DUMP_BUFF_SIZE;
+
+   data->size = len;
+   snprintf(data->name, sizeof(data->name), "%s_%s", cxgb4_driver_name,
+adap->name);
+   data->vmcoredd_callback = cxgb4_cudbg_vmcoredd_collect;
+
+   return vmcore_add_device_dump(data);
+}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
index ce1ac9a1c878..ef59ba1ed968 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
@@ -41,8 +41,11 @@ enum CXGB4_ETHTOOL_DUMP_FLAGS {
CXGB4_ETH_DUMP_HW = (1 << 1), /* various FW and HW dumps */
 };
 
+#define CXGB4_ETH_DUMP_ALL (CXGB4_ETH_DUMP_MEM | CXGB4_ETH_DUMP_HW)
+
 u32 cxgb4_get_dump_length(struct adapter *adap, u32 flag);
 int cxgb4_cudbg_collect(struct adapter *adap, void *buf, u32 *buf_size,
u32 flag);
 void cxgb4_init_ethtool_dump(struct adapter *adapter);
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap);
 #endif /* __CXGB4_CUDBG_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 24d2865b8806..32cad0acf76c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5544,6 +5544,16 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (err)
goto out_free_adapter;
 
+   if (is_kdump_kernel()) {
+   /* Collect hardware state and append to /proc/vmcore */
+   err = cxgb4_cudbg_vmcore_add_dump(adapter);
+   if (err) {
+   dev_warn(adapter->pdev_dev,
+"Fail collecting vmcore device dump, err: %d. 
Continuing\n",
+err);
+   err = 0;
+   }
+   }
 
if (!is_t4(adapter->params.chip)) {
s_qpp = (QUEUESPERPAGEPF0_S +
-- 
2.14.1

[PATCH net-next v4 2/3] vmcore: append device dumps to vmcore as elf notes

2018-04-17 Thread Rahul Lakkireddy

Update read and mmap logic to append device dumps as additional notes
before the other elf notes. We add device dumps before other elf notes
because the other elf notes may not fill the elf notes buffer
completely and we will end up with zero-filled data between the elf
notes and the device dumps. Tools will then try to decode this
zero-filled data as valid notes and we don't want that. Hence, adding
device dumps before the other elf notes ensure that zero-filled data
can be avoided. This also ensures that the device dumps and the
other elf notes can be properly mmaped at page aligned address.

Incorporate device dump size into the total vmcore size. Also update
offsets for other program headers after the device dumps are added.

Suggested-by: Eric Biederman .
Signed-off-by: Rahul Lakkireddy 
Signed-off-by: Ganesh Goudar 
---
v4:
- No changes.

v3:
- Patch added in this version.
- Exported dumps as elf notes. Suggested by Eric Biederman
  .

 fs/proc/vmcore.c | 247 ++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7395462d2f86..ed1ebd85e14e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -39,6 +39,8 @@ static size_t elfcorebuf_sz_orig;
 
 static char *elfnotes_buf;
 static size_t elfnotes_sz;
+/* Size of all notes minus the device dump notes */
+static size_t elfnotes_orig_sz;
 
 /* Total size of vmcore file. */
 static u64 vmcore_size;
@@ -51,6 +53,9 @@ static LIST_HEAD(vmcoredd_list);
 static DEFINE_MUTEX(vmcoredd_mutex);
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 
+/* Device Dump Size */
+static size_t vmcoredd_orig_sz;
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -185,6 +190,77 @@ static int copy_to(void *target, void *src, size_t size, 
int userbuf)
return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+static int vmcoredd_copy_dumps(void *dst, u64 start, size_t size, int userbuf)
+{
+   struct vmcoredd_node *dump;
+   u64 offset = 0;
+   int ret = 0;
+   size_t tsz;
+   char *buf;
+
+   mutex_lock(_mutex);
+   list_for_each_entry(dump, _list, list) {
+   if (start < offset + dump->size) {
+   tsz = min(offset + (u64)dump->size - start, (u64)size);
+   buf = dump->buf + start - offset;
+   if (copy_to(dst, buf, tsz, userbuf)) {
+   ret = -EFAULT;
+   goto out_unlock;
+   }
+
+   size -= tsz;
+   start += tsz;
+   dst += tsz;
+
+   /* Leave now if buffer filled already */
+   if (!size)
+   goto out_unlock;
+   }
+   offset += dump->size;
+   }
+
+out_unlock:
+   mutex_unlock(_mutex);
+   return ret;
+}
+
+static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
+  u64 start, size_t size)
+{
+   struct vmcoredd_node *dump;
+   u64 offset = 0;
+   int ret = 0;
+   size_t tsz;
+   char *buf;
+
+   mutex_lock(_mutex);
+   list_for_each_entry(dump, _list, list) {
+   if (start < offset + dump->size) {
+   tsz = min(offset + (u64)dump->size - start, (u64)size);
+   buf = dump->buf + start - offset;
+   if (remap_vmalloc_range_partial(vma, dst, buf, tsz)) {
+   ret = -EFAULT;
+   goto out_unlock;
+   }
+
+   size -= tsz;
+   start += tsz;
+   dst += tsz;
+
+   /* Leave now if buffer filled already */
+   if (!size)
+   goto out_unlock;
+   }
+   offset += dump->size;
+   }
+
+out_unlock:
+   mutex_unlock(_mutex);
+   return ret;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
@@ -222,10 +298,41 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, 
loff_t *fpos,
if (*fpos < elfcorebuf_sz + elfnotes_sz) {
void *kaddr;
 
+   /* We add device dumps before other elf notes because the
+* other elf notes may not fill the elf notes buffer
+* completely and we will end up with zero-filled data
+* between the elf notes and the device dumps. Tools will
+* then try to decode this zero-filled data as valid notes
+* and we

[PATCH] bpf: btf: fix semicolon.cocci warnings

2018-04-17 Thread kbuild test robot

From: Fengguang Wu 

kernel/bpf/btf.c:353:2-3: Unneeded semicolon
kernel/bpf/btf.c:280:2-3: Unneeded semicolon
kernel/bpf/btf.c:663:2-3: Unneeded semicolon


 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Fixes: b22ac5b97dd9 ("bpf: btf: Validate type reference")
CC: Martin KaFai Lau 
Signed-off-by: Fengguang Wu 
---

 btf.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -277,7 +277,7 @@ static bool btf_type_is_modifier(const s
case BTF_KIND_CONST:
case BTF_KIND_RESTRICT:
return true;
-   };
+   }
 
return false;
 }
@@ -350,7 +350,7 @@ static bool btf_type_has_size(const stru
case BTF_KIND_UNION:
case BTF_KIND_ENUM:
return true;
-   };
+   }
 
return false;
 }
@@ -660,7 +660,7 @@ static bool env_type_is_resolve_sink(con
!btf_type_is_struct(next_type);
default:
BUG_ON(1);
-   };
+   }
 }
 
 static bool env_type_is_resolved(const struct btf_verifier_env *env,

Re: [PATCH bpf-next v3 02/10] bpf: btf: Validate type reference

2018-04-17 Thread kbuild test robot

Hi Martin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Martin-KaFai-Lau/BTF-BPF-Type-Format/20180417-142247
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master


coccinelle warnings: (new ones prefixed by >>)

>> kernel/bpf/btf.c:353:2-3: Unneeded semicolon
   kernel/bpf/btf.c:280:2-3: Unneeded semicolon
   kernel/bpf/btf.c:663:2-3: Unneeded semicolon

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [PATCH 03/10] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).

2018-04-17 Thread Sergei Shtylyov


Hello!

On 4/17/2018 1:04 AM, Michael Schmitz wrote:


From: John Paul Adrian Glaubitz 

This complements the fix in 82533ad9a1c that removed the free_irq


   You also need to specify the commit's summary line enclosed in ("").


call in the error path of probe, to also not call free_irq when
remove is called to revert the effects of probe.

Signed-off-by: Michael Karcher 

[...]

MBR, Sergei

[PATCH net-next] liquidio: Enhanced ethtool stats

2018-04-17 Thread Felix Manlunas

From: Intiyaz Basha 

1. Added red_drops stats. Inbound packets dropped by RED, buffer exhaustion
2. Included fcs_err, jabber_err, l2_err and frame_err errors under
   rx_errors
3. Included fifo_err, dmac_drop, red_drops, fw_err_pko, fw_err_link and
   fw_err_drop under rx_dropped
4. Included max_collision_fail, max_deferral_fail, total_collisions,
   fw_err_pko, fw_err_link, fw_err_drop and fw_err_pki under tx_dropped
5. Counting dma mapping errors
6. Added some firmware stats description and removed for some

Signed-off-by: Intiyaz Basha 
Acked-by: Derek Chickles 
Acked-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 54 ++--
 drivers/net/ethernet/cavium/liquidio/lio_main.c|  2 +
 .../net/ethernet/cavium/liquidio/liquidio_common.h | 75 ++
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   |  4 +-
 4 files changed, 86 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c 
b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
index a63ddf0..b40e8f5 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
@@ -96,11 +96,9 @@ enum {
"tx_packets",
"rx_bytes",
"tx_bytes",
-   "rx_errors",/*jabber_err+l2_err+frame_err */
-   "tx_errors",/*fw_err_pko+fw_err_link+fw_err_drop */
-   "rx_dropped",   /*st->fromwire.total_rcvd - st->fromwire.fw_total_rcvd +
-*st->fromwire.dmac_drop + st->fromwire.fw_err_drop
-*/
+   "rx_errors",
+   "tx_errors",
+   "rx_dropped",
"tx_dropped",
 
"tx_total_sent",
@@ -119,7 +117,7 @@ enum {
"mac_tx_total_bytes",
"mac_tx_mcast_pkts",
"mac_tx_bcast_pkts",
-   "mac_tx_ctl_packets",   /*oct->link_stats.fromhost.ctl_sent */
+   "mac_tx_ctl_packets",
"mac_tx_total_collisions",
"mac_tx_one_collision",
"mac_tx_multi_collison",
@@ -170,17 +168,17 @@ enum {
"tx_packets",
"rx_bytes",
"tx_bytes",
-   "rx_errors", /* jabber_err + l2_err+frame_err */
-   "tx_errors", /* fw_err_pko + fw_err_link+fw_err_drop */
-   "rx_dropped", /* total_rcvd - fw_total_rcvd + dmac_drop + fw_err_drop */
+   "rx_errors",
+   "tx_errors",
+   "rx_dropped",
"tx_dropped",
"link_state_changes",
 };
 
 /* statistics of host tx queue */
 static const char oct_iq_stats_strings[][ETH_GSTRING_LEN] = {
-   "packets",  /*oct->instr_queue[iq_no]->stats.tx_done*/
-   "bytes",/*oct->instr_queue[iq_no]->stats.tx_tot_bytes*/
+   "packets",
+   "bytes",
"dropped",
"iq_busy",
"sgentry_sent",
@@ -197,13 +195,9 @@ enum {
 
 /* statistics of host rx queue */
 static const char oct_droq_stats_strings[][ETH_GSTRING_LEN] = {
-   "packets",  /*oct->droq[oq_no]->stats.rx_pkts_received */
-   "bytes",/*oct->droq[oq_no]->stats.rx_bytes_received */
-   "dropped",  /*oct->droq[oq_no]->stats.rx_dropped+
-*oct->droq[oq_no]->stats.dropped_nodispatch+
-*oct->droq[oq_no]->stats.dropped_toomany+
-*oct->droq[oq_no]->stats.dropped_nomem
-*/
+   "packets",
+   "bytes",
+   "dropped",
"dropped_nomem",
"dropped_toomany",
"fw_dropped",
@@ -1068,16 +1062,33 @@ static void lio_vf_set_msglevel(struct net_device 
*netdev, u32 msglvl)
data[i++] = CVM_CAST64(netstats->rx_bytes);
/*sum of oct->instr_queue[iq_no]->stats.tx_tot_bytes */
data[i++] = CVM_CAST64(netstats->tx_bytes);
-   data[i++] = CVM_CAST64(netstats->rx_errors);
+   data[i++] = CVM_CAST64(netstats->rx_errors +
+  oct_dev->link_stats.fromwire.fcs_err +
+  oct_dev->link_stats.fromwire.jabber_err +
+  oct_dev->link_stats.fromwire.l2_err +
+  oct_dev->link_stats.fromwire.frame_err);
data[i++] = CVM_CAST64(netstats->tx_errors);
/*sum of oct->droq[oq_no]->stats->rx_dropped +
 *oct->droq[oq_no]->stats->dropped_nodispatch +
 *oct->droq[oq_no]->stats->dropped_toomany +
 *oct->droq[oq_no]->stats->dropped_nomem
 */
-   data[i++] = CVM_CAST64(netstats->rx_dropped);
+   data[i++] = CVM_CAST64(netstats->rx_dropped +
+  oct_dev->link_stats.fromwire.fifo_err +
+  oct_dev->link_stats.fromwire.dmac_drop +
+  oct_dev->link_stats.fromwire.red_drops +
+

[PATCH bpf-next 09/10] [bpf]: make tun compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for tun driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
(in case if after xdp's prog run data_end pointer was adjusted)

Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 28583aa0c17d..0b488a958076 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1688,6 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
return NULL;
case XDP_PASS:
delta = orig_data - xdp.data;
+   len = xdp.data_end - xdp.data;
break;
default:
bpf_warn_invalid_xdp_action(act);
@@ -1708,7 +1709,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
}
 
skb_reserve(skb, pad - delta);
-   skb_put(skb, len + delta);
+   skb_put(skb, len);
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
 
-- 
2.15.1

[PATCH bpf-next 10/10] [bpf]: make virtio compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack

Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/virtio_net.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..115d85f7360a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -604,6 +604,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
case XDP_PASS:
/* Recalculate length in case bpf program changed it */
delta = orig_data - xdp.data;
+   len = xdp.data_end - xdp.data;
break;
case XDP_TX:
sent = __virtnet_xdp_xmit(vi, );
@@ -637,7 +638,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
goto err;
}
skb_reserve(skb, headroom - delta);
-   skb_put(skb, len + delta);
+   skb_put(skb, len);
if (!delta) {
buf += header_offset;
memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len);
@@ -752,6 +753,10 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
offset = xdp.data -
page_address(xdp_page) - vi->hdr_len;
 
+   /* recalculate len if xdp.data or xdp.data_end were
+* adjusted
+*/
+   len = xdp.data_end - xdp.data;
/* We can only create skb based on xdp_page. */
if (unlikely(xdp_page != page)) {
rcu_read_unlock();
-- 
2.15.1

net: 4.9-stable regression in drivers/net/phy/micrel.c on 4.9.94

2018-04-17 Thread Lars Persson


Hi

We run into a NULL pointer dereference crash when booting 4.9.94 on our
Artpec-6 board with stmmac ethernet and Micrel KSZ9031 phy.

I traced this to the patch d7ba3c00047d ("net: phy: micrel: Restore
led_mode and clk_sel on resume") that was added in 4.9.94. This patch
makes kszphy_resume() depend on the kszphy_priv object having been
created and this happens only for those Micrel PHYs that have a .probe
callback assigned. This is not the case for KSZ9031.

This is already fixed in later kernels by bfe72442578b ("net: phy:
micrel: fix crash when statistic requested for KSZ9031 phy") thas assigns
a probe function for all Micrel PHYs that depend on the kszphy_priv existing.

Please consider applying this to the 4.9 stable tree.

Crash dump splat:
  Unable to handle kernel NULL pointer dereference at virtual address 0008
  pgd = bd8bc000
  [0008] *pgd=3d98e831, *pte=, *ppte=
  Internal error: Oops: 17 [#1] PREEMPT SMP ARM
  Modules linked in: e1000e nvmem_artpec6_efuse nvmem_core artpec6_trng(O) 
artpec6_lcpu(O)
  CPU: 0 PID: 216 Comm: netd Tainted: G   O4.9.94-axis5-devel #1
  Hardware name: Axis ARTPEC-6 Platform
  task: bf344620 task.stack: bd10c000
  PC is at kszphy_config_reset+0x14/0x148
  LR is at kszphy_resume+0x1c/0x5c
  pc : [<804ad358>]lr : [<804ad608>]psr: 600c0113
  sp : bd10dd00  ip : 8dc7  fp : bf393200
  r10:   r9 : 0002  r8 : 
  r7 : bf3ad000  r6 :   r5 : bf086000  r4 : bf3ad400
  r3 : 0001  r2 :   r1 : 00040003  r0 : bf3ad400
  Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
  Control: 10c5387d  Table: 3d8bc04a  DAC: 0051
  Process netd (pid: 216, stack limit = 0xbd10c210)
  Stack: (0xbd10dd00 to 0xbd10e000)
  dd00: bf3ad400 bf086000  bf3ad000  804ad608 bf3ad400 bf086000
  dd20:  804ab404 bf3ad400 bf086000 804b345c  7ee94b94 
  dd40:  804ab5a4 bf3ad400 bf086000 804b345c 80509278 bf086000 bf086000
  dd60: 0002  7ee94b94 804ae4a4 0002 0001 8014bc78 801647dc
  dd80:  beb25cc0 beb634e0  be97c87c 00c3 800f0093 80682d1c
  dda0: 800f0093 beb634e0  80682d1c beb25cc0 802a6388 beb88444 
  ddc0: beb88450 00c3 0001 0001 0001 801647dc 0001 beb88440
  dde0: 0001 801414fc 0001 805de9f4 bd10dea0 0001 beae9b8c bf086000
  de00: 0001 80743d58 bf086030 804b25b0 8064fd7c 801414fc fff2 bd10de64
  de20: 000d 801414fc bf0864c0 bd10de64 000d 801414fc bf086000 bf086000
  de40: 0001 80743d58 bf086030 7ee94b94   bf393200 80567b00
  de60: bf086188 bf086000 80567da4 bf086000 0001 1003 1002 80567dcc
  de80: bf086000 1002  bf086148 7ee94b94 80567e9c  bd10dec8
  dea0:  bf39320c 7ee94b94 805de4b0  bd10df00 8914 bf086000
  dec0: 0014 bf39320c 30687465    1003 
  dee0:    8914 be643360 7ee94b94 be643340 7ee94b94
  df00: 0008   80548420 7ee94b94 be643360 beae7ee0 0008
  df20: 7ee94b94 80271884    beb0a700  be4f3360
  df40: 0002 0023 beb0a708  76f216c4 8025ec18  8027db08
  df60: beae7ee0 8027db08  beae7ee1 beae7ee0 8914 7ee94b94 0008
  df80:  802721c4 01f1bcb0 76fadcf0 0001 0036 80108984 bd10c000
  dfa0:  801087c0 01f1bcb0 76fadcf0 0008 8914 7ee94b94 01f1be48
  dfc0: 01f1bcb0 76fadcf0 0001 0036 7ee94b94 0008 0004cd2c 
  dfe0: 00063d60 7ee94b74 00027344 76b10b2c 600f0010 0008  7ee727f4
  [<804ad358>] (kszphy_config_reset) from [<804ad608>] (kszphy_resume+0x1c/0x5c)
  [<804ad608>] (kszphy_resume) from [<804ab404>] (phy_attach_direct+0xbc/0x1c4)
  [<804ab404>] (phy_attach_direct) from [<804ab5a4>] 
(phy_connect_direct+0x1c/0x54)
  [<804ab5a4>] (phy_connect_direct) from [<80509278>] (of_phy_connect+0x40/0x68)
  [<80509278>] (of_phy_connect) from [<804ae4a4>] (stmmac_init_phy+0x50/0x1ec)
  [<804ae4a4>] (stmmac_init_phy) from [<804b25b0>] (stmmac_open+0x70/0xc90)
  [<804b25b0>] (stmmac_open) from [<80567b00>] (__dev_open+0xc4/0x140)
  [<80567b00>] (__dev_open) from [<80567dcc>] (__dev_change_flags+0x9c/0x14c)
  [<80567dcc>] (__dev_change_flags) from [<80567e9c>] 
(dev_change_flags+0x20/0x50)
  [<80567e9c>] (dev_change_flags) from [<805de4b0>] (devinet_ioctl+0x6d4/0x798)
  [<805de4b0>] (devinet_ioctl) from [<80548420>] (sock_ioctl+0x158/0x2e4)
  [<80548420>] (sock_ioctl) from [<80271884>] (do_vfs_ioctl+0xa8/0x974)
  [<80271884>] (do_vfs_ioctl) from [<802721c4>] (SyS_ioctl+0x74/0x84)
  [<802721c4>] (SyS_ioctl) from [<801087c0>] (ret_fast_syscall+0x0/0x48)
  Code: e52de004 e8bd4000 e1a04000 e59061d0 (e5d63008)

[PATCH/RFC net-next 0/5] ravb: updates

2018-04-17 Thread Simon Horman

Hi Sergei,

this series is composed of otherwise unrelated RAVB patches from the R-Car
BSP v3.6.2 which at a first pass seem worth considering for upstream.

I would value your feedback on these patches so they can either proceed
into net-next or remain local to the BSP.

Thanks!

Kazuya Mizuguchi (4):
  ravb: correct ptp does failure after suspend and resume
  ravb: do not write 1 to reserved bits
  ravb: remove undocumented processing
  ravb: remove tx buffer addr 4byte alilgnment restriction for R-Car
Gen3

Masaru Nagai (1):
  ravb: fix inconsistent lock state at enabling tx timestamp

 drivers/net/ethernet/renesas/ravb.h  |  23 +++-
 drivers/net/ethernet/renesas/ravb_main.c | 192 ---
 drivers/net/ethernet/renesas/ravb_ptp.c  |   2 +-
 3 files changed, 117 insertions(+), 100 deletions(-)

-- 
2.11.0

[PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp

2018-04-17 Thread Simon Horman

From: Masaru Nagai 

[   58.490829] =
[   58.495205] [ INFO: inconsistent lock state ]
[   58.499583] 4.9.0-yocto-standard-7-g2ef7caf #57 Not tainted
[   58.505529] -
[   58.509904] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[   58.515939] swapper/0/0 [HC1[1]:SC1[1]:HE0:SE0] takes:
[   58.521099]  (&(>lock)->rlock#2){?.-...}, at: [] 
skb_queue_tail+0x2c/0x68
{HARDIRQ-ON-W} state was registered at:
[   58.533654]   [   58.535155] [] mark_lock+0x1c4/0x718
[   58.540318]   [   58.541814] [] __lock_acquire+0x660/0x1890
[   58.547501]   [   58.548997] [] lock_acquire+0xd0/0x290
[   58.554334]   [   58.555834] [] _raw_spin_lock_bh+0x50/0x90
[   58.561520]   [   58.563018] [] 
first_packet_length+0x40/0x2b0
[   58.568965]   [   58.570461] [] udp_ioctl+0x58/0x120
[   58.575535]   [   58.577032] [] inet_ioctl+0x58/0x128
[   58.582194]   [   58.583691] [] sock_do_ioctl+0x40/0x88
[   58.589028]   [   58.590523] [] sock_ioctl+0x284/0x350
[   58.595773]   [   58.597271] [] do_vfs_ioctl+0xb0/0x7c0
[   58.602607]   [   58.604103] [] SyS_ioctl+0x94/0xa8
[   58.609090]   [   58.610588] [] __sys_trace_return+0x0/0x4
[   58.616187] irq event stamp: 335205
[   58.619690] hardirqs last  enabled at (335204): [] 
__do_softirq+0xdc/0x5c4
[   58.628168] hardirqs last disabled at (335205): [] 
el1_irq+0x70/0x12c
[   58.636211] softirqs last  enabled at (335202): [] 
_local_bh_enable+0x28/0x50
[   58.644950] softirqs last disabled at (335203): [] 
irq_exit+0xd4/0x100
[   58.653076]
[   58.653076] other info that might help us debug this:
[   58.659632]  Possible unsafe locking scenario:
[   58.659632]
[   58.665577]CPU0
[   58.668031]
[   58.670484]   lock(&(>lock)->rlock#2);
[   58.674799]   
[   58.677427] lock(&(>lock)->rlock#2);
[   58.681916]
[   58.681916]  *** DEADLOCK ***
[   58.681916]
[   58.687863] 1 lock held by swapper/0/0:
[   58.691713]  #0:  (&(>lock)->rlock){-.-...}, at: [] 
ravb_multi_interrupt+0x28/0x98
[   58.701456]
[   58.701456] stack backtrace:
[   58.705833] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.9.0-yocto-standard-7-g2ef7caf #57
[   58.714396] Hardware name: Renesas Salvator-X board based on r8a7796 (DT)
[   58.721214] Call trace:
[   58.723672] [] dump_backtrace+0x0/0x1d8
[   58.729095] [] show_stack+0x24/0x30
[   58.734170] [] dump_stack+0xb0/0xe8
[   58.740285] [] print_usage_bug.part.24+0x264/0x27c
[   58.747697] [] mark_lock+0x150/0x718
[   58.753892] [] __lock_acquire+0xc10/0x1890
[   58.760602] [] lock_acquire+0xd0/0x290
[   58.766956] [] _raw_spin_lock_irqsave+0x58/0x98
[   58.774089] [] skb_queue_tail+0x2c/0x68
[   58.780518] [] sock_queue_err_skb+0xc8/0x138
[   58.787364] [] __skb_complete_tx_timestamp+0x8c/0xb8
[   58.794888] [] __skb_tstamp_tx+0xd8/0x130
[   58.801437] [] skb_tstamp_tx+0x30/0x40
[   58.807723] [] ravb_timestamp_interrupt+0x164/0x1a8
[   58.815144] [] ravb_multi_interrupt+0x88/0x98
[   58.822043] [] __handle_irq_event_percpu+0x94/0x418
[   58.829464] [] handle_irq_event_percpu+0x28/0x60
[   58.836622] [] handle_irq_event+0x50/0x80
[   58.843166] [] handle_fasteoi_irq+0xdc/0x1e0
[   58.849968] [] generic_handle_irq+0x34/0x50
[   58.856681] [] __handle_domain_irq+0x8c/0x100
[   58.863568] [] gic_handle_irq+0x60/0xb8
[   58.869930] Exception stack(0x80063b0f9de0 to 0x80063b0f9f10)
[   58.877348] 9de0: 80063b0f9e10 0001 80063b0f9f40 
08081810
[   58.886159] 9e00: 6145 08082f70 09194b00 
00190f2c
[   58.894961] 9e20: 800632171000 000a  
0003a4d0
[   58.903767] 9e40: 0016 0023 091952f8 

[   58.912568] 9e60: 0040  34d5d91d 

[   58.921363] 9e80:    

[   58.930133] 9ea0:  0918 080d76e4 
0052
[   58.938897] 9ec0: 08d7 0008  
0001
[   58.947660] 9ee0: 80063a428000 09185000 0918 
80063b0f9f40
[   58.956430] 9f00: 0808180c 80063b0f9f40
[   58.962253] [] el1_irq+0xb4/0x12c
[   58.968096] [] irq_exit+0xd4/0x100
[   58.974025] [] __handle_domain_irq+0x90/0x100
[   58.980916] [] gic_handle_irq+0x60/0xb8
[   58.987281] Exception stack(0x09183d20 to 0x09183e50)
[   58.994708] 3d20: 09194b00 00190f2b 800632171000 
8c6318c6318c6320
[   59.003554] 3d40:  0003a4d0 0016 
002a
[   59.012416] 3d60: 091952f8  1000 

[   59.021279] 3d80: 34d5d91d   

[   59.030111] 3da0:    
000d9e3b53c4
[   59.038913] 3dc0:

[PATCH/RFC net-next 5/5] ravb: remove tx buffer addr 4byte alilgnment restriction for R-Car Gen3

2018-04-17 Thread Simon Horman

From: Kazuya Mizuguchi 

This patch sets from two descriptor to one descriptor because R-Car Gen3
does not have the 4 bytes alignment restriction of the transmission buffer.

Signed-off-by: Kazuya Mizuguchi 
Signed-off-by: Simon Horman 
---
 drivers/net/ethernet/renesas/ravb.h  |   6 +-
 drivers/net/ethernet/renesas/ravb_main.c | 131 +++
 2 files changed, 85 insertions(+), 52 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h 
b/drivers/net/ethernet/renesas/ravb.h
index fcd04dbc7dde..3d0985305c26 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -964,7 +964,10 @@ enum RAVB_QUEUE {
 #define RX_QUEUE_OFFSET4
 #define NUM_RX_QUEUE   2
 #define NUM_TX_QUEUE   2
-#define NUM_TX_DESC2   /* TX descriptors per packet */
+
+/* TX descriptors per packet */
+#define NUM_TX_DESC_GEN2   2
+#define NUM_TX_DESC_GEN3   1
 
 struct ravb_tstamp_skb {
struct list_head list;
@@ -1043,6 +1046,7 @@ struct ravb_private {
unsigned no_avb_link:1;
unsigned avb_link_active_low:1;
unsigned wol_enabled:1;
+   int num_tx_desc;/* TX descriptors per packet */
 };
 
 static inline u32 ravb_read(struct net_device *ndev, enum ravb_reg reg)
diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index 88056dd912ed..f137b62d5b52 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -189,12 +189,13 @@ static int ravb_tx_free(struct net_device *ndev, int q, 
bool free_txed_only)
int free_num = 0;
int entry;
u32 size;
+   int num_tx_desc = priv->num_tx_desc;
 
for (; priv->cur_tx[q] - priv->dirty_tx[q] > 0; priv->dirty_tx[q]++) {
bool txed;
 
entry = priv->dirty_tx[q] % (priv->num_tx_ring[q] *
-NUM_TX_DESC);
+num_tx_desc);
desc = >tx_ring[q][entry];
txed = desc->die_dt == DT_FEMPTY;
if (free_txed_only && !txed)
@@ -203,12 +204,12 @@ static int ravb_tx_free(struct net_device *ndev, int q, 
bool free_txed_only)
dma_rmb();
size = le16_to_cpu(desc->ds_tagl) & TX_DS;
/* Free the original skb. */
-   if (priv->tx_skb[q][entry / NUM_TX_DESC]) {
+   if (priv->tx_skb[q][entry / num_tx_desc]) {
dma_unmap_single(ndev->dev.parent, 
le32_to_cpu(desc->dptr),
 size, DMA_TO_DEVICE);
/* Last packet descriptor? */
-   if (entry % NUM_TX_DESC == NUM_TX_DESC - 1) {
-   entry /= NUM_TX_DESC;
+   if (entry % num_tx_desc == num_tx_desc - 1) {
+   entry /= num_tx_desc;
dev_kfree_skb_any(priv->tx_skb[q][entry]);
priv->tx_skb[q][entry] = NULL;
if (txed)
@@ -229,6 +230,7 @@ static void ravb_ring_free(struct net_device *ndev, int q)
struct ravb_private *priv = netdev_priv(ndev);
int ring_size;
int i;
+   int num_tx_desc = priv->num_tx_desc;
 
if (priv->rx_ring[q]) {
for (i = 0; i < priv->num_rx_ring[q]; i++) {
@@ -252,7 +254,7 @@ static void ravb_ring_free(struct net_device *ndev, int q)
ravb_tx_free(ndev, q, false);
 
ring_size = sizeof(struct ravb_tx_desc) *
-   (priv->num_tx_ring[q] * NUM_TX_DESC + 1);
+   (priv->num_tx_ring[q] * num_tx_desc + 1);
dma_free_coherent(ndev->dev.parent, ring_size, priv->tx_ring[q],
  priv->tx_desc_dma[q]);
priv->tx_ring[q] = NULL;
@@ -284,9 +286,10 @@ static void ravb_ring_format(struct net_device *ndev, int 
q)
struct ravb_ex_rx_desc *rx_desc;
struct ravb_tx_desc *tx_desc;
struct ravb_desc *desc;
+   int num_tx_desc = priv->num_tx_desc;
int rx_ring_size = sizeof(*rx_desc) * priv->num_rx_ring[q];
int tx_ring_size = sizeof(*tx_desc) * priv->num_tx_ring[q] *
-  NUM_TX_DESC;
+  num_tx_desc;
dma_addr_t dma_addr;
int i;
 
@@ -321,8 +324,10 @@ static void ravb_ring_format(struct net_device *ndev, int 
q)
for (i = 0, tx_desc = priv->tx_ring[q]; i < priv->num_tx_ring[q];
 i++, tx_desc++) {
tx_desc->die_dt = DT_EEMPTY;
-   tx_desc++;
-   tx_desc->die_dt = DT_EEMPTY;
+   if (num_tx_desc >= 2) {
+   tx_desc++;
+   tx_desc->die_dt = DT_EEMPTY;
+   }

[PATCH/RFC net-next 4/5] ravb: remove undocumented processing

2018-04-17 Thread Simon Horman

From: Kazuya Mizuguchi 

Signed-off-by: Kazuya Mizuguchi 
Signed-off-by: Simon Horman 
---
 drivers/net/ethernet/renesas/ravb.h  |  5 -
 drivers/net/ethernet/renesas/ravb_main.c | 15 ---
 2 files changed, 20 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h 
b/drivers/net/ethernet/renesas/ravb.h
index 57eea4a77826..fcd04dbc7dde 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -197,15 +197,11 @@ enum ravb_reg {
MAHR= 0x05c0,
MALR= 0x05c8,
TROCR   = 0x0700,   /* Undocumented? */
-   CDCR= 0x0708,   /* Undocumented? */
-   LCCR= 0x0710,   /* Undocumented? */
CEFCR   = 0x0740,
FRECR   = 0x0748,
TSFRCR  = 0x0750,
TLFRCR  = 0x0758,
RFCR= 0x0760,
-   CERCR   = 0x0768,   /* Undocumented? */
-   CEECR   = 0x0770,   /* Undocumented? */
MAFCR   = 0x0778,
 };
 
@@ -223,7 +219,6 @@ enum CCC_BIT {
CCC_CSEL_HPB= 0x0001,
CCC_CSEL_ETH_TX = 0x0002,
CCC_CSEL_GMII_REF = 0x0003,
-   CCC_BOC = 0x0010,   /* Undocumented? */
CCC_LBME= 0x0100,
 };
 
diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index 736ca2f76a35..88056dd912ed 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -451,12 +451,6 @@ static int ravb_dmac_init(struct net_device *ndev)
ravb_ring_format(ndev, RAVB_BE);
ravb_ring_format(ndev, RAVB_NC);
 
-#if defined(__LITTLE_ENDIAN)
-   ravb_modify(ndev, CCC, CCC_BOC, 0);
-#else
-   ravb_modify(ndev, CCC, CCC_BOC, CCC_BOC);
-#endif
-
/* Set AVB RX */
ravb_write(ndev,
   RCR_EFFS | RCR_ENCF | RCR_ETS0 | RCR_ESF | 0x1800, RCR);
@@ -1660,15 +1654,6 @@ static struct net_device_stats *ravb_get_stats(struct 
net_device *ndev)
 
nstats->tx_dropped += ravb_read(ndev, TROCR);
ravb_write(ndev, 0, TROCR); /* (write clear) */
-   nstats->collisions += ravb_read(ndev, CDCR);
-   ravb_write(ndev, 0, CDCR);  /* (write clear) */
-   nstats->tx_carrier_errors += ravb_read(ndev, LCCR);
-   ravb_write(ndev, 0, LCCR);  /* (write clear) */
-
-   nstats->tx_carrier_errors += ravb_read(ndev, CERCR);
-   ravb_write(ndev, 0, CERCR); /* (write clear) */
-   nstats->tx_carrier_errors += ravb_read(ndev, CEECR);
-   ravb_write(ndev, 0, CEECR); /* (write clear) */
 
nstats->rx_packets = stats0->rx_packets + stats1->rx_packets;
nstats->tx_packets = stats0->tx_packets + stats1->tx_packets;
-- 
2.11.0

[PATCH/RFC net-next 2/5] ravb: correct ptp does failure after suspend and resume

2018-04-17 Thread Simon Horman

From: Kazuya Mizuguchi 

This patch fixes the problem that ptp4l command does not work after
suspend and resume.
Add the initial setting in ravb_suspend() and ravb_resume(),
because ptp does not work.

Fixes: a0d2f20650e8 ("Renesas Ethernet AVB PTP clock driver")
Signed-off-by: Kazuya Mizuguchi 
Signed-off-by: Simon Horman 
---
 drivers/net/ethernet/renesas/ravb_main.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index b311b1ac1286..dbde3d11458b 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -2295,6 +2295,9 @@ static int __maybe_unused ravb_suspend(struct device *dev)
else
ret = ravb_close(ndev);
 
+   if (priv->chip_id != RCAR_GEN2)
+   ravb_ptp_stop(ndev);
+
return ret;
 }
 
@@ -2302,6 +2305,7 @@ static int __maybe_unused ravb_resume(struct device *dev)
 {
struct net_device *ndev = dev_get_drvdata(dev);
struct ravb_private *priv = netdev_priv(ndev);
+   struct platform_device *pdev = priv->pdev;
int ret = 0;
 
/* If WoL is enabled set reset mode to rearm the WoL logic */
@@ -2330,6 +2334,9 @@ static int __maybe_unused ravb_resume(struct device *dev)
/* Restore descriptor base address table */
ravb_write(ndev, priv->desc_bat_dma, DBAT);
 
+   if (priv->chip_id != RCAR_GEN2)
+   ravb_ptp_init(ndev, pdev);
+
if (netif_running(ndev)) {
if (priv->wol_enabled) {
ret = ravb_wol_restore(ndev);
-- 
2.11.0

[PATCH/RFC net-next 3/5] ravb: do not write 1 to reserved bits

2018-04-17 Thread Simon Horman

From: Kazuya Mizuguchi 

This patch corrects writing 1 to reserved bits.
The write value should be 0.

Signed-off-by: Kazuya Mizuguchi 
Signed-off-by: Simon Horman 
---
 drivers/net/ethernet/renesas/ravb.h  | 12 
 drivers/net/ethernet/renesas/ravb_main.c |  9 +
 drivers/net/ethernet/renesas/ravb_ptp.c  |  2 +-
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h 
b/drivers/net/ethernet/renesas/ravb.h
index b81f4faf7b10..57eea4a77826 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -433,6 +433,8 @@ enum EIS_BIT {
EIS_QFS = 0x0001,
 };
 
+#define EIS_RESERVED_BITS  (u32)(GENMASK(31, 17) | GENMASK(15, 11))
+
 /* RIC0 */
 enum RIC0_BIT {
RIC0_FRE0   = 0x0001,
@@ -477,6 +479,8 @@ enum RIS0_BIT {
RIS0_FRF17  = 0x0002,
 };
 
+#define RIS0_RESERVED_BITS (u32)GENMASK(31, 18)
+
 /* RIC1 */
 enum RIC1_BIT {
RIC1_RFWE   = 0x8000,
@@ -533,6 +537,8 @@ enum RIS2_BIT {
RIS2_RFFF   = 0x8000,
 };
 
+#define RIS2_RESERVED_BITS (u32)GENMASK_ULL(30, 18)
+
 /* TIC */
 enum TIC_BIT {
TIC_FTE0= 0x0001,   /* Undocumented? */
@@ -549,6 +555,10 @@ enum TIS_BIT {
TIS_TFWF= 0x0200,
 };
 
+#define TIS_RESERVED_BITS  (u32)(GENMASK_ULL(31, 20) | \
+ GENMASK_ULL(15, 12) | \
+ GENMASK_ULL(7, 4))
+
 /* ISS */
 enum ISS_BIT {
ISS_FRS = 0x0001,   /* Undocumented? */
@@ -622,6 +632,8 @@ enum GIS_BIT {
GIS_PTMF= 0x0004,
 };
 
+#define GIS_RESERVED_BITS  (u32)GENMASK(15, 10)
+
 /* GIE (R-Car Gen3 only) */
 enum GIE_BIT {
GIE_PTCS= 0x0001,
diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index dbde3d11458b..736ca2f76a35 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -742,10 +742,11 @@ static void ravb_error_interrupt(struct net_device *ndev)
u32 eis, ris2;
 
eis = ravb_read(ndev, EIS);
-   ravb_write(ndev, ~EIS_QFS, EIS);
+   ravb_write(ndev, ~(EIS_QFS | EIS_RESERVED_BITS), EIS);
if (eis & EIS_QFS) {
ris2 = ravb_read(ndev, RIS2);
-   ravb_write(ndev, ~(RIS2_QFF0 | RIS2_RFFF), RIS2);
+   ravb_write(ndev, ~(RIS2_QFF0 | RIS2_RFFF | RIS2_RESERVED_BITS),
+  RIS2);
 
/* Receive Descriptor Empty int */
if (ris2 & RIS2_QFF0)
@@ -913,7 +914,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
/* Processing RX Descriptor Ring */
if (ris0 & mask) {
/* Clear RX interrupt */
-   ravb_write(ndev, ~mask, RIS0);
+   ravb_write(ndev, ~(mask | RIS0_RESERVED_BITS), RIS0);
if (ravb_rx(ndev, , q))
goto out;
}
@@ -925,7 +926,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
 
spin_lock_irqsave(>lock, flags);
/* Clear TX interrupt */
-   ravb_write(ndev, ~mask, TIS);
+   ravb_write(ndev, ~(mask | TIS_RESERVED_BITS), TIS);
ravb_tx_free(ndev, q, true);
netif_wake_subqueue(ndev, q);
mmiowb();
diff --git a/drivers/net/ethernet/renesas/ravb_ptp.c 
b/drivers/net/ethernet/renesas/ravb_ptp.c
index eede70ec37f8..ba3017ca5577 100644
--- a/drivers/net/ethernet/renesas/ravb_ptp.c
+++ b/drivers/net/ethernet/renesas/ravb_ptp.c
@@ -319,7 +319,7 @@ void ravb_ptp_interrupt(struct net_device *ndev)
}
}
 
-   ravb_write(ndev, ~gis, GIS);
+   ravb_write(ndev, ~(gis | GIS_RESERVED_BITS), GIS);
 }
 
 void ravb_ptp_init(struct net_device *ndev, struct platform_device *pdev)
-- 
2.11.0

[PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-17 Thread Rahul Lakkireddy

On production servers running variety of workloads over time, kernel
panic can happen sporadically after days or even months. It is
important to collect as much debug logs as possible to root cause
and fix the problem, that may not be easy to reproduce. Snapshot of
underlying hardware/firmware state (like register dump, firmware
logs, adapter memory, etc.), at the time of kernel panic will be very
helpful while debugging the culprit device driver.

This series of patches add new generic framework that enable device
drivers to collect device specific snapshot of the hardware/firmware
state of the underlying device in the crash recovery kernel. In crash
recovery kernel, the collected logs are added as elf notes to
/proc/vmcore, which is copied by user space scripts for post-analysis.

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

The device specific hardware/firmware logs can be seen as elf notes:

# readelf -n /proc/vmcore

Displaying notes found at file offset 0x1000 with length 0x04003288:
  Owner Data size   Description
  VMCOREDD_cxgb4_:02:00.4 0x02000fd8Unknown note type: (0x0700)
  VMCOREDD_cxgb4_:04:00.4 0x02000fd8Unknown note type: (0x0700)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  CORE 0x0150   NT_PRSTATUS (prstatus structure)
  VMCOREINFO   0x074f   Unknown note type: (0x)

Patch 1 adds API to vmcore module to allow drivers to register callback
to collect the device specific hardware/firmware logs.  The logs will
be added to /proc/vmcore as elf notes.

Patch 2 updates read and mmap logic to append device specific hardware/
firmware logs as elf notes.

Patch 3 shows a cxgb4 driver example using the API to collect
hardware/firmware logs in crash recovery kernel, before hardware is
initialized.

Thanks,
Rahul

RFC v1: https://lkml.org/lkml/2018/3/2/542
RFC v2: https://lkml.org/lkml/2018/3/16/326

---
v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Exported dumps as elf notes. Suggested by Eric Biederman
  .  Added as patch 2 in this version.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.
- Updated comments.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs. Suggested by
  Stephen Hemminger 
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman .
- Added new crashdd module that exports /proc/crashdd/ containing
  driver's registered hardware/firmware logs in patch 1.
- Replaced the API to allow drivers to register their hardware/firmware
  log collect routine in crash recovery kernel in patch 1.
- Updated patch 2 to use the new API in patch 1.

Rahul Lakkireddy (3):
  vmcore: add API to collect hardware dump in second kernel
  vmcore: append device dumps to vmcore as elf notes
  cxgb4: collect hardware dump in second kernel

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h   |   4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c |  25 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |   3 +

[PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel

2018-04-17 Thread Rahul Lakkireddy

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.

Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.

Suggested-by: Eric Biederman .
Signed-off-by: Rahul Lakkireddy 
Signed-off-by: Ganesh Goudar 
---
v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs.  Suggested by
  Stephen Hemminger 
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman .
- Patch added in this series.

 fs/proc/Kconfig|  10 +++
 fs/proc/vmcore.c   | 152 ++---
 include/linux/crash_core.h |   4 ++
 include/linux/crash_dump.h |  17 +
 include/linux/kcore.h  |   6 ++
 include/uapi/linux/elf.h   |   1 +
 6 files changed, 181 insertions(+), 9 deletions(-)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade1206bb89..504c77a444bd 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -43,6 +43,16 @@ config PROC_VMCORE
 help
 Exports the dump image of crashed kernel in ELF format.
 
+config PROC_VMCORE_DEVICE_DUMP
+   bool "Device Hardware/Firmware Log Collection"
+   depends on PROC_VMCORE
+   default y
+   help
+ Device drivers can collect the device specific snapshot of
+ their hardware or firmware before they are initialized in
+ crash recovery kernel. If you say Y here, the device dumps
+ will be added as ELF notes to /proc/vmcore
+
 config PROC_SYSCTL
bool "Sysctl support (/proc/sys)" if EXPERT
depends on PROC_FS
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af22a60..7395462d2f86 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -44,6 +45,12 @@ static u64 vmcore_size;
 
 static struct proc_dir_entry *proc_vmcore;
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/* Device Dump list and mutex to synchronize access to list */
+static LIST_HEAD(vmcoredd_list);
+static DEFINE_MUTEX(vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = 
{
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *  vmalloc memory
- *
- * @notes_sz: size of buffer
+ * vmcore_alloc_buf - allocate buffer in vmalloc memory
+ * @sizez: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops 
= {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *vmcore_alloc_buf(size_t size)
 {
 #ifdef CONFIG_MMU
-   return vmalloc_user(notes_sz);
+   return vmalloc_user(size);
 #else
-   return vzalloc(notes_sz);
+   return vzalloc(size);
 #endif
 }
 
@@ -665,7 +670,7 @@

Re: [PATCH] VSOCK: make af_vsock.ko removable again

2018-04-17 Thread Jorgen S. Hansen


> On Apr 17, 2018, at 8:25 AM, Stefan Hajnoczi  wrote:
> 
> Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
> vsock_init_tables()") introduced a module_init() function without a
> corresponding module_exit() function.
> 
> Modules with an init function can only be removed if they also have an
> exit function.  Therefore the vsock module was considered "permanent"
> and could not be removed.
> 
> This patch adds an empty module_exit() function so that "rmmod vsock"
> works.  No explicit cleanup is required because:
> 
> 1. Transports call vsock_core_exit() upon exit and cannot be removed
>   while sockets are still alive.
> 2. vsock_diag.ko does not perform any action that requires cleanup by
>   vsock.ko.
> 
> Reported-by: Xiumei Mu 
> Cc: Cong Wang 
> Cc: Jorgen Hansen 
> Signed-off-by: Stefan Hajnoczi 
> ---
> net/vmw_vsock/af_vsock.c | 6 ++
> 1 file changed, 6 insertions(+)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index aac9b8f6552e..c1076c19b858 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -2018,7 +2018,13 @@ const struct vsock_transport 
> *vsock_core_get_transport(void)
> }
> EXPORT_SYMBOL_GPL(vsock_core_get_transport);
> 
> +static void __exit vsock_exit(void)
> +{
> + /* Do nothing.  This function makes this module removable. */
> +}
> +
> module_init(vsock_init_tables);
> +module_exit(vsock_exit);
> 
> MODULE_AUTHOR("VMware, Inc.");
> MODULE_DESCRIPTION("VMware Virtual Socket Family");
> -- 
> 2.14.3
> 

Looks good to me.

Reviewed-by: Jorgen Hansen

[PATCH net-next] vxlan: add ttl inherit support

2018-04-17 Thread Hangbin Liu

Like tos inherit, ttl inherit should also means inherit the inner protocol's
ttl values, which actually not implemented in vxlan yet.

But we could not treat ttl == 0 as "use the inner TTL", because that would be
used also when the "ttl" option is not specified and that would be a behavior
change, and breaking real use cases.

So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is
specified with ip cmd.

Reported-by: Jianlin Shi 
Suggested-by: Jiri Benc 
Signed-off-by: Hangbin Liu 
---
 drivers/net/vxlan.c  | 17 ++---
 include/net/ip_tunnels.h | 11 +++
 include/net/vxlan.h  |  1 +
 include/uapi/linux/if_link.h |  1 +
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index aa5f034..209a840 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2085,9 +2085,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
local_ip = vxlan->cfg.saddr;
dst_cache = >dst_cache;
md->gbp = skb->mark;
-   ttl = vxlan->cfg.ttl;
-   if (!ttl && vxlan_addr_multicast(dst))
-   ttl = 1;
+   if (flags & VXLAN_F_TTL_INHERIT) {
+   ttl = ip_tunnel_get_ttl(old_iph, skb);
+   } else {
+   ttl = vxlan->cfg.ttl;
+   if (!ttl && vxlan_addr_multicast(dst))
+   ttl = 1;
+   }
 
tos = vxlan->cfg.tos;
if (tos == 1)
@@ -2709,6 +2713,7 @@ static const struct nla_policy 
vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_GBP]= { .type = NLA_FLAG, },
[IFLA_VXLAN_GPE]= { .type = NLA_FLAG, },
[IFLA_VXLAN_REMCSUM_NOPARTIAL]  = { .type = NLA_FLAG },
+   [IFLA_VXLAN_TTL_INHERIT]= { .type = NLA_FLAG },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -3254,6 +3259,12 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_TTL])
conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
 
+   if (data[IFLA_VXLAN_TTL_INHERIT]) {
+   if (changelink)
+   return -EOPNOTSUPP;
+   conf->flags |= VXLAN_F_TTL_INHERIT;
+   }
+
if (data[IFLA_VXLAN_LABEL])
conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
 IPV6_FLOWLABEL_MASK;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index cbe5add..6c3c421 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -377,6 +377,17 @@ static inline u8 ip_tunnel_get_dsfield(const struct iphdr 
*iph,
return 0;
 }
 
+static inline u8 ip_tunnel_get_ttl(const struct iphdr *iph,
+  const struct sk_buff *skb)
+{
+   if (skb->protocol == htons(ETH_P_IP))
+   return iph->ttl;
+   else if (skb->protocol == htons(ETH_P_IPV6))
+   return ((const struct ipv6hdr *)iph)->hop_limit;
+   else
+   return 0;
+}
+
 /* Propogate ECN bits out */
 static inline u8 ip_tunnel_ecn_encap(u8 tos, const struct iphdr *iph,
 const struct sk_buff *skb)
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index ad73d8b..b99a02ae 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -262,6 +262,7 @@ struct vxlan_dev {
 #define VXLAN_F_COLLECT_METADATA   0x2000
 #define VXLAN_F_GPE0x4000
 #define VXLAN_F_IPV6_LINKLOCAL 0x8000
+#define VXLAN_F_TTL_INHERIT0x1
 
 /* Flags that are used in the receive path. These flags must match in
  * order for a socket to be shareable
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 11d0c0e..e771a63 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -516,6 +516,7 @@ enum {
IFLA_VXLAN_COLLECT_METADATA,
IFLA_VXLAN_LABEL,
IFLA_VXLAN_GPE,
+   IFLA_VXLAN_TTL_INHERIT,
__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
-- 
2.5.5

[PATCH bpf-next 03/10] [bpf]: add bpf_xdp_adjust_tail sample prog

2018-04-17 Thread Nikita V. Shirokov

adding bpf's sample program which is using bpf_xdp_adjust_tail helper
by generating ICMPv4 "packet to big" message if ingress packet's size is
bigger then 600 bytes

Signed-off-by: Nikita V. Shirokov 
---
 samples/bpf/Makefile  |   4 +
 samples/bpf/xdp_adjust_tail_kern.c| 151 ++
 samples/bpf/xdp_adjust_tail_user.c| 141 
 tools/testing/selftests/bpf/bpf_helpers.h |   2 +
 4 files changed, 298 insertions(+)
 create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
 create mode 100644 samples/bpf/xdp_adjust_tail_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6edd4bf6..aa8c392e2e52 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
+hostprogs-y += xdp_adjust_tail
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
+always += xdp_adjust_tail_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_xdp_adjust_tail += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp_adjust_tail_kern.c 
b/samples/bpf/xdp_adjust_tail_kern.c
new file mode 100644
index ..17570559fd08
--- /dev/null
+++ b/samples/bpf/xdp_adjust_tail_kern.c
@@ -0,0 +1,151 @@
+/* Copyright (c) 2018 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program shows how to use bpf_xdp_adjust_tail() by
+ * generating ICMPv4 "packet to big" (unreachable/ df bit set frag needed
+ * to be more preice in case of v4)" where receiving packets bigger then
+ * 600 bytes.
+ */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define DEFAULT_TTL 64
+#define MAX_PCKT_SIZE 600
+#define ICMP_TOOBIG_SIZE 98
+#define ICMP_TOOBIG_PAYLOAD_SIZE 92
+
+struct bpf_map_def SEC("maps") icmpcnt = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u64),
+   .max_entries = 1,
+};
+
+static __always_inline void count_icmp(void)
+{
+   u64 key = 0;
+   u64 *icmp_count;
+
+   icmp_count = bpf_map_lookup_elem(, );
+   if (icmp_count)
+   *icmp_count += 1;
+}
+
+static __always_inline void swap_mac(void *data, struct ethhdr *orig_eth)
+{
+   struct ethhdr *eth;
+
+   eth = data;
+   memcpy(eth->h_source, orig_eth->h_dest, ETH_ALEN);
+   memcpy(eth->h_dest, orig_eth->h_source, ETH_ALEN);
+   eth->h_proto = orig_eth->h_proto;
+}
+
+static __always_inline __u16 csum_fold_helper(__u32 csum)
+{
+   return ~((csum & 0x) + (csum >> 16));
+}
+
+static __always_inline void ipv4_csum(void *data_start, int data_size,
+ __u32 *csum)
+{
+   *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
+   *csum = csum_fold_helper(*csum);
+}
+
+static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
+{
+   int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
+
+   if (bpf_xdp_adjust_head(xdp, 0 - headroom))
+   return XDP_DROP;
+   void *data = (void *)(long)xdp->data;
+   void *data_end = (void *)(long)xdp->data_end;
+
+   if (data + (ICMP_TOOBIG_SIZE + headroom) > data_end)
+   return XDP_DROP;
+
+   struct iphdr *iph, *orig_iph;
+   struct icmphdr *icmp_hdr;
+   struct ethhdr *orig_eth;
+   __u32 csum = 0;
+   __u64 off = 0;
+
+   orig_eth = data + headroom;
+   swap_mac(data, orig_eth);
+   off += sizeof(struct ethhdr);
+   iph = data + off;
+   off += sizeof(struct iphdr);
+   icmp_hdr = data + off;
+   off += sizeof(struct icmphdr);
+   orig_iph = data + off;
+   icmp_hdr->type = ICMP_DEST_UNREACH;
+   icmp_hdr->code = ICMP_FRAG_NEEDED;
+   icmp_hdr->un.frag.mtu =

[PATCH bpf-next 00/10] introduction of bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

In this patch series i'm adding new bpf helper which allow to manupulate
xdp's data_end pointer. right now only "shrinking" (reduce packet's size
by moving pointer) is supported (and i see no use case for "growing").
Main use case for such helper is to be able to generate controll (ICMP)
messages from XDP context. such messages usually contains first N bytes
from original packets as a payload, and this is exactly what this helper
would allow us to do (see patch 3 for sample program, where we generate
ICMP "packet too big" message). This helper could be usefull for load
balancing applications where after additional encapsulation, resulting
packet could be bigger then interface MTU.
Aside from new helper this patch series contains minor changes in device
drivers (for ones which requires), so they would recal packet's length
not only when head pointer was adjusted, but if tail's one as well.

Nikita V. Shirokov (10):
  [bpf]: adding bpf_xdp_adjust_tail helper
  [bpf]: adding tests for bpf_xdp_adjust_tail
  [bpf]: add bpf_xdp_adjust_tail sample prog
  [bpf]: make generic xdp compatible w/ bpf_xdp_adjust_tail
  [bpf]: make mlx4 compatible w/ bpf_xdp_adjust_tail
  [bpf]: make bnxt compatible w/ bpf_xdp_adjust_tail
  [bpf]: make cavium thunder compatible w/ bpf_xdp_adjust_tail
  [bpf]: make netronome nfp compatible w/ bpf_xdp_adjust_tail
  [bpf]: make tun compatible w/ bpf_xdp_adjust_tail
  [bpf]: make virtio compatible w/ bpf_xdp_adjust_tail

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c  |   2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   2 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   2 +-
 drivers/net/tun.c  |   3 +-
 drivers/net/virtio_net.c   |   7 +-
 include/uapi/linux/bpf.h   |  10 +-
 net/bpf/test_run.c |   3 +-
 net/core/dev.c |  10 +-
 net/core/filter.c  |  29 +++-
 samples/bpf/Makefile   |   4 +
 samples/bpf/xdp_adjust_tail_kern.c | 151 +
 samples/bpf/xdp_adjust_tail_user.c | 141 +++
 tools/include/uapi/linux/bpf.h |  11 +-
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   5 +
 tools/testing/selftests/bpf/test_adjust_tail.c |  29 
 tools/testing/selftests/bpf/test_progs.c   |  32 +
 18 files changed, 433 insertions(+), 12 deletions(-)
 create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
 create mode 100644 samples/bpf/xdp_adjust_tail_user.c
 create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c

-- 
2.15.1

[PATCH bpf-next 02/10] [bpf]: adding tests for bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

adding selftests for bpf_xdp_adjust_tail helper. in this syntetic test
we are testing that 1) if data_end < data helper will return EINVAL
2) for normal use case packet's length would be reduced.

aside from adding new tests i'm changing behaviour of bpf_prog_test_run
so it would recalculate packet's length if only data_end pointer was
changed

Signed-off-by: Nikita V. Shirokov 
---
 net/bpf/test_run.c |  3 ++-
 tools/include/uapi/linux/bpf.h | 11 -
 tools/testing/selftests/bpf/Makefile   |  2 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |  3 +++
 tools/testing/selftests/bpf/test_adjust_tail.c | 29 +++
 tools/testing/selftests/bpf/test_progs.c   | 32 ++
 6 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2ced48662c1f..68c3578343b4 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -170,7 +170,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const 
union bpf_attr *kattr,
xdp.rxq = >xdp_rxq;
 
retval = bpf_test_run(prog, , repeat, );
-   if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+   if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN ||
+   xdp.data_end != xdp.data + size)
size = xdp.data_end - xdp.data;
ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration);
kfree(data);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465023a2..9a2d1a04eb24 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -755,6 +755,13 @@ union bpf_attr {
  * @addr: pointer to struct sockaddr to bind socket to
  * @addr_len: length of sockaddr structure
  * Return: 0 on success or negative error code
+ *
+ * int bpf_xdp_adjust_tail(xdp_md, delta)
+ * Adjust the xdp_md.data_end by delta. Only shrinking of packet's
+ * size is supported.
+ * @xdp_md: pointer to xdp_md
+ * @delta: A negative integer to be added to xdp_md.data_end
+ * Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -821,7 +828,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(xdp_adjust_tail),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -864,6 +872,7 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
+#define BPF_F_SEQ_NUMBER   (1ULL << 3)
 
 /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
  * BPF_FUNC_perf_event_read_value flags.
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0a315ddabbf4..3e819dc70bee 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -31,7 +31,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
-   sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o
+   sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index d8223d99f96d..50c607014b22 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -96,6 +96,9 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
+static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
+   (void *) BPF_FUNC_xdp_adjust_tail;
+
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_adjust_tail.c 
b/tools/testing/selftests/bpf/test_adjust_tail.c
new file mode 100644
index ..86239e792d6d
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_adjust_tail.c
@@ -0,0 +1,29 @@
+/* Copyright (c) 2016,2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include

[PATCH bpf-next 05/10] [bpf]: make mlx4 compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for mlx4 driver we will just calculate packet's length unconditionally
(the same way as it's already being done in mlx5)

Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5c613c6663da..efc55feddc5c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -775,8 +775,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
 
act = bpf_prog_run_xdp(xdp_prog, );
 
+   length = xdp.data_end - xdp.data;
if (xdp.data != orig_data) {
-   length = xdp.data_end - xdp.data;
frags[0].page_offset = xdp.data -
xdp.data_hard_start;
va = xdp.data;
-- 
2.15.1

[PATCH bpf-next 04/10] [bpf]: make generic xdp compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for generic XDP we need to reflect this packet's length change by
adjusting skb's tail pointer

Signed-off-by: Nikita V. Shirokov 
---
 net/core/dev.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 969462ebb296..11c789231a03 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3996,9 +3996,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 struct bpf_prog *xdp_prog)
 {
struct netdev_rx_queue *rxqueue;
+   void *orig_data, *orig_data_end;
u32 metalen, act = XDP_DROP;
struct xdp_buff xdp;
-   void *orig_data;
int hlen, off;
u32 mac_len;
 
@@ -4037,6 +4037,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
xdp.data_meta = xdp.data;
xdp.data_end = xdp.data + hlen;
xdp.data_hard_start = skb->data - skb_headroom(skb);
+   orig_data_end = xdp.data_end;
orig_data = xdp.data;
 
rxqueue = netif_get_rxqueue(skb);
@@ -4051,6 +4052,13 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
__skb_push(skb, -off);
skb->mac_header += off;
 
+   /* check if bpf_xdp_adjust_tail was used. it can only "shrink"
+* pckt.
+*/
+   off = orig_data_end - xdp.data_end;
+   if (off != 0)
+   skb_set_tail_pointer(skb, xdp.data_end - xdp.data);
+
switch (act) {
case XDP_REDIRECT:
case XDP_TX:
-- 
2.15.1

Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek

On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote:
> > I turned pr_debug on in tcp_in_window() for another try and it's a bit
> > mangled because the information on multiple lines and the function is
> > called in parallel but it looks like I do have some seq > maxend +1
> > 
> > Although it's weird, the maxend was set WAY earlier apparently?
> > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 
> > receiver end=505004284 maxend=505033596 maxwin=29200
> > then window decreased drastically e.g. previous ack just before refusal:
> > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 
> > end=1913292311
> > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 
> > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7
> 
> scale=0 is suspect.
> 
> Really if conntrack does not see SYN SYNACK packets, it should not
> make any window check, since windows can be scaled up to 14 :/

Or maybe set the scaling to

  - TCP_MAX_WSCALE (14) by default
  - 0 when SYN or SYNACK without window scale option is seen
  - value of window scale option when SYN or SYNACK with it is seen

Michal Kubecek

[PATCH bpf-next 06/10] [bpf]: make bnxt compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for bnxt driver we will just calculate packet's length unconditionally

Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 1389ab5e05df..1f0e872d0667 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -113,10 +113,10 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct 
bnxt_rx_ring_info *rxr, u16 cons,
if (tx_avail != bp->tx_ring_size)
*event &= ~BNXT_RX_EVENT;
 
+   *len = xdp.data_end - xdp.data;
if (orig_data != xdp.data) {
offset = xdp.data - xdp.data_hard_start;
*data_ptr = xdp.data_hard_start + offset;
-   *len = xdp.data_end - xdp.data;
}
switch (act) {
case XDP_PASS:
-- 
2.15.1

[PATCH bpf-next 08/10] [bpf]: make netronome nfp compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Nikita V. Shirokov

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for nfp driver we will just calculate packet's length unconditionally

Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 1eb6549f2a54..d9111c077699 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1722,7 +1722,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
 
act = bpf_prog_run_xdp(xdp_prog, );
 
-   pkt_len -= xdp.data - orig_data;
+   pkt_len = xdp.data_end - xdp.data;
pkt_off += xdp.data - orig_data;
 
switch (act) {
-- 
2.15.1

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Jesper Dangaard Brouer

On Mon, 16 Apr 2018 23:15:50 -0700
Christoph Hellwig  wrote:

> On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote:
> > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use
> > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops.  I also included that
> > in below fix patch.  
> 
> x86_swiotlb_dma_ops should not exist any mor, and x86 now uses
> dma_direct_ops.  Looks like you are applying it to an old kernel :)
> 
> > Performance improved to 8.9 Mpps from approx 6.5Mpps.
> > 
> > (This was without my bulking for net_device->ndo_xdp_xmit, so that
> > number should improve more).  
> 
> What is the number for the otherwise comparable setup without repolines?

Approx 12 Mpps.

You forgot to handle the dma_direct_mapping_error() case, which still
used the retpoline in the above 8.9 Mpps measurement, I fixed it up and
performance increased to 9.6 Mpps.

Notice, in this test there are still two retpoline/indirect-calls
left.  The net_device->ndo_xdp_xmit and the invocation of the XDP BPF
prog.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH net-next 1/3] net: phy: Add binding for vendor specific C45 MDIO address space

2018-04-17 Thread Vicentiu Galanopulo

The extra property enables the discovery on the MDIO bus
of the PHYs which have a vendor specific address space
for accessing the C45 MDIO registers.

Signed-off-by: Vicentiu Galanopulo 
---
 Documentation/devicetree/bindings/net/phy.txt | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/phy.txt 
b/Documentation/devicetree/bindings/net/phy.txt
index d2169a5..82692e2 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -61,6 +61,11 @@ Optional Properties:
 - reset-deassert-us: Delay after the reset was deasserted in microseconds.
   If this property is missing the delay will be skipped.
 
+- dev-addr: If set, it indicates the device address of the PHY to be used
+  when accessing the C45 PHY registers over MDIO. It is used for vendor 
specific
+  register space addresses that do no conform to standard address for the MDIO
+  registers (e.g. MMD30)
+
 Example:
 
 ethernet-phy@0 {
@@ -72,4 +77,5 @@ ethernet-phy@0 {
reset-gpios = < 4 GPIO_ACTIVE_LOW>;
reset-assert-us = <1000>;
reset-deassert-us = <2000>;
+   dev-addr = <0x1e>;
 };
-- 
2.7.4

[PATCH net] vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi

2018-04-17 Thread Toshiaki Makita

Syzkaller spotted an old bug which leads to reading skb beyond tail by 4
bytes on vlan tagged packets.
This is caused because skb_vlan_tagged_multi() did not check
skb_headlen.

BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283 [inline]
BUG: KMSAN: uninit-value in skb_vlan_tagged_multi include/linux/if_vlan.h:656 
[inline]
BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672 
[inline]
BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline]
BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
  eth_type_vlan include/linux/if_vlan.h:283 [inline]
  skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
  vlan_features_check include/linux/if_vlan.h:672 [inline]
  dflt_features_check net/core/dev.c:2949 [inline]
  netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
  validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084
  __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549
  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
  packet_snd net/packet/af_packet.c:2944 [inline]
  packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  sock_write_iter+0x3b9/0x470 net/socket.c:909
  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
  do_iter_write+0x30d/0xd40 fs/read_write.c:932
  vfs_writev fs/read_write.c:977 [inline]
  do_writev+0x3c9/0x830 fs/read_write.c:1012
  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
  SyS_writev+0x56/0x80 fs/read_write.c:1082
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43ffa9
RSP: 002b:7fff2cff3948 EFLAGS: 0217 ORIG_RAX: 0014
RAX: ffda RBX: 004002c8 RCX: 0043ffa9
RDX: 0001 RSI: 2080 RDI: 0003
RBP: 006cb018 R08:  R09: 
R10:  R11: 0217 R12: 004018d0
R13: 00401960 R14:  R15: 

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
  slab_post_alloc_hook mm/slab.h:445 [inline]
  slab_alloc_node mm/slub.c:2737 [inline]
  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
  __kmalloc_reserve net/core/skbuff.c:138 [inline]
  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
  alloc_skb include/linux/skbuff.h:984 [inline]
  alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
  sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
  packet_alloc_skb net/packet/af_packet.c:2803 [inline]
  packet_snd net/packet/af_packet.c:2894 [inline]
  packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  sock_write_iter+0x3b9/0x470 net/socket.c:909
  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
  do_iter_write+0x30d/0xd40 fs/read_write.c:932
  vfs_writev fs/read_write.c:977 [inline]
  do_writev+0x3c9/0x830 fs/read_write.c:1012
  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
  SyS_writev+0x56/0x80 fs/read_write.c:1082
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 58e998c6d239 ("offloading: Force software GSO for multiple vlan tags.")
Reported-and-tested-by: syzbot+0bbe42c764feafa82...@syzkaller.appspotmail.com
Signed-off-by: Toshiaki Makita 
---
 include/linux/if_vlan.h | 7 +--
 net/core/dev.c  | 2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index d11f41d..78a5a90 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -663,7 +663,7 @@ static inline bool skb_vlan_tagged(const struct sk_buff 
*skb)
  * Returns true if the skb is tagged with multiple vlan headers, regardless
  * of whether it is hardware accelerated or not.
  */
-static inline bool skb_vlan_tagged_multi(const struct sk_buff *skb)
+static inline bool skb_vlan_tagged_multi(struct sk_buff *skb)
 {
__be16 protocol = skb->protocol;
 
@@ -673,6 +673,9 @@ static inline bool skb_vlan_tagged_multi(const struct 
sk_buff *skb)
if (likely(!eth_type_vlan(protocol)))
return false;
 
+   if (unlikely(!pskb_may_pull(skb, VLAN_ETH_HLEN)))
+   return false;
+
veh = (struct vlan_ethhdr *)skb->data;

Re: [PATCH net 1/2] tipc: add policy for TIPC_NLA_NET_ADDR

2018-04-17 Thread Ying Xue

On 04/16/2018 11:29 PM, Eric Dumazet wrote:
> Before syzbot/KMSAN bites, add the missing policy for TIPC_NLA_NET_ADDR
> 
> Fixes: 27c21416727a ("tipc: add net set to new netlink api")
> Signed-off-by: Eric Dumazet 
> Cc: Jon Maloy 
> Cc: Ying Xue 

Acked-by: Ying Xue 

> ---
>  net/tipc/netlink.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c
> index 
> b76f13f6fea10a53d00ed14a38cdf5cdf7afa44c..d4e0bbeee72793a060befaf8a9d0239731c0d48c
>  100644
> --- a/net/tipc/netlink.c
> +++ b/net/tipc/netlink.c
> @@ -79,7 +79,8 @@ const struct nla_policy 
> tipc_nl_sock_policy[TIPC_NLA_SOCK_MAX + 1] = {
>  
>  const struct nla_policy tipc_nl_net_policy[TIPC_NLA_NET_MAX + 1] = {
>   [TIPC_NLA_NET_UNSPEC]   = { .type = NLA_UNSPEC },
> - [TIPC_NLA_NET_ID]   = { .type = NLA_U32 }
> + [TIPC_NLA_NET_ID]   = { .type = NLA_U32 },
> + [TIPC_NLA_NET_ADDR] = { .type = NLA_U32 },
>  };
>  
>  const struct nla_policy tipc_nl_link_policy[TIPC_NLA_LINK_MAX + 1] = {
>

[PATCHv2 net-next] vxlan: add ttl inherit support

2018-04-17 Thread Hangbin Liu

Like tos inherit, ttl inherit should also means inherit the inner protocol's
ttl values, which actually not implemented in vxlan yet.

But we could not treat ttl == 0 as "use the inner TTL", because that would be
used also when the "ttl" option is not specified and that would be a behavior
change, and breaking real use cases.

So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is
specified.

---
v2: As suggested by Stefano, clean up function ip_tunnel_get_ttl().

Suggested-by: Jiri Benc 
Signed-off-by: Hangbin Liu 
---
 drivers/net/vxlan.c  | 17 ++---
 include/net/ip_tunnels.h | 12 
 include/net/vxlan.h  |  1 +
 include/uapi/linux/if_link.h |  1 +
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index aa5f034..209a840 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2085,9 +2085,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
local_ip = vxlan->cfg.saddr;
dst_cache = >dst_cache;
md->gbp = skb->mark;
-   ttl = vxlan->cfg.ttl;
-   if (!ttl && vxlan_addr_multicast(dst))
-   ttl = 1;
+   if (flags & VXLAN_F_TTL_INHERIT) {
+   ttl = ip_tunnel_get_ttl(old_iph, skb);
+   } else {
+   ttl = vxlan->cfg.ttl;
+   if (!ttl && vxlan_addr_multicast(dst))
+   ttl = 1;
+   }
 
tos = vxlan->cfg.tos;
if (tos == 1)
@@ -2709,6 +2713,7 @@ static const struct nla_policy 
vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_GBP]= { .type = NLA_FLAG, },
[IFLA_VXLAN_GPE]= { .type = NLA_FLAG, },
[IFLA_VXLAN_REMCSUM_NOPARTIAL]  = { .type = NLA_FLAG },
+   [IFLA_VXLAN_TTL_INHERIT]= { .type = NLA_FLAG },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -3254,6 +3259,12 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_TTL])
conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
 
+   if (data[IFLA_VXLAN_TTL_INHERIT]) {
+   if (changelink)
+   return -EOPNOTSUPP;
+   conf->flags |= VXLAN_F_TTL_INHERIT;
+   }
+
if (data[IFLA_VXLAN_LABEL])
conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
 IPV6_FLOWLABEL_MASK;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index cbe5add..5a8ab9f 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -377,6 +377,18 @@ static inline u8 ip_tunnel_get_dsfield(const struct iphdr 
*iph,
return 0;
 }
 
+static inline u8 ip_tunnel_get_ttl(const struct iphdr *iph,
+  const struct sk_buff *skb)
+{
+   if (skb->protocol == htons(ETH_P_IP))
+   return iph->ttl;
+
+   if (skb->protocol == htons(ETH_P_IPV6))
+   return ((const struct ipv6hdr *)iph)->hop_limit;
+
+   return 0;
+}
+
 /* Propogate ECN bits out */
 static inline u8 ip_tunnel_ecn_encap(u8 tos, const struct iphdr *iph,
 const struct sk_buff *skb)
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index ad73d8b..b99a02ae 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -262,6 +262,7 @@ struct vxlan_dev {
 #define VXLAN_F_COLLECT_METADATA   0x2000
 #define VXLAN_F_GPE0x4000
 #define VXLAN_F_IPV6_LINKLOCAL 0x8000
+#define VXLAN_F_TTL_INHERIT0x1
 
 /* Flags that are used in the receive path. These flags must match in
  * order for a socket to be shareable
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 11d0c0e..e771a63 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -516,6 +516,7 @@ enum {
IFLA_VXLAN_COLLECT_METADATA,
IFLA_VXLAN_LABEL,
IFLA_VXLAN_GPE,
+   IFLA_VXLAN_TTL_INHERIT,
__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
-- 
2.5.5

Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).

2018-04-17 Thread John Paul Adrian Glaubitz


On 04/17/2018 04:08 AM, Michael Schmitz wrote:

From: John Paul Adrian Glaubitz 

This should be:

From: Michael Karcher 

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v2 6/8] net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as shareable

2018-04-17 Thread John Paul Adrian Glaubitz


On 04/17/2018 04:08 AM, Michael Schmitz wrote:

From: John Paul Adrian Glaubitz 

This should be:

From: Michael Karcher 

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [linux-sunxi] Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device

2018-04-17 Thread Icenowy Zheng



于 2018年4月17日 GMT+08:00 下午7:59:38, Chen-Yu Tsai  写到:
>On Tue, Apr 17, 2018 at 7:52 PM, Maxime Ripard
> wrote:
>> On Mon, Apr 16, 2018 at 10:51:55PM +0800, Chen-Yu Tsai wrote:
>>> On Mon, Apr 16, 2018 at 10:31 PM, Maxime Ripard
>>>  wrote:
>>> > On Thu, Apr 12, 2018 at 11:23:30PM +0800, Chen-Yu Tsai wrote:
>>> >> On Thu, Apr 12, 2018 at 11:11 PM, Icenowy Zheng 
>wrote:
>>> >> > 于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard
> 写到:
>>> >> >>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote:
>>> >> >>> From: Chen-Yu Tsai 
>>> >> >>>
>>> >> >>> On the Allwinner R40 SoC, the "GMAC clock" register is in the
>CCU
>>> >> >>> address space; on the A64 SoC this register is in the SRAM
>controller
>>> >> >>> address space, and with a different offset.
>>> >> >>>
>>> >> >>> To access the register from another device and hide the
>internal
>>> >> >>> difference between the device, let it register a regmap named
>>> >> >>> "emac-clock". We can then get the device from the phandle,
>and
>>> >> >>> retrieve the regmap with dev_get_regmap(); in this situation
>the
>>> >> >>> regmap_field will be set up to access the only register in
>the
>>> >> >>regmap.
>>> >> >>>
>>> >> >>> Signed-off-by: Chen-Yu Tsai 
>>> >> >>> [Icenowy: change to use regmaps with single register, change
>commit
>>> >> >>>  message]
>>> >> >>> Signed-off-by: Icenowy Zheng 
>>> >> >>> ---
>>> >> >>>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48
>>> >> >>++-
>>> >> >>>  1 file changed, 46 insertions(+), 2 deletions(-)
>>> >> >>>
>>> >> >>> diff --git
>a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>> >> >>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>> >> >>> index 1037f6c78bca..b61210c0d415 100644
>>> >> >>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>> >> >>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>> >> >>> @@ -85,6 +85,13 @@ const struct reg_field
>old_syscon_reg_field = {
>>> >> >>>  .msb = 31,
>>> >> >>>  };
>>> >> >>>
>>> >> >>> +/* Specially exported regmap which contains only EMAC
>register */
>>> >> >>> +const struct reg_field single_reg_field = {
>>> >> >>> +.reg = 0,
>>> >> >>> +.lsb = 0,
>>> >> >>> +.msb = 31,
>>> >> >>> +};
>>> >> >>> +
>>> >> >>
>>> >> >>I'm not sure this would be wise. If we ever need some other
>register
>>> >> >>exported through the regmap, will have to change all the
>calling sites
>>> >> >>everywhere in the kernel, which will be a pain and will break
>>> >> >>bisectability.
>>> >> >
>>> >> > In this situation the register can be exported as another
>>> >> >  regmap. Currently the code will access a regmap with name
>>> >> > "emac-clock" for this register.
>>> >> >
>>> >> >>
>>> >> >>Chen-Yu's (or was it yours?) initial solution with a custom
>writeable
>>> >> >>hook only allowing a single register seemed like a better one.
>>> >> >
>>> >> > But I remember you mentioned that you want it to hide the
>>> >> > difference inside the device.
>>> >>
>>> >> The idea is that a device can export multiple regmaps. This one,
>>> >> the one named "gmac" (in my soon to come v2) or "emac-clock"
>here,
>>> >> is but one of many possible regmaps, and it only exports the
>register
>>> >> needed by the GMAC/EMAC.
>>> >
>>> > I'm not sure this would be wise either. There's a single register
>map,
>>> > and as far as I know we don't have a binding to express this in
>the
>>> > DT. This means that the customer and provider would have to use
>the
>>> > same name, but without anything actually enforcing it aside from
>>> > "someone in the community knows it".
>>> >
>>> > This is not a really good design, and I was actually preferring
>your
>>> > first option. We shouldn't rely on any undocumented rule. This
>will be
>>> > easy to break and hard to maintain.
>>>
>>> So, one regmap per device covering the whole register range, and the
>>> consumer knows which register to poke by looking at its own
>compatible.
>>>
>>> That sound right?
>>
>> Yep. And ideally, sending a single serie for both the A64 and the R40
>> cases, in order to provide the big picture.
>
>OK. I'll incorporate Icenowy's stuff into my series.

In this situation maybe I should send newer revision of A64
drivers to you?

>
>ChenYu
>
>___
>linux-arm-kernel mailing list
>linux-arm-ker...@lists.infradead.org
>http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

[PATCH] net: qrtr: add MODULE_ALIAS_NETPROTO macro

2018-04-17 Thread Nicolas Dechesne

To ensure that qrtr can be loaded automatically, when needed, if it is compiled
as module.

Signed-off-by: Nicolas Dechesne 
---
 net/qrtr/qrtr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c
index b33e5aeb4c06..2aa07b547b16 100644
--- a/net/qrtr/qrtr.c
+++ b/net/qrtr/qrtr.c
@@ -1135,3 +1135,4 @@ module_exit(qrtr_proto_fini);
 
 MODULE_DESCRIPTION("Qualcomm IPC-router driver");
 MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_NETPROTO(PF_QIPCRTR);
-- 
2.14.2

[PATCH net] sfc: check RSS is active for filter insert

2018-04-17 Thread Bert Kenward

For some firmware variants - specifically 'capture packed stream' - RSS
filters are not valid. We must check if RSS is actually active rather
than merely enabled.

Fixes: 42356d9a137b ("sfc: support RSS spreading of ethtool ntuple filters")
Signed-off-by: Bert Kenward 
---
 drivers/net/ethernet/sfc/ef10.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 36f24c7e553a..83ce229f4eb7 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -5264,7 +5264,7 @@ static int efx_ef10_filter_insert_addr_list(struct 
efx_nic *efx,
ids = vlan->uc;
}
 
-   filter_flags = efx_rss_enabled(efx) ? EFX_FILTER_FLAG_RX_RSS : 0;
+   filter_flags = efx_rss_active(>rss_context) ? 
EFX_FILTER_FLAG_RX_RSS : 0;
 
/* Insert/renew filters */
for (i = 0; i < addr_count; i++) {
@@ -5333,7 +5333,7 @@ static int efx_ef10_filter_insert_def(struct efx_nic *efx,
int rc;
u16 *id;
 
-   filter_flags = efx_rss_enabled(efx) ? EFX_FILTER_FLAG_RX_RSS : 0;
+   filter_flags = efx_rss_active(>rss_context) ? 
EFX_FILTER_FLAG_RX_RSS : 0;
 
efx_filter_init_rx(, EFX_FILTER_PRI_AUTO, filter_flags, 0);
 
-- 
2.13.6

Re: [PATCH/RFC net-next 2/5] ravb: correct ptp does failure after suspend and resume

2018-04-17 Thread Wolfram Sang


> @@ -2302,6 +2305,7 @@ static int __maybe_unused ravb_resume(struct device 
> *dev)
>  {
>   struct net_device *ndev = dev_get_drvdata(dev);
>   struct ravb_private *priv = netdev_priv(ndev);
> + struct platform_device *pdev = priv->pdev;

Minor nit: I'd save this line...

> + if (priv->chip_id != RCAR_GEN2)
> + ravb_ptp_init(ndev, pdev);

... and use ravb_ptp_init(ndev, priv->pdev); here.

But well, maybe just bike-shedding...



signature.asc
Description: PGP signature

Re: [PATCH bpf-next 1/3] bpftool: Add missing prog types and attach types

2018-04-17 Thread Quentin Monnet

2018-04-16 16:57 UTC-0700 ~ Andrey Ignatov 
> Jakub Kicinski  [Mon, 2018-04-16 16:53 -0700]:
>> On Mon, 16 Apr 2018 14:41:57 -0700, Andrey Ignatov wrote:
>>> diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
>>> index cae32a6..8689916 100644
>>> --- a/tools/bpf/bpftool/cgroup.c
>>> +++ b/tools/bpf/bpftool/cgroup.c
>>> @@ -16,15 +16,28 @@
>>>  #define HELP_SPEC_ATTACH_FLAGS 
>>> \
>>> "ATTACH_FLAGS := { multi | override }"
>>>  
>>> -#define HELP_SPEC_ATTACH_TYPES 
>>> \
>>> -   "ATTACH_TYPE := { ingress | egress | sock_create | sock_ops | device }"
>>> +#define HELP_SPEC_ATTACH_TYPES 
>>>\
>>> +   "   ATTACH_TYPE := { ingress | egress | sock_create |\n"   \
>>> +   "sock_ops | stream_parser |\n" \
>>> +   "stream_verdict | device | msg_verdict |\n"\
>>> +   "bind4 | bind6 | connect4 | connect6 |\n"  \
>>> +   "post_bind4 | post_bind6 }"
>>>  
>>
>> Would you mind updating the man page in Documentation/ as well?
> 
> Sure. Will update and send v2.
> 

Hi Andrey,

In addition to the Documentation, there would also be the bash completion
to update. The patch below should make it, please feel free to incorporate
it to your changes if it seems alright to you. Otherwise I'll submit it as
a follow-up.

Quentin

---

diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 71cc5dec3685..dad9109c2800 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -374,7 +374,8 @@ _bpftool()
 ;;
 attach|detach)
 local ATTACH_TYPES='ingress egress sock_create sock_ops \
-device'
+stream_parser stream_verdict device msg_verdict bind4 \
+bind6 connect4 connect6 post_bind4 post_bind6'
 local ATTACH_FLAGS='multi override'
 local PROG_TYPE='id pinned tag'
 case $prev in
@@ -382,7 +383,9 @@ _bpftool()
 _filedir
 return 0
 ;;
-ingress|egress|sock_create|sock_ops|device)
+ingress|egress|sock_create|sock_ops|stream_parser|\
+stream_verdict|device|msg_verdict|bind4|bind6|\
+connect4|connect6|post_bind4|post_bind6)
 COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \
 "$cur" ) )
 return 0

Re: [linux-sunxi] Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device

2018-04-17 Thread Chen-Yu Tsai

On Tue, Apr 17, 2018 at 7:52 PM, Maxime Ripard
 wrote:
> On Mon, Apr 16, 2018 at 10:51:55PM +0800, Chen-Yu Tsai wrote:
>> On Mon, Apr 16, 2018 at 10:31 PM, Maxime Ripard
>>  wrote:
>> > On Thu, Apr 12, 2018 at 11:23:30PM +0800, Chen-Yu Tsai wrote:
>> >> On Thu, Apr 12, 2018 at 11:11 PM, Icenowy Zheng  wrote:
>> >> > 于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard 
>> >> >  写到:
>> >> >>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote:
>> >> >>> From: Chen-Yu Tsai 
>> >> >>>
>> >> >>> On the Allwinner R40 SoC, the "GMAC clock" register is in the CCU
>> >> >>> address space; on the A64 SoC this register is in the SRAM controller
>> >> >>> address space, and with a different offset.
>> >> >>>
>> >> >>> To access the register from another device and hide the internal
>> >> >>> difference between the device, let it register a regmap named
>> >> >>> "emac-clock". We can then get the device from the phandle, and
>> >> >>> retrieve the regmap with dev_get_regmap(); in this situation the
>> >> >>> regmap_field will be set up to access the only register in the
>> >> >>regmap.
>> >> >>>
>> >> >>> Signed-off-by: Chen-Yu Tsai 
>> >> >>> [Icenowy: change to use regmaps with single register, change commit
>> >> >>>  message]
>> >> >>> Signed-off-by: Icenowy Zheng 
>> >> >>> ---
>> >> >>>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48
>> >> >>++-
>> >> >>>  1 file changed, 46 insertions(+), 2 deletions(-)
>> >> >>>
>> >> >>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> >> >>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> >> >>> index 1037f6c78bca..b61210c0d415 100644
>> >> >>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> >> >>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> >> >>> @@ -85,6 +85,13 @@ const struct reg_field old_syscon_reg_field = {
>> >> >>>  .msb = 31,
>> >> >>>  };
>> >> >>>
>> >> >>> +/* Specially exported regmap which contains only EMAC register */
>> >> >>> +const struct reg_field single_reg_field = {
>> >> >>> +.reg = 0,
>> >> >>> +.lsb = 0,
>> >> >>> +.msb = 31,
>> >> >>> +};
>> >> >>> +
>> >> >>
>> >> >>I'm not sure this would be wise. If we ever need some other register
>> >> >>exported through the regmap, will have to change all the calling sites
>> >> >>everywhere in the kernel, which will be a pain and will break
>> >> >>bisectability.
>> >> >
>> >> > In this situation the register can be exported as another
>> >> >  regmap. Currently the code will access a regmap with name
>> >> > "emac-clock" for this register.
>> >> >
>> >> >>
>> >> >>Chen-Yu's (or was it yours?) initial solution with a custom writeable
>> >> >>hook only allowing a single register seemed like a better one.
>> >> >
>> >> > But I remember you mentioned that you want it to hide the
>> >> > difference inside the device.
>> >>
>> >> The idea is that a device can export multiple regmaps. This one,
>> >> the one named "gmac" (in my soon to come v2) or "emac-clock" here,
>> >> is but one of many possible regmaps, and it only exports the register
>> >> needed by the GMAC/EMAC.
>> >
>> > I'm not sure this would be wise either. There's a single register map,
>> > and as far as I know we don't have a binding to express this in the
>> > DT. This means that the customer and provider would have to use the
>> > same name, but without anything actually enforcing it aside from
>> > "someone in the community knows it".
>> >
>> > This is not a really good design, and I was actually preferring your
>> > first option. We shouldn't rely on any undocumented rule. This will be
>> > easy to break and hard to maintain.
>>
>> So, one regmap per device covering the whole register range, and the
>> consumer knows which register to poke by looking at its own compatible.
>>
>> That sound right?
>
> Yep. And ideally, sending a single serie for both the A64 and the R40
> cases, in order to provide the big picture.

OK. I'll incorporate Icenowy's stuff into my series.

ChenYu

Re: tcp hang when socket fills up ?

2018-04-17 Thread Dominique Martinet

Michal Kubecek wrote on Tue, Apr 17, 2018:
> Data (21 bytes) packet in reply direction. And somewhere between the
> first and second debugging print, we ended up with sender scale=0 and
> that value is then preserved from now on.
> 
> The only place between the two debug prints where we could change only
> one of the td_sender values are the two calls to tcp_options() but
> neither should be called now unless I missed something. I'll try to
> think about it some more.

Could it have something to do with the way I setup the connection?
I don't think the "both remotes call connect() with carefully selected
source/dest port" is a very common case..

If you look at the tcpdump outputs I attached the sequence usually is
something like
 server > client SYN
 client > server SYN
 server > client SYNACK
 client > server ACK

ultimately it IS a connection, but with an extra SYN packet in front of
it (that first SYN opens up the conntrack of the nat so that the
client's syn can come in, the client's conntrack will be that of a
normal connection since its first SYN goes in directly after the
server's (it didn't see the server's SYN))


Looking at my logs again, I'm seeing the same as you:

This looks like the actual SYN/SYN/SYNACK/ACK:
 - 14.364090 seq=505004283 likely SYN coming out of server
 - 14.661731 seq=1913287797 on next line it says receiver
end=505004284 so likely the matching SYN from client
Which this time gets a proper SYNACK from server:
14.662020 seq=505004283 ack=1913287798
And following final dataless ACK:
14.687570 seq=1913287798 ack=505004284

Then as you point out some data ACK, where the scale poofs:
14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
14.688824 tcp_in_window: 
14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7

As you say, only tcp_options() will clear only on side of the scales.
We don't have sender->td_maxwin == 0 (printed) so I see no other way
than we are in the last else if:
 - we have after(end, sender->td_end) (end=1913287819 > sender
end=1913287798)
 - I assume the tcp state machine must be confused because of the
SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
but since this is a data packet it doesn't have the tcp option for scale
thus scale resets.


At least peeling the logs myself helped me follow the process, I'll
sprinkle some carefully crafted logs tomorrow to check if this is true
and will let you figure what is best of trying to preserve scale if it
was set before, setting a default to 14 or something else.

Thanks!
-- 
Dominique Martinet | Asmadeus

Re: [PATCH net-next 2/3] net: phy: Change the array size to 32 for device_ids

2018-04-17 Thread Andrew Lunn

On Tue, Apr 17, 2018 at 04:02:32AM -0500, Vicentiu Galanopulo wrote:
> In the context of enabling the discovery of the PHYs
> which have the C45 MDIO address space in a non-standard
> address:  num_ids in get_phy_c45_ids, has the
> value 8 (ARRAY_SIZE(c45_ids->device_ids)), but the
> u32 *devs can store 32 devices in the bitfield.
> 
> If a device is stored in *devs, in bits 32 to 9
> (bit counting in lookup loop starts from 1), it will
> not be found.

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH net-next 1/3] net: phy: Add binding for vendor specific C45 MDIO address space

2018-04-17 Thread Andrew Lunn

On Tue, Apr 17, 2018 at 04:02:31AM -0500, Vicentiu Galanopulo wrote:
> The extra property enables the discovery on the MDIO bus
> of the PHYs which have a vendor specific address space
> for accessing the C45 MDIO registers.
> 
> Signed-off-by: Vicentiu Galanopulo 

Hi Vicentiu

I think binding is O.K, but the implementation needs work. So 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH 08/10] net: ax88796: Make reset more robust on AX88796B

2018-04-17 Thread Andrew Lunn

On Tue, Apr 17, 2018 at 07:18:10AM +0200, Michael Karcher wrote:
> [Andrew, sorry for the dup. I did hit reply-to-auhor instead of
> reply-to-all first.]
> 
> Andrew Lunn schrieb:
> >> > This should really be fixed in the PHY driver, not the MAC.
> >>
> >> OK - do you want this separate, or as part of this series? Might have
> >> a few side effects on more commonly used hardware, perhaps?
> >
> > Hi Michael
> >
> > What PHY driver is used?
> The ax88796b comes with its own integrated (buggy) PHY needing this
> workaround. This PHY has its own ID which is not known by Linux, so it is
> using the genphy driver as fallback.
> 
> > In the driver you can implement a .soft_reset
> > function which first does the dummy write, and then uses
> > genphy_soft_reset() to do the actual reset.
> We could do that - but I dont't see the point in creating a PHY driver
> that is only ever used by this MAC driver, just to add a single line to
> the genphy driver. If the same PHY might be used with a different MAC,
> you definitely would have a point there, though.


Hi Michael

We try to keep the core code clean, and put all workarounds for buggy
hardware in drivers specific to them. It just helps keep the core code
maintainable.

I would prefer a driver specific to this PHY with the workaround. But
lets see what Florian says.

 Andrew

[PATCH net-next] cxgb4vf: display pause settings

2018-04-17 Thread Ganesh Goudar

Add support to display pause settings

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 9a81b523..71f13bd 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -1419,6 +1419,22 @@ static int cxgb4vf_get_link_ksettings(struct net_device 
*dev,
base->duplex = DUPLEX_UNKNOWN;
}
 
+   if (pi->link_cfg.fc & PAUSE_RX) {
+   if (pi->link_cfg.fc & PAUSE_TX) {
+   ethtool_link_ksettings_add_link_mode(link_ksettings,
+advertising,
+Pause);
+   } else {
+   ethtool_link_ksettings_add_link_mode(link_ksettings,
+advertising,
+Asym_Pause);
+   }
+   } else if (pi->link_cfg.fc & PAUSE_TX) {
+   ethtool_link_ksettings_add_link_mode(link_ksettings,
+advertising,
+Asym_Pause);
+   }
+
base->autoneg = pi->link_cfg.autoneg;
if (pi->link_cfg.pcaps & FW_PORT_CAP32_ANEG)
ethtool_link_ksettings_add_link_mode(link_ksettings,
-- 
2.1.0

[PATCH RESEND net-next] ipv6: provide Kconfig switch to disable accept_ra by default

2018-04-17 Thread Matthias Schiffer

Many distributions and users prefer to handle router advertisements in
userspace; one example is OpenWrt, which includes a combined RA and DHCPv6
client. For such configurations, accept_ra should not be enabled by
default.

As setting net.ipv6.conf.default.accept_ra via sysctl.conf or similar
facilities may be too late to catch all interfaces and common sysctl.conf
tools do not allow setting an option for all existing interfaces, this
patch provides a Kconfig option to control the default value of
default.accept_ra.

Using default.accept_ra is preferable to all.accept_ra for our usecase,
as disabling all.accept_ra would preclude users from explicitly enabling
accept_ra on individual interfaces.

Signed-off-by: Matthias Schiffer 
---
 net/ipv6/Kconfig| 12 
 net/ipv6/addrconf.c |  2 +-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 6794ddf0547c..0f453110f288 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -20,6 +20,18 @@ menuconfig IPV6
 
 if IPV6
 
+config IPV6_ACCEPT_RA_DEFAULT
+   bool "IPv6: Accept router advertisements by default"
+   default y
+   help
+ The kernel can internally handle IPv6 router advertisements for
+ stateless address autoconfiguration (SLAAC) and route configuration,
+ which can be configured in detail and per-interface using a number of
+ sysctl options. This option controls the default value of
+ net.ipv6.conf.default.accept_ra.
+
+ If unsure, say Y.
+
 config IPV6_ROUTER_PREF
bool "IPv6: Router Preference (RFC 4191) support"
---help---
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 62b97722722c..5fd6d2120a35 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -245,7 +245,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.forwarding = 0,
.hop_limit  = IPV6_DEFAULT_HOPLIMIT,
.mtu6   = IPV6_MIN_MTU,
-   .accept_ra  = 1,
+   .accept_ra  = IS_ENABLED(CONFIG_IPV6_ACCEPT_RA_DEFAULT),
.accept_redirects   = 1,
.autoconf   = 1,
.force_mld_version  = 0,
-- 
2.17.0

Re: [PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp

2018-04-17 Thread Wolfram Sang

On Tue, Apr 17, 2018 at 10:50:26AM +0200, Simon Horman wrote:
> From: Masaru Nagai 
> 
> [   58.490829] =
> [   58.495205] [ INFO: inconsistent lock state ]
> [   58.499583] 4.9.0-yocto-standard-7-g2ef7caf #57 Not tainted
> [   58.505529] -
> [   58.509904] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> [   58.515939] swapper/0/0 [HC1[1]:SC1[1]:HE0:SE0] takes:
> [   58.521099]  (&(>lock)->rlock#2){?.-...}, at: [] 
> skb_queue_tail+0x2c/0x68
> {HARDIRQ-ON-W} state was registered at:

Maybe add a short text to the log to describe the approach of this fix?



signature.asc
Description: PGP signature

Re: [linux-sunxi] Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device

2018-04-17 Thread Maxime Ripard

On Mon, Apr 16, 2018 at 10:51:55PM +0800, Chen-Yu Tsai wrote:
> On Mon, Apr 16, 2018 at 10:31 PM, Maxime Ripard
>  wrote:
> > On Thu, Apr 12, 2018 at 11:23:30PM +0800, Chen-Yu Tsai wrote:
> >> On Thu, Apr 12, 2018 at 11:11 PM, Icenowy Zheng  wrote:
> >> > 于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard 
> >> >  写到:
> >> >>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote:
> >> >>> From: Chen-Yu Tsai 
> >> >>>
> >> >>> On the Allwinner R40 SoC, the "GMAC clock" register is in the CCU
> >> >>> address space; on the A64 SoC this register is in the SRAM controller
> >> >>> address space, and with a different offset.
> >> >>>
> >> >>> To access the register from another device and hide the internal
> >> >>> difference between the device, let it register a regmap named
> >> >>> "emac-clock". We can then get the device from the phandle, and
> >> >>> retrieve the regmap with dev_get_regmap(); in this situation the
> >> >>> regmap_field will be set up to access the only register in the
> >> >>regmap.
> >> >>>
> >> >>> Signed-off-by: Chen-Yu Tsai 
> >> >>> [Icenowy: change to use regmaps with single register, change commit
> >> >>>  message]
> >> >>> Signed-off-by: Icenowy Zheng 
> >> >>> ---
> >> >>>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48
> >> >>++-
> >> >>>  1 file changed, 46 insertions(+), 2 deletions(-)
> >> >>>
> >> >>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> >> >>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> >> >>> index 1037f6c78bca..b61210c0d415 100644
> >> >>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> >> >>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> >> >>> @@ -85,6 +85,13 @@ const struct reg_field old_syscon_reg_field = {
> >> >>>  .msb = 31,
> >> >>>  };
> >> >>>
> >> >>> +/* Specially exported regmap which contains only EMAC register */
> >> >>> +const struct reg_field single_reg_field = {
> >> >>> +.reg = 0,
> >> >>> +.lsb = 0,
> >> >>> +.msb = 31,
> >> >>> +};
> >> >>> +
> >> >>
> >> >>I'm not sure this would be wise. If we ever need some other register
> >> >>exported through the regmap, will have to change all the calling sites
> >> >>everywhere in the kernel, which will be a pain and will break
> >> >>bisectability.
> >> >
> >> > In this situation the register can be exported as another
> >> >  regmap. Currently the code will access a regmap with name
> >> > "emac-clock" for this register.
> >> >
> >> >>
> >> >>Chen-Yu's (or was it yours?) initial solution with a custom writeable
> >> >>hook only allowing a single register seemed like a better one.
> >> >
> >> > But I remember you mentioned that you want it to hide the
> >> > difference inside the device.
> >>
> >> The idea is that a device can export multiple regmaps. This one,
> >> the one named "gmac" (in my soon to come v2) or "emac-clock" here,
> >> is but one of many possible regmaps, and it only exports the register
> >> needed by the GMAC/EMAC.
> >
> > I'm not sure this would be wise either. There's a single register map,
> > and as far as I know we don't have a binding to express this in the
> > DT. This means that the customer and provider would have to use the
> > same name, but without anything actually enforcing it aside from
> > "someone in the community knows it".
> >
> > This is not a really good design, and I was actually preferring your
> > first option. We shouldn't rely on any undocumented rule. This will be
> > easy to break and hard to maintain.
> 
> So, one regmap per device covering the whole register range, and the
> consumer knows which register to poke by looking at its own compatible.
> 
> That sound right?

Yep. And ideally, sending a single serie for both the A64 and the R40
cases, in order to provide the big picture.

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


signature.asc
Description: PGP signature

Re: [PATCH net-next 3/3] net: phy: Enable C45 PHYs with vendor specific address space

2018-04-17 Thread Andrew Lunn

On Tue, Apr 17, 2018 at 04:02:33AM -0500, Vicentiu Galanopulo wrote:
> A search of the dev-addr property is done in of_mdiobus_register.
> If the property is found in the PHY node, of_mdiobus_register_vend_spec_phy()
> is called. This is a wrapper function for of_mdiobus_register_phy()
> which finds the device in package based on dev-addr, and fills
> devices_addrs, which is a new field added to phy_c45_device_ids.
> This new field will store the dev-addr property on the same index
> where the device in package has been found.
> 
> The of_mdiobus_register_phy() now contains an extra parameter,
> which is struct phy_c45_device_ids *c45_ids.
> If c45_ids is not NULL, get_vend_spec_addr_phy_device() is called
> and c45_ids are propagated all the way to get_phy_c45_ids().
> 
> Having dev-addr stored in devices_addrs, in get_phy_c45_ids(),
> when probing the identifiers, dev-addr can be extracted from
> devices_addrs and probed if devices_addrs[current_identifier] is not 0.

This still needs work. But i don't want David to see the two
Reviewed-by and think the series is O.K. So lets make it clear

NACK

More comments to follow.

Andrew

Re: [PATCH/RFC net-next 4/5] ravb: remove undocumented processing

2018-04-17 Thread Wolfram Sang

On Tue, Apr 17, 2018 at 10:50:29AM +0200, Simon Horman wrote:
> From: Kazuya Mizuguchi 
> 
> Signed-off-by: Kazuya Mizuguchi 
> Signed-off-by: Simon Horman 
> ---
>  drivers/net/ethernet/renesas/ravb.h  |  5 -
>  drivers/net/ethernet/renesas/ravb_main.c | 15 ---
>  2 files changed, 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/renesas/ravb.h 
> b/drivers/net/ethernet/renesas/ravb.h
> index 57eea4a77826..fcd04dbc7dde 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
> @@ -197,15 +197,11 @@ enum ravb_reg {
>   MAHR= 0x05c0,
>   MALR= 0x05c8,
>   TROCR   = 0x0700,   /* Undocumented? */

Why not this, too? Maybe some background info from HW team for the
commit message would be nice to have...



signature.asc
Description: PGP signature

Re: [RFC v2] virtio: support packed ring

2018-04-17 Thread Tiwei Bie

On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > [...]
> > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned 
> > > > > > int head,
> > > > > > + void **ctx)
> > > > > > +{
> > > > > > +   struct vring_packed_desc *desc;
> > > > > > +   unsigned int i, j;
> > > > > > +
> > > > > > +   /* Clear data ptr. */
> > > > > > +   vq->desc_state[head].data = NULL;
> > > > > > +
> > > > > > +   i = head;
> > > > > > +
> > > > > > +   for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > +   desc = >vring_packed.desc[i];
> > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > +   desc->flags = 0x0;
> > > > > Looks like this is unnecessary.
> > > > It's safer to zero it. If we don't zero it, after we
> > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > this function, the desc is still available to the
> > > > device.
> > > 
> > > Well detach_unused_buf_packed() should be called after device is stopped,
> > > otherwise even if you try to clear, there will still be a window that 
> > > device
> > > may use it.
> > 
> > This is not about whether the device has been stopped or
> > not. We don't have other places to re-initialize the ring
> > descriptors and wrap_counter. So they need to be set to
> > the correct values when doing detach_unused_buf.
> > 
> > Best regards,
> > Tiwei Bie
> 
> find vqs is the time to do it.

The .find_vqs() will call .setup_vq() which will eventually
call vring_create_virtqueue(). It's a different case. Here
we're talking about re-initializing the descs and updating
the wrap counter when detaching the unused descs (In this
case, split ring just needs to decrease vring.avail->idx).

Best regards,
Tiwei Bie

Re: [RFC v2] virtio: support packed ring

2018-04-17 Thread Michael S. Tsirkin

On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> [...]
> > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned 
> > > > > int head,
> > > > > +   void **ctx)
> > > > > +{
> > > > > + struct vring_packed_desc *desc;
> > > > > + unsigned int i, j;
> > > > > +
> > > > > + /* Clear data ptr. */
> > > > > + vq->desc_state[head].data = NULL;
> > > > > +
> > > > > + i = head;
> > > > > +
> > > > > + for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > + desc = >vring_packed.desc[i];
> > > > > + vring_unmap_one_packed(vq, desc);
> > > > > + desc->flags = 0x0;
> > > > Looks like this is unnecessary.
> > > It's safer to zero it. If we don't zero it, after we
> > > call virtqueue_detach_unused_buf_packed() which calls
> > > this function, the desc is still available to the
> > > device.
> > 
> > Well detach_unused_buf_packed() should be called after device is stopped,
> > otherwise even if you try to clear, there will still be a window that device
> > may use it.
> 
> This is not about whether the device has been stopped or
> not. We don't have other places to re-initialize the ring
> descriptors and wrap_counter. So they need to be set to
> the correct values when doing detach_unused_buf.
> 
> Best regards,
> Tiwei Bie

find vqs is the time to do it.

-- 
MST

Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek

On Tue, Apr 17, 2018 at 02:34:37PM +0200, Dominique Martinet wrote:
> Michal Kubecek wrote on Tue, Apr 17, 2018:
> > Data (21 bytes) packet in reply direction. And somewhere between the
> > first and second debugging print, we ended up with sender scale=0 and
> > that value is then preserved from now on.
> > 
> > The only place between the two debug prints where we could change only
> > one of the td_sender values are the two calls to tcp_options() but
> > neither should be called now unless I missed something. I'll try to
> > think about it some more.
> 
> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK

This must be what nf_conntrack_proto_tcp.c calls "simultaneous open".

> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
> scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
> scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

I agree that sender->td_maxwin is not zero so that the handshake above
probably left the conntrack in TCP_CONNTRACK_SYN_RECV state for some
reason. I'll try to go through the code with the pattern you mentioned
in mind.

Michal Kubecek

[net-next V10 PATCH 16/16] xdp: transition into using xdp_frame for ndo_xdp_xmit

2018-04-17 Thread Jesper Dangaard Brouer

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   30 ++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   24 +++-
 drivers/net/tun.c |   19 ++--
 drivers/net/virtio_net.c  |   24 
 include/linux/netdevice.h |4 ++-
 net/core/filter.c |   17 +-
 7 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c8bf4d35fdea..87fb27ab9c24 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2203,9 +2203,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 #define I40E_XDP_CONSUMED 1
 #define I40E_XDP_TX 2
 
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
  struct i40e_ring *xdp_ring);
 
+static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+struct i40e_ring *xdp_ring)
+{
+   struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!xdpf))
+   return I40E_XDP_CONSUMED;
+
+   return i40e_xmit_xdp_ring(xdpf, xdp_ring);
+}
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
@@ -2233,7 +2244,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
break;
case XDP_TX:
xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-   result = i40e_xmit_xdp_ring(xdp, xdp_ring);
+   result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
break;
case XDP_REDIRECT:
err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
@@ -3480,21 +3491,14 @@ static inline int i40e_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
  * @xdp: data to transmit
  * @xdp_ring: XDP Tx ring
  **/
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
  struct i40e_ring *xdp_ring)
 {
u16 i = xdp_ring->next_to_use;
struct i40e_tx_buffer *tx_bi;
struct i40e_tx_desc *tx_desc;
-   struct xdp_frame *xdpf;
+   u32 size = xdpf->len;
dma_addr_t dma;
-   u32 size;
-
-   xdpf = convert_to_xdp_frame(xdp);
-   if (unlikely(!xdpf))
-   return I40E_XDP_CONSUMED;
-
-   size = xdpf->len;
 
if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
xdp_ring->tx_stats.tx_busy++;
@@ -3684,7 +3688,7 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, 
struct net_device *netdev)
  *
  * Returns Zero if sent, else an error code
  **/
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
struct i40e_netdev_priv *np = netdev_priv(dev);
unsigned int queue_index = smp_processor_id();
@@ -3697,7 +3701,7 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff 
*xdp)
if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
return -ENXIO;
 
-   err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+   err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
if (err != I40E_XDP_TX)
return -ENOSPC;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 857b1d743c8d..4bf318b8be85 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -511,7 +511,7 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
 
 /**
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4f2864165723..51e7d82a5860 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2262,7 +2262,7 @@ static struct sk_buff

[net-next V10 PATCH 15/16] xdp: transition into using xdp_frame for return API

2018-04-17 Thread Jesper Dangaard Brouer

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |5 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h|4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |1 +
 drivers/net/tun.c   |4 ++--
 drivers/net/virtio_net.c|2 +-
 include/net/xdp.h   |2 +-
 kernel/bpf/cpumap.c |6 +++---
 net/core/xdp.c  |4 +++-
 9 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 96c54cbfb1f9..c8bf4d35fdea 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -638,8 +638,7 @@ static void i40e_unmap_and_free_tx_resource(struct 
i40e_ring *ring,
if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
kfree(tx_buffer->raw_buf);
else if (ring_is_xdp(ring))
-   xdp_return_frame(tx_buffer->xdpf->data,
-_buffer->xdpf->mem);
+   xdp_return_frame(tx_buffer->xdpf);
else
dev_kfree_skb_any(tx_buffer->skb);
if (dma_unmap_len(tx_buffer, len))
@@ -842,7 +841,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
/* free the skb/XDP data */
if (ring_is_xdp(tx_ring))
-   xdp_return_frame(tx_buf->xdpf->data, 
_buf->xdpf->mem);
+   xdp_return_frame(tx_buf->xdpf);
else
napi_consume_skb(tx_buf->skb, napi_budget);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index abb5248e917e..7dd5038cfcc4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -241,8 +241,7 @@ struct ixgbe_tx_buffer {
unsigned long time_stamp;
union {
struct sk_buff *skb;
-   /* XDP uses address ptr on irq_clean */
-   void *data;
+   struct xdp_frame *xdpf;
};
unsigned int bytecount;
unsigned short gso_segs;
@@ -250,7 +249,6 @@ struct ixgbe_tx_buffer {
DEFINE_DMA_UNMAP_ADDR(dma);
DEFINE_DMA_UNMAP_LEN(len);
u32 tx_flags;
-   struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f10904ec2172..4f2864165723 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
/* free the skb */
if (ring_is_xdp(tx_ring))
-   xdp_return_frame(tx_buffer->data, _buffer->xdp_mem);
+   xdp_return_frame(tx_buffer->xdpf);
else
napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2386,6 +2386,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
xdp.data_hard_start = xdp.data -
  ixgbe_rx_offset(rx_ring);

[net-next V10 PATCH 14/16] mlx5: use page_pool for xdp_return_frame call

2018-04-17 Thread Jesper Dangaard Brouer

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator.  And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.

The performance improvement for XDP_REDIRECT in this patch is really
good.  Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).

The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).

Before this patch for mlx5, XDP redirected frames were returned via
the page allocator.  The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).

Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
close to our 13Mpps max target.

The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing).  It is also planned to remove this unnecessary DMA
unmap in a later patchset

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V5: Updated patch desc

V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
 - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
 - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
 - Correct handling if page_pool_create fail for 
MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V10: Req from Tariq
 - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Tariq Toukan 
Acked-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1a05d1072c5e..3317a4da87cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -534,6 +536,7 @@ struct mlx5e_rq {
unsigned int   hw_mtu;
struct mlx5e_xdpsq xdpsq;
DECLARE_BITMAP(flags, 8);
+   struct page_pool  *page_pool;
 
/* control */
struct mlx5_wq_ctrlwq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2dca0933dfd3..f10037419978 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -389,10 +390,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
  struct mlx5e_rq_param *rqp,
  struct mlx5e_rq *rq)
 {
+   struct page_pool_params pp_params = { 0 };
struct mlx5_core_dev *mdev = c->mdev;
void *rqc = rqp->rqc;
void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-   u32 byte_count;
+   u32 byte_count, pool_size;
int npages;
int wq_sz;
int err;
@@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
+   pool_size = 1 << params->log_rq_mtu_frames;
 
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+
+   pool_size = MLX5_MPWRQ_PAGES_PER_WQE << 
mlx5e_mpwqe_get_log_rq_size(params);
rq->post_wqes = mlx5e_post_rx_mpwqes;
rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -512,13 +517,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
 
-   /* This must only be activate for order-0

[net-next V10 PATCH 12/16] page_pool: refurbish version of page_pool code

2018-04-17 Thread Jesper Dangaard Brouer

Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

V6: Adjustments requested by Eric Dumazet
 - Adjust cacheline_aligned_in_smp usage/placement
 - Move rcu_head in struct page_pool
 - Free pages quicker on destroy, minimize resources delayed an RCU period
 - Remove code for forward/backward compat ABI interface

V8: Issues found by kbuild test robot
 - Address sparse should be static warnings
 - Only compile+link when a driver use/select page_pool,
   mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig |1 
 include/net/page_pool.h |  129 +
 net/Kconfig |3 
 net/core/Makefile   |1 
 net/core/page_pool.c|  317 +++
 5 files changed, 451 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index c032319f1cb9..12257034131e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -30,6 +30,7 @@ config MLX5_CORE_EN
bool "Mellanox Technologies ConnectX-4 Ethernet support"
depends on NETDEVICES && ETHERNET && INET && PCI && MLX5_CORE
depends on IPV6=y || IPV6=n || MLX5_CORE=m
+   select PAGE_POOL
default n
---help---
  Ethernet support in Mellanox Technologies ConnectX-4 NIC.
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index ..1fe77db59518
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.h
+ * Author: Jesper Dangaard Brouer 
+ * Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include  /* Needed by ptr_ring */
+#include 
+#include 
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALLPP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE128
+#define PP_ALLOC_CACHE_REFILL  64
+struct pp_alloc_cache {
+   u32 count;
+   void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+   unsigned intflags;
+   unsigned intorder;
+   unsigned intpool_size;
+   int nid;

[net-next V10 PATCH 13/16] xdp: allow page_pool as an allocator type in xdp_return_frame

2018-04-17 Thread Jesper Dangaard Brouer

New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

V8: address issues found by kbuild test robot
 - Address sparse should be static warnings
 - Allow xdp.o to be compiled without page_pool.o

V9: Remove inline from .c file, compiler knows best

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/page_pool.h |   14 +++
 include/net/xdp.h   |3 ++
 net/core/xdp.c  |   60 ++-
 3 files changed, 65 insertions(+), 12 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 1fe77db59518..c79087153148 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -117,7 +117,12 @@ void __page_pool_put_page(struct page_pool *pool,
 
 static inline void page_pool_put_page(struct page_pool *pool, struct page 
*page)
 {
+   /* When page_pool isn't compiled-in, net/core/xdp.c doesn't
+* allow registering MEM_TYPE_PAGE_POOL, but shield linker.
+*/
+#ifdef CONFIG_PAGE_POOL
__page_pool_put_page(pool, page, false);
+#endif
 }
 /* Very limited use-cases allow recycle direct */
 static inline void page_pool_recycle_direct(struct page_pool *pool,
@@ -126,4 +131,13 @@ static inline void page_pool_recycle_direct(struct 
page_pool *pool,
__page_pool_put_page(pool, page, true);
 }
 
+static inline bool is_page_pool_compiled_in(void)
+{
+#ifdef CONFIG_PAGE_POOL
+   return true;
+#else
+   return false;
+#endif
+}
+
 #endif /* _NET_PAGE_POOL_H */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5f67c62540aa..d0ee437753dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,6 +36,7 @@
 enum xdp_mem_type {
MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
+   MEM_TYPE_PAGE_POOL,
MEM_TYPE_MAX,
 };
 
@@ -44,6 +45,8 @@ struct xdp_mem_info {
u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8b2cb79b5de0..33e382afbd95 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
struct xdp_mem_info mem;
-   void *allocator;
+   union {
+   void *allocator;
+   struct page_pool *page_pool;
+   };
struct rhash_head node;
struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
/* Allow this ID to be reused */
ida_simple_remove(_id_pool, xa->mem.id);
 
-   /* TODO: Depending on allocator type/pointer free resources */
+   /* Notice, driver is expected to free the *allocator,
+* e.g. page_pool, and MUST also use RCU free.
+*/
 
/* Poison memory */
xa->mem.id = 0x;
@@ -225,6 +231,17 @@ static int __mem_id_cyclic_get(gfp_t gfp)
return id;
 }
 
+static bool __is_supported_mem_type(enum xdp_mem_type type)
+{
+   if (type == MEM_TYPE_PAGE_POOL)
+   return is_page_pool_compiled_in();
+
+   if (type >= MEM_TYPE_MAX)
+   return false;
+
+   return true;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
   enum xdp_mem_type type, void *allocator)
 {
@@ -238,13 +255,16 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info 
*xdp_rxq,
return -EFAULT;
}
 
-   if (type >= MEM_TYPE_MAX)
-   return -EINVAL;
+   if (!__is_supported_mem_type(type))
+   return -EOPNOTSUPP;
 
xdp_rxq->mem.type = type;
 
-   if (!allocator)
+   if (!allocator) {
+   if (type == MEM_TYPE_PAGE_POOL)
+   return -EINVAL; /* Setup time check page_pool req */
return 0;
+   }
 
/* Delay init of rhashtable to save memory if feature isn't used */
if (!mem_id_init) {
@@ -290,15 +310,31 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-   if (mem->type == MEM_TYPE_PAGE_SHARED) {
+   struct

[PATCH net-next] ipv6: send netlink notifications for manually configured addresses

2018-04-17 Thread Lorenzo Bianconi

Send a netlink notification when userspace adds a manually configured
address if DAD is enabled and optimistic flag isn't set.
Moreover send RTM_DELADDR notifications for tentative addresses.

Some userspace applications (e.g. NetworkManager) are interested in
addr netlink events albeit the address is still in tentative state,
however events are not sent if DAD process is not completed.
If the address is added and immediately removed userspace listeners
are not notified. This behaviour can be easily reproduced by using
veth interfaces:

$ ip -b - < link add dev vm1 type veth peer name vm2
> link set dev vm1 up
> link set dev vm2 up
> addr add 2001:db8:a:b:1:2:3:4/64 dev vm1
> addr del 2001:db8:a:b:1:2:3:4/64 dev vm1
EOF

This patch reverts the behaviour introduced by the commit f784ad3d79e5
("ipv6: do not send RTM_DELADDR for tentative addresses")

Suggested-by: Thomas Haller 
Signed-off-by: Lorenzo Bianconi 
---
 net/ipv6/addrconf.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 62b97722722c..b2c0175125db 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2901,6 +2901,11 @@ static int inet6_addr_add(struct net *net, int ifindex,
  expires, flags);
}
 
+   /* Send a netlink notification if DAD is enabled and
+* optimistic flag is not set
+*/
+   if (!(ifp->flags & (IFA_F_OPTIMISTIC | IFA_F_NODAD)))
+   ipv6_ifa_notify(0, ifp);
/*
 * Note that section 3.1 of RFC 4429 indicates
 * that the Optimistic flag should not be set for
@@ -5028,14 +5033,6 @@ static void inet6_ifa_notify(int event, struct 
inet6_ifaddr *ifa)
struct net *net = dev_net(ifa->idev->dev);
int err = -ENOBUFS;
 
-   /* Don't send DELADDR notification for TENTATIVE address,
-* since NEWADDR notification is sent only after removing
-* TENTATIVE flag, if DAD has not failed.
-*/
-   if (ifa->flags & IFA_F_TENTATIVE && !(ifa->flags & IFA_F_DADFAILED) &&
-   event == RTM_DELADDR)
-   return;
-
skb = nlmsg_new(inet6_ifaddr_msgsize(), GFP_ATOMIC);
if (!skb)
goto errout;
-- 
2.14.3

Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx

2018-04-17 Thread Phil Elwell

On 16/04/2018 20:22, Rob Herring wrote:
> On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote:
>> The Microchip LAN78XX family of devices are Ethernet controllers with
>> a USB interface. Despite being discoverable devices it can be useful to
>> be able to configure them from Device Tree, particularly in low-cost
>> applications without an EEPROM or programmed OTP.
>>
>> Document the supported properties in a bindings file, adding it to
>> MAINTAINERS at the same time.
>>
>> Signed-off-by: Phil Elwell 
>> ---
>>  .../devicetree/bindings/net/microchip,lan78xx.txt  | 44 
>> ++
>>  MAINTAINERS|  1 +
>>  2 files changed, 45 insertions(+)
>>  create mode 100644 
>> Documentation/devicetree/bindings/net/microchip,lan78xx.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt 
>> b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
>> new file mode 100644
>> index 000..e7d7850
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
>> @@ -0,0 +1,44 @@
>> +Microchip LAN78xx Gigabit Ethernet controller
>> +
>> +The LAN78XX devices are usually configured by programming their OTP or with
>> +an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have 
>> neither.
>> +
>> +Please refer to ethernet.txt for a description of common Ethernet bindings.
>> +
>> +Optional properties:
>> +- microchip,eee-enabled: if present, enable Energy Efficient Ethernet 
>> support;
> 
> I see we have some flags for broken EEE, but nothing already defined to 
> enable EEE. Seems like this should either be a user option (therefore 
> not in DT) or we should use the broken EEE properties if this is h/w 
> dependent.

In the downstream Raspberry Pi kernel we use DT as a way of passing user 
settings
to drivers - it's more powerful than the command line. I understand that this is
not the done thing here so I'm withdrawing this element of the patch series.

Apologies for the noise.

>> +- microchip,led-modes: a two-element vector, with each element configuring
>> +  the operating mode of an LED. The values supported by the device are;
>> +  0: Link/Activity
>> +  1: Link1000/Activity
>> +  2: Link100/Activity
>> +  3: Link10/Activity
>> +  4: Link100/1000/Activity
>> +  5: Link10/1000/Activity
>> +  6: Link10/100/Activity
>> +  7: RESERVED
>> +  8: Duplex/Collision
>> +  9: Collision
>> +  10: Activity
>> +  11: RESERVED
>> +  12: Auto-negotiation Fault
>> +  13: RESERVED
>> +  14: Off
>> +  15: On
>> +- microchip,tx-lpi-timer: the delay (in microseconds) between the TX fifo
>> +  becoming empty and invoking Low Power Idles (default 600).
> 
> Needs a unit suffix as defined in property-units.txt.
> 
>> +
>> +Example:
>> +
>> +/* Standard configuration for a Raspberry Pi 3 B+ */
>> +ethernet: usbether@1 {
>> +compatible = "usb424,7800";
>> +reg = <1>;
>> +microchip,eee-enabled;
>> +microchip,tx-lpi-timer = <600>;
>> +/*
>> + * led0 = 1:link1000/activity
>> + * led1 = 6:link10/100/activity
>> + */
>> +microchip,led-modes = <1 6>;
>> +};
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 2328eed..b637aad 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -14482,6 +14482,7 @@ M:   Microchip Linux Driver Support 
>> 
>>  L:  netdev@vger.kernel.org
>>  S:  Maintained
>>  F:  drivers/net/usb/lan78xx.*
>> +F:  Documentation/devicetree/bindings/net/microchip,lan78xx.txt
>>  
>>  USB MASS STORAGE DRIVER
>>  M:  Alan Stern 
>> -- 
>> 2.7.4
>>

Re: [PATCH net 2/2] tipc: fix possible crash in __tipc_nl_net_set()

2018-04-17 Thread Ying Xue

On 04/16/2018 11:29 PM, Eric Dumazet wrote:
> syzbot reported a crash in __tipc_nl_net_set() caused by NULL dereference.
> 
> We need to check that both TIPC_NLA_NET_NODEID and TIPC_NLA_NET_NODEID_W1
> are present.
> 
> We also need to make sure userland provided u64 attributes.
> 
> Fixes: d50ccc2d3909 ("tipc: add 128-bit node identifier")
> Signed-off-by: Eric Dumazet 
> Cc: Jon Maloy 
> Cc: Ying Xue 
> Reported-by: syzbot 

Acked-by: Ying Xue 

> ---
>  net/tipc/net.c | 2 ++
>  net/tipc/netlink.c | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/net/tipc/net.c b/net/tipc/net.c
> index 
> 856f9e97ea293210bea1d2003d2092482732ace9..4fbaa0464405370601cb2fd1dd3b03733836d342
>  100644
> --- a/net/tipc/net.c
> +++ b/net/tipc/net.c
> @@ -252,6 +252,8 @@ int __tipc_nl_net_set(struct sk_buff *skb, struct 
> genl_info *info)
>   u64 *w0 = (u64 *)_id[0];
>   u64 *w1 = (u64 *)_id[8];
>  
> + if (!attrs[TIPC_NLA_NET_NODEID_W1])
> + return -EINVAL;
>   *w0 = nla_get_u64(attrs[TIPC_NLA_NET_NODEID]);
>   *w1 = nla_get_u64(attrs[TIPC_NLA_NET_NODEID_W1]);
>   tipc_net_init(net, node_id, 0);
> diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c
> index 
> d4e0bbeee72793a060befaf8a9d0239731c0d48c..6ff2254088f647d4f7410c3335ccdae2e68ec522
>  100644
> --- a/net/tipc/netlink.c
> +++ b/net/tipc/netlink.c
> @@ -81,6 +81,8 @@ const struct nla_policy tipc_nl_net_policy[TIPC_NLA_NET_MAX 
> + 1] = {
>   [TIPC_NLA_NET_UNSPEC]   = { .type = NLA_UNSPEC },
>   [TIPC_NLA_NET_ID]   = { .type = NLA_U32 },
>   [TIPC_NLA_NET_ADDR] = { .type = NLA_U32 },
> + [TIPC_NLA_NET_NODEID]   = { .type = NLA_U64 },
> + [TIPC_NLA_NET_NODEID_W1]= { .type = NLA_U64 },
>  };
>  
>  const struct nla_policy tipc_nl_link_policy[TIPC_NLA_LINK_MAX + 1] = {
>

[net-next V10 PATCH 04/16] xdp: move struct xdp_buff from filter.h to xdp.h

2018-04-17 Thread Jesper Dangaard Brouer

This is done to prepare for the next patch, and it is also
nice to move this XDP related struct out of filter.h.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/filter.h |   24 +---
 include/net/xdp.h  |   22 ++
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index fc4e8f91b03d..4da8b2308174 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -30,6 +30,7 @@ struct sock;
 struct seccomp_data;
 struct bpf_prog_aux;
 struct xdp_rxq_info;
+struct xdp_buff;
 
 /* ArgX, context and stack frame pointer register positions. Note,
  * Arg1, Arg2, Arg3, etc are used as argument mappings of function
@@ -500,14 +501,6 @@ struct bpf_skb_data_end {
void *data_end;
 };
 
-struct xdp_buff {
-   void *data;
-   void *data_end;
-   void *data_meta;
-   void *data_hard_start;
-   struct xdp_rxq_info *rxq;
-};
-
 struct sk_msg_buff {
void *data;
void *data_end;
@@ -772,21 +765,6 @@ int xdp_do_redirect(struct net_device *dev,
struct bpf_prog *prog);
 void xdp_do_flush_map(void);
 
-/* Drivers not supporting XDP metadata can use this helper, which
- * rejects any room expansion for metadata as a result.
- */
-static __always_inline void
-xdp_set_data_meta_invalid(struct xdp_buff *xdp)
-{
-   xdp->data_meta = xdp->data + 1;
-}
-
-static __always_inline bool
-xdp_data_meta_unsupported(const struct xdp_buff *xdp)
-{
-   return unlikely(xdp->data_meta > xdp->data);
-}
-
 void bpf_warn_invalid_xdp_action(u32 act);
 
 struct sock *do_sk_redirect_map(struct sk_buff *skb);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index e4207699c410..15f8ade008b5 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -50,6 +50,13 @@ struct xdp_rxq_info {
struct xdp_mem_info mem;
 } cacheline_aligned; /* perf critical, avoid false-sharing */
 
+struct xdp_buff {
+   void *data;
+   void *data_end;
+   void *data_meta;
+   void *data_hard_start;
+   struct xdp_rxq_info *rxq;
+};
 
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
@@ -72,4 +79,19 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
   enum xdp_mem_type type, void *allocator);
 
+/* Drivers not supporting XDP metadata can use this helper, which
+ * rejects any room expansion for metadata as a result.
+ */
+static __always_inline void
+xdp_set_data_meta_invalid(struct xdp_buff *xdp)
+{
+   xdp->data_meta = xdp->data + 1;
+}
+
+static __always_inline bool
+xdp_data_meta_unsupported(const struct xdp_buff *xdp)
+{
+   return unlikely(xdp->data_meta > xdp->data);
+}
+
 #endif /* __LINUX_NET_XDP_H__ */

[net-next V10 PATCH 06/16] tun: convert to use generic xdp_frame and xdp_return_frame API

2018-04-17 Thread Jesper Dangaard Brouer

From: Jesper Dangaard Brouer 

The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.

Convert it over to use the more generic xdp_frame structure.  The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.

V3: Remove check based on feedback from Jason

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/tun.c  |   43 ---
 drivers/vhost/net.c|7 ---
 include/linux/if_tun.h |4 ++--
 3 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 28583aa0c17d..2c85e5cac2a9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -248,11 +248,11 @@ struct veth {
__be16 h_vlan_TCI;
 };
 
-bool tun_is_xdp_buff(void *ptr)
+bool tun_is_xdp_frame(void *ptr)
 {
return (unsigned long)ptr & TUN_XDP_FLAG;
 }
-EXPORT_SYMBOL(tun_is_xdp_buff);
+EXPORT_SYMBOL(tun_is_xdp_frame);
 
 void *tun_xdp_to_ptr(void *ptr)
 {
@@ -660,10 +660,10 @@ void tun_ptr_free(void *ptr)
 {
if (!ptr)
return;
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   put_page(virt_to_head_page(xdp->data));
+   xdp_return_frame(xdpf->data, >mem);
} else {
__skb_array_destroy_skb(ptr);
}
@@ -1298,17 +1298,14 @@ static const struct net_device_ops tun_netdev_ops = {
 static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 {
struct tun_struct *tun = netdev_priv(dev);
-   struct xdp_buff *buff = xdp->data_hard_start;
-   int headroom = xdp->data - xdp->data_hard_start;
+   struct xdp_frame *frame;
struct tun_file *tfile;
u32 numqueues;
int ret = 0;
 
-   /* Assure headroom is available and buff is properly aligned */
-   if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
-   return -ENOSPC;
-
-   *buff = *xdp;
+   frame = convert_to_xdp_frame(xdp);
+   if (unlikely(!frame))
+   return -EOVERFLOW;
 
rcu_read_lock();
 
@@ -1323,7 +1320,7 @@ static int tun_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
/* Encode the XDP flag into lowest bit for consumer to differ
 * XDP buffer from sk_buff.
 */
-   if (ptr_ring_produce(>tx_ring, tun_xdp_to_ptr(buff))) {
+   if (ptr_ring_produce(>tx_ring, tun_xdp_to_ptr(frame))) {
this_cpu_inc(tun->pcpu_stats->tx_dropped);
ret = -ENOSPC;
}
@@ -2001,11 +1998,11 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
 
 static ssize_t tun_put_user_xdp(struct tun_struct *tun,
struct tun_file *tfile,
-   struct xdp_buff *xdp,
+   struct xdp_frame *xdp_frame,
struct iov_iter *iter)
 {
int vnet_hdr_sz = 0;
-   size_t size = xdp->data_end - xdp->data;
+   size_t size = xdp_frame->len;
struct tun_pcpu_stats *stats;
size_t ret;
 
@@ -2021,7 +2018,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
}
 
-   ret = copy_to_iter(xdp->data, size, iter) + vnet_hdr_sz;
+   ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz;
 
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(>syncp);
@@ -2189,11 +2186,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
return err;
}
 
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   ret = tun_put_user_xdp(tun, tfile, xdp, to);
-   put_page(virt_to_head_page(xdp->data));
+   ret = tun_put_user_xdp(tun, tfile, xdpf, to);
+   xdp_return_frame(xdpf->data, >mem);
} else {
struct sk_buff *skb = ptr;
 
@@ -2432,10 +2429,10 @@ static int tun_recvmsg(struct socket *sock, struct 
msghdr *m, size_t total_len,
 static int tun_ptr_peek_len(void *ptr)
 {
if (likely(ptr)) {
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   return xdp->data_end - xdp->data;
+   return xdpf->len;
}
return __skb_array_len_with_tag(ptr);
} else {
diff

[net-next V10 PATCH 07/16] virtio_net: convert to use generic xdp_frame and xdp_return_frame API

2018-04-17 Thread Jesper Dangaard Brouer

The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page).  Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.

Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.

V8: Avoid endianness issues (found by kbuild test robot)
V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan 
Carpenter)

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/virtio_net.c |   54 +-
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..f50e1ad81ad4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -415,38 +415,48 @@ static void virtnet_xdp_flush(struct net_device *dev)
virtqueue_kick(sq->vq);
 }
 
-static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
-  struct xdp_buff *xdp)
+static int __virtnet_xdp_xmit(struct virtnet_info *vi,
+ struct xdp_buff *xdp)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr;
-   unsigned int len;
+   struct xdp_frame *xdpf, *xdpf_sent;
struct send_queue *sq;
+   unsigned int len;
unsigned int qp;
-   void *xdp_sent;
int err;
 
qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
sq = >sq[qp];
 
/* Free up any pending old buffers before queueing new ones. */
-   while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) {
-   struct page *sent_page = virt_to_head_page(xdp_sent);
+   while ((xdpf_sent = virtqueue_get_buf(sq->vq, )) != NULL)
+   xdp_return_frame(xdpf_sent->data, _sent->mem);
 
-   put_page(sent_page);
-   }
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf))
+   return -EOVERFLOW;
+
+   /* virtqueue want to use data area in-front of packet */
+   if (unlikely(xdpf->metasize > 0))
+   return -EOPNOTSUPP;
 
-   xdp->data -= vi->hdr_len;
+   if (unlikely(xdpf->headroom < vi->hdr_len))
+   return -EOVERFLOW;
+
+   /* Make room for virtqueue hdr (also change xdpf->headroom?) */
+   xdpf->data -= vi->hdr_len;
/* Zero header and leave csum up to XDP layers */
-   hdr = xdp->data;
+   hdr = xdpf->data;
memset(hdr, 0, vi->hdr_len);
+   xdpf->len   += vi->hdr_len;
 
-   sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+   sg_init_one(sq->sg, xdpf->data, xdpf->len);
 
-   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC);
+   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
if (unlikely(err))
-   return false; /* Caller handle free/refcnt */
+   return -ENOSPC; /* Caller handle free/refcnt */
 
-   return true;
+   return 0;
 }
 
 static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
@@ -454,7 +464,6 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
struct virtnet_info *vi = netdev_priv(dev);
struct receive_queue *rq = vi->rq;
struct bpf_prog *xdp_prog;
-   bool sent;
 
/* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this
 * indicate XDP resources have been successfully allocated.
@@ -463,10 +472,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
if (!xdp_prog)
return -ENXIO;
 
-   sent = __virtnet_xdp_xmit(vi, xdp);
-   if (!sent)
-   return -ENOSPC;
-   return 0;
+   return __virtnet_xdp_xmit(vi, xdp);
 }
 
 static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
@@ -555,7 +561,6 @@ static struct sk_buff *receive_small(struct net_device *dev,
struct page *page = virt_to_head_page(buf);
unsigned int delta = 0;
struct page *xdp_page;
-   bool sent;
int err;
 
len -= vi->hdr_len;
@@ -606,8 +611,8 @@ static struct sk_buff *receive_small(struct net_device *dev,
delta = orig_data - xdp.data;
break;
case XDP_TX:
-   sent = __virtnet_xdp_xmit(vi, );
-   if (unlikely(!sent)) {
+   err = __virtnet_xdp_xmit(vi, );
+   if (unlikely(err)) {
trace_xdp_exception(vi->dev, xdp_prog, act);
goto err_xdp;
}
@@ -690,7 +695,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
struct bpf_prog *xdp_prog;
unsigned int truesize;
unsigned int headroom = mergeable_ctx_to_headroom(ctx);
-   bool sent;
int err;
 
head_skb =

[net-next V10 PATCH 08/16] bpf: cpumap convert to use generic xdp_frame

2018-04-17 Thread Jesper Dangaard Brouer

The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format.  It is now time to convert it over to the generic
xdp_frame format.  The cpumap needs one extra field dev_rx.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h   |1 +
 kernel/bpf/cpumap.c |  100 ++-
 2 files changed, 29 insertions(+), 72 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 756c42811e78..ea3773f94f65 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -67,6 +67,7 @@ struct xdp_frame {
 * while mem info is valid on remote CPU.
 */
struct xdp_mem_info mem;
+   struct net_device *dev_rx; /* used by cpumap */
 };
 
 /* Convert xdp_buff to xdp_frame */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3e4bbcbe3e86..bcdc4dea5ce7 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work)
kthread_stop(rcpu->kthread);
 }
 
-/* For now, xdp_pkt is a cpumap internal data structure, with info
- * carried between enqueue to dequeue. It is mapped into the top
- * headroom of the packet, to avoid allocating separate mem.
- */
-struct xdp_pkt {
-   void *data;
-   u16 len;
-   u16 headroom;
-   u16 metasize;
-   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
-* while mem info is valid on remote CPU.
-*/
-   struct xdp_mem_info mem;
-   struct net_device *dev_rx;
-};
-
-/* Convert xdp_buff to xdp_pkt */
-static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
-{
-   struct xdp_pkt *xdp_pkt;
-   int metasize;
-   int headroom;
-
-   /* Assure headroom is available for storing info */
-   headroom = xdp->data - xdp->data_hard_start;
-   metasize = xdp->data - xdp->data_meta;
-   metasize = metasize > 0 ? metasize : 0;
-   if (unlikely((headroom - metasize) < sizeof(*xdp_pkt)))
-   return NULL;
-
-   /* Store info in top of packet */
-   xdp_pkt = xdp->data_hard_start;
-
-   xdp_pkt->data = xdp->data;
-   xdp_pkt->len  = xdp->data_end - xdp->data;
-   xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
-   xdp_pkt->metasize = metasize;
-
-   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
-   xdp_pkt->mem = xdp->rxq->mem;
-
-   return xdp_pkt;
-}
-
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-struct xdp_pkt *xdp_pkt)
+struct xdp_frame *xdpf)
 {
unsigned int frame_size;
void *pkt_data_start;
@@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
 * would be preferred to set frame_size to 2048 or 4096
 * depending on the driver.
 *   frame_size = 2048;
-*   frame_len  = frame_size - sizeof(*xdp_pkt);
+*   frame_len  = frame_size - sizeof(*xdp_frame);
 *
 * Instead, with info avail, skb_shared_info in placed after
 * packet len.  This, unfortunately fakes the truesize.
@@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
 * is not at a fixed memory location, with mixed length
 * packets, which is bad for cache-line hotness.
 */
-   frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
+   frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-   pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
+   pkt_data_start = xdpf->data - xdpf->headroom;
skb = build_skb(pkt_data_start, frame_size);
if (!skb)
return NULL;
 
-   skb_reserve(skb, xdp_pkt->headroom);
-   __skb_put(skb, xdp_pkt->len);
-   if (xdp_pkt->metasize)
-   skb_metadata_set(skb, xdp_pkt->metasize);
+   skb_reserve(skb, xdpf->headroom);
+   __skb_put(skb, xdpf->len);
+   if (xdpf->metasize)
+   skb_metadata_set(skb, xdpf->metasize);
 
/* Essential SKB info: protocol and skb->dev */
-   skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
+   skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
 
/* Optional SKB info, currently missing:
 * - HW checksum info   (skb->ip_summed)
@@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
 * gracefully and warn once.
 */
-   struct xdp_pkt *xdp_pkt;
+   struct xdp_frame *xdpf;
 
-   while ((xdp_pkt = ptr_ring_consume(ring)))
-   if (WARN_ON_ONCE(xdp_pkt))
-   xdp_return_frame(xdp_pkt, _pkt->mem);
+   while ((xdpf = ptr_ring_consume(ring)))
+   if

[net-next V10 PATCH 11/16] xdp: rhashtable with allocator ID to pointer mapping

2018-04-17 Thread Jesper Dangaard Brouer

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |9 +
 drivers/net/tun.c |6 +
 drivers/net/virtio_net.c  |7 +
 include/net/xdp.h |   14 --
 net/core/xdp.c|  223 -
 5 files changed, 241 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0bfe6cf2bf8b..f10904ec2172 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6370,7 +6370,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter 
*adapter,
struct device *dev = rx_ring->dev;
int orig_node = dev_to_node(dev);
int ring_node = -1;
-   int size;
+   int size, err;
 
size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6407,6 +6407,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter 
*adapter,
 rx_ring->queue_index) < 0)
goto err;
 
+   err = xdp_rxq_info_reg_mem_model(_ring->xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err) {
+   xdp_rxq_info_unreg(_ring->xdp_rxq);
+   goto err;
+   }
+
rx_ring->xdp_prog = adapter->xdp_prog;
 
return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2c85e5cac2a9..283bde85c455 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -854,6 +854,12 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file,
   tun->dev, tfile->queue_index);
if (err < 0)
goto out;
+   err = xdp_rxq_info_reg_mem_model(>xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err < 0) {
+   xdp_rxq_info_unreg(>xdp_rxq);
+   goto out;
+   }
err = 0;
}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f50e1ad81ad4..42d338fe9a8d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1305,6 +1305,13 @@ static int virtnet_open(struct net_device *dev)
if (err < 0)
return err;
 
+   err = xdp_rxq_info_reg_mem_model(>rq[i].xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err < 0) {
+   xdp_rxq_info_unreg(>rq[i].xdp_rxq);
+   return err;
+   }
+
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi);
}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index ea3773f94f65..5f67c62540aa 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,6 +41,7 @@ enum xdp_mem_type {
 
 struct xdp_mem_info {
u32 type; /* enum xdp_mem_type, but known size type */
+   u32 id;
 };
 
 struct

[net-next V10 PATCH 10/16] mlx5: register a memory model when XDP is enabled

2018-04-17 Thread Jesper Dangaard Brouer

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Tariq Toukan 
Acked-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b29c1d93f058..2dca0933dfd3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -512,6 +512,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
 
+   /* This must only be activate for order-0 pages */
+   if (rq->xdp_prog) {
+   err = xdp_rxq_info_reg_mem_model(>xdp_rxq,
+MEM_TYPE_PAGE_ORDER0, NULL);
+   if (err)
+   goto err_rq_wq_destroy;
+   }
+
for (i = 0; i < wq_sz; i++) {
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(>wq, i);

[net-next V10 PATCH 09/16] i40e: convert to use generic xdp_frame and xdp_return_frame API

2018-04-17 Thread Jesper Dangaard Brouer

Also convert driver i40e, which very recently got XDP_REDIRECT support
in commit d9314c474d4f ("i40e: add support for XDP_REDIRECT").

V7: This patch got added in V7 of this patchset.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   20 +++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |1 +
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f174c72480ab..96c54cbfb1f9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -638,7 +638,8 @@ static void i40e_unmap_and_free_tx_resource(struct 
i40e_ring *ring,
if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
kfree(tx_buffer->raw_buf);
else if (ring_is_xdp(ring))
-   page_frag_free(tx_buffer->raw_buf);
+   xdp_return_frame(tx_buffer->xdpf->data,
+_buffer->xdpf->mem);
else
dev_kfree_skb_any(tx_buffer->skb);
if (dma_unmap_len(tx_buffer, len))
@@ -841,7 +842,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
/* free the skb/XDP data */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buf->raw_buf);
+   xdp_return_frame(tx_buf->xdpf->data, 
_buf->xdpf->mem);
else
napi_consume_skb(tx_buf->skb, napi_budget);
 
@@ -2225,6 +2226,8 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
if (!xdp_prog)
goto xdp_out;
 
+   prefetchw(xdp->data_hard_start); /* xdp_frame write */
+
act = bpf_prog_run_xdp(xdp_prog, xdp);
switch (act) {
case XDP_PASS:
@@ -3481,25 +3484,32 @@ static inline int i40e_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
  struct i40e_ring *xdp_ring)
 {
-   u32 size = xdp->data_end - xdp->data;
u16 i = xdp_ring->next_to_use;
struct i40e_tx_buffer *tx_bi;
struct i40e_tx_desc *tx_desc;
+   struct xdp_frame *xdpf;
dma_addr_t dma;
+   u32 size;
+
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf))
+   return I40E_XDP_CONSUMED;
+
+   size = xdpf->len;
 
if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
xdp_ring->tx_stats.tx_busy++;
return I40E_XDP_CONSUMED;
}
 
-   dma = dma_map_single(xdp_ring->dev, xdp->data, size, DMA_TO_DEVICE);
+   dma = dma_map_single(xdp_ring->dev, xdpf->data, size, DMA_TO_DEVICE);
if (dma_mapping_error(xdp_ring->dev, dma))
return I40E_XDP_CONSUMED;
 
tx_bi = _ring->tx_bi[i];
tx_bi->bytecount = size;
tx_bi->gso_segs = 1;
-   tx_bi->raw_buf = xdp->data;
+   tx_bi->xdpf = xdpf;
 
/* record length, and DMA address */
dma_unmap_len_set(tx_bi, len, size);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 3043483ec426..857b1d743c8d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -306,6 +306,7 @@ static inline unsigned int i40e_txd_use_count(unsigned int 
size)
 struct i40e_tx_buffer {
struct i40e_tx_desc *next_to_watch;
union {
+   struct xdp_frame *xdpf;
struct sk_buff *skb;
void *raw_buf;
};

[net-next V10 PATCH 01/16] mlx5: basic XDP_REDIRECT forward support

2018-04-17 Thread Jesper Dangaard Brouer

This implements basic XDP redirect support in mlx5 driver.

Notice that the ndo_xdp_xmit() is NOT implemented, because that API
need some changes that this patchset is working towards.

The main purpose of this patch is have different drivers doing
XDP_REDIRECT to show how different memory models behave in a cross
driver world.

Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
as the return API does not exist yet to to keep this mapped.

Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
introduce xdpsq.db.redirect_flush boolian.

V9: Adjust for commit 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Tariq Toukan 
Acked-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h|1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   27 ---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 30cad07be2b5..1a05d1072c5e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -392,6 +392,7 @@ struct mlx5e_xdpsq {
struct {
struct mlx5e_dma_info *di;
bool   doorbell;
+   bool   redirect_flush;
} db;
 
/* read only */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 176645762e49..0e24be05907f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,14 +236,20 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq 
*rq,
return 0;
 }
 
+static void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
+   struct mlx5e_dma_info *dma_info)
+{
+   dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+  rq->buff.map_dir);
+}
+
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle)
 {
if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
return;
 
-   dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-  rq->buff.map_dir);
+   mlx5e_page_dma_unmap(rq, dma_info);
put_page(dma_info->page);
 }
 
@@ -800,9 +806,10 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
   struct mlx5e_dma_info *di,
   void *va, u16 *rx_headroom, u32 *len)
 {
-   const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+   struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
struct xdp_buff xdp;
u32 act;
+   int err;
 
if (!prog)
return false;
@@ -823,6 +830,15 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, )))
trace_xdp_exception(rq->netdev, prog, act);
return true;
+   case XDP_REDIRECT:
+   /* When XDP enabled then page-refcnt==1 here */
+   err = xdp_do_redirect(rq->netdev, , prog);
+   if (!err) {
+   __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags);
+   rq->xdpsq.db.redirect_flush = true;
+   mlx5e_page_dma_unmap(rq, di);
+   }
+   return true;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -1140,6 +1156,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
xdpsq->db.doorbell = false;
}
 
+   if (xdpsq->db.redirect_flush) {
+   xdp_do_flush_map();
+   xdpsq->db.redirect_flush = false;
+   }
+
mlx5_cqwq_update_db_record(>wq);
 
/* ensure cq space is freed before enabling more cqes */

[net-next V10 PATCH 02/16] xdp: introduce xdp_return_frame API and use in cpumap

2018-04-17 Thread Jesper Dangaard Brouer

Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
V6: Remove comment that id will be added later (Req by Alex Duyck)
V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h   |   27 +++
 kernel/bpf/cpumap.c |   60 +++
 net/core/xdp.c  |   18 +++
 3 files changed, 81 insertions(+), 24 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index b2362ddfa694..e4207699c410 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -33,16 +33,43 @@
  * also mandatory during RX-ring setup.
  */
 
+enum xdp_mem_type {
+   MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
+   MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
+   MEM_TYPE_MAX,
+};
+
+struct xdp_mem_info {
+   u32 type; /* enum xdp_mem_type, but known size type */
+};
+
 struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
u32 reg_state;
+   struct xdp_mem_info mem;
 } cacheline_aligned; /* perf critical, avoid false-sharing */
 
+
+static inline
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+   if (mem->type == MEM_TYPE_PAGE_SHARED)
+   page_frag_free(data);
+
+   if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+   struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+   put_page(page);
+   }
+}
+
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+  enum xdp_mem_type type, void *allocator);
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a4bb0b34375a..3e4bbcbe3e86 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
return ERR_PTR(err);
 }
 
-static void __cpu_map_queue_destructor(void *ptr)
-{
-   /* The tear-down procedure should have made sure that queue is
-* empty.  See __cpu_map_entry_replace() and work-queue
-* invoked cpu_map_kthread_stop(). Catch any broken behaviour
-* gracefully and warn once.
-*/
-   if (WARN_ON_ONCE(ptr))
-   page_frag_free(ptr);
-}
-
-static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
-{
-   if (atomic_dec_and_test(>refcnt)) {
-   /* The queue should be empty at this point */
-   ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
-   kfree(rcpu->queue);
-   kfree(rcpu);
-   }
-}
-
 static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 {
atomic_inc(>refcnt);
@@ -188,6 +168,10 @@ struct xdp_pkt {
u16 len;
u16 headroom;
u16 metasize;
+   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+* while mem info is valid on remote CPU.
+*/
+   struct xdp_mem_info mem;
struct net_device *dev_rx;
 };
 
@@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff 
*xdp)
xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
xdp_pkt->metasize = metasize;
 
+   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+   xdp_pkt->mem = xdp->rxq->mem;
+
return xdp_pkt;
 }
 
@@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
return skb;
 }
 
+static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
+{
+   /* The tear-down procedure should have made sure that queue is
+* empty.  See __cpu_map_entry_replace() and work-queue
+* invoked cpu_map_kthread_stop(). Catch any broken behaviour
+* gracefully and warn once.
+*/
+   struct xdp_pkt *xdp_pkt;
+
+   while ((xdp_pkt = ptr_ring_consume(ring)))
+   if (WARN_ON_ONCE(xdp_pkt))
+   xdp_return_frame(xdp_pkt, _pkt->mem);
+}
+
+static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
+{
+   if (atomic_dec_and_test(>refcnt)) {
+   /* The queue should be empty at this point */
+   __cpu_map_ring_cleanup(rcpu->queue);
+   ptr_ring_cleanup(rcpu->queue, NULL);
+   kfree(rcpu->queue);
+   kfree(rcpu);
+   }
+}
+
 static int cpu_map_kthread_run(void *data)
 {
struct bpf_cpu_map_entry *rcpu = data;
@@

[net-next V10 PATCH 03/16] ixgbe: use xdp_return_frame API

2018-04-17 Thread Jesper Dangaard Brouer

Extend struct ixgbe_tx_buffer to store the xdp_mem_info.

Notice that this could be optimized further by putting this into
a union in the struct ixgbe_tx_buffer, but this patchset
works towards removing this again.  Thus, this is not done.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |6 --
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 4f08c712e58e..abb5248e917e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -250,6 +250,7 @@ struct ixgbe_tx_buffer {
DEFINE_DMA_UNMAP_ADDR(dma);
DEFINE_DMA_UNMAP_LEN(len);
u32 tx_flags;
+   struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba99f7b8..0bfe6cf2bf8b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
/* free the skb */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buffer->data);
+   xdp_return_frame(tx_buffer->data, _buffer->xdp_mem);
else
napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -5797,7 +5797,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
 
/* Free all the Tx ring sk_buffs */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buffer->data);
+   xdp_return_frame(tx_buffer->data, _buffer->xdp_mem);
else
dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8366,6 +8366,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter 
*adapter,
dma_unmap_len_set(tx_buffer, len, len);
dma_unmap_addr_set(tx_buffer, dma, dma);
tx_buffer->data = xdp->data;
+   tx_buffer->xdp_mem = xdp->rxq->mem;
+
tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
/* put descriptor type bits */

[net-next V10 PATCH 05/16] xdp: introduce a new xdp_frame type

2018-04-17 Thread Jesper Dangaard Brouer

This is needed to convert drivers tuntap and virtio_net.

This is a generalization of what is done inside cpumap, which will be
converted later.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h |   40 
 1 file changed, 40 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 15f8ade008b5..756c42811e78 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -58,6 +58,46 @@ struct xdp_buff {
struct xdp_rxq_info *rxq;
 };
 
+struct xdp_frame {
+   void *data;
+   u16 len;
+   u16 headroom;
+   u16 metasize;
+   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+* while mem info is valid on remote CPU.
+*/
+   struct xdp_mem_info mem;
+};
+
+/* Convert xdp_buff to xdp_frame */
+static inline
+struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
+{
+   struct xdp_frame *xdp_frame;
+   int metasize;
+   int headroom;
+
+   /* Assure headroom is available for storing info */
+   headroom = xdp->data - xdp->data_hard_start;
+   metasize = xdp->data - xdp->data_meta;
+   metasize = metasize > 0 ? metasize : 0;
+   if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
+   return NULL;
+
+   /* Store info in top of packet */
+   xdp_frame = xdp->data_hard_start;
+
+   xdp_frame->data = xdp->data;
+   xdp_frame->len  = xdp->data_end - xdp->data;
+   xdp_frame->headroom = headroom - sizeof(*xdp_frame);
+   xdp_frame->metasize = metasize;
+
+   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+   xdp_frame->mem = xdp->rxq->mem;
+
+   return xdp_frame;
+}
+
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {

[net-next V10 PATCH 00/16] XDP redirect memory return API

2018-04-17 Thread Jesper Dangaard Brouer

Resubmit V10 against net-next, as it contains NIC driver changes.

This patchset works towards supporting different XDP RX-ring memory
allocators.  As this will be needed by the AF_XDP zero-copy mode.

The patchset uses mlx5 as the sample driver, which gets implemented
XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
change thought the patchset).

A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.

Support for a driver supplied allocator is implemented, and a
refurbished version of page_pool is the first return allocator type
introduced.  This will be a integration point for AF_XDP zero-copy.

The mlx5 driver evolve into using the page_pool, and see a performance
increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.


The patchset stop at 16 patches (one over limit), but more API changes
are planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
APIs to support bulking.  As this will address some known limits.

V2: Updated according to Tariq's feedback
V3: Updated based on feedback from Jason Wang and Alex Duyck
V4: Updated based on feedback from Tariq and Jason
V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test
V6: Updated based on feedback from Eric Dumazet and Alex Duyck
V7: Adapt to i40e that got XDP_REDIRECT support in-between
V8:
 Updated based on feedback kbuild test robot, and adjust for mlx5 changes
 page_pool only compiled into kernel when drivers Kconfig 'select' feature
V9:
 Remove some inline statements, let compiler decide what to inline
 Fix return value in virtio_net driver
 Adjust for mlx5 changes in-between submissions
V10:
 Minor adjust for mlx5 requested by Tariq
 Resubmit against net-next

---

Jesper Dangaard Brouer (16):
  mlx5: basic XDP_REDIRECT forward support
  xdp: introduce xdp_return_frame API and use in cpumap
  ixgbe: use xdp_return_frame API
  xdp: move struct xdp_buff from filter.h to xdp.h
  xdp: introduce a new xdp_frame type
  tun: convert to use generic xdp_frame and xdp_return_frame API
  virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  bpf: cpumap convert to use generic xdp_frame
  i40e: convert to use generic xdp_frame and xdp_return_frame API
  mlx5: register a memory model when XDP is enabled
  xdp: rhashtable with allocator ID to pointer mapping
  page_pool: refurbish version of page_pool code
  xdp: allow page_pool as an allocator type in xdp_return_frame
  mlx5: use page_pool for xdp_return_frame call
  xdp: transition into using xdp_frame for return API
  xdp: transition into using xdp_frame for ndo_xdp_xmit


 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   33 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |3 
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   38 ++-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig   |1 
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   42 ++-
 drivers/net/tun.c |   60 ++--
 drivers/net/virtio_net.c  |   67 +++-
 drivers/vhost/net.c   |7 
 include/linux/filter.h|   24 --
 include/linux/if_tun.h|4 
 include/linux/netdevice.h |4 
 include/net/page_pool.h   |  143 +
 include/net/xdp.h |   83 +
 kernel/bpf/cpumap.c   |  132 +++--
 net/Kconfig   |3 
 net/core/Makefile |1 
 net/core/filter.c |   17 +
 net/core/page_pool.c  |  317 +
 net/core/xdp.c|  269 ++
 22 files changed, 1094 insertions(+), 198 deletions(-)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

--

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig

On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote:
> On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use
> x86_swiotlb_dma_ops, instead of swiotlb_dma_ops.  I also included that
> in below fix patch.

x86_swiotlb_dma_ops should not exist any mor, and x86 now uses
dma_direct_ops.  Looks like you are applying it to an old kernel :)

> Performance improved to 8.9 Mpps from approx 6.5Mpps.
> 
> (This was without my bulking for net_device->ndo_xdp_xmit, so that
> number should improve more).

What is the number for the otherwise comparable setup without repolines?

[PATCH] VSOCK: make af_vsock.ko removable again

2018-04-17 Thread Stefan Hajnoczi

Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
vsock_init_tables()") introduced a module_init() function without a
corresponding module_exit() function.

Modules with an init function can only be removed if they also have an
exit function.  Therefore the vsock module was considered "permanent"
and could not be removed.

This patch adds an empty module_exit() function so that "rmmod vsock"
works.  No explicit cleanup is required because:

1. Transports call vsock_core_exit() upon exit and cannot be removed
   while sockets are still alive.
2. vsock_diag.ko does not perform any action that requires cleanup by
   vsock.ko.

Reported-by: Xiumei Mu 
Cc: Cong Wang 
Cc: Jorgen Hansen 
Signed-off-by: Stefan Hajnoczi 
---
 net/vmw_vsock/af_vsock.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index aac9b8f6552e..c1076c19b858 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -2018,7 +2018,13 @@ const struct vsock_transport 
*vsock_core_get_transport(void)
 }
 EXPORT_SYMBOL_GPL(vsock_core_get_transport);
 
+static void __exit vsock_exit(void)
+{
+   /* Do nothing.  This function makes this module removable. */
+}
+
 module_init(vsock_init_tables);
+module_exit(vsock_exit);
 
 MODULE_AUTHOR("VMware, Inc.");
 MODULE_DESCRIPTION("VMware Virtual Socket Family");
-- 
2.14.3

Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device

2018-04-17 Thread Tariq Toukan

On 16/04/2018 7:51 PM, David Miller wrote:

From: Zhu Yanjun 
Date: Sun, 15 Apr 2018 21:02:07 -0400

While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

  ...

In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
instead of resetting HCA.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Suggested-by: Håkon Bugge 
Signed-off-by: Zhu Yanjun 

Tariq, I'm assuming you'll take this in and send it to me later.

Thanks.

Yes, I will review and send if all is OK.

Thanks,
Tariq

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig

On Tue, Apr 17, 2018 at 09:07:01AM +0200, Jesper Dangaard Brouer wrote:
> > > number should improve more).  
> > 
> > What is the number for the otherwise comparable setup without repolines?
> 
> Approx 12 Mpps.
> 
> You forgot to handle the dma_direct_mapping_error() case, which still
> used the retpoline in the above 8.9 Mpps measurement, I fixed it up and
> performance increased to 9.6 Mpps.
> 
> Notice, in this test there are still two retpoline/indirect-calls
> left.  The net_device->ndo_xdp_xmit and the invocation of the XDP BPF
> prog.

But that seems like a pretty clear indicator that we want the fast path
direct mapping.  I'll try to find some time over the next weeks to
do a cleaner version of it.

Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).

2018-04-17 Thread Geert Uytterhoeven

Hi Michael, Adrian,

Thanks for your patch!

On Tue, Apr 17, 2018 at 4:08 AM, Michael Schmitz  wrote:
> From: John Paul Adrian Glaubitz 
>
> This complements the fix in 82533ad9a1c that removed the free_irq

Please quote the commit's subject, too, like

... fix in commit 82533ad9a1 ("net: ethernet: ax88796: don't call free_irq
without request_irq first")

BTW, I have a git alias for that:

$ git help fixes
`git fixes' is aliased to `show --format='Fixes: %h ("%s")' -s'
$ git fixes 82533ad9a1c
Fixes: 82533ad9a1c ("net: ethernet: ax88796: don't call free_irq
without request_irq first")

> call in the error path of probe, to also not call free_irq when
> remove is called to revert the effects of probe.
>
> Signed-off-by: Michael Karcher 

The patch is authored by Adrian, but his SoB is missing?

Michael (Schmitz): as you took the patch, you should add your SoB, too.

For the actual patch contents:
Reviewed-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

RE: [PATCH 0/3] Receive Side Coalescing for macb driver

2018-04-17 Thread Rafal Ozieblo

From: David Miller [mailto:da...@davemloft.net] 
Sent: 16 kwietnia 2018 17:09

> From: Rafal Ozieblo 
> Date: Sat, 14 Apr 2018 21:53:07 +0100
>
>> This patch series adds support for receive side coalescing for Cadence 
>> GEM driver. Receive segmentation coalescing is a mechanism to reduce 
>> CPU overhead. This is done by coalescing received TCP message segments 
>> together into a single large message. This means that when the message
>> is complete the CPU only has to process the single header and act upon 
>> the one data payload.
>
> You're going to have to think more deeply about enabling this feature.
>
> If you can't adjust the receive buffer offset, then the various packet header 
> fields will be unaligned.
>
> On certain architectures this will result in unaligned traps all over the 
> networking stack as the packet is being processed.
>
> So enabling this by default will hurt performance on such systems a lot.
>
> The whole "skb_reserve(skb, NET_IP_ALIGN)" is not just for fun, it is 
> absolutely essential.

I totally agree with you. But the issue is with IP cores which has this feature 
implemented in.
Even when user does not want to use that feature but he bought IP with 
configuration supported RSC, then he has to switch off IP alignment.
There is no IP alignment with RSC in the GEM:
"When the gem rsc define has been set the receive buffer offset cannot be 
changed in the network configuration register."
If IP supports RSC and skb has 2B reserved for alignment we end up with none 
packets receive correctly (2B missing in the each skb).
We can either leave few customers without support in Linux driver or let them 
use the driver with decrease performance.

Re: [PATCH v2 2/8] net: ax88796: Attach MII bus only when open

2018-04-17 Thread Andrew Lunn

On Tue, Apr 17, 2018 at 02:08:09PM +1200, Michael Schmitz wrote:
> From: Michael Karcher 
> 
> Call ax_mii_init in ax_open(), and unregister/remove mdiobus resources
> in ax_close().
> 
> This is needed to be able to unload the module, as the module is busy
> while the MII bus is attached.
> 
> Signed-off-by: Michael Karcher 
> Signed-off-by: Michael Schmitz 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH v2 8/8] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)

2018-04-17 Thread Andrew Lunn

On Tue, Apr 17, 2018 at 02:08:15PM +1200, Michael Schmitz wrote:
> Add platform device driver to populate the ax88796 platform data from
> information provided by the XSurf100 zorro device driver.
> This driver will have to be loaded before loading the ax88796 module,
> or compiled as built-in.
> 
> Signed-off-by: Michael Karcher 
> Signed-off-by: Michael Schmitz 
> ---
>  drivers/net/ethernet/8390/Kconfig|   14 +-
>  drivers/net/ethernet/8390/Makefile   |1 +
>  drivers/net/ethernet/8390/xsurf100.c |  411 
> ++
>  3 files changed, 425 insertions(+), 1 deletions(-)
>  create mode 100644 drivers/net/ethernet/8390/xsurf100.c
> 
> diff --git a/drivers/net/ethernet/8390/Kconfig 
> b/drivers/net/ethernet/8390/Kconfig
> index fdc6734..0cadd45 100644
> --- a/drivers/net/ethernet/8390/Kconfig
> +++ b/drivers/net/ethernet/8390/Kconfig
> @@ -30,7 +30,7 @@ config PCMCIA_AXNET
>  
>  config AX88796
>   tristate "ASIX AX88796 NE2000 clone support"
> - depends on (ARM || MIPS || SUPERH)
> + depends on (ARM || MIPS || SUPERH || AMIGA)

Hi Michael

Will it compile on other platforms? If so, it is a good idea to add
COMPILE_TEST as well.

 Andrew

Re: [RFC v2] virtio: support packed ring

2018-04-17 Thread Michael S. Tsirkin

On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote:
> On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > [...]
> > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > unsigned int head,
> > > > > > > +   void **ctx)
> > > > > > > +{
> > > > > > > + struct vring_packed_desc *desc;
> > > > > > > + unsigned int i, j;
> > > > > > > +
> > > > > > > + /* Clear data ptr. */
> > > > > > > + vq->desc_state[head].data = NULL;
> > > > > > > +
> > > > > > > + i = head;
> > > > > > > +
> > > > > > > + for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > > + desc = >vring_packed.desc[i];
> > > > > > > + vring_unmap_one_packed(vq, desc);
> > > > > > > + desc->flags = 0x0;
> > > > > > Looks like this is unnecessary.
> > > > > It's safer to zero it. If we don't zero it, after we
> > > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > > this function, the desc is still available to the
> > > > > device.
> > > > 
> > > > Well detach_unused_buf_packed() should be called after device is 
> > > > stopped,
> > > > otherwise even if you try to clear, there will still be a window that 
> > > > device
> > > > may use it.
> > > 
> > > This is not about whether the device has been stopped or
> > > not. We don't have other places to re-initialize the ring
> > > descriptors and wrap_counter. So they need to be set to
> > > the correct values when doing detach_unused_buf.
> > > 
> > > Best regards,
> > > Tiwei Bie
> > 
> > find vqs is the time to do it.
> 
> The .find_vqs() will call .setup_vq() which will eventually
> call vring_create_virtqueue(). It's a different case. Here
> we're talking about re-initializing the descs and updating
> the wrap counter when detaching the unused descs (In this
> case, split ring just needs to decrease vring.avail->idx).
> 
> Best regards,
> Tiwei Bie

There's no requirement that  virtqueue_detach_unused_buf re-initializes
the descs. It happens on cleanup path just before drivers delete the
vqs.

-- 
MST

[bpf-next PATCH] samples/bpf: fix xdp_monitor user output for tracepoint exception

2018-04-17 Thread Jesper Dangaard Brouer

The variable rec_i contains an XDP action code not an error.
Thus, using err2str() was wrong, it should have been action2str().

Signed-off-by: Jesper Dangaard Brouer 
---
 samples/bpf/xdp_monitor_user.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/xdp_monitor_user.c b/samples/bpf/xdp_monitor_user.c
index eec14520d513..894bc64c2cac 100644
--- a/samples/bpf/xdp_monitor_user.c
+++ b/samples/bpf/xdp_monitor_user.c
@@ -330,7 +330,7 @@ static void stats_print(struct stats_record *stats_rec,
pps = calc_pps_u64(r, p, t);
if (pps > 0)
printf(fmt1, "Exception", i,
-  0.0, pps, err2str(rec_i));
+  0.0, pps, action2str(rec_i));
}
pps = calc_pps_u64(>total, >total, t);
if (pps > 0)

Re: [PATCH bpf-next 01/10] [bpf]: adding bpf_xdp_adjust_tail helper

2018-04-17 Thread kbuild test robot

Hi Nikita,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Nikita-V-Shirokov/introduction-of-bpf_xdp_adjust_tail/20180417-211905
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-s1-201815 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   net/core/filter.c: In function 'bpf_xdp_adjust_tail':
>> net/core/filter.c:2726:2: warning: ISO C90 forbids mixed declarations and 
>> code [-Wdeclaration-after-statement]
 void *data_end = xdp->data_end + offset;
 ^~~~

vim +2726 net/core/filter.c

  2719  
  2720  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
  2721  {
  2722  /* only shrinking is allowed for now. */
  2723  if (unlikely(offset > 0))
  2724  return -EINVAL;
  2725  
> 2726  void *data_end = xdp->data_end + offset;
  2727  
  2728  if (unlikely(data_end < xdp->data + ETH_HLEN))
  2729  return -EINVAL;
  2730  
  2731  xdp->data_end = data_end;
  2732  
  2733  return 0;
  2734  }
  2735  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp

2018-04-17 Thread David Miller

From: Simon Horman 
Date: Tue, 17 Apr 2018 10:50:26 +0200

> From: Masaru Nagai 
> 
> [   58.490829] =
> [   58.495205] [ INFO: inconsistent lock state ]
> [   58.499583] 4.9.0-yocto-standard-7-g2ef7caf #57 Not tainted
 ...
> Fixes: f51bdc236b6c ("ravb: Add dma queue interrupt support")
> Signed-off-by: Masaru Nagai 
> Signed-off-by: Kazuya Mizuguchi 
> Signed-off-by: Simon Horman 

This really needs more than the lockdep dump in the commit message, explaining
what the problem is and how it was corrected.

Are the wrong interrupt types enabled?  Are they handled improperly?
It definitely isn't clear from just reading the patch.

Re: One question about __tcp_select_window()

2018-04-17 Thread Eric Dumazet

On 04/17/2018 06:53 AM, Wang Jian wrote:
> I test the fix with 4.17.0-rc1+ and it seems work.
> 
> 1. iperf -c IP -i 20 -t 60 -w 1K
>  with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec
> I also try other windows and have similar results.
> 
> 2. Use tcp probe trace snd_wind.
> with-fix vs without-fix: 1245568 vs 1042816
> 
> 3. I don't see extra retransmit/drops.
> 

Unfortunately I have no idea what exact problem you had to solve.

Setting small windows is not exactly the path we are taking.

And I do not know how many side effects your change will have for 'standard' 
flows
using autotuning or sane windows.

Re: One question about __tcp_select_window()

2018-04-17 Thread Wang Jian

I test the fix with 4.17.0-rc1+ and it seems work.

1. iperf -c IP -i 20 -t 60 -w 1K
 with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec
I also try other windows and have similar results.

2. Use tcp probe trace snd_wind.
with-fix vs without-fix: 1245568 vs 1042816

3. I don't see extra retransmit/drops.

On Sun, Apr 15, 2018 at 8:50 PM, Wang Jian  wrote:
> Hi all,
>
> While I read __tcp_select_window() code, I find that it maybe return a
> smaller window.
> Below is one scenario I thought, may be not right:
> In function __tcp_select_window(), assume:
> full_space is 6mss, free_space is 2mss, tp->rcv_wnd is 3MSS.
> And assume disable window scaling, then
> window = tp->rcv_wnd > free_space && window > free_space
> then it will round down free_space and return it.
>
> Is this expected behavior? The comment is also saying
> "Get the largest window that is a nice multiple of mss."
>
> Should we do something like below ? Or I miss something?
> I don't know how to verify it now.
>
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2680,9 +2680,9 @@ u32 __tcp_select_window(struct sock *sk)
>  * We also don't do any window rounding when the free space
>  * is too small.
>  */
> -   if (window <= free_space - mss || window > free_space)
> +   if (window <= free_space - mss)
> window = rounddown(free_space, mss);
> -   else if (mss == full_space &&
> +   else if (window <= free_space && mss == full_space &&
>  free_space > window + (full_space >> 1))
> window = free_space;
> }
>
> Thanks.

Re: [net-next V10 PATCH 00/16] XDP redirect memory return API

2018-04-17 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Tue, 17 Apr 2018 14:58:52 +0200

> Resubmit V10 against net-next, as it contains NIC driver changes.

Series applied, thanks Jesper.

Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).

2018-04-17 Thread David Miller

From: Geert Uytterhoeven 
Date: Tue, 17 Apr 2018 10:20:25 +0200

> BTW, I have a git alias for that:
> 
> $ git help fixes
> `git fixes' is aliased to `show --format='Fixes: %h ("%s")' -s'
> $ git fixes 82533ad9a1c
> Fixes: 82533ad9a1c ("net: ethernet: ax88796: don't call free_irq
> without request_irq first")

Thanks for sharing :)

Re: [net-next V10 PATCH 00/16] XDP redirect memory return API

2018-04-17 Thread David Miller

From: Alexei Starovoitov 
Date: Tue, 17 Apr 2018 06:53:33 -0700

> looks like you forgot to include extra patch to fixup xdp_adjust_head()
> helper. Otherwise reused xdp_frame in the top of that packet is leaking
> kernel pointers into bpf program.
> Could you please respin with that change included?

Just in time, I was about to push this back out. :)

Jesper, please respin with Alexei's requested changes.

1 2 3 4 5 >

1 - 100 of 424 matches

Mail list logo