from:"Toshiaki Makita"

Re: [PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Toshiaki Makita

On 2018/11/30 11:40, Jason Wang wrote:
> On 2018/11/30 上午10:30, Prashant Bhole wrote:
>> In tun.c skb->len was accessed while doing stats accounting after a
>> call to netif_receive_skb. We can not access skb after this call
>> because buffers may be dropped.
>>
>> The fix for this bug would be to store skb->len in local variable and
>> then use it after netif_receive_skb(). IMO using xdp data size for
>> accounting bytes will be better because input for tun_xdp_one() is
>> xdp_buff.
>>
>> Hence this patch:
>> - fixes a bug by removing skb access after netif_receive_skb()
>> - uses xdp data size for accounting bytes
...
>> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through
>> sendmsg()")
>> Reviewed-by: Toshiaki Makita 
>> Signed-off-by: Prashant Bhole 
>> ---
>>   drivers/net/tun.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index e244f5d7512a..6e388792c0a8 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>  struct tun_file *tfile,
>>  struct xdp_buff *xdp, int *flush)
>>   {
>> +    unsigned int datasize = xdp->data_end - xdp->data;
>>   struct tun_xdp_hdr *hdr = xdp->data_hard_start;
>>   struct virtio_net_hdr *gso = >gso;
>>   struct tun_pcpu_stats *stats;
>> @@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>   stats = get_cpu_ptr(tun->pcpu_stats);
>>   u64_stats_update_begin(>syncp);
>>   stats->rx_packets++;
>> -    stats->rx_bytes += skb->len;
>> +    stats->rx_bytes += datasize;
>>   u64_stats_update_end(>syncp);
>>   put_cpu_ptr(stats);
>>   
> 
> 
> Good catch, but you probably need to calculate the datasize after XDP
> processing since it may modify the packet length.

(+CC David Ahern who may be interested in this area.)

I'd rather think we should calculate it before XDP.
I checked several drivers behavior. mlx5, bnxt and qede use hardware
counters for rx bytes, which means the size is calculated before XDP.
nfp calculate it by software but before XDP. On the other hand, intel
drivers use skb->len. So currently bytes counters do not look
consistent, but I think calculation before XDP is more common.

-- 
Toshiaki Makita

Re: consistency for statistics with XDP mode

2018-11-27 Thread Toshiaki Makita

On 2018/11/28 13:03, Jason Wang wrote:
> On 2018/11/27 下午3:04, Toshiaki Makita wrote:
>> On 2018/11/26 10:37, Toshiaki Makita wrote:
>>> On 2018/11/23 1:43, David Ahern wrote:
>>>> On 11/21/18 5:53 PM, Toshiaki Makita wrote:
>>>>>> We really need consistency in the counters and at a minimum, users
>>>>>> should be able to track packet and byte counters for both Rx and Tx
>>>>>> including XDP.
>>>>>>
>>>>>> It seems to me the Rx and Tx packet, byte and dropped counters
>>>>>> returned
>>>>>> for the standard device stats (/proc/net/dev, ip -s li show, ...)
>>>>>> should
>>>>>> include all packets managed by the driver regardless of whether
>>>>>> they are
>>>>>> forwarded / dropped in XDP or go up the Linux stack. This also aligns
>>>>> Agreed. When I introduced virtio_net XDP counters, I just forgot to
>>>>> update tx packets/bytes counters on ndo_xdp_xmit. Probably I
>>>>> thought it
>>>>> is handled by free_old_xmit_skbs.
>>>> Do you have some time to look at adding the Tx counters to virtio_net?
>>> hoping I can make some time within a couple of days.
>> Hmm... It looks like free_old_xmit_skbs() calls dev_consume_skb_any()
>> for xdp_frame when napi_tx is enabled. I will fix this beforehand.
> 
> 
> Good catch. But the fix may require some thought. E.g one idea is to not
> call free_old_xmit_skbs() for XDP TX ring?

Yes, that's what I'm planning to do.

-- 
Toshiaki Makita

Re: consistency for statistics with XDP mode

2018-11-26 Thread Toshiaki Makita

On 2018/11/26 10:37, Toshiaki Makita wrote:
> On 2018/11/23 1:43, David Ahern wrote:
>> On 11/21/18 5:53 PM, Toshiaki Makita wrote:
>>>> We really need consistency in the counters and at a minimum, users
>>>> should be able to track packet and byte counters for both Rx and Tx
>>>> including XDP.
>>>>
>>>> It seems to me the Rx and Tx packet, byte and dropped counters returned
>>>> for the standard device stats (/proc/net/dev, ip -s li show, ...) should
>>>> include all packets managed by the driver regardless of whether they are
>>>> forwarded / dropped in XDP or go up the Linux stack. This also aligns
>>>
>>> Agreed. When I introduced virtio_net XDP counters, I just forgot to
>>> update tx packets/bytes counters on ndo_xdp_xmit. Probably I thought it
>>> is handled by free_old_xmit_skbs.
>>
>> Do you have some time to look at adding the Tx counters to virtio_net?
> 
> hoping I can make some time within a couple of days.

Hmm... It looks like free_old_xmit_skbs() calls dev_consume_skb_any()
for xdp_frame when napi_tx is enabled. I will fix this beforehand.

-- 
Toshiaki Makita

Re: consistency for statistics with XDP mode

2018-11-25 Thread Toshiaki Makita

On 2018/11/23 1:43, David Ahern wrote:
> On 11/21/18 5:53 PM, Toshiaki Makita wrote:
>>> We really need consistency in the counters and at a minimum, users
>>> should be able to track packet and byte counters for both Rx and Tx
>>> including XDP.
>>>
>>> It seems to me the Rx and Tx packet, byte and dropped counters returned
>>> for the standard device stats (/proc/net/dev, ip -s li show, ...) should
>>> include all packets managed by the driver regardless of whether they are
>>> forwarded / dropped in XDP or go up the Linux stack. This also aligns
>>
>> Agreed. When I introduced virtio_net XDP counters, I just forgot to
>> update tx packets/bytes counters on ndo_xdp_xmit. Probably I thought it
>> is handled by free_old_xmit_skbs.
> 
> Do you have some time to look at adding the Tx counters to virtio_net?

hoping I can make some time within a couple of days.

-- 
Toshiaki Makita

Re: consistency for statistics with XDP mode

2018-11-21 Thread Toshiaki Makita

On 2018/11/22 6:06, David Ahern wrote:
> Paweł ran some more XDP tests yesterday and from it found a couple of
> issues. One is a panic in the mlx5 driver unloading the bpf program
> (mlx5e_xdp_xmit); he will send a send a separate email for that problem.
> 
> The problem I wanted to discuss here is statistics for XDP context. The
> short of it is that we need consistency in the counters across NIC
> drivers and virtual devices. Right now stats are specific to a driver
> with no clear accounting for the packets and bytes handled in XDP.
> 
> For example virtio has some stats as device private data extracted via
> ethtool:
> $ ethtool -S eth2 | grep xdp
> ...
>  rx_queue_3_xdp_packets: 5291
>  rx_queue_3_xdp_tx: 0
>  rx_queue_3_xdp_redirects: 5163
>  rx_queue_3_xdp_drops: 0
> ...
>  tx_queue_3_xdp_tx: 5163
>  tx_queue_3_xdp_tx_drops: 0
> 
> And the standard counters appear to track bytes and packets for Rx, but
> not Tx if the packet is forwarded in XDP.
> 
> Similarly, mlx5 has some counters (thanks to Jesper and Toke for helping
> out here):
> 
> $ ethtool -S mlx5p1 | grep xdp
>  rx_xdp_drop: 86468350180
>  rx_xdp_redirect: 18860584
>  rx_xdp_tx_xmit: 0
>  rx_xdp_tx_full: 0
>  rx_xdp_tx_err: 0
>  rx_xdp_tx_cqe: 0
>  tx_xdp_xmit: 0
>  tx_xdp_full: 0
>  tx_xdp_err: 0
>  tx_xdp_cqes: 0
> ...
>  rx3_xdp_drop: 86468350180
>  rx3_xdp_redirect: 18860556
>  rx3_xdp_tx_xmit: 0
>  rx3_xdp_tx_full: 0
>  rx3_xdp_tx_err: 0
>  rx3_xdp_tx_cqes: 0
> ...
>  tx0_xdp_xmit: 0
>  tx0_xdp_full: 0
>  tx0_xdp_err: 0
>  tx0_xdp_cqes: 0
> ...
> 
> And no accounting in standard stats for packets handled in XDP.
> 
> And then if I understand Jesper's data correctly, the i40e driver does
> not have device specific data:
> 
> $ ethtool -S i40e1  | grep xdp
> [NOTHING]
> 
> 
> But rather bumps the standard counters:
> 
> sudo ./xdp_rxq_info --dev i40e1 --action XDP_DROP
> 
> Running XDP on dev:i40e1 (ifindex:3) action:XDP_DROP options:no_touch
> XDP stats   CPU pps issue-pps
> XDP-RX CPU  1   36,156,872  0
> XDP-RX CPU  total   36,156,872
> 
> RXQ stats   RXQ:CPU pps issue-pps
> rx_queue_index1:1   36,156,878  0
> rx_queue_index1:sum 36,156,878
> 
> 
> $ ethtool_stats.pl --dev i40e1
> 
> Show adapter(s) (i40e1) statistics (ONLY that changed!)
> Ethtool(i40e1   ) stat:   2711292859 (  2,711,292,859) <= port.rx_bytes /sec
> Ethtool(i40e1   ) stat:  6274204 (  6,274,204) <=
> port.rx_dropped /sec
> Ethtool(i40e1   ) stat: 42363867 ( 42,363,867) <=
> port.rx_size_64 /sec
> Ethtool(i40e1   ) stat: 42363950 ( 42,363,950) <=
> port.rx_unicast /sec
> Ethtool(i40e1   ) stat:   2165051990 (  2,165,051,990) <= rx-1.bytes /sec
> Ethtool(i40e1   ) stat: 36084200 ( 36,084,200) <= rx-1.packets /sec
> Ethtool(i40e1   ) stat: 5385 (  5,385) <= rx_dropped /sec
> Ethtool(i40e1   ) stat: 36089727 ( 36,089,727) <= rx_unicast /sec
> 
> 
> We really need consistency in the counters and at a minimum, users
> should be able to track packet and byte counters for both Rx and Tx
> including XDP.
> 
> It seems to me the Rx and Tx packet, byte and dropped counters returned
> for the standard device stats (/proc/net/dev, ip -s li show, ...) should
> include all packets managed by the driver regardless of whether they are
> forwarded / dropped in XDP or go up the Linux stack. This also aligns

Agreed. When I introduced virtio_net XDP counters, I just forgot to
update tx packets/bytes counters on ndo_xdp_xmit. Probably I thought it
is handled by free_old_xmit_skbs.

Toshiaki Makita

> with mlxsw and the stats it shows which are packets handled by the hardware.
> 
>>From there the private stats can include XDP specifics as desired --
> like the drops and redirects but that those should be add-ons and even
> here some consistency makes life easier for users.
> 
> The same standards should be also be applied to virtual devices built on
> top of the ports -- e.g,  vlans. I have an API now that allows bumping
> stats for vlan devices.
> 
> Keeping the basic xdp packets in the standard counters allows Paweł, for
> example, to continue to monitor /proc/net/dev.
> 
> Can we get agreement on this? And from there, get updates to the mlx5
> and virtio drivers?
> 
> David
> 
>

Re: [PATCH net-next 1/3] veth: Account for packet drops in ndo_xdp_xmit

2018-10-13 Thread Toshiaki Makita


On 18/10/13 (土) 16:48, Jesper Dangaard Brouer wrote:

On Thu, 11 Oct 2018 18:36:48 +0900
Toshiaki Makita  wrote:


Use existing atomic drop counter. Since drop path is really an
exceptional case here, I'm thinking atomic ops would not hurt the
performance.


Hmm... we try very hard not to add atomic ops to XDP code path. The
XDP_DROP case is also considered hot-path.  In below code, the
atomic64_add happens for a bulk of dropped packets (currently up-to
16), so it might be okay.


Yes, this happens only once in a bulk sending.
Note that this drop does not include XDP_DROP. This drop is counted when
- ndo_xdp_xmit "flags" arg is invalid
- peer is detached
- XDP is not loaded on peer
- XDP ring (256 slots) overflow
So really exceptional. XDP_DROP is counted per-queue basis (non-atomic) 
in the patch 2/3.


Toshiaki Makita




XDP packets and bytes are not counted in ndo_xdp_xmit, but will be
accounted on rx side by the following commit.

Signed-off-by: Toshiaki Makita 
---
  drivers/net/veth.c | 30 ++
  1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 224c56a..452193f2 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -308,16 +308,20 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
  {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
+   int i, ret, drops = n;
unsigned int max_len;
struct veth_rq *rq;
-   int i, drops = 0;
  
-	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))

-   return -EINVAL;
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK)) {
+   ret = -EINVAL;
+   goto drop;
+   }
  
  	rcv = rcu_dereference(priv->peer);

-   if (unlikely(!rcv))
-   return -ENXIO;
+   if (unlikely(!rcv)) {
+   ret = -ENXIO;
+   goto drop;
+   }
  
  	rcv_priv = netdev_priv(rcv);

rq = _priv->rq[veth_select_rxq(rcv)];
@@ -325,9 +329,12 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 * side. This means an XDP program is loaded on the peer and the peer
 * device is up.
 */
-   if (!rcu_access_pointer(rq->xdp_prog))
-   return -ENXIO;
+   if (!rcu_access_pointer(rq->xdp_prog)) {
+   ret = -ENXIO;
+   goto drop;
+   }
  
+	drops = 0;

max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
  
  	spin_lock(>xdp_ring.producer_lock);

@@ -346,7 +353,14 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
if (flags & XDP_XMIT_FLUSH)
__veth_xdp_flush(rq);
  
-	return n - drops;

+   if (likely(!drops))
+   return n;
+
+   ret = n - drops;
+drop:
+   atomic64_add(drops, >dropped);
+
+   return ret;
  }
  
  static void veth_xdp_flush(struct net_device *dev)

[PATCH net-next 3/3] veth: Add ethtool statistics support for XDP

2018-10-11 Thread Toshiaki Makita

Expose per-queue stats for ethtool -S.
As there are only rx queues, and rx queues are used only when XDP is
used, per-queue counters are only rx XDP ones.

Example:

$ ethtool -S veth0
NIC statistics:
 peer_ifindex: 11
 rx_queue_0_xdp_packets: 28601434
 rx_queue_0_xdp_bytes: 1716086040
 rx_queue_0_xdp_drops: 28601434
 rx_queue_1_xdp_packets: 17873050
 rx_queue_1_xdp_bytes: 1072383000
 rx_queue_1_xdp_drops: 17873050

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 48 ++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 68bb93d..890fa5b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -67,6 +67,21 @@ struct veth_priv {
  * ethtool interface
  */
 
+struct veth_q_stat_desc {
+   chardesc[ETH_GSTRING_LEN];
+   size_t  offset;
+};
+
+#define VETH_RQ_STAT(m)offsetof(struct veth_rq_stats, m)
+
+static const struct veth_q_stat_desc veth_rq_stats_desc[] = {
+   { "xdp_packets",VETH_RQ_STAT(xdp_packets) },
+   { "xdp_bytes",  VETH_RQ_STAT(xdp_bytes) },
+   { "xdp_drops",  VETH_RQ_STAT(xdp_drops) },
+};
+
+#define VETH_RQ_STATS_LEN  ARRAY_SIZE(veth_rq_stats_desc)
+
 static struct {
const char string[ETH_GSTRING_LEN];
 } ethtool_stats_keys[] = {
@@ -91,9 +106,20 @@ static void veth_get_drvinfo(struct net_device *dev, struct 
ethtool_drvinfo *inf
 
 static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
 {
+   char *p = (char *)buf;
+   int i, j;
+
switch(stringset) {
case ETH_SS_STATS:
-   memcpy(buf, _stats_keys, sizeof(ethtool_stats_keys));
+   memcpy(p, _stats_keys, sizeof(ethtool_stats_keys));
+   p += sizeof(ethtool_stats_keys);
+   for (i = 0; i < dev->real_num_rx_queues; i++) {
+   for (j = 0; j < VETH_RQ_STATS_LEN; j++) {
+   snprintf(p, ETH_GSTRING_LEN, "rx_queue_%u_%s",
+i, veth_rq_stats_desc[j].desc);
+   p += ETH_GSTRING_LEN;
+   }
+   }
break;
}
 }
@@ -102,7 +128,8 @@ static int veth_get_sset_count(struct net_device *dev, int 
sset)
 {
switch (sset) {
case ETH_SS_STATS:
-   return ARRAY_SIZE(ethtool_stats_keys);
+   return ARRAY_SIZE(ethtool_stats_keys) +
+  VETH_RQ_STATS_LEN * dev->real_num_rx_queues;
default:
return -EOPNOTSUPP;
}
@@ -113,8 +140,25 @@ static void veth_get_ethtool_stats(struct net_device *dev,
 {
struct veth_priv *priv = netdev_priv(dev);
struct net_device *peer = rtnl_dereference(priv->peer);
+   int i, j, idx;
 
data[0] = peer ? peer->ifindex : 0;
+   idx = 1;
+   for (i = 0; i < dev->real_num_rx_queues; i++) {
+   const struct veth_rq_stats *rq_stats = >rq[i].stats;
+   const void *stats_base = (void *)rq_stats;
+   unsigned int start;
+   size_t offset;
+
+   do {
+   start = u64_stats_fetch_begin_irq(_stats->syncp);
+   for (j = 0; j < VETH_RQ_STATS_LEN; j++) {
+   offset = veth_rq_stats_desc[j].offset;
+   data[idx + j] = *(u64 *)(stats_base + offset);
+   }
+   } while (u64_stats_fetch_retry_irq(_stats->syncp, start));
+   idx += VETH_RQ_STATS_LEN;
+   }
 }
 
 static int veth_get_ts_info(struct net_device *dev,
-- 
1.8.3.1

[PATCH net-next 2/3] veth: Account for XDP packet statistics on rx side

2018-10-11 Thread Toshiaki Makita

On XDP path veth has napi handler so we can collect statistics on
per-queue basis for XDP.

By this change now we can collect XDP_DROP drop count as well as packets
and bytes coming through ndo_xdp_xmit. Packet counters shown by
"ip -s link", sysfs stats or /proc/net/dev is now correct for XDP.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 97 --
 1 file changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 452193f2..68bb93d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -37,11 +37,19 @@
 #define VETH_XDP_TXBIT(0)
 #define VETH_XDP_REDIR BIT(1)
 
+struct veth_rq_stats {
+   u64 xdp_packets;
+   u64 xdp_bytes;
+   u64 xdp_drops;
+   struct u64_stats_sync   syncp;
+};
+
 struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
struct xdp_mem_info xdp_mem;
+   struct veth_rq_statsstats;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
@@ -211,12 +219,14 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
skb_tx_timestamp(skb);
if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
-   struct pcpu_lstats *stats = this_cpu_ptr(dev->lstats);
+   if (!rcv_xdp) {
+   struct pcpu_lstats *stats = this_cpu_ptr(dev->lstats);
 
-   u64_stats_update_begin(>syncp);
-   stats->bytes += length;
-   stats->packets++;
-   u64_stats_update_end(>syncp);
+   u64_stats_update_begin(>syncp);
+   stats->bytes += length;
+   stats->packets++;
+   u64_stats_update_end(>syncp);
+   }
} else {
 drop:
atomic64_inc(>dropped);
@@ -230,7 +240,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
return NETDEV_TX_OK;
 }
 
-static u64 veth_stats_one(struct pcpu_lstats *result, struct net_device *dev)
+static u64 veth_stats_tx(struct pcpu_lstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
int cpu;
@@ -253,23 +263,58 @@ static u64 veth_stats_one(struct pcpu_lstats *result, 
struct net_device *dev)
return atomic64_read(>dropped);
 }
 
+static void veth_stats_rx(struct veth_rq_stats *result, struct net_device *dev)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   int i;
+
+   result->xdp_packets = 0;
+   result->xdp_bytes = 0;
+   result->xdp_drops = 0;
+   for (i = 0; i < dev->num_rx_queues; i++) {
+   struct veth_rq_stats *stats = >rq[i].stats;
+   u64 packets, bytes, drops;
+   unsigned int start;
+
+   do {
+   start = u64_stats_fetch_begin_irq(>syncp);
+   packets = stats->xdp_packets;
+   bytes = stats->xdp_bytes;
+   drops = stats->xdp_drops;
+   } while (u64_stats_fetch_retry_irq(>syncp, start));
+   result->xdp_packets += packets;
+   result->xdp_bytes += bytes;
+   result->xdp_drops += drops;
+   }
+}
+
 static void veth_get_stats64(struct net_device *dev,
 struct rtnl_link_stats64 *tot)
 {
struct veth_priv *priv = netdev_priv(dev);
struct net_device *peer;
-   struct pcpu_lstats one;
+   struct veth_rq_stats rx;
+   struct pcpu_lstats tx;
+
+   tot->tx_dropped = veth_stats_tx(, dev);
+   tot->tx_bytes = tx.bytes;
+   tot->tx_packets = tx.packets;
 
-   tot->tx_dropped = veth_stats_one(, dev);
-   tot->tx_bytes = one.bytes;
-   tot->tx_packets = one.packets;
+   veth_stats_rx(, dev);
+   tot->rx_dropped = rx.xdp_drops;
+   tot->rx_bytes = rx.xdp_bytes;
+   tot->rx_packets = rx.xdp_packets;
 
rcu_read_lock();
peer = rcu_dereference(priv->peer);
if (peer) {
-   tot->rx_dropped = veth_stats_one(, peer);
-   tot->rx_bytes = one.bytes;
-   tot->rx_packets = one.packets;
+   tot->rx_dropped += veth_stats_tx(, peer);
+   tot->rx_bytes += tx.bytes;
+   tot->rx_packets += tx.packets;
+
+   veth_stats_rx(, peer);
+   tot->tx_bytes += rx.xdp_bytes;
+   tot->tx_packets += rx.xdp_packets;
}
rcu_read_unlock();
 }
@@ -609,28 +654,42 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq 
*rq, struct sk

[PATCH net-next 1/3] veth: Account for packet drops in ndo_xdp_xmit

2018-10-11 Thread Toshiaki Makita

Use existing atomic drop counter. Since drop path is really an
exceptional case here, I'm thinking atomic ops would not hurt the
performance.
XDP packets and bytes are not counted in ndo_xdp_xmit, but will be
accounted on rx side by the following commit.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 224c56a..452193f2 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -308,16 +308,20 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
+   int i, ret, drops = n;
unsigned int max_len;
struct veth_rq *rq;
-   int i, drops = 0;
 
-   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
-   return -EINVAL;
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK)) {
+   ret = -EINVAL;
+   goto drop;
+   }
 
rcv = rcu_dereference(priv->peer);
-   if (unlikely(!rcv))
-   return -ENXIO;
+   if (unlikely(!rcv)) {
+   ret = -ENXIO;
+   goto drop;
+   }
 
rcv_priv = netdev_priv(rcv);
rq = _priv->rq[veth_select_rxq(rcv)];
@@ -325,9 +329,12 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 * side. This means an XDP program is loaded on the peer and the peer
 * device is up.
 */
-   if (!rcu_access_pointer(rq->xdp_prog))
-   return -ENXIO;
+   if (!rcu_access_pointer(rq->xdp_prog)) {
+   ret = -ENXIO;
+   goto drop;
+   }
 
+   drops = 0;
max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
 
spin_lock(>xdp_ring.producer_lock);
@@ -346,7 +353,14 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
if (flags & XDP_XMIT_FLUSH)
__veth_xdp_flush(rq);
 
-   return n - drops;
+   if (likely(!drops))
+   return n;
+
+   ret = n - drops;
+drop:
+   atomic64_add(drops, >dropped);
+
+   return ret;
 }
 
 static void veth_xdp_flush(struct net_device *dev)
-- 
1.8.3.1

[PATCH net-next 0/3] veth: XDP stats improvement

2018-10-11 Thread Toshiaki Makita

ndo_xdp_xmit in veth did not update packet counters as described in [1].
Also, current implementation only updates counters on tx side so rx side
events like XDP_DROP were not collected.
This series implements the missing accounting as well as support for
ethtool per-queue stats in veth.

Patch 1: Update drop counter in ndo_xdp_xmit.
Patch 2: Update packet and byte counters for all XDP path, and drop
 counter on XDP_DROP.
Patch 3: Support per-queue ethtool stats for XDP counters.

Note that counters are maintained on per-queue basis for XDP but not
otherwise (per-cpu and atomic as before). This is because 1) tx path in
veth is essentially lockless so we cannot update per-queue stats on tx,
and 2) rx path is net core routine (process_backlog) which cannot update
per-queue based stats when XDP is disabled. On the other hand there are
real rxqs and napi handlers for veth XDP, so update per-queue stats on
rx for XDP packets, and use them to calculate tx counters as well,
contrary to the existing non-XDP counters.

[1] https://patchwork.ozlabs.org/cover/953071/#1967449

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (3):
  veth: Account for packet drops in ndo_xdp_xmit
  veth: Account for XDP packet statistics on rx side
  veth: Add ethtool statistics support for XDP

 drivers/net/veth.c | 175 -
 1 file changed, 147 insertions(+), 28 deletions(-)

-- 
1.8.3.1

[PATCH net] veth: Orphan skb before GRO

2018-09-13 Thread Toshiaki Makita

GRO expects skbs not to be owned by sockets, but when XDP is enabled veth
passed skbs owned by sockets. It caused corrupted sk_wmem_alloc.

Paolo Abeni reported the following splat:

[  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in 
iperf3[1644], uid/euid: 0/0
[  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 
refcount_error_report+0xa0/0xa4
[  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore 
intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf 
mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter 
pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm 
pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ 
#2025
[  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 
06/16/2016
[  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
[  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 
89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 
88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
[  362.219711] RSP: 0018:9ee6ff603c20 EFLAGS: 00010282
[  362.225538] RAX:  RBX: 9de83e10 RCX: 
[  362.233497] RDX: 0001 RSI: 9ee6ff6167d8 RDI: 9ee6ff6167d8
[  362.241457] RBP: 9ee6ff603d78 R08: 0490 R09: 0004
[  362.249416] R10:  R11: 9ee6ff603990 R12: 9ee664b94500
[  362.257377] R13:  R14: 0004 R15: 9de615f9
[  362.265337] FS:  7f1d22d28740() GS:9ee6ff60() 
knlGS:
[  362.274363] CS:  0010 DS:  ES:  CR0: 80050033
[  362.280773] CR2: 7f1d222f35d0 CR3: 001fddfec003 CR4: 001606f0
[  362.288733] Call Trace:
[  362.291459]  
[  362.293702]  ex_handler_refcount+0x4e/0x80
[  362.298269]  fixup_exception+0x35/0x40
[  362.302451]  do_trap+0x109/0x150
[  362.306048]  do_error_trap+0xd5/0x130
[  362.315766]  invalid_op+0x14/0x20
[  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
[  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 
85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 
80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
[  362.345465] RSP: 0018:9ee6ff603e20 EFLAGS: 00010a86
[  362.351291] RAX: 1100 RBX: 9ee65deec700 RCX: 9ee65e829244
[  362.359250] RDX: 0100 RSI: 9ee65e829100 RDI: 9ee65deec700
[  362.367210] RBP: 9ee65e829100 R08: 0002a380 R09: 
[  362.375169] R10: 0002 R11: f1a4bf77bb00 R12: c0754661d000
[  362.383130] R13: 9ee65deec200 R14: 9ee65f597000 R15: 00aa
[  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
[  362.399357]  veth_poll+0x4d/0x17a [veth]
[  362.403731]  net_rx_action+0x2af/0x3f0
[  362.407912]  __do_softirq+0xdd/0x29e
[  362.411897]  do_softirq_own_stack+0x2a/0x40
[  362.416561]  
[  362.418899]  do_softirq+0x4b/0x70
[  362.422594]  __local_bh_enable_ip+0x50/0x60
[  362.427258]  ip_finish_output2+0x16a/0x390
[  362.431824]  ip_output+0x71/0xe0
[  362.440670]  __tcp_transmit_skb+0x583/0xab0
[  362.445333]  tcp_write_xmit+0x247/0xfb0
[  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
[  362.454760]  tcp_sendmsg_locked+0x857/0xd30
[  362.459424]  tcp_sendmsg+0x27/0x40
[  362.463216]  sock_sendmsg+0x36/0x50
[  362.467104]  sock_write_iter+0x87/0x100
[  362.471382]  __vfs_write+0x112/0x1a0
[  362.475369]  vfs_write+0xad/0x1a0
[  362.479062]  ksys_write+0x52/0xc0
[  362.482759]  do_syscall_64+0x5b/0x180
[  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  362.492473] RIP: 0033:0x7f1d22293238
[  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 
0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  362.517409] RSP: 002b:7ffebaef8008 EFLAGS: 0246 ORIG_RAX: 
0001
[  362.525855] RAX: ffda RBX: 2800 RCX: 7f1d22293238
[  362.533816] RDX: 2800 RSI: 7f1d22d36000 RDI: 0005
[  362.541775] RBP: 7f1d22d36000 R08: 0002db777a30 R09: 562b70712b20
[  362.549734] R10:  R11: 0246 R12: 0005
[  362.557693] R13: 2800 R14: 7ffebaef8060 R15: 562b70712260

In order to avoid this, orphan the skb before entering GRO.

Fixes: 948d4f214fde ("veth: Add driver XDP")
Reported-by: Paolo Abeni 
Signed-off-by: Toshiaki Makita 
---
 drivers

Re: unexpected GRO/veth behavior

2018-09-13 Thread Toshiaki Makita

On 2018/09/11 20:07, Toshiaki Makita wrote:
> On 2018/09/11 19:27, Eric Dumazet wrote:
> ...
>> Fix would probably be :
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 
>> 8d679c8b7f25c753d77cfb8821d9d2528c9c9048..96bd94480942b469403abf017f9f9d5be1e23ef5
>>  100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -602,9 +602,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, 
>> unsigned int *xdp_xmit)
>> skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
>> }
>>  
>> -   if (skb)
>> +   if (skb) {
>> +   skb_orphan(skb);
>> napi_gro_receive(>xdp_napi, skb);
>> -
>> +   }
>> done++;
>> }
> 
> Considering commit 9c4c3252 ("skbuff: preserve sock reference when
> scrubbing the skb.") I'm not sure if we should unconditionally orphan
> the skb here.
> I was thinking I should call netif_receive_skb() for such packets
> instead of napi_gro_receive().

I tested TCP throughput within localhost with XDP enabled (with
skb_orphan() fix).

GRO off: 4.7 Gbps
GRO on : 6.7 Gbps

Since there is not-so-small difference, I'm making a patch which orphan
the skb as Eric suggested (but in veth_xdp_rcv_skb() instead).

Thanks!

-- 
Toshiaki Makita

Re: unexpected GRO/veth behavior

2018-09-11 Thread Toshiaki Makita

On 2018/09/11 19:27, Eric Dumazet wrote:
...
> Fix would probably be :
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 
> 8d679c8b7f25c753d77cfb8821d9d2528c9c9048..96bd94480942b469403abf017f9f9d5be1e23ef5
>  100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -602,9 +602,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, 
> unsigned int *xdp_xmit)
> skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
> }
>  
> -   if (skb)
> +   if (skb) {
> +   skb_orphan(skb);
> napi_gro_receive(>xdp_napi, skb);
> -
> +   }
> done++;
> }

Considering commit 9c4c3252 ("skbuff: preserve sock reference when
scrubbing the skb.") I'm not sure if we should unconditionally orphan
the skb here.
I was thinking I should call netif_receive_skb() for such packets
instead of napi_gro_receive().

-- 
Toshiaki Makita

Re: [net-next, PATCH 2/2, v1] net: socionext: add AF_XDP support

2018-09-10 Thread Toshiaki Makita

On 2018/09/11 1:21, Ilias Apalodimas wrote:
>>> @@ -707,6 +731,26 @@ static int netsec_process_rx(struct netsec_priv *priv, 
>>> int budget)
>>> if (unlikely(!buf_addr))
>>> break;
>>>  
>>> +   if (xdp_prog) {
>>> +   xdp_result = netsec_run_xdp(desc, priv, xdp_prog,
>>> +   pkt_len);
>>> +   if (xdp_result != NETSEC_XDP_PASS) {
>>> +   xdp_flush |= xdp_result & NETSEC_XDP_REDIR;
>>> +
>>> +   dma_unmap_single_attrs(priv->dev,
>>> +  desc->dma_addr,
>>> +  desc->len, DMA_TO_DEVICE,
>>> +  DMA_ATTR_SKIP_CPU_SYNC);
>>> +
>>> +   desc->len = desc_len;
>>> +   desc->dma_addr = dma_handle;
>>> +   desc->addr = buf_addr;
>>> +   netsec_rx_fill(priv, idx, 1);
>>> +   nsetsec_adv_desc(>tail);
>>> +   }
>>> +   continue;
>>
>> Continue even on XDP_PASS? Is this really correct?
>>
>> Also seems there is no handling of adjust_head/tail for XDP_PASS case.
>>
> A question on this. Should XDP related frames be allocated using 1 page
> per packet?

AFAIK there is no such constraint, e.g. i40e allocates 1 page per 2 packets.

-- 
Toshiaki Makita

Re: unexpected GRO/veth behavior

2018-09-10 Thread Toshiaki Makita

On 2018/09/10 23:56, Eric Dumazet wrote:
> On 09/10/2018 07:44 AM, Paolo Abeni wrote:
>> hi all,
>>
>> while testing some local patches I observed that the TCP tput in the
>> following scenario:
>>
>> # the following enable napi on veth0, so that we can trigger the
>> # GRO path with namespaces
>> ip netns add test
>> ip link add type veth
>> ip link set dev veth0 netns test
>> ip -n test link set lo up
>> ip -n test link set veth0 up
>> ip -n test addr add dev veth0 172.16.1.2/24
>> ip link set dev veth1 up
>> ip addr add dev veth1 172.16.1.1/24
>> IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex`
>>
>> # 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS
>> ip netns exec test ./xdp_pass $IDX &
>> taskset 0x2 ip netns exec test iperf3 -s -i 60 &
>> taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60
>>
>> is quite lower than expected (~800Mbps). 'perf' shows a weird topmost 
>> offender:
>>
> 
> 
> But... why GRO would even be needed in this scenario ?
> 
> GRO is really meant for physical devices, having to mess with skb->sk adds 
> extra cost
> in this already heavy cost engine.
> 
> Virtual devices should already be fed with TSO packets.

Because XDP does not have SG feature (GRO path in veth is used only when
XDP is enabled).

I have tested configuration like this:

NIC ---(XDP_REDIRECT)---> veth===veth (XDP_PASS)

GRO seems to work and improves TCP throughput in this case.


Now I noticed I did not test:

netperf -> veth===veth (XDP_PASS) -> netserver

which I think is the case where Paolo faces a problem.

I think it is not the case XDP can improve performance. I think I can
disable GRO for packets with skb->sk != NULL in veth.

-- 
Toshiaki Makita

Re: [net-next, PATCH 2/2, v1] net: socionext: add AF_XDP support

2018-09-10 Thread Toshiaki Makita

On 2018/09/10 17:24, Ilias Apalodimas wrote:
> Add basic AF_XDP support without zero-copy
> 
> Signed-off-by: Ilias Apalodimas 
> ---
...
> @@ -707,6 +731,26 @@ static int netsec_process_rx(struct netsec_priv *priv, 
> int budget)
>   if (unlikely(!buf_addr))
>   break;
>  
> + if (xdp_prog) {
> + xdp_result = netsec_run_xdp(desc, priv, xdp_prog,
> + pkt_len);
> + if (xdp_result != NETSEC_XDP_PASS) {
> + xdp_flush |= xdp_result & NETSEC_XDP_REDIR;
> +
> + dma_unmap_single_attrs(priv->dev,
> +desc->dma_addr,
> +desc->len, DMA_TO_DEVICE,
> +DMA_ATTR_SKIP_CPU_SYNC);
> +
> + desc->len = desc_len;
> + desc->dma_addr = dma_handle;
> + desc->addr = buf_addr;
> + netsec_rx_fill(priv, idx, 1);
> + nsetsec_adv_desc(>tail);
> + }
> + continue;

Continue even on XDP_PASS? Is this really correct?

Also seems there is no handling of adjust_head/tail for XDP_PASS case.

> + }
> +
>   skb = build_skb(desc->addr, desc->len);
>   if (unlikely(!skb)) {
>   dma_unmap_single(priv->dev, dma_handle, desc_len,
> @@ -740,6 +784,9 @@ static int netsec_process_rx(struct netsec_priv *priv, 
> int budget)
>   nsetsec_adv_desc(>tail);
>   }
>  
> + if (xdp_flush & NETSEC_XDP_REDIR)
> + xdp_do_flush_map();
> +
>   return done;
>  }
...
> +static u32 netsec_run_xdp(struct netsec_desc *desc, struct netsec_priv *priv,
> +   struct bpf_prog *prog, u16 len)
> +
> +{
> + struct netsec_desc_ring *dring = >desc_ring[NETSEC_RING_RX];
> + struct xdp_buff xdp;
> + u32 ret = NETSEC_XDP_PASS;
> + int err;
> + u32 act;
> +
> + xdp.data_hard_start = desc->addr;
> + xdp.data = desc->addr;

There is no headroom. REDIRECT using devmap/cpumap will fail due to
this. Generally we need XDP_PACKET_HEADROOM.

> + xdp_set_data_meta_invalid();
> + xdp.data_end = xdp.data + len;
> + xdp.rxq = >xdp_rxq;
> +
> + rcu_read_lock();
> + act = bpf_prog_run_xdp(prog, );
> +
> + switch (act) {
> + case XDP_PASS:
> + ret = NETSEC_XDP_PASS;
> + break;
> + case XDP_TX:
> + ret = netsec_xmit_xdp(priv, , desc);
> + break;
> + case XDP_REDIRECT:
> + err = xdp_do_redirect(priv->ndev, , prog);
> + if (!err) {
> + ret = NETSEC_XDP_REDIR;
> + } else {
> + ret = NETSEC_XDP_CONSUMED;
> + xdp_return_buff();
> + }
> + break;
> + default:
> + bpf_warn_invalid_xdp_action(act);
> + /* fall through */
> + case XDP_ABORTED:
> + trace_xdp_exception(priv->ndev, prog, act);
> + /* fall through -- handle aborts by dropping packet */
> + case XDP_DROP:
> + ret = NETSEC_XDP_CONSUMED;
> + break;
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +static int netsec_xdp_setup(struct netsec_priv *priv, struct bpf_prog *prog)
> +{
> + struct net_device *dev = priv->ndev;
> + struct bpf_prog *old_prog;
> +
> + /* For now just support only the usual MTU sized frames */
> + if (prog && dev->mtu > 1500) {
> + netdev_warn(dev, "Jumbo frames not yet supported with XDP\n");

Why not using extack?

> + return -EOPNOTSUPP;
> + }
> +

-- 
Toshiaki Makita

[PATCH v3 net-next] veth: Free queues on link delete

2018-08-15 Thread Toshiaki Makita

David Ahern reported memory leak in veth.

===
$ cat /sys/kernel/debug/kmemleak
unreferenced object 0x8800354d5c00 (size 1024):
  comm "ip", pid 836, jiffies 4294722952 (age 25.904s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<(ptrval)>] kmemleak_alloc+0x70/0x94
[<(ptrval)>] slab_post_alloc_hook+0x42/0x52
[<(ptrval)>] __kmalloc+0x101/0x142
[<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
[<(ptrval)>] veth_newlink+0x147/0x3ac [veth]
...
unreferenced object 0x88002e009c00 (size 1024):
  comm "ip", pid 836, jiffies 4294722958 (age 25.898s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<(ptrval)>] kmemleak_alloc+0x70/0x94
[<(ptrval)>] slab_post_alloc_hook+0x42/0x52
[<(ptrval)>] __kmalloc+0x101/0x142
[<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
[<(ptrval)>] veth_newlink+0x219/0x3ac [veth]
===

veth_rq allocated in veth_newlink() was not freed on dellink.

We need to free up them after veth_close() so that any packets will not
reference the queues afterwards. Thus free them in veth_dev_free() in
the same way as freeing stats structure (vstats).

Also move queues allocation to veth_dev_init() to be in line with stats
allocation.

Fixes: 638264dc90227 ("veth: Support per queue XDP ring")
Reported-by: David Ahern 
Signed-off-by: Toshiaki Makita 
---
This is a fix for a bug which exists only in net-next.
Let me know if I should wait for -next merging into net or reopen of -next.

 drivers/net/veth.c | 70 +-
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e3202af..8d679c8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -789,16 +789,48 @@ static int is_valid_veth_mtu(int mtu)
return mtu >= ETH_MIN_MTU && mtu <= ETH_MAX_MTU;
 }
 
+static int veth_alloc_queues(struct net_device *dev)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   int i;
+
+   priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
+   if (!priv->rq)
+   return -ENOMEM;
+
+   for (i = 0; i < dev->num_rx_queues; i++)
+   priv->rq[i].dev = dev;
+
+   return 0;
+}
+
+static void veth_free_queues(struct net_device *dev)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+
+   kfree(priv->rq);
+}
+
 static int veth_dev_init(struct net_device *dev)
 {
+   int err;
+
dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats);
if (!dev->vstats)
return -ENOMEM;
+
+   err = veth_alloc_queues(dev);
+   if (err) {
+   free_percpu(dev->vstats);
+   return err;
+   }
+
return 0;
 }
 
 static void veth_dev_free(struct net_device *dev)
 {
+   veth_free_queues(dev);
free_percpu(dev->vstats);
 }
 
@@ -1040,31 +1072,13 @@ static int veth_validate(struct nlattr *tb[], struct 
nlattr *data[],
return 0;
 }
 
-static int veth_alloc_queues(struct net_device *dev)
-{
-   struct veth_priv *priv = netdev_priv(dev);
-
-   priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
-   if (!priv->rq)
-   return -ENOMEM;
-
-   return 0;
-}
-
-static void veth_free_queues(struct net_device *dev)
-{
-   struct veth_priv *priv = netdev_priv(dev);
-
-   kfree(priv->rq);
-}
-
 static struct rtnl_link_ops veth_link_ops;
 
 static int veth_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[],
struct netlink_ext_ack *extack)
 {
-   int err, i;
+   int err;
struct net_device *peer;
struct veth_priv *priv;
char ifname[IFNAMSIZ];
@@ -1117,12 +1131,6 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
return PTR_ERR(peer);
}
 
-   err = veth_alloc_queues(peer);
-   if (err) {
-   put_net(net);
-   goto err_peer_alloc_queues;
-   }
-
if (!ifmp || !tbp[IFLA_ADDRESS])
eth_hw_addr_random(peer);
 
@@ -1151,10 +1159,6 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
 * should be re-allocated
 */
 
-   err = veth_alloc_queues(dev);
-   if (err)
-   goto err_

Re: [PATCH v2 net-next] veth: Free queues on link delete

2018-08-14 Thread Toshiaki Makita

On 2018/08/15 10:29, David Ahern wrote:
> On 8/14/18 7:16 PM, Toshiaki Makita wrote:
>> Hmm, on second thought this queues need to be freed after veth_close()
>> to make sure no packet will reference them. That means we need to free
>> them in .ndo_uninit() or destructor.
>> (rtnl_delete_link() calls dellink() before unregister_netdevice_many()
>> which calls dev_close_many() through rollback_registered_many())
>>
>> Currently veth has destructor veth_dev_free() for vstats, so we can free
>> queues in the function.
>> To be in line with vstats, allocation also should be moved to
>> veth_dev_init().
> 
> given that, can you take care of the free in the proper location?

Sure, will cook a patch.
Thanks!

-- 
Toshiaki Makita

Re: [PATCH v2 net-next] veth: Free queues on link delete

2018-08-14 Thread Toshiaki Makita

On 2018/08/15 10:04, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> kmemleak reported new suspected memory leaks.
> $ cat /sys/kernel/debug/kmemleak
> unreferenced object 0x8800354d5c00 (size 1024):
>   comm "ip", pid 836, jiffies 4294722952 (age 25.904s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [<(ptrval)>] kmemleak_alloc+0x70/0x94
> [<(ptrval)>] slab_post_alloc_hook+0x42/0x52
> [<(ptrval)>] __kmalloc+0x101/0x142
> [<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
> [<(ptrval)>] veth_newlink+0x147/0x3ac [veth]
> ...
> unreferenced object 0x88002e009c00 (size 1024):
>   comm "ip", pid 836, jiffies 4294722958 (age 25.898s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [<(ptrval)>] kmemleak_alloc+0x70/0x94
> [<(ptrval)>] slab_post_alloc_hook+0x42/0x52
> [<(ptrval)>] __kmalloc+0x101/0x142
> [<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
> [<(ptrval)>] veth_newlink+0x219/0x3ac [veth]
> 
> The allocations in question are veth_alloc_queues for the dev and its peer.
> 
> Free the queues on a delete.
> 
> Fixes: 638264dc90227 ("veth: Support per queue XDP ring")
> Signed-off-by: David Ahern 
> ---
> v2
> - free peer dev queues as well
> 
>  drivers/net/veth.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index e3202af72df5..2a3ce60631ef 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1205,6 +1205,7 @@ static void veth_dellink(struct net_device *dev, struct 
> list_head *head)
>   struct veth_priv *priv;
>   struct net_device *peer;
>  
> + veth_free_queues(dev);
>   priv = netdev_priv(dev);
>   peer = rtnl_dereference(priv->peer);
>  
> @@ -1216,6 +1217,7 @@ static void veth_dellink(struct net_device *dev, struct 
> list_head *head)
>   unregister_netdevice_queue(dev, head);
>  
>   if (peer) {
> + veth_free_queues(peer);
>   priv = netdev_priv(peer);
>   RCU_INIT_POINTER(priv->peer, NULL);
>   unregister_netdevice_queue(peer, head);

Hmm, on second thought this queues need to be freed after veth_close()
to make sure no packet will reference them. That means we need to free
them in .ndo_uninit() or destructor.
(rtnl_delete_link() calls dellink() before unregister_netdevice_many()
which calls dev_close_many() through rollback_registered_many())

Currently veth has destructor veth_dev_free() for vstats, so we can free
queues in the function.
To be in line with vstats, allocation also should be moved to
veth_dev_init().

-- 
Toshiaki Makita

Re: [PATCH net] veth: Free queues on link delete

2018-08-14 Thread Toshiaki Makita

On 2018/08/15 7:36, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> kmemleak reported new suspected memory leaks.
> $ cat /sys/kernel/debug/kmemleak
> unreferenced object 0x880130b6ec00 (size 1024):
>   comm "ip", pid 916, jiffies 4296194668 (age 7251.672s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [<1ed37cc9>] kmemleak_alloc+0x70/0x94
> [<646dfdeb>] slab_post_alloc_hook+0x42/0x52
> [<04aba61b>] __kmalloc+0x101/0x142
> [<54d50e21>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
> [<8238855a>] veth_newlink+0x147/0x3ac [veth]
> ...
> 
> The allocation in question is veth_alloc_queues.
> 
> Free the queues on a delete.

Oops, thanks for catching this.

> Fixes: 638264dc90227 ("veth: Support per queue XDP ring")
> Signed-off-by: David Ahern 
> ---
>  drivers/net/veth.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index e3202af72df5..bef7d212f04e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1205,6 +1205,7 @@ static void veth_dellink(struct net_device *dev, struct 
> list_head *head)
>   struct veth_priv *priv;
>   struct net_device *peer;
>  
> + veth_free_queues(dev);
>   priv = netdev_priv(dev);
>   peer = rtnl_dereference(priv->peer);

We need to free up peer queues as well.
Also isn't this for net-next though it is now closed?

-- 
Toshiaki Makita

Re: [PATCH v8 bpf-next 02/10] veth: Add driver XDP

2018-08-07 Thread Toshiaki Makita

Hi Daniel,

Thank you for taking a look!

On 2018/08/07 23:26, Daniel Borkmann wrote:
> On 08/03/2018 09:58 AM, Toshiaki Makita wrote:
> [...]
>> +
>> +static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
>> +struct sk_buff *skb)
>> +{
>> +u32 pktlen, headroom, act, metalen;
>> +void *orig_data, *orig_data_end;
>> +struct bpf_prog *xdp_prog;
>> +int mac_len, delta, off;
>> +struct xdp_buff xdp;
>> +
>> +rcu_read_lock();
>> +xdp_prog = rcu_dereference(priv->xdp_prog);
>> +if (unlikely(!xdp_prog)) {
>> +rcu_read_unlock();
>> +goto out;
>> +}
>> +
>> +mac_len = skb->data - skb_mac_header(skb);
>> +pktlen = skb->len + mac_len;
>> +headroom = skb_headroom(skb) - mac_len;
>> +
>> +if (skb_shared(skb) || skb_head_is_locked(skb) ||
>> +skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
> 
> Hmm, I think this is not fully correct. What happens if you have cloned skbs 
> as
> e.g. the case with TCP? This would also need a full expensive unclone to make 
> the
> data private as expected by XDP (this is basically a similar issue in generic
> XDP).

Well, cloned is checked in skb_head_is_locked() so TCP packets are
always uncloned here.

> It may potentially be worth to also share the code here with generic XDP
> implementation given it's quite similar?

For now I'm not sharing the code because of two reasons.

One is that as you say generic XDP skips cloned packets. I traced the
reason and it seems it is to skip packets redirected by act_mirred.
https://patchwork.ozlabs.org/patch/750127/
The assumption that no one provides cloned skbs other than mirred breaks
when generic XDP added support for virtual devices, but it is still
valid that we should skip packets redirected by mirred if we want to be
in line with driver XDP. So I'm thinking generic XDP needs something
more than just uncloning cloned skbs.

The other reason is performance for REDIRECT. We can make use of bulk
redirection in driver XDP, but it requires xdp_frames which requires
non-kmallocked skb head. This is different from generic XDP which allows
kmallocked skb head and uses kmalloc if head reallocation is needed.

-- 
Toshiaki Makita

Re: [PATCH v8 bpf-next 00/10] veth: Driver XDP

2018-08-03 Thread Toshiaki Makita


On 18/08/03 (金) 18:45, Jesper Dangaard Brouer wrote:

On Fri,  3 Aug 2018 16:58:08 +0900
Toshiaki Makita  wrote:


This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

   NIC ---> veth===veth
  (XDP) (redirect)(XDP)



I'm was playing with V7 on my testlab yesterday and I noticed one
fundamental issue.  You are not updating the "ifconfig" stats counters,
when in XDP mode.  This makes receive or send via XDP invisible to
sysadm/management tools.  This for-sure is going to cause confusion...


Yes, I did not update stats on ndo_xdp_xmit. My intention was that I'm 
going to make another patch set to make stats nice after this, but did 
not state that in the cover letter. Sorry about that.



I took a closer look at other driver. The ixgbe driver is doing the
right thing.  Driver i40e have a bug, where RX/TX stats are swapped
getting (strange!).  The mlx5 driver is not updating the regular RX/TX
counters, but A LOT of other ethtool stats counters (which are the ones
I usually monitor when testing).

So, given other drivers also didn't get this right, we need to have a
discussion outside your/this patchset.  Thus, I don't want to
stop/stall this patchset, but this is something we need to fixup in a
followup patchset to other drivers as well.


One of the reason why I did not include the stats patches in this series 
is that as you say basically stats in many drivers do not look correct 
and I thought the correctness is not strictly required for now.
In fact I recently fixed virtio_net stats which only updated packets 
counter but not bytes counter on XDP_DROP.


Another reason is that it will hurt the performance without more 
aggressive stats structure change. Drop counter is currently atomic so 
it would cause heavy cache contention on multiqueue env. The plan is to 
make this per-cpu or per-queue first. Also I want to introduce per-queue 
stats for ethtool, so the change would be relatively big and probably 
not fit in this series all together.



Thus, I'm acking the patchset, but I request that we do a joint effort
of fixing this as followup patches.


Sure, at least for veth I'm going to make a followup patches.


Acked-by: Jesper Dangaard Brouer 


Thank you for your thorough review!

Toshiaki Makita

[PATCH v8 bpf-next 09/10] veth: Add XDP TX and REDIRECT

2018-08-03 Thread Toshiaki Makita

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)  (XDP) (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v8:
- Don't use xdp_frame pointer address for data_hard_start of xdp_buff.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 119 +
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index dbb693a..9b0a7b9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TXBIT(0)
+#define VETH_XDP_REDIR BIT(1)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
+   struct xdp_mem_info xdp_mem;
unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
@@ -317,12 +322,44 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+
+   rcu_read_lock();
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   goto out;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+   goto out;
+
+   __veth_xdp_flush(rcv_priv);
+out:
+   rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return veth_xdp_xmit(dev, 1, , 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-   struct xdp_frame *frame)
+   struct xdp_frame *frame,
+   unsigned int *xdp_xmit)
 {
void *hard_start = frame->data - frame->headroom;
void *head = hard_start - sizeof(struct xdp_frame);
int len = frame->len, delta = 0;
+   struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
@@ -346,6 +383,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
delta = frame->data - xdp.data;
len = xdp.data_end - xdp.data;
break;
+   case XDP_TX:
+   orig_frame = *frame;
+   xdp.data_hard_start = head;
+   xdp.rxq->mem = frame->mem;
+   if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) {
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_TX;
+   rcu_read_unlock();
+   goto xdp_xmit;
+   case XDP_REDIRECT:
+   orig_frame = *frame;
+   xdp.data_hard_start = head;
+   xdp.rxq->mem = frame->mem;
+   if (xdp_do_redirect(priv->dev, , xdp_prog)) {
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_REDIR;
+   rcu_read_unlock();
+   goto xdp_xmit;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -370,12 +430,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
 err_xdp:
rcu_read_unlock();
xdp_return_frame(frame);
-
+xdp_xmit:
return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-

[PATCH v8 bpf-next 10/10] veth: Support per queue XDP ring

2018-08-03 Thread Toshiaki Makita

Move XDP and napi related fields from veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 278 -
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9b0a7b9..e3202af 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
-   struct bpf_prog *_xdp_prog;
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
struct xdp_mem_info xdp_mem;
-   unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
 };
 
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   struct bpf_prog *_xdp_prog;
+   struct veth_rq  *rq;
+   unsigned intrequested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
/* Write ptr_ring before reading rx_notify_masked */
smp_mb();
-   if (!priv->rx_notify_masked) {
-   priv->rx_notify_masked = true;
-   napi_schedule(>xdp_napi);
+   if (!rq->rx_notify_masked) {
+   rq->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
dev_kfree_skb_any(skb);
return NET_RX_DROP;
}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct 
sk_buff *skb)
return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+   struct veth_rq *rq, bool xdp)
 {
-   struct veth_priv *priv = netdev_priv(dev);
-
return __dev_forward_skb(dev, skb) ?: xdp ?
-   veth_xdp_rx(priv, skb) :
+   veth_xdp_rx(rq, skb) :
netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
+   int rxq;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
rcv_priv = netdev_priv(rcv);
-   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+   rxq = skb_get_queue_mapping(skb);
+   if (rxq < rcv->real_num_rx_queues) {
+   rq = _priv->rq[rxq];
+   rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+   if (rcv_xdp)
+   skb_record_rx_queue(skb, rxq);
+   }
 
-   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+   if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
if (rcv_xdp)
-   __veth_xdp_flush(rcv_priv);
+   __veth_xdp_flush(rq);
 
rcu_read_unlock();
 
@@ -278,12 +289,18 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+   return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 struct xdp_frame **frames, u32 flags)
 {
struct veth_pr

[PATCH v8 bpf-next 06/10] veth: Add ndo_xdp_xmit

2018-08-03 Thread Toshiaki Makita

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 89f3059..dbb693a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,50 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+struct xdp_frame **frames, u32 flags)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+   unsigned int max_len;
+   int i, drops = 0;
+
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+   return -EINVAL;
+
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   return -ENXIO;
+
+   rcv_priv = netdev_priv(rcv);
+   /* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
+* side. This means an XDP program is loaded on the peer and the peer
+* device is up.
+*/
+   if (!rcu_access_pointer(rcv_priv->xdp_prog))
+   return -ENXIO;
+
+   max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
+
+   spin_lock(_priv->xdp_ring.producer_lock);
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *frame = frames[i];
+   void *ptr = veth_xdp_to_ptr(frame);
+
+   if (unlikely(frame->len > max_len ||
+__ptr_ring_produce(_priv->xdp_ring, ptr))) {
+   xdp_return_frame_rx_napi(frame);
+   drops++;
+   }
+   }
+   spin_unlock(_priv->xdp_ring.producer_lock);
+
+   if (flags & XDP_XMIT_FLUSH)
+   __veth_xdp_flush(rcv_priv);
+
+   return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
struct xdp_frame *frame)
 {
@@ -769,6 +819,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
+   .ndo_xdp_xmit   = veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
1.8.3.1

[PATCH v8 bpf-next 08/10] xdp: Helpers for disabling napi_direct of xdp_return_frame

2018-08-03 Thread Toshiaki Makita

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 25 +
 net/core/xdp.c |  6 --
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4717af8..2b072da 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,10 +543,14 @@ struct bpf_redirect_info {
struct bpf_map *map;
struct bpf_map *map_to_flush;
unsigned long   map_owner;
+   u32 kern_flags;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT  BIT(0)  /* no napi_direct on return_frame */
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -775,6 +779,27 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 
+static inline bool xdp_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_set_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_clear_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+}
+
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 unsigned int pktlen)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index c013b83..efad5c0 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
page = virt_to_head_page(data);
-   if (xa)
+   if (xa) {
+   napi_direct &= !xdp_return_frame_no_direct();
page_pool_put_page(xa->page_pool, page, napi_direct);
-   else
+   } else {
put_page(page);
+   }
rcu_read_unlock();
break;
case MEM_TYPE_PAGE_SHARED:
-- 
1.8.3.1

[PATCH v8 bpf-next 07/10] bpf: Make redirect_info accessible from modules

2018-08-03 Thread Toshiaki Makita

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

v6:
- Fix sparse warning around EXPORT_SYMBOL.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 10 ++
 net/core/filter.c  | 29 +++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd73..4717af8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -537,6 +537,16 @@ struct sk_msg_buff {
struct list_head list;
 };
 
+struct bpf_redirect_info {
+   u32 ifindex;
+   u32 flags;
+   struct bpf_map *map;
+   struct bpf_map *map_to_flush;
+   unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/net/core/filter.c b/net/core/filter.c
index 7509bb7..4754089 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2082,19 +2082,12 @@ static int __bpf_redirect(struct sk_buff *skb, struct 
net_device *dev,
.arg3_type  = ARG_ANYTHING,
 };
 
-struct redirect_info {
-   u32 ifindex;
-   u32 flags;
-   struct bpf_map *map;
-   struct bpf_map *map_to_flush;
-   unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags & ~(BPF_F_INGRESS)))
return TC_ACT_SHOT;
@@ -2107,7 +2100,7 @@ struct redirect_info {
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *dev;
 
dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
@@ -3200,7 +3193,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
 
 void xdp_do_flush_map(void)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct bpf_map *map = ri->map_to_flush;
 
ri->map_to_flush = NULL;
@@ -3245,7 +3238,7 @@ static inline bool xdp_map_invalid(const struct bpf_prog 
*xdp_prog,
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3285,7 +3278,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *fwd;
u32 index = ri->ifindex;
int err;
@@ -3317,7 +3310,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
   struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3368,7 +3361,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
u32 index = ri->ifindex;
struct net_device *fwd;
int err = 0;
@@ -3399,7 +3392,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags))
return XDP_ABORTED;
@@ -3423,7 +3416,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 BPF_CALL_4(bpf_xdp_redirect_

[PATCH v8 bpf-next 04/10] xdp: Helper function to clear kernel pointers in xdp_frame

2018-08-03 Thread Toshiaki Makita

xdp_frame has kernel pointers which should not be readable from bpf
programs. When we want to reuse xdp_frame region but it may be read by
bpf programs later, we can use this helper to clear kernel pointers.
This is more efficient than calling memset() for the entire struct.

Signed-off-by: Toshiaki Makita 
---
 include/net/xdp.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index fcb033f..76b9525 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -84,6 +84,13 @@ struct xdp_frame {
struct net_device *dev_rx; /* used by cpumap */
 };
 
+/* Clear kernel pointers in xdp_frame */
+static inline void xdp_scrub_frame(struct xdp_frame *frame)
+{
+   frame->data = NULL;
+   frame->dev_rx = NULL;
+}
+
 /* Convert xdp_buff to xdp_frame */
 static inline
 struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
-- 
1.8.3.1

[PATCH v8 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-03 Thread Toshiaki Makita

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v8:
- Don't use xdp_frame pointer address to calculate skb->head and
  headroom.

v7:
- Use xdp_scrub_frame() instead of memset().

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 89 +++---
 1 file changed, 84 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9edf104..89f3059 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_XDP_FLAG  BIT(0)
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static void veth_get_ethtool_stats(struct net_device *dev,
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+   if (veth_is_xdp_frame(ptr))
+   xdp_return_frame(veth_ptr_to_xdp(ptr));
+   else
+   kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,63 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+   struct xdp_frame *frame)
+{
+   void *hard_start = frame->data - frame->headroom;
+   void *head = hard_start - sizeof(struct xdp_frame);
+   int len = frame->len, delta = 0;
+   struct bpf_prog *xdp_prog;
+   unsigned int headroom;
+   struct sk_buff *skb;
+
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(priv->xdp_prog);
+   if (likely(xdp_prog)) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = hard_start;
+   xdp.data = frame->data;
+   xdp.data_end = frame->data + frame->len;
+   xdp.data_meta = frame->data - frame->metasize;
+   xdp.rxq = >xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (act) {
+   case XDP_PASS:
+   delta = frame->data - xdp.data;
+   len = xdp.data_end - xdp.data;
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   case XDP_DROP:
+   goto err_xdp;
+   }
+   }
+   rcu_read_unlock();
+
+   headroom = sizeof(struct xdp_frame) + frame->headroom - delta;
+   skb = veth_build_skb(head, headroom, len, 0);
+   if (!skb) {
+   xdp_return_frame(frame);
+   goto err;
+   }
+
+   xdp_scrub_frame(frame);
+   skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+   return skb;
+err_xdp:
+   rcu_read_unlock();
+   xdp_return_frame(frame);
+
+   return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
struct sk_buff *skb)
 {
@@ -359,12 +434,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int 
budget)
int i, done = 0;
 
for (i = 0; i < budget; i++) {
-   struct sk_buff *skb = __ptr_ring_consume(>xdp_ring);
+   void *ptr = __ptr_ring_consume(>xdp_ring);
+   struct sk_buff *skb;
 
-   if (!skb)
+   if (!ptr)
break;
 
-   skb = veth_xdp_rcv_skb(priv, skb);
+   if (veth_is_xdp_frame(ptr))
+   skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+   else
+   skb = veth_xdp_rcv_skb(priv, ptr);
 
if (skb)
napi_gro_receive(>xdp_napi, skb);
@@ -417,7 +496,7 @@ static void veth_napi_del(struct net_device *dev)
napi_disable(>xdp_napi);
netif_napi_del(>xdp_napi);
priv->rx_notify_masked = false;
-   ptr_ring_cleanup(>x

[PATCH v8 bpf-next 02/10] veth: Add driver XDP

2018-08-03 Thread Toshiaki Makita

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
  events.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 374 -
 1 file changed, 367 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39..d3b9f10 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_RING_SIZE 256
+#define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+   struct napi_struct  xdp_napi;
+   struct net_device   *dev;
+   struct bpf_prog __rcu   *xdp_prog;
+   struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
unsignedrequested_headroom;
+   boolrx_notify_masked;
+   struct ptr_ring xdp_ring;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static void veth_get_ethtool_stats(struct net_device *dev,
.get_link_ksettings = veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+   /* Write ptr_ring before reading rx_notify_masked */
+   smp_mb();
+   if (!priv->rx_notify_masked) {
+   priv->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
+   }
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   dev_kfree_skb_any(skb);
+   return NET_RX_DROP;
+   }
+
+   return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
 {
struct veth_priv *priv = netdev_priv(dev);
+
+   return __dev_forward_skb(dev, skb) ?: xdp ?
+   veth_xdp_rx(priv, skb) :
+   netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   bool rcv_xdp = false;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   rcv_priv = netdev_priv(rcv);
+   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 drop:
atomic64_inc(>dropped);
}
+
+   if (rcv_xdp)
+   __veth_xdp_flush(rcv_priv);
+
rcu_read_unlock();
+
return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,254 @@ static void veth_set_multicast_list(struct net_device 
*dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+

[PATCH v8 bpf-next 03/10] veth: Avoid drops by oversized packets when XDP is enabled

2018-08-03 Thread Toshiaki Makita

Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d3b9f10..9edf104 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -543,6 +543,23 @@ static int veth_get_iflink(const struct net_device *dev)
return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+  netdev_features_t features)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   struct net_device *peer;
+
+   peer = rtnl_dereference(priv->peer);
+   if (peer) {
+   struct veth_priv *peer_priv = netdev_priv(peer);
+
+   if (peer_priv->_xdp_prog)
+   features &= ~NETIF_F_GSO_SOFTWARE;
+   }
+
+   return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -572,6 +589,7 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
struct veth_priv *priv = netdev_priv(dev);
struct bpf_prog *old_prog;
struct net_device *peer;
+   unsigned int max_mtu;
int err;
 
old_prog = priv->_xdp_prog;
@@ -585,6 +603,15 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
 
+   max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+ peer->hard_header_len -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (peer->mtu > max_mtu) {
+   NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to 
set XDP");
+   err = -ERANGE;
+   goto err;
+   }
+
if (dev->flags & IFF_UP) {
err = veth_enable_xdp(dev);
if (err) {
@@ -592,14 +619,29 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
}
+
+   if (!old_prog) {
+   peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = max_mtu;
+   }
}
 
if (old_prog) {
-   if (!prog && dev->flags & IFF_UP)
-   veth_disable_xdp(dev);
+   if (!prog) {
+   if (dev->flags & IFF_UP)
+   veth_disable_xdp(dev);
+
+   if (peer) {
+   peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = ETH_MAX_MTU;
+   }
+   }
bpf_prog_put(old_prog);
}
 
+   if ((!!old_prog ^ !!prog) && peer)
+   netdev_update_features(peer);
+
return 0;
 err:
priv->_xdp_prog = old_prog;
@@ -644,6 +686,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_poll_controller= veth_poll_controller,
 #endif
.ndo_get_iflink = veth_get_iflink,
+   .ndo_fix_features   = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
-- 
1.8.3.1

[PATCH v8 bpf-next 00/10] veth: Driver XDP

2018-08-03 Thread Toshiaki Makita

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC ---> veth===veth
 (XDP) (redirect)(XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.


Envisioned use-cases


* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.


Implementation
--

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-6: Add ndo_xdp_xmit.
 - patch 7-9: Add XDP_TX and XDP_REDIRECT.
 - patch 10: Performance optimization for multi-queue env.


Tests and performance numbers
-

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP actionFlowsMpps

DROP   110.6
DROP   221.2
DROP 10036.0
TX 1 5.0
TX 210.0
TX   10031.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction Flows   Gbps
==
external->veth1   20.8
external->veth2   23.5
external->veth  100   23.6
veth->external19.0
veth->external2   17.8
veth->external  100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v8:
- Don't use xdp_frame pointer address to calculate skb->head, headroom,
  and xdp_buff.data_hard_start.

v7:
- Introduce xdp_scrub_frame() to clear kernel pointers in xdp_frame and
  use it instead of memset().

v6:
- Check skb->len only if reallocation is needed.
- Add __GFP_NOWARN to alloc_page() since it can be triggered by external
  events.
- Fix sparse warning around EXPORT_SYMBOL.

v5:
- Fix broken SOBs.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (10):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  xdp: Helper function to clear kernel pointers in xdp_frame
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  bpf: Make redirect_info accessible from modules
  xdp: Helpers for disabling napi_direct of xdp_return_frame
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c | 750 -
 include/linux/filter.h |  35 +++
 include/linux/skbuff.h |   1 +
 include/net/xdp.h  |   7 +
 net/core/filter.c  |  29 +-
 net/core/skbuff.c  |   3 +-
 net/core/xdp.c |   6 +-
 7 files changed, 801 insertions(+), 30 deletions(-)

-- 
1.8.3.1

[PATCH v8 bpf-next 01/10] net: Export skb_headers_offset_update

2018-08-03 Thread Toshiaki Makita

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b..f692968 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a912..4acd464 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
/* Only adjust this if it actually is csum_start rather than csum */
if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_network_header += off;
skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
1.8.3.1

Re: [PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Toshiaki Makita


On 18/08/02 (木) 22:53, Jesper Dangaard Brouer wrote:

On Thu, 2 Aug 2018 22:17:53 +0900
Toshiaki Makita  wrote:


On 18/08/02 (木) 20:45, Jesper Dangaard Brouer wrote:

On Thu,  2 Aug 2018 19:55:09 +0900
Toshiaki Makita  wrote:
   

+   headroom = frame->data - delta - (void *)frame;


Your calculation of headroom is still adding an assumption that
xdp_frame is located in the top of data area, that is unnecessary.

The headroom can be calculated as:

   headroom = sizeof(struct xdp_frame) + frame->headroom - delta;


Thanks. But I'm not sure I get what you are requesting.


I'm simply requesting you do not use the (void *)frame pointer address,
to calculate the headroom, as it can be calculated in another way.


I don't see difference, but ok I can change this calculation as you 
prefer a different way.



Supposing xdp_frame is not located in the top of data area, what ensures
that additional sizeof(struct xdp_frame) can be used?


The calculation in convert_to_xdp_frame() assures this.  If we later
add an xdp_frame that is not located in the top of data area, and want
to change the reserved headroom size, then we deal with it, and update
the code.


I just thought you are requesting the change so that we don't need to 
change this code even when convert_to_xdp_frame() is changed. Now I see 
my guess was wrong.


will send v8.

Thanks,
Toshiaki Makita

Re: [PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Toshiaki Makita


On 18/08/02 (木) 20:45, Jesper Dangaard Brouer wrote:

On Thu,  2 Aug 2018 19:55:09 +0900
Toshiaki Makita  wrote:


+   headroom = frame->data - delta - (void *)frame;


Your calculation of headroom is still adding an assumption that
xdp_frame is located in the top of data area, that is unnecessary.

The headroom can be calculated as:

  headroom = sizeof(struct xdp_frame) + frame->headroom - delta;


Thanks. But I'm not sure I get what you are requesting.
Supposing xdp_frame is not located in the top of data area, what ensures 
that additional sizeof(struct xdp_frame) can be used?


Toshiaki Makita

[PATCH v7 bpf-next 10/10] veth: Support per queue XDP ring

2018-08-02 Thread Toshiaki Makita

Move XDP and napi related fields from veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 278 -
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a2ba1c0..0bb409b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
-   struct bpf_prog *_xdp_prog;
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
struct xdp_mem_info xdp_mem;
-   unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
 };
 
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   struct bpf_prog *_xdp_prog;
+   struct veth_rq  *rq;
+   unsigned intrequested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
/* Write ptr_ring before reading rx_notify_masked */
smp_mb();
-   if (!priv->rx_notify_masked) {
-   priv->rx_notify_masked = true;
-   napi_schedule(>xdp_napi);
+   if (!rq->rx_notify_masked) {
+   rq->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
dev_kfree_skb_any(skb);
return NET_RX_DROP;
}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct 
sk_buff *skb)
return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+   struct veth_rq *rq, bool xdp)
 {
-   struct veth_priv *priv = netdev_priv(dev);
-
return __dev_forward_skb(dev, skb) ?: xdp ?
-   veth_xdp_rx(priv, skb) :
+   veth_xdp_rx(rq, skb) :
netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
+   int rxq;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
rcv_priv = netdev_priv(rcv);
-   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+   rxq = skb_get_queue_mapping(skb);
+   if (rxq < rcv->real_num_rx_queues) {
+   rq = _priv->rq[rxq];
+   rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+   if (rcv_xdp)
+   skb_record_rx_queue(skb, rxq);
+   }
 
-   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+   if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
if (rcv_xdp)
-   __veth_xdp_flush(rcv_priv);
+   __veth_xdp_flush(rq);
 
rcu_read_unlock();
 
@@ -278,12 +289,18 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+   return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 struct xdp_frame **frames, u32 flags)
 {
struct veth_pr

[PATCH v7 bpf-next 06/10] veth: Add ndo_xdp_xmit

2018-08-02 Thread Toshiaki Makita

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9993878..3e1582a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,50 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+struct xdp_frame **frames, u32 flags)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+   unsigned int max_len;
+   int i, drops = 0;
+
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+   return -EINVAL;
+
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   return -ENXIO;
+
+   rcv_priv = netdev_priv(rcv);
+   /* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
+* side. This means an XDP program is loaded on the peer and the peer
+* device is up.
+*/
+   if (!rcu_access_pointer(rcv_priv->xdp_prog))
+   return -ENXIO;
+
+   max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
+
+   spin_lock(_priv->xdp_ring.producer_lock);
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *frame = frames[i];
+   void *ptr = veth_xdp_to_ptr(frame);
+
+   if (unlikely(frame->len > max_len ||
+__ptr_ring_produce(_priv->xdp_ring, ptr))) {
+   xdp_return_frame_rx_napi(frame);
+   drops++;
+   }
+   }
+   spin_unlock(_priv->xdp_ring.producer_lock);
+
+   if (flags & XDP_XMIT_FLUSH)
+   __veth_xdp_flush(rcv_priv);
+
+   return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
struct xdp_frame *frame)
 {
@@ -767,6 +817,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
+   .ndo_xdp_xmit   = veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
1.8.3.1

[PATCH v7 bpf-next 07/10] bpf: Make redirect_info accessible from modules

2018-08-02 Thread Toshiaki Makita

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

v6:
- Fix sparse warning around EXPORT_SYMBOL.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 10 ++
 net/core/filter.c  | 29 +++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd73..4717af8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -537,6 +537,16 @@ struct sk_msg_buff {
struct list_head list;
 };
 
+struct bpf_redirect_info {
+   u32 ifindex;
+   u32 flags;
+   struct bpf_map *map;
+   struct bpf_map *map_to_flush;
+   unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/net/core/filter.c b/net/core/filter.c
index 104d560..2766a55 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2080,19 +2080,12 @@ static int __bpf_redirect(struct sk_buff *skb, struct 
net_device *dev,
.arg3_type  = ARG_ANYTHING,
 };
 
-struct redirect_info {
-   u32 ifindex;
-   u32 flags;
-   struct bpf_map *map;
-   struct bpf_map *map_to_flush;
-   unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags & ~(BPF_F_INGRESS)))
return TC_ACT_SHOT;
@@ -2105,7 +2098,7 @@ struct redirect_info {
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *dev;
 
dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
@@ -3198,7 +3191,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
 
 void xdp_do_flush_map(void)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct bpf_map *map = ri->map_to_flush;
 
ri->map_to_flush = NULL;
@@ -3243,7 +3236,7 @@ static inline bool xdp_map_invalid(const struct bpf_prog 
*xdp_prog,
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3283,7 +3276,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *fwd;
u32 index = ri->ifindex;
int err;
@@ -3315,7 +3308,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
   struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3366,7 +3359,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
u32 index = ri->ifindex;
struct net_device *fwd;
int err = 0;
@@ -3397,7 +3390,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags))
return XDP_ABORTED;
@@ -3421,7 +3414,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 BPF_CALL_4(bpf_xdp_redirect_

[PATCH v7 bpf-next 09/10] veth: Add XDP TX and REDIRECT

2018-08-02 Thread Toshiaki Makita

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)  (XDP) (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 119 +
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 3e1582a..a2ba1c0 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TXBIT(0)
+#define VETH_XDP_REDIR BIT(1)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
+   struct xdp_mem_info xdp_mem;
unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
@@ -317,10 +322,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+
+   rcu_read_lock();
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   goto out;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+   goto out;
+
+   __veth_xdp_flush(rcv_priv);
+out:
+   rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return veth_xdp_xmit(dev, 1, , 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-   struct xdp_frame *frame)
+   struct xdp_frame *frame,
+   unsigned int *xdp_xmit)
 {
int len = frame->len, delta = 0;
+   struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
@@ -344,6 +381,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
delta = frame->data - xdp.data;
len = xdp.data_end - xdp.data;
break;
+   case XDP_TX:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) {
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_TX;
+   rcu_read_unlock();
+   goto xdp_xmit;
+   case XDP_REDIRECT:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (xdp_do_redirect(priv->dev, , xdp_prog)) {
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_REDIR;
+   rcu_read_unlock();
+   goto xdp_xmit;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -368,12 +428,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
 err_xdp:
rcu_read_unlock();
xdp_return_frame(frame);
-
+xdp_xmit:
return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-   struct sk_buff *skb)
+   struct sk_buff *skb,
+   unsigned int *xdp_xmit)
 {
u32 pk

[PATCH v7 bpf-next 08/10] xdp: Helpers for disabling napi_direct of xdp_return_frame

2018-08-02 Thread Toshiaki Makita

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 25 +
 net/core/xdp.c |  6 --
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4717af8..2b072da 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,10 +543,14 @@ struct bpf_redirect_info {
struct bpf_map *map;
struct bpf_map *map_to_flush;
unsigned long   map_owner;
+   u32 kern_flags;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT  BIT(0)  /* no napi_direct on return_frame */
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -775,6 +779,27 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 
+static inline bool xdp_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_set_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_clear_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+}
+
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 unsigned int pktlen)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 5728538..3dd99e1 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
page = virt_to_head_page(data);
-   if (xa)
+   if (xa) {
+   napi_direct &= !xdp_return_frame_no_direct();
page_pool_put_page(xa->page_pool, page, napi_direct);
-   else
+   } else {
put_page(page);
+   }
rcu_read_unlock();
break;
case MEM_TYPE_PAGE_SHARED:
-- 
1.8.3.1

[PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Toshiaki Makita

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v7:
- Use xdp_scrub_frame() instead of memset().

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 87 ++
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9edf104..9993878 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_XDP_FLAG  BIT(0)
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static void veth_get_ethtool_stats(struct net_device *dev,
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+   if (veth_is_xdp_frame(ptr))
+   xdp_return_frame(veth_ptr_to_xdp(ptr));
+   else
+   kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+   struct xdp_frame *frame)
+{
+   int len = frame->len, delta = 0;
+   struct bpf_prog *xdp_prog;
+   unsigned int headroom;
+   struct sk_buff *skb;
+
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(priv->xdp_prog);
+   if (likely(xdp_prog)) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = frame->data - frame->headroom;
+   xdp.data = frame->data;
+   xdp.data_end = frame->data + frame->len;
+   xdp.data_meta = frame->data - frame->metasize;
+   xdp.rxq = >xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (act) {
+   case XDP_PASS:
+   delta = frame->data - xdp.data;
+   len = xdp.data_end - xdp.data;
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   case XDP_DROP:
+   goto err_xdp;
+   }
+   }
+   rcu_read_unlock();
+
+   headroom = frame->data - delta - (void *)frame;
+   skb = veth_build_skb(frame, headroom, len, 0);
+   if (!skb) {
+   xdp_return_frame(frame);
+   goto err;
+   }
+
+   xdp_scrub_frame(frame);
+   skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+   return skb;
+err_xdp:
+   rcu_read_unlock();
+   xdp_return_frame(frame);
+
+   return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
struct sk_buff *skb)
 {
@@ -359,12 +432,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int 
budget)
int i, done = 0;
 
for (i = 0; i < budget; i++) {
-   struct sk_buff *skb = __ptr_ring_consume(>xdp_ring);
+   void *ptr = __ptr_ring_consume(>xdp_ring);
+   struct sk_buff *skb;
 
-   if (!skb)
+   if (!ptr)
break;
 
-   skb = veth_xdp_rcv_skb(priv, skb);
+   if (veth_is_xdp_frame(ptr))
+   skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+   else
+   skb = veth_xdp_rcv_skb(priv, ptr);
 
if (skb)
napi_gro_receive(>xdp_napi, skb);
@@ -417,7 +494,7 @@ static void veth_napi_del(struct net_device *dev)
napi_disable(>xdp_napi);
netif_napi_del(>xdp_napi);
priv->rx_notify_masked = false;
-   ptr_ring_cleanup(>xdp_ring, __skb_array_destroy_skb);
+   ptr_ring_cleanup(>xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
1.8.3.1

[PATCH v7 bpf-next 03/10] veth: Avoid drops by oversized packets when XDP is enabled

2018-08-02 Thread Toshiaki Makita

Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d3b9f10..9edf104 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -543,6 +543,23 @@ static int veth_get_iflink(const struct net_device *dev)
return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+  netdev_features_t features)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   struct net_device *peer;
+
+   peer = rtnl_dereference(priv->peer);
+   if (peer) {
+   struct veth_priv *peer_priv = netdev_priv(peer);
+
+   if (peer_priv->_xdp_prog)
+   features &= ~NETIF_F_GSO_SOFTWARE;
+   }
+
+   return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -572,6 +589,7 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
struct veth_priv *priv = netdev_priv(dev);
struct bpf_prog *old_prog;
struct net_device *peer;
+   unsigned int max_mtu;
int err;
 
old_prog = priv->_xdp_prog;
@@ -585,6 +603,15 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
 
+   max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+ peer->hard_header_len -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (peer->mtu > max_mtu) {
+   NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to 
set XDP");
+   err = -ERANGE;
+   goto err;
+   }
+
if (dev->flags & IFF_UP) {
err = veth_enable_xdp(dev);
if (err) {
@@ -592,14 +619,29 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
}
+
+   if (!old_prog) {
+   peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = max_mtu;
+   }
}
 
if (old_prog) {
-   if (!prog && dev->flags & IFF_UP)
-   veth_disable_xdp(dev);
+   if (!prog) {
+   if (dev->flags & IFF_UP)
+   veth_disable_xdp(dev);
+
+   if (peer) {
+   peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = ETH_MAX_MTU;
+   }
+   }
bpf_prog_put(old_prog);
}
 
+   if ((!!old_prog ^ !!prog) && peer)
+   netdev_update_features(peer);
+
return 0;
 err:
priv->_xdp_prog = old_prog;
@@ -644,6 +686,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_poll_controller= veth_poll_controller,
 #endif
.ndo_get_iflink = veth_get_iflink,
+   .ndo_fix_features   = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
-- 
1.8.3.1

[PATCH v7 bpf-next 04/10] xdp: Helper function to clear kernel pointers in xdp_frame

2018-08-02 Thread Toshiaki Makita

xdp_frame has kernel pointers which should not be readable from bpf
programs. When we want to reuse xdp_frame region but it may be read by
bpf programs later, we can use this helper to clear kernel pointers.
This is more efficient than calling memset() for the entire struct.

Signed-off-by: Toshiaki Makita 
---
 include/net/xdp.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index fcb033f..76b9525 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -84,6 +84,13 @@ struct xdp_frame {
struct net_device *dev_rx; /* used by cpumap */
 };
 
+/* Clear kernel pointers in xdp_frame */
+static inline void xdp_scrub_frame(struct xdp_frame *frame)
+{
+   frame->data = NULL;
+   frame->dev_rx = NULL;
+}
+
 /* Convert xdp_buff to xdp_frame */
 static inline
 struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
-- 
1.8.3.1

[PATCH v7 bpf-next 02/10] veth: Add driver XDP

2018-08-02 Thread Toshiaki Makita

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
  events.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 374 -
 1 file changed, 367 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39..d3b9f10 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_RING_SIZE 256
+#define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+   struct napi_struct  xdp_napi;
+   struct net_device   *dev;
+   struct bpf_prog __rcu   *xdp_prog;
+   struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
unsignedrequested_headroom;
+   boolrx_notify_masked;
+   struct ptr_ring xdp_ring;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static void veth_get_ethtool_stats(struct net_device *dev,
.get_link_ksettings = veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+   /* Write ptr_ring before reading rx_notify_masked */
+   smp_mb();
+   if (!priv->rx_notify_masked) {
+   priv->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
+   }
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   dev_kfree_skb_any(skb);
+   return NET_RX_DROP;
+   }
+
+   return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
 {
struct veth_priv *priv = netdev_priv(dev);
+
+   return __dev_forward_skb(dev, skb) ?: xdp ?
+   veth_xdp_rx(priv, skb) :
+   netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   bool rcv_xdp = false;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   rcv_priv = netdev_priv(rcv);
+   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 drop:
atomic64_inc(>dropped);
}
+
+   if (rcv_xdp)
+   __veth_xdp_flush(rcv_priv);
+
rcu_read_unlock();
+
return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,254 @@ static void veth_set_multicast_list(struct net_device 
*dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+

[PATCH v7 bpf-next 01/10] net: Export skb_headers_offset_update

2018-08-02 Thread Toshiaki Makita

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b..f692968 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 266b954..f5670e6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
/* Only adjust this if it actually is csum_start rather than csum */
if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_network_header += off;
skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
1.8.3.1

[PATCH v7 bpf-next 00/10] veth: Driver XDP

2018-08-02 Thread Toshiaki Makita

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC ---> veth===veth
 (XDP) (redirect)(XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.


Envisioned use-cases


* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.


Implementation
--

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-6: Add ndo_xdp_xmit.
 - patch 7-9: Add XDP_TX and XDP_REDIRECT.
 - patch 10: Performance optimization for multi-queue env.


Tests and performance numbers
-

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP actionFlowsMpps

DROP   110.6
DROP   221.2
DROP 10036.0
TX 1 5.0
TX 210.0
TX   10031.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction Flows   Gbps
==
external->veth1   20.8
external->veth2   23.5
external->veth  100   23.6
veth->external19.0
veth->external2   17.8
veth->external  100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v7:
- Introduce xdp_scrub_frame() to clear kernel pointers in xdp_frame and
  use it instead of memset().

v6:
- Check skb->len only if reallocation is needed.
- Add __GFP_NOWARN to alloc_page() since it can be triggered by external
  events.
- Fix sparse warning around EXPORT_SYMBOL.

v5:
- Fix broken SOBs.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (10):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  xdp: Helper function to clear kernel pointers in xdp_frame
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  bpf: Make redirect_info accessible from modules
  xdp: Helpers for disabling napi_direct of xdp_return_frame
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c | 748 -
 include/linux/filter.h |  35 +++
 include/linux/skbuff.h |   1 +
 include/net/xdp.h  |   7 +
 net/core/filter.c  |  29 +-
 net/core/skbuff.c  |   3 +-
 net/core/xdp.c |   6 +-
 7 files changed, 799 insertions(+), 30 deletions(-)

-- 
1.8.3.1

Re: [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring

2018-07-31 Thread Toshiaki Makita

On 2018/07/31 21:46, Jesper Dangaard Brouer wrote:
> On Tue, 31 Jul 2018 19:40:08 +0900
> Toshiaki Makita  wrote:
> 
>> On 2018/07/31 19:26, Jesper Dangaard Brouer wrote:
>>>
>>> Context needed from: [PATCH v6 bpf-next 2/9] veth: Add driver XDP
>>>
>>> On Mon, 30 Jul 2018 19:43:44 +0900
>>> Toshiaki Makita  wrote:
>>>   
>>>> +static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
>>>> +int buflen)
>>>> +{
>>>> +  struct sk_buff *skb;
>>>> +
>>>> +  if (!buflen) {
>>>> +  buflen = SKB_DATA_ALIGN(headroom + len) +
>>>> +   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>> +  }
>>>> +  skb = build_skb(head, buflen);
>>>> +  if (!skb)
>>>> +  return NULL;
>>>> +
>>>> +  skb_reserve(skb, headroom);
>>>> +  skb_put(skb, len);
>>>> +
>>>> +  return skb;
>>>> +}  
>>>
>>>
>>> On Mon, 30 Jul 2018 19:43:46 +0900
>>> Toshiaki Makita  wrote:
>>>   
>>>> +static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>>> +  struct xdp_frame *frame)
>>>> +{
>>>> +  int len = frame->len, delta = 0;
>>>> +  struct bpf_prog *xdp_prog;
>>>> +  unsigned int headroom;
>>>> +  struct sk_buff *skb;
>>>> +
>>>> +  rcu_read_lock();
>>>> +  xdp_prog = rcu_dereference(priv->xdp_prog);
>>>> +  if (likely(xdp_prog)) {
>>>> +  struct xdp_buff xdp;
>>>> +  u32 act;
>>>> +
>>>> +  xdp.data_hard_start = frame->data - frame->headroom;
>>>> +  xdp.data = frame->data;
>>>> +  xdp.data_end = frame->data + frame->len;
>>>> +  xdp.data_meta = frame->data - frame->metasize;
>>>> +  xdp.rxq = >xdp_rxq;
>>>> +
>>>> +  act = bpf_prog_run_xdp(xdp_prog, );
>>>> +
>>>> +  switch (act) {
>>>> +  case XDP_PASS:
>>>> +  delta = frame->data - xdp.data;
>>>> +  len = xdp.data_end - xdp.data;
>>>> +  break;
>>>> +  default:
>>>> +  bpf_warn_invalid_xdp_action(act);
>>>> +  case XDP_ABORTED:
>>>> +  trace_xdp_exception(priv->dev, xdp_prog, act);
>>>> +  case XDP_DROP:
>>>> +  goto err_xdp;
>>>> +  }
>>>> +  }
>>>> +  rcu_read_unlock();
>>>> +
>>>> +  headroom = frame->data - delta - (void *)frame;
>>>> +  skb = veth_build_skb(frame, headroom, len, 0);  
>>>
>>> Here you are adding an assumption that struct xdp_frame is always
>>> located in-the-top of the packet-data area.  I tried hard not to add
>>> such a dependency!  You can calculate the beginning of the frame from
>>> the xdp_frame->data pointer.
>>>
>>> Why not add such a dependency?  Because for AF_XDP zero-copy, we cannot
>>> make such an assumption.  
>>>
>>> Currently, when an RX-queue is in AF-XDP-ZC mode (MEM_TYPE_ZERO_COPY)
>>> the packet will get dropped when calling convert_to_xdp_frame(), but as
>>> the TODO comment indicated in convert_to_xdp_frame() this is not the
>>> end-goal. 
>>>
>>> The comment in convert_to_xdp_frame(), indicate we need a full
>>> alloc+copy, but that is actually not necessary, if we can just use
>>> another memory area for struct xdp_frame, and a pointer to data.  Thus,
>>> allowing devmap-redir to work-ZC and allow cpumap-redir to do the copy
>>> on the remote CPU.  
>>
>> Thanks for pointing this out.
>> Seems you are saying xdp_frame area is not reusable. That means we
>> reduce usable headroom on every REDIRECT. I wanted to avoid this but
>> actually it is impossible, right?
> 
> I'm not sure I understand fully...  has this something to do, with the
> below memset?

Sorry for not being so clear...
It has something to do with the memset as well but mainly I was talking
about XDP_TX and REDIRECT introduced in patch 8. On REDIRECT,
dev_map_enqueue() calls convert_to_xdp_frame() so we use the headroom
for struct xdp_frame on REDIRECT. If we don't reuse xdp_frame region of
the original xdp packet, we reduce

Re: [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring

2018-07-31 Thread Toshiaki Makita

On 2018/07/31 19:26, Jesper Dangaard Brouer wrote:
> 
> Context needed from: [PATCH v6 bpf-next 2/9] veth: Add driver XDP
> 
> On Mon, 30 Jul 2018 19:43:44 +0900
> Toshiaki Makita  wrote:
> 
>> +static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
>> +  int buflen)
>> +{
>> +struct sk_buff *skb;
>> +
>> +if (!buflen) {
>> +buflen = SKB_DATA_ALIGN(headroom + len) +
>> + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +}
>> +skb = build_skb(head, buflen);
>> +if (!skb)
>> +return NULL;
>> +
>> +skb_reserve(skb, headroom);
>> +skb_put(skb, len);
>> +
>> +return skb;
>> +}
> 
> 
> On Mon, 30 Jul 2018 19:43:46 +0900
> Toshiaki Makita  wrote:
> 
>> +static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>> +struct xdp_frame *frame)
>> +{
>> +int len = frame->len, delta = 0;
>> +struct bpf_prog *xdp_prog;
>> +unsigned int headroom;
>> +struct sk_buff *skb;
>> +
>> +rcu_read_lock();
>> +xdp_prog = rcu_dereference(priv->xdp_prog);
>> +if (likely(xdp_prog)) {
>> +struct xdp_buff xdp;
>> +u32 act;
>> +
>> +xdp.data_hard_start = frame->data - frame->headroom;
>> +xdp.data = frame->data;
>> +xdp.data_end = frame->data + frame->len;
>> +xdp.data_meta = frame->data - frame->metasize;
>> +xdp.rxq = >xdp_rxq;
>> +
>> +act = bpf_prog_run_xdp(xdp_prog, );
>> +
>> +switch (act) {
>> +case XDP_PASS:
>> +delta = frame->data - xdp.data;
>> +len = xdp.data_end - xdp.data;
>> +break;
>> +default:
>> +bpf_warn_invalid_xdp_action(act);
>> +case XDP_ABORTED:
>> +trace_xdp_exception(priv->dev, xdp_prog, act);
>> +case XDP_DROP:
>> +goto err_xdp;
>> +}
>> +}
>> +rcu_read_unlock();
>> +
>> +headroom = frame->data - delta - (void *)frame;
>> +skb = veth_build_skb(frame, headroom, len, 0);
> 
> Here you are adding an assumption that struct xdp_frame is always
> located in-the-top of the packet-data area.  I tried hard not to add
> such a dependency!  You can calculate the beginning of the frame from
> the xdp_frame->data pointer.
> 
> Why not add such a dependency?  Because for AF_XDP zero-copy, we cannot
> make such an assumption.  
> 
> Currently, when an RX-queue is in AF-XDP-ZC mode (MEM_TYPE_ZERO_COPY)
> the packet will get dropped when calling convert_to_xdp_frame(), but as
> the TODO comment indicated in convert_to_xdp_frame() this is not the
> end-goal. 
> 
> The comment in convert_to_xdp_frame(), indicate we need a full
> alloc+copy, but that is actually not necessary, if we can just use
> another memory area for struct xdp_frame, and a pointer to data.  Thus,
> allowing devmap-redir to work-ZC and allow cpumap-redir to do the copy
> on the remote CPU.

Thanks for pointing this out.
Seems you are saying xdp_frame area is not reusable. That means we
reduce usable headroom on every REDIRECT. I wanted to avoid this but
actually it is impossible, right?

>> +if (!skb) {
>> +xdp_return_frame(frame);
>> +goto err;
>> +}
>> +
>> +memset(frame, 0, sizeof(*frame));
>> +skb->protocol = eth_type_trans(skb, priv->dev);
>> +err:
>> +return skb;
>> +err_xdp:
>> +rcu_read_unlock();
>> +xdp_return_frame(frame);
>> +
>> +return NULL;
>> +}
> 
> 

-- 
Toshiaki Makita

[PATCH v6 bpf-next 9/9] veth: Support per queue XDP ring

2018-07-30 Thread Toshiaki Makita

Move XDP and napi related fields in veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 278 -
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b1ce3691..c276a72 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
-   struct bpf_prog *_xdp_prog;
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
struct xdp_mem_info xdp_mem;
-   unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
 };
 
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   struct bpf_prog *_xdp_prog;
+   struct veth_rq  *rq;
+   unsigned intrequested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
/* Write ptr_ring before reading rx_notify_masked */
smp_mb();
-   if (!priv->rx_notify_masked) {
-   priv->rx_notify_masked = true;
-   napi_schedule(>xdp_napi);
+   if (!rq->rx_notify_masked) {
+   rq->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
dev_kfree_skb_any(skb);
return NET_RX_DROP;
}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct 
sk_buff *skb)
return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+   struct veth_rq *rq, bool xdp)
 {
-   struct veth_priv *priv = netdev_priv(dev);
-
return __dev_forward_skb(dev, skb) ?: xdp ?
-   veth_xdp_rx(priv, skb) :
+   veth_xdp_rx(rq, skb) :
netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
+   int rxq;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
rcv_priv = netdev_priv(rcv);
-   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+   rxq = skb_get_queue_mapping(skb);
+   if (rxq < rcv->real_num_rx_queues) {
+   rq = _priv->rq[rxq];
+   rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+   if (rcv_xdp)
+   skb_record_rx_queue(skb, rxq);
+   }
 
-   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+   if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
if (rcv_xdp)
-   __veth_xdp_flush(rcv_priv);
+   __veth_xdp_flush(rq);
 
rcu_read_unlock();
 
@@ -278,12 +289,18 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+   return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 struct xdp_frame **frames, u32 flags)
 {
struct veth_pr

[PATCH v6 bpf-next 5/9] veth: Add ndo_xdp_xmit

2018-07-30 Thread Toshiaki Makita

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9de0e90..c13f7a4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,50 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+struct xdp_frame **frames, u32 flags)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+   unsigned int max_len;
+   int i, drops = 0;
+
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+   return -EINVAL;
+
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   return -ENXIO;
+
+   rcv_priv = netdev_priv(rcv);
+   /* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
+* side. This means an XDP program is loaded on the peer and the peer
+* device is up.
+*/
+   if (!rcu_access_pointer(rcv_priv->xdp_prog))
+   return -ENXIO;
+
+   max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
+
+   spin_lock(_priv->xdp_ring.producer_lock);
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *frame = frames[i];
+   void *ptr = veth_xdp_to_ptr(frame);
+
+   if (unlikely(frame->len > max_len ||
+__ptr_ring_produce(_priv->xdp_ring, ptr))) {
+   xdp_return_frame_rx_napi(frame);
+   drops++;
+   }
+   }
+   spin_unlock(_priv->xdp_ring.producer_lock);
+
+   if (flags & XDP_XMIT_FLUSH)
+   __veth_xdp_flush(rcv_priv);
+
+   return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
struct xdp_frame *frame)
 {
@@ -767,6 +817,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
+   .ndo_xdp_xmit   = veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
1.8.3.1

[PATCH v6 bpf-next 8/9] veth: Add XDP TX and REDIRECT

2018-07-30 Thread Toshiaki Makita

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)  (XDP) (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 119 +
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index c13f7a4..b1ce3691 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TXBIT(0)
+#define VETH_XDP_REDIR BIT(1)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
+   struct xdp_mem_info xdp_mem;
unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
@@ -317,10 +322,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+
+   rcu_read_lock();
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   goto out;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+   goto out;
+
+   __veth_xdp_flush(rcv_priv);
+out:
+   rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return veth_xdp_xmit(dev, 1, , 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-   struct xdp_frame *frame)
+   struct xdp_frame *frame,
+   unsigned int *xdp_xmit)
 {
int len = frame->len, delta = 0;
+   struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
@@ -344,6 +381,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
delta = frame->data - xdp.data;
len = xdp.data_end - xdp.data;
break;
+   case XDP_TX:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) {
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_TX;
+   rcu_read_unlock();
+   goto xdp_xmit;
+   case XDP_REDIRECT:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (xdp_do_redirect(priv->dev, , xdp_prog)) {
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_REDIR;
+   rcu_read_unlock();
+   goto xdp_xmit;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -368,12 +428,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
 err_xdp:
rcu_read_unlock();
xdp_return_frame(frame);
-
+xdp_xmit:
return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-   struct sk_buff *skb)
+   struct sk_buff *skb,
+   unsigned int *xdp_xmit)
 {
u32 pk

[PATCH v6 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-30 Thread Toshiaki Makita

Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d3b9f10..9edf104 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -543,6 +543,23 @@ static int veth_get_iflink(const struct net_device *dev)
return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+  netdev_features_t features)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   struct net_device *peer;
+
+   peer = rtnl_dereference(priv->peer);
+   if (peer) {
+   struct veth_priv *peer_priv = netdev_priv(peer);
+
+   if (peer_priv->_xdp_prog)
+   features &= ~NETIF_F_GSO_SOFTWARE;
+   }
+
+   return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -572,6 +589,7 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
struct veth_priv *priv = netdev_priv(dev);
struct bpf_prog *old_prog;
struct net_device *peer;
+   unsigned int max_mtu;
int err;
 
old_prog = priv->_xdp_prog;
@@ -585,6 +603,15 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
 
+   max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+ peer->hard_header_len -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (peer->mtu > max_mtu) {
+   NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to 
set XDP");
+   err = -ERANGE;
+   goto err;
+   }
+
if (dev->flags & IFF_UP) {
err = veth_enable_xdp(dev);
if (err) {
@@ -592,14 +619,29 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
}
+
+   if (!old_prog) {
+   peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = max_mtu;
+   }
}
 
if (old_prog) {
-   if (!prog && dev->flags & IFF_UP)
-   veth_disable_xdp(dev);
+   if (!prog) {
+   if (dev->flags & IFF_UP)
+   veth_disable_xdp(dev);
+
+   if (peer) {
+   peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = ETH_MAX_MTU;
+   }
+   }
bpf_prog_put(old_prog);
}
 
+   if ((!!old_prog ^ !!prog) && peer)
+   netdev_update_features(peer);
+
return 0;
 err:
priv->_xdp_prog = old_prog;
@@ -644,6 +686,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_poll_controller= veth_poll_controller,
 #endif
.ndo_get_iflink = veth_get_iflink,
+   .ndo_fix_features   = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
-- 
1.8.3.1

[PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring

2018-07-30 Thread Toshiaki Makita

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 87 ++
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9edf104..9de0e90 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_XDP_FLAG  BIT(0)
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static void veth_get_ethtool_stats(struct net_device *dev,
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+   if (veth_is_xdp_frame(ptr))
+   xdp_return_frame(veth_ptr_to_xdp(ptr));
+   else
+   kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+   struct xdp_frame *frame)
+{
+   int len = frame->len, delta = 0;
+   struct bpf_prog *xdp_prog;
+   unsigned int headroom;
+   struct sk_buff *skb;
+
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(priv->xdp_prog);
+   if (likely(xdp_prog)) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = frame->data - frame->headroom;
+   xdp.data = frame->data;
+   xdp.data_end = frame->data + frame->len;
+   xdp.data_meta = frame->data - frame->metasize;
+   xdp.rxq = >xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (act) {
+   case XDP_PASS:
+   delta = frame->data - xdp.data;
+   len = xdp.data_end - xdp.data;
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   case XDP_DROP:
+   goto err_xdp;
+   }
+   }
+   rcu_read_unlock();
+
+   headroom = frame->data - delta - (void *)frame;
+   skb = veth_build_skb(frame, headroom, len, 0);
+   if (!skb) {
+   xdp_return_frame(frame);
+   goto err;
+   }
+
+   memset(frame, 0, sizeof(*frame));
+   skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+   return skb;
+err_xdp:
+   rcu_read_unlock();
+   xdp_return_frame(frame);
+
+   return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
struct sk_buff *skb)
 {
@@ -359,12 +432,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int 
budget)
int i, done = 0;
 
for (i = 0; i < budget; i++) {
-   struct sk_buff *skb = __ptr_ring_consume(>xdp_ring);
+   void *ptr = __ptr_ring_consume(>xdp_ring);
+   struct sk_buff *skb;
 
-   if (!skb)
+   if (!ptr)
break;
 
-   skb = veth_xdp_rcv_skb(priv, skb);
+   if (veth_is_xdp_frame(ptr))
+   skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+   else
+   skb = veth_xdp_rcv_skb(priv, ptr);
 
if (skb)
napi_gro_receive(>xdp_napi, skb);
@@ -417,7 +494,7 @@ static void veth_napi_del(struct net_device *dev)
napi_disable(>xdp_napi);
netif_napi_del(>xdp_napi);
priv->rx_notify_masked = false;
-   ptr_ring_cleanup(>xdp_ring, __skb_array_destroy_skb);
+   ptr_ring_cleanup(>xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
1.8.3.1

[PATCH v6 bpf-next 7/9] xdp: Helpers for disabling napi_direct of xdp_return_frame

2018-07-30 Thread Toshiaki Makita

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 25 +
 net/core/xdp.c |  6 --
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4717af8..2b072da 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,10 +543,14 @@ struct bpf_redirect_info {
struct bpf_map *map;
struct bpf_map *map_to_flush;
unsigned long   map_owner;
+   u32 kern_flags;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT  BIT(0)  /* no napi_direct on return_frame */
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -775,6 +779,27 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 
+static inline bool xdp_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_set_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_clear_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+}
+
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 unsigned int pktlen)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 5728538..3dd99e1 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
page = virt_to_head_page(data);
-   if (xa)
+   if (xa) {
+   napi_direct &= !xdp_return_frame_no_direct();
page_pool_put_page(xa->page_pool, page, napi_direct);
-   else
+   } else {
put_page(page);
+   }
rcu_read_unlock();
break;
case MEM_TYPE_PAGE_SHARED:
-- 
1.8.3.1

[PATCH v6 bpf-next 6/9] bpf: Make redirect_info accessible from modules

2018-07-30 Thread Toshiaki Makita

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

v6:
- Fix sparse warning around EXPORT_SYMBOL.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 10 ++
 net/core/filter.c  | 29 +++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd73..4717af8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -537,6 +537,16 @@ struct sk_msg_buff {
struct list_head list;
 };
 
+struct bpf_redirect_info {
+   u32 ifindex;
+   u32 flags;
+   struct bpf_map *map;
+   struct bpf_map *map_to_flush;
+   unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/net/core/filter.c b/net/core/filter.c
index 104d560..2766a55 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2080,19 +2080,12 @@ static int __bpf_redirect(struct sk_buff *skb, struct 
net_device *dev,
.arg3_type  = ARG_ANYTHING,
 };
 
-struct redirect_info {
-   u32 ifindex;
-   u32 flags;
-   struct bpf_map *map;
-   struct bpf_map *map_to_flush;
-   unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags & ~(BPF_F_INGRESS)))
return TC_ACT_SHOT;
@@ -2105,7 +2098,7 @@ struct redirect_info {
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *dev;
 
dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
@@ -3198,7 +3191,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
 
 void xdp_do_flush_map(void)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct bpf_map *map = ri->map_to_flush;
 
ri->map_to_flush = NULL;
@@ -3243,7 +3236,7 @@ static inline bool xdp_map_invalid(const struct bpf_prog 
*xdp_prog,
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3283,7 +3276,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *fwd;
u32 index = ri->ifindex;
int err;
@@ -3315,7 +3308,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
   struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3366,7 +3359,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
u32 index = ri->ifindex;
struct net_device *fwd;
int err = 0;
@@ -3397,7 +3390,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags))
return XDP_ABORTED;
@@ -3421,7 +3414,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 BPF_CALL_4(bpf_xdp_redirect_

[PATCH v6 bpf-next 2/9] veth: Add driver XDP

2018-07-30 Thread Toshiaki Makita

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
  events.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 374 -
 1 file changed, 367 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39..d3b9f10 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_RING_SIZE 256
+#define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+   struct napi_struct  xdp_napi;
+   struct net_device   *dev;
+   struct bpf_prog __rcu   *xdp_prog;
+   struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
unsignedrequested_headroom;
+   boolrx_notify_masked;
+   struct ptr_ring xdp_ring;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static void veth_get_ethtool_stats(struct net_device *dev,
.get_link_ksettings = veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+   /* Write ptr_ring before reading rx_notify_masked */
+   smp_mb();
+   if (!priv->rx_notify_masked) {
+   priv->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
+   }
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   dev_kfree_skb_any(skb);
+   return NET_RX_DROP;
+   }
+
+   return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
 {
struct veth_priv *priv = netdev_priv(dev);
+
+   return __dev_forward_skb(dev, skb) ?: xdp ?
+   veth_xdp_rx(priv, skb) :
+   netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   bool rcv_xdp = false;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   rcv_priv = netdev_priv(rcv);
+   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 drop:
atomic64_inc(>dropped);
}
+
+   if (rcv_xdp)
+   __veth_xdp_flush(rcv_priv);
+
rcu_read_unlock();
+
return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,254 @@ static void veth_set_multicast_list(struct net_device 
*dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+

[PATCH v6 bpf-next 1/9] net: Export skb_headers_offset_update

2018-07-30 Thread Toshiaki Makita

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b..f692968 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 266b954..f5670e6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
/* Only adjust this if it actually is csum_start rather than csum */
if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_network_header += off;
skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
1.8.3.1

[PATCH v6 bpf-next 0/9] veth: Driver XDP

2018-07-30 Thread Toshiaki Makita

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC ---> veth===veth
 (XDP) (redirect)(XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.


Envisioned use-cases


* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.


Implementation
--

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-5: Add ndo_xdp_xmit.
 - patch 6-8: Add XDP_TX and XDP_REDIRECT.
 - patch 9: Performance optimization for multi-queue env.


Tests and performance numbers
-

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP actionFlowsMpps

DROP   110.6
DROP   221.2
DROP 10036.0
TX 1 5.0
TX 210.0
TX   10031.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction Flows   Gbps
==
external->veth1   20.8
external->veth2   23.5
external->veth  100   23.6
veth->external19.0
veth->external2   17.8
veth->external  100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v6:
- Check skb->len only if reallocation is needed.
- Add __GFP_NOWARN to alloc_page() since it can be triggered by external
  events.
- Fix sparse warning around EXPORT_SYMBOL.

v5:
- Fix broken SOBs.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (9):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  bpf: Make redirect_info accessible from modules
  xdp: Helpers for disabling napi_direct of xdp_return_frame
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c | 748 -
 include/linux/filter.h |  35 +++
 include/linux/skbuff.h |   1 +
 net/core/filter.c  |  29 +-
 net/core/skbuff.c  |   3 +-
 net/core/xdp.c |   6 +-
 6 files changed, 792 insertions(+), 30 deletions(-)

-- 
1.8.3.1

Re: [PATCH v5 bpf-next 2/9] veth: Add driver XDP

2018-07-26 Thread Toshiaki Makita

Hi John,

On 2018/07/27 12:02, John Fastabend wrote:
> On 07/26/2018 07:40 AM, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> This is the basic implementation of veth driver XDP.
>>
>> Incoming packets are sent from the peer veth device in the form of skb,
>> so this is generally doing the same thing as generic XDP.
>>
>> This itself is not so useful, but a starting point to implement other
>> useful veth XDP features like TX and REDIRECT.
>>
>> This introduces NAPI when XDP is enabled, because XDP is now heavily
>> relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
>> enqueues packets to the ring and peer NAPI handler drains the ring.
>>
>> Currently only one ring is allocated for each veth device, so it does
>> not scale on multiqueue env. This can be resolved by allocating rings
>> on the per-queue basis later.
>>
>> Note that NAPI is not used but netif_rx is used when XDP is not loaded,
>> so this does not change the default behaviour.
>>
>> v3:
>> - Fix race on closing the device.
>> - Add extack messages in ndo_bpf.
>>
>> v2:
>> - Squashed with the patch adding NAPI.
>> - Implement adjust_tail.
>> - Don't acquire consumer lock because it is guarded by NAPI.
>> - Make poll_controller noop since it is unnecessary.
>> - Register rxq_info on enabling XDP rather than on opening the device.
>>
>> Signed-off-by: Toshiaki Makita 
>> ---
> 
> 
> [...]
> 
> One nit and one question.
> 
>> +
>> +static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
>> +struct sk_buff *skb)
>> +{
>> +u32 pktlen, headroom, act, metalen;
>> +void *orig_data, *orig_data_end;
>> +int size, mac_len, delta, off;
>> +struct bpf_prog *xdp_prog;
>> +struct xdp_buff xdp;
>> +
>> +rcu_read_lock();
>> +xdp_prog = rcu_dereference(priv->xdp_prog);
>> +if (unlikely(!xdp_prog)) {
>> +rcu_read_unlock();
>> +goto out;
>> +}
>> +
>> +mac_len = skb->data - skb_mac_header(skb);
>> +pktlen = skb->len + mac_len;
>> +size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
>> +   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +if (size > PAGE_SIZE)
>> +goto drop;
> 
> I'm not sure why it matters if size > PAGE_SIZE here. Why not
> just consume it and use the correct page order in alloc_page if
> its not linear.

Indeed. We can allow such skbs here at least if we don't need
reallocation (which is highly unlikely though).

But I'm not sure we should allocate multiple pages in atomic context.
It tends to cause random allocation failure which is IMO more
frustrating. We are now prohibiting such a situation by max_mtu and
dropping features, which looks more robust to me.

>> +
>> +headroom = skb_headroom(skb) - mac_len;
>> +if (skb_shared(skb) || skb_head_is_locked(skb) ||
>> +    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
>> +struct sk_buff *nskb;
>> +void *head, *start;
>> +struct page *page;
>> +int head_off;
>> +
>> +page = alloc_page(GFP_ATOMIC);
> 
> Should also have __NO_WARN here as well this can be triggered by
> external events so we don't want DDOS here to flood system logs.

Sure, thanks!

-- 
Toshiaki Makita

Re: [PATCH v5 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-26 Thread Toshiaki Makita

On 2018/07/27 9:51, Jakub Kicinski wrote:
> On Thu, 26 Jul 2018 23:40:26 +0900, Toshiaki Makita wrote:
>> +max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
>> +  peer->hard_header_len -
>> +  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +if (peer->mtu > max_mtu) {
>> +NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to 
>> set XDP");
>> +err = -ERANGE;
>> +goto err;
>> +}
> 
> You need to add .ndo_change_mtu and check this condition there too.

I'm setting peer->max_mtu so no need to add .ndo_change_mtu.
Inappropriate MTU will be refused in dev_set_mtu().

-- 
Toshiaki Makita

[PATCH v5 bpf-next 9/9] veth: Support per queue XDP ring

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

Move XDP and napi related fields in veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 278 -
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 60397a8ea2e9..3059b897ecea 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
-   struct bpf_prog *_xdp_prog;
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
struct xdp_mem_info xdp_mem;
-   unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
 };
 
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   struct bpf_prog *_xdp_prog;
+   struct veth_rq  *rq;
+   unsigned intrequested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
/* Write ptr_ring before reading rx_notify_masked */
smp_mb();
-   if (!priv->rx_notify_masked) {
-   priv->rx_notify_masked = true;
-   napi_schedule(>xdp_napi);
+   if (!rq->rx_notify_masked) {
+   rq->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
dev_kfree_skb_any(skb);
return NET_RX_DROP;
}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct 
sk_buff *skb)
return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+   struct veth_rq *rq, bool xdp)
 {
-   struct veth_priv *priv = netdev_priv(dev);
-
return __dev_forward_skb(dev, skb) ?: xdp ?
-   veth_xdp_rx(priv, skb) :
+   veth_xdp_rx(rq, skb) :
netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
+   int rxq;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
rcv_priv = netdev_priv(rcv);
-   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+   rxq = skb_get_queue_mapping(skb);
+   if (rxq < rcv->real_num_rx_queues) {
+   rq = _priv->rq[rxq];
+   rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+   if (rcv_xdp)
+   skb_record_rx_queue(skb, rxq);
+   }
 
-   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+   if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
if (rcv_xdp)
-   __veth_xdp_flush(rcv_priv);
+   __veth_xdp_flush(rq);
 
rcu_read_unlock();
 
@@ -278,12 +289,18 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+   return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 struct xdp_frame **frames, u

[PATCH v5 bpf-next 6/9] bpf: Make redirect_info accessible from modules

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 10 ++
 net/core/filter.c  | 29 +++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd7396886..4717af8b95e6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -537,6 +537,16 @@ struct sk_msg_buff {
struct list_head list;
 };
 
+struct bpf_redirect_info {
+   u32 ifindex;
+   u32 flags;
+   struct bpf_map *map;
+   struct bpf_map *map_to_flush;
+   unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/net/core/filter.c b/net/core/filter.c
index 104d560946da..acf322296535 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2080,19 +2080,12 @@ static const struct bpf_func_proto 
bpf_clone_redirect_proto = {
.arg3_type  = ARG_ANYTHING,
 };
 
-struct redirect_info {
-   u32 ifindex;
-   u32 flags;
-   struct bpf_map *map;
-   struct bpf_map *map_to_flush;
-   unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+EXPORT_SYMBOL_GPL(bpf_redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags & ~(BPF_F_INGRESS)))
return TC_ACT_SHOT;
@@ -2105,7 +2098,7 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *dev;
 
dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
@@ -3198,7 +3191,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
 
 void xdp_do_flush_map(void)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct bpf_map *map = ri->map_to_flush;
 
ri->map_to_flush = NULL;
@@ -3243,7 +3236,7 @@ static inline bool xdp_map_invalid(const struct bpf_prog 
*xdp_prog,
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3283,7 +3276,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *fwd;
u32 index = ri->ifindex;
int err;
@@ -3315,7 +3308,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
   struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3366,7 +3359,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
u32 index = ri->ifindex;
struct net_device *fwd;
int err = 0;
@@ -3397,7 +3390,7 @@ EXPORT_SYMBOL_GPL(xdp_do_generic_redirect);
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags))
return XDP_ABORTED;
@@ -3421,7 +3414,7 @@ static const struct bpf_func_proto bpf_xdp_redirect_proto 
= {
 BPF_CALL_4(bpf_xdp_redirect_map, struct bpf_map *, map, u32, i

[PATCH v5 bpf-next 8/9] veth: Add XDP TX and REDIRECT

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)  (XDP) (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 119 +
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index acdb1c543f4b..60397a8ea2e9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TXBIT(0)
+#define VETH_XDP_REDIR BIT(1)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
+   struct xdp_mem_info xdp_mem;
unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
@@ -317,10 +322,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+
+   rcu_read_lock();
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   goto out;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+   goto out;
+
+   __veth_xdp_flush(rcv_priv);
+out:
+   rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return veth_xdp_xmit(dev, 1, , 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-   struct xdp_frame *frame)
+   struct xdp_frame *frame,
+   unsigned int *xdp_xmit)
 {
int len = frame->len, delta = 0;
+   struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
@@ -344,6 +381,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
delta = frame->data - xdp.data;
len = xdp.data_end - xdp.data;
break;
+   case XDP_TX:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) {
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_TX;
+   rcu_read_unlock();
+   goto xdp_xmit;
+   case XDP_REDIRECT:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (xdp_do_redirect(priv->dev, , xdp_prog)) {
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_REDIR;
+   rcu_read_unlock();
+   goto xdp_xmit;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -368,12 +428,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
 err_xdp:
rcu_read_unlock();
xdp_return_frame(frame);
-
+xdp_xmit:
return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-   struct sk_buff *skb)
+   struct sk_buff *skb,
+   uns

[PATCH v5 bpf-next 5/9] veth: Add ndo_xdp_xmit

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ef22d991f678..acdb1c543f4b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,50 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+struct xdp_frame **frames, u32 flags)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+   unsigned int max_len;
+   int i, drops = 0;
+
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+   return -EINVAL;
+
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   return -ENXIO;
+
+   rcv_priv = netdev_priv(rcv);
+   /* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
+* side. This means an XDP program is loaded on the peer and the peer
+* device is up.
+*/
+   if (!rcu_access_pointer(rcv_priv->xdp_prog))
+   return -ENXIO;
+
+   max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
+
+   spin_lock(_priv->xdp_ring.producer_lock);
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *frame = frames[i];
+   void *ptr = veth_xdp_to_ptr(frame);
+
+   if (unlikely(frame->len > max_len ||
+__ptr_ring_produce(_priv->xdp_ring, ptr))) {
+   xdp_return_frame_rx_napi(frame);
+   drops++;
+   }
+   }
+   spin_unlock(_priv->xdp_ring.producer_lock);
+
+   if (flags & XDP_XMIT_FLUSH)
+   __veth_xdp_flush(rcv_priv);
+
+   return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
struct xdp_frame *frame)
 {
@@ -766,6 +816,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
+   .ndo_xdp_xmit   = veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.14.3

[PATCH v5 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 87 ++
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 1b4006d3df32..ef22d991f678 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_XDP_FLAG  BIT(0)
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static const struct ethtool_ops veth_ethtool_ops = {
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+   if (veth_is_xdp_frame(ptr))
+   xdp_return_frame(veth_ptr_to_xdp(ptr));
+   else
+   kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+   struct xdp_frame *frame)
+{
+   int len = frame->len, delta = 0;
+   struct bpf_prog *xdp_prog;
+   unsigned int headroom;
+   struct sk_buff *skb;
+
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(priv->xdp_prog);
+   if (likely(xdp_prog)) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = frame->data - frame->headroom;
+   xdp.data = frame->data;
+   xdp.data_end = frame->data + frame->len;
+   xdp.data_meta = frame->data - frame->metasize;
+   xdp.rxq = >xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (act) {
+   case XDP_PASS:
+   delta = frame->data - xdp.data;
+   len = xdp.data_end - xdp.data;
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   case XDP_DROP:
+   goto err_xdp;
+   }
+   }
+   rcu_read_unlock();
+
+   headroom = frame->data - delta - (void *)frame;
+   skb = veth_build_skb(frame, headroom, len, 0);
+   if (!skb) {
+   xdp_return_frame(frame);
+   goto err;
+   }
+
+   memset(frame, 0, sizeof(*frame));
+   skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+   return skb;
+err_xdp:
+   rcu_read_unlock();
+   xdp_return_frame(frame);
+
+   return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
struct sk_buff *skb)
 {
@@ -358,12 +431,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int 
budget)
int i, done = 0;
 
for (i = 0; i < budget; i++) {
-   struct sk_buff *skb = __ptr_ring_consume(>xdp_ring);
+   void *ptr = __ptr_ring_consume(>xdp_ring);
+   struct sk_buff *skb;
 
-   if (!skb)
+   if (!ptr)
break;
 
-   skb = veth_xdp_rcv_skb(priv, skb);
+   if (veth_is_xdp_frame(ptr))
+   skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+   else
+   skb = veth_xdp_rcv_skb(priv, ptr);
 
if (skb)
napi_gro_receive(>xdp_napi, skb);
@@ -416,7 +493,7 @@ static void veth_napi_del(struct net_device *dev)
napi_disable(>xdp_napi);
netif_napi_del(>xdp_napi);
priv->rx_notify_masked = false;
-   ptr_ring_cleanup(>xdp_ring, __skb_array_destroy_skb);
+   ptr_ring_cleanup(>xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
2.14.3

[PATCH v5 bpf-next 7/9] xdp: Helpers for disabling napi_direct of xdp_return_frame

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 25 +
 net/core/xdp.c |  6 --
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4717af8b95e6..2b072dab32c0 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,10 +543,14 @@ struct bpf_redirect_info {
struct bpf_map *map;
struct bpf_map *map_to_flush;
unsigned long   map_owner;
+   u32 kern_flags;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT  BIT(0)  /* no napi_direct on return_frame */
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -775,6 +779,27 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 
+static inline bool xdp_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_set_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_clear_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+}
+
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 unsigned int pktlen)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 57285383ed00..3dd99e1c04f5 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
page = virt_to_head_page(data);
-   if (xa)
+   if (xa) {
+   napi_direct &= !xdp_return_frame_no_direct();
page_pool_put_page(xa->page_pool, page, napi_direct);
-   else
+   } else {
put_page(page);
+   }
rcu_read_unlock();
break;
case MEM_TYPE_PAGE_SHARED:
-- 
2.14.3

[PATCH v5 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

All oversized packets including GSO packets are dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 78fa08cb6e24..1b4006d3df32 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+  netdev_features_t features)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   struct net_device *peer;
+
+   peer = rtnl_dereference(priv->peer);
+   if (peer) {
+   struct veth_priv *peer_priv = netdev_priv(peer);
+
+   if (peer_priv->_xdp_prog)
+   features &= ~NETIF_F_GSO_SOFTWARE;
+   }
+
+   return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -571,6 +588,7 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
struct veth_priv *priv = netdev_priv(dev);
struct bpf_prog *old_prog;
struct net_device *peer;
+   unsigned int max_mtu;
int err;
 
old_prog = priv->_xdp_prog;
@@ -584,6 +602,15 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
 
+   max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+ peer->hard_header_len -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (peer->mtu > max_mtu) {
+   NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to 
set XDP");
+   err = -ERANGE;
+   goto err;
+   }
+
if (dev->flags & IFF_UP) {
err = veth_enable_xdp(dev);
if (err) {
@@ -591,14 +618,29 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
}
+
+   if (!old_prog) {
+   peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = max_mtu;
+   }
}
 
if (old_prog) {
-   if (!prog && dev->flags & IFF_UP)
-   veth_disable_xdp(dev);
+   if (!prog) {
+   if (dev->flags & IFF_UP)
+   veth_disable_xdp(dev);
+
+   if (peer) {
+   peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = ETH_MAX_MTU;
+   }
+   }
bpf_prog_put(old_prog);
}
 
+   if ((!!old_prog ^ !!prog) && peer)
+   netdev_update_features(peer);
+
return 0;
 err:
priv->_xdp_prog = old_prog;
@@ -643,6 +685,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_poll_controller= veth_poll_controller,
 #endif
.ndo_get_iflink = veth_get_iflink,
+   .ndo_fix_features   = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
-- 
2.14.3

[PATCH v5 bpf-next 1/9] net: Export skb_headers_offset_update

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..f6929688853a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 266b954f763e..f5670e6ab40c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
/* Only adjust this if it actually is csum_start rather than csum */
if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_network_header += off;
skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
2.14.3

[PATCH v5 bpf-next 2/9] veth: Add driver XDP

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 373 -
 1 file changed, 366 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39ee57e..78fa08cb6e24 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_RING_SIZE 256
+#define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+   struct napi_struct  xdp_napi;
+   struct net_device   *dev;
+   struct bpf_prog __rcu   *xdp_prog;
+   struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
unsignedrequested_headroom;
+   boolrx_notify_masked;
+   struct ptr_ring xdp_ring;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_link_ksettings = veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+   /* Write ptr_ring before reading rx_notify_masked */
+   smp_mb();
+   if (!priv->rx_notify_masked) {
+   priv->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
+   }
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   dev_kfree_skb_any(skb);
+   return NET_RX_DROP;
+   }
+
+   return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
 {
struct veth_priv *priv = netdev_priv(dev);
+
+   return __dev_forward_skb(dev, skb) ?: xdp ?
+   veth_xdp_rx(priv, skb) :
+   netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   bool rcv_xdp = false;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   rcv_priv = netdev_priv(rcv);
+   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 drop:
atomic64_inc(>dropped);
}
+
+   if (rcv_xdp)
+   __veth_xdp_flush(rcv_priv);
+
rcu_read_unlock();
+
return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,253 @@ static void veth_set_multicast_list(struct net_device 
*dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+ int buflen)
+{
+   struct sk_buff *skb;
+
+   if (!buflen) {
+

[PATCH v5 bpf-next 0/9] veth: Driver XDP

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC ---> veth===veth
 (XDP) (redirect)(XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.


Envisioned use-cases


* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.


Implementation
--

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-5: Add ndo_xdp_xmit.
 - patch 6-8: Add XDP_TX and XDP_REDIRECT.
 - patch 9: Performance optimization for multi-queue env.


Tests and performance numbers
-

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP actionFlowsMpps

DROP   110.6
DROP   221.2
DROP 10036.0
TX 1 5.0
TX 210.0
TX   10031.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction Flows   Gbps
==
external->veth1   20.8
external->veth2   23.5
external->veth  100   23.6
veth->external19.0
veth->external2   17.8
veth->external  100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v5:
- Fix broken SOBs.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (9):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  bpf: Make redirect_info accessible from modules
  xdp: Helpers for disabling napi_direct of xdp_return_frame
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c | 747 -
 include/linux/filter.h |  35 +++
 include/linux/skbuff.h |   1 +
 net/core/filter.c  |  29 +-
 net/core/skbuff.c  |   3 +-
 net/core/xdp.c |   6 +-
 6 files changed, 791 insertions(+), 30 deletions(-)

-- 
2.14.3

Re: [PATCH v4 bpf-next 1/9] net: Export skb_headers_offset_update

2018-07-26 Thread Toshiaki Makita


On 18/07/26 (木) 23:25, Toshiaki Makita wrote:

From: Toshiaki Makita 

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 


Oops. SOBs are messed up. Please ignore this series.
Sorry for the noise.

[PATCH v4 bpf-next 5/9] veth: Add ndo_xdp_xmit

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ef22d991f678..acdb1c543f4b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,50 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+struct xdp_frame **frames, u32 flags)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+   unsigned int max_len;
+   int i, drops = 0;
+
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+   return -EINVAL;
+
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   return -ENXIO;
+
+   rcv_priv = netdev_priv(rcv);
+   /* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
+* side. This means an XDP program is loaded on the peer and the peer
+* device is up.
+*/
+   if (!rcu_access_pointer(rcv_priv->xdp_prog))
+   return -ENXIO;
+
+   max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
+
+   spin_lock(_priv->xdp_ring.producer_lock);
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *frame = frames[i];
+   void *ptr = veth_xdp_to_ptr(frame);
+
+   if (unlikely(frame->len > max_len ||
+__ptr_ring_produce(_priv->xdp_ring, ptr))) {
+   xdp_return_frame_rx_napi(frame);
+   drops++;
+   }
+   }
+   spin_unlock(_priv->xdp_ring.producer_lock);
+
+   if (flags & XDP_XMIT_FLUSH)
+   __veth_xdp_flush(rcv_priv);
+
+   return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
struct xdp_frame *frame)
 {
@@ -766,6 +816,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
+   .ndo_xdp_xmit   = veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.14.3

[PATCH v4 bpf-next 6/9] bpf: Make redirect_info accessible from modules

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 10 ++
 net/core/filter.c  | 29 +++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd7396886..4717af8b95e6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -537,6 +537,16 @@ struct sk_msg_buff {
struct list_head list;
 };
 
+struct bpf_redirect_info {
+   u32 ifindex;
+   u32 flags;
+   struct bpf_map *map;
+   struct bpf_map *map_to_flush;
+   unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/net/core/filter.c b/net/core/filter.c
index 104d560946da..acf322296535 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2080,19 +2080,12 @@ static const struct bpf_func_proto 
bpf_clone_redirect_proto = {
.arg3_type  = ARG_ANYTHING,
 };
 
-struct redirect_info {
-   u32 ifindex;
-   u32 flags;
-   struct bpf_map *map;
-   struct bpf_map *map_to_flush;
-   unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+EXPORT_SYMBOL_GPL(bpf_redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags & ~(BPF_F_INGRESS)))
return TC_ACT_SHOT;
@@ -2105,7 +2098,7 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *dev;
 
dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
@@ -3198,7 +3191,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
 
 void xdp_do_flush_map(void)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct bpf_map *map = ri->map_to_flush;
 
ri->map_to_flush = NULL;
@@ -3243,7 +3236,7 @@ static inline bool xdp_map_invalid(const struct bpf_prog 
*xdp_prog,
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3283,7 +3276,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *fwd;
u32 index = ri->ifindex;
int err;
@@ -3315,7 +3308,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
   struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3366,7 +3359,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
u32 index = ri->ifindex;
struct net_device *fwd;
int err = 0;
@@ -3397,7 +3390,7 @@ EXPORT_SYMBOL_GPL(xdp_do_generic_redirect);
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags))
return XDP_ABORTED;
@@ -3421,7 +3414,7 @@ static const struct bpf_func_proto bpf_xdp_redirect_proto 
= {
 BPF_CALL_4(bpf_xdp_redirect_

[PATCH v4 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 87 ++
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 1b4006d3df32..ef22d991f678 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_XDP_FLAG  BIT(0)
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static const struct ethtool_ops veth_ethtool_ops = {
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+   if (veth_is_xdp_frame(ptr))
+   xdp_return_frame(veth_ptr_to_xdp(ptr));
+   else
+   kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+   struct xdp_frame *frame)
+{
+   int len = frame->len, delta = 0;
+   struct bpf_prog *xdp_prog;
+   unsigned int headroom;
+   struct sk_buff *skb;
+
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(priv->xdp_prog);
+   if (likely(xdp_prog)) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = frame->data - frame->headroom;
+   xdp.data = frame->data;
+   xdp.data_end = frame->data + frame->len;
+   xdp.data_meta = frame->data - frame->metasize;
+   xdp.rxq = >xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (act) {
+   case XDP_PASS:
+   delta = frame->data - xdp.data;
+   len = xdp.data_end - xdp.data;
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   case XDP_DROP:
+   goto err_xdp;
+   }
+   }
+   rcu_read_unlock();
+
+   headroom = frame->data - delta - (void *)frame;
+   skb = veth_build_skb(frame, headroom, len, 0);
+   if (!skb) {
+   xdp_return_frame(frame);
+   goto err;
+   }
+
+   memset(frame, 0, sizeof(*frame));
+   skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+   return skb;
+err_xdp:
+   rcu_read_unlock();
+   xdp_return_frame(frame);
+
+   return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
struct sk_buff *skb)
 {
@@ -358,12 +431,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int 
budget)
int i, done = 0;
 
for (i = 0; i < budget; i++) {
-   struct sk_buff *skb = __ptr_ring_consume(>xdp_ring);
+   void *ptr = __ptr_ring_consume(>xdp_ring);
+   struct sk_buff *skb;
 
-   if (!skb)
+   if (!ptr)
break;
 
-   skb = veth_xdp_rcv_skb(priv, skb);
+   if (veth_is_xdp_frame(ptr))
+   skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+   else
+   skb = veth_xdp_rcv_skb(priv, ptr);
 
if (skb)
napi_gro_receive(>xdp_napi, skb);
@@ -416,7 +493,7 @@ static void veth_napi_del(struct net_device *dev)
napi_disable(>xdp_napi);
netif_napi_del(>xdp_napi);
priv->rx_notify_masked = false;
-   ptr_ring_cleanup(>xdp_ring, __skb_array_destroy_skb);
+   ptr_ring_cleanup(>xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
2.14.3

[PATCH v4 bpf-next 9/9] veth: Support per queue XDP ring

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

Move XDP and napi related fields in veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 278 -
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 60397a8ea2e9..3059b897ecea 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
-   struct bpf_prog *_xdp_prog;
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
struct xdp_mem_info xdp_mem;
-   unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
 };
 
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   struct bpf_prog *_xdp_prog;
+   struct veth_rq  *rq;
+   unsigned intrequested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
/* Write ptr_ring before reading rx_notify_masked */
smp_mb();
-   if (!priv->rx_notify_masked) {
-   priv->rx_notify_masked = true;
-   napi_schedule(>xdp_napi);
+   if (!rq->rx_notify_masked) {
+   rq->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
dev_kfree_skb_any(skb);
return NET_RX_DROP;
}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct 
sk_buff *skb)
return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+   struct veth_rq *rq, bool xdp)
 {
-   struct veth_priv *priv = netdev_priv(dev);
-
return __dev_forward_skb(dev, skb) ?: xdp ?
-   veth_xdp_rx(priv, skb) :
+   veth_xdp_rx(rq, skb) :
netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
+   int rxq;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
rcv_priv = netdev_priv(rcv);
-   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+   rxq = skb_get_queue_mapping(skb);
+   if (rxq < rcv->real_num_rx_queues) {
+   rq = _priv->rq[rxq];
+   rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+   if (rcv_xdp)
+   skb_record_rx_queue(skb, rxq);
+   }
 
-   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+   if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
if (rcv_xdp)
-   __veth_xdp_flush(rcv_priv);
+   __veth_xdp_flush(rq);
 
rcu_read_unlock();
 
@@ -278,12 +289,18 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+   return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 s

[PATCH v4 bpf-next 8/9] veth: Add XDP TX and REDIRECT

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)  (XDP) (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 119 +
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index acdb1c543f4b..60397a8ea2e9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TXBIT(0)
+#define VETH_XDP_REDIR BIT(1)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
+   struct xdp_mem_info xdp_mem;
unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
@@ -317,10 +322,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+
+   rcu_read_lock();
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   goto out;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+   goto out;
+
+   __veth_xdp_flush(rcv_priv);
+out:
+   rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return veth_xdp_xmit(dev, 1, , 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-   struct xdp_frame *frame)
+   struct xdp_frame *frame,
+   unsigned int *xdp_xmit)
 {
int len = frame->len, delta = 0;
+   struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
@@ -344,6 +381,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
delta = frame->data - xdp.data;
len = xdp.data_end - xdp.data;
break;
+   case XDP_TX:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) {
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_TX;
+   rcu_read_unlock();
+   goto xdp_xmit;
+   case XDP_REDIRECT:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (xdp_do_redirect(priv->dev, , xdp_prog)) {
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_REDIR;
+   rcu_read_unlock();
+   goto xdp_xmit;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -368,12 +428,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
 err_xdp:
rcu_read_unlock();
xdp_return_frame(frame);
-
+xdp_xmit:
return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-   struct sk_buff *skb)
+   struct sk_buff *skb,
+

[PATCH v4 bpf-next 2/9] veth: Add driver XDP

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 373 -
 1 file changed, 366 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39ee57e..78fa08cb6e24 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_RING_SIZE 256
+#define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+   struct napi_struct  xdp_napi;
+   struct net_device   *dev;
+   struct bpf_prog __rcu   *xdp_prog;
+   struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
unsignedrequested_headroom;
+   boolrx_notify_masked;
+   struct ptr_ring xdp_ring;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_link_ksettings = veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+   /* Write ptr_ring before reading rx_notify_masked */
+   smp_mb();
+   if (!priv->rx_notify_masked) {
+   priv->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
+   }
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   dev_kfree_skb_any(skb);
+   return NET_RX_DROP;
+   }
+
+   return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
 {
struct veth_priv *priv = netdev_priv(dev);
+
+   return __dev_forward_skb(dev, skb) ?: xdp ?
+   veth_xdp_rx(priv, skb) :
+   netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   bool rcv_xdp = false;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   rcv_priv = netdev_priv(rcv);
+   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 drop:
atomic64_inc(>dropped);
}
+
+   if (rcv_xdp)
+   __veth_xdp_flush(rcv_priv);
+
rcu_read_unlock();
+
return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,253 @@ static void veth_set_multicast_list(struct net_device 
*dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+ int buflen)
+{
+   struct sk_buff *skb;
+
+   if (

[PATCH v4 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

All oversized packets including GSO packets are dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 78fa08cb6e24..1b4006d3df32 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+  netdev_features_t features)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   struct net_device *peer;
+
+   peer = rtnl_dereference(priv->peer);
+   if (peer) {
+   struct veth_priv *peer_priv = netdev_priv(peer);
+
+   if (peer_priv->_xdp_prog)
+   features &= ~NETIF_F_GSO_SOFTWARE;
+   }
+
+   return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -571,6 +588,7 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
struct veth_priv *priv = netdev_priv(dev);
struct bpf_prog *old_prog;
struct net_device *peer;
+   unsigned int max_mtu;
int err;
 
old_prog = priv->_xdp_prog;
@@ -584,6 +602,15 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
 
+   max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+ peer->hard_header_len -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (peer->mtu > max_mtu) {
+   NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to 
set XDP");
+   err = -ERANGE;
+   goto err;
+   }
+
if (dev->flags & IFF_UP) {
err = veth_enable_xdp(dev);
if (err) {
@@ -591,14 +618,29 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
}
+
+   if (!old_prog) {
+   peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = max_mtu;
+   }
}
 
if (old_prog) {
-   if (!prog && dev->flags & IFF_UP)
-   veth_disable_xdp(dev);
+   if (!prog) {
+   if (dev->flags & IFF_UP)
+   veth_disable_xdp(dev);
+
+   if (peer) {
+   peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = ETH_MAX_MTU;
+   }
+   }
bpf_prog_put(old_prog);
}
 
+   if ((!!old_prog ^ !!prog) && peer)
+   netdev_update_features(peer);
+
return 0;
 err:
priv->_xdp_prog = old_prog;
@@ -643,6 +685,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_poll_controller= veth_poll_controller,
 #endif
.ndo_get_iflink = veth_get_iflink,
+   .ndo_fix_features   = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
-- 
2.14.3

[PATCH v4 bpf-next 7/9] xdp: Helpers for disabling napi_direct of xdp_return_frame

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 25 +
 net/core/xdp.c |  6 --
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4717af8b95e6..2b072dab32c0 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,10 +543,14 @@ struct bpf_redirect_info {
struct bpf_map *map;
struct bpf_map *map_to_flush;
unsigned long   map_owner;
+   u32 kern_flags;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT  BIT(0)  /* no napi_direct on return_frame */
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -775,6 +779,27 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 
+static inline bool xdp_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_set_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_clear_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+}
+
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 unsigned int pktlen)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 57285383ed00..3dd99e1c04f5 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
page = virt_to_head_page(data);
-   if (xa)
+   if (xa) {
+   napi_direct &= !xdp_return_frame_no_direct();
page_pool_put_page(xa->page_pool, page, napi_direct);
-   else
+   } else {
put_page(page);
+   }
rcu_read_unlock();
break;
case MEM_TYPE_PAGE_SHARED:
-- 
2.14.3

[PATCH v4 bpf-next 1/9] net: Export skb_headers_offset_update

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
Signed-off-by: Toshiaki Makita 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..f6929688853a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 266b954f763e..f5670e6ab40c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
/* Only adjust this if it actually is csum_start rather than csum */
if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_network_header += off;
skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
2.14.3

[PATCH v4 bpf-next 0/9] veth: Driver XDP

2018-07-26 Thread Toshiaki Makita

From: Toshiaki Makita 

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC ---> veth===veth
 (XDP) (redirect)(XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.


Envisioned use-cases


* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.


Implementation
--

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-5: Add ndo_xdp_xmit.
 - patch 6-8: Add XDP_TX and XDP_REDIRECT.
 - patch 9: Performance optimization for multi-queue env.


Tests and performance numbers
-

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP actionFlowsMpps

DROP   110.6
DROP   221.2
DROP 10036.0
TX 1 5.0
TX 210.0
TX   10031.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction Flows   Gbps
==
external->veth1   20.8
external->veth2   23.5
external->veth  100   23.6
veth->external19.0
veth->external2   17.8
veth->external  100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (9):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  bpf: Make redirect_info accessible from modules
  xdp: Helpers for disabling napi_direct of xdp_return_frame
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c | 747 -
 include/linux/filter.h |  35 +++
 include/linux/skbuff.h |   1 +
 net/core/filter.c  |  29 +-
 net/core/skbuff.c  |   3 +-
 net/core/xdp.c |   6 +-
 6 files changed, 791 insertions(+), 30 deletions(-)

-- 
2.14.3

Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-24 Thread Toshiaki Makita

On 2018/07/25 4:10, Jakub Kicinski wrote:
> On Tue, 24 Jul 2018 18:39:09 +0900, Toshiaki Makita wrote:
>> On 2018/07/24 10:56, Toshiaki Makita wrote:
>>> On 2018/07/24 9:27, Jakub Kicinski wrote:  
>>>> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:  
>>>>> From: Toshiaki Makita 
>>>>>
>>>>> All oversized packets including GSO packets are dropped if XDP is
>>>>> enabled on receiver side, so don't send such packets from peer.
>>>>>
>>>>> Drop TSO and SCTP fragmentation features so that veth devices themselves
>>>>> segment packets with XDP enabled. Also cap MTU accordingly.
>>>>>
>>>>> Signed-off-by: Toshiaki Makita   
>>>>
>>>> Is there any precedence for fixing up features and MTU like this?  Most
>>>> drivers just refuse to install the program if settings are incompatible.  
>>>
>>> I don't know any precedence. I can refuse the program on installing it
>>> when features and MTU are not appropriate. Is it preferred?
>>> Note that with current implementation wanted_features are not touched so
>>> features will be restored when the XDP program is removed. MTU will not
>>> be restored though, as I do not remember the original MTU.  
>>
>> I just recalled that virtio_net used to refused XDP when guest offload
>> features are incompatible but now it dynamically fixup them on
>> installing an XDP program.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f93522ffab2d46a36b57adf324a54e674fc9536
> 
> That's slightly different AFAIU, because the virtio features weren't
> really controllable at runtime at all.  I'm not dead set on leaving the
> features be, but I just want to make sure we think this through
> properly before we commit to any magic behaviour for ever...

To me it does not look so different. What the above virtio commit is
doing is almost disabling LRO so we could add a feature to toggle LRO
instead but it chose to automatically disable it. And this veth commit
is also almost equivalent to disabling LRO.
IMHO we should do this feature adjustment. It just avoids packet drops
and has no downside. It forces software segmentation on the peer veth.
Features will be restored when XDP is removed so there would be no
surprising for users. It seems there is no benefit to not doing this.

> Taking a quick glance at the MTU side, it seems that today if someone
> decides to set MTU on one side of a veth pair the packets will simply
> get dropped.  So the MTU coupling for XDP doesn't seem in line with
> existing behaviour of veth, not only other XDP drivers.

It looks weird to allow such inconsistent MTU settings. But anyway
changing MTU can have negative effect on users, such as causing
fragmentation or EMSGSIZE error on UDP sendmsg() or not restoring MTU. I
think I should not adjust MTU but just cap max_mtu.

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-24 Thread Toshiaki Makita

On 2018/07/24 10:56, Toshiaki Makita wrote:
> On 2018/07/24 9:27, Jakub Kicinski wrote:
>> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:
>>> From: Toshiaki Makita 
>>>
>>> All oversized packets including GSO packets are dropped if XDP is
>>> enabled on receiver side, so don't send such packets from peer.
>>>
>>> Drop TSO and SCTP fragmentation features so that veth devices themselves
>>> segment packets with XDP enabled. Also cap MTU accordingly.
>>>
>>> Signed-off-by: Toshiaki Makita 
>>
>> Is there any precedence for fixing up features and MTU like this?  Most
>> drivers just refuse to install the program if settings are incompatible.
> 
> I don't know any precedence. I can refuse the program on installing it
> when features and MTU are not appropriate. Is it preferred?
> Note that with current implementation wanted_features are not touched so
> features will be restored when the XDP program is removed. MTU will not
> be restored though, as I do not remember the original MTU.

I just recalled that virtio_net used to refused XDP when guest offload
features are incompatible but now it dynamically fixup them on
installing an XDP program.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f93522ffab2d46a36b57adf324a54e674fc9536

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 12:38, Jakub Kicinski wrote:
> On Tue, 24 Jul 2018 11:43:11 +0900, Toshiaki Makita wrote:
>> On 2018/07/24 10:22, Jakub Kicinski wrote:
>>> On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:  
>>>> From: Toshiaki Makita 
>>>>
>>>> We need some mechanism to disable napi_direct on calling
>>>> xdp_return_frame_rx_napi() from some context.
>>>> When veth gets support of XDP_REDIRECT, it will redirects packets which
>>>> are redirected from other devices. On redirection veth will reuse
>>>> xdp_mem_info of the redirection source device to make return_frame work.
>>>> But in this case .ndo_xdp_xmit() called from veth redirection uses
>>>> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
>>>> not called directly from the rxq which owns the xdp_mem_info.
>>>>
>>>> This approach introduces a flag in xdp_mem_info to indicate that
>>>> napi_direct should be disabled even when _rx_napi variant is used.
>>>>
>>>> Signed-off-by: Toshiaki Makita   
>>>
>>> To be clear - you will modify flags of the original source device if it
>>> ever redirected a frame to a software device like veth?  Seems a bit
>>> heavy handed.  The xdp_return_frame_rx_napi() is only really used on
>>> error paths, but still..  Also as you note the original NAPI can run
>>> concurrently with your veth dest one, but also with NAPIs of other veth
>>> devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
>>> makes me worried.  
>>
>> xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
>> field is local to the frame. Changing flags affects only the frame.
>> xdp.rxq is local to NAPI thread, so no worries about atomicity.
> 
> Ah, right!  mem_info used to be just 8B, now it would be 12B.
> Alternatively we could perhaps add this info to struct redirect_info,
> through xdp_do_redirect() to avoid the per-frame cost.  I'm not sure
> that's better.

OK, let me check if this works.

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 10:22, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> We need some mechanism to disable napi_direct on calling
>> xdp_return_frame_rx_napi() from some context.
>> When veth gets support of XDP_REDIRECT, it will redirects packets which
>> are redirected from other devices. On redirection veth will reuse
>> xdp_mem_info of the redirection source device to make return_frame work.
>> But in this case .ndo_xdp_xmit() called from veth redirection uses
>> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
>> not called directly from the rxq which owns the xdp_mem_info.
>>
>> This approach introduces a flag in xdp_mem_info to indicate that
>> napi_direct should be disabled even when _rx_napi variant is used.
>>
>> Signed-off-by: Toshiaki Makita 
> 
> To be clear - you will modify flags of the original source device if it
> ever redirected a frame to a software device like veth?  Seems a bit
> heavy handed.  The xdp_return_frame_rx_napi() is only really used on
> error paths, but still..  Also as you note the original NAPI can run
> concurrently with your veth dest one, but also with NAPIs of other veth
> devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
> makes me worried.

xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
field is local to the frame. Changing flags affects only the frame.
xdp.rxq is local to NAPI thread, so no worries about atomicity.

> Would you mind elaborating why not handle the RX completely in the NAPI
> context of the original device?

Originally it was difficult to implement .ndo_xdp_xmit() and
.ndo_xdp_flush() model without creating NAPI in veth. Now it is changed
so I'm not sure how difficult it is at this point.
But in any case I want to avoid stack inflation by veth NAPI. (Imagine
some misconfiguration like calling XDP_TX on both side of veth...)

> 
>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>> index fcb033f51d8c..1d1bc6553ff2 100644
>> --- a/include/net/xdp.h
>> +++ b/include/net/xdp.h
>> @@ -41,6 +41,9 @@ enum xdp_mem_type {
>>  MEM_TYPE_MAX,
>>  };
>>  
>> +/* XDP flags for xdp_mem_info */
>> +#define XDP_MEM_RF_NO_DIRECTBIT(0)  /* don't use napi_direct */
>> +
>>  /* XDP flags for ndo_xdp_xmit */
>>  #define XDP_XMIT_FLUSH  (1U << 0)   /* doorbell signal 
>> consumer */
>>  #define XDP_XMIT_FLAGS_MASK XDP_XMIT_FLUSH
>> @@ -48,6 +51,7 @@ enum xdp_mem_type {
>>  struct xdp_mem_info {
>>  u32 type; /* enum xdp_mem_type, but known size type */
>>  u32 id;
>> +u32 flags;
>>  };
>>  
>>  struct page_pool;
>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>> index 57285383ed00..1426c608fd75 100644
>> --- a/net/core/xdp.c
>> +++ b/net/core/xdp.c
>> @@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct 
>> xdp_mem_info *mem, bool napi_direct,
>>  /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
>>  xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
>>  page = virt_to_head_page(data);
>> -if (xa)
>> +if (xa) {
>> +    napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT);
>>  page_pool_put_page(xa->page_pool, page, napi_direct);
>> -else
>> +} else {
>>  put_page(page);
>> +}
>>  rcu_read_unlock();
>>  break;
>>  case MEM_TYPE_PAGE_SHARED:
> 
> 
> 

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 10:02, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> This allows NIC's XDP to redirect packets to veth. The destination veth
>> device enqueues redirected packets to the napi ring of its peer, then
>> they are processed by XDP on its peer veth device.
>> This can be thought as calling another XDP program by XDP program using
>> REDIRECT, when the peer enables driver XDP.
>>
>> Note that when the peer veth device does not set driver xdp, redirected
>> packets will be dropped because the peer is not ready for NAPI.
...
>> +static int veth_xdp_xmit(struct net_device *dev, int n,
>> + struct xdp_frame **frames, u32 flags)
>> +{
>> +struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>> +struct net_device *rcv;
>> +int i, drops = 0;
>> +
>> +if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>> +return -EINVAL;
>> +
>> +rcv = rcu_dereference(priv->peer);
>> +if (unlikely(!rcv))
>> +return -ENXIO;
>> +
>> +rcv_priv = netdev_priv(rcv);
>> +/* xdp_ring is initialized on receive side? */
>> +if (!rcu_access_pointer(rcv_priv->xdp_prog))
>> +return -ENXIO;
>> +
>> +spin_lock(_priv->xdp_ring.producer_lock);
>> +for (i = 0; i < n; i++) {
>> +struct xdp_frame *frame = frames[i];
>> +void *ptr = veth_xdp_to_ptr(frame);
>> +
>> +if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
>> + __ptr_ring_produce(_priv->xdp_ring, ptr))) {
> 
> Would you mind sparing a few more words how this is safe vs the
> .ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
> IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
> guarantee the device doesn't go down right after that check, or is
> already down, but netdev->flags are not atomic...  

Actually it is guarded by RCU. On closing the device rcv_priv->xdp_prog
is set to be NULL, and synchronize_net() is called from within
netif_napi_del(). Then ptr_ring is cleaned-up.
xdp_ok_fwd_dev() is doing the same check as non-XDP case, but it may not
be appropriate because IFF_UP check here is not usable as you say.

> 
>> +xdp_return_frame_rx_napi(frame);
>> +drops++;
>> +}
>> +}
>> +spin_unlock(_priv->xdp_ring.producer_lock);
>> +
>> +if (flags & XDP_XMIT_FLUSH)
>> +__veth_xdp_flush(rcv_priv);
>> +
>> +return n - drops;
>> +}
>> +
>>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>  struct xdp_frame *frame)
>>  {
>> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  .ndo_features_check = passthru_features_check,
>>  .ndo_set_rx_headroom= veth_set_rx_headroom,
>>  .ndo_bpf= veth_xdp,
>> +.ndo_xdp_xmit   = veth_xdp_xmit,
>>  };
>>  
>>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> 
> 
> 

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 10:02, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> This allows NIC's XDP to redirect packets to veth. The destination veth
>> device enqueues redirected packets to the napi ring of its peer, then
>> they are processed by XDP on its peer veth device.
>> This can be thought as calling another XDP program by XDP program using
>> REDIRECT, when the peer enables driver XDP.
>>
>> Note that when the peer veth device does not set driver xdp, redirected
>> packets will be dropped because the peer is not ready for NAPI.
> 
> Often we can't redirect to devices which don't have am xdp program
> installed.  In your case we can't redirect unless the peer of the
> target doesn't have a program installed?  :(

Right. I tried to avoid this case by converting xdp_frames to skb but
realized that should not be done.
https://patchwork.ozlabs.org/patch/903536/

> Perhaps it is time to reconsider what Saeed once asked for, a flag or
> attribute to enable being the destination of a XDP_REDIRECT.

Yes, something will be necessary. Jesper said Tariq had some ideas to
implement it.

> 
>> v2:
>> - Drop the part converting xdp_frame into skb when XDP is not enabled.
>> - Implement bulk interface of ndo_xdp_xmit.
>> - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
>>
>> Signed-off-by: Toshiaki Makita 
>> ---
>>  drivers/net/veth.c | 45 +
>>  1 file changed, 45 insertions(+)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 4be75c58bc6a..57187e955fea 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -17,6 +17,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
>>  return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
>>  }
>>  
>> +static void *veth_xdp_to_ptr(void *ptr)
>> +{
>> +return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
>> +}
>> +
>>  static void veth_ptr_free(void *ptr)
>>  {
>>  if (veth_is_xdp_frame(ptr))
>> @@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int 
>> headroom, int len,
>>  return skb;
>>  }
>>  
>> +static int veth_xdp_xmit(struct net_device *dev, int n,
>> + struct xdp_frame **frames, u32 flags)
>> +{
>> +struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>> +struct net_device *rcv;
>> +int i, drops = 0;
>> +
>> +if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>> +return -EINVAL;
>> +
>> +rcv = rcu_dereference(priv->peer);
>> +if (unlikely(!rcv))
>> +return -ENXIO;
>> +
>> +rcv_priv = netdev_priv(rcv);
>> +/* xdp_ring is initialized on receive side? */
>> +if (!rcu_access_pointer(rcv_priv->xdp_prog))
>> +return -ENXIO;
>> +
>> +spin_lock(_priv->xdp_ring.producer_lock);
>> +for (i = 0; i < n; i++) {
>> +struct xdp_frame *frame = frames[i];
>> +void *ptr = veth_xdp_to_ptr(frame);
>> +
>> +if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
>> + __ptr_ring_produce(_priv->xdp_ring, ptr))) {
> 
> Would you mind sparing a few more words how this is safe vs the
> .ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
> IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
> guarantee the device doesn't go down right after that check, or is
> already down, but netdev->flags are not atomic...  
> 
>> +xdp_return_frame_rx_napi(frame);
>> +drops++;
>> +}
>> +}
>> +spin_unlock(_priv->xdp_ring.producer_lock);
>> +
>> +if (flags & XDP_XMIT_FLUSH)
>> +__veth_xdp_flush(rcv_priv);
>> +
>> +return n - drops;
>> +}
>> +
>>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>  struct xdp_frame *frame)
>>  {
>> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  .ndo_features_check = passthru_features_check,
>>  .ndo_set_rx_headroom= veth_set_rx_headroom,
>>  .ndo_bpf= veth_xdp,
>> +.ndo_xdp_xmit   = veth_xdp_xmit,
>>  };
>>  
>>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> 
> 
> 

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 9:19, kbuild test robot wrote:
> Hi Toshiaki,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on bpf-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Toshiaki-Makita/veth-Driver-XDP/20180724-065517
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 
> master
> config: i386-randconfig-x001-201829 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>In file included from include/linux/kernel.h:10:0,
> from include/linux/list.h:9,
> from include/linux/timer.h:5,
> from include/linux/netdevice.h:28,
> from drivers//net/veth.c:11:
>drivers//net/veth.c: In function 'veth_xdp_xmit':
>>> drivers//net/veth.c:300:16: error: implicit declaration of function 
>>> 'xdp_ok_fwd_dev' [-Werror=implicit-function-declaration]
>   if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||

This is because this series depends on commit d8d7218ad842 ("xdp:
XDP_REDIRECT should check IFF_UP and MTU") which is currently in DaveM's
net-next tree, as I noted in the cover letter.

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 9:27, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> All oversized packets including GSO packets are dropped if XDP is
>> enabled on receiver side, so don't send such packets from peer.
>>
>> Drop TSO and SCTP fragmentation features so that veth devices themselves
>> segment packets with XDP enabled. Also cap MTU accordingly.
>>
>> Signed-off-by: Toshiaki Makita 
> 
> Is there any precedence for fixing up features and MTU like this?  Most
> drivers just refuse to install the program if settings are incompatible.

I don't know any precedence. I can refuse the program on installing it
when features and MTU are not appropriate. Is it preferred?
Note that with current implementation wanted_features are not touched so
features will be restored when the XDP program is removed. MTU will not
be restored though, as I do not remember the original MTU.


>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 78fa08cb6e24..f5b72e937d9d 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
>>  return iflink;
>>  }
>>  
>> +static netdev_features_t veth_fix_features(struct net_device *dev,
>> +   netdev_features_t features)
>> +{
>> +struct veth_priv *priv = netdev_priv(dev);
>> +struct net_device *peer;
>> +
>> +peer = rtnl_dereference(priv->peer);
>> +if (peer) {
>> +struct veth_priv *peer_priv = netdev_priv(peer);
>> +
>> +if (peer_priv->_xdp_prog)
>> +features &= ~NETIF_F_GSO_SOFTWARE;
>> +}
>> +
>> +return features;
>> +}
>> +
>>  static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
>>  {
>>  struct veth_priv *peer_priv, *priv = netdev_priv(dev);
>> @@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct 
>> bpf_prog *prog,
>>  goto err;
>>  }
>>  }
>> +
>> +if (!old_prog) {
>> +peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
>> +peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
>> +peer->hard_header_len -
>> +SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +if (peer->mtu > peer->max_mtu)
>> +dev_set_mtu(peer, peer->max_mtu);
>> +}
>>  }
>>  
>>  if (old_prog) {
>> -if (!prog && dev->flags & IFF_UP)
>> -veth_disable_xdp(dev);
>> +if (!prog) {
>> +if (dev->flags & IFF_UP)
>> +veth_disable_xdp(dev);
>> +
>> +if (peer) {
>> +peer->hw_features |= NETIF_F_GSO_SOFTWARE;
>> +peer->max_mtu = ETH_MAX_MTU;
>> +}
>> +}
>>  bpf_prog_put(old_prog);
>>  }
>>  
>> +if ((!!old_prog ^ !!prog) && peer)
>> +netdev_update_features(peer);
>> +
>>  return 0;
>>  err:
>>  priv->_xdp_prog = old_prog;
>> @@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  .ndo_poll_controller= veth_poll_controller,
>>  #endif
>>  .ndo_get_iflink = veth_get_iflink,
>> +.ndo_fix_features   = veth_fix_features,
>>  .ndo_features_check = passthru_features_check,
>>  .ndo_set_rx_headroom= veth_set_rx_headroom,
>>  .ndo_bpf= veth_xdp,

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 2/8] veth: Add driver XDP

2018-07-23 Thread Toshiaki Makita

Hi Jakub,

Thanks for reviewing!

On 2018/07/24 9:23, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:02 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> This is the basic implementation of veth driver XDP.
>>
>> Incoming packets are sent from the peer veth device in the form of skb,
>> so this is generally doing the same thing as generic XDP.
>>
>> This itself is not so useful, but a starting point to implement other
>> useful veth XDP features like TX and REDIRECT.
>>
>> This introduces NAPI when XDP is enabled, because XDP is now heavily
>> relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
>> enqueues packets to the ring and peer NAPI handler drains the ring.
>>
>> Currently only one ring is allocated for each veth device, so it does
>> not scale on multiqueue env. This can be resolved by allocating rings
>> on the per-queue basis later.
>>
>> Note that NAPI is not used but netif_rx is used when XDP is not loaded,
>> so this does not change the default behaviour.
>>
>> v3:
>> - Fix race on closing the device.
>> - Add extack messages in ndo_bpf.
>>
>> v2:
>> - Squashed with the patch adding NAPI.
>> - Implement adjust_tail.
>> - Don't acquire consumer lock because it is guarded by NAPI.
>> - Make poll_controller noop since it is unnecessary.
>> - Register rxq_info on enabling XDP rather than on opening the device.
>>
>> Signed-off-by: Toshiaki Makita 
> 
>> +static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
>> +struct sk_buff *skb)
>> +{
>> +u32 pktlen, headroom, act, metalen;
>> +void *orig_data, *orig_data_end;
>> +int size, mac_len, delta, off;
>> +struct bpf_prog *xdp_prog;
>> +struct xdp_buff xdp;
>> +
>> +rcu_read_lock();
>> +xdp_prog = rcu_dereference(priv->xdp_prog);
>> +if (unlikely(!xdp_prog)) {
>> +rcu_read_unlock();
>> +goto out;
>> +}
>> +
>> +mac_len = skb->data - skb_mac_header(skb);
>> +pktlen = skb->len + mac_len;
>> +size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
>> +   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +if (size > PAGE_SIZE)
>> +goto drop;
>> +
>> +headroom = skb_headroom(skb) - mac_len;
>> +if (skb_shared(skb) || skb_head_is_locked(skb) ||
>> +skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
>> +struct sk_buff *nskb;
>> +void *head, *start;
>> +struct page *page;
>> +int head_off;
>> +
>> +page = alloc_page(GFP_ATOMIC);
>> +if (!page)
>> +goto drop;
>> +
>> +head = page_address(page);
>> +start = head + VETH_XDP_HEADROOM;
>> +if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
>> +page_frag_free(head);
>> +goto drop;
>> +}
>> +
>> +nskb = veth_build_skb(head,
>> +  VETH_XDP_HEADROOM + mac_len, skb->len,
>> +  PAGE_SIZE);
>> +if (!nskb) {
>> +page_frag_free(head);
>> +goto drop;
>> +}
> 
>> +static int veth_enable_xdp(struct net_device *dev)
>> +{
>> +struct veth_priv *priv = netdev_priv(dev);
>> +int err;
>> +
>> +if (!xdp_rxq_info_is_reg(>xdp_rxq)) {
>> +err = xdp_rxq_info_reg(>xdp_rxq, dev, 0);
>> +if (err < 0)
>> +return err;
>> +
>> +err = xdp_rxq_info_reg_mem_model(>xdp_rxq,
>> + MEM_TYPE_PAGE_SHARED, NULL);
> 
> nit: doesn't matter much but looks like a mix of MEM_TYPE_PAGE_SHARED
>  and MEM_TYPE_PAGE_ORDER0

Actually I'm not sure when to use MEM_TYPE_PAGE_ORDER0. It seems a page
allocated by alloc_page() can be freed by page_frag_free() and it is
more lightweight than put_page(), isn't it?
virtio_net is doing it in a similar way.

-- 
Toshiaki Makita

[PATCH v3 bpf-next 7/8] veth: Add XDP TX and REDIRECT

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)  (XDP) (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by XDP_MEM_RF_NO_DIRECT flag.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 119 +
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 57187e955fea..0323a4ca74e2 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TXBIT(0)
+#define VETH_XDP_REDIR BIT(1)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
+   struct xdp_mem_info xdp_mem;
unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
@@ -311,10 +316,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+
+   rcu_read_lock();
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   goto out;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+   goto out;
+
+   __veth_xdp_flush(rcv_priv);
+out:
+   rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return veth_xdp_xmit(dev, 1, , 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-   struct xdp_frame *frame)
+   struct xdp_frame *frame,
+   unsigned int *xdp_xmit)
 {
int len = frame->len, delta = 0;
+   struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
@@ -338,6 +375,31 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
delta = frame->data - xdp.data;
len = xdp.data_end - xdp.data;
break;
+   case XDP_TX:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
+   if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) {
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_TX;
+   rcu_read_unlock();
+   goto xdp_xmit;
+   case XDP_REDIRECT:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
+   if (xdp_do_redirect(priv->dev, , xdp_prog)) {
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_REDIR;
+   rcu_read_unlock();
+   goto xdp_xmit;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -362,12 +424,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
 err_xdp:
rcu_read_unlock();
xdp_return_frame(frame);
-
+xdp_xmit:
return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-   struct sk_buff *skb)
+   struct sk_buff *skb,
+

[PATCH v3 bpf-next 4/8] veth: Handle xdp_frames in xdp napi ring

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 87 ++
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f5b72e937d9d..4be75c58bc6a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_XDP_FLAG  BIT(0)
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static const struct ethtool_ops veth_ethtool_ops = {
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+   if (veth_is_xdp_frame(ptr))
+   xdp_return_frame(veth_ptr_to_xdp(ptr));
+   else
+   kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+   struct xdp_frame *frame)
+{
+   int len = frame->len, delta = 0;
+   struct bpf_prog *xdp_prog;
+   unsigned int headroom;
+   struct sk_buff *skb;
+
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(priv->xdp_prog);
+   if (likely(xdp_prog)) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = frame->data - frame->headroom;
+   xdp.data = frame->data;
+   xdp.data_end = frame->data + frame->len;
+   xdp.data_meta = frame->data - frame->metasize;
+   xdp.rxq = >xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (act) {
+   case XDP_PASS:
+   delta = frame->data - xdp.data;
+   len = xdp.data_end - xdp.data;
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   case XDP_DROP:
+   goto err_xdp;
+   }
+   }
+   rcu_read_unlock();
+
+   headroom = frame->data - delta - (void *)frame;
+   skb = veth_build_skb(frame, headroom, len, 0);
+   if (!skb) {
+   xdp_return_frame(frame);
+   goto err;
+   }
+
+   memset(frame, 0, sizeof(*frame));
+   skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+   return skb;
+err_xdp:
+   rcu_read_unlock();
+   xdp_return_frame(frame);
+
+   return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
struct sk_buff *skb)
 {
@@ -358,12 +431,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int 
budget)
int i, done = 0;
 
for (i = 0; i < budget; i++) {
-   struct sk_buff *skb = __ptr_ring_consume(>xdp_ring);
+   void *ptr = __ptr_ring_consume(>xdp_ring);
+   struct sk_buff *skb;
 
-   if (!skb)
+   if (!ptr)
break;
 
-   skb = veth_xdp_rcv_skb(priv, skb);
+   if (veth_is_xdp_frame(ptr))
+   skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+   else
+   skb = veth_xdp_rcv_skb(priv, ptr);
 
if (skb)
napi_gro_receive(>xdp_napi, skb);
@@ -416,7 +493,7 @@ static void veth_napi_del(struct net_device *dev)
napi_disable(>xdp_napi);
netif_napi_del(>xdp_napi);
priv->rx_notify_masked = false;
-   ptr_ring_cleanup(>xdp_ring, __skb_array_destroy_skb);
+   ptr_ring_cleanup(>xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
2.14.3

[PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4be75c58bc6a..57187e955fea 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+struct xdp_frame **frames, u32 flags)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+   int i, drops = 0;
+
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+   return -EINVAL;
+
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   return -ENXIO;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (!rcu_access_pointer(rcv_priv->xdp_prog))
+   return -ENXIO;
+
+   spin_lock(_priv->xdp_ring.producer_lock);
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *frame = frames[i];
+   void *ptr = veth_xdp_to_ptr(frame);
+
+   if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
+__ptr_ring_produce(_priv->xdp_ring, ptr))) {
+   xdp_return_frame_rx_napi(frame);
+   drops++;
+   }
+   }
+   spin_unlock(_priv->xdp_ring.producer_lock);
+
+   if (flags & XDP_XMIT_FLUSH)
+   __veth_xdp_flush(rcv_priv);
+
+   return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
struct xdp_frame *frame)
 {
@@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
+   .ndo_xdp_xmit   = veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.14.3

[PATCH v3 bpf-next 0/8] veth: Driver XDP

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC ---> veth===veth
 (XDP) (redirect)(XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.

Envisioned use-cases


* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.

Implementation
--

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-5: Add ndo_xdp_xmit.
 - patch 6-7: Add XDP_TX and XDP_REDIRECT.
 - patch 8: Performance optimization for multi-queue env.

Tests and performance numbers
-

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP actionFlowsMpps

DROP   110.6
DROP   221.2
DROP 10036.0
TX 1 5.0
TX 210.0
TX   10031.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction Flows   Gbps
==
external->veth1   20.8
external->veth2   23.5
external->veth  100   23.6
veth->external19.0
veth->external2   17.8
veth->external  100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

Note:
This patchset depends on commit d8d7218ad842 ("xdp: XDP_REDIRECT should
check IFF_UP and MTU") which is currently in DaveM's net-next tree.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (8):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  xdp: Add a flag for disabling napi_direct of xdp_return_frame in
xdp_mem_info
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c | 735 -
 include/linux/skbuff.h |   1 +
 include/net/xdp.h  |   4 +
 net/core/skbuff.c  |   3 +-
 net/core/xdp.c |   6 +-
 5 files changed, 737 insertions(+), 12 deletions(-)

-- 
2.14.3

[PATCH v3 bpf-next 2/8] veth: Add driver XDP

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 373 -
 1 file changed, 366 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39ee57e..78fa08cb6e24 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_RING_SIZE 256
+#define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+   struct napi_struct  xdp_napi;
+   struct net_device   *dev;
+   struct bpf_prog __rcu   *xdp_prog;
+   struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
unsignedrequested_headroom;
+   boolrx_notify_masked;
+   struct ptr_ring xdp_ring;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_link_ksettings = veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+   /* Write ptr_ring before reading rx_notify_masked */
+   smp_mb();
+   if (!priv->rx_notify_masked) {
+   priv->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
+   }
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   dev_kfree_skb_any(skb);
+   return NET_RX_DROP;
+   }
+
+   return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
 {
struct veth_priv *priv = netdev_priv(dev);
+
+   return __dev_forward_skb(dev, skb) ?: xdp ?
+   veth_xdp_rx(priv, skb) :
+   netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   bool rcv_xdp = false;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   rcv_priv = netdev_priv(rcv);
+   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 drop:
atomic64_inc(>dropped);
}
+
+   if (rcv_xdp)
+   __veth_xdp_flush(rcv_priv);
+
rcu_read_unlock();
+
return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,253 @@ static void veth_set_multicast_list(struct net_device 
*dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+ int buflen)
+{
+   struct sk_buff *skb;
+
+   if (!buflen) {
+

[PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

All oversized packets including GSO packets are dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 41 +++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 78fa08cb6e24..f5b72e937d9d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+  netdev_features_t features)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   struct net_device *peer;
+
+   peer = rtnl_dereference(priv->peer);
+   if (peer) {
+   struct veth_priv *peer_priv = netdev_priv(peer);
+
+   if (peer_priv->_xdp_prog)
+   features &= ~NETIF_F_GSO_SOFTWARE;
+   }
+
+   return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
}
+
+   if (!old_prog) {
+   peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+   peer->hard_header_len -
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (peer->mtu > peer->max_mtu)
+   dev_set_mtu(peer, peer->max_mtu);
+   }
}
 
if (old_prog) {
-   if (!prog && dev->flags & IFF_UP)
-   veth_disable_xdp(dev);
+   if (!prog) {
+   if (dev->flags & IFF_UP)
+   veth_disable_xdp(dev);
+
+   if (peer) {
+   peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = ETH_MAX_MTU;
+   }
+   }
bpf_prog_put(old_prog);
}
 
+   if ((!!old_prog ^ !!prog) && peer)
+   netdev_update_features(peer);
+
return 0;
 err:
priv->_xdp_prog = old_prog;
@@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_poll_controller= veth_poll_controller,
 #endif
.ndo_get_iflink = veth_get_iflink,
+   .ndo_fix_features   = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
-- 
2.14.3

[PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in xdp_mem_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used.

Signed-off-by: Toshiaki Makita 
---
 include/net/xdp.h | 4 
 net/core/xdp.c| 6 --
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index fcb033f51d8c..1d1bc6553ff2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,6 +41,9 @@ enum xdp_mem_type {
MEM_TYPE_MAX,
 };
 
+/* XDP flags for xdp_mem_info */
+#define XDP_MEM_RF_NO_DIRECT   BIT(0)  /* don't use napi_direct */
+
 /* XDP flags for ndo_xdp_xmit */
 #define XDP_XMIT_FLUSH (1U << 0)   /* doorbell signal consumer */
 #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH
@@ -48,6 +51,7 @@ enum xdp_mem_type {
 struct xdp_mem_info {
u32 type; /* enum xdp_mem_type, but known size type */
u32 id;
+   u32 flags;
 };
 
 struct page_pool;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 57285383ed00..1426c608fd75 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
page = virt_to_head_page(data);
-   if (xa)
+   if (xa) {
+   napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT);
page_pool_put_page(xa->page_pool, page, napi_direct);
-   else
+   } else {
put_page(page);
+   }
rcu_read_unlock();
break;
case MEM_TYPE_PAGE_SHARED:
-- 
2.14.3

[PATCH v3 bpf-next 1/8] net: Export skb_headers_offset_update

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..f6929688853a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0c1a00672ba9..5366d1660e5b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
/* Only adjust this if it actually is csum_start rather than csum */
if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_network_header += off;
skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
2.14.3

[PATCH v3 bpf-next 8/8] veth: Support per queue XDP ring

2018-07-22 Thread Toshiaki Makita

From: Toshiaki Makita 

Move XDP and napi related fields in veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 278 -
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0323a4ca74e2..84482d9901ec 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
-   struct bpf_prog *_xdp_prog;
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
struct xdp_mem_info xdp_mem;
-   unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
 };
 
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   struct bpf_prog *_xdp_prog;
+   struct veth_rq  *rq;
+   unsigned intrequested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
/* Write ptr_ring before reading rx_notify_masked */
smp_mb();
-   if (!priv->rx_notify_masked) {
-   priv->rx_notify_masked = true;
-   napi_schedule(>xdp_napi);
+   if (!rq->rx_notify_masked) {
+   rq->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
dev_kfree_skb_any(skb);
return NET_RX_DROP;
}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct 
sk_buff *skb)
return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+   struct veth_rq *rq, bool xdp)
 {
-   struct veth_priv *priv = netdev_priv(dev);
-
return __dev_forward_skb(dev, skb) ?: xdp ?
-   veth_xdp_rx(priv, skb) :
+   veth_xdp_rx(rq, skb) :
netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
+   int rxq;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
rcv_priv = netdev_priv(rcv);
-   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+   rxq = skb_get_queue_mapping(skb);
+   if (rxq < rcv->real_num_rx_queues) {
+   rq = _priv->rq[rxq];
+   rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+   if (rcv_xdp)
+   skb_record_rx_queue(skb, rxq);
+   }
 
-   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+   if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
if (rcv_xdp)
-   __veth_xdp_flush(rcv_priv);
+   __veth_xdp_flush(rq);
 
rcu_read_unlock();
 
@@ -278,11 +289,17 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+   return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 struct xdp_frame **frames, u

[PATCH net] tun: Fix use-after-free on XDP_TX

2018-07-12 Thread Toshiaki Makita

On XDP_TX we need to free up the frame only when tun_xdp_tx() returns a
negative value. A positive value indicates that the packet is
successfully enqueued to the ptr_ring, so freeing the page causes
use-after-free.

Fixes: 735fc4054b3a ("xdp: change ndo_xdp_xmit API to support bulking")
Signed-off-by: Toshiaki Makita 
---
 drivers/net/tun.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a192a01..f5727ba 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1688,7 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
case XDP_TX:
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
-   if (tun_xdp_tx(tun->dev, ))
+   if (tun_xdp_tx(tun->dev, ) < 0)
goto err_redirect;
rcu_read_unlock();
local_bh_enable();
-- 
1.8.3.1

1 2 3 >

1 - 100 of 273 matches

Mail list logo