Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-12-05 Thread Peter Oskolkov
FWIW, I find the patch really useful - I applied it to my local dev
repo (with minor changes) and use skb_dump() a lot now. It would be
great if it makes its way into net-next in some form.
On Fri, Nov 30, 2018 at 12:15 PM Saeed Mahameed  wrote:
>
> On Thu, 2018-11-22 at 17:45 -0800, Cong Wang wrote:
> > On Wed, Nov 21, 2018 at 11:33 AM Saeed Mahameed 
> > wrote:
> > > On Wed, 2018-11-21 at 10:26 -0800, Eric Dumazet wrote:
> > > > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang <
> > > > xiyou.wangc...@gmail.com>
> > > > wrote:
> > > > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet <
> > > > > eric.duma...@gmail.com> wrote:
> > > > > >
> > > > > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > > > > Currently, we only dump a few selected skb fields in
> > > > > > > netdev_rx_csum_fault(). It is not suffient for debugging
> > > > > > > checksum
> > > > > > > fault. This patch introduces skb_dump() which dumps skb mac
> > > > > > > header,
> > > > > > > network header and its whole skb->data too.
> > > > > > >
> > > > > > > Cc: Herbert Xu 
> > > > > > > Cc: Eric Dumazet 
> > > > > > > Cc: David Miller 
> > > > > > > Signed-off-by: Cong Wang 
> > > > > > > ---
> > > > > > > + print_hex_dump(level, "skb data: ",
> > > > > > > DUMP_PREFIX_OFFSET,
> > > > > > > 16, 1,
> > > > > > > +skb->data, skb->len, false);
> > > > > >
> > > > > > As I mentioned to David, we want all the bytes that were
> > > > > > maybe
> > > > > > already pulled
> > > > > >
> > > > > > (skb->head starting point, not skb->data)
> > > > >
> > > > > Hmm, with mac header and network header, it is effectively from
> > > > > skb->head, no?
> > > > > Is there anything between skb->head and mac header?
> > > >
> > > > Oh, I guess we wanted a single hex dump, or we need some user
> > > > program
> > > > to be able to
> > > > rebuild from different memory zones the original
> > > > CHECKSUM_COMPLETE
> > > > value.
> > > >
> > >
> > > Normally the driver keeps some headroom @skb->head, so the actual
> > > mac
> > > header starts @ skb->head + driver_specific_headroom
> >
> > Good to know, but this headroom isn't covered by skb->csum, so
> > not useful here, right? The skb->csum for mlx5 only covers network
> > header and its payload.
>
> correct
>


Re: [PATCH net] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2018-12-05 Thread Peter Oskolkov
On Wed, Dec 5, 2018 at 7:57 AM Jiri Wiesner  wrote:
>
> The *_frag_reasm() functions are susceptible to miscalculating the byte
> count of packet fragments in case the truesize of a head buffer changes.
> The truesize member may be changed by the call to skb_unclone(), leaving
> the fragment memory limit counter unbalanced even if all fragments are
> processed. This miscalculation goes unnoticed as long as the network
> namespace which holds the counter is not destroyed.
>
> Should an attempt be made to destroy a network namespace that holds an
> unbalanced fragment memory limit counter the cleanup of the namespace
> never finishes. The thread handling the cleanup gets stuck in
> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
> thread is usually in running state with a stacktrace similar to:
>
>  PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>   #5 [880621563d48] _raw_spin_lock at 815f5480
>   #6 [880621563d48] inet_evict_bucket at 8158020b
>   #7 [880621563d80] inet_frags_exit_net at 8158051c
>   #8 [880621563db0] ops_exit_list at 814f5856
>   #9 [880621563dd8] cleanup_net at 814f67c0
>  #10 [880621563e38] process_one_work at 81096f14
>
> It is not possible to create new network namespaces, and processes
> that call unshare() end up being stuck in uninterruptible sleep state
> waiting to acquire the net_mutex.
>
> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
> I thank him for his analysis of the problem. The parts of this patch
> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>
> Signed-off-by: Jiri Wiesner 
> Reported-by: Per Sundstrom 
> ---

Acked-by: Peter Oskolkov 

>  net/ipv4/ip_fragment.c  | 7 +++
>  net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +++-
>  net/ipv6/reassembly.c   | 8 +++-
>  3 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index d6ee343fdb86..aa0b22697998 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -515,6 +515,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
> *skb,
> struct rb_node *rbn;
> int len;
> int ihlen;
> +   int delta;
> int err;
> u8 ecn;
>
> @@ -556,10 +557,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
> *skb,
> if (len > 65535)
> goto out_oversize;
>
> +   delta = - head->truesize;
> +
> /* Head of list must not be cloned. */
> if (skb_unclone(head, GFP_ATOMIC))
> goto out_nomem;
>
> +   delta += head->truesize;
> +   if (delta)
> +   add_frag_mem_limit(qp->q.net, delta);
> +
> /* If the first fragment is fragmented itself, we split
>  * it to two chunks: the first with data and paged part
>  * and the second, holding only fragments. */
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
> b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index d219979c3e52..181da2c40f9a 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -341,7 +341,7 @@ static bool
>  nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct 
> net_device *dev)
>  {
> struct sk_buff *fp, *head = fq->q.fragments;
> -   intpayload_len;
> +   intpayload_len, delta;
> u8 ecn;
>
> inet_frag_kill(>q);
> @@ -363,10 +363,16 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff 
> *prev,  struct net_devic
> return false;
> }
>
> +   delta = - head->truesize;
> +
> /* Head of list must not be cloned. */
> if (skb_unclone(head, GFP_ATOMIC))
> return false;
>
> +   delta += head->truesize;
> +   if (delta)
> +   add_frag_mem_limit(fq->q.net, delta);
> +
> /* If the first fragment is fragmented itself, we split
>  * it to two chunks: the first with data and paged part
>  * and the second, holding only fragments. */
> diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
> index 5c3c92713096..aa26c45486d9 100644
> --- a/net/ipv6/reassembly.c
> +++ b/net/ipv6/reassembly.c
> @@ -281,7 +281,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
> sk_buff *prev,
>  {
> struct net *net = container_of(fq->q.net, struct net, ipv6.frags);
> struct sk_buff *fp, *head = fq->q.fragments;
> -   intpayload_le

Re: [PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-04 Thread Peter Oskolkov
On Tue, Dec 4, 2018 at 11:11 AM Peter Oskolkov  wrote:
>
> Thanks, Stephen!
>
> I don't care much about braces either. David, do you want me to send a
> new patch with braces moved around?

Sent a v2 with style fixes, just in case.

>
> On Tue, Dec 4, 2018 at 9:56 AM Stephen Hemminger
>  wrote:
> >
> > I like this, it makes a lot of sense since packets are almost
> > always queued in order.
> >
> > Minor style stuff you might want to fix (but don't have to).
> >
> > > + if (!last ||
> > > + t_last->time_to_send > 
> > > last->time_to_send) {
> > > + last = t_last;
> > > + }
> >
> > I don't think you need braces here for single assignment.
> >
> > > +static void netem_erase_head(struct netem_sched_data *q, struct sk_buff 
> > > *skb)
> > > +{
> > > + if (skb == q->t_head) {
> > > + q->t_head = skb->next;
> > > + if (!q->t_head)
> > > + q->t_tail = NULL;
> > > + } else
> > > + rb_erase(>rbnode, >t_root);
> >
> > Checkpatch wants both sides of if/else to have brackets.
> > Personally, don't care.
> >
> > Reviewed-by: Stephen Hemminger 


[PATCH v2 net-next 0/1] net: netem: use a list _and_ rbtree

2018-12-04 Thread Peter Oskolkov
v2: address style suggestions by Stephen Hemminger.

All changes are noop vs v1.

Peter Oskolkov (1):
  net: netem: use a list in addition to rbtree

 net/sched/sch_netem.c | 89 +--
 1 file changed, 69 insertions(+), 20 deletions(-)



[PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-04 Thread Peter Oskolkov
When testing high-bandwidth TCP streams with large windows,
high latency, and low jitter, netem consumes a lot of CPU cycles
doing rbtree rebalancing.

This patch uses a linear list/queue in addition to the rbtree:
if an incoming packet is past the tail of the linear queue, it is
added there, otherwise it is inserted into the rbtree.

Without this patch, perf shows netem_enqueue, netem_dequeue,
and rb_* functions among the top offenders. With this patch,
only netem_enqueue is noticeable if jitter is low/absent.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/sched/sch_netem.c | 89 +--
 1 file changed, 69 insertions(+), 20 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2c38e3d07924..84658f60a872 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,6 +77,10 @@ struct netem_sched_data {
/* internal t(ime)fifo qdisc uses t_root and sch->limit */
struct rb_root t_root;
 
+   /* a linear queue; reduces rbtree rebalancing when jitter is low */
+   struct sk_buff  *t_head;
+   struct sk_buff  *t_tail;
+
/* optional qdisc for classful handling (NULL at netem init) */
struct Qdisc*qdisc;
 
@@ -369,26 +373,39 @@ static void tfifo_reset(struct Qdisc *sch)
rb_erase(>rbnode, >t_root);
rtnl_kfree_skbs(skb, skb);
}
+
+   rtnl_kfree_skbs(q->t_head, q->t_tail);
+   q->t_head = NULL;
+   q->t_tail = NULL;
 }
 
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
u64 tnext = netem_skb_cb(nskb)->time_to_send;
-   struct rb_node **p = >t_root.rb_node, *parent = NULL;
 
-   while (*p) {
-   struct sk_buff *skb;
-
-   parent = *p;
-   skb = rb_to_skb(parent);
-   if (tnext >= netem_skb_cb(skb)->time_to_send)
-   p = >rb_right;
+   if (!q->t_tail || tnext >= netem_skb_cb(q->t_tail)->time_to_send) {
+   if (q->t_tail)
+   q->t_tail->next = nskb;
else
-   p = >rb_left;
+   q->t_head = nskb;
+   q->t_tail = nskb;
+   } else {
+   struct rb_node **p = >t_root.rb_node, *parent = NULL;
+
+   while (*p) {
+   struct sk_buff *skb;
+
+   parent = *p;
+   skb = rb_to_skb(parent);
+   if (tnext >= netem_skb_cb(skb)->time_to_send)
+   p = >rb_right;
+   else
+   p = >rb_left;
+   }
+   rb_link_node(>rbnode, parent, p);
+   rb_insert_color(>rbnode, >t_root);
}
-   rb_link_node(>rbnode, parent, p);
-   rb_insert_color(>rbnode, >t_root);
sch->q.qlen++;
 }
 
@@ -530,9 +547,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
t_skb = skb_rb_last(>t_root);
t_last = netem_skb_cb(t_skb);
if (!last ||
-   t_last->time_to_send > last->time_to_send) {
+   t_last->time_to_send > last->time_to_send)
+   last = t_last;
+   }
+   if (q->t_tail) {
+   struct netem_skb_cb *t_last =
+   netem_skb_cb(q->t_tail);
+
+   if (!last ||
+   t_last->time_to_send > last->time_to_send)
last = t_last;
-   }
}
 
if (last) {
@@ -611,11 +635,38 @@ static void get_slot_next(struct netem_sched_data *q, u64 
now)
q->slot.bytes_left = q->slot_config.max_bytes;
 }
 
+static struct sk_buff *netem_peek(struct netem_sched_data *q)
+{
+   struct sk_buff *skb = skb_rb_first(>t_root);
+   u64 t1, t2;
+
+   if (!skb)
+   return q->t_head;
+   if (!q->t_head)
+   return skb;
+
+   t1 = netem_skb_cb(skb)->time_to_send;
+   t2 = netem_skb_cb(q->t_head)->time_to_send;
+   if (t1 < t2)
+   return skb;
+   return q->t_head;
+}
+
+static void netem_erase_head(struct netem_sched_data *q, struct sk_buff *skb)
+{
+   if (skb == q->t_head) {
+   q->t_head = skb->next;
+   if (!q->t_head)
+   q->t_tail = NULL;
+   } else {
+   rb_erase(>rbnode, >t_root);
+   }
+}
+
 static st

Re: [PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-04 Thread Peter Oskolkov
Thanks, Stephen!

I don't care much about braces either. David, do you want me to send a
new patch with braces moved around?

On Tue, Dec 4, 2018 at 9:56 AM Stephen Hemminger
 wrote:
>
> I like this, it makes a lot of sense since packets are almost
> always queued in order.
>
> Minor style stuff you might want to fix (but don't have to).
>
> > + if (!last ||
> > + t_last->time_to_send > 
> > last->time_to_send) {
> > + last = t_last;
> > + }
>
> I don't think you need braces here for single assignment.
>
> > +static void netem_erase_head(struct netem_sched_data *q, struct sk_buff 
> > *skb)
> > +{
> > + if (skb == q->t_head) {
> > + q->t_head = skb->next;
> > + if (!q->t_head)
> > + q->t_tail = NULL;
> > + } else
> > + rb_erase(>rbnode, >t_root);
>
> Checkpatch wants both sides of if/else to have brackets.
> Personally, don't care.
>
> Reviewed-by: Stephen Hemminger 


[PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-03 Thread Peter Oskolkov
When testing high-bandwidth TCP streams with large windows,
high latency, and low jitter, netem consumes a lot of CPU cycles
doing rbtree rebalancing.

This patch uses a linear list/queue in addition to the rbtree:
if an incoming packet is past the tail of the linear queue, it is
added there, otherwise it is inserted into the rbtree.

Without this patch, perf shows netem_enqueue, netem_dequeue,
and rb_* functions among the top offenders. With this patch,
only netem_enqueue is noticeable if jitter is low/absent.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/sched/sch_netem.c | 86 ++-
 1 file changed, 68 insertions(+), 18 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2c38e3d07924..f1099486ecd3 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,6 +77,10 @@ struct netem_sched_data {
/* internal t(ime)fifo qdisc uses t_root and sch->limit */
struct rb_root t_root;
 
+   /* a linear queue; reduces rbtree rebalancing when jitter is low */
+   struct sk_buff  *t_head;
+   struct sk_buff  *t_tail;
+
/* optional qdisc for classful handling (NULL at netem init) */
struct Qdisc*qdisc;
 
@@ -369,26 +373,39 @@ static void tfifo_reset(struct Qdisc *sch)
rb_erase(>rbnode, >t_root);
rtnl_kfree_skbs(skb, skb);
}
+
+   rtnl_kfree_skbs(q->t_head, q->t_tail);
+   q->t_head = NULL;
+   q->t_tail = NULL;
 }
 
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
u64 tnext = netem_skb_cb(nskb)->time_to_send;
-   struct rb_node **p = >t_root.rb_node, *parent = NULL;
 
-   while (*p) {
-   struct sk_buff *skb;
-
-   parent = *p;
-   skb = rb_to_skb(parent);
-   if (tnext >= netem_skb_cb(skb)->time_to_send)
-   p = >rb_right;
+   if (!q->t_tail || tnext >= netem_skb_cb(q->t_tail)->time_to_send) {
+   if (q->t_tail)
+   q->t_tail->next = nskb;
else
-   p = >rb_left;
+   q->t_head = nskb;
+   q->t_tail = nskb;
+   } else {
+   struct rb_node **p = >t_root.rb_node, *parent = NULL;
+
+   while (*p) {
+   struct sk_buff *skb;
+
+   parent = *p;
+   skb = rb_to_skb(parent);
+   if (tnext >= netem_skb_cb(skb)->time_to_send)
+   p = >rb_right;
+   else
+   p = >rb_left;
+   }
+   rb_link_node(>rbnode, parent, p);
+   rb_insert_color(>rbnode, >t_root);
}
-   rb_link_node(>rbnode, parent, p);
-   rb_insert_color(>rbnode, >t_root);
sch->q.qlen++;
 }
 
@@ -534,6 +551,15 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
last = t_last;
}
}
+   if (q->t_tail) {
+   struct netem_skb_cb *t_last =
+   netem_skb_cb(q->t_tail);
+
+   if (!last ||
+   t_last->time_to_send > last->time_to_send) {
+   last = t_last;
+   }
+   }
 
if (last) {
/*
@@ -611,11 +637,37 @@ static void get_slot_next(struct netem_sched_data *q, u64 
now)
q->slot.bytes_left = q->slot_config.max_bytes;
 }
 
+static struct sk_buff *netem_peek(struct netem_sched_data *q)
+{
+   struct sk_buff *skb = skb_rb_first(>t_root);
+   u64 t1, t2;
+
+   if (!skb)
+   return q->t_head;
+   if (!q->t_head)
+   return skb;
+
+   t1 = netem_skb_cb(skb)->time_to_send;
+   t2 = netem_skb_cb(q->t_head)->time_to_send;
+   if (t1 < t2)
+   return skb;
+   return q->t_head;
+}
+
+static void netem_erase_head(struct netem_sched_data *q, struct sk_buff *skb)
+{
+   if (skb == q->t_head) {
+   q->t_head = skb->next;
+   if (!q->t_head)
+   q->t_tail = NULL;
+   } else
+   rb_erase(>rbnode, >t_root);
+}
+
 static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
-   struct rb_node *p;
 
 tfifo_dequeue:
skb = __qdisc_dequeue_head(>q);
@@ -625,20 +677,18 @@ static struct sk_buff *netem_dequeue(struct 

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-12-03 Thread Peter Oskolkov
Our case is a bit different - it is more like using the SRH header in
IPv6 to route packets via a non-default intermediate hop. But I see
your point - I'll expand the patchset to cover IPv6 and the ingress
path.
On Mon, Dec 3, 2018 at 8:04 AM David Ahern  wrote:
>
> On 11/30/18 5:14 PM, Peter Oskolkov wrote:
> > On Fri, Nov 30, 2018 at 3:52 PM David Ahern  wrote:
> >>
> >> On 11/30/18 4:35 PM, Peter Oskolkov wrote:
> >>> Thanks, David! This is for egress only, so I'll add an appropriate
> >>> check. I'll also address your other comments/concerns in a v2 version
> >>> of this patchset.
> >>
> >> Why are you limiting this to egress only?
> >
> > I'm focusing on egress because this is a use case that we have and
> > understand well, and would like to have a solution for sooner rather
> > than later.
> >
> > Without understanding why anybody would want to do lwt-bpf encap on
> > ingress, I don't know, for example, what a good test would look like;
> > but I'd be happy to look into a specific ingress use case if you have
> > one.
> >
>
> We can not have proliferation of helpers for a lot of one off use cases.
> A little thought now makes this helper useful for more than just your 1
> use case. And, IPv6 parity should be a minimal requirement for helpers.
>
> Based on your route lookup I presume your use case is capturing certain
> local traffic, pushing a custom header and sending that packet else
> where. The same could be done on the ingress path.


Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov
On Fri, Nov 30, 2018 at 3:52 PM David Ahern  wrote:
>
> On 11/30/18 4:35 PM, Peter Oskolkov wrote:
> > Thanks, David! This is for egress only, so I'll add an appropriate
> > check. I'll also address your other comments/concerns in a v2 version
> > of this patchset.
>
> Why are you limiting this to egress only?

I'm focusing on egress because this is a use case that we have and
understand well, and would like to have a solution for sooner rather
than later.

Without understanding why anybody would want to do lwt-bpf encap on
ingress, I don't know, for example, what a good test would look like;
but I'd be happy to look into a specific ingress use case if you have
one.


Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov
Thanks, David! This is for egress only, so I'll add an appropriate
check. I'll also address your other comments/concerns in a v2 version
of this patchset.
On Fri, Nov 30, 2018 at 12:08 PM David Ahern  wrote:
>
> On 11/28/18 6:34 PM, Peter Oskolkov wrote:
> > On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
> >>
> >> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> >>> diff --git a/net/core/filter.c b/net/core/filter.c
> >>> index bd0df75dc7b6..17f3c37218e5 100644
> >>> --- a/net/core/filter.c
> >>> +++ b/net/core/filter.c
> >>> @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff 
> >>> *skb, u32 type, void *hdr, u32 len
> >>>  }
> >>>  #endif /* CONFIG_IPV6_SEG6_BPF */
> >>>
> >>> +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> >>> +{
> >>> + struct dst_entry *dst;
> >>> + struct rtable *rt;
> >>> + struct iphdr *iph;
> >>> + struct net *net;
> >>> + int err;
> >>> +
> >>> + if (skb->protocol != htons(ETH_P_IP))
> >>> + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> >>> +
> >>> + iph = (struct iphdr *)hdr;
> >>> +
> >>> + if (unlikely(len < sizeof(struct iphdr) || len > 
> >>> LWTUNNEL_MAX_ENCAP_HSIZE))
> >>> + return -EINVAL;
> >>> + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> >>> + return -EINVAL;
> >>> +
> >>> + if (skb->sk)
> >>> + net = sock_net(skb->sk);
> >>> + else {
> >>> + net = dev_net(skb_dst(skb)->dev);
> >>> + }
> >>> + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
> >>
> >> That is a very limited use case. e.g., oif = 0 means you are not
> >> considering any kind of policy routing (e.g., VRF).
> >
> > Hi David! Could you be a bit more specific re: what you would like to
> > see here? Thanks!
> >
>
> Is the encap happening on ingress or egress? Seems like the current code
> does not assume either direction for lwt (BPF_PROG_TYPE_LWT_IN vs
> BPF_PROG_TYPE_LWT_OUT), yet your change does - output only. Basically,
> you should be filling in a flow struct and doing a proper lookup.
>
> When the potentially custom encap header is pushed on, seems to me skb
> marks should still be honored for the route lookup. If not, you should
> handle that in the API.
>
> From there skb->dev at a minimum should be used as either iif (ingress)
> or oif (egress).
>
> The iph is already set so you have quick access to the tos.
>
> Also, this should implement IPv6 as well before going in.


Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-28 Thread Peter Oskolkov
On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
>
> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index bd0df75dc7b6..17f3c37218e5 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, 
> > u32 type, void *hdr, u32 len
> >  }
> >  #endif /* CONFIG_IPV6_SEG6_BPF */
> >
> > +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> > +{
> > + struct dst_entry *dst;
> > + struct rtable *rt;
> > + struct iphdr *iph;
> > + struct net *net;
> > + int err;
> > +
> > + if (skb->protocol != htons(ETH_P_IP))
> > + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> > +
> > + iph = (struct iphdr *)hdr;
> > +
> > + if (unlikely(len < sizeof(struct iphdr) || len > 
> > LWTUNNEL_MAX_ENCAP_HSIZE))
> > + return -EINVAL;
> > + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> > + return -EINVAL;
> > +
> > + if (skb->sk)
> > + net = sock_net(skb->sk);
> > + else {
> > + net = dev_net(skb_dst(skb)->dev);
> > + }
> > + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
>
> That is a very limited use case. e.g., oif = 0 means you are not
> considering any kind of policy routing (e.g., VRF).

Hi David! Could you be a bit more specific re: what you would like to
see here? Thanks!


[PATCH bpf-next 2/2] selftests/bpf: add test_lwt_ip_encap selftest

2018-11-28 Thread Peter Oskolkov
This patch adds a sample/selftest that covers BPF_LWT_ENCAP_IP option
added in the first patch in the series.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/bpf/Makefile  |   5 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  65 ++
 .../selftests/bpf/test_lwt_ip_encap.sh| 114 ++
 3 files changed, 182 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 73aa6d8f4a2f..044fcdbc9864 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -39,7 +39,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o \
test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o 
\
-   xdp_dummy.o test_map_in_map.o
+   xdp_dummy.o test_map_in_map.o test_lwt_ip_encap.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -53,7 +53,8 @@ TEST_PROGS := test_kmod.sh \
test_lirc_mode2.sh \
test_skb_cgroup_id.sh \
test_flow_dissector.sh \
-   test_xdp_vlan.sh
+   test_xdp_vlan.sh \
+   test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh
 
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.c 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
new file mode 100644
index ..967db922dcc6
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define BPF_LWT_ENCAP_IP 2
+
+struct iphdr {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+   __u8ihl:4,
+   version:4;
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+   __u8version:4,
+   ihl:4;
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+   __u8tos;
+   __be16  tot_len;
+   __be16  id;
+   __be16  frag_off;
+   __u8ttl;
+   __u8protocol;
+   __sum16 check;
+   __be32  saddr;
+   __be32  daddr;
+};
+
+struct grehdr {
+   __be16 flags;
+   __be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+   char encap_header[24];
+   int err;
+   struct iphdr *iphdr = (struct iphdr *)encap_header;
+   struct grehdr *greh = (struct grehdr *)(encap_header + sizeof(struct 
iphdr));
+
+   memset(encap_header, 0, sizeof(encap_header));
+
+   iphdr->ihl = 5;
+   iphdr->version = 4;
+   iphdr->tos = 0;
+   iphdr->ttl = 0x40;
+   iphdr->protocol = 47;  /* IPPROTO_GRE */
+   iphdr->saddr = 0x640110ac;  /* 172.16.1.100 */
+   iphdr->daddr = 0x640310ac;  /* 172.16.5.100 */
+   iphdr->check = 0;
+   iphdr->tot_len = bpf_htons(skb->len + sizeof(encap_header));
+
+   greh->protocol = bpf_htons(0x800);
+
+   err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, (void *)encap_header,
+sizeof(encap_header));
+   if (err)
+   return BPF_DROP;
+
+   return BPF_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index ..4c32b754bf96
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup:
+# - create VETH1/VETH2 veth
+# - VETH1 gets IP_SRC
+# - create netns NS
+# - move VETH2 to NS, add IP_DST
+# - in NS, create gre tunnel GREDEV, add IP_GRE
+# - in NS, configure GREDEV to route to IP_DST from IP_SRC
+# - configure route to IP_GRE via VETH1
+#   (note: there is no route to IP_DST from root/init ns)
+#
+# Test:
+# - listen on IP_DST
+# - send a packet to IP_DST: the listener does not get it
+# - add LWT_XMIT bpf to IP_DST that gre-encaps all packets to IP_GRE
+# - send a packet to IP_DST: the listener gets it
+
+
+# set -x  # debug ON
+set +x  # debug OFF
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+   echo "This script must be run as root"
+   echo "FAIL"
+   exit 1
+fi
+
+readonly NS="ns-ip-encap-$(mktemp -u XX)"
+readonly OUT=$(mktemp /tmp/test_lwt_ip_incap.XX)
+
+readonly NET_SRC="172.16.1.0"
+
+readonly IP_SRC="172.16.1.100"
+readonly IP_DST="172.16.2.100"
+readonly IP_GRE="172.16.3.100"
+
+readonly PORT=
+readonly MSG="foo_bar"
+
+PID1=0
+PID2=0
+
+setup() {
+   ip link add veth1 type veth peer 

[PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-28 Thread Peter Oskolkov
This patch enables BPF programs (specifically, of LWT_XMIT type)
to add IP encapsulation headers to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

A future patch will enable IPv6 encapping (and IPv4/IPv6 cross-routing).

Tested: see the second patch in the series.

Signed-off-by: Peter Oskolkov 
---
 include/net/lwtunnel.h   |  2 ++
 include/uapi/linux/bpf.h |  7 -
 net/core/filter.c| 58 
 3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..6a1c5c2f16d5 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -16,6 +16,8 @@
 #define LWTUNNEL_STATE_INPUT_REDIRECT  BIT(1)
 #define LWTUNNEL_STATE_XMIT_REDIRECT   BIT(2)
 
+#define LWTUNNEL_MAX_ENCAP_HSIZE   80
+
 enum {
LWTUNNEL_XMIT_DONE,
LWTUNNEL_XMIT_CONTINUE,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 597afdbc1ab9..6f2efe2dca9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1998,6 +1998,10 @@ union bpf_attr {
  * Only works if *skb* contains an IPv6 packet. Insert a
  * Segment Routing Header (**struct ipv6_sr_hdr**) inside
  * the IPv6 header.
+ * **BPF_LWT_ENCAP_IP**
+ * IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ * must be IPv4, followed by zero, one, or more additional
+ * headers.
  *
  * A call to this helper is susceptible to change the underlaying
  * packet buffer. Therefore, at load time, all checks on pointers
@@ -2444,7 +2448,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_SEG6,
-   BPF_LWT_ENCAP_SEG6_INLINE
+   BPF_LWT_ENCAP_SEG6_INLINE,
+   BPF_LWT_ENCAP_IP,
 };
 
 /* user accessible mirror of in-kernel sk_buff.
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0df75dc7b6..17f3c37218e5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 
type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
+{
+   struct dst_entry *dst;
+   struct rtable *rt;
+   struct iphdr *iph;
+   struct net *net;
+   int err;
+
+   if (skb->protocol != htons(ETH_P_IP))
+   return -EINVAL;  /* ETH_P_IPV6 not yet supported */
+
+   iph = (struct iphdr *)hdr;
+
+   if (unlikely(len < sizeof(struct iphdr) || len > 
LWTUNNEL_MAX_ENCAP_HSIZE))
+   return -EINVAL;
+   if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
+   return -EINVAL;
+
+   if (skb->sk)
+   net = sock_net(skb->sk);
+   else {
+   net = dev_net(skb_dst(skb)->dev);
+   }
+   rt = ip_route_output(net, iph->daddr, 0, 0, 0);
+   if (IS_ERR(rt) || rt->dst.error)
+   return -EINVAL;
+   dst = >dst;
+
+   skb_reset_inner_headers(skb);
+   skb->encapsulation = 1;
+
+   err = skb_cow_head(skb, len + LL_RESERVED_SPACE(dst->dev));
+   if (unlikely(err))
+   return err;
+
+   skb_push(skb, len);
+   skb_reset_network_header(skb);
+
+   iph = ip_hdr(skb);
+   memcpy(iph, hdr, len);
+
+   bpf_compute_data_pointers(skb);
+   if (iph->ihl * 4 < len)
+   skb_set_transport_header(skb, iph->ihl * 4);
+   skb->protocol = htons(ETH_P_IP);
+   if (!iph->check)
+   iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
+
+   skb_dst_drop(skb);
+   dst_hold(dst);
+   skb_dst_set(skb, dst);
+   return 0;
+}
+
 BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
   u32, len)
 {
@@ -4802,6 +4856,8 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, 
u32, type, void *, hdr,
case BPF_LWT_ENCAP_SEG6_INLINE:
return bpf_push_seg6_encap(skb, type, hdr, len);
 #endif
+   case BPF_LWT_ENCAP_IP:
+   return bpf_push_ip_encap(skb, hdr, len);
default:
return -EINVAL;
}
@@ -5687,6 +5743,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return _l4_csum_replace_proto;
case BPF_FUNC_set_hash_invalid:
return _set_hash_invalid_proto;
+   case BPF_FUNC_lwt_push_encap

[PATCH net-next 2/3] net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net

2018-09-21 Thread Peter Oskolkov
Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
hard-limited to those of the root/init ns.

There are at least two use cases when it would be desirable to
set the high_thresh values higher in a child namespace vs the global hard
limit:

- a security/ddos protection policy may lower the thresholds in the
  root/init ns but allow for a special exception in a child namespace
- testing: a test running in a namespace may want to set these
  thresholds higher in its namespace than what is in the root/init ns

The new behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=900
 net.ipv4.ipfrag_high_thresh = 900

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 900

 # sysctl -w net.ipv6.ip6frag_high_thresh=900
 net.ipv6.ip6frag_high_thresh = 900

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 900

The old behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=900
 net.ipv4.ipfrag_high_thresh = 900

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 4194304

 # sysctl -w net.ipv6.ip6frag_high_thresh=900
 net.ipv6.ip6frag_high_thresh = 900

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 4194304

Signed-off-by: Peter Oskolkov 
---
 net/ieee802154/6lowpan/reassembly.c | 1 -
 net/ipv4/ip_fragment.c  | 1 -
 net/ipv6/reassembly.c   | 1 -
 3 files changed, 3 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 09ffbf5ce8fa..d14226ecfde4 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -463,7 +463,6 @@ static int __net_init 
lowpan_frags_ns_sysctl_register(struct net *net)
 
table[0].data = _lowpan->frags.high_thresh;
table[0].extra1 = _lowpan->frags.low_thresh;
-   table[0].extra2 = _net.ieee802154_lowpan.frags.high_thresh;
table[1].data = _lowpan->frags.low_thresh;
table[1].extra2 = _lowpan->frags.high_thresh;
table[2].data = _lowpan->frags.timeout;
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 13f4d189e12b..9b0158fa431f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -822,7 +822,6 @@ static int __net_init ip4_frags_ns_ctl_register(struct net 
*net)
 
table[0].data = >ipv4.frags.high_thresh;
table[0].extra1 = >ipv4.frags.low_thresh;
-   table[0].extra2 = _net.ipv4.frags.high_thresh;
table[1].data = >ipv4.frags.low_thresh;
table[1].extra2 = >ipv4.frags.high_thresh;
table[2].data = >ipv4.frags.timeout;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 536c1d172cba..5c3c92713096 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -554,7 +554,6 @@ static int __net_init ip6_frags_ns_sysctl_register(struct 
net *net)
 
table[0].data = >ipv6.frags.high_thresh;
table[0].extra1 = >ipv6.frags.low_thresh;
-   table[0].extra2 = _net.ipv6.frags.high_thresh;
table[1].data = >ipv6.frags.low_thresh;
table[1].extra2 = >ipv6.frags.high_thresh;
table[2].data = >ipv6.frags.timeout;
-- 
2.19.0.444.g18242da7ef-goog



[PATCH net-next 1/3] ipv6: discard IP frag queue on more errors

2018-09-21 Thread Peter Oskolkov
This is similar to how ipv4 now behaves:
commit 0ff89efb5246 ("ip: fail fast on IP defrag errors").

Signed-off-by: Peter Oskolkov 
---
 net/ipv6/reassembly.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index f1b1ff30fe5b..536c1d172cba 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -145,7 +145,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
 */
if (end < fq->q.len ||
((fq->q.flags & INET_FRAG_LAST_IN) && end != fq->q.len))
-   goto err;
+   goto discard_fq;
fq->q.flags |= INET_FRAG_LAST_IN;
fq->q.len = end;
} else {
@@ -162,20 +162,20 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end > fq->q.len) {
/* Some bits beyond end -> corruption. */
if (fq->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_fq;
fq->q.len = end;
}
}
 
if (end == offset)
-   goto err;
+   goto discard_fq;
 
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data))
-   goto err;
+   goto discard_fq;
 
if (pskb_trim_rcsum(skb, end - offset))
-   goto err;
+   goto discard_fq;
 
/* Find out which fragments are in front and at the back of us
 * in the chain of fragments so far.  We must know where to put
@@ -418,6 +418,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
sk_buff *prev,
rcu_read_lock();
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMFAILS);
rcu_read_unlock();
+   inet_frag_kill(>q);
return -1;
 }
 
-- 
2.19.0.444.g18242da7ef-goog



[PATCH net-next 3/3] selftests/net: add ipv6 tests to ip_defrag selftest

2018-09-21 Thread Peter Oskolkov
This patch adds ipv6 defragmentation tests to ip_defrag selftest,
to complement existing ipv4 tests.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/net/ip_defrag.c  | 249 +++
 tools/testing/selftests/net/ip_defrag.sh |  39 ++--
 2 files changed, 190 insertions(+), 98 deletions(-)

diff --git a/tools/testing/selftests/net/ip_defrag.c 
b/tools/testing/selftests/net/ip_defrag.c
index 55fdcdc78eef..2366dc6bce71 100644
--- a/tools/testing/selftests/net/ip_defrag.c
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -23,21 +23,28 @@ static bool cfg_overlap;
 static unsigned short  cfg_port = 9000;
 
 const struct in_addr addr4 = { .s_addr = __constant_htonl(INADDR_LOOPBACK + 2) 
};
+const struct in6_addr addr6 = IN6ADDR_LOOPBACK_INIT;
 
 #define IP4_HLEN   (sizeof(struct iphdr))
 #define IP6_HLEN   (sizeof(struct ip6_hdr))
 #define UDP_HLEN   (sizeof(struct udphdr))
 
-static int msg_len;
+/* IPv6 fragment header lenth. */
+#define FRAG_HLEN  8
+
+static int payload_len;
 static int max_frag_len;
 
 #define MSG_LEN_MAX6   /* Max UDP payload length. */
 
 #define IP4_MF (1u << 13)  /* IPv4 MF flag. */
+#define IP6_MF (1)  /* IPv6 MF flag. */
+
+#define CSUM_MANGLED_0 (0x)
 
 static uint8_t udp_payload[MSG_LEN_MAX];
 static uint8_t ip_frame[IP_MAXPACKET];
-static uint16_t ip_id = 0xabcd;
+static uint32_t ip_id = 0xabcd;
 static int msg_counter;
 static int frag_counter;
 static unsigned int seed;
@@ -48,25 +55,25 @@ static void recv_validate_udp(int fd_udp)
ssize_t ret;
static uint8_t recv_buff[MSG_LEN_MAX];
 
-   ret = recv(fd_udp, recv_buff, msg_len, 0);
+   ret = recv(fd_udp, recv_buff, payload_len, 0);
msg_counter++;
 
if (cfg_overlap) {
if (ret != -1)
-   error(1, 0, "recv: expected timeout; got %d; seed = %u",
-   (int)ret, seed);
+   error(1, 0, "recv: expected timeout; got %d",
+   (int)ret);
if (errno != ETIMEDOUT && errno != EAGAIN)
-   error(1, errno, "recv: expected timeout: %d; seed = %u",
-errno, seed);
+   error(1, errno, "recv: expected timeout: %d",
+errno);
return;  /* OK */
}
 
if (ret == -1)
-   error(1, errno, "recv: msg_len = %d max_frag_len = %d",
-   msg_len, max_frag_len);
-   if (ret != msg_len)
-   error(1, 0, "recv: wrong size: %d vs %d", (int)ret, msg_len);
-   if (memcmp(udp_payload, recv_buff, msg_len))
+   error(1, errno, "recv: payload_len = %d max_frag_len = %d",
+   payload_len, max_frag_len);
+   if (ret != payload_len)
+   error(1, 0, "recv: wrong size: %d vs %d", (int)ret, 
payload_len);
+   if (memcmp(udp_payload, recv_buff, payload_len))
error(1, 0, "recv: wrong data");
 }
 
@@ -92,31 +99,95 @@ static uint32_t raw_checksum(uint8_t *buf, int len, 
uint32_t sum)
 static uint16_t udp_checksum(struct ip *iphdr, struct udphdr *udphdr)
 {
uint32_t sum = 0;
+   uint16_t res;
 
sum = raw_checksum((uint8_t *)>ip_src, 2 * sizeof(iphdr->ip_src),
-   IPPROTO_UDP + (uint32_t)(UDP_HLEN + msg_len));
-   sum = raw_checksum((uint8_t *)udp_payload, msg_len, sum);
+   IPPROTO_UDP + (uint32_t)(UDP_HLEN + 
payload_len));
+   sum = raw_checksum((uint8_t *)udphdr, UDP_HLEN, sum);
+   sum = raw_checksum((uint8_t *)udp_payload, payload_len, sum);
+   res = 0x & ~sum;
+   if (res)
+   return htons(res);
+   else
+   return CSUM_MANGLED_0;
+}
+
+static uint16_t udp6_checksum(struct ip6_hdr *iphdr, struct udphdr *udphdr)
+{
+   uint32_t sum = 0;
+   uint16_t res;
+
+   sum = raw_checksum((uint8_t *)>ip6_src, 2 * 
sizeof(iphdr->ip6_src),
+   IPPROTO_UDP);
+   sum = raw_checksum((uint8_t *)>len, sizeof(udphdr->len), sum);
sum = raw_checksum((uint8_t *)udphdr, UDP_HLEN, sum);
-   return htons(0x & ~sum);
+   sum = raw_checksum((uint8_t *)udp_payload, payload_len, sum);
+   res = 0x & ~sum;
+   if (res)
+   return htons(res);
+   else
+   return CSUM_MANGLED_0;
 }
 
 static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
-   struct ip *iphdr, int offset)
+   int offset, bool ipv6)
 {
int frag_len;
int res;
+   int payload_offset = offset > 0 ? offset - UDP_HLEN : 0;
+   uint8_t *frag_start = ipv6 ? ip_frame + IP6_HLEN + F

Re: [PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-17 Thread Peter Oskolkov
On Mon, Sep 17, 2018 at 9:59 AM David Ahern  wrote:
>
> On 9/17/18 9:11 AM, Peter Oskolkov wrote:
> > On Thu, Sep 13, 2018 at 9:11 PM David Ahern  wrote:
> >>
> >> On 9/13/18 1:38 PM, Peter Oskolkov wrote:
> >>
> >>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> >>> index 3eed045c65a5..a3902f805305 100644
> >>> --- a/net/ipv6/route.c
> >>> +++ b/net/ipv6/route.c
> >>> @@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info 
> >>> *rt, struct fib6_info *ort)
> >>>
> >>>  static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
> >>>  {
> >>> - rt->dst.flags |= fib6_info_dst_flags(ort);
> >>> + rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
> >>
> >> I think my mistake is setting dst.flags in ip6_rt_init_dst. Flags
> >> argument is passed to ip6_dst_alloc which is always invoked before
> >> ip6_rt_copy_init is called which is the only caller of ip6_rt_init_dst.
> >
> > ip6_rt_cache_alloc calls ip6_dst_alloc with zero as flags; and only
> > one flag is copied later (DST_HOST) outside of ip6_rt_init_dst().
> > If the flag assignment is completely removed from ip6_rt_init_dst(),
> > then DST_NOPOLICY flag will be lost.
> >
> > Which may be OK, but is more than what this patch tries to solve (do not
> > copy DST_NOCOUNT flag).
>
> In the 4.17 kernel (prior to the fib6_info change), ip6_rt_cache_alloc
> calls __ip6_dst_alloc with 0 for flags so this is correct. The mistake
> is ip6_rt_copy_init -> ip6_rt_init_dst -> fib6_info_dst_flags.
>
> I believe the right fix is to drop the 'rt->dst.flags |=
> fib6_info_dst_flags(ort);' from ip6_rt_init_dst.

OK, I sent a v2 with the assignment removed. Thanks for the review!


[Patch net v2] net/ipv6: do not copy dst flags on rt init

2018-09-17 Thread Peter Oskolkov
DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern 
Acked-by: Wei Wang 
Signed-off-by: Peter Oskolkov 
---
 net/ipv6/route.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3eed045c65a5..480a79f47c52 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -946,8 +946,6 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
struct fib6_info *ort)
 
 static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
 {
-   rt->dst.flags |= fib6_info_dst_flags(ort);
-
if (ort->fib6_flags & RTF_REJECT) {
ip6_rt_init_dst_reject(rt, ort);
return;
-- 
2.19.0.397.gdd90340f6a-goog



Re: [PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-17 Thread Peter Oskolkov
On Thu, Sep 13, 2018 at 9:11 PM David Ahern  wrote:
>
> On 9/13/18 1:38 PM, Peter Oskolkov wrote:
>
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index 3eed045c65a5..a3902f805305 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
> > struct fib6_info *ort)
> >
> >  static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
> >  {
> > - rt->dst.flags |= fib6_info_dst_flags(ort);
> > + rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
>
> I think my mistake is setting dst.flags in ip6_rt_init_dst. Flags
> argument is passed to ip6_dst_alloc which is always invoked before
> ip6_rt_copy_init is called which is the only caller of ip6_rt_init_dst.

ip6_rt_cache_alloc calls ip6_dst_alloc with zero as flags; and only
one flag is copied later (DST_HOST) outside of ip6_rt_init_dst().
If the flag assignment is completely removed from ip6_rt_init_dst(),
then DST_NOPOLICY flag will be lost.

Which may be OK, but is more than what this patch tries to solve (do not
copy DST_NOCOUNT flag).

>
> >
> >   if (ort->fib6_flags & RTF_REJECT) {
> >   ip6_rt_init_dst_reject(rt, ort);
> >
>


[PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-13 Thread Peter Oskolkov
DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern 
Acked-by: Wei Wang 
Signed-off-by: Peter Oskolkov 
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3eed045c65a5..a3902f805305 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
struct fib6_info *ort)
 
 static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
 {
-   rt->dst.flags |= fib6_info_dst_flags(ort);
+   rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
 
if (ort->fib6_flags & RTF_REJECT) {
ip6_rt_init_dst_reject(rt, ort);
-- 
2.19.0.397.gdd90340f6a-goog



[PATCH net-next 1/2] ip: fail fast on IP defrag errors

2018-08-28 Thread Peter Oskolkov
The current behavior of IP defragmentation is inconsistent:
- some overlapping/wrong length fragments are dropped without
  affecting the queue;
- most overlapping fragments cause the whole frag queue to be dropped.

This patch brings consistency: if a bad fragment is detected,
the whole frag queue is dropped. Two major benefits:
- fail fast: corrupted frag queues are cleared immediately, instead of
  by timeout;
- testing of overlapping fragments is now much easier: any kind of
  random fragment length mutation now leads to the frag queue being
  discarded (IP packet dropped); before this patch, some overlaps were
  "corrected", with tests not seeing expected packet drops.

Note that in one case (see "if (end&7)" conditional) the current
behavior is preserved as there are concerns that this could be
legitimate padding.

Signed-off-by: Peter Oskolkov 
Reviewed-by: Eric Dumazet 
Reviewed-by: Willem de Bruijn 
---
 net/ipv4/ip_fragment.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88281fbce88c..330f62353b11 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -382,7 +382,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
if (end < qp->q.len ||
((qp->q.flags & INET_FRAG_LAST_IN) && end != qp->q.len))
-   goto err;
+   goto discard_qp;
qp->q.flags |= INET_FRAG_LAST_IN;
qp->q.len = end;
} else {
@@ -394,20 +394,20 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
if (end > qp->q.len) {
/* Some bits beyond end -> corruption. */
if (qp->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_qp;
qp->q.len = end;
}
}
if (end == offset)
-   goto err;
+   goto discard_qp;
 
err = -ENOMEM;
if (!pskb_pull(skb, skb_network_offset(skb) + ihl))
-   goto err;
+   goto discard_qp;
 
err = pskb_trim_rcsum(skb, end - offset);
if (err)
-   goto err;
+   goto discard_qp;
 
/* Note : skb->rbnode and skb->dev share the same location. */
dev = skb->dev;
@@ -423,6 +423,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * We do the same here for IPv4 (and increment an snmp counter).
 */
 
+   err = -EINVAL;
/* Find out where to put this fragment.  */
prev_tail = qp->q.fragments_tail;
if (!prev_tail)
@@ -431,7 +432,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
/* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
-   goto discard_qp;
+   goto overlap;
if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
ip4_frag_append_to_last_run(>q, skb);
else
@@ -450,7 +451,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
FRAG_CB(skb1)->frag_run_len)
rbn = >rb_right;
else /* Found an overlap with skb1. */
-   goto discard_qp;
+   goto overlap;
} while (*rbn);
/* Here we have parent properly set, and rbn pointing to
 * one of its NULL left/right children. Insert skb.
@@ -487,16 +488,18 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
skb->_skb_refdst = 0UL;
err = ip_frag_reasm(qp, skb, prev_tail, dev);
skb->_skb_refdst = orefdst;
+   if (err)
+   inet_frag_kill(>q);
return err;
}
 
skb_dst_drop(skb);
return -EINPROGRESS;
 
+overlap:
+   __IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 discard_qp:
inet_frag_kill(>q);
-   err = -EINVAL;
-   __IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
kfree_skb(skb);
return err;
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog



[PATCH net-next 2/2] selftests/net: add ip_defrag selftest

2018-08-28 Thread Peter Oskolkov
This test creates a raw IPv4 socket, fragments a largish UDP
datagram and sends the fragments out of order.

Then repeats in a loop with different message and fragment lengths.

Then does the same with overlapping fragments (with overlapping
fragments the expectation is that the recv times out).

Tested:

root@# time ./ip_defrag.sh
ipv4 defrag
PASS
ipv4 defrag with overlaps
PASS

real1m7.679s
user0m0.628s
sys 0m2.242s

A similar test for IPv6 is to follow.

Signed-off-by: Peter Oskolkov 
Reviewed-by: Willem de Bruijn 
---
 tools/testing/selftests/net/.gitignore   |   2 +
 tools/testing/selftests/net/Makefile |   4 +-
 tools/testing/selftests/net/ip_defrag.c  | 313 +++
 tools/testing/selftests/net/ip_defrag.sh |  29 +++
 4 files changed, 346 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/net/ip_defrag.c
 create mode 100755 tools/testing/selftests/net/ip_defrag.sh

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index 78b24cf76f40..2836e0cf2d81 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -14,3 +14,5 @@ udpgso_bench_rx
 udpgso_bench_tx
 tcp_inq
 tls
+ip_defrag
+
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 9cca68e440a0..cccdb2295567 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,13 +5,13 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd
-TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx
+TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
diff --git a/tools/testing/selftests/net/ip_defrag.c 
b/tools/testing/selftests/net/ip_defrag.c
new file mode 100644
index ..55fdcdc78eef
--- /dev/null
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static boolcfg_do_ipv4;
+static boolcfg_do_ipv6;
+static boolcfg_verbose;
+static boolcfg_overlap;
+static unsigned short  cfg_port = 9000;
+
+const struct in_addr addr4 = { .s_addr = __constant_htonl(INADDR_LOOPBACK + 2) 
};
+
+#define IP4_HLEN   (sizeof(struct iphdr))
+#define IP6_HLEN   (sizeof(struct ip6_hdr))
+#define UDP_HLEN   (sizeof(struct udphdr))
+
+static int msg_len;
+static int max_frag_len;
+
+#define MSG_LEN_MAX6   /* Max UDP payload length. */
+
+#define IP4_MF (1u << 13)  /* IPv4 MF flag. */
+
+static uint8_t udp_payload[MSG_LEN_MAX];
+static uint8_t ip_frame[IP_MAXPACKET];
+static uint16_t ip_id = 0xabcd;
+static int msg_counter;
+static int frag_counter;
+static unsigned int seed;
+
+/* Receive a UDP packet. Validate it matches udp_payload. */
+static void recv_validate_udp(int fd_udp)
+{
+   ssize_t ret;
+   static uint8_t recv_buff[MSG_LEN_MAX];
+
+   ret = recv(fd_udp, recv_buff, msg_len, 0);
+   msg_counter++;
+
+   if (cfg_overlap) {
+   if (ret != -1)
+   error(1, 0, "recv: expected timeout; got %d; seed = %u",
+   (int)ret, seed);
+   if (errno != ETIMEDOUT && errno != EAGAIN)
+   error(1, errno, "recv: expected timeout: %d; seed = %u",
+errno, seed);
+   return;  /* OK */
+   }
+
+   if (ret == -1)
+   error(1, errno, "recv: msg_len = %d max_frag_len = %d",
+   msg_len, max_frag_len);
+   if (ret != msg_len)
+   error(1, 0, "recv: wrong size: %d vs %d", (int)ret, msg_len);
+   if (memcmp(udp_payload, recv_buff, msg_len))
+   error(1, 0, "recv: wrong data");
+}
+
+static uint32_t raw_checksum(uint8_t *buf, int len, uint32_t sum)
+{
+   int i;
+
+   for (i = 0; i < (len & ~1U); i += 2) {
+   sum += (u_int16_t)ntohs(*((u_int16_t *)(buf + i)));
+   if (sum > 0x)
+   sum -= 0x;
+   }
+
+   if (i < len) {
+   sum += buf[i] << 8;
+   

[PATCH net-next v2 0/2] ip: faster in-order IP fragments

2018-08-11 Thread Peter Oskolkov
Added "Signed-off-by" in v2.

Peter Oskolkov (2):
  ip: add helpers to process in-order fragments faster.
  ip: process in-order fragments efficiently

 include/net/inet_frag.h  |   6 ++
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 183 ++-
 3 files changed, 149 insertions(+), 42 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog



[PATCH net-next v2 1/2] ip: add helpers to process in-order fragments faster.

2018-08-11 Thread Peter Oskolkov
This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 

---
 include/net/inet_frag.h |  6 
 net/ipv4/ip_fragment.c  | 73 +
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index b86d14528188..1662cbc0b46b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -57,7 +57,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -78,6 +80,7 @@ struct inet_frag_queue {
struct sk_buff  *fragments;  /* Used in IPv6. */
struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
+   struct sk_buff  *last_run_head;
ktime_t stamp;
int len;
int meat;
@@ -113,6 +116,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (refcount_dec_and_test(>refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7cb7ed761d8c..26ace9d2d976 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,6 +57,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   struct inet_skb_parmh;
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+   struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(>rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   if (q->last_run_head)
+   rb_link_node(>rbnode, >last_run_head->rbnode,
+>last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(>rbnode, NULL, >rb_fragments.rb_node);
+   rb_insert_color(>rbnode, >rb_fragments);
+
+   ip4_frag_init_run(skb);
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -654,6 +705,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 u

[PATCH net-next v2 2/2] ip: process in-order fragments efficiently

2018-08-11 Thread Peter Oskolkov
This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H  -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 

---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ---
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6d258a5669e7..bcb11f3a27c0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -146,7 +146,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = xp;
} while (fp);
} else {
-   sum_truesize = skb_rbtree_purge(>rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 26ace9d2d976..88281fbce88c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -126,8 +126,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -219,7 +219,12 @@ static void ip_expire(struct timer_list *t)
head = skb_rb_first(>q.rb_fragments);
if (!head)
goto out;
-   rb_erase(>rbnode, >q.rb_fragments);
+   if (FRAG_CB(head)->next_frag)
+   rb_replace_node(>rbnode,
+   _CB(head)->next_frag->rbnode,
+   >q.rb_fragments);
+   else
+   rb_erase(>rbnode, >q.rb_fragments);
memset(>rbnode, 0, sizeof(head->rbnode));
barrier();
}
@@ -320,7 +325,7 @@ static int ip_frag_reinit(struct ipq *qp)
return -ETIMEDOUT;
}
 
-   sum_truesize = skb_rbtree_purge(>q.rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>q.rb_fragments);
sub_frag_mem_limit(qp->q.net, sum_truesize);
 
qp->q.flags = 0;
@@ -329,6 +334,7 @@ static int ip_frag_reinit(struct ipq *qp)
qp->q.fragments = NULL;
qp->q.rb_fragments = RB_ROOT;
qp->q.fragments_tail = NULL;
+   qp->q.last_run_head = NULL;
qp->iif = 0;
qp->ecn = 0;
 
@@ -340,7 +346,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 {
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct rb_node **rbn, *parent;
-   struct sk_buff *skb1;
+   struct sk_buff *skb1, *prev_tail;
struct net_device *dev;
unsigned int fragsize;
int flags, offset;
@@ -418,38 +424,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
 
/* Find out where to put this fragment.  */
-   skb1 = qp->q.fragments_tail;
-   if (!skb1) {
-   /* This is the first fragment we've received. */
-   rb_link_node(>rbnode, NULL, >q.rb_fragments.rb_node);
-   qp->q.fragments_tail = skb;
-   } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-   /* This is the common/special case: skb goes to the end. */
+   prev_tail = qp->q.fragments_tail;
+   if (!prev_tail)
+   ip4_frag_create_run(>q, skb);  /* First fragment. */
+   else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+   /* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
-   if (offset < (skb1->ip_defrag_offset + skb1->len))
+   if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
goto discard_qp;
-   /* Insert after skb1. */
-   rb_link_node(>rbnode, >rbnode, 
>rbnode.rb_right);
-   qp->q.fragments_tail = skb;
+   if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+   ip4_frag_append_to_last_run(>q, skb);
+   else
+   ip4_frag_create_run(>q, skb);
} else {
-   /* Binary se

[PATCH net-next 1/2] ip: add helpers to process in-order fragments faster.

2018-08-10 Thread Peter Oskolkov
This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn 
Cc: Eric Dumazet 
Cc: Cc: Florian Westphal 

---
 include/net/inet_frag.h |  6 
 net/ipv4/ip_fragment.c  | 73 +
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index b86d14528188..1662cbc0b46b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -57,7 +57,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -78,6 +80,7 @@ struct inet_frag_queue {
struct sk_buff  *fragments;  /* Used in IPv6. */
struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
+   struct sk_buff  *last_run_head;
ktime_t stamp;
int len;
int meat;
@@ -113,6 +116,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (refcount_dec_and_test(>refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7cb7ed761d8c..26ace9d2d976 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,6 +57,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   struct inet_skb_parmh;
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+   struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(>rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   if (q->last_run_head)
+   rb_link_node(>rbnode, >last_run_head->rbnode,
+>last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(>rbnode, NULL, >rb_fragments.rb_node);
+   rb_insert_color(>rbnode, >rb_fragments);
+
+   ip4_frag_init_run(skb);
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -654,6 +705,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = 

[PATCH net-next 2/2] ip: process in-order fragments efficiently

2018-08-10 Thread Peter Oskolkov
This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H  -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn 
Cc: Eric Dumazet 
Cc: Cc: Florian Westphal 

---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ---
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6d258a5669e7..bcb11f3a27c0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -146,7 +146,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = xp;
} while (fp);
} else {
-   sum_truesize = skb_rbtree_purge(>rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 26ace9d2d976..88281fbce88c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -126,8 +126,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -219,7 +219,12 @@ static void ip_expire(struct timer_list *t)
head = skb_rb_first(>q.rb_fragments);
if (!head)
goto out;
-   rb_erase(>rbnode, >q.rb_fragments);
+   if (FRAG_CB(head)->next_frag)
+   rb_replace_node(>rbnode,
+   _CB(head)->next_frag->rbnode,
+   >q.rb_fragments);
+   else
+   rb_erase(>rbnode, >q.rb_fragments);
memset(>rbnode, 0, sizeof(head->rbnode));
barrier();
}
@@ -320,7 +325,7 @@ static int ip_frag_reinit(struct ipq *qp)
return -ETIMEDOUT;
}
 
-   sum_truesize = skb_rbtree_purge(>q.rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>q.rb_fragments);
sub_frag_mem_limit(qp->q.net, sum_truesize);
 
qp->q.flags = 0;
@@ -329,6 +334,7 @@ static int ip_frag_reinit(struct ipq *qp)
qp->q.fragments = NULL;
qp->q.rb_fragments = RB_ROOT;
qp->q.fragments_tail = NULL;
+   qp->q.last_run_head = NULL;
qp->iif = 0;
qp->ecn = 0;
 
@@ -340,7 +346,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 {
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct rb_node **rbn, *parent;
-   struct sk_buff *skb1;
+   struct sk_buff *skb1, *prev_tail;
struct net_device *dev;
unsigned int fragsize;
int flags, offset;
@@ -418,38 +424,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
 
/* Find out where to put this fragment.  */
-   skb1 = qp->q.fragments_tail;
-   if (!skb1) {
-   /* This is the first fragment we've received. */
-   rb_link_node(>rbnode, NULL, >q.rb_fragments.rb_node);
-   qp->q.fragments_tail = skb;
-   } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-   /* This is the common/special case: skb goes to the end. */
+   prev_tail = qp->q.fragments_tail;
+   if (!prev_tail)
+   ip4_frag_create_run(>q, skb);  /* First fragment. */
+   else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+   /* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
-   if (offset < (skb1->ip_defrag_offset + skb1->len))
+   if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
goto discard_qp;
-   /* Insert after skb1. */
-   rb_link_node(>rbnode, >rbnode, 
>rbnode.rb_right);
-   qp->q.fragments_tail = skb;
+   if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+   ip4_frag_append_to_last_run(>q, skb);
+   else
+   ip4_frag_create_run(>q, skb);
} else {
-   /* Binary search. Note that skb can become the first fragment, 
but
-* not the last (covered above). */
+   /* Binary search. Note that skb can become the first fragment,
+  

Re: [PATCH net-next] ipv4: frags: precedence bug in ip_expire()

2018-08-06 Thread Peter Oskolkov
Ack. Thanks, Dan!
On Mon, Aug 6, 2018 at 12:17 PM Dan Carpenter  wrote:
>
> We accidentally removed the parentheses here, but they are required
> because '!' has higher precedence than '&'.
>
> Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
> Signed-off-by: Dan Carpenter 
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 0e8f8de77e71..7cb7ed761d8c 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -154,7 +154,7 @@ static void ip_expire(struct timer_list *t)
> __IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
> __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
>
> -   if (!qp->q.flags & INET_FRAG_FIRST_IN)
> +   if (!(qp->q.flags & INET_FRAG_FIRST_IN))
> goto out;
>
> /* sk_buff::dev and sk_buff::rbnode are unionized. So we


Re: [PATCH v2 net-next 0/3] ip: Use rb trees for IP frag queue

2018-08-03 Thread Peter Oskolkov
On Fri, Aug 3, 2018 at 12:33 PM Josh Hunt  wrote:
>
> On Thu, Aug 2, 2018 at 4:34 PM, Peter Oskolkov  wrote:
>>
>> This patchset
>>  * changes IPv4 defrag behavior to match that of IPv6: overlapping
>>fragments now cause the whole IP datagram to be discarded (suggested
>>by David Miller): there are no legitimate use cases for overlapping
>>fragments;
>>  * changes IPv4 defrag queue from a list to a rb tree (suggested
>>by Eric Dumazet): this change removes a potential attach vector.
>>
>> Upcoming patches will contain similar changes for IPv6 frag queue,
>> as well as a comprehensive IP defrag self-test (temporarily delayed).
>>
>> Peter Oskolkov (3):
>>   ip: discard IPv4 datagrams with overlapping segments.
>>   net: modify skb_rbtree_purge to return the truesize of all purged
>> skbs.
>>   ip: use rb trees for IP frag queue.
>>
>>  include/linux/skbuff.h  |  11 +-
>>  include/net/inet_frag.h |   3 +-
>>  include/uapi/linux/snmp.h   |   1 +
>>  net/core/skbuff.c   |   6 +-
>>  net/ipv4/inet_fragment.c|  16 +-
>>  net/ipv4/ip_fragment.c  | 239 +++-
>>  net/ipv4/proc.c |   1 +
>>  net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
>>  net/ipv6/reassembly.c   |   1 +
>>  9 files changed, 139 insertions(+), 140 deletions(-)
>>
>> --
>> 2.18.0.597.ga71716f1ad-goog
>>
>
> Peter
>
> I just tested your patches along with Florian's on top of net-next. Things 
> look much better wrt this type of attack. Thanks for doing this. I'm 
> wondering if we want to put an optional mechanism in place to limit the size 
> of the tree in terms of skbs it can hold? Otherwise an attacker can send 
> ~1400 8 byte frags and consume all frag memory (default high thresh is 4M) 
> pretty easily and I believe also evict other frags which may have been 
> pending? I am guessing this is what Florian's min MTU patches are trying to 
> help with.
>
> --
> Josh

Hi Josh,

It will be really easy to limit the size of the queue/tree (e.g. based
on a sysctl parameter). I can send a follow-up patch if there is a
consensus that this behavior is needed/useful.

Thanks,
Peter


[PATCH v2 net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-08-02 Thread Peter Oskolkov
Tested: see the next patch is the series.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..47848367c816 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2585,7 +2585,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a9126e12..8d574a88125d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2858,23 +2858,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an _buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(>rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.18.0.597.ga71716f1ad-goog



[PATCH v2 net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Peter Oskolkov
This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e5ebc83827ab..f80135e5feaa 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -56,6 +56,7 @@ enum
IPSTATS_MIB_ECT1PKTS,   /* InECT1Pkts */
IPSTATS_MIB_ECT0PKTS,   /* InECT0Pkts */
IPSTATS_MIB_CEPKTS, /* InCEPkts */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */
dev = skb->dev;
@@ -463,6 +422,10 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 

[PATCH v2 net-next 0/3] ip: Use rb trees for IP frag queue

2018-08-02 Thread Peter Oskolkov
This patchset
 * changes IPv4 defrag behavior to match that of IPv6: overlapping
   fragments now cause the whole IP datagram to be discarded (suggested
   by David Miller): there are no legitimate use cases for overlapping
   fragments;
 * changes IPv4 defrag queue from a list to a rb tree (suggested
   by Eric Dumazet): this change removes a potential attach vector.

Upcoming patches will contain similar changes for IPv6 frag queue,
as well as a comprehensive IP defrag self-test (temporarily delayed).

Peter Oskolkov (3):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
skbs.
  ip: use rb trees for IP frag queue.

 include/linux/skbuff.h  |  11 +-
 include/net/inet_frag.h |   3 +-
 include/uapi/linux/snmp.h   |   1 +
 net/core/skbuff.c   |   6 +-
 net/ipv4/inet_fragment.c|  16 +-
 net/ipv4/ip_fragment.c  | 239 +++-
 net/ipv4/proc.c |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog



[PATCH v2 net-next 3/3] ip: use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov
Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H  -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).

Reported-by: Jann Horn 
Reported-by: Juha-Matti Tilli 
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/linux/skbuff.h  |   9 +-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 121 insertions(+), 91 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47848367c816..7ebdf158a795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,16 @@ struct sk_buff {
 * UDP receive path is one user.
 */
unsigned long   dev_scratch;
-   int ip_defrag_offset;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
struct list_headlist;
};
-   struct sock *sk;
+
+   union {
+   struct sock *sk;
+   int ip_defrag_offset;
+   };
 
union {
ktime_t tstamp;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f4272a29dc44..b86d14528188 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ccd140e4082d..6d258a5669e7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -137,12 +137,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..ffbf9135fd71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
struct inet_frag_queue *frag = from_timer(frag, t, timer);
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   if (qp->q.fragments) {
+   head = qp->q.fragments;
+   qp->q.fragments = head->next;
+   } else {
+   

[PATCH net-next 3/3] ip: use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov
Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H  -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).

Reported-by: Jann Horn 
Reported-by: Juha-Matti Tilli 
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/linux/skbuff.h  |   9 +-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 121 insertions(+), 91 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47848367c816..7ebdf158a795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,16 @@ struct sk_buff {
 * UDP receive path is one user.
 */
unsigned long   dev_scratch;
-   int ip_defrag_offset;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
struct list_headlist;
};
-   struct sock *sk;
+
+   union {
+   struct sock *sk;
+   int ip_defrag_offset;
+   };
 
union {
ktime_t tstamp;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f4272a29dc44..b86d14528188 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ccd140e4082d..6d258a5669e7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -137,12 +137,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..ffbf9135fd71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
struct inet_frag_queue *frag = from_timer(frag, t, timer);
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   if (qp->q.fragments) {
+   head = qp->q.fragments;
+   qp->q.fragments = head->next;
+   } else {
+   

[PATCH net-next 0/3] ip: Use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov
This patchset
 * changes IPv4 defrag behavior to match that of IPv6: overlapping
   fragments now cause the whole IP datagram to be discarded (suggested
   by David Miller): there are no legitimate use cases for overlapping
   fragments;
 * changes IPv4 defrag queue from a list to a rb tree (suggested
   by Eric Dumazet): this change removes a potential attach vector.

Upcoming patches will contain similar changes for IPv6 frag queue,
as well as a comprehensive IP defrag self-test (temporarily delayed).

Peter Oskolkov (3):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
skbs.
  ip: use rb trees for IP frag queue.

 include/linux/skbuff.h  |  11 +-
 include/net/inet_frag.h |   3 +-
 include/uapi/linux/snmp.h   |   1 +
 net/core/skbuff.c   |   6 +-
 net/ipv4/inet_fragment.c|  16 +-
 net/ipv4/ip_fragment.c  | 239 +++-
 net/ipv4/proc.c |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog



[PATCH net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-08-02 Thread Peter Oskolkov
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..47848367c816 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2585,7 +2585,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a9126e12..8d574a88125d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2858,23 +2858,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an _buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(>rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.18.0.597.ga71716f1ad-goog



[PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Peter Oskolkov
This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e5ebc83827ab..da1a144f1a51 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -40,6 +40,7 @@ enum
IPSTATS_MIB_REASMREQDS, /* ReasmReqds */
IPSTATS_MIB_REASMOKS,   /* ReasmOKs */
IPSTATS_MIB_REASMFAILS, /* ReasmFails */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
IPSTATS_MIB_FRAGOKS,/* FragOKs */
IPSTATS_MIB_FRAGFAILS,  /* FragFails */
IPSTATS_MIB_FRAGCREATES,/* FragCreates */
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */
d