RE: [PATCH] net: bond: skip vlan header when do layer 3+4 hash policy

2018-04-02 Thread liujian (CE)
> -Original Message-
> From: Nikolay Aleksandrov [mailto:niko...@cumulusnetworks.com]
> Sent: Saturday, March 31, 2018 5:23 PM
> To: liujian (CE) <liujia...@huawei.com>; da...@davemloft.net;
> j.vosbu...@gmail.com; vfal...@gmail.com; a...@greyhouse.net
> Cc: netdev@vger.kernel.org; weiyongjun (A) <weiyongj...@huawei.com>
> Subject: Re: [PATCH] net: bond: skip vlan header when do layer 3+4 hash policy
> 
> On 31/03/18 12:14, liujia...@huawei.com wrote:
> > From: liujian <liujia...@huawei.com>
> >
> > When the hash policy is BOND_XMIT_POLICY_LAYER34 mode and skb
> protocol
> > is 802.1q VLAN, the policy will be degenerated to LAYER2 mode; Now,
> > change it to get the next layer protocol to ensure that it worked in
> > BOND_XMIT_POLICY_LAYER34 mode.
> >
> > Signed-off-by: liujian <liujia...@huawei.com>
> > ---
> >   drivers/net/bonding/bond_main.c | 11 ---
> >   1 file changed, 8 insertions(+), 3 deletions(-)
> 
> 
> Nak
> Use BOND_XMIT_POLICY_ENCAP34 (encap3+4), that was one of the main
> reasons it was added.

Got it, thank you~


RE: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare

2017-08-31 Thread liujian (CE)



Best Regards,
liujian


> -Original Message-
> From: Michal Kubecek [mailto:mkube...@suse.cz]
> Sent: Friday, September 01, 2017 12:24 AM
> To: Jesper Dangaard Brouer
> Cc: liujian (CE); netdev@vger.kernel.org; Florian Westphal
> Subject: Re: [RFC PATCH] net: frag limit checks need to use
> percpu_counter_compare
> 
> On Thu, Aug 31, 2017 at 12:20:19PM +0200, Jesper Dangaard Brouer wrote:
> > To: Liujian can you please test this patch?
> >  I want to understand if using __percpu_counter_compare() solves  the
> > problem correctness wise (even-though this will be slower  than using
> > a simple atomic_t on your big system).

I have test the patch, it can work. 
1. make sure frag_mem_limit reach to thresh
  ===>FRAG: inuse 0 memory 0 frag_mem_limit 5386864
2. change NIC rx irq's affinity to a fixed CPU
3. iperf -u -c 9.83.1.41 -l 1 -i 1 -t 1000 -P 10 -b 20M
  And check /proc/net/snmp, there are no ReasmFails.

And I think it is a better way that adding some counter sync points as you said.

> > Fix bug in fragmentation codes use of the percpu_counter API, that
> > cause issues on systems with many CPUs.
> >
> > The frag_mem_limit() just reads the global counter (fbc->count),
> > without considering other CPUs can have upto batch size (130K) that
> > haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
> > this become dangerous at >=24 CPUs (3*1024*1024/13=24).
> >
> > The __percpu_counter_compare() does the right thing, and takes into
> > account the number of (online) CPUs and batch size, to account for
> > this and call __percpu_counter_sum() when needed.
> >
> > On systems with many CPUs this will unfortunately always result in the
> > heavier fully locked __percpu_counter_sum() which touch the
> > per_cpu_ptr of all (online) CPUs.
> >
> > On systems with a smaller number of CPUs this solution is also not
> > optimal, because __percpu_counter_compare()/__percpu_counter_sum()
> > doesn't help synchronize the global counter.
> >  Florian Westphal have an idea of adding some counter sync points,
> > which should help address this issue.
> > ---
> >  include/net/inet_frag.h  |   16 ++--
> >  net/ipv4/inet_fragment.c |6 +++---
> >  2 files changed, 17 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index
> > 6fdcd2427776..b586e320783d 100644
> > --- a/include/net/inet_frag.h
> > +++ b/include/net/inet_frag.h
> > @@ -147,9 +147,21 @@ static inline bool inet_frag_evicting(struct
> inet_frag_queue *q)
> >   */
> >  static unsigned int frag_percpu_counter_batch = 13;
> >
> > -static inline int frag_mem_limit(struct netns_frags *nf)
> > +static inline bool frag_mem_over_limit(struct netns_frags *nf, int
> > +thresh)
> >  {
> > -   return percpu_counter_read(>mem);
> > +   /* When reading counter here, __percpu_counter_compare() call
> > +* will invoke __percpu_counter_sum() when needed.  Which
> > +* depend on num_online_cpus()*batch size, as each CPU can
> > +* potentential can hold a batch count.
> > +*
> > +* With many CPUs this heavier sum operation will
> > +* unfortunately always occur.
> > +*/
> > +   if (__percpu_counter_compare(>mem, thresh,
> > +frag_percpu_counter_batch) > 0)
> > +   return true;
> > +   else
> > +   return false;
> >  }
> >
> >  static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index
> > 96e95e83cc61..ee2cf56900e6 100644
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct
> > inet_frags *f)  static bool inet_fragq_should_evict(const struct
> > inet_frag_queue *q)  {
> > return q->net->low_thresh == 0 ||
> > -  frag_mem_limit(q->net) >= q->net->low_thresh;
> > +   frag_mem_over_limit(q->net, q->net->low_thresh);
> >  }
> >
> >  static unsigned int
> > @@ -355,7 +355,7 @@ static struct inet_frag_queue
> > *inet_frag_alloc(struct netns_frags *nf,  {
> > struct inet_frag_queue *q;
> >
> > -   if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
> > +   if (!nf->high_thresh || frag_mem_over_limit(nf, nf->high_thresh)) {
> > inet_frag_schedule_worker(f);
> > return NULL;
> > }
> 
> If we go t

RE: Question about ip_defrag

2017-08-29 Thread liujian (CE)



Best Regards,
liujian


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Florian Westphal
> Sent: Tuesday, August 29, 2017 9:47 PM
> To: liujian (CE)
> Cc: Florian Westphal; Jesper Dangaard Brouer; netdev@vger.kernel.org;
> Wangkefeng (Kevin); weiyongjun (A)
> Subject: Re: Question about ip_defrag
> 
> liujian (CE) <liujia...@huawei.com> wrote:
> 
> [ trimming cc list ]
> 
> > Now, I have not the real environment.
> > I use iperf generate fragment packets; and I always change NIC rx
> > irq's affinity cpu, to make sure frag_mem_limit reach to thresh.
> > my test machine, CPU num is 384.
> 
> Oh well, that explains it.
> 
> > > > +   if (frag_mem_limit(nf) > nf->low_thresh) {
> > > > inet_frag_schedule_worker(f);
> > > > +   update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16);
> > > > +   }
> 
> You need to reduce this to a lower value.
> Your cpu count * batch_value needs to be less than low_thresh to avoid
> problems.
> 
> Wtih 384 cpus its close to 12 mbyte...
> 
> Perhaps do this:
> 
> update_frag_mem_limit(nf, 2 * 1024*1024 / NR_CPUS);
> 
> 
> However, I think its better to revert the percpu counter change and move back
> to a single atomic_t count.

Ok. 
Florian and Jesper, many thanks for this issue. 


RE: Question about ip_defrag

2017-08-29 Thread liujian (CE)



Best Regards,
liujian


> -Original Message-
> From: liujian (CE)
> Sent: Tuesday, August 29, 2017 3:39 PM
> To: 'Florian Westphal'
> Cc: Jesper Dangaard Brouer; da...@davemloft.net; kuz...@ms2.inr.ac.ru;
> yoshf...@linux-ipv6.org; elena.reshet...@intel.com; eduma...@google.com;
> netdev@vger.kernel.org; Wangkefeng (Kevin); weiyongjun (A)
> Subject: RE: Question about ip_defrag
> 
> 
> 
> > -Original Message-
> > From: netdev-ow...@vger.kernel.org
> > [mailto:netdev-ow...@vger.kernel.org]
> > On Behalf Of Florian Westphal
> > Sent: Monday, August 28, 2017 10:01 PM
> > To: liujian (CE)
> > Cc: Jesper Dangaard Brouer; da...@davemloft.net; kuz...@ms2.inr.ac.ru;
> > yoshf...@linux-ipv6.org; elena.reshet...@intel.com;
> > eduma...@google.com; netdev@vger.kernel.org; Wangkefeng (Kevin);
> > weiyongjun (A)
> > Subject: Re: Question about ip_defrag
> >
> > liujian (CE) <liujia...@huawei.com> wrote:
> > > Hi
> > >
> > > I checked our 3.10 kernel, we had backported all percpu_counter bug
> > > fix in
> > lib/percpu_counter.c and include/linux/percpu_counter.h.
> > > And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough.
> > >
> > > > > > > the issue:
> > > > > > > Ip_defrag fail caused by frag_mem_limit reached
> > 4M(frags.high_thresh).
> > > > > > > At this moment,sum_frag_mem_limit is about 10K.
> > >
> > > So should we change ipfrag high/low thresh to a reasonable value ?
> > > And if it is, is there a standard to change the value?
> >
> > Each cpu can have frag_percpu_counter_batch bytes rest doesn't know
> > about so with 64 cpus that is ~8 mbyte.
> >
> > possible solutions:
> > 1. reduce frag_percpu_counter_batch to 16k or so 2. make both low and
> > high thresh depend on NR_CPUS
> >
> Thank you for your reply.
> 
> > liujian, does this change help in any way?
> 
> I will have a try.

Now, I have not the real environment. 
I use iperf generate fragment packets; 
and I always change NIC rx irq's affinity cpu, to make sure frag_mem_limit 
reach to thresh.
my test machine, CPU num is 384.
As above , test the patch , seemingly , there is no improving...
Check /proc/net/snmp, there is no significant difference.
maybe we should find a good test method!

root@RH8100-V3:/proc/net# cat sockstat
sockets: used 1386
TCP: inuse 3 orphan 0 tw 0 alloc 4 mem 1
UDP: inuse 44 mem 42
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 1 memory 34336, 3144424.
 

> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -123,6 +123,17 @@ static bool inet_fragq_should_evict(const struct
> > inet_frag_queue *q)
> >frag_mem_limit(q->net) >= q->net->low_thresh;  }
> >
> > +/* ->mem batch size is huge, this can cause severe discrepancies
> > + * between actual value (sum of pcpu values) and the global estimate.
> > + *
> > + * Use a smaller batch to give an opportunity for the global estimate
> > + * to more accurately reflect current state.
> > + */
> > +static void update_frag_mem_limit(struct netns_frags *nf, unsigned
> > +int
> > +batch) {
> > +percpu_counter_add_batch(>mem, 0, batch); }
> > +
> >  static unsigned int
> >  inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb)
> > { @@
> > -146,8 +157,12 @@ inet_evict_bucket(struct inet_frags *f, struct
> > inet_frag_bucket *hb)
> >
> > spin_unlock(>chain_lock);
> >
> > -   hlist_for_each_entry_safe(fq, n, , list_evictor)
> > +   hlist_for_each_entry_safe(fq, n, , list_evictor) {
> > +   struct netns_frags *nf = fq->net;
> > +
> > f->frag_expire((unsigned long) fq);
> > +   update_frag_mem_limit(nf, 1);
> 
> > +   }
> >
> > return evicted;
> >  }
> > @@ -396,8 +411,10 @@ struct inet_frag_queue *inet_frag_find(struct
> > netns_frags *nf,
> > struct inet_frag_queue *q;
> > int depth = 0;
> >
> > -   if (frag_mem_limit(nf) > nf->low_thresh)
> > +   if (frag_mem_limit(nf) > nf->low_thresh) {
> > inet_frag_schedule_worker(f);
> > +   update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16);
> > +   }
> >
> > hash &= (INETFRAGS_HASHSZ - 1);
> > hb = >hash[hash];
> > @@ -416,6 +433,8 @@ struct inet_frag_queue *inet_frag_find(struct
> > netns_frags *nf,
> > if (depth <= INETFRAGS_MAXDEPTH)
> > return inet_frag_create(nf, f, key);
> >
> > +   update_frag_mem_limit(nf, 1);
> > +
> > if (inet_frag_may_rebuild(f)) {
> > if (!f->rebuild)
> > f->rebuild = true;


RE: Question about ip_defrag

2017-08-29 Thread liujian (CE)

> -Original Message-
> From: Jesper Dangaard Brouer [mailto:bro...@redhat.com]
> Sent: Tuesday, August 29, 2017 3:20 PM
> To: Florian Westphal
> Cc: liujian (CE); da...@davemloft.net; kuz...@ms2.inr.ac.ru;
> yoshf...@linux-ipv6.org; elena.reshet...@intel.com; eduma...@google.com;
> netdev@vger.kernel.org; Wangkefeng (Kevin); weiyongjun (A);
> bro...@redhat.com
> Subject: Re: Question about ip_defrag
> 
> On Mon, 28 Aug 2017 16:00:32 +0200
> Florian Westphal <f...@strlen.de> wrote:
> 
> > liujian (CE) <liujia...@huawei.com> wrote:
> > > Hi
> > >
> > > I checked our 3.10 kernel, we had backported all percpu_counter bug fix in
> lib/percpu_counter.c and include/linux/percpu_counter.h.
> > > And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough.
> > >
> > > > > > > the issue:
> > > > > > > Ip_defrag fail caused by frag_mem_limit reached
> 4M(frags.high_thresh).
> > > > > > > At this moment,sum_frag_mem_limit is about 10K.
> > >
> > > So should we change ipfrag high/low thresh to a reasonable value ?
> > > And if it is, is there a standard to change the value?
> >
> > Each cpu can have frag_percpu_counter_batch bytes rest doesn't know
> > about so with 64 cpus that is ~8 mbyte.
> >
> > possible solutions:
> > 1. reduce frag_percpu_counter_batch to 16k or so 2. make both low and
> > high thresh depend on NR_CPUS
> 
> To me it looks like we/I have been using the wrong API for comparing against
> percpu_counters.  I guess we should have used
> __percpu_counter_compare().

Are you means?
Change 
if (frag_mem_limit(nf) > nf->low_thresh)
to
__percpu_counter_compare(>mem, nf->low_thresh, frag_percpu_counter_batch)

> /*
>  * Compare counter against given value.
>  * Return 1 if greater, 0 if equal and -1 if less  */ int
> __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) {
>   s64 count;
> 
>   count = percpu_counter_read(fbc);
>   /* Check to see if rough count will be sufficient for comparison */
>   if (abs(count - rhs) > (batch * num_online_cpus())) {
>   if (count > rhs)
>   return 1;
>   else
>   return -1;
>   }
>   /* Need to use precise count */
>   count = percpu_counter_sum(fbc);
>   if (count > rhs)
>   return 1;
>   else if (count < rhs)
>   return -1;
>   else
>   return 0;
> }
> EXPORT_SYMBOL(__percpu_counter_compare);
> 
> 
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


RE: Question about ip_defrag

2017-08-29 Thread liujian (CE)


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Florian Westphal
> Sent: Monday, August 28, 2017 10:01 PM
> To: liujian (CE)
> Cc: Jesper Dangaard Brouer; da...@davemloft.net; kuz...@ms2.inr.ac.ru;
> yoshf...@linux-ipv6.org; elena.reshet...@intel.com; eduma...@google.com;
> netdev@vger.kernel.org; Wangkefeng (Kevin); weiyongjun (A)
> Subject: Re: Question about ip_defrag
> 
> liujian (CE) <liujia...@huawei.com> wrote:
> > Hi
> >
> > I checked our 3.10 kernel, we had backported all percpu_counter bug fix in
> lib/percpu_counter.c and include/linux/percpu_counter.h.
> > And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough.
> >
> > > > > > the issue:
> > > > > > Ip_defrag fail caused by frag_mem_limit reached
> 4M(frags.high_thresh).
> > > > > > At this moment,sum_frag_mem_limit is about 10K.
> >
> > So should we change ipfrag high/low thresh to a reasonable value ?
> > And if it is, is there a standard to change the value?
> 
> Each cpu can have frag_percpu_counter_batch bytes rest doesn't know about
> so with 64 cpus that is ~8 mbyte.
> 
> possible solutions:
> 1. reduce frag_percpu_counter_batch to 16k or so 2. make both low and high
> thresh depend on NR_CPUS
> 
Thank you for your reply.
 
> liujian, does this change help in any way?

I will have a try.

> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -123,6 +123,17 @@ static bool inet_fragq_should_evict(const struct
> inet_frag_queue *q)
>  frag_mem_limit(q->net) >= q->net->low_thresh;  }
> 
> +/* ->mem batch size is huge, this can cause severe discrepancies
> + * between actual value (sum of pcpu values) and the global estimate.
> + *
> + * Use a smaller batch to give an opportunity for the global estimate
> + * to more accurately reflect current state.
> + */
> +static void update_frag_mem_limit(struct netns_frags *nf, unsigned int
> +batch) {
> +  percpu_counter_add_batch(>mem, 0, batch); }
> +
>  static unsigned int
>  inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb)  { @@
> -146,8 +157,12 @@ inet_evict_bucket(struct inet_frags *f, struct
> inet_frag_bucket *hb)
> 
>   spin_unlock(>chain_lock);
> 
> - hlist_for_each_entry_safe(fq, n, , list_evictor)
> + hlist_for_each_entry_safe(fq, n, , list_evictor) {
> + struct netns_frags *nf = fq->net;
> +
>   f->frag_expire((unsigned long) fq);
> + update_frag_mem_limit(nf, 1);

> + }
> 
>   return evicted;
>  }
> @@ -396,8 +411,10 @@ struct inet_frag_queue *inet_frag_find(struct
> netns_frags *nf,
>   struct inet_frag_queue *q;
>   int depth = 0;
> 
> - if (frag_mem_limit(nf) > nf->low_thresh)
> + if (frag_mem_limit(nf) > nf->low_thresh) {
>   inet_frag_schedule_worker(f);
> + update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16); 
> + }
> 
>   hash &= (INETFRAGS_HASHSZ - 1);
>   hb = >hash[hash];
> @@ -416,6 +433,8 @@ struct inet_frag_queue *inet_frag_find(struct
> netns_frags *nf,
>   if (depth <= INETFRAGS_MAXDEPTH)
>   return inet_frag_create(nf, f, key);
> 
> + update_frag_mem_limit(nf, 1);
> +
>   if (inet_frag_may_rebuild(f)) {
>   if (!f->rebuild)
>   f->rebuild = true;


RE: Question about ip_defrag

2017-08-28 Thread liujian (CE)
Hi

I checked our 3.10 kernel, we had backported all percpu_counter bug fix in 
lib/percpu_counter.c and include/linux/percpu_counter.h.
And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough.

> > > > the issue:
> > > > Ip_defrag fail caused by frag_mem_limit reached 4M(frags.high_thresh).
> > > > At this moment,sum_frag_mem_limit is about 10K.

So should we change ipfrag high/low thresh to a reasonable value ? 
And if it is, is there a standard to change the value?


root@RH8100-V3:/proc/net# cat sockstat
sockets: used 1485
TCP: inuse 4 orphan 0 tw 0 alloc 5 mem 1
UDP: inuse 203 mem 201
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 1 memory 16048, 3156696.
root@RH8100-V3:/proc/net#

In order to print frag_mem_limit, change the code as below:

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 43eb6567..38bfb20 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -73,7 +73,7 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
seq_printf(seq, "RAW: inuse %d\n",
   sock_prot_inuse_get(net, _prot));
frag_mem = ip_frag_mem(net);
-   seq_printf(seq,  "FRAG: inuse %u memory %u\n", !!frag_mem, frag_mem);
+   seq_printf(seq,  "FRAG: inuse %u memory %u, %u.\n", !!frag_mem, 
frag_mem, frag_mem_limit(>ipv4.frags));
return 0;
 }

Best Regards,
liujian


> -Original Message-
> From: liujian (CE)
> Sent: Friday, August 25, 2017 9:33 AM
> To: 'Jesper Dangaard Brouer'
> Cc: da...@davemloft.net; kuz...@ms2.inr.ac.ru; yoshf...@linux-ipv6.org;
> elena.reshet...@intel.com; eduma...@google.com; netdev@vger.kernel.org;
> Wangkefeng (Kevin); weiyongjun (A)
> Subject: RE: Question about ip_defrag
> 
> 
> > -Original Message-
> > From: Jesper Dangaard Brouer [mailto:bro...@redhat.com]
> > Sent: Friday, August 25, 2017 2:59 AM
> > To: liujian (CE)
> > Cc: da...@davemloft.net; kuz...@ms2.inr.ac.ru;
> > yoshf...@linux-ipv6.org; elena.reshet...@intel.com;
> > eduma...@google.com; netdev@vger.kernel.org; bro...@redhat.com
> > Subject: Re: Question about ip_defrag
> >
> >
> > On Thu, 24 Aug 2017 16:04:41 + "liujian (CE)"
> > <liujia...@huawei.com>
> > wrote:
> >
> > > >What kernel version have you seen this issue with?
> > >
> > > 3.10,with some backport.
> > >
> > >  >As far as I remember, this issue have been fixed before...
> > >
> > > which one patch? I didnot find out the patch:(
> >
> > AFAIK it was some bugs in the percpu_counter code.  If you need to
> > backport look at the git commits:
> >
> >  git log lib/percpu_counter.c include/linux/percpu_counter.h
> >
> > Are you maintaining your own 3.10 kernel?
> >
> > I know that for RHEL7 (also kernel 3.10) we backported the
> > percpu_counter fixes...
> >
> Could you tell me which one patch?  we have backported most of the two
> files's change.
> Thank you ~
> 
> 
> > --Jesper
> >
> >
> > > 发件人: Jesper Dangaard Brouer
> > > 收件人: liujian
> > (CE)<liujia...@huawei.com<mailto:liujia...@huawei.com>>
> > > 抄送:
> > >
> >
> da...@davemloft.net<mailto:da...@davemloft.net>;kuz...@ms2.inr.ac.ru
> >  > > ailto:kuz...@ms2.inr.ac.ru>;yoshf...@linux-ipv6.org<mailto:yoshfuji@
> > > li
> > > nux-ipv6.org>;elena.reshet...@intel.com<mailto:elena.reshetova@intel
> > > .c
> > >
> >
> om>;eduma...@google.com<mailto:eduma...@google.com>;netdev@vger.k
> > ernel
> > > .org<mailto:netdev@vger.kernel.org>;bro...@redhat.com<mailto:brouer
> @
> > > r
> > e
> > > dhat.com>
> > > 主题: Re: Question about ip_defrag
> > > 时间: 2017-08-24 21:53:17
> > >
> > >
> > > On Thu, 24 Aug 2017 13:15:33 + "liujian (CE)"
> > > <liujia...@huawei.com>
> > wrote:
> > > > Hello,
> > > >
> > > > With below patch we met one issue.
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > > /c ommit/?h=v4.13-rc6=6d7b857d541e
> > > >
> > > > the issue:
> > > > Ip_defrag fail caused by frag_mem_limit reached 4M(frags.high_thresh).
> > > > At this moment,sum_frag_mem_limit is about 10K.
> > > > and my test machine's cpu num is 64.
> > > >
> > > > Can i only change frag_mem_limit to sum_ frag_mem_limit?
> > > >
> > > >
> > > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > 

RE: Question about ip_defrag

2017-08-24 Thread liujian (CE)

> -Original Message-
> From: Jesper Dangaard Brouer [mailto:bro...@redhat.com]
> Sent: Friday, August 25, 2017 2:59 AM
> To: liujian (CE)
> Cc: da...@davemloft.net; kuz...@ms2.inr.ac.ru; yoshf...@linux-ipv6.org;
> elena.reshet...@intel.com; eduma...@google.com; netdev@vger.kernel.org;
> bro...@redhat.com
> Subject: Re: Question about ip_defrag
> 
> 
> On Thu, 24 Aug 2017 16:04:41 + "liujian (CE)" <liujia...@huawei.com>
> wrote:
> 
> > >What kernel version have you seen this issue with?
> >
> > 3.10,with some backport.
> >
> >  >As far as I remember, this issue have been fixed before...
> >
> > which one patch? I didnot find out the patch:(
> 
> AFAIK it was some bugs in the percpu_counter code.  If you need to backport
> look at the git commits:
> 
>  git log lib/percpu_counter.c include/linux/percpu_counter.h
> 
> Are you maintaining your own 3.10 kernel?
> 
> I know that for RHEL7 (also kernel 3.10) we backported the percpu_counter
> fixes...
>
Could you tell me which one patch?  we have backported most of the two files's 
change. 
Thank you ~


> --Jesper
> 
> 
> > 发件人: Jesper Dangaard Brouer
> > 收件人: liujian
> (CE)<liujia...@huawei.com<mailto:liujia...@huawei.com>>
> > 抄送:
> >
> da...@davemloft.net<mailto:da...@davemloft.net>;kuz...@ms2.inr.ac.ru
>  > ailto:kuz...@ms2.inr.ac.ru>;yoshf...@linux-ipv6.org<mailto:yoshfuji@li
> > nux-ipv6.org>;elena.reshet...@intel.com<mailto:elena.reshetova@intel.c
> >
> om>;eduma...@google.com<mailto:eduma...@google.com>;netdev@vger.k
> ernel
> > .org<mailto:netdev@vger.kernel.org>;bro...@redhat.com<mailto:brouer@r
> e
> > dhat.com>
> > 主题: Re: Question about ip_defrag
> > 时间: 2017-08-24 21:53:17
> >
> >
> > On Thu, 24 Aug 2017 13:15:33 + "liujian (CE)" <liujia...@huawei.com>
> wrote:
> > > Hello,
> > >
> > > With below patch we met one issue.
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > ommit/?h=v4.13-rc6=6d7b857d541e
> > >
> > > the issue:
> > > Ip_defrag fail caused by frag_mem_limit reached 4M(frags.high_thresh).
> > > At this moment,sum_frag_mem_limit is about 10K.
> > > and my test machine's cpu num is 64.
> > >
> > > Can i only change frag_mem_limit to sum_ frag_mem_limit?
> > >
> > >
> > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > > index 96e95e8..f09c00b 100644
> > > --- a/net/ipv4/inet_fragment.c
> > > +++ b/net/ipv4/inet_fragment.c
> > > @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct
> > > inet_frags *f)  static bool inet_fragq_should_evict(const struct
> > > inet_frag_queue *q)  {
> > > return q->net->low_thresh == 0 ||
> > > -  frag_mem_limit(q->net) >= q->net->low_thresh;
> > > +  sum_frag_mem_limit(q->net) >= q->net->low_thresh;
> > >  }
> > >
> > >  static unsigned int
> > > @@ -355,7 +355,7 @@ static struct inet_frag_queue
> > > *inet_frag_alloc(struct netns_frags *nf,  {
> > > struct inet_frag_queue *q;
> > >
> > > -   if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
> > > +   if (!nf->high_thresh || sum_frag_mem_limit(nf) >
> > > + nf->high_thresh) {
> > > inet_frag_schedule_worker(f);
> > > return NULL;
> > > }
> > > @@ -396,7 +396,7 @@ struct inet_frag_queue *inet_frag_find(struct
> netns_frags *nf,
> > > struct inet_frag_queue *q;
> > > int depth = 0;
> > >
> > > -   if (frag_mem_limit(nf) > nf->low_thresh)
> > > +   if (sum_frag_mem_limit(nf) > nf->low_thresh)
> > > inet_frag_schedule_worker(f);
> > >
> > > hash &= (INETFRAGS_HASHSZ - 1);
> > > --
> > >
> > > Thank you for your time.
> >
> > What kernel version have you seen this issue with?
> >
> > As far as I remember, this issue have been fixed before...
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> 
> 
> 
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


RE: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-07-24 Thread liujian (CE)
Hi Wang cong,

After apply the patch, I did not hit the issue again.
Thank you~


Best Regards,
liujian

> -Original Message-
> From: Dingtianhong
> Sent: Monday, July 24, 2017 9:29 AM
> To: Cong Wang; liujian (CE)
> Cc: Willem de Bruijn; Dave Jones; alexander.le...@verizon.com;
> da...@davemloft.net; eduma...@google.com; will...@google.com;
> dan...@iogearbox.net; netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: af_packet: use after free in prb_retire_rx_blk_timer_expired
> 
> 
> 
> On 2017/7/24 9:09, Ding Tianhong wrote:
> >
> >
> > On 2017/7/24 1:03, Cong Wang wrote:
> >> On Sun, Jul 23, 2017 at 5:48 AM, liujian (CE) <liujia...@huawei.com> wrote:
> >>> Hi
> >>>
> >>> I find it caused by below steps:
> >>> 1. set tp_version to TPACKET_V3 and req->tp_block_nr to 1 2. set
> >>> tp_block_nr to 0 Then pg_vec was freed, and we did not delete the
> >>> timer?
> >>
> >> Thanks for testing!
> >>
> >> Ah, I overlook the initialization case in my previous patch.
> >>
> >> How about the following one? Does it cover all the cases?
> >>
> >>
> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index
> >> 008bb34ee324..0615c2a950fa 100644
> >> --- a/net/packet/af_packet.c
> >> +++ b/net/packet/af_packet.c
> >> @@ -4329,7 +4329,7 @@ static int packet_set_ring(struct sock *sk,
> >> union tpacket_req_u *req_u,
> >> register_prot_hook(sk);
> >> }
> >> spin_unlock(>bind_lock);
> >> -   if (closing && (po->tp_version > TPACKET_V2)) {
> >> +   if (pg_vec && (po->tp_version > TPACKET_V2)) {
> >> /* Because we don't support block-based V3 on tx-ring
> */
> >> if (!tx_ring)
> >> prb_shutdown_retire_blk_timer(po,
> rb_queue);
> >>
> >> .
> >
> > Hi, Cong:
> >
> > It looks like could not cover the case: req->tp_block_nr = 2 ->
> reg->tp_block_nr = 1 .
> >
> 
> Oh, looks like this case would never happen, so I think your solution is ok.
> 
> > what about this way:
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -4331,13 +4331,17 @@ static int packet_set_ring(struct sock *sk, union
> tpacket_req_u *req_u,
> > register_prot_hook(sk);
> > }
> > spin_unlock(>bind_lock);
> > -   if (closing && (po->tp_version > TPACKET_V2)) {
> > +   if ((closing || (pg_vec && !reg->tp_block_nr))&&
> > + (po->tp_version > TPACKET_V2)) {
> > /* Because we don't support block-based V3 on tx-ring */
> > if (!tx_ring)
> > prb_shutdown_retire_blk_timer(po,
> rb_queue);
> >
> >
> 
> >>
> >
> >
> > .
> >



RE: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-07-23 Thread liujian (CE)
Hi 

I find it caused by below steps:
1. set tp_version to TPACKET_V3 and req->tp_block_nr to 1
2. set tp_block_nr to 0
Then pg_vec was freed, and we did not delete the timer?

Best Regards,
liujian


> -Original Message-
> From: liujian (CE)
> Sent: Sunday, July 23, 2017 5:47 PM
> To: liujian (CE); 'Cong Wang'; Dingtianhong
> Cc: 'Willem de Bruijn'; 'Dave Jones'; 'alexander.le...@verizon.com';
> 'da...@davemloft.net'; 'eduma...@google.com'; 'will...@google.com';
> 'dan...@iogearbox.net'; 'netdev@vger.kernel.org';
> 'linux-ker...@vger.kernel.org'
> Subject: RE: af_packet: use after free in prb_retire_rx_blk_timer_expired
> 
> Hi,
> 
> Do we need delete the v3 ring, when tp_version changed from TPACKET_V3 to
> TPACKET_V1 ?
> 
> 
> Best Regards,
> liujian
> 
> 
> > -Original Message-
> > From: liujian (CE)
> > Sent: Sunday, July 23, 2017 4:21 PM
> > To: 'Cong Wang'; Dingtianhong
> > Cc: Willem de Bruijn; Dave Jones; alexander.le...@verizon.com;
> > da...@davemloft.net; eduma...@google.com; will...@google.com;
> > dan...@iogearbox.net; netdev@vger.kernel.org;
> > linux-ker...@vger.kernel.org
> > Subject: RE: af_packet: use after free in
> > prb_retire_rx_blk_timer_expired
> >
> > Hi Wang Cong,
> >
> > With this patch , the system was crashed when setsockopt.
> >
> > The call trace as below:
> >
> > crash> bt
> > PID: 3069   TASK: 8800afcc  CPU: 0   COMMAND: "trinity-main"
> >  #0 [8801bec03ce0] machine_kexec at 8105354b
> >  #1 [8801bec03d40] crash_kexec at 810f7e82
> >  #2 [8801bec03e10] panic at 81650058
> >  #3 [8801bec03e90] watchdog_timer_fn at 81122533
> >  #4 [8801bec03ec8] __hrtimer_run_queues at 810abeb2
> >  #5 [8801bec03f20] hrtimer_interrupt at 810ac450
> >  #6 [8801bec03f70] local_apic_timer_interrupt at 8104a457
> >  #7 [8801bec03f88] smp_apic_timer_interrupt at 8166aed0
> >  #8 [8801bec03fb0] apic_timer_interrupt at 8166931d
> > ---  ---
> >  #9 [8801b301fcb8] apic_timer_interrupt at 8166931d
> > [exception RIP: lock_timer_base+77]
> > RIP: 8108dced  RSP: 8801b301fd60  RFLAGS: 0246
> > RAX:   RBX: 8800afcc  RCX:
> > 0001
> > RDX: 8800afcc  RSI: 8801b301fd90  RDI: 8800b0d853c8
> > RBP: 8801b301fd80   R8: 8800afcc   R9:
> ea000268
> > R10: 003c  R11: 8801b301fb2e  R12:
> 8800afcc
> > R13: 8800afcc  R14:   R15:
> 83d1a340
> > ORIG_RAX: ff10  CS: 0010  SS: 0018
> > #10 [8801b301fd88] try_to_del_timer_sync at 8108f19f
> > #11 [8801b301fdb8] del_timer_sync at 8108f252
> > #12 [8801b301fdd0] packet_set_ring at 81635e60
> > #13 [8801b301fe98] packet_setsockopt at 81636760
> > #14 [8801b301ff38] sys_setsockopt at 81531860
> > #15 [8801b301ff80] tracesys at 816687ed (via system_call)
> > RIP: 7fcc78b03e3a  RSP: 7fff16f246b8  RFLAGS: 0202
> > RAX: ffda  RBX: 816687ed  RCX: 
> > RDX: 0005  RSI: 0107  RDI:
> > 0180
> > RBP: 0180   R8: 001c   R9:
> > 7fcc78dc7160
> > R10: 01fd6ba0  R11: 00000202  R12:
> > 
> > R13: 0011  R14: 01fd6b60  R15:
> > 01fd6b70
> > ORIG_RAX: 0036  CS: 0033  SS: 002b
> >
> >
> > Best Regards,
> > liujian
> >
> >
> > > -Original Message-
> > > From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
> > > Sent: Sunday, July 23, 2017 1:59 PM
> > > To: Dingtianhong
> > > Cc: liujian (CE); Willem de Bruijn; Dave Jones;
> > > alexander.le...@verizon.com; da...@davemloft.net;
> > eduma...@google.com;
> > > will...@google.com; dan...@iogearbox.net; netdev@vger.kernel.org;
> > > linux-ker...@vger.kernel.org
> > > Subject: Re: af_packet: use after free in
> > > prb_retire_rx_blk_timer_expired
> > >
> > > On Sat, Jul 22, 2017 at 8:40 PM, Ding Tianhong
> > > <dingtianh...@huawei.com>
> > > wrote:
> > > > Hi, Cong:
> > > >
> > > > Thanks for your quirk solution, but I still has some doubts about
> > > > it, it looks like fix the problem in the
&

RE: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-07-23 Thread liujian (CE)
Hi, 

Do we need delete the v3 ring, when tp_version changed from TPACKET_V3 to 
TPACKET_V1 ?


Best Regards,
liujian


> -Original Message-
> From: liujian (CE)
> Sent: Sunday, July 23, 2017 4:21 PM
> To: 'Cong Wang'; Dingtianhong
> Cc: Willem de Bruijn; Dave Jones; alexander.le...@verizon.com;
> da...@davemloft.net; eduma...@google.com; will...@google.com;
> dan...@iogearbox.net; netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: RE: af_packet: use after free in prb_retire_rx_blk_timer_expired
> 
> Hi Wang Cong,
> 
> With this patch , the system was crashed when setsockopt.
> 
> The call trace as below:
> 
> crash> bt
> PID: 3069   TASK: 8800afcc  CPU: 0   COMMAND: "trinity-main"
>  #0 [8801bec03ce0] machine_kexec at 8105354b
>  #1 [8801bec03d40] crash_kexec at 810f7e82
>  #2 [8801bec03e10] panic at 81650058
>  #3 [8801bec03e90] watchdog_timer_fn at 81122533
>  #4 [8801bec03ec8] __hrtimer_run_queues at 810abeb2
>  #5 [8801bec03f20] hrtimer_interrupt at 810ac450
>  #6 [8801bec03f70] local_apic_timer_interrupt at 8104a457
>  #7 [8801bec03f88] smp_apic_timer_interrupt at 8166aed0
>  #8 [8801bec03fb0] apic_timer_interrupt at 8166931d
> ---  ---
>  #9 [8801b301fcb8] apic_timer_interrupt at 8166931d
> [exception RIP: lock_timer_base+77]
> RIP: 8108dced  RSP: 8801b301fd60  RFLAGS: 0246
> RAX:   RBX: 8800afcc  RCX:
> 0001
> RDX: 8800afcc  RSI: 8801b301fd90  RDI: 8800b0d853c8
> RBP: 8801b301fd80   R8: 8800afcc   R9: ea000268
> R10: 003c  R11: 8801b301fb2e  R12: 8800afcc
> R13: 8800afcc  R14:   R15: 83d1a340
> ORIG_RAX: ff10  CS: 0010  SS: 0018
> #10 [8801b301fd88] try_to_del_timer_sync at 8108f19f
> #11 [8801b301fdb8] del_timer_sync at 8108f252
> #12 [8801b301fdd0] packet_set_ring at 81635e60
> #13 [8801b301fe98] packet_setsockopt at 81636760
> #14 [8801b301ff38] sys_setsockopt at 81531860
> #15 [8801b301ff80] tracesys at 816687ed (via system_call)
> RIP: 7fcc78b03e3a  RSP: 7fff16f246b8  RFLAGS: 0202
> RAX: ffda  RBX: 816687ed  RCX: 
> RDX: 0005  RSI: 0107  RDI:
> 0180
> RBP: 0180   R8: 001c   R9:
> 7fcc78dc7160
> R10: 01fd6ba0  R11: 0202  R12:
> 
> R13: 0011  R14: 01fd6b60  R15:
> 01fd6b70
> ORIG_RAX: 0036  CS: 0033  SS: 002b
> 
> 
> Best Regards,
> liujian
> 
> 
> > -Original Message-
> > From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
> > Sent: Sunday, July 23, 2017 1:59 PM
> > To: Dingtianhong
> > Cc: liujian (CE); Willem de Bruijn; Dave Jones;
> > alexander.le...@verizon.com; da...@davemloft.net;
> eduma...@google.com;
> > will...@google.com; dan...@iogearbox.net; netdev@vger.kernel.org;
> > linux-ker...@vger.kernel.org
> > Subject: Re: af_packet: use after free in
> > prb_retire_rx_blk_timer_expired
> >
> > On Sat, Jul 22, 2017 at 8:40 PM, Ding Tianhong
> > <dingtianh...@huawei.com>
> > wrote:
> > > Hi, Cong:
> > >
> > > Thanks for your quirk solution, but I still has some doubts about
> > > it, it looks like fix the problem in the
> > > packet_setsockopt->packet_set_ring processing, but when in
> > > packet_release processing, it may could not release the real pg_vec
> > > for the TPACKET_V3 ring, and then cause the mem leak, maybe I miss
> > > something here, nice to hear from your feedback. :)
> >
> > Yes you miss that packet_release() has memset()'s so we won't hit that
> > path. :)
> >
> > However, I missed the swap() in this messy function, actually I
> > believe the bug is that we modify tpacket_kbdq_core inside rx_ring in
> > non-closing case without actually stopping its timer. I feel more confident
> with the following patch:
> >
> >
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index
> > 008bb34ee324..267b181fef15 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -4263,6 +4263,7 @@ static int packet_set_ring(struct sock *sk,
> > union tpacket_req_u *req_u,
> > case TPACKET_V3:
> > /* Block transmit is not supported yet */
> > if (!tx_ring) {
> > +   prb_shutdown_retire_blk_timer(po,
> > + rb_queue);
> > init_prb_bdqc(po, rb, pg_vec, req_u);
> > } else {
> > struct tpacket_req3 *req3 =
> > _u->req3;


RE: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-07-23 Thread liujian (CE)
Hi Wang Cong,

With this patch , the system was crashed when setsockopt. 

The call trace as below:  

crash> bt
PID: 3069   TASK: 8800afcc  CPU: 0   COMMAND: "trinity-main"
 #0 [8801bec03ce0] machine_kexec at 8105354b
 #1 [8801bec03d40] crash_kexec at 810f7e82
 #2 [8801bec03e10] panic at 81650058
 #3 [8801bec03e90] watchdog_timer_fn at 81122533
 #4 [8801bec03ec8] __hrtimer_run_queues at 810abeb2
 #5 [8801bec03f20] hrtimer_interrupt at 810ac450
 #6 [8801bec03f70] local_apic_timer_interrupt at 8104a457
 #7 [8801bec03f88] smp_apic_timer_interrupt at 8166aed0
 #8 [8801bec03fb0] apic_timer_interrupt at 8166931d
---  ---
 #9 [8801b301fcb8] apic_timer_interrupt at 8166931d
[exception RIP: lock_timer_base+77]
RIP: 8108dced  RSP: 8801b301fd60  RFLAGS: 0246
RAX:   RBX: 8800afcc  RCX: 0001
RDX: 8800afcc  RSI: 8801b301fd90  RDI: 8800b0d853c8
RBP: 8801b301fd80   R8: 8800afcc   R9: ea000268
R10: 003c  R11: 8801b301fb2e  R12: 8800afcc
R13: 8800afcc  R14:   R15: 83d1a340
ORIG_RAX: ff10  CS: 0010  SS: 0018
#10 [8801b301fd88] try_to_del_timer_sync at 8108f19f
#11 [8801b301fdb8] del_timer_sync at 8108f252
#12 [8801b301fdd0] packet_set_ring at 81635e60
#13 [8801b301fe98] packet_setsockopt at 81636760
#14 [8801b301ff38] sys_setsockopt at 81531860
#15 [8801b301ff80] tracesys at 816687ed (via system_call)
RIP: 7fcc78b03e3a  RSP: 7fff16f246b8  RFLAGS: 0202
RAX: ffda  RBX: 816687ed  RCX: 
RDX: 0005  RSI: 0107  RDI: 0180
RBP: 0180   R8: 001c   R9: 7fcc78dc7160
R10: 01fd6ba0  R11: 0202  R12: 
R13: 0011  R14: 01fd6b60  R15: 01fd6b70
ORIG_RAX: 0036  CS: 0033  SS: 002b


Best Regards,
liujian


> -Original Message-
> From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
> Sent: Sunday, July 23, 2017 1:59 PM
> To: Dingtianhong
> Cc: liujian (CE); Willem de Bruijn; Dave Jones; alexander.le...@verizon.com;
> da...@davemloft.net; eduma...@google.com; will...@google.com;
> dan...@iogearbox.net; netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: af_packet: use after free in prb_retire_rx_blk_timer_expired
> 
> On Sat, Jul 22, 2017 at 8:40 PM, Ding Tianhong <dingtianh...@huawei.com>
> wrote:
> > Hi, Cong:
> >
> > Thanks for your quirk solution, but I still has some doubts about it,
> > it looks like fix the problem in the
> > packet_setsockopt->packet_set_ring processing, but when in
> > packet_release processing, it may could not release the real pg_vec
> > for the TPACKET_V3 ring, and then cause the mem leak, maybe I miss
> > something here, nice to hear from your feedback. :)
> 
> Yes you miss that packet_release() has memset()'s so we won't hit that path. 
> :)
> 
> However, I missed the swap() in this messy function, actually I believe the 
> bug
> is that we modify tpacket_kbdq_core inside rx_ring in non-closing case without
> actually stopping its timer. I feel more confident with the following patch:
> 
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index
> 008bb34ee324..267b181fef15 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -4263,6 +4263,7 @@ static int packet_set_ring(struct sock *sk, union
> tpacket_req_u *req_u,
> case TPACKET_V3:
> /* Block transmit is not supported yet */
> if (!tx_ring) {
> +   prb_shutdown_retire_blk_timer(po,
> + rb_queue);
> init_prb_bdqc(po, rb, pg_vec, req_u);
> } else {
> struct tpacket_req3 *req3 =
> _u->req3;


RE: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-07-22 Thread liujian (CE)
I also hit this issue with trinity test:

The call trace:
  [exception RIP: prb_retire_rx_blk_timer_expired+70]
RIP: 81633be6  RSP: 8801bec03dc0  RFLAGS: 00010246
RAX:   RBX: 8801b49d0948  RCX: 
RDX: 8801b31057a0  RSI: a56b6b6b6b6b6b6b  RDI: 8801b49d09ec
RBP: 8801bec03dd8   R8: 0001   R9: 83e1bf80
R10: 0002  R11: 0005  R12: 8801b49d09ec
R13: 0100  R14: 81633ba0  R15: 8801b49d0948
ORIG_RAX:   CS: 0010  SS: 0018
 #7 [8801bec03de0] call_timer_fn at 8108cb76
 #8 [8801bec03e18] run_timer_softirq at 8108f87c
 #9 [8801bec03e90] __do_softirq at 8108629f
#10 [8801bec03f00] call_softirq at 8166a01c
#11 [8801bec03f18] do_softirq at 810172ad
#12 [8801bec03f30] irq_exit at 81086655
#13 [8801bec03f48] msa_irq_exit at 810b1ab3
#14 [8801bec03f88] smp_apic_timer_interrupt at 8166aeae
#15 [8801bec03fb0] apic_timer_interrupt at 816692dd
---  ---

And from vmcore, I can see the pointer GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); is 
a56b6b6b6b6b6b6b 


struct packet_ring_buffer  rx_ring = {
pg_vec = 0x0, 
head = 0x0, 
frames_per_block = 0x400, 
frame_size = 0x0, 
frame_max = 0x, 
pg_vec_order = 0x0, 
pg_vec_pages = 0x0, 
pg_vec_len = 0x0, 
pending_refcnt = 0x0, 
prb_bdqc = {
  pkbdq = 0x8801b31057a0, 
  feature_req_word = 0x1, 
  hdrlen = 0x44, 
  reset_pending_on_curr_blk = 0x1, 
  delete_blk_timer = 0x0, 
  kactive_blk_num = 0x0, 
  blk_sizeof_priv = 0x0, 
  last_kactive_blk_num = 0x0, 
  pkblk_start = 0x8800a700 struct: page excluded: kernel virtual 
address: 8800a700  type: "gdb_readmem_callback"
struct: page excluded: kernel virtual address: 8800a700  type: 
"gdb_readmem_callback"
, 
  pkblk_end = 0x8800a720 "\002", 
  kblk_size = 0x20, 
  max_frame_len = 0x1fffd0, 
  knum_blocks = 0x1, 
  knxt_seq_num = 0x2, 
  prev = 0x8800a730 struct: page excluded: kernel virtual address: 
8800a730  type: "gdb_readmem_callback"
struct: page excluded: kernel virtual address: 8800a730  type: 
"gdb_readmem_callback"
, 
  nxt_offset = 0x8800a730 struct: page excluded: kernel virtual 
address: 8800a730  type: "gdb_readmem_callback"
struct: page excluded: kernel virtual address: 8800a730  type: 
"gdb_readmem_callback"
, 
  skb = 0x0, 
  blk_fill_in_prog = {
counter = 0x0

crash> struct pgv 0x8801b31057a0
struct pgv {
  buffer = 0xa56b6b6b6b6b6b6b 
}


Best Regards,
liujian


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Willem de Bruijn
> Sent: Wednesday, April 12, 2017 7:23 AM
> To: Dave Jones; alexander.le...@verizon.com; da...@davemloft.net;
> eduma...@google.com; will...@google.com; dan...@iogearbox.net;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: af_packet: use after free in prb_retire_rx_blk_timer_expired
> 
> On Mon, Apr 10, 2017 at 3:23 PM, Dave Jones 
> wrote:
> > On Mon, Apr 10, 2017 at 07:03:30PM +, alexander.le...@verizon.com
> wrote:
> >  > Hi all,
> >  >
> >  > I seem to be hitting this use-after-free on a -next kernel using trinity:
> >  >
> >  > [  531.036054] BUG: KASAN: use-after-free in
> > prb_retire_rx_blk_timer_expired (net/packet/af_packet.c:688)
> 
> The retire_blk_timer is called after the pg_vec struct for this ring was 
> freed.
> This should not happen. packet_set_ring stops the timer with del_timer_sync
> when tearing down the ring before freeing that
> struct:
> 
> if (closing && (po->tp_version > TPACKET_V2)) {
> /* Because we don't support block-based V3 on tx-ring */
> if (!tx_ring)
> prb_shutdown_retire_blk_timer(po, rb_queue);
> }
> 
> if (pg_vec)
> free_pg_vec(pg_vec, order, req->tp_block_nr);
> 
> This is a similar race to the use-after-free fixed by 84ac7260236a
> ("packet: fix race condition in packet_set_ring"). The previous race was
> triggered by a call to setsockopt PACKET_VERSION changing tp_version while
> the ring is active. It is not immediately obvious what is the cause now. I
> suppose trinity does not give a trace of such system calls on this file 
> descriptor?
> That would be helpful.
> 
> The bug report shows both a timer firing after the packet_set_ring call that
> freed the pg_vec, and later a CONFIG_DEBUG_OBJECTS_FREE warning that
> the timer is still active when the socket is closed on release of the last 
> file
> descriptor.