RE: [PATCH] net: bond: skip vlan header when do layer 3+4 hash policy
> -Original Message- > From: Nikolay Aleksandrov [mailto:niko...@cumulusnetworks.com] > Sent: Saturday, March 31, 2018 5:23 PM > To: liujian (CE) <liujia...@huawei.com>; da...@davemloft.net; > j.vosbu...@gmail.com; vfal...@gmail.com; a...@greyhouse.net > Cc: netdev@vger.kernel.org; weiyongjun (A) <weiyongj...@huawei.com> > Subject: Re: [PATCH] net: bond: skip vlan header when do layer 3+4 hash policy > > On 31/03/18 12:14, liujia...@huawei.com wrote: > > From: liujian <liujia...@huawei.com> > > > > When the hash policy is BOND_XMIT_POLICY_LAYER34 mode and skb > protocol > > is 802.1q VLAN, the policy will be degenerated to LAYER2 mode; Now, > > change it to get the next layer protocol to ensure that it worked in > > BOND_XMIT_POLICY_LAYER34 mode. > > > > Signed-off-by: liujian <liujia...@huawei.com> > > --- > > drivers/net/bonding/bond_main.c | 11 --- > > 1 file changed, 8 insertions(+), 3 deletions(-) > > > Nak > Use BOND_XMIT_POLICY_ENCAP34 (encap3+4), that was one of the main > reasons it was added. Got it, thank you~
RE: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare
Best Regards, liujian > -Original Message- > From: Michal Kubecek [mailto:mkube...@suse.cz] > Sent: Friday, September 01, 2017 12:24 AM > To: Jesper Dangaard Brouer > Cc: liujian (CE); netdev@vger.kernel.org; Florian Westphal > Subject: Re: [RFC PATCH] net: frag limit checks need to use > percpu_counter_compare > > On Thu, Aug 31, 2017 at 12:20:19PM +0200, Jesper Dangaard Brouer wrote: > > To: Liujian can you please test this patch? > > I want to understand if using __percpu_counter_compare() solves the > > problem correctness wise (even-though this will be slower than using > > a simple atomic_t on your big system). I have test the patch, it can work. 1. make sure frag_mem_limit reach to thresh ===>FRAG: inuse 0 memory 0 frag_mem_limit 5386864 2. change NIC rx irq's affinity to a fixed CPU 3. iperf -u -c 9.83.1.41 -l 1 -i 1 -t 1000 -P 10 -b 20M And check /proc/net/snmp, there are no ReasmFails. And I think it is a better way that adding some counter sync points as you said. > > Fix bug in fragmentation codes use of the percpu_counter API, that > > cause issues on systems with many CPUs. > > > > The frag_mem_limit() just reads the global counter (fbc->count), > > without considering other CPUs can have upto batch size (130K) that > > haven't been subtracted yet. Due to the 3MBytes lower thresh limit, > > this become dangerous at >=24 CPUs (3*1024*1024/13=24). > > > > The __percpu_counter_compare() does the right thing, and takes into > > account the number of (online) CPUs and batch size, to account for > > this and call __percpu_counter_sum() when needed. > > > > On systems with many CPUs this will unfortunately always result in the > > heavier fully locked __percpu_counter_sum() which touch the > > per_cpu_ptr of all (online) CPUs. > > > > On systems with a smaller number of CPUs this solution is also not > > optimal, because __percpu_counter_compare()/__percpu_counter_sum() > > doesn't help synchronize the global counter. > > Florian Westphal have an idea of adding some counter sync points, > > which should help address this issue. > > --- > > include/net/inet_frag.h | 16 ++-- > > net/ipv4/inet_fragment.c |6 +++--- > > 2 files changed, 17 insertions(+), 5 deletions(-) > > > > diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index > > 6fdcd2427776..b586e320783d 100644 > > --- a/include/net/inet_frag.h > > +++ b/include/net/inet_frag.h > > @@ -147,9 +147,21 @@ static inline bool inet_frag_evicting(struct > inet_frag_queue *q) > > */ > > static unsigned int frag_percpu_counter_batch = 13; > > > > -static inline int frag_mem_limit(struct netns_frags *nf) > > +static inline bool frag_mem_over_limit(struct netns_frags *nf, int > > +thresh) > > { > > - return percpu_counter_read(>mem); > > + /* When reading counter here, __percpu_counter_compare() call > > +* will invoke __percpu_counter_sum() when needed. Which > > +* depend on num_online_cpus()*batch size, as each CPU can > > +* potentential can hold a batch count. > > +* > > +* With many CPUs this heavier sum operation will > > +* unfortunately always occur. > > +*/ > > + if (__percpu_counter_compare(>mem, thresh, > > +frag_percpu_counter_batch) > 0) > > + return true; > > + else > > + return false; > > } > > > > static inline void sub_frag_mem_limit(struct netns_frags *nf, int i) > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index > > 96e95e83cc61..ee2cf56900e6 100644 > > --- a/net/ipv4/inet_fragment.c > > +++ b/net/ipv4/inet_fragment.c > > @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct > > inet_frags *f) static bool inet_fragq_should_evict(const struct > > inet_frag_queue *q) { > > return q->net->low_thresh == 0 || > > - frag_mem_limit(q->net) >= q->net->low_thresh; > > + frag_mem_over_limit(q->net, q->net->low_thresh); > > } > > > > static unsigned int > > @@ -355,7 +355,7 @@ static struct inet_frag_queue > > *inet_frag_alloc(struct netns_frags *nf, { > > struct inet_frag_queue *q; > > > > - if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) { > > + if (!nf->high_thresh || frag_mem_over_limit(nf, nf->high_thresh)) { > > inet_frag_schedule_worker(f); > > return NULL; > > } > > If we go t
RE: Question about ip_defrag
Best Regards, liujian > -Original Message- > From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] > On Behalf Of Florian Westphal > Sent: Tuesday, August 29, 2017 9:47 PM > To: liujian (CE) > Cc: Florian Westphal; Jesper Dangaard Brouer; netdev@vger.kernel.org; > Wangkefeng (Kevin); weiyongjun (A) > Subject: Re: Question about ip_defrag > > liujian (CE) <liujia...@huawei.com> wrote: > > [ trimming cc list ] > > > Now, I have not the real environment. > > I use iperf generate fragment packets; and I always change NIC rx > > irq's affinity cpu, to make sure frag_mem_limit reach to thresh. > > my test machine, CPU num is 384. > > Oh well, that explains it. > > > > > + if (frag_mem_limit(nf) > nf->low_thresh) { > > > > inet_frag_schedule_worker(f); > > > > + update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16); > > > > + } > > You need to reduce this to a lower value. > Your cpu count * batch_value needs to be less than low_thresh to avoid > problems. > > Wtih 384 cpus its close to 12 mbyte... > > Perhaps do this: > > update_frag_mem_limit(nf, 2 * 1024*1024 / NR_CPUS); > > > However, I think its better to revert the percpu counter change and move back > to a single atomic_t count. Ok. Florian and Jesper, many thanks for this issue.
RE: Question about ip_defrag
Best Regards, liujian > -Original Message- > From: liujian (CE) > Sent: Tuesday, August 29, 2017 3:39 PM > To: 'Florian Westphal' > Cc: Jesper Dangaard Brouer; da...@davemloft.net; kuz...@ms2.inr.ac.ru; > yoshf...@linux-ipv6.org; elena.reshet...@intel.com; eduma...@google.com; > netdev@vger.kernel.org; Wangkefeng (Kevin); weiyongjun (A) > Subject: RE: Question about ip_defrag > > > > > -Original Message- > > From: netdev-ow...@vger.kernel.org > > [mailto:netdev-ow...@vger.kernel.org] > > On Behalf Of Florian Westphal > > Sent: Monday, August 28, 2017 10:01 PM > > To: liujian (CE) > > Cc: Jesper Dangaard Brouer; da...@davemloft.net; kuz...@ms2.inr.ac.ru; > > yoshf...@linux-ipv6.org; elena.reshet...@intel.com; > > eduma...@google.com; netdev@vger.kernel.org; Wangkefeng (Kevin); > > weiyongjun (A) > > Subject: Re: Question about ip_defrag > > > > liujian (CE) <liujia...@huawei.com> wrote: > > > Hi > > > > > > I checked our 3.10 kernel, we had backported all percpu_counter bug > > > fix in > > lib/percpu_counter.c and include/linux/percpu_counter.h. > > > And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough. > > > > > > > > > > the issue: > > > > > > > Ip_defrag fail caused by frag_mem_limit reached > > 4M(frags.high_thresh). > > > > > > > At this moment,sum_frag_mem_limit is about 10K. > > > > > > So should we change ipfrag high/low thresh to a reasonable value ? > > > And if it is, is there a standard to change the value? > > > > Each cpu can have frag_percpu_counter_batch bytes rest doesn't know > > about so with 64 cpus that is ~8 mbyte. > > > > possible solutions: > > 1. reduce frag_percpu_counter_batch to 16k or so 2. make both low and > > high thresh depend on NR_CPUS > > > Thank you for your reply. > > > liujian, does this change help in any way? > > I will have a try. Now, I have not the real environment. I use iperf generate fragment packets; and I always change NIC rx irq's affinity cpu, to make sure frag_mem_limit reach to thresh. my test machine, CPU num is 384. As above , test the patch , seemingly , there is no improving... Check /proc/net/snmp, there is no significant difference. maybe we should find a good test method! root@RH8100-V3:/proc/net# cat sockstat sockets: used 1386 TCP: inuse 3 orphan 0 tw 0 alloc 4 mem 1 UDP: inuse 44 mem 42 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 1 memory 34336, 3144424. > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > > --- a/net/ipv4/inet_fragment.c > > +++ b/net/ipv4/inet_fragment.c > > @@ -123,6 +123,17 @@ static bool inet_fragq_should_evict(const struct > > inet_frag_queue *q) > >frag_mem_limit(q->net) >= q->net->low_thresh; } > > > > +/* ->mem batch size is huge, this can cause severe discrepancies > > + * between actual value (sum of pcpu values) and the global estimate. > > + * > > + * Use a smaller batch to give an opportunity for the global estimate > > + * to more accurately reflect current state. > > + */ > > +static void update_frag_mem_limit(struct netns_frags *nf, unsigned > > +int > > +batch) { > > +percpu_counter_add_batch(>mem, 0, batch); } > > + > > static unsigned int > > inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb) > > { @@ > > -146,8 +157,12 @@ inet_evict_bucket(struct inet_frags *f, struct > > inet_frag_bucket *hb) > > > > spin_unlock(>chain_lock); > > > > - hlist_for_each_entry_safe(fq, n, , list_evictor) > > + hlist_for_each_entry_safe(fq, n, , list_evictor) { > > + struct netns_frags *nf = fq->net; > > + > > f->frag_expire((unsigned long) fq); > > + update_frag_mem_limit(nf, 1); > > > + } > > > > return evicted; > > } > > @@ -396,8 +411,10 @@ struct inet_frag_queue *inet_frag_find(struct > > netns_frags *nf, > > struct inet_frag_queue *q; > > int depth = 0; > > > > - if (frag_mem_limit(nf) > nf->low_thresh) > > + if (frag_mem_limit(nf) > nf->low_thresh) { > > inet_frag_schedule_worker(f); > > + update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16); > > + } > > > > hash &= (INETFRAGS_HASHSZ - 1); > > hb = >hash[hash]; > > @@ -416,6 +433,8 @@ struct inet_frag_queue *inet_frag_find(struct > > netns_frags *nf, > > if (depth <= INETFRAGS_MAXDEPTH) > > return inet_frag_create(nf, f, key); > > > > + update_frag_mem_limit(nf, 1); > > + > > if (inet_frag_may_rebuild(f)) { > > if (!f->rebuild) > > f->rebuild = true;
RE: Question about ip_defrag
> -Original Message- > From: Jesper Dangaard Brouer [mailto:bro...@redhat.com] > Sent: Tuesday, August 29, 2017 3:20 PM > To: Florian Westphal > Cc: liujian (CE); da...@davemloft.net; kuz...@ms2.inr.ac.ru; > yoshf...@linux-ipv6.org; elena.reshet...@intel.com; eduma...@google.com; > netdev@vger.kernel.org; Wangkefeng (Kevin); weiyongjun (A); > bro...@redhat.com > Subject: Re: Question about ip_defrag > > On Mon, 28 Aug 2017 16:00:32 +0200 > Florian Westphal <f...@strlen.de> wrote: > > > liujian (CE) <liujia...@huawei.com> wrote: > > > Hi > > > > > > I checked our 3.10 kernel, we had backported all percpu_counter bug fix in > lib/percpu_counter.c and include/linux/percpu_counter.h. > > > And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough. > > > > > > > > > > the issue: > > > > > > > Ip_defrag fail caused by frag_mem_limit reached > 4M(frags.high_thresh). > > > > > > > At this moment,sum_frag_mem_limit is about 10K. > > > > > > So should we change ipfrag high/low thresh to a reasonable value ? > > > And if it is, is there a standard to change the value? > > > > Each cpu can have frag_percpu_counter_batch bytes rest doesn't know > > about so with 64 cpus that is ~8 mbyte. > > > > possible solutions: > > 1. reduce frag_percpu_counter_batch to 16k or so 2. make both low and > > high thresh depend on NR_CPUS > > To me it looks like we/I have been using the wrong API for comparing against > percpu_counters. I guess we should have used > __percpu_counter_compare(). Are you means? Change if (frag_mem_limit(nf) > nf->low_thresh) to __percpu_counter_compare(>mem, nf->low_thresh, frag_percpu_counter_batch) > /* > * Compare counter against given value. > * Return 1 if greater, 0 if equal and -1 if less */ int > __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) { > s64 count; > > count = percpu_counter_read(fbc); > /* Check to see if rough count will be sufficient for comparison */ > if (abs(count - rhs) > (batch * num_online_cpus())) { > if (count > rhs) > return 1; > else > return -1; > } > /* Need to use precise count */ > count = percpu_counter_sum(fbc); > if (count > rhs) > return 1; > else if (count < rhs) > return -1; > else > return 0; > } > EXPORT_SYMBOL(__percpu_counter_compare); > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
RE: Question about ip_defrag
> -Original Message- > From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] > On Behalf Of Florian Westphal > Sent: Monday, August 28, 2017 10:01 PM > To: liujian (CE) > Cc: Jesper Dangaard Brouer; da...@davemloft.net; kuz...@ms2.inr.ac.ru; > yoshf...@linux-ipv6.org; elena.reshet...@intel.com; eduma...@google.com; > netdev@vger.kernel.org; Wangkefeng (Kevin); weiyongjun (A) > Subject: Re: Question about ip_defrag > > liujian (CE) <liujia...@huawei.com> wrote: > > Hi > > > > I checked our 3.10 kernel, we had backported all percpu_counter bug fix in > lib/percpu_counter.c and include/linux/percpu_counter.h. > > And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough. > > > > > > > > the issue: > > > > > > Ip_defrag fail caused by frag_mem_limit reached > 4M(frags.high_thresh). > > > > > > At this moment,sum_frag_mem_limit is about 10K. > > > > So should we change ipfrag high/low thresh to a reasonable value ? > > And if it is, is there a standard to change the value? > > Each cpu can have frag_percpu_counter_batch bytes rest doesn't know about > so with 64 cpus that is ~8 mbyte. > > possible solutions: > 1. reduce frag_percpu_counter_batch to 16k or so 2. make both low and high > thresh depend on NR_CPUS > Thank you for your reply. > liujian, does this change help in any way? I will have a try. > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > --- a/net/ipv4/inet_fragment.c > +++ b/net/ipv4/inet_fragment.c > @@ -123,6 +123,17 @@ static bool inet_fragq_should_evict(const struct > inet_frag_queue *q) > frag_mem_limit(q->net) >= q->net->low_thresh; } > > +/* ->mem batch size is huge, this can cause severe discrepancies > + * between actual value (sum of pcpu values) and the global estimate. > + * > + * Use a smaller batch to give an opportunity for the global estimate > + * to more accurately reflect current state. > + */ > +static void update_frag_mem_limit(struct netns_frags *nf, unsigned int > +batch) { > + percpu_counter_add_batch(>mem, 0, batch); } > + > static unsigned int > inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb) { @@ > -146,8 +157,12 @@ inet_evict_bucket(struct inet_frags *f, struct > inet_frag_bucket *hb) > > spin_unlock(>chain_lock); > > - hlist_for_each_entry_safe(fq, n, , list_evictor) > + hlist_for_each_entry_safe(fq, n, , list_evictor) { > + struct netns_frags *nf = fq->net; > + > f->frag_expire((unsigned long) fq); > + update_frag_mem_limit(nf, 1); > + } > > return evicted; > } > @@ -396,8 +411,10 @@ struct inet_frag_queue *inet_frag_find(struct > netns_frags *nf, > struct inet_frag_queue *q; > int depth = 0; > > - if (frag_mem_limit(nf) > nf->low_thresh) > + if (frag_mem_limit(nf) > nf->low_thresh) { > inet_frag_schedule_worker(f); > + update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16); > + } > > hash &= (INETFRAGS_HASHSZ - 1); > hb = >hash[hash]; > @@ -416,6 +433,8 @@ struct inet_frag_queue *inet_frag_find(struct > netns_frags *nf, > if (depth <= INETFRAGS_MAXDEPTH) > return inet_frag_create(nf, f, key); > > + update_frag_mem_limit(nf, 1); > + > if (inet_frag_may_rebuild(f)) { > if (!f->rebuild) > f->rebuild = true;
RE: Question about ip_defrag
Hi I checked our 3.10 kernel, we had backported all percpu_counter bug fix in lib/percpu_counter.c and include/linux/percpu_counter.h. And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough. > > > > the issue: > > > > Ip_defrag fail caused by frag_mem_limit reached 4M(frags.high_thresh). > > > > At this moment,sum_frag_mem_limit is about 10K. So should we change ipfrag high/low thresh to a reasonable value ? And if it is, is there a standard to change the value? root@RH8100-V3:/proc/net# cat sockstat sockets: used 1485 TCP: inuse 4 orphan 0 tw 0 alloc 5 mem 1 UDP: inuse 203 mem 201 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 1 memory 16048, 3156696. root@RH8100-V3:/proc/net# In order to print frag_mem_limit, change the code as below: diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 43eb6567..38bfb20 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -73,7 +73,7 @@ static int sockstat_seq_show(struct seq_file *seq, void *v) seq_printf(seq, "RAW: inuse %d\n", sock_prot_inuse_get(net, _prot)); frag_mem = ip_frag_mem(net); - seq_printf(seq, "FRAG: inuse %u memory %u\n", !!frag_mem, frag_mem); + seq_printf(seq, "FRAG: inuse %u memory %u, %u.\n", !!frag_mem, frag_mem, frag_mem_limit(>ipv4.frags)); return 0; } Best Regards, liujian > -Original Message- > From: liujian (CE) > Sent: Friday, August 25, 2017 9:33 AM > To: 'Jesper Dangaard Brouer' > Cc: da...@davemloft.net; kuz...@ms2.inr.ac.ru; yoshf...@linux-ipv6.org; > elena.reshet...@intel.com; eduma...@google.com; netdev@vger.kernel.org; > Wangkefeng (Kevin); weiyongjun (A) > Subject: RE: Question about ip_defrag > > > > -Original Message- > > From: Jesper Dangaard Brouer [mailto:bro...@redhat.com] > > Sent: Friday, August 25, 2017 2:59 AM > > To: liujian (CE) > > Cc: da...@davemloft.net; kuz...@ms2.inr.ac.ru; > > yoshf...@linux-ipv6.org; elena.reshet...@intel.com; > > eduma...@google.com; netdev@vger.kernel.org; bro...@redhat.com > > Subject: Re: Question about ip_defrag > > > > > > On Thu, 24 Aug 2017 16:04:41 + "liujian (CE)" > > <liujia...@huawei.com> > > wrote: > > > > > >What kernel version have you seen this issue with? > > > > > > 3.10,with some backport. > > > > > > >As far as I remember, this issue have been fixed before... > > > > > > which one patch? I didnot find out the patch:( > > > > AFAIK it was some bugs in the percpu_counter code. If you need to > > backport look at the git commits: > > > > git log lib/percpu_counter.c include/linux/percpu_counter.h > > > > Are you maintaining your own 3.10 kernel? > > > > I know that for RHEL7 (also kernel 3.10) we backported the > > percpu_counter fixes... > > > Could you tell me which one patch? we have backported most of the two > files's change. > Thank you ~ > > > > --Jesper > > > > > > > 发件人: Jesper Dangaard Brouer > > > 收件人: liujian > > (CE)<liujia...@huawei.com<mailto:liujia...@huawei.com>> > > > 抄送: > > > > > > da...@davemloft.net<mailto:da...@davemloft.net>;kuz...@ms2.inr.ac.ru > > > > ailto:kuz...@ms2.inr.ac.ru>;yoshf...@linux-ipv6.org<mailto:yoshfuji@ > > > li > > > nux-ipv6.org>;elena.reshet...@intel.com<mailto:elena.reshetova@intel > > > .c > > > > > > om>;eduma...@google.com<mailto:eduma...@google.com>;netdev@vger.k > > ernel > > > .org<mailto:netdev@vger.kernel.org>;bro...@redhat.com<mailto:brouer > @ > > > r > > e > > > dhat.com> > > > 主题: Re: Question about ip_defrag > > > 时间: 2017-08-24 21:53:17 > > > > > > > > > On Thu, 24 Aug 2017 13:15:33 + "liujian (CE)" > > > <liujia...@huawei.com> > > wrote: > > > > Hello, > > > > > > > > With below patch we met one issue. > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > > > /c ommit/?h=v4.13-rc6=6d7b857d541e > > > > > > > > the issue: > > > > Ip_defrag fail caused by frag_mem_limit reached 4M(frags.high_thresh). > > > > At this moment,sum_frag_mem_limit is about 10K. > > > > and my test machine's cpu num is 64. > > > > > > > > Can i only change frag_mem_limit to sum_ frag_mem_limit? > > > > > > > > > > > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > >
RE: Question about ip_defrag
> -Original Message- > From: Jesper Dangaard Brouer [mailto:bro...@redhat.com] > Sent: Friday, August 25, 2017 2:59 AM > To: liujian (CE) > Cc: da...@davemloft.net; kuz...@ms2.inr.ac.ru; yoshf...@linux-ipv6.org; > elena.reshet...@intel.com; eduma...@google.com; netdev@vger.kernel.org; > bro...@redhat.com > Subject: Re: Question about ip_defrag > > > On Thu, 24 Aug 2017 16:04:41 + "liujian (CE)" <liujia...@huawei.com> > wrote: > > > >What kernel version have you seen this issue with? > > > > 3.10,with some backport. > > > > >As far as I remember, this issue have been fixed before... > > > > which one patch? I didnot find out the patch:( > > AFAIK it was some bugs in the percpu_counter code. If you need to backport > look at the git commits: > > git log lib/percpu_counter.c include/linux/percpu_counter.h > > Are you maintaining your own 3.10 kernel? > > I know that for RHEL7 (also kernel 3.10) we backported the percpu_counter > fixes... > Could you tell me which one patch? we have backported most of the two files's change. Thank you ~ > --Jesper > > > > 发件人: Jesper Dangaard Brouer > > 收件人: liujian > (CE)<liujia...@huawei.com<mailto:liujia...@huawei.com>> > > 抄送: > > > da...@davemloft.net<mailto:da...@davemloft.net>;kuz...@ms2.inr.ac.ru > > ailto:kuz...@ms2.inr.ac.ru>;yoshf...@linux-ipv6.org<mailto:yoshfuji@li > > nux-ipv6.org>;elena.reshet...@intel.com<mailto:elena.reshetova@intel.c > > > om>;eduma...@google.com<mailto:eduma...@google.com>;netdev@vger.k > ernel > > .org<mailto:netdev@vger.kernel.org>;bro...@redhat.com<mailto:brouer@r > e > > dhat.com> > > 主题: Re: Question about ip_defrag > > 时间: 2017-08-24 21:53:17 > > > > > > On Thu, 24 Aug 2017 13:15:33 + "liujian (CE)" <liujia...@huawei.com> > wrote: > > > Hello, > > > > > > With below patch we met one issue. > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c > > > ommit/?h=v4.13-rc6=6d7b857d541e > > > > > > the issue: > > > Ip_defrag fail caused by frag_mem_limit reached 4M(frags.high_thresh). > > > At this moment,sum_frag_mem_limit is about 10K. > > > and my test machine's cpu num is 64. > > > > > > Can i only change frag_mem_limit to sum_ frag_mem_limit? > > > > > > > > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > > > index 96e95e8..f09c00b 100644 > > > --- a/net/ipv4/inet_fragment.c > > > +++ b/net/ipv4/inet_fragment.c > > > @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct > > > inet_frags *f) static bool inet_fragq_should_evict(const struct > > > inet_frag_queue *q) { > > > return q->net->low_thresh == 0 || > > > - frag_mem_limit(q->net) >= q->net->low_thresh; > > > + sum_frag_mem_limit(q->net) >= q->net->low_thresh; > > > } > > > > > > static unsigned int > > > @@ -355,7 +355,7 @@ static struct inet_frag_queue > > > *inet_frag_alloc(struct netns_frags *nf, { > > > struct inet_frag_queue *q; > > > > > > - if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) { > > > + if (!nf->high_thresh || sum_frag_mem_limit(nf) > > > > + nf->high_thresh) { > > > inet_frag_schedule_worker(f); > > > return NULL; > > > } > > > @@ -396,7 +396,7 @@ struct inet_frag_queue *inet_frag_find(struct > netns_frags *nf, > > > struct inet_frag_queue *q; > > > int depth = 0; > > > > > > - if (frag_mem_limit(nf) > nf->low_thresh) > > > + if (sum_frag_mem_limit(nf) > nf->low_thresh) > > > inet_frag_schedule_worker(f); > > > > > > hash &= (INETFRAGS_HASHSZ - 1); > > > -- > > > > > > Thank you for your time. > > > > What kernel version have you seen this issue with? > > > > As far as I remember, this issue have been fixed before... > > > > -- > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > LinkedIn: http://www.linkedin.com/in/brouer > > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
RE: af_packet: use after free in prb_retire_rx_blk_timer_expired
Hi Wang cong, After apply the patch, I did not hit the issue again. Thank you~ Best Regards, liujian > -Original Message- > From: Dingtianhong > Sent: Monday, July 24, 2017 9:29 AM > To: Cong Wang; liujian (CE) > Cc: Willem de Bruijn; Dave Jones; alexander.le...@verizon.com; > da...@davemloft.net; eduma...@google.com; will...@google.com; > dan...@iogearbox.net; netdev@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: af_packet: use after free in prb_retire_rx_blk_timer_expired > > > > On 2017/7/24 9:09, Ding Tianhong wrote: > > > > > > On 2017/7/24 1:03, Cong Wang wrote: > >> On Sun, Jul 23, 2017 at 5:48 AM, liujian (CE) <liujia...@huawei.com> wrote: > >>> Hi > >>> > >>> I find it caused by below steps: > >>> 1. set tp_version to TPACKET_V3 and req->tp_block_nr to 1 2. set > >>> tp_block_nr to 0 Then pg_vec was freed, and we did not delete the > >>> timer? > >> > >> Thanks for testing! > >> > >> Ah, I overlook the initialization case in my previous patch. > >> > >> How about the following one? Does it cover all the cases? > >> > >> > >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index > >> 008bb34ee324..0615c2a950fa 100644 > >> --- a/net/packet/af_packet.c > >> +++ b/net/packet/af_packet.c > >> @@ -4329,7 +4329,7 @@ static int packet_set_ring(struct sock *sk, > >> union tpacket_req_u *req_u, > >> register_prot_hook(sk); > >> } > >> spin_unlock(>bind_lock); > >> - if (closing && (po->tp_version > TPACKET_V2)) { > >> + if (pg_vec && (po->tp_version > TPACKET_V2)) { > >> /* Because we don't support block-based V3 on tx-ring > */ > >> if (!tx_ring) > >> prb_shutdown_retire_blk_timer(po, > rb_queue); > >> > >> . > > > > Hi, Cong: > > > > It looks like could not cover the case: req->tp_block_nr = 2 -> > reg->tp_block_nr = 1 . > > > > Oh, looks like this case would never happen, so I think your solution is ok. > > > what about this way: > > --- a/net/packet/af_packet.c > > +++ b/net/packet/af_packet.c > > @@ -4331,13 +4331,17 @@ static int packet_set_ring(struct sock *sk, union > tpacket_req_u *req_u, > > register_prot_hook(sk); > > } > > spin_unlock(>bind_lock); > > - if (closing && (po->tp_version > TPACKET_V2)) { > > + if ((closing || (pg_vec && !reg->tp_block_nr))&& > > + (po->tp_version > TPACKET_V2)) { > > /* Because we don't support block-based V3 on tx-ring */ > > if (!tx_ring) > > prb_shutdown_retire_blk_timer(po, > rb_queue); > > > > > > >> > > > > > > . > >
RE: af_packet: use after free in prb_retire_rx_blk_timer_expired
Hi I find it caused by below steps: 1. set tp_version to TPACKET_V3 and req->tp_block_nr to 1 2. set tp_block_nr to 0 Then pg_vec was freed, and we did not delete the timer? Best Regards, liujian > -Original Message- > From: liujian (CE) > Sent: Sunday, July 23, 2017 5:47 PM > To: liujian (CE); 'Cong Wang'; Dingtianhong > Cc: 'Willem de Bruijn'; 'Dave Jones'; 'alexander.le...@verizon.com'; > 'da...@davemloft.net'; 'eduma...@google.com'; 'will...@google.com'; > 'dan...@iogearbox.net'; 'netdev@vger.kernel.org'; > 'linux-ker...@vger.kernel.org' > Subject: RE: af_packet: use after free in prb_retire_rx_blk_timer_expired > > Hi, > > Do we need delete the v3 ring, when tp_version changed from TPACKET_V3 to > TPACKET_V1 ? > > > Best Regards, > liujian > > > > -Original Message- > > From: liujian (CE) > > Sent: Sunday, July 23, 2017 4:21 PM > > To: 'Cong Wang'; Dingtianhong > > Cc: Willem de Bruijn; Dave Jones; alexander.le...@verizon.com; > > da...@davemloft.net; eduma...@google.com; will...@google.com; > > dan...@iogearbox.net; netdev@vger.kernel.org; > > linux-ker...@vger.kernel.org > > Subject: RE: af_packet: use after free in > > prb_retire_rx_blk_timer_expired > > > > Hi Wang Cong, > > > > With this patch , the system was crashed when setsockopt. > > > > The call trace as below: > > > > crash> bt > > PID: 3069 TASK: 8800afcc CPU: 0 COMMAND: "trinity-main" > > #0 [8801bec03ce0] machine_kexec at 8105354b > > #1 [8801bec03d40] crash_kexec at 810f7e82 > > #2 [8801bec03e10] panic at 81650058 > > #3 [8801bec03e90] watchdog_timer_fn at 81122533 > > #4 [8801bec03ec8] __hrtimer_run_queues at 810abeb2 > > #5 [8801bec03f20] hrtimer_interrupt at 810ac450 > > #6 [8801bec03f70] local_apic_timer_interrupt at 8104a457 > > #7 [8801bec03f88] smp_apic_timer_interrupt at 8166aed0 > > #8 [8801bec03fb0] apic_timer_interrupt at 8166931d > > --- --- > > #9 [8801b301fcb8] apic_timer_interrupt at 8166931d > > [exception RIP: lock_timer_base+77] > > RIP: 8108dced RSP: 8801b301fd60 RFLAGS: 0246 > > RAX: RBX: 8800afcc RCX: > > 0001 > > RDX: 8800afcc RSI: 8801b301fd90 RDI: 8800b0d853c8 > > RBP: 8801b301fd80 R8: 8800afcc R9: > ea000268 > > R10: 003c R11: 8801b301fb2e R12: > 8800afcc > > R13: 8800afcc R14: R15: > 83d1a340 > > ORIG_RAX: ff10 CS: 0010 SS: 0018 > > #10 [8801b301fd88] try_to_del_timer_sync at 8108f19f > > #11 [8801b301fdb8] del_timer_sync at 8108f252 > > #12 [8801b301fdd0] packet_set_ring at 81635e60 > > #13 [8801b301fe98] packet_setsockopt at 81636760 > > #14 [8801b301ff38] sys_setsockopt at 81531860 > > #15 [8801b301ff80] tracesys at 816687ed (via system_call) > > RIP: 7fcc78b03e3a RSP: 7fff16f246b8 RFLAGS: 0202 > > RAX: ffda RBX: 816687ed RCX: > > RDX: 0005 RSI: 0107 RDI: > > 0180 > > RBP: 0180 R8: 001c R9: > > 7fcc78dc7160 > > R10: 01fd6ba0 R11: 00000202 R12: > > > > R13: 0011 R14: 01fd6b60 R15: > > 01fd6b70 > > ORIG_RAX: 0036 CS: 0033 SS: 002b > > > > > > Best Regards, > > liujian > > > > > > > -Original Message- > > > From: Cong Wang [mailto:xiyou.wangc...@gmail.com] > > > Sent: Sunday, July 23, 2017 1:59 PM > > > To: Dingtianhong > > > Cc: liujian (CE); Willem de Bruijn; Dave Jones; > > > alexander.le...@verizon.com; da...@davemloft.net; > > eduma...@google.com; > > > will...@google.com; dan...@iogearbox.net; netdev@vger.kernel.org; > > > linux-ker...@vger.kernel.org > > > Subject: Re: af_packet: use after free in > > > prb_retire_rx_blk_timer_expired > > > > > > On Sat, Jul 22, 2017 at 8:40 PM, Ding Tianhong > > > <dingtianh...@huawei.com> > > > wrote: > > > > Hi, Cong: > > > > > > > > Thanks for your quirk solution, but I still has some doubts about > > > > it, it looks like fix the problem in the &
RE: af_packet: use after free in prb_retire_rx_blk_timer_expired
Hi, Do we need delete the v3 ring, when tp_version changed from TPACKET_V3 to TPACKET_V1 ? Best Regards, liujian > -Original Message- > From: liujian (CE) > Sent: Sunday, July 23, 2017 4:21 PM > To: 'Cong Wang'; Dingtianhong > Cc: Willem de Bruijn; Dave Jones; alexander.le...@verizon.com; > da...@davemloft.net; eduma...@google.com; will...@google.com; > dan...@iogearbox.net; netdev@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: RE: af_packet: use after free in prb_retire_rx_blk_timer_expired > > Hi Wang Cong, > > With this patch , the system was crashed when setsockopt. > > The call trace as below: > > crash> bt > PID: 3069 TASK: 8800afcc CPU: 0 COMMAND: "trinity-main" > #0 [8801bec03ce0] machine_kexec at 8105354b > #1 [8801bec03d40] crash_kexec at 810f7e82 > #2 [8801bec03e10] panic at 81650058 > #3 [8801bec03e90] watchdog_timer_fn at 81122533 > #4 [8801bec03ec8] __hrtimer_run_queues at 810abeb2 > #5 [8801bec03f20] hrtimer_interrupt at 810ac450 > #6 [8801bec03f70] local_apic_timer_interrupt at 8104a457 > #7 [8801bec03f88] smp_apic_timer_interrupt at 8166aed0 > #8 [8801bec03fb0] apic_timer_interrupt at 8166931d > --- --- > #9 [8801b301fcb8] apic_timer_interrupt at 8166931d > [exception RIP: lock_timer_base+77] > RIP: 8108dced RSP: 8801b301fd60 RFLAGS: 0246 > RAX: RBX: 8800afcc RCX: > 0001 > RDX: 8800afcc RSI: 8801b301fd90 RDI: 8800b0d853c8 > RBP: 8801b301fd80 R8: 8800afcc R9: ea000268 > R10: 003c R11: 8801b301fb2e R12: 8800afcc > R13: 8800afcc R14: R15: 83d1a340 > ORIG_RAX: ff10 CS: 0010 SS: 0018 > #10 [8801b301fd88] try_to_del_timer_sync at 8108f19f > #11 [8801b301fdb8] del_timer_sync at 8108f252 > #12 [8801b301fdd0] packet_set_ring at 81635e60 > #13 [8801b301fe98] packet_setsockopt at 81636760 > #14 [8801b301ff38] sys_setsockopt at 81531860 > #15 [8801b301ff80] tracesys at 816687ed (via system_call) > RIP: 7fcc78b03e3a RSP: 7fff16f246b8 RFLAGS: 0202 > RAX: ffda RBX: 816687ed RCX: > RDX: 0005 RSI: 0107 RDI: > 0180 > RBP: 0180 R8: 001c R9: > 7fcc78dc7160 > R10: 01fd6ba0 R11: 0202 R12: > > R13: 0011 R14: 01fd6b60 R15: > 01fd6b70 > ORIG_RAX: 0036 CS: 0033 SS: 002b > > > Best Regards, > liujian > > > > -Original Message- > > From: Cong Wang [mailto:xiyou.wangc...@gmail.com] > > Sent: Sunday, July 23, 2017 1:59 PM > > To: Dingtianhong > > Cc: liujian (CE); Willem de Bruijn; Dave Jones; > > alexander.le...@verizon.com; da...@davemloft.net; > eduma...@google.com; > > will...@google.com; dan...@iogearbox.net; netdev@vger.kernel.org; > > linux-ker...@vger.kernel.org > > Subject: Re: af_packet: use after free in > > prb_retire_rx_blk_timer_expired > > > > On Sat, Jul 22, 2017 at 8:40 PM, Ding Tianhong > > <dingtianh...@huawei.com> > > wrote: > > > Hi, Cong: > > > > > > Thanks for your quirk solution, but I still has some doubts about > > > it, it looks like fix the problem in the > > > packet_setsockopt->packet_set_ring processing, but when in > > > packet_release processing, it may could not release the real pg_vec > > > for the TPACKET_V3 ring, and then cause the mem leak, maybe I miss > > > something here, nice to hear from your feedback. :) > > > > Yes you miss that packet_release() has memset()'s so we won't hit that > > path. :) > > > > However, I missed the swap() in this messy function, actually I > > believe the bug is that we modify tpacket_kbdq_core inside rx_ring in > > non-closing case without actually stopping its timer. I feel more confident > with the following patch: > > > > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index > > 008bb34ee324..267b181fef15 100644 > > --- a/net/packet/af_packet.c > > +++ b/net/packet/af_packet.c > > @@ -4263,6 +4263,7 @@ static int packet_set_ring(struct sock *sk, > > union tpacket_req_u *req_u, > > case TPACKET_V3: > > /* Block transmit is not supported yet */ > > if (!tx_ring) { > > + prb_shutdown_retire_blk_timer(po, > > + rb_queue); > > init_prb_bdqc(po, rb, pg_vec, req_u); > > } else { > > struct tpacket_req3 *req3 = > > _u->req3;
RE: af_packet: use after free in prb_retire_rx_blk_timer_expired
Hi Wang Cong, With this patch , the system was crashed when setsockopt. The call trace as below: crash> bt PID: 3069 TASK: 8800afcc CPU: 0 COMMAND: "trinity-main" #0 [8801bec03ce0] machine_kexec at 8105354b #1 [8801bec03d40] crash_kexec at 810f7e82 #2 [8801bec03e10] panic at 81650058 #3 [8801bec03e90] watchdog_timer_fn at 81122533 #4 [8801bec03ec8] __hrtimer_run_queues at 810abeb2 #5 [8801bec03f20] hrtimer_interrupt at 810ac450 #6 [8801bec03f70] local_apic_timer_interrupt at 8104a457 #7 [8801bec03f88] smp_apic_timer_interrupt at 8166aed0 #8 [8801bec03fb0] apic_timer_interrupt at 8166931d --- --- #9 [8801b301fcb8] apic_timer_interrupt at 8166931d [exception RIP: lock_timer_base+77] RIP: 8108dced RSP: 8801b301fd60 RFLAGS: 0246 RAX: RBX: 8800afcc RCX: 0001 RDX: 8800afcc RSI: 8801b301fd90 RDI: 8800b0d853c8 RBP: 8801b301fd80 R8: 8800afcc R9: ea000268 R10: 003c R11: 8801b301fb2e R12: 8800afcc R13: 8800afcc R14: R15: 83d1a340 ORIG_RAX: ff10 CS: 0010 SS: 0018 #10 [8801b301fd88] try_to_del_timer_sync at 8108f19f #11 [8801b301fdb8] del_timer_sync at 8108f252 #12 [8801b301fdd0] packet_set_ring at 81635e60 #13 [8801b301fe98] packet_setsockopt at 81636760 #14 [8801b301ff38] sys_setsockopt at 81531860 #15 [8801b301ff80] tracesys at 816687ed (via system_call) RIP: 7fcc78b03e3a RSP: 7fff16f246b8 RFLAGS: 0202 RAX: ffda RBX: 816687ed RCX: RDX: 0005 RSI: 0107 RDI: 0180 RBP: 0180 R8: 001c R9: 7fcc78dc7160 R10: 01fd6ba0 R11: 0202 R12: R13: 0011 R14: 01fd6b60 R15: 01fd6b70 ORIG_RAX: 0036 CS: 0033 SS: 002b Best Regards, liujian > -Original Message- > From: Cong Wang [mailto:xiyou.wangc...@gmail.com] > Sent: Sunday, July 23, 2017 1:59 PM > To: Dingtianhong > Cc: liujian (CE); Willem de Bruijn; Dave Jones; alexander.le...@verizon.com; > da...@davemloft.net; eduma...@google.com; will...@google.com; > dan...@iogearbox.net; netdev@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: af_packet: use after free in prb_retire_rx_blk_timer_expired > > On Sat, Jul 22, 2017 at 8:40 PM, Ding Tianhong <dingtianh...@huawei.com> > wrote: > > Hi, Cong: > > > > Thanks for your quirk solution, but I still has some doubts about it, > > it looks like fix the problem in the > > packet_setsockopt->packet_set_ring processing, but when in > > packet_release processing, it may could not release the real pg_vec > > for the TPACKET_V3 ring, and then cause the mem leak, maybe I miss > > something here, nice to hear from your feedback. :) > > Yes you miss that packet_release() has memset()'s so we won't hit that path. > :) > > However, I missed the swap() in this messy function, actually I believe the > bug > is that we modify tpacket_kbdq_core inside rx_ring in non-closing case without > actually stopping its timer. I feel more confident with the following patch: > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index > 008bb34ee324..267b181fef15 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -4263,6 +4263,7 @@ static int packet_set_ring(struct sock *sk, union > tpacket_req_u *req_u, > case TPACKET_V3: > /* Block transmit is not supported yet */ > if (!tx_ring) { > + prb_shutdown_retire_blk_timer(po, > + rb_queue); > init_prb_bdqc(po, rb, pg_vec, req_u); > } else { > struct tpacket_req3 *req3 = > _u->req3;
RE: af_packet: use after free in prb_retire_rx_blk_timer_expired
I also hit this issue with trinity test: The call trace: [exception RIP: prb_retire_rx_blk_timer_expired+70] RIP: 81633be6 RSP: 8801bec03dc0 RFLAGS: 00010246 RAX: RBX: 8801b49d0948 RCX: RDX: 8801b31057a0 RSI: a56b6b6b6b6b6b6b RDI: 8801b49d09ec RBP: 8801bec03dd8 R8: 0001 R9: 83e1bf80 R10: 0002 R11: 0005 R12: 8801b49d09ec R13: 0100 R14: 81633ba0 R15: 8801b49d0948 ORIG_RAX: CS: 0010 SS: 0018 #7 [8801bec03de0] call_timer_fn at 8108cb76 #8 [8801bec03e18] run_timer_softirq at 8108f87c #9 [8801bec03e90] __do_softirq at 8108629f #10 [8801bec03f00] call_softirq at 8166a01c #11 [8801bec03f18] do_softirq at 810172ad #12 [8801bec03f30] irq_exit at 81086655 #13 [8801bec03f48] msa_irq_exit at 810b1ab3 #14 [8801bec03f88] smp_apic_timer_interrupt at 8166aeae #15 [8801bec03fb0] apic_timer_interrupt at 816692dd --- --- And from vmcore, I can see the pointer GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); is a56b6b6b6b6b6b6b struct packet_ring_buffer rx_ring = { pg_vec = 0x0, head = 0x0, frames_per_block = 0x400, frame_size = 0x0, frame_max = 0x, pg_vec_order = 0x0, pg_vec_pages = 0x0, pg_vec_len = 0x0, pending_refcnt = 0x0, prb_bdqc = { pkbdq = 0x8801b31057a0, feature_req_word = 0x1, hdrlen = 0x44, reset_pending_on_curr_blk = 0x1, delete_blk_timer = 0x0, kactive_blk_num = 0x0, blk_sizeof_priv = 0x0, last_kactive_blk_num = 0x0, pkblk_start = 0x8800a700 struct: page excluded: kernel virtual address: 8800a700 type: "gdb_readmem_callback" struct: page excluded: kernel virtual address: 8800a700 type: "gdb_readmem_callback" , pkblk_end = 0x8800a720 "\002", kblk_size = 0x20, max_frame_len = 0x1fffd0, knum_blocks = 0x1, knxt_seq_num = 0x2, prev = 0x8800a730 struct: page excluded: kernel virtual address: 8800a730 type: "gdb_readmem_callback" struct: page excluded: kernel virtual address: 8800a730 type: "gdb_readmem_callback" , nxt_offset = 0x8800a730 struct: page excluded: kernel virtual address: 8800a730 type: "gdb_readmem_callback" struct: page excluded: kernel virtual address: 8800a730 type: "gdb_readmem_callback" , skb = 0x0, blk_fill_in_prog = { counter = 0x0 crash> struct pgv 0x8801b31057a0 struct pgv { buffer = 0xa56b6b6b6b6b6b6b } Best Regards, liujian > -Original Message- > From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] > On Behalf Of Willem de Bruijn > Sent: Wednesday, April 12, 2017 7:23 AM > To: Dave Jones; alexander.le...@verizon.com; da...@davemloft.net; > eduma...@google.com; will...@google.com; dan...@iogearbox.net; > netdev@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: af_packet: use after free in prb_retire_rx_blk_timer_expired > > On Mon, Apr 10, 2017 at 3:23 PM, Dave Jones> wrote: > > On Mon, Apr 10, 2017 at 07:03:30PM +, alexander.le...@verizon.com > wrote: > > > Hi all, > > > > > > I seem to be hitting this use-after-free on a -next kernel using trinity: > > > > > > [ 531.036054] BUG: KASAN: use-after-free in > > prb_retire_rx_blk_timer_expired (net/packet/af_packet.c:688) > > The retire_blk_timer is called after the pg_vec struct for this ring was > freed. > This should not happen. packet_set_ring stops the timer with del_timer_sync > when tearing down the ring before freeing that > struct: > > if (closing && (po->tp_version > TPACKET_V2)) { > /* Because we don't support block-based V3 on tx-ring */ > if (!tx_ring) > prb_shutdown_retire_blk_timer(po, rb_queue); > } > > if (pg_vec) > free_pg_vec(pg_vec, order, req->tp_block_nr); > > This is a similar race to the use-after-free fixed by 84ac7260236a > ("packet: fix race condition in packet_set_ring"). The previous race was > triggered by a call to setsockopt PACKET_VERSION changing tp_version while > the ring is active. It is not immediately obvious what is the cause now. I > suppose trinity does not give a trace of such system calls on this file > descriptor? > That would be helpful. > > The bug report shows both a timer firing after the packet_set_ring call that > freed the pg_vec, and later a CONFIG_DEBUG_OBJECTS_FREE warning that > the timer is still active when the socket is closed on release of the last > file > descriptor.