Re: rtfree(): "rt->rt_refcnt > 0" assertion

2021-10-02 Thread Vitaliy Makkoveev
On Sat, Oct 02, 2021 at 11:27:48AM +0200, Martin Pieuchot wrote:
> On 15/09/21(Wed) 01:23, Vitaliy Makkoveev wrote:
> > We have weird `rt_refcnt' check in rtfree():
> 
> > 
> > 497 rtfree(struct rtentry *rt)
> > 498 {
> > ...
> > 504 refcnt = (int)atomic_dec_int_nv(>rt_refcnt);
> > 505 if (refcnt <= 0) {
> > 506 KASSERT(!ISSET(rt->rt_flags, RTF_UP));
> > 507 KASSERT(!RT_ROOT(rt));
> > 508 atomic_dec_int();
> > 509 if (refcnt < 0) {
> > 510 printf("rtfree: %p not freed (neg refs)\n", rt);
> > 511 return;
> > 512 }  
> > 
> > We underflow `rt_refcnt' when we missed to get reference to this `rt' or
> > we did extra release. This is the bug which should be exposed. But
> > according current code this `rt' is just leaked. Also it's easy to miss
> > this error condition because we only print error message.
> 
> Yes, all of this is intentional.  If you screw up reference counting a
> leak is better than a crash that you can't debug.  At least people can
> report there is a leak.  
> 

Sorry, I don't assume the leak more helpful for debug. This error
message could be easily missed and not reported, but the panic could not
be ignored.

> > I propose to put "rt->rt_refcnt > 0" assertion before we decrement
> > `rt_refcnt'. This makes reference counting errors more notable when they
> > are. Also I changed `rt_refcnt' definition to unsigned integer.
> 
> Such assert isn't safe because the value can be dereferenced by another
> thread.  You can only assert for the value of a private variable or if
> you're under a critical section.
> 

May be I'm wrong but the sequence:

KASSERT(rt->rt_refcnt > 0);
if (atomic_dec_int_nv(>rt_refcnt) == 0) {

should not be reordered. This is the 'int' variable, so the 'read' (or
'load') and 'atomic decrement' commands could not interrupt each other.
We could have the case when many rtfree(9) threads hit this assertion,
but we could not have the situation when no one rtfree(9) thread hits
this. The critical section is not required here. We have the same logic
is other places, like closef():

1230 closef(struct file *fp, struct proc *p)
1231 {
1232 struct filedesc *fdp;
1233 
1234 if (fp == NULL)
1235 return (0);
1236 
1237 KASSERTMSG(fp->f_count >= 2, "count (%u) < 2", fp->f_count);
1238 
1239 atomic_dec_int(>f_count);

> > I didn't find any "rtfree: ... not freed (neg refs)" report, so it
> > looks like we can't hit this assertion, but I like to commit this
> > diff after release. But nothing stops to test it and provide feedback :)
> 
> We did hit it many many times in the past.  Please do not change this.
>

I searched "rtfree: ... not freed (neg refs)" by google but didn't find
anything. I propose to put this diff into the snaps before commit. We
should instantly receive reports if we have refcounting issues. Also we
should instantly receive reports if someone introduced them in the
future.

> > Index: sys/net/route.c
> > ===
> > RCS file: /cvs/src/sys/net/route.c,v
> > retrieving revision 1.399
> > diff -u -p -r1.399 route.c
> > --- sys/net/route.c 25 May 2021 22:45:09 -  1.399
> > +++ sys/net/route.c 14 Sep 2021 21:47:11 -
> > @@ -496,20 +496,15 @@ rtref(struct rtentry *rt)
> >  void
> >  rtfree(struct rtentry *rt)
> >  {
> > -   int  refcnt;
> > -
> > if (rt == NULL)
> > return;
> >  
> > -   refcnt = (int)atomic_dec_int_nv(>rt_refcnt);
> > -   if (refcnt <= 0) {
> > +   KASSERT(rt->rt_refcnt > 0);
> > +
> > +   if (atomic_dec_int_nv(>rt_refcnt) == 0) {
> > KASSERT(!ISSET(rt->rt_flags, RTF_UP));
> > KASSERT(!RT_ROOT(rt));
> > atomic_dec_int();
> > -   if (refcnt < 0) {
> > -   printf("rtfree: %p not freed (neg refs)\n", rt);
> > -   return;
> > -   }
> >  
> > KERNEL_LOCK();
> > rt_timer_remove_all(rt);
> > Index: sys/net/route.h
> > ===
> > RCS file: /cvs/src/sys/net/route.h,v
> > retrieving revision 1.185
> > diff -u -p -r1.185 route.h
> > --- sys/net/route.h 17 Mar 2021 09:05:42 -  1.185
> > +++ sys/net/route.h 14 Sep 2021 21:47:11 -
> > @@ -113,7 +113,7 @@ struct rtentry {
> > struct rt_kmetrics rt_rmx;  /* metrics used by rx'ing protocols */
> > unsigned int rt_ifidx;  /* the answer: interface to use */
> > unsigned int rt_flags;  /* up/down?, host/net */
> > -   int  rt_refcnt; /* # held references */
> > +   unsigned int rt_refcnt; /* # held references */
> > int  rt_plen;   /* prefix length */
> > uint16_t rt_labelid;/* route label ID */
> > uint8_t  rt_priority;   /* routing 

Re: rtfree(): "rt->rt_refcnt > 0" assertion

2021-10-02 Thread Martin Pieuchot
On 15/09/21(Wed) 01:23, Vitaliy Makkoveev wrote:
> We have weird `rt_refcnt' check in rtfree():

> 
> 497 rtfree(struct rtentry *rt)
> 498 {
>   ...
> 504 refcnt = (int)atomic_dec_int_nv(>rt_refcnt);
> 505 if (refcnt <= 0) {
> 506 KASSERT(!ISSET(rt->rt_flags, RTF_UP));
> 507 KASSERT(!RT_ROOT(rt));
> 508 atomic_dec_int();
> 509 if (refcnt < 0) {
> 510 printf("rtfree: %p not freed (neg refs)\n", rt);
> 511 return;
> 512 }  
> 
> We underflow `rt_refcnt' when we missed to get reference to this `rt' or
> we did extra release. This is the bug which should be exposed. But
> according current code this `rt' is just leaked. Also it's easy to miss
> this error condition because we only print error message.

Yes, all of this is intentional.  If you screw up reference counting a
leak is better than a crash that you can't debug.  At least people can
report there is a leak.  

> I propose to put "rt->rt_refcnt > 0" assertion before we decrement
> `rt_refcnt'. This makes reference counting errors more notable when they
> are. Also I changed `rt_refcnt' definition to unsigned integer.

Such assert isn't safe because the value can be dereferenced by another
thread.  You can only assert for the value of a private variable or if
you're under a critical section.

> I didn't find any "rtfree: ... not freed (neg refs)" report, so it
> looks like we can't hit this assertion, but I like to commit this
> diff after release. But nothing stops to test it and provide feedback :)

We did hit it many many times in the past.  Please do not change this.

> Index: sys/net/route.c
> ===
> RCS file: /cvs/src/sys/net/route.c,v
> retrieving revision 1.399
> diff -u -p -r1.399 route.c
> --- sys/net/route.c   25 May 2021 22:45:09 -  1.399
> +++ sys/net/route.c   14 Sep 2021 21:47:11 -
> @@ -496,20 +496,15 @@ rtref(struct rtentry *rt)
>  void
>  rtfree(struct rtentry *rt)
>  {
> - int  refcnt;
> -
>   if (rt == NULL)
>   return;
>  
> - refcnt = (int)atomic_dec_int_nv(>rt_refcnt);
> - if (refcnt <= 0) {
> + KASSERT(rt->rt_refcnt > 0);
> +
> + if (atomic_dec_int_nv(>rt_refcnt) == 0) {
>   KASSERT(!ISSET(rt->rt_flags, RTF_UP));
>   KASSERT(!RT_ROOT(rt));
>   atomic_dec_int();
> - if (refcnt < 0) {
> - printf("rtfree: %p not freed (neg refs)\n", rt);
> - return;
> - }
>  
>   KERNEL_LOCK();
>   rt_timer_remove_all(rt);
> Index: sys/net/route.h
> ===
> RCS file: /cvs/src/sys/net/route.h,v
> retrieving revision 1.185
> diff -u -p -r1.185 route.h
> --- sys/net/route.h   17 Mar 2021 09:05:42 -  1.185
> +++ sys/net/route.h   14 Sep 2021 21:47:11 -
> @@ -113,7 +113,7 @@ struct rtentry {
>   struct rt_kmetrics rt_rmx;  /* metrics used by rx'ing protocols */
>   unsigned int rt_ifidx;  /* the answer: interface to use */
>   unsigned int rt_flags;  /* up/down?, host/net */
> - int  rt_refcnt; /* # held references */
> + unsigned int rt_refcnt; /* # held references */
>   int  rt_plen;   /* prefix length */
>   uint16_t rt_labelid;/* route label ID */
>   uint8_t  rt_priority;   /* routing priority to use */
> 



rtfree(): "rt->rt_refcnt > 0" assertion

2021-09-14 Thread Vitaliy Makkoveev
We have weird `rt_refcnt' check in rtfree():

497 rtfree(struct rtentry *rt)
498 {
...
504 refcnt = (int)atomic_dec_int_nv(>rt_refcnt);
505 if (refcnt <= 0) {
506 KASSERT(!ISSET(rt->rt_flags, RTF_UP));
507 KASSERT(!RT_ROOT(rt));
508 atomic_dec_int();
509 if (refcnt < 0) {
510 printf("rtfree: %p not freed (neg refs)\n", rt);
511 return;
512 }  

We underflow `rt_refcnt' when we missed to get reference to this `rt' or
we did extra release. This is the bug which should be exposed. But
according current code this `rt' is just leaked. Also it's easy to miss
this error condition because we only print error message.

I propose to put "rt->rt_refcnt > 0" assertion before we decrement
`rt_refcnt'. This makes reference counting errors more notable when they
are. Also I changed `rt_refcnt' definition to unsigned integer.

I didn't find any "rtfree: ... not freed (neg refs)" report, so it
looks like we can't hit this assertion, but I like to commit this
diff after release. But nothing stops to test it and provide feedback :)

Index: sys/net/route.c
===
RCS file: /cvs/src/sys/net/route.c,v
retrieving revision 1.399
diff -u -p -r1.399 route.c
--- sys/net/route.c 25 May 2021 22:45:09 -  1.399
+++ sys/net/route.c 14 Sep 2021 21:47:11 -
@@ -496,20 +496,15 @@ rtref(struct rtentry *rt)
 void
 rtfree(struct rtentry *rt)
 {
-   int  refcnt;
-
if (rt == NULL)
return;
 
-   refcnt = (int)atomic_dec_int_nv(>rt_refcnt);
-   if (refcnt <= 0) {
+   KASSERT(rt->rt_refcnt > 0);
+
+   if (atomic_dec_int_nv(>rt_refcnt) == 0) {
KASSERT(!ISSET(rt->rt_flags, RTF_UP));
KASSERT(!RT_ROOT(rt));
atomic_dec_int();
-   if (refcnt < 0) {
-   printf("rtfree: %p not freed (neg refs)\n", rt);
-   return;
-   }
 
KERNEL_LOCK();
rt_timer_remove_all(rt);
Index: sys/net/route.h
===
RCS file: /cvs/src/sys/net/route.h,v
retrieving revision 1.185
diff -u -p -r1.185 route.h
--- sys/net/route.h 17 Mar 2021 09:05:42 -  1.185
+++ sys/net/route.h 14 Sep 2021 21:47:11 -
@@ -113,7 +113,7 @@ struct rtentry {
struct rt_kmetrics rt_rmx;  /* metrics used by rx'ing protocols */
unsigned int rt_ifidx;  /* the answer: interface to use */
unsigned int rt_flags;  /* up/down?, host/net */
-   int  rt_refcnt; /* # held references */
+   unsigned int rt_refcnt; /* # held references */
int  rt_plen;   /* prefix length */
uint16_t rt_labelid;/* route label ID */
uint8_t  rt_priority;   /* routing priority to use */