So... back to the original problem.

I'm glad non-reentrancy can be ruled out as a root cause, but that is 
always something that's good to tackle.

It is possible that there's a race when the system is loaded, but the 
bug is all too obvious -- we are using a pointer as a cache, and there 
is no way to invalidate it.

If there has been a race, I would really like to catch it.

My best guess is that the send ran before one of these:
 1. a FinderClient operation, which would invalidate the resolved sender 
transport, runs before the Xrl send.
     This is most likely.
 2. a transport layer close notification inside an XrlPFSender.
     But this is unlikely. XrlPFSTCP uses BufferedAsyncFileReader. Only 
the Win32 transport code in that file will dispatch an async close, 
because that's what Winsock does.

...where XRL is concerned, these are the places I'd look more carefully 
for races.

    Sorry to be a nazi about ref_ptr<T>. It's one of those things I wish 
would just go away. There is a need for a refcounted object/pointer 
type. ref_ptr<T> fills that gap now; it existed before Boost did.

    There are several places where its use is actually typedef'd away, 
making it harder to tell, at-a-glance, what's going on. One of these is 
in the FinderClient. I'm going to want to get rid of this; ref_ptr<T> 
caused no end of problems for me in OLSR.

    I did say that Boost needs to be introduced carefully and 
incrementally; I'll be sticking to that. I took a look at how Boost 
implements the weak_ptr<T> / shared_ptr<T> split. It turns out they both 
reference an embedded sp_counted_base instance, to implement the 
refcounts. They maintain a separate count of holders and watchers. It's 
thread safe, but it'll use atomic ops if you don't want to use threads; 
and looks pretty efficient.

In the meantime, to avoid using ref_ptr<T>, here's what I'd suggest.
    We can sanity check 's' before its use using an existing map, 
XrlRouter::_senders2, which seems to have been introduced to support 
batch XRL operations. We know the key, there is one match and one match 
only, so it should be quick enough.
    Given that we already maintain this map, I'm surprised we don't use 
it for XrlRouter::get_sender() anyway. But beware: that method name is 
overloaded.

Of course, it's likely the change can be picked up on later on. I am 
looking at things with the Thrift goggles on right now, and seeing 
exactly the same code pattern I saw emerging in OLSR.

thanks,
BMS

_______________________________________________
Xorp-hackers mailing list
[email protected]
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers

Reply via email to