So... back to the original problem.
I'm glad non-reentrancy can be ruled out as a root cause, but that is
always something that's good to tackle.
It is possible that there's a race when the system is loaded, but the
bug is all too obvious -- we are using a pointer as a cache, and there
is no way to invalidate it.
If there has been a race, I would really like to catch it.
My best guess is that the send ran before one of these:
1. a FinderClient operation, which would invalidate the resolved sender
transport, runs before the Xrl send.
This is most likely.
2. a transport layer close notification inside an XrlPFSender.
But this is unlikely. XrlPFSTCP uses BufferedAsyncFileReader. Only
the Win32 transport code in that file will dispatch an async close,
because that's what Winsock does.
...where XRL is concerned, these are the places I'd look more carefully
for races.
Sorry to be a nazi about ref_ptr<T>. It's one of those things I wish
would just go away. There is a need for a refcounted object/pointer
type. ref_ptr<T> fills that gap now; it existed before Boost did.
There are several places where its use is actually typedef'd away,
making it harder to tell, at-a-glance, what's going on. One of these is
in the FinderClient. I'm going to want to get rid of this; ref_ptr<T>
caused no end of problems for me in OLSR.
I did say that Boost needs to be introduced carefully and
incrementally; I'll be sticking to that. I took a look at how Boost
implements the weak_ptr<T> / shared_ptr<T> split. It turns out they both
reference an embedded sp_counted_base instance, to implement the
refcounts. They maintain a separate count of holders and watchers. It's
thread safe, but it'll use atomic ops if you don't want to use threads;
and looks pretty efficient.
In the meantime, to avoid using ref_ptr<T>, here's what I'd suggest.
We can sanity check 's' before its use using an existing map,
XrlRouter::_senders2, which seems to have been introduced to support
batch XRL operations. We know the key, there is one match and one match
only, so it should be quick enough.
Given that we already maintain this map, I'm surprised we don't use
it for XrlRouter::get_sender() anyway. But beware: that method name is
overloaded.
Of course, it's likely the change can be picked up on later on. I am
looking at things with the Thrift goggles on right now, and seeing
exactly the same code pattern I saw emerging in OLSR.
thanks,
BMS
_______________________________________________
Xorp-hackers mailing list
[email protected]
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers