So, after I got fea to quit crashing, I was able to do some tests with
dynamic interfaces & OSPF.

With 15 routers, the system load goes to ~20 and ospf cannot seem to
get out of init.

I notice that ospf and fea are taking large amounts of CPU during
and after xorpsh activity.

I did an strace on xorp_ospf2 and it looks like it is sitting in
a loop doing selects but not actually reading/writing the descriptors
for much of the time.

If I remember correctly, this was the same problem that xorpsh showed
before Pavlin put the fix in to sleep for 10ms.  Maybe a similar
fix is needed for ospf2?

Even then, it seems that we should not need to busy-spin even at 10ms.
We should be able to set a longer timeout and wait on select to tell
us when we have messages and/or can send data.

Here' strace from xorp_ospfv2:

clock_gettime(CLOCK_MONOTONIC, {19988, 604437813}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 604521828}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 604834293}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 604918543}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 605048836}) = 0
time(NULL)                              = 1192851387
time(NULL)                              = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 605290834}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 605397338}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 605731294}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 605842029}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 606542801}) = 0
time(NULL)                              = 1192851387
time(NULL)                              = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 606813953}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 606927272}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 607256007}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 607339372}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 608347317}) = 0
time(NULL)                              = 1192851387
time(NULL)                              = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 608595070}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 608678070}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 608989096}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609073655}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609176878}) = 0
time(NULL)                              = 1192851387
time(NULL)                              = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 609418106}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609500577}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 609814101}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609934343}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 610732073}) = 0
time(NULL)                              = 1192851387
time(NULL)                              = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 610919469}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 610982308}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], 
out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 611307881}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 611391965}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 612422222}) = 0


fea is also taking up lots of CPU, but it seems like it might be doing
at least some real work.  It is running multiple selects before it actually
reads from the FD, so it still seems like it could be optimized.

clock_gettime(CLOCK_MONOTONIC, {20125, 745575383}) = 0
clock_gettime(CLOCK_MONOTONIC, {20125, 745683166}) = 0
select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 
48 49 50 51 52 53 54 55 56 57 59], [], [], {0, 0}) = 1 (in [57], left {0, 0})
select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 
48 49 50 51 52 53 54 55 56 57 59], [], [], {0, 0}) = 1 (in [57], left {0, 0})
select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 
48 49 50 51 52 53 54 55 56 57 59], [], [], {3, 441222}) = 1 (in [57], left {3, 
441222})
clock_gettime(CLOCK_MONOTONIC, {20125, 746149812}) = 0
recvmsg(57, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.13.14.13")}, 
msg_iov(1)=[{"E\300\0P\253\314\0\0\1Y\24\252\n\r\16\r\340\0\0\5\2\4\0"..., 
65536}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=, ...}, 
msg_flags=0}, 0) = 80
time(NULL)                              = 1192851524
time(NULL)                              = 1192851524
clock_gettime(CLOCK_MONOTONIC, {20125, 746558226}) = 0
clock_gettime(CLOCK_MONOTONIC, {20125, 746641971}) = 0

At least part of fea's problem is that it is getting ~2000 packets per second 
for
interfaces not it's own.  I'm not sure at all why all the traffic...as under 
'normal'
load it was only getting about 100 per second.

Here's a snippet of a packet capture.  Looks like two routers have
gone insane and are flooding each other with LS Updates.


[EMAIL PROTECTED] local]# tshark -n -i 11.16.11
Capturing on 11.16.11
   0.000000    99.1.1.10 -> 224.0.0.5    OSPF LS Update
   0.000111    99.1.1.10 -> 224.0.0.5    OSPF LS Update
   0.000135    99.1.1.11 -> 224.0.0.5    OSPF LS Acknowledge
   0.000157    99.1.1.11 -> 224.0.0.5    OSPF LS Update
   0.000178    99.1.1.11 -> 224.0.0.5    OSPF LS Update
   0.000198    99.1.1.11 -> 224.0.0.5    OSPF LS Acknowledge
   0.000219    99.1.1.11 -> 224.0.0.5    OSPF LS Update
   0.000239    99.1.1.10 -> 224.0.0.5    OSPF LS Update
   0.000260    99.1.1.10 -> 224.0.0.5    OSPF LS Update
   0.000285    99.1.1.11 -> 224.0.0.5    OSPF LS Update
   0.000306    99.1.1.11 -> 224.0.0.5    OSPF LS Acknowledge
   0.000326    99.1.1.11 -> 224.0.0.5    OSPF LS Update
   0.000346    99.1.1.11 -> 224.0.0.5    OSPF LS Update

Thanks,
Ben


-- 
Ben Greear <[EMAIL PROTECTED]>
Candela Technologies Inc  http://www.candelatech.com

_______________________________________________
Xorp-hackers mailing list
[email protected]
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers

Reply via email to