Hi Paul, a quick response in the interest of time. though your analysis appear be detailed the direction of it and the scope puzzles me. It is only analyzing test with regression. I think you will find the answers yourself if you compare logs before the regression. Pay attention to the log line on east. I noticed this on Tuomo's system at the beginning of this bughunt. | inserting event EVENT_SA_EXPIRE, timeout in 0.000 seconds for #4
Another suggestion is run the same test with uniqueid=no And compare the logs. I don't understand the issue you are bringing up about short timers, 30seconds. shouldn't that work too? It worked before this commit. First rekey should work with 30second timer too? Second, we/Tuomo ran the code with 8hour timer and the issue are similar. What timer would you like to see? Our test run is 17 hours, could even create a 8hour test:) After my fix I created ikev2-ike-rekey-05 which I left running for longer to see the dynamics and no issues shown. also whack command is not very convincing to me, not on my wish list:) A side note, strongswan, has byte limit to rekey IPsec. Which I found it confusing so far. These tests are complicated it takes a while to follow them. Even if we have such whack command one would need timer based test too? Or are you suggesting no timer tests? I would like to bring your attention to the bug at hand instead of digressing into generic code questions and timer questions. I would also remind rekey code has limitations focusing one issue/improvement thing at a time is probably better. On Fri, Jun 15, 2018 at 05:52:20PM -0400, Paul Wouters wrote: > On Wed, 6 Jun 2018, Paul Wouters wrote: > > > On Tue, 5 Jun 2018, Antony Antony wrote: > > > > > I noticed a regression in test ikev2-ike-rekey-03. I was hunting an issue > > > Tuomo found. While trouble shooting it I noticed this regression. > > > > > > > > http://testing.libreswan.org/results/testing/v3.22-1500-g2efd341-master/ikev2-ike-rekey-03/ > > I looked at this test case. Clearly there is a bug now, but the > situation is complicated by the fact that the test case seems > flawed to begin with due to the extremely short rekey timers used. > > This is my observation on east: > > [initial setup of all tunnels] > > westnet-eastnet-ikev2c #1 IKE SA > westnet-eastnet-ikev2a #2 [192.0.2.0-192.0.2.255:0-65535 0] -> > [192.0.1.0-192.0.1.255:0-65535 0] ESP=>0x030eb7c4 <0xeb332c41 > westnet-eastnet-ikev2b #3 [192.0.200.0-192.0.200.255:0-65535 0] -> > [192.0.100.0-192.0.100.255:0-65535 0] ESP=>0xc8c58945 <0x71312be9 > westnet-eastnet-ikev2c #4 [192.0.201.0-192.0.201.255:0-65535 0] -> > [192.0.101.0-192.0.101.255:0-65535 0] ESP=>0x98a747b7 <0x4513d3f4 > > [rekey event as responder] > westnet-eastnet-ikev2a #5: negotiated new IPsec SA > [192.0.2.0-192.0.2.255:0-65535 0] -> [192.0.1.0-192.0.1.255:0-65535 0] > ESP=>0xd09a6b80 <0x13cdc164 > > [timer event unrelated to anything of remote peer kicks in] > | handling event EVENT_SA_EXPIRE for child state #4 > westnet-eastnet-ikev2c #4: deleting state (STATE_V2_IPSEC_R) > [this means 192.0.201.0/24 - 192.0.101.0/24 tunnel is now BROKEN] > > [Remote deletes old ipsec SA which was just rekeyed] > westnet-eastnet-ikev2c #1: received Delete SA payload: delete IPSEC State #2 > now > westnet-eastnet-ikev2a #2: deleting other state #2 connection > (STATE_V2_IPSEC_R) "westnet-eastnet-ikev2a" > > [remote rekey event] > "westnet-eastnet-ikev2b" #6: negotiated new IPsec SA > [192.0.200.0-192.0.200.255:0-65535 0] -> [192.0.100.0-192.0.100.255:0-65535 > 0] ESP=>0x304bf497 <0x970d9cc0 > > [remote delete ipsec sa which was just rekeyed] > "westnet-eastnet-ikev2c" #1: received Delete SA payload: delete IPSEC State > #3 now > > [next remote sends rekey ipsec sa for #4 which we already expired by timer!] > "westnet-eastnet-ikev2c" #1: CREATE_CHILD_SA no such IPsec SA to rekey > SA(0x98a747b7) Protocol PROTO_v2_ESP > "westnet-eastnet-ikev2c" #7: responding to CREATE_CHILD_SA message (ID 8) > from 192.1.2.45:500 with encrypted notification CHILD_SA_NOT_FOUND > > [remote then tries to delete the thing we told it we already don't know > anything about] > #1: received delete request for PROTO_v2_ESP SA(0x98a747b7) but corresponding > state not found > > [remote replaces #4 which we already deleted] > "westnet-eastnet-ikev2c" #8: negotiated new IPsec SA > [192.0.201.0-192.0.201.255:0-65535 0] -> [192.0.101.0-192.0.101.255:0-65535 > 0] ESP=>0x63221c98 <0x3d570d8c > [ this means 192.0.201.0/24 - 192.0.101.0/24 tunnel is now repaired] > > [remote rekeys #5 with #9, but our end then mistakenly deletes wrong state] > "westnet-eastnet-ikev2a" #9: negotiated new IPsec SA > [192.0.2.0-192.0.2.255:0-65535 0] -> [192.0.1.0-192.0.1.255:0-65535 0] > ESP=>0x10cb8a4c <0x9b32c05b > "westnet-eastnet-ikev2c" #8: deleting state (STATE_V2_IPSEC_R) > [note westnet-eastnet-ikev2a vs westnet-eastnet-ikev2c] > > [remote deletes the replaced ipsec sa] > "westnet-eastnet-ikev2c" #1: received Delete SA payload: delete IPSEC State > #5 now > > And it goes on. > > So the first problem is that #4 is deleted before it is rekeyed on east > and everything is a bit confused from there on. Why not compare what happens with uniqueid=no? Pay attention the log line mentioned below. > But I think the real issue is that we have code that doesn't fully > understand the concept of shared IKE SA. With the IKE SA being on > "westnet-eastnet-ikev2c", it seems that it deletes another conn's > IPsec SA (on westnet-eastnet-ikev2a). Maybe this is because it didn't > find one on westnet-eastnet-ikev2c, because a timer just deleted it. > > > I think that instead of this playing with timers, it is worth > introducing a new whack command that rekeys an IKEv2 SA, so that we can > do rekeying without using 30 second timers in our test cases. are you saying you can't wait 30s? I don't undertand what wrong with a test with timers. More tests are ofcourse good idea. I think all timers should work. That is the real case. I know we have corner case and deadlocks. But this bug I reported is not about those. It feels like you trying to get world peace:) > Looking at if the IPsec SA's properly migrate to the new IKE SA I am > observing the following: > > [first incoming IKE SA rekey for #1] > | "westnet-eastnet-ikev2c" #1 received IKE Rekey Request CREATE_CHILD_SA from > 192.1.2.45:500 Child "westnet-eastnet-ikev2c" #11 in STATE_V2_REKEY_IKE_R > will process it further > > We know that 3 IPsec SA's should move from #1 to #11 > > | #11 inherit #10 from parent #1 > | #11 inherit #9 from parent #1 > | #11 inherit #7 from parent #1 > > So that seems to work properly (unlike my expectation) since your initial hypothesis is , atleast in part, proven wrong and there clearly one commit that causing this regression I suggest you focus there. However not, while working on this bug I did find another one. If your further tests involve keep in mind of the patch 91af9fa707. > > But the next one confuses me. I would expcet a IKE REKEY Request for > #11, but the next one I see is actually: > > | "westnet-eastnet-ikev2c" #13 received IKE Rekey Request CREATE_CHILD_SA > from 192.1.2.45:500 Child "westnet-eastnet-ikev2c" #17 in > STATE_V2_REKEY_IKE_R will process it further > > So where did #13 come from? Why is that not #11 ? > > | switching from parent? #1 to child #11 in FSM processor this is not new. > > What does this mean? Surely it does not mean it switched from a parent > state to a child state because this involves none of the child IPsec > SA's. But the code indeed switches? Generally I don't seey why not switch. IKE SA is used to decrypting and rest will be on the Child state. However, while creating the response message, parent will be used also. > > struct state *cst = process_v2_child_ix(md, st); > if (cst == NULL) { > /* no go. Could improve the status code? */ > complete_v2_state_transition(mdp, STF_FAIL); > return; > } > > DBGF(DBG_CONTROL, > "switching from parent? #%lu to child #%lu in FSM > processor", > st->st_serialno, cst->st_serialno); > > /* switch from parent state to child state */ > st = cst; > } > > md->st = st; > > It even updated the state associated to the md? That seems wrong. I do't agree. Again we are digressing. I recollect some discussion about the choice of storing st in md and switching it. May be something discuss as a seperate topic, and not to hack this in haste. > This is a CREATE_CHILD_SA not involving any IPsec SA, yet we moved > state from IKE SA to IPsec SA? If that IPsec SA would have a different > IKE SA, the resulting IKE message would be undecryptable by the remote > endpoint. Also remember above we deleted #4, so what if there is no > IPsec SA state? The above code just fails out when cst == NULL. I think your assumtions are wrong:) Try the test without uniqueid I noticed this line, is the uniqueid code introduced in | inserting event EVENT_SA_EXPIRE, timeout in 0.000 seconds for #4 see the log https://swantest.libreswan.fi/results/blackswan/2018-06-11-swantest-3.22-1581-g1cbd37cf9-master/ikev2-ike-rekey-03/OUTPUT/east.pluto.log (this file should open in the browser). If you look before the regression we do not have such line. And I suspect it is caused by 9bd57bb https://swantest.libreswan.fi/results/blackswan/2018-05-15-swantest-3.22-1400-g39c2250c1-master/ikev2-ike-rekey-03/OUTPUT/east.pluto.log after things things are wrong. > Back to #11, it completes the new IKE SA. We now have two IKE SA's #1 > and #11. And we then receive a delete for #1. Note that I see: > > | STF_OK but no state object remains > > and processing stops. So we did not send an informational delete answer. > (will this cause any msgid deadlock? Maybe not because it was the last > message to be send/received over that IKE SA anyway) > > The next message is from the new IKE SA. > > "westnet-eastnet-ikev2c" #11: CREATE_CHILD_SA no such IPsec SA to rekey > SA(0x63221c98) Protocol PROTO_v2_ESP > > This is the state we (wrongly) deleted before. The remote assumed it migrated. > > So what about #11 vs #13? It seems we just derail even more: > > "westnet-eastnet-ikev2c" #11: received Delete SA payload: delete IPSEC State > #10 now > "westnet-eastnet-ikev2b" #10: deleting other state #10 connection > (STATE_V2_IPSEC_R) "westnet-eastnet-ikev2b" > "westnet-eastnet-ikev2b" #10: ESP traffic information: in=336B out=336B > "westnet-eastnet-ikev2c" #12: deleting other state #12 (STATE_CHILDSA_DEL) > "westnet-eastnet-ikev2c" #12: ERROR: netlink response for Del SA > [email protected] included errno 3: No such process > "westnet-eastnet-ikev2c" #12: ERROR: netlink response for Del SA > [email protected] included errno 3: No such process > "westnet-eastnet-ikev2c" #7: deleting other state #7 (STATE_CHILDSA_DEL) > "westnet-eastnet-ikev2c" #7: ERROR: netlink response for Del SA > [email protected] included errno 3: No such process > "westnet-eastnet-ikev2c" #7: ERROR: netlink response for Del SA > [email protected] included errno 3: No such process > "westnet-eastnet-ikev2c" #11: deleting state (STATE_IKESA_DEL) > packet from 192.1.2.45:500: proposal > 1:IKE:ENCR=AES_GCM_C_256;PRF=HMAC_SHA2_512;DH=MODP2048 chosen from remote > proposals > "westnet-eastnet-ikev2c" #13: STATE_PARENT_R1: received v2I1, sent v2R1 > {auth=IKEv2 cipher=aes_gcm_16_256 integ=n/a prf=sha2_512 group=MODP2048} > > Things come crashing down, and a new IKE SA is started from scratch, > with I presume fresh IPsec SA's to be generated as well. And I assume > we do another loop with similar results. > > I've stopped tracking it at this point. This specific test run has log > files updated to vault:/tmp/ikev2-ike-rekey-03 if someone wants to have > a look. But I think it would be better to not use short timers, and > invoke an IKE rekey via whack, and then observe what is happening. > > Paul _______________________________________________ Swan-dev mailing list [email protected] https://lists.libreswan.org/mailman/listinfo/swan-dev
