On Wed, 8 Apr 2020 at 02:29, Antony Antony <[email protected]> wrote:
> Hi, > > I am hunting a couple of corner cases, IKEv2 rekey initiator failures. > These issues appear when testing clones. Think of 100 IKEv2 Child SAs > under > one IKE SA and rekeying them all. In the test rekey margin and salife are > short. Short values do not matter, because looking back in Tuomo > production > systems logs, with larger rekey margin and saflifetime, I see the same > issues. > > First one is possibly fixed. > > > https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz > > ikev2: insert new v2D message to the tail of pending v2D requests It's certainly better: optimized to send out v2D message before v2_CREATE_CHILD_SA jobs. Delete would also get a quicker response from the other end, because, there are no crypto operations i.e. v2DH,KE,nonce to respond to a v2_INFORMATIONAL request. but is it needed? this queue may have v2_CREATE_CHILD_SA or/and v2_INFORMATIONAL (v2D). add new v2_INFORMATIONAL, v2D request to the tail of v2D requests, instead adding to the tail of the queue. I'm also puzzled by: if (IS_CHILD_SA_ESTABLISHED(st)) { /* * this is a continuation of delete message. * shortcut complete_v2_state_transition() * to call complete_v2_state_transition, need more work */ pending.cb(ike, st, NULL); v2_msgid_schedule_next_initiator(ike); The pluto log will show the following debugline and rekey response is > dropped. > > Message ID: CHILD #1.#6 XXX: expecting initiator==NULL - suspect > record'n'send with an out-of-order wrong packet response; discarding > packet: > ike.initiator.sent=7 ike.initiator.recv=6 ike.responder.sent=-1 > ike.responder.recv=-1 child.wip.initiator=5 child.wip.responder=-1 > this is due to delete messages jumping the queue and Message ID seems to > sync up again with next dpd or delete message. I am not sure that is RFC > complaint. > It isn't. > Second One seems to harder to trace. Here are the symptoms that I found so > far. Rekaying IPsec fails with the following message and the connections > disappears. If you have auto=route it may come back again. auto=route is a > bandaid. > > Apr 6 13:58:50.367487: | ikev2_child_sa_respond returned > STF_INTERNAL_ERROR > > Any ideas on what triggered the internal error? Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished > abort this exchange > > the internal error seems to caused EVENT_SA_EXPIRE, timeout in 0 seconds > which removed predecessor state before rekey send the message. > > Apr 6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0 > seconds for #3500 > > Apr 6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as > #3699 for IPSEC SA > Apr 6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state > #3500 > > It seems when replace is hit, code decided to rekey, and also scheduled an > EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case > it margin end as 0. > > Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it > manually for longer time. It took 58 minutes before this error to occured. > > WARNING: the following log is big 100s Mega bytes, careful when clicking > on > the link in a browser. It is better to download it using wget to read. > > > https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/west.pluto.log.gz > > I let the test run, it happend twice in about 6 hours. > > Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 > vanished abort this exchange > Apr 6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320 > vanished abort this exchange > > Looking back in older logs, Tuomo's log, I see the same sequence and > connrection restarting either due to keying tries or traffic. He is > running > with default rekey margin and fuzz, while the test has shorter rekey > margin. Note it is not just because of shorter rekey margin or salifetime. > > The oldest seems to hapeend in > > "Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to rekey > #912 vanished abort this exchange" > > I am wondering where is the bug? pluto's rekey margin calcuation, or some > other logic forcing the EXPIRE event. > > -antony > > _______________________________________________ > Swan-dev mailing list > [email protected] > https://lists.libreswan.org/mailman/listinfo/swan-dev >
_______________________________________________ Swan-dev mailing list [email protected] https://lists.libreswan.org/mailman/listinfo/swan-dev
