On Wed, 8 Apr 2020 at 13:16, Antony Antony <[email protected]> wrote:
> On Wed, Apr 08, 2020 at 10:07:43AM -0400, Andrew Cagney wrote: > > > > > > On Wed, 8 Apr 2020 at 02:29, Antony Antony <[email protected]> wrote: > > > > Hi, > > > > I am hunting a couple of corner cases, IKEv2 rekey initiator > failures. > > These issues appear when testing clones. Think of 100 IKEv2 Child > SAs under > > one IKE SA and rekeying them all. In the test rekey margin and > salife are > > short. Short values do not matter, because looking back in Tuomo > production > > systems logs, with larger rekey margin and saflifetime, I see the > same > > issues. > > > > First one is possibly fixed. > > > > https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/ > > ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz > > > > > > > > ikev2: insert new v2D message to the tail of pending v2D requests > > > > It's certainly better: > > > > optimized to send out v2D message before v2_CREATE_CHILD_SA jobs. > > Delete would also get a quicker response from the other end, because, > > there are no crypto operations i.e. v2DH,KE,nonce to respond to > > a v2_INFORMATIONAL request. > > > > but is it needed? > > I noticed sometimes a several rekeys would get queued up. And delete would > stay in the queue for longer. I think it is best to prioritize v2D ahead > CREATE_CHILD_SA. Also can't think of any side effect of pritorizing v2D. > It re-orders the event queue, which will make debugging harder. > > this queue may have v2_CREATE_CHILD_SA or/and v2_INFORMATIONAL (v2D). > > add new v2_INFORMATIONAL, v2D request to the tail of v2D requests, > > instead adding to the tail of the queue. > > > > I'm also puzzled by: > > yes. It seems necessary for now. May be there could a way to call > complete_v2_state_transition(). > The processor is is most likely broken. > > if (IS_CHILD_SA_ESTABLISHED(st)) { > > /* > > * this is a continuation of delete message. > > * shortcut complete_v2_state_transition() > > * to call complete_v2_state_transition, need more work > > */ > > pending.cb(ike, st, NULL); > > > Initially I imagined calling complete_v2_state_transition() would be > cleaner, however it turned out to be harder. There are too many checks > inside complete_v2_state_transition() and success_v2_state_transition() > that > would fail, or need hacks or fixes. > > Initiatiing INFROMATIONAL v2D does not have an svm entry or fake md. I > am > not sure faking all that is worth it. > > Message ID is already advanced in send_delete() > As in: /* * XXX: The record 'n' send call shouldn't be * needed. Instead, as part of this * transition (live -> being-deleted) the * standard success_v2_transition() code path * should get to do the right thing. * * XXX: The record 'n' send call leads to an * RFC violation. The lack of a state * transition means there's nothing set up to * wait for the ack. And that in turn means * that the next packet will be sent before * this one has had a response. */ > Now I am thinking a lighter version complete_v2_state_transition() is > probably ideal this. > > when completeing an v2D all we really need is the next line. > > v2_msgid_schedule_next_initiator(ike); > > > > > Second One seems to harder to trace. Here are the symptoms that I > found so > > far. Rekaying IPsec fails with the following message and the > connections > > disappears. If you have auto=route it may come back again. > auto=route is a > > bandaid. > > > > Apr 6 13:58:50.367487: | ikev2_child_sa_respond returned > > STF_INTERNAL_ERROR > > > > > > > > Any ideas on what triggered the internal error? > > I know one step further. the line shown bellow "EVENT_SA_EXPIRE, timeout > in > 0 seconds" is the cause. However, I don't why pluto schedule this > EVENT_SA_EXPIRE. > > According to the below the internal-error triggered the expire; not the reverse? That's totally expected. Did you look a the log? > > > > > Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 > vanished > > abort this exchange > > > > the internal error seems to caused EVENT_SA_EXPIRE, timeout in 0 > seconds > > which removed predecessor state before rekey send the message. > > > So there's an STF_INTERNAL_ERROR, and then ... > Apr 6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in > 0 > > seconds for #3500 > > Apr 6 13:58:50.230304: | duplicating state object #3606 "west/100x0" > as > > #3699 for IPSEC SA > > Apr 6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child > state # > > 3500 > > > > It seems when replace is hit, code decided to rekey, and also > scheduled an > > EXPIRE. Usually margin is 70s or so Which is enough to rekey. In > this case > > it margin end as 0. > > > > Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I > ran it > > manually for longer time. It took 58 minutes before this error to > occured. > > > > WARNING: the following log is big 100s Mega bytes, careful when > clicking on > > the link in a browser. It is better to download it using wget to > read. > > > > > https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/ > > west.pluto.log.gz > > > > I let the test run, it happend twice in about 6 hours. > > > > Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 > vanished > > abort this exchange > > Apr 6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320 > > vanished abort this exchange > > > > Looking back in older logs, Tuomo's log, I see the same sequence and > > connrection restarting either due to keying tries or traffic. He is > running > > with default rekey margin and fuzz, while the test has shorter rekey > > margin. Note it is not just because of shorter rekey margin or > salifetime. > > > > The oldest seems to hapeend in > > > > "Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to > rekey # > > 912 vanished abort this exchange" > > > > I am wondering where is the bug? pluto's rekey margin calcuation, or > some > > other logic forcing the EXPIRE event. >
_______________________________________________ Swan-dev mailing list [email protected] https://lists.libreswan.org/mailman/listinfo/swan-dev
