Re: [Swan-dev] pluto: When Child state fails, don't tear down IKE SA

Paul Wouters Mon, 05 Jul 2021 10:22:54 -0700

On Mon, 5 Jul 2021, Andrew Cagney wrote:

      But that is changing the meaning of FAIL vs FATAL.



(sarcasm) It's hard to change the meaning of something that was, well, 
meaningless.

This is what v3.23 had:

STF_INTERNAL_ERROR:
It just signals to the state machine to do nothing (as I discovered last week, 
it also signals that the state on the stack state may have been deleted)!
-> I think it should go; internal errors are either recoverable (just kill IKE 
family), or passert


This is supposed to core dump. We hit something that never is supposed
to happen.

STF_DROP:
delete current state; but "be vewy vewy quiet"
-> this is gone

STF_FATAL:
delete current state
(for a child it would delete sending a notify, then see the ike sa was 
childless and delete that sending a second notify; except when there were 
multiple
children)
-> it now completes message exchange and deletes the IKE family; it is after 
all fatal


That is wrong. Because a fatal child should not delete the IKE SA or
other exitsing Child SA states, just because of the children is
STF_FATAL. Eg this is now executed when two endpoints misconfigure
their subnets. It makes it impossible for an admin to add a new subnet
and then tell the other end to add it, since it now tears down the
house.

STF_FAIL:
do pretty much nothing
-> which is useless, the exchange failed so things are going nowhere
-> it now completes message exchange and deletes current state


In IKEv1, I think this can still be valid. In IKEv2, things are more
clear so it is either OK or FATAL.

STF_FAIL+v2N_....:
in the responder send v2N and then deleted a state (during IKE_SA_INIT that was 
IKE, else it was the child, but see alsofor the initiator this did nothing
-> this code path is gone


even in IKEv1?

STF_IGNORE:
it's been pretty consistent; pretend the message never happened

So STF_DROP, STF_FATAL and STF_FAIL+v2N could all potentially delete a state; 
and STF_FAIL on its own did nothing.


well, "nothing" meant it could retransmit the request, eg a Quick Mode
Request that got denied before. Eg retrying. It works because of the
unique msgids, you could just keep retransmitting such packets.

      FAIL used to mean "there is a non-fatal error, please keep trying this 
state"
      FATAL meant "there is no more hope, kill the state"

      Note that this applies to the _states_ and not the connection. That is,
      starting a new keying attempt after a timeout and/or revival are not
      related to these state changes.


If a state transition fails then it fails.  There is no "please keep trying this 
state".


That depends on the retransmit logic/arhitecture. Also, in IKEv1 you
could have Child SA states without having an IKE state. It was a
different world :)

The only remaining question is how much damage should the failure inflict?  
Delete the state or delete the IKE family (and yes any exchange can trigger a
delete family, a critical payload, bad payload lengths, ...).


I still don't agree with this statement. A new CREATE_CHILD_SA that
fails should not take down existing Child SA's or an IKE SA. Even an
IKE REKEY failure should not take down the existing IKE SA.

However, what I have found and fixed is code trying to ignore failure (failure 
is not not an option :-).  For instance, code would ignore unexpected
notifications in an IKE_AUTH response, and instead let things retransmit.  This 
is futile.  The response was protected, so no number of retransmits will
change its content.


Agreed, but IKEv1 might not be that clear.

      In that sense, a CREATE_CHILD_SA that received an answer that leads to
      an unfixable isse, like TS_UNACCEPTABLE, is a FATAL state change for
      the child sa. FAIL would mean we might try to retransmit, which is
      what is wrongly happening here. FAIL would also happen if we got an
      unauthenticated reply we didn't like, so we would still wait for a
      possible unauthenticated reply we like (to prevent attackers from
      locking out legitimate packets).

If a packet fails protection, it's STF_IGNOREd.  The exception is with 
IKE_SA_INIT bit it too is free to STF_IGNORE a message.
 

      The way IKEv2 works with serial MSGID, we really don't have STF_FAIL in
      that sense. We only have STF_OK or STF_FATAL. (well also STF_SUSPEND etc)
      The only STF_FAIL would be the if we get no reply, and we need to
      retransmit. I can't think of an STF_FAIL that would operate on the child
      SA state.


We're somewhat on the same page.  Any message can carry a notification that 
implies that the IKE family is dead.  For instance, INVALID_SYNTAX.  So the
exchange outcome is either Ok, or fatal.


See above. I don't agree that INVALID_SYNTAX should be interpreted to
tear down the entire house. Or at least we should be really careful
with sending INVALID_SYNTAX. I find these two bits of the RFC rather
conflicting:


   If a peer parsing a request notices that it is badly formatted (after
   it has passed the message authentication code checks and window
   checks) and it returns an INVALID_SYNTAX notification, then this
   error notification is considered fatal in both peers, meaning that
   the IKE SA is deleted without needing an explicit Delete payload.

vs

 INVALID_SYNTAX                            7
       Indicates the IKE message that was received was invalid because
       some type, length, or value was out of range or because the
       request was rejected for policy reasons.

Specifically the "for policy reasons".

But even so, INVALID_SYNTAX is rare and not the problem we are dealing
with now, which is things like TS_UNACCEPTABLE. That is the case now
where we tear down the house and we should not.

However with a CREATE_CHILD_SA exchange, there's three outcomes:
- exchange succeeds; child succeeds
- exchange succeeds; child fails (ike survives)
- exchange fails, implying that the ike family dies
hence the need for three status

Taking a step back, and ignoring our current code, which would you consider 
more serious: FAIL or FATAL?
Personally, I'd treat something fatal as more serious and assume it's wiping 
things out.


As you said, I think we are on the same page, but FATAL in a child
should not cause the IKE SA or other Child SAs to die (maybe the
exception to this is indeed INVALID_SYNTAX - I don't care too much
since we don't really see that case happening in real life, and I'm
mostly concerned now with fixing the mismatching subnet cases that
throw TS_UNACCEPTABLE)

The IKE_AUTH exchange is different.   The message is passed to the IKE SA, and 
then as a nested operation, it tries to create the child.  If the attempt
to create the child proves fatal, the IKE SA can proceed accordingly.


That is not the case yet, which is why I added _two_ test cases with
mismatched subnet. One of those mismatches the first subnet, thus
causing a failure in IKE_AUTH, which currently also/still tears
down the IKE SA (indirectly through STF_FATAL, even if your recent
childless code made the IKE SA survive the initial failure)

If CREATE_CHILD_SA followed this model (I call it nested states), then we could 
only need two codes.


Works for me, but we have to be careful not to break IKEv1 assumptions
and running code.

Paul
_______________________________________________
Swan-dev mailing list
[email protected]
https://lists.libreswan.org/mailman/listinfo/swan-dev

Re: [Swan-dev] pluto: When Child state fails, don't tear down IKE SA

Reply via email to