Re: [Lightning-dev] Error Codes for LN
Hi Rusty, Thanks for taking a look! I like the idea of being able to reference a specific message/field that your node finds problematic. Definitely makes the job simpler than trying to define every error evar. > erroneous_message is the message we're complaining about (including 2-byte type), which may be truncated (but must be at least 2 bytes) Do you think this could be reduced to just the 2-byte message type? If the reason for including the whole message is so that the error recipient can match it to the original message, we could possibly just include the field that we had a problem with as another TLV? That way we save on having to resend the whole message, and guarantee that we'd be delivering the problematic field value (it could be cut off if we truncate the message). With this approach, there's a chance that we can't match errors back to their original message if the same channel sends the same message with the same incorrect field more than once, but I don't see much practical need to exactly match error to original message in this case, since we know which field is broken. > (We should similarly add this TLV to warnings?) Sounds like a good idea to me! > But it's worth noting that with the exception of timeouts, these are all expressible in form "problem is this message, this field". Perhaps its worth having a special TLV case for timeouts in the message? If we're confident that the only error that we're going to need that doesn't follow the "problem/message" structure is timeouts, then a `timeout` tlv sounds reasonable. If we're also going to add these fields to warning messages, I'm wondering whether it could be useful to define an `error_code` enum, where `timeout` is one of the cases? Perhaps overkill if we're certain that the only value will ever be `timeout`. Putting that all together, an update on your proposed solution: 1. `tlv_stream`: `error_tlvs` 2. types: 1. type: 1 (`erroneous_msgtype`) 2. data: * [`u16`:`type`] 1. type: 3 (`erroneous_fieldnum`) 2. data: * [`tu64`:`fieldnum`] 1. type: 5 (`suggested_value`) 2. data: * [`...*byte`:`value`] 1. type: 7 (`erroneous_value`) 2. data: * [`...*byte`:`value`] 1. type: 9 (`error_code`) 2. data: * [`u16`:`code`] Cheers, - Carla On Wed, Jul 7, 2021 at 2:36 AM Rusty Russell wrote: > Carla Kirk-Cohen writes: > > Hi all, > > Hi Carla, > > I apologize for not responding to this earlier, but it was > raised again in the recent spec meeting > ( > https://lightningd.github.io/meetings/ln_spec_meeting/2021/ln_spec_meeting.2021-07-05-20.06.log.html > ). > > I love the idea of more specific error codes, BTW! > > Feedback interleaved: > > > Since we shouldn’t have non-ascii values in the error string itself, > > this can most easily be achieved by adding TLV fields after the > > data field. In terms of supporting nodes that have not upgraded, > > we could either include the error code in the data field to cover > > our bases, or introduce a feature bit so that we know whether > > to backfill the data field. This gives upgraded nodes an improved > > quality of life, while leaving older nodes unaffected. > > Older nodes should definitely ignore extra fields; it's in the spec and > we've relied on this to extend messages in the past, so this part is > easy. > > Technically, all defined types are now assumed to have an optional TLV > appended, since f068dd0d (Bolt 1: Specify that extensions to existing > messages must use TLV (#754)). > > > While we can’t enumerate every possible error, there are quite > > a few cases in the spec where we can introduce explicit error > > codes. For the sake of the skim-readers, I’ve left that list at > > the end of the email. > > > > Taking the example of our node receiving an invalid signature for > > a htlc, a new error would look like this: > > I think this is both too much, and not enough. > > Too much: > - Many of these errors are "your implementation is broken", which is > really not something actionable by the recipient. > - A lot of work to fill in all these error cases, which will (because > they're usually impossible) will be untested and broken. > > Not enough: > - Look at the proposal for channel_types, where you would object to the > channel_type if you don't like it. This would be grouped under > "Funding params unacceptable", which is actually 99% of errors at this > point and does not say what the problem is with specificity. > > I took a different approach with onion messages[1], where you (optionally) > specify the field number, even an optional suggested value: > > 1. type: 1 (`erroneous_field`) > 2. data: > * [`tu64`:`tlv_fieldnum`] > 1. type: 3 (`suggested_value`) > 2. data: > * [`...*byte`:`value`] > 1. type: 5 (`error`) > 2. data: > * [`...*utf8`:`msg`] > > In our case, we need to refer to which message (if any) caused the > error, and we hav
Re: [Lightning-dev] Error Codes for LN
Carla Kirk-Cohen writes: > Hi all, Hi Carla, I apologize for not responding to this earlier, but it was raised again in the recent spec meeting (https://lightningd.github.io/meetings/ln_spec_meeting/2021/ln_spec_meeting.2021-07-05-20.06.log.html). I love the idea of more specific error codes, BTW! Feedback interleaved: > Since we shouldn’t have non-ascii values in the error string itself, > this can most easily be achieved by adding TLV fields after the > data field. In terms of supporting nodes that have not upgraded, > we could either include the error code in the data field to cover > our bases, or introduce a feature bit so that we know whether > to backfill the data field. This gives upgraded nodes an improved > quality of life, while leaving older nodes unaffected. Older nodes should definitely ignore extra fields; it's in the spec and we've relied on this to extend messages in the past, so this part is easy. Technically, all defined types are now assumed to have an optional TLV appended, since f068dd0d (Bolt 1: Specify that extensions to existing messages must use TLV (#754)). > While we can’t enumerate every possible error, there are quite > a few cases in the spec where we can introduce explicit error > codes. For the sake of the skim-readers, I’ve left that list at > the end of the email. > > Taking the example of our node receiving an invalid signature for > a htlc, a new error would look like this: I think this is both too much, and not enough. Too much: - Many of these errors are "your implementation is broken", which is really not something actionable by the recipient. - A lot of work to fill in all these error cases, which will (because they're usually impossible) will be untested and broken. Not enough: - Look at the proposal for channel_types, where you would object to the channel_type if you don't like it. This would be grouped under "Funding params unacceptable", which is actually 99% of errors at this point and does not say what the problem is with specificity. I took a different approach with onion messages[1], where you (optionally) specify the field number, even an optional suggested value: 1. type: 1 (`erroneous_field`) 2. data: * [`tu64`:`tlv_fieldnum`] 1. type: 3 (`suggested_value`) 2. data: * [`...*byte`:`value`] 1. type: 5 (`error`) 2. data: * [`...*utf8`:`msg`] In our case, we need to refer to which message (if any) caused the error, and we have non-tlv fields, so it can't simply use the tlv field number. Here's my straw proposal: 1. `tlv_stream`: `error_tlvs` 2. types: 1. type: 1 (`erroneous_message`) 2. data: * [`...*byte`:`message`] 1. type: 3 (`erroneous_fieldnum`) 2. data: * [`tu64`:`fieldnum`] 1. type: 5 (`suggested_value`) 2. data: * [`...*byte`:`value`] erroneous_message is the message we're complaining about (including 2-byte type), which may be truncated (but must be at least 2 bytes). fieldnum is either the 0-based field number (for fixed fields), or the number of fixed fields + the tlv type (for tlv fields). suggested_value is the optional value if we have an idea if what we expected / prefer. > This new kind of error provides us with an error code that tells us > > exactly what has gone wrong, and metadata pointing to the htlc > > with an invalid sig. This information can be logged, or stored in a > > more permanent error store to help diagnose issues in the future. > > Right now, the spec is pretty strict on error handling [13], indicating > > that senders/recipients of errors `MUST` fail the channel referenced > > in the error. > This isn’t very practical, and I believe that the majority > of the impls don’t abide by this instruction. This was inevitable eventually, but c-lightning deliberately treated errors as fatal for a long time so people would *notice* and *report* these issues. To be fair, *LND* didn't treat them as fatal. As so naturally your engineers didn't *think* of them as a big deal (and testing didn't show it up), so it would send errors for cases which it clearly didn't want to close the channel (e.g. peer too slow to respond!). Hence this PR, which makes these less fatal, and adds warning support: https://github.com/lightningnetwork/lightning-rfc/pull/834 (We should similarly add this TLV to warnings?) > Candidates for error codes: The vast majority of these are "contact your developer, peer says we did something illegal". Which is always nice to have more information about, but not critital. The exceptions are: > Funding Process: > > * Funding process timeout [2] > > * Fees out of range [3] > > * Funding tx spent [11] > > * Funding params unacceptable (eg, channel too small) ... > Channel State Machine: > > * HTLC timeout [4] ... > Fee Updates > > * Update fee to low/high [9] ... And BTW this one is misguided: > * Feature bit required If Alice says a feature bit is even (compulsory), and
[Lightning-dev] Error Codes for LN
Hi all, I’d like to make a case for re-purposing the existing error message (17) in the spec to allow for more structured errors, and create a path for standardized, enriched errors going forward. As is: the error message contains an arbitrary string, and is used to communicate fatal errors to our peers. To (potentially) be: the error message contains an error code, and optional metadata which enriches the context of the error. There are a few benefits of upgrading these messages: * Better debugging information, and standardization across implementations * Clearer information for the end user * Reduced risk of leaking information in an arbitrary string * More fine-grained error handling based on error code Since we shouldn’t have non-ascii values in the error string itself, this can most easily be achieved by adding TLV fields after the data field. In terms of supporting nodes that have not upgraded, we could either include the error code in the data field to cover our bases, or introduce a feature bit so that we know whether to backfill the data field. This gives upgraded nodes an improved quality of life, while leaving older nodes unaffected. While we can’t enumerate every possible error, there are quite a few cases in the spec where we can introduce explicit error codes. For the sake of the skim-readers, I’ve left that list at the end of the email. Taking the example of our node receiving an invalid signature for a htlc, a new error would look like this: 1. type: 17 (`error`) 2. data [Channel_id:channel_id] [u16:len] [len*byte:data] 1. tlv_stream: `invalid_sig_tlvs` 2. Types i. type 0 (`error_code`) data: [u16: error_code] ii. type 1 (`htlc_id`) data: [u64: id] iii. type 2 (`state_number`) data: [u64: commitment_number] This new kind of error provides us with an error code that tells us exactly what has gone wrong, and metadata pointing to the htlc with an invalid sig. This information can be logged, or stored in a more permanent error store to help diagnose issues in the future. Right now, the spec is pretty strict on error handling [13], indicating that senders/recipients of errors `MUST` fail the channel referenced in the error. This isn’t very practical, and I believe that the majority of the impls don’t abide by this instruction. With the addition of error codes, we can pair each error code with a recommended action that is more appropriate for the error at hand - for example, still force closing if we get an invalid signature, but being more lenient if we get a low fee in update fee. While this may already be the case in practise, it’s messy for everybody to roll their own and then figure out what other impls are doing. With a standardized set of errors, and reasonable handling - rather than the current “close-all-da-chans” prescription in the spec - we can clear up some of the ambiguity around errors, and a spec that’s reasonable for the end user to follow. Thanks for reading! - Carla Candidates for error codes: Signature problems: * Incorrect Signature [1] [3] [7] [12] *Non-standard signature [1] [3 [7] Funding Process: * Funding process timeout [2] * Fees greater than base fee [3] * Fees out of range [3] * Funding tx spent [11] * Funding params unacceptable (eg, channel too small) Channel State Machine: * HTLC timeout [4] * Zero htlc add [5] * Htlc add less than minimum [5] * Htlc add can’t afford at current fee rate [5] * Maximum htlc count exceeded [5] * Maximum htlc in flight exceeded [5] * cltv > 5 [5] * Sub-msat values [5] * Fulfill/fail htlc id not found [6] * Incorrect commitment number [10] * Incorrect preimage [6] * Incorrect per commit secret [8] Fee Updates * Update fee to low/high [9] * Unexpected update fee [9] * Cannot afford update fee [9] Connection Level * Disconnecting * Feature bit required Gossip * Incorrect gossip short channel ID [12] [1] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#the-funding_created-message [2] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#the-funding_locked-message [3] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#closing-negotiation-closing_signed [4] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-8 [5] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-9 [6] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-10 [7] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-11 [8] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-12 [9] https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-13 [10] https://github.com/lightningnetwork/lightning-rfc/blob/mas