Re: [Lightning-dev] Error Codes for LN

2021-07-28 Thread Carla Kirk-Cohen
Hi Rusty,

Thanks for taking a look!

I like the idea of being able to reference a specific message/field
that your node finds problematic. Definitely makes the job simpler
than trying to define every error evar.

> erroneous_message is the message we're complaining about (including
2-byte type), which may be truncated (but must be at least 2 bytes)

Do you think this could be reduced to just the 2-byte message type?
If the reason for including the whole message is so that the error
recipient can match it to the original message, we could possibly just
include the field that we had a problem with as another TLV? That way
we save on having to resend the whole message, and guarantee that we'd
be delivering the problematic field value (it could be cut off if we
truncate the message).

With this approach, there's a chance that we can't match errors back
to their original message if the same channel sends the same message
with the same incorrect field more than once, but I don't see much
practical need to exactly match error to original message in this case,
since we know which field is broken.

> (We should similarly add this TLV to warnings?)

Sounds like a good idea to me!

> But it's worth noting that with the exception of timeouts, these are
all expressible in form "problem is this message, this field". Perhaps
its worth having a special TLV case for timeouts in the message?

If we're confident that the only error that we're going to need that
doesn't follow the "problem/message" structure is timeouts, then a
`timeout` tlv sounds reasonable. If we're also going to add these
fields to warning messages, I'm wondering whether it could be useful
to define an `error_code` enum, where `timeout` is one of the cases?
Perhaps overkill if we're certain that the only value will ever be
`timeout`.

Putting that all together, an update on your proposed solution:
1. `tlv_stream`: `error_tlvs`
2. types:
1. type: 1 (`erroneous_msgtype`)
2. data:
* [`u16`:`type`]
1. type: 3 (`erroneous_fieldnum`)
2. data:
* [`tu64`:`fieldnum`]
1. type: 5 (`suggested_value`)
2. data:
* [`...*byte`:`value`]
1. type: 7 (`erroneous_value`)
2. data:
* [`...*byte`:`value`]
1. type: 9 (`error_code`)
2. data:
* [`u16`:`code`]

Cheers,
  - Carla

On Wed, Jul 7, 2021 at 2:36 AM Rusty Russell  wrote:

> Carla Kirk-Cohen  writes:
> > Hi all,
>
> Hi Carla,
>
> I apologize for not responding to this earlier, but it was
> raised again in the recent spec meeting
> (
> https://lightningd.github.io/meetings/ln_spec_meeting/2021/ln_spec_meeting.2021-07-05-20.06.log.html
> ).
>
> I love the idea of more specific error codes, BTW!
>
> Feedback interleaved:
>
> > Since we shouldn’t have non-ascii values in the error string itself,
> > this can most easily be achieved by adding TLV fields after the
> > data field. In terms of supporting nodes that have not upgraded,
> > we could either include the error code in the data field to cover
> > our bases, or introduce a feature bit so that we know whether
> > to backfill the data field. This gives upgraded nodes an improved
> > quality of life, while leaving older nodes unaffected.
>
> Older nodes should definitely ignore extra fields; it's in the spec and
> we've relied on this to extend messages in the past, so this part is
> easy.
>
> Technically, all defined types are now assumed to have an optional TLV
> appended, since f068dd0d (Bolt 1: Specify that extensions to existing
> messages must use TLV (#754)).
>
> > While we can’t enumerate every possible error, there are quite
> > a few cases in the spec where we can introduce explicit error
> > codes. For the sake of the skim-readers, I’ve left that list at
> > the end of the email.
> >
> > Taking the example of our node receiving an invalid signature for
> > a htlc, a new error would look like this:
>
> I think this is both too much, and not enough.
>
> Too much:
> - Many of these errors are "your implementation is broken", which is
>   really not something actionable by the recipient.
> - A lot of work to fill in all these error cases, which will (because
>   they're usually impossible) will be untested and broken.
>
> Not enough:
> - Look at the proposal for channel_types, where you would object to the
>   channel_type if you don't like it.  This would be grouped under
>   "Funding params unacceptable", which is actually 99% of errors at this
>   point and does not say what the problem is with specificity.
>
> I took a different approach with onion messages[1], where you (optionally)
> specify the field number, even an optional suggested value:
>
> 1. type: 1 (`erroneous_field`)
> 2. data:
> * [`tu64`:`tlv_fieldnum`]
> 1. type: 3 (`suggested_value`)
> 2. data:
> * [`...*byte`:`value`]
> 1. type: 5 (`error`)
> 2. data:
> * [`...*utf8`:`msg`]
>
> In our case, we need to refer to which message (if any) caused the
> error, and we hav

Re: [Lightning-dev] Error Codes for LN

2021-07-06 Thread Rusty Russell
Carla Kirk-Cohen  writes:
> Hi all,

Hi Carla,

I apologize for not responding to this earlier, but it was
raised again in the recent spec meeting
(https://lightningd.github.io/meetings/ln_spec_meeting/2021/ln_spec_meeting.2021-07-05-20.06.log.html).

I love the idea of more specific error codes, BTW!

Feedback interleaved:

> Since we shouldn’t have non-ascii values in the error string itself,
> this can most easily be achieved by adding TLV fields after the
> data field. In terms of supporting nodes that have not upgraded,
> we could either include the error code in the data field to cover
> our bases, or introduce a feature bit so that we know whether
> to backfill the data field. This gives upgraded nodes an improved
> quality of life, while leaving older nodes unaffected.

Older nodes should definitely ignore extra fields; it's in the spec and
we've relied on this to extend messages in the past, so this part is
easy.

Technically, all defined types are now assumed to have an optional TLV
appended, since f068dd0d (Bolt 1: Specify that extensions to existing
messages must use TLV (#754)).

> While we can’t enumerate every possible error, there are quite
> a few cases in the spec where we can introduce explicit error
> codes. For the sake of the skim-readers, I’ve left that list at
> the end of the email.
>
> Taking the example of our node receiving an invalid signature for
> a htlc, a new error would look like this:

I think this is both too much, and not enough.

Too much:
- Many of these errors are "your implementation is broken", which is
  really not something actionable by the recipient.
- A lot of work to fill in all these error cases, which will (because
  they're usually impossible) will be untested and broken.

Not enough:
- Look at the proposal for channel_types, where you would object to the
  channel_type if you don't like it.  This would be grouped under
  "Funding params unacceptable", which is actually 99% of errors at this
  point and does not say what the problem is with specificity.

I took a different approach with onion messages[1], where you (optionally)
specify the field number, even an optional suggested value:

1. type: 1 (`erroneous_field`)
2. data:
* [`tu64`:`tlv_fieldnum`]
1. type: 3 (`suggested_value`)
2. data:
* [`...*byte`:`value`]
1. type: 5 (`error`)
2. data:
* [`...*utf8`:`msg`]

In our case, we need to refer to which message (if any) caused the
error, and we have non-tlv fields, so it can't simply use the tlv field
number.

Here's my straw proposal:

1. `tlv_stream`: `error_tlvs`
2. types:
1. type: 1 (`erroneous_message`)
2. data:
* [`...*byte`:`message`]
1. type: 3 (`erroneous_fieldnum`)
2. data:
* [`tu64`:`fieldnum`]
1. type: 5 (`suggested_value`)
2. data:
* [`...*byte`:`value`]

erroneous_message is the message we're complaining about (including
2-byte type), which may be truncated (but must be at least 2 bytes).

fieldnum is either the 0-based field number (for fixed fields), or the
number of fixed fields + the tlv type (for tlv fields).

suggested_value is the optional value if we have an idea if what we
expected / prefer.

> This new kind of error provides us with an error code that tells us
>
> exactly what has gone wrong, and metadata pointing to the htlc
>
> with an invalid sig. This information can be logged, or stored in a
>
> more permanent error store to help diagnose issues in the future.
>
> Right now, the spec is pretty strict on error handling [13], indicating
>
> that senders/recipients of errors `MUST` fail the channel referenced
>
> in the error.

> This isn’t very practical, and I believe that the majority
> of the impls don’t abide by this instruction.

This was inevitable eventually, but c-lightning deliberately treated
errors as fatal for a long time so people would *notice* and *report*
these issues.

To be fair, *LND* didn't treat them as fatal.  As so naturally your
engineers didn't *think* of them as a big deal (and testing didn't show
it up), so it would send errors for cases which it clearly didn't want
to close the channel (e.g. peer too slow to respond!).

Hence this PR, which makes these less fatal, and adds warning support:

https://github.com/lightningnetwork/lightning-rfc/pull/834

(We should similarly add this TLV to warnings?)

> Candidates for error codes:

The vast majority of these are "contact your developer, peer says we did
something illegal".  Which is always nice to have more information
about, but not critital.

The exceptions are:

> Funding Process:
>
> * Funding process timeout [2]
>
> * Fees out of range [3]
>
> * Funding tx spent [11]
>
> * Funding params unacceptable (eg, channel too small)
...
> Channel State Machine:
> 
> * HTLC timeout [4]
...
> Fee Updates
>
> * Update fee to low/high [9]
...

And BTW this one is misguided:

> * Feature bit required

If Alice says a feature bit is even (compulsory), and 

[Lightning-dev] Error Codes for LN

2021-02-11 Thread Carla Kirk-Cohen
Hi all,

I’d like to make a case for re-purposing the existing error message

(17) in the spec to allow for more structured errors, and create a

path for standardized, enriched errors going forward.

As is: the error message contains an arbitrary string, and is used to

communicate fatal errors to our peers.

To (potentially) be: the error message contains an error code, and

optional metadata which enriches the context of the error.

There are a few benefits of upgrading these messages:

* Better debugging information, and standardization across

   implementations

* Clearer information for the end user

* Reduced risk of leaking information in an arbitrary string

* More fine-grained error handling based on error code

Since we shouldn’t have non-ascii values in the error string itself,

this can most easily be achieved by adding TLV fields after the

data field. In terms of supporting nodes that have not upgraded,

we could either include the error code in the data field to cover

our bases, or introduce a feature bit so that we know whether

to backfill the data field. This gives upgraded nodes an improved

quality of life, while leaving older nodes unaffected.

While we can’t enumerate every possible error, there are quite

a few cases in the spec where we can introduce explicit error

codes. For the sake of the skim-readers, I’ve left that list at

the end of the email.

Taking the example of our node receiving an invalid signature for

a htlc, a new error would look like this:

   1.

   type: 17 (`error`)
   2.

   data

[Channel_id:channel_id]

[u16:len]

[len*byte:data]

   1.

   tlv_stream: `invalid_sig_tlvs`
   2.

   Types

i. type 0 (`error_code`)

data: [u16: error_code]

ii. type 1 (`htlc_id`)

data: [u64: id]

iii. type 2 (`state_number`)

data: [u64: commitment_number]

This new kind of error provides us with an error code that tells us

exactly what has gone wrong, and metadata pointing to the htlc

with an invalid sig. This information can be logged, or stored in a

more permanent error store to help diagnose issues in the future.

Right now, the spec is pretty strict on error handling [13], indicating

that senders/recipients of errors `MUST` fail the channel referenced

in the error. This isn’t very practical, and I believe that the majority

of the impls don’t abide by this instruction. With the addition of error

codes, we can pair each error code with a recommended action

that is more appropriate for the error at hand - for example, still force

closing if we get an invalid signature, but being more lenient if we

get a low fee in update fee. While this may already be the case in

practise, it’s messy for everybody to roll their own and then figure

out what other impls are doing. With a standardized set of errors,

and reasonable handling - rather than the current “close-all-da-chans”

prescription in the spec - we can clear up some of the ambiguity

around errors, and a spec that’s reasonable for the end user to

follow.

Thanks for reading!

- Carla

Candidates for error codes:

Signature problems:

* Incorrect Signature [1] [3] [7] [12]

*Non-standard signature [1] [3 [7]

Funding Process:

* Funding process timeout [2]

* Fees greater than base fee [3]

* Fees out of range [3]

* Funding tx spent [11]

* Funding params unacceptable (eg, channel too small)

Channel State Machine:

* HTLC timeout [4]

* Zero htlc add [5]

* Htlc add less than minimum [5]

* Htlc add can’t afford at current fee rate [5]

* Maximum htlc count exceeded [5]

* Maximum htlc in flight exceeded [5]

* cltv > 5 [5]

* Sub-msat values [5]

* Fulfill/fail htlc id not found [6]

* Incorrect commitment number [10]

* Incorrect preimage [6]

* Incorrect per commit secret [8]

Fee Updates

* Update fee to low/high [9]

* Unexpected update fee [9]

* Cannot afford update fee [9]

Connection Level

* Disconnecting

* Feature bit required

Gossip

* Incorrect gossip short channel ID [12]


[1]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#the-funding_created-message

[2]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#the-funding_locked-message

[3]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#closing-negotiation-closing_signed

[4]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-8

[5]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-9

[6]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-10

[7]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-11

[8]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-12

[9]
https://github.com/lightningnetwork/lightning-rfc/blob/master/02-peer-protocol.md#requirements-13

[10]
https://github.com/lightningnetwork/lightning-rfc/blob/mas