Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution
Hi ZmnSCPxj, Since, B cannot report the `update_fail_htlc` immediately, its timer should > still be running. > Suppose we counted only up to `update_fail_htlc` and not on the > `revoke_and_ack`. > If C sends `update_fail_htlc` immediately, then the > `update_add_htlc`->`update_fail_htlc` time reported by B would be fast. > But if C then does not send `revoke_and_ack`, B cannot safely propagate > `update_fail_htlc` to A, so the time reported by A will be slow. > This sudden transition of time from A to B will be blamed on A and B, > while C is unpunished. > > That is why, for failures, we ***must*** wait for `revoke_and_ack`. > The node must report the time when it can safely propagate the error > report upstream, not the time it receives the error report. > For payment fulfillment, `update_fulfill_htlc` is fine without waiting for > `revoke_and_ack` since it is always reported immediately upstream anyway. > Yes, that is a good point. C hasn't completed its duty until it sends `revoke_and_ack` indeed. > > I think we could indeed do more with the information that we currently > have and gather some more by probing. But in the end we would still be > sampling a noisy signal. More scenarios to take into account, less accurate > results and probably more non-ideal payment attempts. Failed, slow or stuck > payments degrade the user experience of lightning, while "fat errors" > arguably don't impact the user in a noticeable way. > > Fat errors just give you more information when a problem happens for a > "real" payment. > But the problem still occurs on the "real" payment and user experience is > still degraded. > > Background probing gives you the same information **before** problems > happen for "real" payments. > With probing, I was thinking about probing right before making the actual payment, so not a process that is continuously running in the background. I am not sure how that would scale, everyone (assume mass adoption) probing the whole network. Also, for private channels, nodes may put rate limits in place or close channels that are the source of many failures. Then end points with only private channels like a mobile app wouldn't be able to probe effectively anymore. I do think probes are useful, but would only use them sparingly. Sending a probe before the real payment surely helps to mitigate certain risks. But then I'd still prefer to also have fat errors. To get maximum value out of the probe and minimize the number of probes required. Functionally speaking, I don't see why you wouldn't want to have that extra information. Joost ___ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution
Good morning Joost, > Yes that is accurate, although using the time difference between receiving > the `update_add_htlc` and sending back the `update_fail_htlc` would work too. > It would then include the node's processing time. It would not work safely. A node can only propagate an `update_fail_htlc` if the downstream `update_fail_htlc` has been irrevocably committed by `revoke_and_ack`. See BOLT spec about this. Suppose we have a route A -> B -> C. C sends `update_fail_htlc` immediately, but dallies on `revoke_and_ack`. B cannot send `update_fail_htlc` to A yet, because C can still drop the previous B-C channel state onchain (it is not yet revoked, that is what the `revoke_and_ack` will later do). If B send `update_fail_htlc` to A as soon as it receives `update_fail_htlc` from C, A can use the new A-B channel state onchain, while at the same time C drops the previous B-C channel state onchain. the new A-B channel state returns the HTLC to A, while the previous B-C channel state has the HTLC still claimable by C, causing B to lose funds. For `update_fulfill_htlc` B can immediately propagate to A (without waiting for `update_and_ack` from C) since C is already claiming the money. Since, B cannot report the `update_fail_htlc` immediately, its timer should still be running. Suppose we counted only up to `update_fail_htlc` and not on the `revoke_and_ack`. If C sends `update_fail_htlc` immediately, then the `update_add_htlc`->`update_fail_htlc` time reported by B would be fast. But if C then does not send `revoke_and_ack`, B cannot safely propagate `update_fail_htlc` to A, so the time reported by A will be slow. This sudden transition of time from A to B will be blamed on A and B, while C is unpunished. That is why, for failures, we ***must*** wait for `revoke_and_ack`. The node must report the time when it can safely propagate the error report upstream, not the time it receives the error report. For payment fulfillment, `update_fulfill_htlc` is fine without waiting for `revoke_and_ack` since it is always reported immediately upstream anyway. See my discussion about "fast forwards": https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-April/001986.html > I think we could indeed do more with the information that we currently have > and gather some more by probing. But in the end we would still be sampling a > noisy signal. More scenarios to take into account, less accurate results and > probably more non-ideal payment attempts. Failed, slow or stuck payments > degrade the user experience of lightning, while "fat errors" arguably don't > impact the user in a noticeable way. Fat errors just give you more information when a problem happens for a "real" payment. But the problem still occurs on the "real" payment and user experience is still degraded. Background probing gives you the same information **before** problems happen for "real" payments. Regards, ZmnSCPxj ___ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution
Hi Bastien, > What about having each node add some padding along the way? The erring > node's padding should be bigger than intermediate nodes' padding (ideally > using a deterministic construction as you suggest) so details need to be > fleshed out, but it could mitigate even more the possibility of > intermediate nodes figuring out their approximate position. > That also mitigates the risk that a network observer correlates error > messages between hops (because in the variable-length message that you > propose, a network observer can easily track an error message across the > whole payment path). > Yes we could also do that. Then even if the same person has two different nodes in the path, they can't know for sure how many hops were in between. It would be nice if there is a way to add padding such that hops don't learn anything about their position, but not sure if that is possible. Having the error node add padding with a random length between 0 and 20 block sizes (block size is the number of bytes a hop would add in the backward path), does reveal an upper bound for the distance to the error node. For example: a node receives a failure with a padding of 3 blocks. That means that the distance to the error node is between 0 and 3 hops. Joost ___ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution
Hi ZmnSCPxj, > > That is definitely a concern. It is up to senders how to interpret the > received timestamps. They can decide to tolerate slight variations. Or they > could just look at the difference between the in and out timestamp, > abandoning the synchronization requirement altogether (a node could also > just report that difference instead of two timestamps). The held duration > is enough to identify a pair of nodes from which one of the nodes is > responsible for the delay. > > > > Example (held durations between parenthesis): > > > > A (15 secs) -> B (14 secs) -> C (3 secs) -> D (2 secs) > > > > In this case either B or C is delaying the payment. We'd penalize the > channel between B and C. > > This seems better. > If B is at fault, it could lie and reduce its reported delta time, but > that simply means it will be punished with A. > If C is at fault, it could lie and increase its reported delta time, but > that simply means it will be punished with D. > > I presume that the delta time is the time difference from when it sends > `update_add_htlc` and when it receives `update_fulfill_htlc`, or when it > gets an irrevocably committed `update_fail_htlc` + `revoke_and_ack`. > Is that accurate? > Yes that is accurate, although using the time difference between receiving the `update_add_htlc` and sending back the `update_fail_htlc` would work too. It would then include the node's processing time. > Unit should probably be milliseconds > Yes, we probably want sub-second resolution for this. An alternative that comes to mind is to use active probing and tracking > persistent data per node. > > For each node we record two pieces of information: > > 1. Total imposed delay. > 2. Number of attempts. > > Suppose a probe or payment takes N milliseconds on a route with M nodes to > fulfill or irrevocably fail at the payer. > For each node on the route, we increase Total imposed delay by N / M > rounded up, and increment Number of attempts. > For error reports we can shorten the route if we get an error response > that points to a specific failing node, or penalize the entire route in > case of a completely undecodable error response. > > When finding a route for a "real" payment, we adjust the cost of > traversing a node by the ratio Total imposed delay / Number of attempts (we > can avoid undefined math by starting both fields at 1). > For probes we can probably ignore this factor in order to give nodes that > happened to be borked by a different slow node on the trial route another > chance to exonerate their apparent slowness. > > This does not need changes in the current spec. > I think we could indeed do more with the information that we currently have and gather some more by probing. But in the end we would still be sampling a noisy signal. More scenarios to take into account, less accurate results and probably more non-ideal payment attempts. Failed, slow or stuck payments degrade the user experience of lightning, while "fat errors" arguably don't impact the user in a noticeable way. Joost ___ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution
Good morning Joost, > That is definitely a concern. It is up to senders how to interpret the > received timestamps. They can decide to tolerate slight variations. Or they > could just look at the difference between the in and out timestamp, > abandoning the synchronization requirement altogether (a node could also just > report that difference instead of two timestamps). The held duration is > enough to identify a pair of nodes from which one of the nodes is responsible > for the delay. > > Example (held durations between parenthesis): > > A (15 secs) -> B (14 secs) -> C (3 secs) -> D (2 secs) > > In this case either B or C is delaying the payment. We'd penalize the channel > between B and C. This seems better. If B is at fault, it could lie and reduce its reported delta time, but that simply means it will be punished with A. If C is at fault, it could lie and increase its reported delta time, but that simply means it will be punished with D. I presume that the delta time is the time difference from when it sends `update_add_htlc` and when it receives `update_fulfill_htlc`, or when it gets an irrevocably committed `update_fail_htlc` + `revoke_and_ack`. Is that accurate? Unit should probably be milliseconds. -- An alternative that comes to mind is to use active probing and tracking persistent data per node. For each node we record two pieces of information: 1. Total imposed delay. 2. Number of attempts. Suppose a probe or payment takes N milliseconds on a route with M nodes to fulfill or irrevocably fail at the payer. For each node on the route, we increase Total imposed delay by N / M rounded up, and increment Number of attempts. For error reports we can shorten the route if we get an error response that points to a specific failing node, or penalize the entire route in case of a completely undecodable error response. When finding a route for a "real" payment, we adjust the cost of traversing a node by the ratio Total imposed delay / Number of attempts (we can avoid undefined math by starting both fields at 1). For probes we can probably ignore this factor in order to give nodes that happened to be borked by a different slow node on the trial route another chance to exonerate their apparent slowness. This does not need changes in the current spec. Regards, ZmnSCPxj ___ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
[Lightning-dev] Improve Lightning payment reliability through better error attribution
Hi ZmnSCPxj, Before proceeding with discussing HMACs and preventing nodes from putting > words in the mouths of other nodes, perhaps we should consider, how we can > ensure that nodes can be forced to be accurate about what happened. > > For instance, a proposal is for nodes to put timestamps for certain events. > Does this imply that all nodes along the route **MUST** have their clocks > strongly synchronized to some global clock? > If a node along the route happens to be 15 seconds early or 15 seconds > late, can it be erroneously "dinged" for this when a later hop delays a > successful payment for 20 seconds? > > If it requires that hop nodes have strong synchrony with some global clock > service, why would I want to run a hop node then? > What happens if some global clock service is attacked in order to convince > nodes to route to particular nodes (using a competing global clock service) > on the Lightning network? > That is definitely a concern. It is up to senders how to interpret the received timestamps. They can decide to tolerate slight variations. Or they could just look at the difference between the in and out timestamp, abandoning the synchronization requirement altogether (a node could also just report that difference instead of two timestamps). The held duration is enough to identify a pair of nodes from which one of the nodes is responsible for the delay. Example (held durations between parenthesis): A (15 secs) -> B (14 secs) -> C (3 secs) -> D (2 secs) In this case either B or C is delaying the payment. We'd penalize the channel between B and C. Joost ___ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
[Lightning-dev] Improve Lightning payment reliability through better error attribution
Hello list, In Lightning, the reliability of payments is dependent on the reliability of the chosen route. Information about previous payment attempts helps to select better routes and improve the payment experience. Therefore implementations usually track the past performance of nodes and channels. This can be as simple as a black list that contains previously failed channels. In order for this mechanism to be most effective, it is important to know which node is to blame for a non-ideal payment attempt. Non-ideal payment attempts are not only failed payment attempts (either instantly failed or after a delay), but also successful payments for which it took a long time to receive the `htlc_fulfill` message. For non-ideal payment attempts, we are currently not always able to determine the node that should be penalized. In particular: * If an attempt takes long to complete (settle or fail), we have no information that points us to the source of the delay. * Nodes can return a corrupt failure message. When this message arrives at the sender after a number of encryption rounds, the sender is no longer able to pinpoint the node that failed the payment. A potential solution is to change the failure message such that every hop along the backward path adds an hmac to the message (currently only the error source authenticates the message). This allows the source of a corruption to be narrowed down to a pair of nodes, which is enough to properly apply a penalty. In addition to that, all hops could add two timestamps to the failure message: the htlc add time and the htlc fail time. Using this information, the sender of the payment can identify the source of the delay down to, again, a pair of nodes. Those timestamps could be added to the settle message as well, to also allow diagnostics on slow settled payments. The challenge here is to design the failure message format in such a way that hops cannot learn their position in the path. Just appending timestamps and hmacs to a variable length message would reveal the distance between a node and the error source. A fixed length message in which hops shift some previous (unused) data out from the message to create space to add their own data does not seem to work. What happens is that the earlier nodes calculate their hmac over data that is shifted out and cannot be recovered anymore by the sender. The sender has no way to verify the hmac in that case. Regenerating the shifted out data (similar to deterministic padding on the forward path) isn't a solution either, because a node may modify that (unused) data before passing the message on. This would invalidate all hmacs, denying the sender from locating the responsible node. One space-inefficient solution is to have every hop add hmacs for every possible (real) message length, but this would require n^2 hmacs in total (20*20*32 bytes). Half of these could be discarded along the way, but it would still leave 10*20*32=6.4kb of hmacs. Another direction might be to use a variable length message, but have the error source add a seemingly random length padding. The actual length could be deterministically derived from the shared secret, so that the erring node cannot just not add padding. This obfuscates the distance to the error source somewhat, but still reveals a bit of information. If one knows that the padding length is somewhere between 0 and 20 blocks worth of bytes, a message length of say 25 blocks would reveal that the err source is at least 5 hops away. It could be a fair trade-off though. An alternative to all of this is to try to locate bad nodes by probing with different route lengths and coming from different angles. This will however require more attempts and more complex processing of the outcomes. There is also a level of indirectness because not all information is gathered in a single roundtrip. And in addition to that, a malicious node may somehow act differently if it manages to recognize the probes. I'd be interested to hear your opinions about the importance of being able to locate bad nodes irrefutably, as well as ideas around the failure message format. Joost ___ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev