Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution

2019-06-15 Thread Joost Jager
Hi ZmnSCPxj,

Since, B cannot report the `update_fail_htlc` immediately, its timer should
> still be running.
> Suppose we counted only up to `update_fail_htlc` and not on the
> `revoke_and_ack`.
> If C sends `update_fail_htlc` immediately, then the
> `update_add_htlc`->`update_fail_htlc` time reported by B would be fast.
> But if C then does not send `revoke_and_ack`, B cannot safely propagate
> `update_fail_htlc` to A, so the time reported by A will be slow.
> This sudden transition of time from A to B will be blamed on A and B,
> while C is unpunished.
>
> That is why, for failures, we ***must*** wait for `revoke_and_ack`.
> The node must report the time when it can safely propagate the error
> report upstream, not the time it receives the error report.
> For payment fulfillment, `update_fulfill_htlc` is fine without waiting for
> `revoke_and_ack` since it is always reported immediately upstream anyway.
>

Yes, that is a good point. C hasn't completed its duty until it sends
`revoke_and_ack` indeed.


> > I think we could indeed do more with the information that we currently
> have and gather some more by probing. But in the end we would still be
> sampling a noisy signal. More scenarios to take into account, less accurate
> results and probably more non-ideal payment attempts. Failed, slow or stuck
> payments degrade the user experience of lightning, while "fat errors"
> arguably don't impact the user in a noticeable way.
>
> Fat errors just give you more information when a problem happens for a
> "real" payment.
> But the problem still occurs on the "real" payment and user experience is
> still degraded.
>
> Background probing gives you the same information **before** problems
> happen for "real" payments.
>

With probing, I was thinking about probing right before making the actual
payment, so not a process that is continuously running in the background. I
am not sure how that would scale, everyone (assume mass adoption) probing
the whole network. Also, for private channels, nodes may put rate limits in
place or close channels that are the source of many failures. Then end
points with only private channels like a mobile app wouldn't be able to
probe effectively anymore.

I do think probes are useful, but would only use them sparingly. Sending a
probe before the real payment surely helps to mitigate certain risks. But
then I'd still prefer to also have fat errors. To get maximum value out of
the probe and minimize the number of probes required. Functionally
speaking, I don't see why you wouldn't want to have that extra information.

 Joost
___
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev


Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution

2019-06-14 Thread ZmnSCPxj via Lightning-dev
Good morning Joost,

> Yes that is accurate, although using the time difference between receiving 
> the `update_add_htlc` and sending back the `update_fail_htlc` would work too. 
> It would then include the node's processing time.

It would not work safely.
A node can only propagate an `update_fail_htlc` if the downstream 
`update_fail_htlc` has been irrevocably committed by `revoke_and_ack`.
See BOLT spec about this.

Suppose we have a route A -> B -> C.
C sends `update_fail_htlc` immediately, but dallies on `revoke_and_ack`.
B cannot send `update_fail_htlc` to A yet, because C can still drop the 
previous B-C channel state onchain (it is not yet revoked, that is what the 
`revoke_and_ack` will later do).
If B send `update_fail_htlc` to A as soon as it receives `update_fail_htlc` 
from C, A can use the new A-B channel state onchain, while at the same time C 
drops the previous B-C channel state onchain.
the new A-B channel state returns the HTLC to A, while the previous B-C channel 
state has the HTLC still claimable by C, causing B to lose funds.

For `update_fulfill_htlc` B can immediately propagate to A (without waiting for 
`update_and_ack` from C) since C is already claiming the money.

Since, B cannot report the `update_fail_htlc` immediately, its timer should 
still be running.
Suppose we counted only up to `update_fail_htlc` and not on the 
`revoke_and_ack`.
If C sends `update_fail_htlc` immediately, then the 
`update_add_htlc`->`update_fail_htlc` time reported by B would be fast.
But if C then does not send `revoke_and_ack`, B cannot safely propagate 
`update_fail_htlc` to A, so the time reported by A will be slow.
This sudden transition of time from A to B will be blamed on A and B, while C 
is unpunished.

That is why, for failures, we ***must*** wait for `revoke_and_ack`.
The node must report the time when it can safely propagate the error report 
upstream, not the time it receives the error report.
For payment fulfillment, `update_fulfill_htlc` is fine without waiting for 
`revoke_and_ack` since it is always reported immediately upstream anyway.

See my discussion about "fast forwards": 
https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-April/001986.html

> I think we could indeed do more with the information that we currently have 
> and gather some more by probing. But in the end we would still be sampling a 
> noisy signal. More scenarios to take into account, less accurate results and 
> probably more non-ideal payment attempts. Failed, slow or stuck payments 
> degrade the user experience of lightning, while "fat errors" arguably don't 
> impact the user in a noticeable way.

Fat errors just give you more information when a problem happens for a "real" 
payment.
But the problem still occurs on the "real" payment and user experience is still 
degraded.

Background probing gives you the same information **before** problems happen 
for "real" payments.

Regards,
ZmnSCPxj
___
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev


Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution

2019-06-14 Thread Joost Jager
Hi Bastien,


> What about having each node add some padding along the way? The erring
> node's padding should be bigger than intermediate nodes' padding (ideally
> using a deterministic construction as you suggest) so details need to be
> fleshed out, but it could mitigate even more the possibility of
> intermediate nodes figuring out their approximate position.
> That also mitigates the risk that a network observer correlates error
> messages between hops (because in the variable-length message that you
> propose, a network observer can easily track an error message across the
> whole payment path).
>

Yes we could also do that. Then even if the same person has two different
nodes in the path, they can't know for sure how many hops were in between.

It would be nice if there is a way to add padding such that hops don't
learn anything about their position, but not sure if that is possible.
Having the error node add padding with a random length between 0 and 20
block sizes (block size is the number of bytes a hop would add in the
backward path), does reveal an upper bound for the distance to the error
node. For example: a node receives a failure with a padding of 3 blocks.
That means that the distance to the error node is between 0 and 3 hops.

Joost
___
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev


Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution

2019-06-14 Thread Joost Jager
Hi ZmnSCPxj,


> > That is definitely a concern. It is up to senders how to interpret the
> received timestamps. They can decide to tolerate slight variations. Or they
> could just look at the difference between the in and out timestamp,
> abandoning the synchronization requirement altogether (a node could also
> just report that difference instead of two timestamps). The held duration
> is enough to identify a pair of nodes from which one of the nodes is
> responsible for the delay.
> >
> > Example (held durations between parenthesis):
> >
> > A (15 secs) -> B (14 secs) -> C (3 secs) -> D (2 secs)
> >
> > In this case either B or C is delaying the payment. We'd penalize the
> channel between B and C.
>
> This seems better.
> If B is at fault, it could lie and reduce its reported delta time, but
> that simply means it will be punished with A.
> If C is at fault, it could lie and increase its reported delta time, but
> that simply means it will be punished with D.
>
> I presume that the delta time is the time difference from when it sends
> `update_add_htlc` and when it receives `update_fulfill_htlc`, or when it
> gets an irrevocably committed `update_fail_htlc` + `revoke_and_ack`.
> Is that accurate?
>

Yes that is accurate, although using the time difference between receiving
the `update_add_htlc` and sending back the `update_fail_htlc` would work
too. It would then include the node's processing time.


> Unit should probably be milliseconds
>

Yes, we probably want sub-second resolution for this.

An alternative that comes to mind is to use active probing and tracking
> persistent data per node.
>
> For each node we record two pieces of information:
>
> 1.  Total imposed delay.
> 2.  Number of attempts.
>
> Suppose a probe or payment takes N milliseconds on a route with M nodes to
> fulfill or irrevocably fail at the payer.
> For each node on the route, we increase Total imposed delay by N / M
> rounded up, and increment Number of attempts.
> For error reports we can shorten the route if we get an error response
> that points to a specific failing node, or penalize the entire route in
> case of a completely undecodable error response.
>
> When finding a route for a "real" payment, we adjust the cost of
> traversing a node by the ratio Total imposed delay / Number of attempts (we
> can avoid undefined math by starting both fields at 1).
> For probes we can probably ignore this factor in order to give nodes that
> happened to be borked by a different slow node on the trial route another
> chance to exonerate their apparent slowness.
>
> This does not need changes in the current spec.
>

I think we could indeed do more with the information that we currently have
and gather some more by probing. But in the end we would still be sampling
a noisy signal. More scenarios to take into account, less accurate results
and probably more non-ideal payment attempts. Failed, slow or stuck
payments degrade the user experience of lightning, while "fat errors"
arguably don't impact the user in a noticeable way.

Joost
___
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev


Re: [Lightning-dev] Improve Lightning payment reliability through better error attribution

2019-06-14 Thread ZmnSCPxj via Lightning-dev
Good morning Joost,

> That is definitely a concern. It is up to senders how to interpret the 
> received timestamps. They can decide to tolerate slight variations. Or they 
> could just look at the difference between the in and out timestamp, 
> abandoning the synchronization requirement altogether (a node could also just 
> report that difference instead of two timestamps). The held duration is 
> enough to identify a pair of nodes from which one of the nodes is responsible 
> for the delay.
>
> Example (held durations between parenthesis):
>
> A (15 secs) -> B (14 secs) -> C (3 secs) -> D (2 secs)
>
> In this case either B or C is delaying the payment. We'd penalize the channel 
> between B and C.

This seems better.
If B is at fault, it could lie and reduce its reported delta time, but that 
simply means it will be punished with A.
If C is at fault, it could lie and increase its reported delta time, but that 
simply means it will be punished with D.

I presume that the delta time is the time difference from when it sends 
`update_add_htlc` and when it receives `update_fulfill_htlc`, or when it gets 
an irrevocably committed `update_fail_htlc` + `revoke_and_ack`.
Is that accurate?

Unit should probably be milliseconds.

--

An alternative that comes to mind is to use active probing and tracking 
persistent data per node.

For each node we record two pieces of information:

1.  Total imposed delay.
2.  Number of attempts.

Suppose a probe or payment takes N milliseconds on a route with M nodes to 
fulfill or irrevocably fail at the payer.
For each node on the route, we increase Total imposed delay by N / M rounded 
up, and increment Number of attempts.
For error reports we can shorten the route if we get an error response that 
points to a specific failing node, or penalize the entire route in case of a 
completely undecodable error response.

When finding a route for a "real" payment, we adjust the cost of traversing a 
node by the ratio Total imposed delay / Number of attempts (we can avoid 
undefined math by starting both fields at 1).
For probes we can probably ignore this factor in order to give nodes that 
happened to be borked by a different slow node on the trial route another 
chance to exonerate their apparent slowness.

This does not need changes in the current spec.

Regards,
ZmnSCPxj
___
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev


[Lightning-dev] Improve Lightning payment reliability through better error attribution

2019-06-14 Thread Joost Jager
Hi ZmnSCPxj,

Before proceeding with discussing HMACs and preventing nodes from putting
> words in the mouths of other nodes, perhaps we should consider, how we can
> ensure that nodes can be forced to be accurate about what happened.
>
> For instance, a proposal is for nodes to put timestamps for certain events.
> Does this imply that all nodes along the route **MUST** have their clocks
> strongly synchronized to some global clock?
> If a node along the route happens to be 15 seconds early or 15 seconds
> late, can it be erroneously "dinged" for this when a later hop delays a
> successful payment for 20 seconds?
>
> If it requires that hop nodes have strong synchrony with some global clock
> service, why would I want to run a hop node then?
> What happens if some global clock service is attacked in order to convince
> nodes to route to particular nodes (using a competing global clock service)
> on the Lightning network?
>

That is definitely a concern. It is up to senders how to interpret the
received timestamps. They can decide to tolerate slight variations. Or they
could just look at the difference between the in and out timestamp,
abandoning the synchronization requirement altogether (a node could also
just report that difference instead of two timestamps). The held duration
is enough to identify a pair of nodes from which one of the nodes is
responsible for the delay.

Example (held durations between parenthesis):

A (15 secs) -> B (14 secs) -> C (3 secs) -> D (2 secs)

In this case either B or C is delaying the payment. We'd penalize the
channel between B and C.

Joost
___
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev


[Lightning-dev] Improve Lightning payment reliability through better error attribution

2019-06-12 Thread Joost Jager
Hello list,

In Lightning, the reliability of payments is dependent on the reliability
of the chosen route. Information about previous payment attempts helps to
select better routes and improve the payment experience. Therefore
implementations usually track the past performance of nodes and channels.
This can be as simple as a black list that contains previously failed
channels.

In order for this mechanism to be most effective, it is important to know
which node is to blame for a non-ideal payment attempt.

Non-ideal payment attempts are not only failed payment attempts (either
instantly failed or after a delay), but also successful payments for which
it took a long time to receive the `htlc_fulfill` message.

For non-ideal payment attempts, we are currently not always able to
determine the node that should be penalized. In particular:
* If an attempt takes long to complete (settle or fail), we have no
information that points us to the source of the delay.
* Nodes can return a corrupt failure message. When this message arrives at
the sender after a number of encryption rounds, the sender is no longer
able to pinpoint the node that failed the payment.

A potential solution is to change the failure message such that every hop
along the backward path adds an hmac to the message (currently only the
error source authenticates the message). This allows the source of a
corruption to be narrowed down to a pair of nodes, which is enough to
properly apply a penalty.

In addition to that, all hops could add two timestamps to the failure
message: the htlc add time and the htlc fail time. Using this information,
the sender of the payment can identify the source of the delay down to,
again, a pair of nodes. Those timestamps could be added to the settle
message as well, to also allow diagnostics on slow settled payments.

The challenge here is to design the failure message format in such a way
that hops cannot learn their position in the path. Just appending
timestamps and hmacs to a variable length message would reveal the distance
between a node and the error source.

A fixed length message in which hops shift some previous (unused) data out
from the message to create space to add their own data does not seem to
work. What happens is that the earlier nodes calculate their hmac over data
that is shifted out and cannot be recovered anymore by the sender. The
sender has no way to verify the hmac in that case. Regenerating the shifted
out data (similar to deterministic padding on the forward path) isn't a
solution either, because a node may modify that (unused) data before
passing the message on. This would invalidate all hmacs, denying the sender
from locating the responsible node.

One space-inefficient solution is to have every hop add hmacs for every
possible (real) message length, but this would require n^2 hmacs in total
(20*20*32 bytes). Half of these could be discarded along the way, but it
would still leave 10*20*32=6.4kb of hmacs.

Another direction might be to use a variable length message, but have the
error source add a seemingly random length padding. The actual length could
be deterministically derived from the shared secret, so that the erring
node cannot just not add padding. This obfuscates the distance to the error
source somewhat, but still reveals a bit of information. If one knows that
the padding length is somewhere between 0 and 20 blocks worth of bytes, a
message length of say 25 blocks would reveal that the err source is at
least 5 hops away. It could be a fair trade-off though.

An alternative to all of this is to try to locate bad nodes by probing with
different route lengths and coming from different angles. This will however
require more attempts and more complex processing of the outcomes. There is
also a level of indirectness because not all information is gathered in a
single roundtrip. And in addition to that, a malicious node may somehow act
differently if it manages to recognize the probes.

I'd be interested to hear your opinions about the importance of being able
to locate bad nodes irrefutably, as well as ideas around the failure
message format.

Joost
___
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev