Re: [tcp] 9d9b1ee0b2: packetdrill.packetdrill/gtests/net/tcp/user_timeout/user-timeout-probe_ipv4-mapped-v6.fail

2021-02-19 Thread Neal Cardwell
On Thu, Feb 18, 2021 at 8:33 PM kernel test robot  wrote:
>
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 9d9b1ee0b2d1c9e02b2338c4a4b0a062d2d3edac ("tcp: fix TCP_USER_TIMEOUT 
> with zero window")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

I have pushed to the packetdrill repo a commit that should fix this test:

094da3bc77e5 (HEAD, packetdrill/master) net-test: update TCP tests for
USER_TIMEOUT ZWP fix
https://github.com/google/packetdrill/commit/094da3bc77e518d820ebc0ef8b94a5b4cf707a39

Can someone please pull that commit into the repo used by the test
bot, if needed? (Or does it automatically use the latest packetdrill
master branch?)

thanks,
neal


Re: [PATCH net] tcp: make TCP_USER_TIMEOUT accurate for zero window probes

2021-01-23 Thread Neal Cardwell
On Fri, Jan 22, 2021 at 9:45 PM Enke Chen  wrote:
>
> Hi, Jakub:
>
> On Fri, Jan 22, 2021 at 06:34:24PM -0800, Jakub Kicinski wrote:
> > On Fri, 22 Jan 2021 18:28:23 -0800 Enke Chen wrote:
> > > Hi, Jakub:
> > >
> > > In terms of backporting, this patch should go together with:
> > >
> > > 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
> >
> > As in it:
> >
> > Fixes: 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
> >
> > or does it further fix the same issue, so:
> >
> > Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
> >
> > ?
>
> Let me clarify:
>
> 1) 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
>
>fixes the bug and makes it work.
>
> 2) The current patch makes the TCP_USER_TIMEOUT accurate for 0-window probes.
>It's independent.

Patch (2) ("tcp: make TCP_USER_TIMEOUT accurate for zero window
probes") is indeed conceptually independent of (1) but its
implementation depends on the icsk_probes_tstamp field defined in (1),
so AFAICT (2) cannot be backported further back than (1).

Patch (1) fixes a bug in 5.1:
Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")

So probably (1) and (2) should be backported as a pair, and only back
as far as 5.1. (That covers 2 LTS kernels, 5.4 and 5.10, so hopefully
that is good enough.)

neal


Re: [PATCH] Revert "tcp: simplify window probe aborting on USER_TIMEOUT"

2021-01-11 Thread Neal Cardwell
On Fri, Jan 8, 2021 at 11:38 PM Enke Chen  wrote:
>
> From: Enke Chen 
>
> This reverts commit 9721e709fa68ef9b860c322b474cfbd1f8285b0f.
>
> With the commit 9721e709fa68 ("tcp: simplify window probe aborting
> on USER_TIMEOUT"), the TCP session does not terminate with
> TCP_USER_TIMEOUT when data remain untransmitted due to zero window.
>
> The number of unanswered zero-window probes (tcp_probes_out) is
> reset to zero with incoming acks irrespective of the window size,
> as described in tcp_probe_timer():
>
> RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> as long as the receiver continues to respond probes. We support
> this by default and reset icsk_probes_out with incoming ACKs.
>
> This counter, however, is the wrong one to be used in calculating the
> duration that the window remains closed and data remain untransmitted.
> Thanks to Jonathan Maxwell  for diagnosing the
> actual issue.
>
> Cc: sta...@vger.kernel.org
> Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
> Reported-by: William McCall 
> Signed-off-by: Enke Chen 
> ---

I ran this revert commit through our packetdrill TCP tests, and it's
causing failures in a ZWP/USER_TIMEOUT test due to interactions with
this Jan 2019 patch:

7f12422c4873e9b274bc151ea59cb0cdf9415cf1
tcp: always timestamp on every skb transmission

The issue seems to be that after 7f12422c4873 the skb->skb_mstamp_ns
is set on every transmit attempt. That means that even skbs that are
not successfully transmitted have a non-zero skb_mstamp_ns. That means
that if ZWPs are repeatedly failing to be sent due to severe local
qdisc congestion, then at this point in the code the start_ts is
always only 500ms in the past (from TCP_RESOURCE_PROBE_INTERVAL =
500ms). That means that if there is severe local qdisc congestion a
USER_TIMEOUT above 500ms is a NOP, and the socket can live far past
the USER_TIMEOUT.

It seems we need a slightly different approach than the revert in this commit.

neal


Re: [PATCH] tcp: fix TCP socks unreleased in BBR mode

2020-06-03 Thread Neal Cardwell
On Wed, Jun 3, 2020 at 9:55 AM Eric Dumazet  wrote:
>
> On Wed, Jun 3, 2020 at 5:02 AM Neal Cardwell  wrote:
> >
> > On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet  wrote:
> > >
> > > On Tue, Jun 2, 2020 at 10:05 PM Jason Xing  
> > > wrote:
> > > >
> > > > Hi Eric,
> > > >
> > > > I'm still trying to understand what you're saying before. Would this
> > > > be better as following:
> > > > 1) discard the tcp_internal_pacing() function.
> > > > 2) remove where the tcp_internal_pacing() is called in the
> > > > __tcp_transmit_skb() function.
> > > >
> > > > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > > > should we introduce the tcp_wstamp_ns socket field as commit
> > > > (864e5c090749) does?
> > > >
> > >
> > > Please do not top-post on netdev mailing list.
> > >
> > >
> > > I basically suggested double-checking which point in TCP could end up
> > > calling tcp_internal_pacing()
> > > while the timer was already armed.
> > >
> > > I guess this is mtu probing.
> >
> > Perhaps this could also happen from some of the retransmission code
> > paths that don't use tcp_xmit_retransmit_queue()? Perhaps
> > tcp_retransmit_timer() (RTO) and  tcp_send_loss_probe() TLP? It seems
> > they could indirectly cause a call to __tcp_transmit_skb() and thus
> > tcp_internal_pacing() without first checking if the pacing timer was
> > already armed?
>
> I feared this, (see recent commits about very low pacing rates) :/
>
> I am not sure we need to properly fix all these points for old
> kernels, since EDT model got rid of these problems.

Agreed.

> Maybe we can try to extend the timer.

Sounds good.

> Something like :
>
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> cc4ba42052c21b206850594db6751810d8fc72b4..626b9f4f500f7e5270d8d59e6eb16dbfa3efbc7c
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -966,6 +966,8 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
>
>  static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
>  {
> +   struct tcp_sock *tp = tcp_sk(sk);
> +   ktime_t expire, now;
> u64 len_ns;
> u32 rate;
>
> @@ -977,12 +979,29 @@ static void tcp_internal_pacing(struct sock *sk,
> const struct sk_buff *skb)
>
> len_ns = (u64)skb->len * NSEC_PER_SEC;
> do_div(len_ns, rate);
> -   hrtimer_start(_sk(sk)->pacing_timer,
> - ktime_add_ns(ktime_get(), len_ns),
> +
> +   now = ktime_get();
> +   /* If hrtimer is already armed, then our caller has not
> +* used tcp_pacing_check().
> +*/
> +   if (unlikely(hrtimer_is_queued(>pacing_timer))) {
> +   expire = hrtimer_get_softexpires(>pacing_timer);
> +   if (ktime_after(expire, now))
> +   now = expire;
> +   if (hrtimer_try_to_cancel(>pacing_timer) == 1)
> +   __sock_put(sk);
> +   }
> +   hrtimer_start(>pacing_timer, ktime_add_ns(now, len_ns),
>   HRTIMER_MODE_ABS_PINNED_SOFT);
> sock_hold(sk);
>  }
>
> +static bool tcp_pacing_check(const struct sock *sk)
> +{
> +   return tcp_needs_internal_pacing(sk) &&
> +  hrtimer_is_queued(_sk(sk)->pacing_timer);
> +}
> +
>  static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff 
> *skb)
>  {
> skb->skb_mstamp = tp->tcp_mstamp;
> @@ -2117,6 +2136,9 @@ static int tcp_mtu_probe(struct sock *sk)
> if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
> return -1;
>
> +   if (tcp_pacing_check(sk))
> +   return -1;
> +
> /* We're allowed to probe.  Build it now. */
> nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
> if (!nskb)
> @@ -2190,11 +2212,6 @@ static int tcp_mtu_probe(struct sock *sk)
> return -1;
>  }
>
> -static bool tcp_pacing_check(const struct sock *sk)
> -{
> -   return tcp_needs_internal_pacing(sk) &&
> -  hrtimer_is_queued(_sk(sk)->pacing_timer);
> -}
>
>  /* TCP Small Queues :
>   * Control number of packets in qdisc/devices to two packets / or ~1 ms.

Thanks for your fix, Eric. This fix looks good to me! I agree that
this fix is good enough for older kernels.

thanks,
neal


Re: [PATCH] tcp: fix TCP socks unreleased in BBR mode

2020-06-03 Thread Neal Cardwell
On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet  wrote:
>
> On Tue, Jun 2, 2020 at 10:05 PM Jason Xing  wrote:
> >
> > Hi Eric,
> >
> > I'm still trying to understand what you're saying before. Would this
> > be better as following:
> > 1) discard the tcp_internal_pacing() function.
> > 2) remove where the tcp_internal_pacing() is called in the
> > __tcp_transmit_skb() function.
> >
> > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > should we introduce the tcp_wstamp_ns socket field as commit
> > (864e5c090749) does?
> >
>
> Please do not top-post on netdev mailing list.
>
>
> I basically suggested double-checking which point in TCP could end up
> calling tcp_internal_pacing()
> while the timer was already armed.
>
> I guess this is mtu probing.

Perhaps this could also happen from some of the retransmission code
paths that don't use tcp_xmit_retransmit_queue()? Perhaps
tcp_retransmit_timer() (RTO) and  tcp_send_loss_probe() TLP? It seems
they could indirectly cause a call to __tcp_transmit_skb() and thus
tcp_internal_pacing() without first checking if the pacing timer was
already armed?

neal


Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Neal Cardwell
On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte
 wrote:
>
> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Please note that there's no general rule about whether BBR will run
with a lower or higher cwnd than CUBIC, Reno, or other loss-based
congestion control algorithms. Whether BBR's cwnd will be lower or
higher depends on the BDP of the path, the amount of buffering in the
bottleneck, and the number of flows. BBR tries to match the amount of
in-flight data to the BDP based on the available bandwidth and the
two-way propagation delay. This will usually produce an amount of data
in flight that is smaller than CUBIC/Reno (yielding lower latency) if
the path has deep buffers (bufferbloat), but can be larger than
CUBIC/Reno (yielding higher throughput) if the buffers are shallow and
the traffic is suffering burst losses.

>
>>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>>> transferring some files between hosts).
>
> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
>
> Connecting to host tux, port 5201
> [  5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec0204 KBytes
> [  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec0204 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec0204 KBytes
> [...]
>
> Running it locally gives the more or less expected results as well:
>
> Connecting to host ragnarok, port 5201
> [  5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec  8.09 GBytes  69.5 Gbits/sec0512 KBytes
> [  5]   1.00-2.00   sec  8.14 GBytes  69.9 Gbits/sec0512 KBytes
> [  5]   2.00-3.00   sec  8.43 GBytes  72.4 Gbits/sec0512 KBytes
> [...]
>
> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).

Can you please clarify if this is over bare metal or between VM
guests? It sounds like Oleksandr's initial report was between KVM VMs,
so the virtualization may be an ingredient here.

thanks,
neal


Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Neal Cardwell
On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte
 wrote:
>
> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Please note that there's no general rule about whether BBR will run
with a lower or higher cwnd than CUBIC, Reno, or other loss-based
congestion control algorithms. Whether BBR's cwnd will be lower or
higher depends on the BDP of the path, the amount of buffering in the
bottleneck, and the number of flows. BBR tries to match the amount of
in-flight data to the BDP based on the available bandwidth and the
two-way propagation delay. This will usually produce an amount of data
in flight that is smaller than CUBIC/Reno (yielding lower latency) if
the path has deep buffers (bufferbloat), but can be larger than
CUBIC/Reno (yielding higher throughput) if the buffers are shallow and
the traffic is suffering burst losses.

>
>>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>>> transferring some files between hosts).
>
> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
>
> Connecting to host tux, port 5201
> [  5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec0204 KBytes
> [  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec0204 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec0204 KBytes
> [...]
>
> Running it locally gives the more or less expected results as well:
>
> Connecting to host ragnarok, port 5201
> [  5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec  8.09 GBytes  69.5 Gbits/sec0512 KBytes
> [  5]   1.00-2.00   sec  8.14 GBytes  69.9 Gbits/sec0512 KBytes
> [  5]   2.00-3.00   sec  8.43 GBytes  72.4 Gbits/sec0512 KBytes
> [...]
>
> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).

Can you please clarify if this is over bare metal or between VM
guests? It sounds like Oleksandr's initial report was between KVM VMs,
so the virtualization may be an ingredient here.

thanks,
neal


Re: [PATCH 4.9 00/75] 4.9.74-stable review

2018-01-02 Thread Neal Cardwell
On Tue, Jan 2, 2018 at 3:08 PM, Greg KH <gre...@linuxfoundation.org> wrote:
> On Tue, Jan 02, 2018 at 02:11:25PM -0500, Neal Cardwell wrote:
...
>> Thanks, Greg and David. Looks like these 2 patches will cherry-pick
>> cleanly if cherry-picked in the following sequence, on top of
>> 4.9.74-rc1, which already has 6c9e73ef9aa7 ("tcp_bbr: record "full bw
>> reached" decision in new full_bw_reached bit"):
>>
>> $ git checkout linux-stable-rc/linux-4.9.y
>>
>> $ git cherry-pick 2f6c498e4f15
>> Performing inexact rename detection: 100% (17803152/17803152), done.
>> [detached HEAD 0982234c57e1] tcp_bbr: reset full pipe detection on
>> loss recovery undo
>>  Date: Thu Dec 7 12:43:31 2017 -0500
>>  1 file changed, 4 insertions(+)
>>
>> $ git cherry-pick 600647d467c6
>> Performing inexact rename detection: 100% (17803152/17803152), done.
>> [detached HEAD 7e866eccd083] tcp_bbr: reset long-term bandwidth
>> sampling on loss recovery undo
>>  Date: Thu Dec 7 12:43:32 2017 -0500
>>  1 file changed, 1 insertion(+)
>>
>> $ git log --oneline --decorate | head -3
>> 7e866eccd083 (HEAD) tcp_bbr: reset long-term bandwidth sampling on
>> loss recovery undo
>> 0982234c57e1 tcp_bbr: reset full pipe detection on loss recovery undo
>> 79070be7f1ae (linux-stable-rc/linux-4.9.y) Linux 4.9.74-rc1
>>
>> I verified that this compiles without warnings, and boots, and BBR works.
>>
>> Shall I prepare another version of these 2 patches, or do we think
>> this recipe will be sufficient? (Sorry I am not more familiar with the
>> backport-to-stable process.)
>
> That works, those two patches are now queued up for the next stable
> release, thanks!
>
> greg k-h

Great. Thank you, Greg and David!

neal


Re: [PATCH 4.9 00/75] 4.9.74-stable review

2018-01-02 Thread Neal Cardwell
On Tue, Jan 2, 2018 at 3:08 PM, Greg KH  wrote:
> On Tue, Jan 02, 2018 at 02:11:25PM -0500, Neal Cardwell wrote:
...
>> Thanks, Greg and David. Looks like these 2 patches will cherry-pick
>> cleanly if cherry-picked in the following sequence, on top of
>> 4.9.74-rc1, which already has 6c9e73ef9aa7 ("tcp_bbr: record "full bw
>> reached" decision in new full_bw_reached bit"):
>>
>> $ git checkout linux-stable-rc/linux-4.9.y
>>
>> $ git cherry-pick 2f6c498e4f15
>> Performing inexact rename detection: 100% (17803152/17803152), done.
>> [detached HEAD 0982234c57e1] tcp_bbr: reset full pipe detection on
>> loss recovery undo
>>  Date: Thu Dec 7 12:43:31 2017 -0500
>>  1 file changed, 4 insertions(+)
>>
>> $ git cherry-pick 600647d467c6
>> Performing inexact rename detection: 100% (17803152/17803152), done.
>> [detached HEAD 7e866eccd083] tcp_bbr: reset long-term bandwidth
>> sampling on loss recovery undo
>>  Date: Thu Dec 7 12:43:32 2017 -0500
>>  1 file changed, 1 insertion(+)
>>
>> $ git log --oneline --decorate | head -3
>> 7e866eccd083 (HEAD) tcp_bbr: reset long-term bandwidth sampling on
>> loss recovery undo
>> 0982234c57e1 tcp_bbr: reset full pipe detection on loss recovery undo
>> 79070be7f1ae (linux-stable-rc/linux-4.9.y) Linux 4.9.74-rc1
>>
>> I verified that this compiles without warnings, and boots, and BBR works.
>>
>> Shall I prepare another version of these 2 patches, or do we think
>> this recipe will be sufficient? (Sorry I am not more familiar with the
>> backport-to-stable process.)
>
> That works, those two patches are now queued up for the next stable
> release, thanks!
>
> greg k-h

Great. Thank you, Greg and David!

neal


Re: [PATCH 4.9 00/75] 4.9.74-stable review

2018-01-02 Thread Neal Cardwell
On Tue, Jan 2, 2018 at 1:32 PM, David Miller <da...@davemloft.net> wrote:
> From: Neal Cardwell <ncardw...@google.com>
> Date: Tue, 2 Jan 2018 11:57:59 -0500
>
>> On Mon, Jan 1, 2018 at 9:31 AM, Greg Kroah-Hartman
>> <gre...@linuxfoundation.org> wrote:
>>> This is the start of the stable review cycle for the 4.9.74 release.
>>> There are 75 patches in this series, all will be posted as a response
>>> to this one.  If anyone has any issues with these being applied, please
>>> let me know.
>>>
>>> Responses should be made by Wed Jan  3 14:00:03 UTC 2018.
>>> Anything received after that time might be too late.
>>>
>>> The whole patch series can be found in one patch at:
>>> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.74-rc1.gz
>>> or in the git tree and branch at:
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
>>> linux-4.9.y
>>> and the diffstat can be found below.
>>
>> Hi Greg,
>>
>> In looking at the 4.9 and 4.14 patches yesterday, I noticed there were
>> two TCP BBR fixes that made it into 4.14 but not 4.9. Doing an
>> inventory of the TCP BBR fixes, AFAICT we have:
>>
>> c589e69b508d tcp_bbr: record "full bw reached" decision in new
>> full_bw_reached bit
>>  - in 4.9 and 4.14 (great)
>>
>> 2f6c498e4f15 tcp_bbr: reset full pipe detection on loss recovery undo
>>   - in 4.14 (but not 4.9)
>>
>> 600647d467c6 tcp_bbr: reset long-term bandwidth sampling on loss recovery 
>> undo
>>   - in 4.14 (but not 4.9)
>>
>> Lacking the second and third patches in 4.9 will not cause any new
>> problems, but it will miss out on some nice fixes. If it's possible to
>> get  2f6c498e4f15 and 600647d467c6 either into 4.9.74 or 4.9.75, I
>> would be very grateful.
>
> These were not straight-forward to backport and I felt the risk outweighed
> the gains.
>
> If you want to do the backport yourself and you feel confident in it,
> feel free.

Thanks, Greg and David. Looks like these 2 patches will cherry-pick
cleanly if cherry-picked in the following sequence, on top of
4.9.74-rc1, which already has 6c9e73ef9aa7 ("tcp_bbr: record "full bw
reached" decision in new full_bw_reached bit"):

$ git checkout linux-stable-rc/linux-4.9.y

$ git cherry-pick 2f6c498e4f15
Performing inexact rename detection: 100% (17803152/17803152), done.
[detached HEAD 0982234c57e1] tcp_bbr: reset full pipe detection on
loss recovery undo
 Date: Thu Dec 7 12:43:31 2017 -0500
 1 file changed, 4 insertions(+)

$ git cherry-pick 600647d467c6
Performing inexact rename detection: 100% (17803152/17803152), done.
[detached HEAD 7e866eccd083] tcp_bbr: reset long-term bandwidth
sampling on loss recovery undo
 Date: Thu Dec 7 12:43:32 2017 -0500
 1 file changed, 1 insertion(+)

$ git log --oneline --decorate | head -3
7e866eccd083 (HEAD) tcp_bbr: reset long-term bandwidth sampling on
loss recovery undo
0982234c57e1 tcp_bbr: reset full pipe detection on loss recovery undo
79070be7f1ae (linux-stable-rc/linux-4.9.y) Linux 4.9.74-rc1

I verified that this compiles without warnings, and boots, and BBR works.

Shall I prepare another version of these 2 patches, or do we think
this recipe will be sufficient? (Sorry I am not more familiar with the
backport-to-stable process.)

Thanks!
neal


Re: [PATCH 4.9 00/75] 4.9.74-stable review

2018-01-02 Thread Neal Cardwell
On Tue, Jan 2, 2018 at 1:32 PM, David Miller  wrote:
> From: Neal Cardwell 
> Date: Tue, 2 Jan 2018 11:57:59 -0500
>
>> On Mon, Jan 1, 2018 at 9:31 AM, Greg Kroah-Hartman
>>  wrote:
>>> This is the start of the stable review cycle for the 4.9.74 release.
>>> There are 75 patches in this series, all will be posted as a response
>>> to this one.  If anyone has any issues with these being applied, please
>>> let me know.
>>>
>>> Responses should be made by Wed Jan  3 14:00:03 UTC 2018.
>>> Anything received after that time might be too late.
>>>
>>> The whole patch series can be found in one patch at:
>>> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.74-rc1.gz
>>> or in the git tree and branch at:
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
>>> linux-4.9.y
>>> and the diffstat can be found below.
>>
>> Hi Greg,
>>
>> In looking at the 4.9 and 4.14 patches yesterday, I noticed there were
>> two TCP BBR fixes that made it into 4.14 but not 4.9. Doing an
>> inventory of the TCP BBR fixes, AFAICT we have:
>>
>> c589e69b508d tcp_bbr: record "full bw reached" decision in new
>> full_bw_reached bit
>>  - in 4.9 and 4.14 (great)
>>
>> 2f6c498e4f15 tcp_bbr: reset full pipe detection on loss recovery undo
>>   - in 4.14 (but not 4.9)
>>
>> 600647d467c6 tcp_bbr: reset long-term bandwidth sampling on loss recovery 
>> undo
>>   - in 4.14 (but not 4.9)
>>
>> Lacking the second and third patches in 4.9 will not cause any new
>> problems, but it will miss out on some nice fixes. If it's possible to
>> get  2f6c498e4f15 and 600647d467c6 either into 4.9.74 or 4.9.75, I
>> would be very grateful.
>
> These were not straight-forward to backport and I felt the risk outweighed
> the gains.
>
> If you want to do the backport yourself and you feel confident in it,
> feel free.

Thanks, Greg and David. Looks like these 2 patches will cherry-pick
cleanly if cherry-picked in the following sequence, on top of
4.9.74-rc1, which already has 6c9e73ef9aa7 ("tcp_bbr: record "full bw
reached" decision in new full_bw_reached bit"):

$ git checkout linux-stable-rc/linux-4.9.y

$ git cherry-pick 2f6c498e4f15
Performing inexact rename detection: 100% (17803152/17803152), done.
[detached HEAD 0982234c57e1] tcp_bbr: reset full pipe detection on
loss recovery undo
 Date: Thu Dec 7 12:43:31 2017 -0500
 1 file changed, 4 insertions(+)

$ git cherry-pick 600647d467c6
Performing inexact rename detection: 100% (17803152/17803152), done.
[detached HEAD 7e866eccd083] tcp_bbr: reset long-term bandwidth
sampling on loss recovery undo
 Date: Thu Dec 7 12:43:32 2017 -0500
 1 file changed, 1 insertion(+)

$ git log --oneline --decorate | head -3
7e866eccd083 (HEAD) tcp_bbr: reset long-term bandwidth sampling on
loss recovery undo
0982234c57e1 tcp_bbr: reset full pipe detection on loss recovery undo
79070be7f1ae (linux-stable-rc/linux-4.9.y) Linux 4.9.74-rc1

I verified that this compiles without warnings, and boots, and BBR works.

Shall I prepare another version of these 2 patches, or do we think
this recipe will be sufficient? (Sorry I am not more familiar with the
backport-to-stable process.)

Thanks!
neal


Re: [PATCH 4.9 00/75] 4.9.74-stable review

2018-01-02 Thread Neal Cardwell
On Mon, Jan 1, 2018 at 9:31 AM, Greg Kroah-Hartman
 wrote:
> This is the start of the stable review cycle for the 4.9.74 release.
> There are 75 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Wed Jan  3 14:00:03 UTC 2018.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.74-rc1.gz
> or in the git tree and branch at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.9.y
> and the diffstat can be found below.

Hi Greg,

In looking at the 4.9 and 4.14 patches yesterday, I noticed there were
two TCP BBR fixes that made it into 4.14 but not 4.9. Doing an
inventory of the TCP BBR fixes, AFAICT we have:

c589e69b508d tcp_bbr: record "full bw reached" decision in new
full_bw_reached bit
 - in 4.9 and 4.14 (great)

2f6c498e4f15 tcp_bbr: reset full pipe detection on loss recovery undo
  - in 4.14 (but not 4.9)

600647d467c6 tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
  - in 4.14 (but not 4.9)

Lacking the second and third patches in 4.9 will not cause any new
problems, but it will miss out on some nice fixes. If it's possible to
get  2f6c498e4f15 and 600647d467c6 either into 4.9.74 or 4.9.75, I
would be very grateful.

Thanks!
neal


Re: [PATCH 4.9 00/75] 4.9.74-stable review

2018-01-02 Thread Neal Cardwell
On Mon, Jan 1, 2018 at 9:31 AM, Greg Kroah-Hartman
 wrote:
> This is the start of the stable review cycle for the 4.9.74 release.
> There are 75 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Wed Jan  3 14:00:03 UTC 2018.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.74-rc1.gz
> or in the git tree and branch at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.9.y
> and the diffstat can be found below.

Hi Greg,

In looking at the 4.9 and 4.14 patches yesterday, I noticed there were
two TCP BBR fixes that made it into 4.14 but not 4.9. Doing an
inventory of the TCP BBR fixes, AFAICT we have:

c589e69b508d tcp_bbr: record "full bw reached" decision in new
full_bw_reached bit
 - in 4.9 and 4.14 (great)

2f6c498e4f15 tcp_bbr: reset full pipe detection on loss recovery undo
  - in 4.14 (but not 4.9)

600647d467c6 tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
  - in 4.14 (but not 4.9)

Lacking the second and third patches in 4.9 will not cause any new
problems, but it will miss out on some nice fixes. If it's possible to
get  2f6c498e4f15 and 600647d467c6 either into 4.9.74 or 4.9.75, I
would be very grateful.

Thanks!
neal


Re: [PATCH] Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile

2017-08-15 Thread Neal Cardwell
On Tue, Aug 15, 2017 at 9:08 AM, mohamedalrshah
 wrote:
> This commit implements a new TCP congestion control algorithm, namely 
> Agile-SD.

Also, please use a summary line for your patch that is more in keeping
with Linux style, using spaces rather than dashes, and leading with
tcp: or tcp_agile_sd:. For example:
  tcp: add Agile SD TCP congestion control module

And please read:
  https://www.kernel.org/doc/html/latest/process/submitting-patches.html

Thanks,
neal


Re: [PATCH] Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile

2017-08-15 Thread Neal Cardwell
On Tue, Aug 15, 2017 at 9:08 AM, mohamedalrshah
 wrote:
> This commit implements a new TCP congestion control algorithm, namely 
> Agile-SD.

Also, please use a summary line for your patch that is more in keeping
with Linux style, using spaces rather than dashes, and leading with
tcp: or tcp_agile_sd:. For example:
  tcp: add Agile SD TCP congestion control module

And please read:
  https://www.kernel.org/doc/html/latest/process/submitting-patches.html

Thanks,
neal


Re: [PATCH] Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile

2017-08-15 Thread Neal Cardwell
On Tue, Aug 15, 2017 at 9:08 AM, mohamedalrshah
 wrote:
> +
> +/* Agile-SD Parameters */
> +struct agilesdtcp {
> +   u32 loss_cwnd; /* congestion window at last loss.*/

Please rebase your change on top of the latest net-next changes and
update this module to use the latest approach from the recent commit:

  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=f1722a1be19dc38e0a4b282d4e6e6ec5e1b11a67
  tcp: consolidate congestion control undo functions

Specifically:

- reference tp->prior_cwnd instead of ca->loss_cwnd
- remove the  ca->loss_cwnd field
- have the .undo_cwnd field reference tcp_reno_undo_cwnd

> +   u32 frac_tracer;   /* This is to trace the fractions of 
> the increment.*/
> +   u32 degraded_loss_cwnd;/* loss_cwnd after degradation.*/
> +   enumdystate{SS=0, CA=1} agilesd_tcp_status;

Per Linux style, please define the enum separately before declaring
the variable of that type, and format the enum using Linux style. Also
please use a longer, more specific name, to avoid name collisions. I'd
suggest:

enum dystate {
AGILE_SD_SS = 0,  /* comment ... */
AGILE_SD_CA = 1,  /* comment ... */
};


> +};
> +
> +/* To reset the parameters if needed*/
> +static inline void agilesdtcp_reset(struct sock *sk)
> +{
> +
> +}
> +
> +/* This function is called after the first acknowledgment is received and 
> before the congestion
> + * control algorithm will be called for the first time. If the congestion 
> control algorithm has
> + * private data, it should initialize its private date here. */

Multi-line comments should end with the trailing */ on a line by
itself. Here and elsewhere.

Please read:
  https://www.kernel.org/doc/html/v4.10/process/coding-style.html

Please check the style of patches before submitting with the following
script in the Linux kernel tree:
  scripts/checkpatch.pl

> +static void agilesdtcp_init(struct sock *sk)
> +{
> +   struct agilesdtcp *ca = inet_csk_ca(sk);
> +
> +   /* The value of initial_ssthresh parameter is not used here, thus, 
> snd_ssthresh is initialized by a large value.*/
> +   tcp_sk(sk)->snd_ssthresh = 0x7fff;
> +
> +   ca->loss_cwnd   = 0;
> +   ca->frac_tracer = 0;
> +   ca->agilesd_tcp_status  = SS;
> +}
> +
> +/* This function is called whenever an ack is received and the congestion 
> window can be increased.
> + * This is equivalent to opencwnd in tcp.cc.
> + * ack is the number of bytes that are acknowledged in the latest 
> acknowledgment;
> + * rtt is the the rtt measured by the latest acknowledgment;
> + * in_flight is the packet in flight before the latest acknowledgment;
> + * good_ack is an indicator whether the current situation is normal (no 
> duplicate ack, no loss and no SACK). */
> +static void agilesdtcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
> +{
> +   struct tcp_sock *tp = tcp_sk(sk);
> +   struct agilesdtcp *ca = inet_csk_ca(sk);
> +   u32 inc_factor;
> +   u32 ca_inc;
> +   u32 current_gap, total_gap;

For coding style, please order local variable declarations from
longest to shortest line, also know as Reverse Christmas Tree Format.

> +   /* The value of inc_factor is limited by lower_fl and upper_fl.
> +* The lower_fl must be always = 1. The greater the upper_fl the 
> higher the aggressiveness.
> +* But, if upper_fl set to 1, Agile-SD will work exactly as newreno.
> +* We have already designed an equation to calculate the optimum 
> upper_fl based on the given beta.
> +* This equation will be revealed once its article is published*/
> +   u32 lower_fl = 1 * SCALE;
> +   u32 upper_fl = 3 * SCALE;
> +
> +   if (!tcp_is_cwnd_limited(sk)) return;

Please put this return (or any if/else body) on a line by itself.

> +
> +   if (tp->snd_cwnd < tp->snd_ssthresh){

Need a space between ) and {.

> +   ca->agilesd_tcp_status = SS;
> +   tcp_slow_start(tp, in_flight);
> +   }
> +   else {

The else needs to go on the same line as the closing brace.


> +   inc_factor = min(max(((upper_fl * current_gap) / total_gap), 
> lower_fl), upper_fl);

Please use the existing kernel helper macro for this:

#define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)


> +
> +   ca_inc = ((inc_factor * SCALE) / tp->snd_cwnd);   /* SCALE is 
> used to avoid fractions*/
> +
> +   ca->frac_tracer += ca_inc;/* This in 
> order to take the fraction increase into account */
> +   if (ca->frac_tracer >= Double_SCALE)  /* To take 
> factor scale into account */
> +   {

The opening brace goes on the previous line.

> +/* This function is called when the TCP flow detects a loss.
> + * It returns the slow start threshold of a flow, after a packet loss is 
> detected. */

Trailing */ 

Re: [PATCH] Adding-Agile-SD-TCP-module-and-modifying-Kconfig-and-makefile

2017-08-15 Thread Neal Cardwell
On Tue, Aug 15, 2017 at 9:08 AM, mohamedalrshah
 wrote:
> +
> +/* Agile-SD Parameters */
> +struct agilesdtcp {
> +   u32 loss_cwnd; /* congestion window at last loss.*/

Please rebase your change on top of the latest net-next changes and
update this module to use the latest approach from the recent commit:

  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=f1722a1be19dc38e0a4b282d4e6e6ec5e1b11a67
  tcp: consolidate congestion control undo functions

Specifically:

- reference tp->prior_cwnd instead of ca->loss_cwnd
- remove the  ca->loss_cwnd field
- have the .undo_cwnd field reference tcp_reno_undo_cwnd

> +   u32 frac_tracer;   /* This is to trace the fractions of 
> the increment.*/
> +   u32 degraded_loss_cwnd;/* loss_cwnd after degradation.*/
> +   enumdystate{SS=0, CA=1} agilesd_tcp_status;

Per Linux style, please define the enum separately before declaring
the variable of that type, and format the enum using Linux style. Also
please use a longer, more specific name, to avoid name collisions. I'd
suggest:

enum dystate {
AGILE_SD_SS = 0,  /* comment ... */
AGILE_SD_CA = 1,  /* comment ... */
};


> +};
> +
> +/* To reset the parameters if needed*/
> +static inline void agilesdtcp_reset(struct sock *sk)
> +{
> +
> +}
> +
> +/* This function is called after the first acknowledgment is received and 
> before the congestion
> + * control algorithm will be called for the first time. If the congestion 
> control algorithm has
> + * private data, it should initialize its private date here. */

Multi-line comments should end with the trailing */ on a line by
itself. Here and elsewhere.

Please read:
  https://www.kernel.org/doc/html/v4.10/process/coding-style.html

Please check the style of patches before submitting with the following
script in the Linux kernel tree:
  scripts/checkpatch.pl

> +static void agilesdtcp_init(struct sock *sk)
> +{
> +   struct agilesdtcp *ca = inet_csk_ca(sk);
> +
> +   /* The value of initial_ssthresh parameter is not used here, thus, 
> snd_ssthresh is initialized by a large value.*/
> +   tcp_sk(sk)->snd_ssthresh = 0x7fff;
> +
> +   ca->loss_cwnd   = 0;
> +   ca->frac_tracer = 0;
> +   ca->agilesd_tcp_status  = SS;
> +}
> +
> +/* This function is called whenever an ack is received and the congestion 
> window can be increased.
> + * This is equivalent to opencwnd in tcp.cc.
> + * ack is the number of bytes that are acknowledged in the latest 
> acknowledgment;
> + * rtt is the the rtt measured by the latest acknowledgment;
> + * in_flight is the packet in flight before the latest acknowledgment;
> + * good_ack is an indicator whether the current situation is normal (no 
> duplicate ack, no loss and no SACK). */
> +static void agilesdtcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
> +{
> +   struct tcp_sock *tp = tcp_sk(sk);
> +   struct agilesdtcp *ca = inet_csk_ca(sk);
> +   u32 inc_factor;
> +   u32 ca_inc;
> +   u32 current_gap, total_gap;

For coding style, please order local variable declarations from
longest to shortest line, also know as Reverse Christmas Tree Format.

> +   /* The value of inc_factor is limited by lower_fl and upper_fl.
> +* The lower_fl must be always = 1. The greater the upper_fl the 
> higher the aggressiveness.
> +* But, if upper_fl set to 1, Agile-SD will work exactly as newreno.
> +* We have already designed an equation to calculate the optimum 
> upper_fl based on the given beta.
> +* This equation will be revealed once its article is published*/
> +   u32 lower_fl = 1 * SCALE;
> +   u32 upper_fl = 3 * SCALE;
> +
> +   if (!tcp_is_cwnd_limited(sk)) return;

Please put this return (or any if/else body) on a line by itself.

> +
> +   if (tp->snd_cwnd < tp->snd_ssthresh){

Need a space between ) and {.

> +   ca->agilesd_tcp_status = SS;
> +   tcp_slow_start(tp, in_flight);
> +   }
> +   else {

The else needs to go on the same line as the closing brace.


> +   inc_factor = min(max(((upper_fl * current_gap) / total_gap), 
> lower_fl), upper_fl);

Please use the existing kernel helper macro for this:

#define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)


> +
> +   ca_inc = ((inc_factor * SCALE) / tp->snd_cwnd);   /* SCALE is 
> used to avoid fractions*/
> +
> +   ca->frac_tracer += ca_inc;/* This in 
> order to take the fraction increase into account */
> +   if (ca->frac_tracer >= Double_SCALE)  /* To take 
> factor scale into account */
> +   {

The opening brace goes on the previous line.

> +/* This function is called when the TCP flow detects a loss.
> + * It returns the slow start threshold of a flow, after a packet loss is 
> detected. */

Trailing */ style issue again.

> 

Re: linux-next: manual merge of the net-next tree with the net tree

2017-08-06 Thread Neal Cardwell
On Sun, Aug 6, 2017 at 10:01 PM, Stephen Rothwell  wrote:
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   net/ipv4/tcp_output.c
>
> between commit:
>
>   a2815817ffa6 ("tcp: enable xmit timer fix by having TLP use time when RTO 
> should fire")
>
> from the net tree and commit:
>
>   bb4d991a28cc ("tcp: adjust tail loss probe timeout")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Sorry about that. Will try to follow that procedure in the future.

thanks,
neal


Re: linux-next: manual merge of the net-next tree with the net tree

2017-08-06 Thread Neal Cardwell
On Sun, Aug 6, 2017 at 10:01 PM, Stephen Rothwell  wrote:
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   net/ipv4/tcp_output.c
>
> between commit:
>
>   a2815817ffa6 ("tcp: enable xmit timer fix by having TLP use time when RTO 
> should fire")
>
> from the net tree and commit:
>
>   bb4d991a28cc ("tcp: adjust tail loss probe timeout")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Sorry about that. Will try to follow that procedure in the future.

thanks,
neal


Re: [PATCH] ipv6: initialize treq->txhash in cookie_v6_check()

2017-07-14 Thread Neal Cardwell
On Fri, Jul 14, 2017 at 1:35 PM, Alexander Potapenko <gli...@google.com> wrote:
> On Fri, Jul 14, 2017 at 7:04 PM, Neal Cardwell <ncardw...@google.com> wrote:
>> On Fri, Jul 14, 2017 at 12:54 PM, Alexander Potapenko <gli...@google.com> 
>> wrote:
>>> KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(),
>>> which originated from the TCP request socket created in
>>> cookie_v6_check():
>> ...
>>> --- a/net/ipv6/syncookies.c
>>> +++ b/net/ipv6/syncookies.c
>>> @@ -216,6 +216,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
>>> sk_buff *skb)
>>> treq->rcv_isn = ntohl(th->seq) - 1;
>>> treq->snt_isn = cookie;
>>> treq->ts_off = 0;
>>> +   treq->txhash = 0;
>>>
>>> /*
>>>  * We need to lookup the dst_entry to get the correct window size.
>>
>> I would have thought that the same fix is needed in the corresponding
>> line in cookie_v4_check() in net/ipv4/syncookies.c? (I do not see
>> txhash being initialized for the IPv4 side.) If it's not needed for
>> some reason, then it would be worth a comment in the commit
>> description to explain why not.
> Most certainly it is needed. I haven't seen reports for that in the
> wild and couldn't forge a repro triggering the bug in IPv4, but I'll
> give it another shot.

If you force syncookies to be used, with:

  sysctl -q net.ipv4.tcp_syncookies=2

then every passive IPv4 TCP connection that is accepted by a TCP
server app should be using that uninitialized memory, I would think
(and thus should be liable to fire the KMSAN use of uninitialized
memory warning).

Does that work?

neal


Re: [PATCH] ipv6: initialize treq->txhash in cookie_v6_check()

2017-07-14 Thread Neal Cardwell
On Fri, Jul 14, 2017 at 1:35 PM, Alexander Potapenko  wrote:
> On Fri, Jul 14, 2017 at 7:04 PM, Neal Cardwell  wrote:
>> On Fri, Jul 14, 2017 at 12:54 PM, Alexander Potapenko  
>> wrote:
>>> KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(),
>>> which originated from the TCP request socket created in
>>> cookie_v6_check():
>> ...
>>> --- a/net/ipv6/syncookies.c
>>> +++ b/net/ipv6/syncookies.c
>>> @@ -216,6 +216,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
>>> sk_buff *skb)
>>> treq->rcv_isn = ntohl(th->seq) - 1;
>>> treq->snt_isn = cookie;
>>> treq->ts_off = 0;
>>> +   treq->txhash = 0;
>>>
>>> /*
>>>  * We need to lookup the dst_entry to get the correct window size.
>>
>> I would have thought that the same fix is needed in the corresponding
>> line in cookie_v4_check() in net/ipv4/syncookies.c? (I do not see
>> txhash being initialized for the IPv4 side.) If it's not needed for
>> some reason, then it would be worth a comment in the commit
>> description to explain why not.
> Most certainly it is needed. I haven't seen reports for that in the
> wild and couldn't forge a repro triggering the bug in IPv4, but I'll
> give it another shot.

If you force syncookies to be used, with:

  sysctl -q net.ipv4.tcp_syncookies=2

then every passive IPv4 TCP connection that is accepted by a TCP
server app should be using that uninitialized memory, I would think
(and thus should be liable to fire the KMSAN use of uninitialized
memory warning).

Does that work?

neal


Re: [PATCH] ipv6: initialize treq->txhash in cookie_v6_check()

2017-07-14 Thread Neal Cardwell
On Fri, Jul 14, 2017 at 12:54 PM, Alexander Potapenko  wrote:
> KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(),
> which originated from the TCP request socket created in
> cookie_v6_check():
...
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -216,6 +216,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
> sk_buff *skb)
> treq->rcv_isn = ntohl(th->seq) - 1;
> treq->snt_isn = cookie;
> treq->ts_off = 0;
> +   treq->txhash = 0;
>
> /*
>  * We need to lookup the dst_entry to get the correct window size.

I would have thought that the same fix is needed in the corresponding
line in cookie_v4_check() in net/ipv4/syncookies.c? (I do not see
txhash being initialized for the IPv4 side.) If it's not needed for
some reason, then it would be worth a comment in the commit
description to explain why not.

thanks,
neal


Re: [PATCH] ipv6: initialize treq->txhash in cookie_v6_check()

2017-07-14 Thread Neal Cardwell
On Fri, Jul 14, 2017 at 12:54 PM, Alexander Potapenko  wrote:
> KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(),
> which originated from the TCP request socket created in
> cookie_v6_check():
...
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -216,6 +216,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
> sk_buff *skb)
> treq->rcv_isn = ntohl(th->seq) - 1;
> treq->snt_isn = cookie;
> treq->ts_off = 0;
> +   treq->txhash = 0;
>
> /*
>  * We need to lookup the dst_entry to get the correct window size.

I would have thought that the same fix is needed in the corresponding
line in cookie_v4_check() in net/ipv4/syncookies.c? (I do not see
txhash being initialized for the IPv4 side.) If it's not needed for
some reason, then it would be worth a comment in the commit
description to explain why not.

thanks,
neal


Re: Get amount of fast retransmissions from TCP info

2017-05-03 Thread Neal Cardwell
On Wed, May 3, 2017 at 3:47 PM, Lars Erik Storbukås
 wrote:
> I also want to count the amount of ECN signals received. Do anyone
> have any input on where to place an ECN signal count?
>
> Is any of these locations a logical place to increase the ECN counter
> (which I've created in tcp_sock)? Both locations are in the
> tcp_input.c.
>
> /* In tcp_fastretrans_alert() */
> if (flag & FLAG_ECE) {
> tp->prior_ssthresh = 0;
> tp->ecn_count += 1; // ECN counter
> }

This approach sounds good to me.

> or
>
> /* In tcp_enter_recovery() */
> if (!tcp_in_cwnd_reduction(sk)) {
> if (!ece_ack)
> tp->prior_ssthresh = tcp_current_ssthresh(sk);
> else
>tp->ecn_count += 1; // ECN counter
> tcp_init_cwnd_reduction(sk);
> }
> tcp_set_ca_state(sk, TCP_CA_Recovery);

This location would only count ECE marks we happened to get at the
moment we enter loss recovery.

neal


Re: Get amount of fast retransmissions from TCP info

2017-05-03 Thread Neal Cardwell
On Wed, May 3, 2017 at 3:47 PM, Lars Erik Storbukås
 wrote:
> I also want to count the amount of ECN signals received. Do anyone
> have any input on where to place an ECN signal count?
>
> Is any of these locations a logical place to increase the ECN counter
> (which I've created in tcp_sock)? Both locations are in the
> tcp_input.c.
>
> /* In tcp_fastretrans_alert() */
> if (flag & FLAG_ECE) {
> tp->prior_ssthresh = 0;
> tp->ecn_count += 1; // ECN counter
> }

This approach sounds good to me.

> or
>
> /* In tcp_enter_recovery() */
> if (!tcp_in_cwnd_reduction(sk)) {
> if (!ece_ack)
> tp->prior_ssthresh = tcp_current_ssthresh(sk);
> else
>tp->ecn_count += 1; // ECN counter
> tcp_init_cwnd_reduction(sk);
> }
> tcp_set_ca_state(sk, TCP_CA_Recovery);

This location would only count ECE marks we happened to get at the
moment we enter loss recovery.

neal


Re: Get amount of fast retransmissions from TCP info

2017-04-24 Thread Neal Cardwell
"

On Mon, Apr 24, 2017 at 4:20 PM, Lars Erik Storbukås
<storbukas@gmail.com> wrote:
> 2017-04-24 21:42 GMT+02:00 Neal Cardwell <ncardw...@google.com>:
>> On Mon, Apr 24, 2017 at 3:11 PM, Lars Erik Storbukås
>> <storbukas@gmail.com> wrote:
>>> I'm trying to get amount of congestion events in TCP caused by
>>> DUPACK's (fast retransmissions), and can't seem to find any variable
>>> in the TCP info struct which hold that value. There are three
>>> variables in the TCP info struct that seem to hold similar congestion
>>> values: __u8 tcpi_retransmits;__u32 tcpi_retrans; __u32
>>> tcpi_total_retrans;
>>>
>>> Does anyone have any pointers on how to find this value in the TCP code?
>>>
>>> Please CC me personally if answering this question. Any help is
>>> greatly appreciated.
>>
>> [I'm cc-ing the netdev list.]
>>
>> Do you need this per-socket? On a per-socket basis, I do not think
>> there are separate totals for fast retransmits and timeout
>> retransmits.
>>
>> If a global number is good enough, then you can get that number from
>> the global network statistics. In "nstat" output they look like:
>>
>>   TcpExtTCPFastRetrans = packets sent in fast retransmit / fast recovery
>>
>>   TcpExtTCPSlowStartRetrans = packets sent in timeout recovery
>>
>> It sounds like TcpExtTCPFastRetrans is what you are after.
>>
>> Hope that helps,
>> neal
>
> Thanks for your answer Neal.
>
> Yes, I need this information per-socket. What would be the most
> appropriate place to update this value?

Is this for a custom kernel you are building? Or are you proposing
this for upstream?

IMHO the best place to add this for your custom kernel would be in
_tcp_retransmit_skb() around the spot with the comment "Update global
and local TCP statistics". Something like:

  /* Update global and local TCP statistics. */
...
  tp->total_retrans += segs;
  if (icsk->icsk_ca_state == TCP_CA_Loss)
tp->slow_retrans += segs;
  else
tp->fast_retrans += segs;

> If none of the variables (mentioned above) contain any value in
> regards to fast retransmits, what does the different values represent?

tcpi_retransmits: consecutive retransmits of lowest-sequence outstanding packet

tcpi_retrans: retransmitted packets estimated to be in-flight in the network now

tcpi_total_retrans: total number of retransmitted packets over the
life of the connection

Can you sketch out why you need to have separate counts for fast
retransmits and timeout/slow-start retransmits?

neal


Re: Get amount of fast retransmissions from TCP info

2017-04-24 Thread Neal Cardwell
"

On Mon, Apr 24, 2017 at 4:20 PM, Lars Erik Storbukås
 wrote:
> 2017-04-24 21:42 GMT+02:00 Neal Cardwell :
>> On Mon, Apr 24, 2017 at 3:11 PM, Lars Erik Storbukås
>>  wrote:
>>> I'm trying to get amount of congestion events in TCP caused by
>>> DUPACK's (fast retransmissions), and can't seem to find any variable
>>> in the TCP info struct which hold that value. There are three
>>> variables in the TCP info struct that seem to hold similar congestion
>>> values: __u8 tcpi_retransmits;__u32 tcpi_retrans; __u32
>>> tcpi_total_retrans;
>>>
>>> Does anyone have any pointers on how to find this value in the TCP code?
>>>
>>> Please CC me personally if answering this question. Any help is
>>> greatly appreciated.
>>
>> [I'm cc-ing the netdev list.]
>>
>> Do you need this per-socket? On a per-socket basis, I do not think
>> there are separate totals for fast retransmits and timeout
>> retransmits.
>>
>> If a global number is good enough, then you can get that number from
>> the global network statistics. In "nstat" output they look like:
>>
>>   TcpExtTCPFastRetrans = packets sent in fast retransmit / fast recovery
>>
>>   TcpExtTCPSlowStartRetrans = packets sent in timeout recovery
>>
>> It sounds like TcpExtTCPFastRetrans is what you are after.
>>
>> Hope that helps,
>> neal
>
> Thanks for your answer Neal.
>
> Yes, I need this information per-socket. What would be the most
> appropriate place to update this value?

Is this for a custom kernel you are building? Or are you proposing
this for upstream?

IMHO the best place to add this for your custom kernel would be in
_tcp_retransmit_skb() around the spot with the comment "Update global
and local TCP statistics". Something like:

  /* Update global and local TCP statistics. */
...
  tp->total_retrans += segs;
  if (icsk->icsk_ca_state == TCP_CA_Loss)
tp->slow_retrans += segs;
  else
tp->fast_retrans += segs;

> If none of the variables (mentioned above) contain any value in
> regards to fast retransmits, what does the different values represent?

tcpi_retransmits: consecutive retransmits of lowest-sequence outstanding packet

tcpi_retrans: retransmitted packets estimated to be in-flight in the network now

tcpi_total_retrans: total number of retransmitted packets over the
life of the connection

Can you sketch out why you need to have separate counts for fast
retransmits and timeout/slow-start retransmits?

neal


Re: Get amount of fast retransmissions from TCP info

2017-04-24 Thread Neal Cardwell
On Mon, Apr 24, 2017 at 3:11 PM, Lars Erik Storbukås
 wrote:
> I'm trying to get amount of congestion events in TCP caused by
> DUPACK's (fast retransmissions), and can't seem to find any variable
> in the TCP info struct which hold that value. There are three
> variables in the TCP info struct that seem to hold similar congestion
> values: __u8 tcpi_retransmits;__u32 tcpi_retrans; __u32
> tcpi_total_retrans;
>
> Does anyone have any pointers on how to find this value in the TCP code?
>
> Please CC me personally if answering this question. Any help is
> greatly appreciated.

[I'm cc-ing the netdev list.]

Do you need this per-socket? On a per-socket basis, I do not think
there are separate totals for fast retransmits and timeout
retransmits.

If a global number is good enough, then you can get that number from
the global network statistics. In "nstat" output they look like:

  TcpExtTCPFastRetrans = packets sent in fast retransmit / fast recovery

  TcpExtTCPSlowStartRetrans = packets sent in timeout recovery

It sounds like TcpExtTCPFastRetrans is what you are after.

Hope that helps,
neal


Re: Get amount of fast retransmissions from TCP info

2017-04-24 Thread Neal Cardwell
On Mon, Apr 24, 2017 at 3:11 PM, Lars Erik Storbukås
 wrote:
> I'm trying to get amount of congestion events in TCP caused by
> DUPACK's (fast retransmissions), and can't seem to find any variable
> in the TCP info struct which hold that value. There are three
> variables in the TCP info struct that seem to hold similar congestion
> values: __u8 tcpi_retransmits;__u32 tcpi_retrans; __u32
> tcpi_total_retrans;
>
> Does anyone have any pointers on how to find this value in the TCP code?
>
> Please CC me personally if answering this question. Any help is
> greatly appreciated.

[I'm cc-ing the netdev list.]

Do you need this per-socket? On a per-socket basis, I do not think
there are separate totals for fast retransmits and timeout
retransmits.

If a global number is good enough, then you can get that number from
the global network statistics. In "nstat" output they look like:

  TcpExtTCPFastRetrans = packets sent in fast retransmit / fast recovery

  TcpExtTCPSlowStartRetrans = packets sent in timeout recovery

It sounds like TcpExtTCPFastRetrans is what you are after.

Hope that helps,
neal


Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-10-26 Thread Neal Cardwell
On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
 wrote:
>@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
...
> +   case TCP_RDB:
> +   if (val < 0 || val > 1) {
> +   err = -EINVAL;
> +   } else {
> +   tp->rdb = val;
> +   tp->nonagle = val;

The semantics of the tp->nonagle bits are already a bit complex. My
sense is that having a setsockopt of TCP_RDB transparently modify the
nagle behavior is going to add more extra complexity and unanticipated
behavior than is warranted given the slight possible gain in
convenience to the app writer. What about a model where the
application user just needs to remember to call
setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
sensible? I see your nice tests at

   
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b

are already doing that. And my sense is that likewise most
well-engineered "thin stream" apps will already be using
setsockopt(TCP_NODELAY). Is that workable?

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-10-26 Thread Neal Cardwell
On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
 wrote:
>@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
...
> +   case TCP_RDB:
> +   if (val < 0 || val > 1) {
> +   err = -EINVAL;
> +   } else {
> +   tp->rdb = val;
> +   tp->nonagle = val;

The semantics of the tp->nonagle bits are already a bit complex. My
sense is that having a setsockopt of TCP_RDB transparently modify the
nagle behavior is going to add more extra complexity and unanticipated
behavior than is warranted given the slight possible gain in
convenience to the app writer. What about a model where the
application user just needs to remember to call
setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
sensible? I see your nice tests at

   
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b

are already doing that. And my sense is that likewise most
well-engineered "thin stream" apps will already be using
setsockopt(TCP_NODELAY). Is that workable?

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tcp: Restore RFC5961-compliant behavior for SYN packets

2014-11-21 Thread Neal Cardwell
On Thu, Nov 20, 2014 at 6:09 PM, Calvin Owens  wrote:
> Commit c3ae62af8e755 ("tcp: should drop incoming frames without ACK
> flag set") was created to mitigate a security vulnerability in which a
> local attacker is able to inject data into locally-opened sockets by
> using TCP protocol statistics in procfs to quickly find the correct
> sequence number.
>
> This broke the RFC5961 requirement to send a challenge ACK in response
> to spurious RST packets, which was subsequently fixed by commit
> 7b514a886ba50 ("tcp: accept RST without ACK flag").
>
> Unfortunately, the RFC5961 requirement that spurious SYN packets be
> handled in a similar manner remains broken.
>
> RFC5961 section 4 states that:
>
>... the handling of the SYN in the synchronized state SHOULD be
>performed as follows:
>
>1) If the SYN bit is set, irrespective of the sequence number, TCP
>   MUST send an ACK (also referred to as challenge ACK) to the remote
>   peer:
>
>   
>
>   After sending the acknowledgment, TCP MUST drop the unacceptable
>   segment and stop processing further.
>
>By sending an ACK, the remote peer is challenged to confirm the loss
>of the previous connection and the request to start a new connection.
>A legitimate peer, after restart, would not have a TCB in the
>synchronized state.  Thus, when the ACK arrives, the peer should send
>a RST segment back with the sequence number derived from the ACK
>field that caused the RST.
>
>This RST will confirm that the remote peer has indeed closed the
>previous connection.  Upon receipt of a valid RST, the local TCP
>endpoint MUST terminate its connection.  The local TCP endpoint
>should then rely on SYN retransmission from the remote end to
>re-establish the connection.
>
> This patch lets SYN packets through the discard added in c3ae62af8e755,
> so that spurious SYN packets are properly dealt with as per the RFC.
>
> The challenge ACK is sent unconditionally and is rate-limited, so the
> original vulnerability is not reintroduced by this patch.
>
> Signed-off-by: Calvin Owens 

Acked-by: Neal Cardwell 

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tcp: Restore RFC5961-compliant behavior for SYN packets

2014-11-21 Thread Neal Cardwell
On Thu, Nov 20, 2014 at 6:09 PM, Calvin Owens calvinow...@fb.com wrote:
 Commit c3ae62af8e755 (tcp: should drop incoming frames without ACK
 flag set) was created to mitigate a security vulnerability in which a
 local attacker is able to inject data into locally-opened sockets by
 using TCP protocol statistics in procfs to quickly find the correct
 sequence number.

 This broke the RFC5961 requirement to send a challenge ACK in response
 to spurious RST packets, which was subsequently fixed by commit
 7b514a886ba50 (tcp: accept RST without ACK flag).

 Unfortunately, the RFC5961 requirement that spurious SYN packets be
 handled in a similar manner remains broken.

 RFC5961 section 4 states that:

... the handling of the SYN in the synchronized state SHOULD be
performed as follows:

1) If the SYN bit is set, irrespective of the sequence number, TCP
   MUST send an ACK (also referred to as challenge ACK) to the remote
   peer:

   SEQ=SND.NXTACK=RCV.NXTCTL=ACK

   After sending the acknowledgment, TCP MUST drop the unacceptable
   segment and stop processing further.

By sending an ACK, the remote peer is challenged to confirm the loss
of the previous connection and the request to start a new connection.
A legitimate peer, after restart, would not have a TCB in the
synchronized state.  Thus, when the ACK arrives, the peer should send
a RST segment back with the sequence number derived from the ACK
field that caused the RST.

This RST will confirm that the remote peer has indeed closed the
previous connection.  Upon receipt of a valid RST, the local TCP
endpoint MUST terminate its connection.  The local TCP endpoint
should then rely on SYN retransmission from the remote end to
re-establish the connection.

 This patch lets SYN packets through the discard added in c3ae62af8e755,
 so that spurious SYN packets are properly dealt with as per the RFC.

 The challenge ACK is sent unconditionally and is rate-limited, so the
 original vulnerability is not reintroduced by this patch.

 Signed-off-by: Calvin Owens calvinow...@fb.com

Acked-by: Neal Cardwell ncardw...@google.com

neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner

2014-04-18 Thread Neal Cardwell
On Wed, Apr 16, 2014 at 12:25 PM, Kenjiro Nakayama
 wrote:
> This patch adds a TCP_FASTOPEN socket option to get a max backlog on its
> listener to getsockopt().
>
> Signed-off-by: Kenjiro Nakayama 
> ---
>  net/ipv4/tcp.c | 8 
>  1 file changed, 8 insertions(+)

Acked-by: Neal Cardwell 

Seems fine to me. Thanks, Kenjiro.

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner

2014-04-18 Thread Neal Cardwell
On Wed, Apr 16, 2014 at 12:25 PM, Kenjiro Nakayama
nakayamakenj...@gmail.com wrote:
 This patch adds a TCP_FASTOPEN socket option to get a max backlog on its
 listener to getsockopt().

 Signed-off-by: Kenjiro Nakayama nakayamakenj...@gmail.com
 ---
  net/ipv4/tcp.c | 8 
  1 file changed, 8 insertions(+)

Acked-by: Neal Cardwell ncardw...@google.com

Seems fine to me. Thanks, Kenjiro.

neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] ipv4: Add option to get TCP_FASTOPEN to getsockopt()

2014-04-15 Thread Neal Cardwell
On Tue, Apr 15, 2014 at 8:35 AM, Kenjiro Nakayama
 wrote:
> Hi,

Please omit the "Hi," at the top of the patch description, so the top
of the email can be used directly as a git commit description.

> TCP_FASTOPEN option can be set via setsockopt(), but the value cannot be
> gotten via getsockopt(). This patch adds the option to getsockopt().
>
> Sighned-off-by: Kenjiro Nakayama 

Typo: this should be "Signed-off-by:"

> Add option to get TCP_FASTOPEN to getsockopt()

Likewise, please omit the extra descriptive text below the
"Signed-off-by:" footer.

(And, of course, please note Eric's remarks as well...)

thanks,
neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] ipv4: Add option to get TCP_FASTOPEN to getsockopt()

2014-04-15 Thread Neal Cardwell
On Tue, Apr 15, 2014 at 8:35 AM, Kenjiro Nakayama
nakayamakenj...@gmail.com wrote:
 Hi,

Please omit the Hi, at the top of the patch description, so the top
of the email can be used directly as a git commit description.

 TCP_FASTOPEN option can be set via setsockopt(), but the value cannot be
 gotten via getsockopt(). This patch adds the option to getsockopt().

 Sighned-off-by: Kenjiro Nakayama nakayamakenj...@gmail.com

Typo: this should be Signed-off-by:

 Add option to get TCP_FASTOPEN to getsockopt()

Likewise, please omit the extra descriptive text below the
Signed-off-by: footer.

(And, of course, please note Eric's remarks as well...)

thanks,
neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipv4: Add option to get TCP_FASTOPEN to getsockopt()

2014-04-14 Thread Neal Cardwell
On Mon, Apr 14, 2014 at 1:39 PM, Kenjiro Nakayama
 wrote:
> TCP_FASTOPEN option can be set via setsockopt(), but the value cannot be
> gotten via getsockopt(). This patch adds the option to getsockopt().
>
> Sighned-off-by: Kenjiro Nakayama 
>
> Add option to get TCP_FASTOPEN to getsockopt()
> ---
>  net/ipv4/tcp.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4bd6d52..6ccdbcc 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2916,6 +2916,11 @@ static int do_tcp_getsockopt(struct sock *sk, int 
> level,
> case TCP_USER_TIMEOUT:
> val = jiffies_to_msecs(icsk->icsk_user_timeout);
> break;
> +
> +   case TCP_FASTOPEN:
> +   val = sysctl_tcp_fastopen;
> +   break;
> +
> case TCP_TIMESTAMP:
> val = tcp_time_stamp + tp->tsoffset;
> break;
> --
> 1.9.0

Hi,

Please cc the Linux network development mailing list at
net...@vger.kernel.org (cc-ed) for Linux networking patches like this.

I don't think this is the implementation we'd want for a getsockopt()
for TCP_FASTOPEN. I imagine that since the setsockopt(TCP_FASTOPEN)
enables Fast Open on a listening socket with a given backlog, if we
add a corresponding getsockopt() then it should return that Fast Open
backlog if Fast Open is enabled on the listening socket, or 0
otherwise. That is, basically it would be querying whether
fastopen_init_queue() has been run and returning fastopenq->max_qlen
if so. That way setsockopt(TCP_FASTOPEN, getsockopt(TCP_FASTOPEN))
will be a NOP in most cases, as one would expect.

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipv4: Add option to get TCP_FASTOPEN to getsockopt()

2014-04-14 Thread Neal Cardwell
On Mon, Apr 14, 2014 at 1:39 PM, Kenjiro Nakayama
nakayamakenj...@gmail.com wrote:
 TCP_FASTOPEN option can be set via setsockopt(), but the value cannot be
 gotten via getsockopt(). This patch adds the option to getsockopt().

 Sighned-off-by: Kenjiro Nakayama nakayamakenj...@gmail.com

 Add option to get TCP_FASTOPEN to getsockopt()
 ---
  net/ipv4/tcp.c | 5 +
  1 file changed, 5 insertions(+)

 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
 index 4bd6d52..6ccdbcc 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -2916,6 +2916,11 @@ static int do_tcp_getsockopt(struct sock *sk, int 
 level,
 case TCP_USER_TIMEOUT:
 val = jiffies_to_msecs(icsk-icsk_user_timeout);
 break;
 +
 +   case TCP_FASTOPEN:
 +   val = sysctl_tcp_fastopen;
 +   break;
 +
 case TCP_TIMESTAMP:
 val = tcp_time_stamp + tp-tsoffset;
 break;
 --
 1.9.0

Hi,

Please cc the Linux network development mailing list at
net...@vger.kernel.org (cc-ed) for Linux networking patches like this.

I don't think this is the implementation we'd want for a getsockopt()
for TCP_FASTOPEN. I imagine that since the setsockopt(TCP_FASTOPEN)
enables Fast Open on a listening socket with a given backlog, if we
add a corresponding getsockopt() then it should return that Fast Open
backlog if Fast Open is enabled on the listening socket, or 0
otherwise. That is, basically it would be querying whether
fastopen_init_queue() has been run and returning fastopenq-max_qlen
if so. That way setsockopt(TCP_FASTOPEN, getsockopt(TCP_FASTOPEN))
will be a NOP in most cases, as one would expect.

neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] net: make tcp_cleanup_rbuf private

2014-01-09 Thread Neal Cardwell
On Thu, Jan 9, 2014 at 3:16 PM, Dan Williams  wrote:
> net_dma was the only external user so this can become local to tcp.c
> again.
...
> -void tcp_cleanup_rbuf(struct sock *sk, int copied)
> +static void cleanup_rbuf(struct sock *sk, int copied)

I would vote to keep the tcp_ prefix. In the TCP code base that is the
more common idiom, even for internal/static TCP functions, and
personally I find it easier to read and work with in stack traces,
etc. My 2 cents.

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] net: make tcp_cleanup_rbuf private

2014-01-09 Thread Neal Cardwell
On Thu, Jan 9, 2014 at 3:16 PM, Dan Williams dan.j.willi...@intel.com wrote:
 net_dma was the only external user so this can become local to tcp.c
 again.
...
 -void tcp_cleanup_rbuf(struct sock *sk, int copied)
 +static void cleanup_rbuf(struct sock *sk, int copied)

I would vote to keep the tcp_ prefix. In the TCP code base that is the
more common idiom, even for internal/static TCP functions, and
personally I find it easier to read and work with in stack traces,
etc. My 2 cents.

neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: IPv6 TCP-Connections resetting

2013-04-05 Thread Neal Cardwell
On Fri, Apr 5, 2013 at 11:48 AM, Tetja Rediske  wrote:
> I tracked it down with 'git bisect' to commit:
>
>   093d04d42fa094f6740bb188f0ad0c215ff61e2c
...

Thanks for the  detailed report!

> 11:52:04.634656 IP6 fe80::92e2:baff:fe00:c120 > 2a00:1828:1000:1102::2:
> ICMP6, redirect, 2a00:1828:0:1::10 to 2a00:1828:0:1::10, length 136

Would you be able to re-run your tests with a tcpdump command line like:

  tcpdump -v -n -X -s 1600 icmp6

And report the full dump of this first ICMP6 packet in the exchange?

It seems that perhaps the parsing/validation of this packet is failing
somehow, so it would be nice to know exactly what that packet looked
like.

Also, are both sides of your test running 3.9.0-rc5+?

And can you please cc net...@vger.kernel.org if you follow up with more details?

Thanks!
neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: IPv6 TCP-Connections resetting

2013-04-05 Thread Neal Cardwell
On Fri, Apr 5, 2013 at 11:48 AM, Tetja Rediske te...@tetja.de wrote:
 I tracked it down with 'git bisect' to commit:

   093d04d42fa094f6740bb188f0ad0c215ff61e2c
...

Thanks for the  detailed report!

 11:52:04.634656 IP6 fe80::92e2:baff:fe00:c120  2a00:1828:1000:1102::2:
 ICMP6, redirect, 2a00:1828:0:1::10 to 2a00:1828:0:1::10, length 136

Would you be able to re-run your tests with a tcpdump command line like:

  tcpdump -v -n -X -s 1600 icmp6

And report the full dump of this first ICMP6 packet in the exchange?

It seems that perhaps the parsing/validation of this packet is failing
somehow, so it would be nice to know exactly what that packet looked
like.

Also, are both sides of your test running 3.9.0-rc5+?

And can you please cc net...@vger.kernel.org if you follow up with more details?

Thanks!
neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] tcp: add ability to set a timestamp offset

2013-01-22 Thread Neal Cardwell
On Tue, Jan 22, 2013 at 3:52 PM, Andrey Vagin  wrote:
> If a TCP socket will get live-migrated from one box to another the
> timestamps (which are typically ON) will get screwed up -- the new
> kernel will generate TS values that has nothing to do with what they
> were on dump. The solution is to yet again fix the kernel and put a
> "timestamp offset" on a socket.

One serious issue with this patch is that outgoing timestamp values
will no longer correspond to tcp_time_stamp, so echoed timestamp
values will also no longer have a meaningful relationship to
tcp_time_stamp. That violates assumptions made in several places in
the code, which assumes that we can compare echoed timestamp values to
tcp_time_stamp; for example, there are several places where we do
things like subtracting:
   tcp_time_stamp - tp->rx_opt.rcv_tsecr
to find the estimated RTT for a segment.

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] tcp: add ability to set a timestamp offset

2013-01-22 Thread Neal Cardwell
On Tue, Jan 22, 2013 at 3:52 PM, Andrey Vagin ava...@openvz.org wrote:
 If a TCP socket will get live-migrated from one box to another the
 timestamps (which are typically ON) will get screwed up -- the new
 kernel will generate TS values that has nothing to do with what they
 were on dump. The solution is to yet again fix the kernel and put a
 timestamp offset on a socket.

One serious issue with this patch is that outgoing timestamp values
will no longer correspond to tcp_time_stamp, so echoed timestamp
values will also no longer have a meaningful relationship to
tcp_time_stamp. That violates assumptions made in several places in
the code, which assumes that we can compare echoed timestamp values to
tcp_time_stamp; for example, there are several places where we do
things like subtracting:
   tcp_time_stamp - tp-rx_opt.rcv_tsecr
to find the estimated RTT for a segment.

neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.8-rc2/rc3 write() blocked on CLOSE_WAIT TCP socket

2013-01-10 Thread Neal Cardwell
On Thu, Jan 10, 2013 at 9:18 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> On Thu, 2013-01-10 at 18:01 -0800, Eric Dumazet wrote:
>
>> Hmm, it might be commit c3ae62af8e755ea68380fb5ce682e60079a4c388
>> tcp: should drop incoming frames without ACK flag set
>>
>> It seems RST should be allowed to not have ACK set.
>>
>> I'll send a fix, thanks !
>
> Yes, thats definitely the problem, sorry for that.
>
>
> [PATCH] tcp: accept RST without ACK flag
>
> commit c3ae62af8e755 (tcp: should drop incoming frames without ACK flag
> set) added a regression on the handling of RST messages.
>
> RST should be allowed to come even without ACK bit set. We validate
> the RST by checking the exact sequence, as requested by RFC 793 and
> 5961 3.2, in tcp_validate_incoming()
>
> Reported-by: Eric Wong 
> Signed-off-by: Eric Dumazet 

Acked-by: Neal Cardwell 

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.8-rc2/rc3 write() blocked on CLOSE_WAIT TCP socket

2013-01-10 Thread Neal Cardwell
On Thu, Jan 10, 2013 at 9:18 PM, Eric Dumazet eric.duma...@gmail.com wrote:
 From: Eric Dumazet eduma...@google.com

 On Thu, 2013-01-10 at 18:01 -0800, Eric Dumazet wrote:

 Hmm, it might be commit c3ae62af8e755ea68380fb5ce682e60079a4c388
 tcp: should drop incoming frames without ACK flag set

 It seems RST should be allowed to not have ACK set.

 I'll send a fix, thanks !

 Yes, thats definitely the problem, sorry for that.


 [PATCH] tcp: accept RST without ACK flag

 commit c3ae62af8e755 (tcp: should drop incoming frames without ACK flag
 set) added a regression on the handling of RST messages.

 RST should be allowed to come even without ACK bit set. We validate
 the RST by checking the exact sequence, as requested by RFC 793 and
 5961 3.2, in tcp_validate_incoming()

 Reported-by: Eric Wong normalper...@yhbt.net
 Signed-off-by: Eric Dumazet eduma...@google.com

Acked-by: Neal Cardwell ncardw...@google.com

neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tun: don't zeroize sock->file on detach

2012-08-21 Thread Neal Cardwell
On Tue, Aug 21, 2012 at 12:04 PM, Stanislav Kinsbursky
 wrote:
> 10.08.2012 03:16, David Miller пишет:
>
>> From: Stanislav Kinsbursky 
>> Date: Thu, 09 Aug 2012 16:50:40 +0400
>>
>>> This is a fix for bug, introduced in 3.4 kernel by commit
>>> 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d, which, among other things,
>>> replaced
>>> simple sock_put() by sk_release_kernel(). Below is sequence, which leads
>>> to
>>> oops for non-persistent devices:
>>>
>>> tun_chr_close()
>>> tun_detach()<== tun->socket.file = NULL
>>> tun_free_netdev()
>>> sk_release_sock()
>>> sock_release(sock->file == NULL)
>>> iput(SOCK_INODE(sock))  <== dereference on NULL pointer
>>>
>>> This patch just removes zeroing of socket's file from __tun_detach().
>>> sock_release() will do this.
>>>
>>> Cc: sta...@vger.kernel.org
>>> Reported-by: Ruan Zhijie 
>>> Tested-by: Ruan Zhijie 
>>> Acked-by: Al Viro 
>>> Acked-by: Eric Dumazet 
>>> Acked-by: Yuchung Cheng 
>>> Signed-off-by: Stanislav Kinsbursky 
>>
>>
>> Applied, thanks.
>>
>
> Hi, David.
> I found out, that this commit: b09e786bd1dd66418b69348cb110f3a64764626a
> was previous attempt to fix the problem.
> I believe this commit have to be dropped.

Have you tried testing with that commit reverted? AFAICT from reading
the code, if you revert b09e786bd1dd66418b69348cb110f3a64764626a then
the sockets_in_use count becomes incorrect, because sock_release()
will be calling this_cpu_sub() for each tun socket teardown when there
was no corresponding this_cpu_add() for the tun socket (because the
tun socket is not allocated with sock_alloc()).

Can you sketch in more detail why that commit should be dropped?

neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel paging request at 00010016

2012-08-21 Thread Neal Cardwell
On Mon, Aug 20, 2012 at 8:05 PM, Dave Haywood  wrote:
> Bisected to:
>
> 5d299f3d3c8a2fbc732b1bf03af36333ccec3130 is the first bad commit
>
> commit 5d299f3d3c8a2fbc732b1bf03af36333ccec3130
>
> Author: Eric Dumazet 
>
> Date:   Mon Aug 6 05:09:33 2012 +
>
> net: ipv6: fix TCP early demux

Yes, this is expected. There was a fix checked into the "net" tree yesterday:

http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commitdiff;h=fae6ef87faeb8853896920c68ee703d715799d28

Please let us know if that doesn't fix the crashes for you.

thanks,
neal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel paging request at 00010016

2012-08-21 Thread Neal Cardwell
On Mon, Aug 20, 2012 at 8:05 PM, Dave Haywood t...@oak.selfip.net wrote:
 Bisected to:

 5d299f3d3c8a2fbc732b1bf03af36333ccec3130 is the first bad commit

 commit 5d299f3d3c8a2fbc732b1bf03af36333ccec3130

 Author: Eric Dumazet eduma...@google.com

 Date:   Mon Aug 6 05:09:33 2012 +

 net: ipv6: fix TCP early demux

Yes, this is expected. There was a fix checked into the net tree yesterday:

http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commitdiff;h=fae6ef87faeb8853896920c68ee703d715799d28

Please let us know if that doesn't fix the crashes for you.

thanks,
neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tun: don't zeroize sock-file on detach

2012-08-21 Thread Neal Cardwell
On Tue, Aug 21, 2012 at 12:04 PM, Stanislav Kinsbursky
skinsbur...@parallels.com wrote:
 10.08.2012 03:16, David Miller пишет:

 From: Stanislav Kinsbursky skinsbur...@parallels.com
 Date: Thu, 09 Aug 2012 16:50:40 +0400

 This is a fix for bug, introduced in 3.4 kernel by commit
 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d, which, among other things,
 replaced
 simple sock_put() by sk_release_kernel(). Below is sequence, which leads
 to
 oops for non-persistent devices:

 tun_chr_close()
 tun_detach()== tun-socket.file = NULL
 tun_free_netdev()
 sk_release_sock()
 sock_release(sock-file == NULL)
 iput(SOCK_INODE(sock))  == dereference on NULL pointer

 This patch just removes zeroing of socket's file from __tun_detach().
 sock_release() will do this.

 Cc: sta...@vger.kernel.org
 Reported-by: Ruan Zhijie ruanzhi...@hotmail.com
 Tested-by: Ruan Zhijie ruanzhi...@hotmail.com
 Acked-by: Al Viro v...@zeniv.linux.org.uk
 Acked-by: Eric Dumazet eduma...@google.com
 Acked-by: Yuchung Cheng ych...@google.com
 Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com


 Applied, thanks.


 Hi, David.
 I found out, that this commit: b09e786bd1dd66418b69348cb110f3a64764626a
 was previous attempt to fix the problem.
 I believe this commit have to be dropped.

Have you tried testing with that commit reverted? AFAICT from reading
the code, if you revert b09e786bd1dd66418b69348cb110f3a64764626a then
the sockets_in_use count becomes incorrect, because sock_release()
will be calling this_cpu_sub() for each tun socket teardown when there
was no corresponding this_cpu_add() for the tun socket (because the
tun socket is not allocated with sock_alloc()).

Can you sketch in more detail why that commit should be dropped?

neal
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/