Re: [PATCH] [PATCH net-next] openvswitch: Use correct reply values in datapath and vport ops

2018-10-17 Thread Koichiro Den
On Sat, 2018-09-29 at 11:44 -0700, David Miller wrote:
> From: Yifeng Sun 
> Date: Wed, 26 Sep 2018 11:40:14 -0700
> 
> > This patch fixes the bug that all datapath and vport ops are returning
> > wrong values (OVS_FLOW_CMD_NEW or OVS_DP_CMD_NEW) in their replies.
> > 
> > Signed-off-by: Yifeng Sun 
> 
> Applied.

Not directly relating to ovs, but as a similar existing behaviour, we do have
somewhat odd response for CTRL_CMD_GETFAMILY. I mean this part:

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 25eeb6d2a75a..17428d5d1434 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -789,7 +789,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct
netlink_callback *cb)

if (ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
   cb->nlh->nlmsg_seq, NLM_F_MULTI,
-  skb, CTRL_CMD_NEWFAMILY) < 0) {
+  skb, CTRL_CMD_GETFAMILY) < 0) {
n--;
break;
}
@@ -886,7 +886,7 @@ static int ctrl_getfamily(struct sk_buff *skb, struct
genl_info *info)
}

msg = ctrl_build_family_msg(res, info->snd_portid, info->snd_seq,
-   CTRL_CMD_NEWFAMILY);
+   CTRL_CMD_GETFAMILY);
if (IS_ERR(msg))
return PTR_ERR(msg);

David, do you think we should retain this existing bahaviour, since this would
be a relatively huge breaking change for userland apps?



Re: [net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

2017-10-22 Thread Koichiro Den
On Sun, 2017-10-22 at 21:31 -0700, Eric Dumazet wrote:
> On Mon, 2017-10-23 at 13:28 +0900, Koichiro Den wrote:
> 
> > Indeed, meaning that tcp_clean_rtx_queue implementation never takes.
> > But for me it seems that there is some possibility RACK algorithm will take
> > it.
> 
> As long as tp->tcp_mstamp is monotonically increasing (and it is ;) ),
> RACK will behave properly.
> 
> I am not sure we need to discuss this forever frankly ;)
>  
> 
> 
Yeah, sure, thanks for your time and helpful comments.



Re: [net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

2017-10-22 Thread Koichiro Den
On Sun, 2017-10-22 at 20:40 -0700, Eric Dumazet wrote:
> On Mon, 2017-10-23 at 12:26 +0900, Koichiro Den wrote:
> 
> > Now I wonder this is more of a theoretical one rather than a patch to fix
> > one
> > specific bug.
> 
> 
> Note that I said that your patch was fine and I added a 'Reviewed-by:'
> tag.
Sure, sorry about my confusing comment.
> 
> 
> What I meant is that it has no direct effect on correctness of TCP
> stack. I could not cook a packetdrill test that shows the difference
> before and after your patch.
> 
> BTW, in the following sequence :
> 
> A)   Fetch high-res timestamp and store in X
> B)   Use X
> 
> B) Can use a quite old value of X, depending on scheduling (preempt
> kernels or interrupt handling)
> 
> TCP really does not care of how accurate X is, it is a best effort.
> 
I agreed. In e.g., hard interrupt storm, this extra refreshing is just
make the expected delay smaller under the same condition.
> For RTX packets, it is even more the case, since TCP does not take RTT
> samples from packets that were retransmitted.
Indeed, meaning that tcp_clean_rtx_queue implementation never takes.
But for me it seems that there is some possibility RACK algorithm will take it.



Re: [net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

2017-10-22 Thread Koichiro Den
On Sun, 2017-10-22 at 10:11 -0700, Eric Dumazet wrote:
> On Sun, 2017-10-22 at 09:49 -0700, Eric Dumazet wrote:
> 
> > Here is the test I cooked.
> > 
> > Note that it actually passes just fine, do I am wondering if your patch
> > is needed after all..
> > 
> 
> Oh this is because tcp_mstamp_refresh() is called from tcp_write_xmit()
> and tcp_write_xmit() is directly called from tcp_tsq_handler()

Thanks for all you help. I guess we should focus on retransmit call path
which comes just before that, or not?
> 
> So it looks your patch has no observable effect.
> 
> (Even with usec TS as we use here at Google, it has no effect since TCP
> stack does not consider RTT samples being valid if skb was ever
> retransmitted)
> 
> 

Is the point about tcp_rack_advance part or in general?

Btw, perf probe makes the original point described in my commit message apparent
as follows:

$ perf probe 'T1=tcp_write_xmit:25 tm=+1464(%bx):u64'
$ perf probe 'T2=tcp_xmit_retransmit_queue+0
  l_o=+1664(%di):u32  # tp->lost_out
  r_o=+1520(%di):u32  # tp->retrans_out
  tm=+1464(%di):u64'  # skb_mstamp will be set to this.
$ perf trace -aT --no-syscalls --call-graph dwarf \
  -e probe:T1/max-stack=4/
  -e probe:T2/max-stack=1/ |& \
  sed 's/..kernel.kallsyms..//;s/probe://' # to make it a bit more readable :p

$ # run fq-pacing-tsq-rtx-ts.pkt Eric provided:

[before patch applied]
-8<--8<-
2560581.604 T2:(817da5e0) l_o=25 r_o=0 tm=2560581404)
   tcp_xmit_retransmit_queue.part.33
   tcp_xmit_recovery.part.53
2560581.614 T1:(817d7e74) tm=2560581612)  # <--- (1)
   tcp_write_xmit
   __tcp_push_pending_frames
   tcp_rcv_established
   tcp_v4_do_rcv

/*
 * ^
 * | could be unexpectedly long, e.g. due to BLOCK_SOFTIRQ
 * v
 */
2560582.215 T2:(817da5e0) l_o=25 r_o=5 tm=2560581612)
/* <- same as (1) above */
   tcp_xmit_retransmit_queue.part.33
   tcp_tasklet_func
->8--->8-


[after patch applied]
-8<--8<-
 97846.150 T2:(817da5e0) l_o=25 r_o=0 tm=97846132)
   tcp_xmit_retransmit_queue.part.33
   tcp_xmit_recovery.part.53
 97846.159 T1:(817d7e74) tm=97846158)# <--- (1)
   tcp_write_xmit
   __tcp_push_pending_frames
   tcp_rcv_established
   tcp_v4_do_rcv
/*
 * ^
 * | could be unexpectedly long, e.g. due to BLOCK_SOFTIRQ
 * v
 */
 97847.144 T2:(817da5e0) l_o=25 r_o=5 tm=97847134)
/* <- not the same as (1).
 * never affected now that we ensure it's refreshed */
   tcp_xmit_retransmit_queue.part.33
   tcp_tasklet_func
->8--->8-

Now I wonder this is more of a theoretical one rather than a patch to fix one
specific bug.

Re: [net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

2017-10-22 Thread Koichiro Den
On Sat, 2017-10-21 at 22:21 -0700, Eric Dumazet wrote:
> On Sun, 2017-10-22 at 13:10 +0900, Koichiro Den wrote:
> > On Sat, 2017-10-21 at 20:52 -0700, Eric Dumazet wrote:
> > > On Sun, 2017-10-22 at 12:38 +0900, Koichiro Den wrote:
> > > > When retransmission on TSQ handler was introduced in the commit
> > > > f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
> > > > skbs' timestamps were updated on the actual transmission. In the later
> > > > commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
> > > > being done so. In the commit, the comment says "We try to refresh
> > > > tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
> > > > tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
> > > > it's rare enough.
> > > > 
> > > > About the former, even though possible retransmissions on the tasklet
> > > > comes just after the destructor run in NET_RX softirq handling, the time
> > > > between them could be nonnegligibly large to the extent that
> > > > tcp_rack_advance or rto rearming be affected if other (remaining) RX,
> > > > BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
> > > > 
> > > > So in the same way as tcp_write_timer_handler does, doing
> > > > tcp_mstamp_refresh
> > > > ensures the accuracy of algorithms relying on it.
> > > > 
> > > > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > > > ---
> > > >  net/ipv4/tcp_output.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > Very nice catch, thanks a lot Koichiro.
> > > 
> > > This IMO would target net tree, since it is a bug fix.
> > > 
> > > Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
> > 
> > Ok I will submit it to net tree. Thanks!
> > > 
> > > Thanks !
> > > 
> > > We should have caught that in our regression packetdrill tests...
> > 
> > In its "remote" mode testing not relying on tun xmit, I agree it could be
> > caught. If it's better to write, test and attach the script, please let me
> > know.
> 
> Packetdrill in the normal (local) mode will do it.
> 
> Push fq packet scheduler on tun0, and packets will be held long enough
> in FQ that TSQ will be effective.
Ah yes, I missed it, thank you.
> 
> Adding TCP TS support on folloginw packetdrill test should demonstrate
> the issue and if your patch cures it.
Thanks for the demo script. I thought making the issue described in my commit
message obvious with a packetdrill script on its own was a bit difficult without
heavy load on test bed (or intentionally injecting time-consuming code).

IIUC, whether or not TCP TS val reflects a bit later timestamp than its last
reception, which triggered TSQ handler, sounds much better. Though it is still
like "With some probability, one millisecond delay is reflected on TS val, so
the packetdrill test passes. Otherwise we can say that the test fails."
Am I correct? To be honest I am still wondering what is the best way to make the
issue obvious.
> // Test if TSQ applies to retransmits.
> 
> `tc qdisc replace dev tun0 root fq quantum 1514 initial_quantum 1514
> flow_limit 5
> sysctl -e -q net.ipv4.tcp_min_tso_segs=1`
> 
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
> 
>    +0 < S 0:0(0) win 32792 
>    +0 > S. 0:0(0) ack 1 
>  +.01 < . 1:1(0) ack 1 win 1500
>    +0 accept(3, ..., ...) = 4
> 
>    +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
> // send one frame per ms
>    +0 setsockopt(4, SOL_SOCKET, SO_MAX_PACING_RATE, [1514000], 4) = 0
>    +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [100], 4) = 0
> 
>    +0 write(4, ..., 12) = 12
> 
>    +0 > .  1:1461(1460) ack 1
> +.001 > .  1461:2921(1460) ack 1
> +.001 > .  2921:4381(1460) ack 1
> +.001 > .  4381:5841(1460) ack 1
> +.001 > .  5841:7301(1460) ack 1
> +.001 > .  7301:8761(1460) ack 1
> +.001 > .  8761:10221(1460) ack 1
> +.001 > .  10221:11681(1460) ack 1
> +.001 > .  11681:13141(1460) ack 1
> +.001 > .  13141:14601(1460) ack 1
>  +.01 < . 1:1(0) ack 14601 win 1500
> 
> +.001 > .  14601:16061(1460) ack 1
> +.001 > .  16061:17521(1460) ack 1
> +.001 > .  17521:18981(1460) ack 1
> +.001 > .  18981:20441(1460) ack 1
> +.001 > .  20441:

[net] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

2017-10-21 Thread Koichiro Den
When retransmission on TSQ handler was introduced in the commit
f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
skbs' timestamps were updated on the actual transmission. In the later
commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
being done so. In the commit, the comment says "We try to refresh
tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
it's rare enough.

About the former, even though possible retransmissions on the tasklet
comes just after the destructor run in NET_RX softirq handling, the time
between them could be nonnegligibly large to the extent that
tcp_rack_advance or rto rearming be affected if other (remaining) RX,
BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.

So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
ensures the accuracy of algorithms relying on it.

Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Koichiro Den <d...@klaipeden.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_output.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0bc9e46a5369..973befc36fd4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -739,8 +739,10 @@ static void tcp_tsq_handler(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
 
if (tp->lost_out > tp->retrans_out &&
-   tp->snd_cwnd > tcp_packets_in_flight(tp))
+   tp->snd_cwnd > tcp_packets_in_flight(tp)) {
+   tcp_mstamp_refresh(tp);
tcp_xmit_retransmit_queue(sk);
+   }
 
tcp_write_xmit(sk, tcp_current_mss(sk), tp->nonagle,
   0, GFP_ATOMIC);
-- 
2.9.4




Re: [net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

2017-10-21 Thread Koichiro Den
On Sat, 2017-10-21 at 20:52 -0700, Eric Dumazet wrote:
> On Sun, 2017-10-22 at 12:38 +0900, Koichiro Den wrote:
> > When retransmission on TSQ handler was introduced in the commit
> > f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
> > skbs' timestamps were updated on the actual transmission. In the later
> > commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
> > being done so. In the commit, the comment says "We try to refresh
> > tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
> > tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
> > it's rare enough.
> > 
> > About the former, even though possible retransmissions on the tasklet
> > comes just after the destructor run in NET_RX softirq handling, the time
> > between them could be nonnegligibly large to the extent that
> > tcp_rack_advance or rto rearming be affected if other (remaining) RX,
> > BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
> > 
> > So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
> > ensures the accuracy of algorithms relying on it.
> > 
> > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > ---
> >  net/ipv4/tcp_output.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Very nice catch, thanks a lot Koichiro.
> 
> This IMO would target net tree, since it is a bug fix.
> 
> Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Ok I will submit it to net tree. Thanks!
> 
> Thanks !
> 
> We should have caught that in our regression packetdrill tests...
In its "remote" mode testing not relying on tun xmit, I agree it could be
caught. If it's better to write, test and attach the script, please let me know.

Thank you.



[net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

2017-10-21 Thread Koichiro Den
When retransmission on TSQ handler was introduced in the commit
f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
skbs' timestamps were updated on the actual transmission. In the later
commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
being done so. In the commit, the comment says "We try to refresh
tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
it's rare enough.

About the former, even though possible retransmissions on the tasklet
comes just after the destructor run in NET_RX softirq handling, the time
between them could be nonnegligibly large to the extent that
tcp_rack_advance or rto rearming be affected if other (remaining) RX,
BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.

So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
ensures the accuracy of algorithms relying on it.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 net/ipv4/tcp_output.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 988733f289c8..7a4d9b719b47 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -743,8 +743,10 @@ static void tcp_tsq_handler(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
 
if (tp->lost_out > tp->retrans_out &&
-   tp->snd_cwnd > tcp_packets_in_flight(tp))
+   tp->snd_cwnd > tcp_packets_in_flight(tp)) {
+   tcp_mstamp_refresh(tp);
tcp_xmit_retransmit_queue(sk);
+   }
 
tcp_write_xmit(sk, tcp_current_mss(sk), tp->nonagle,
   0, GFP_ATOMIC);
-- 
2.9.4




Re: [net-next 3/3] tcp: keep tcp_collapse controllable even after processing starts

2017-10-14 Thread Koichiro Den
On Sat, 2017-10-14 at 08:46 -0700, Eric Dumazet wrote:
> On Sat, 2017-10-14 at 16:27 +0900, Koichiro Den wrote:
> > Combining actual collapsing with reasoning for deciding the starting
> > point, we can apply its logic in a consistent manner such that we can
> > avoid costly yet not much useful collapsing. When collapsing to be
> > triggered, it's not rare that most of the skbs in the receive or ooo
> > queue are large ones without much metadata overhead. This also
> > simplifies code and makes it easier to apply logic in a fair manner.
> > 
> > Subtle subsidiary changes included:
> > - When the end_seq of the skb we are trying to collapse was larger than
> >   the 'end' argument provided, we would end up copying to the 'end'
> >   even though we couldn't collapse the original one. Current users of
> >   tcp_collapse does not require such reserves so redefines it as the
> >   point over which skbs whose seq passes guranteed not to be collapsed.
> > - Naturally tcp_collapse_ofo_queue shapes up and we no longer need
> >   'tail' argument.
> 
> 
> I am not inclined to review such a large change, without you providing
> actual numbers.
> 
> We have a problem in TCP right now, that receiver announces a too big
> window, and that is the main reason we trigger collapsing.
> 
> I would rather fix the root cause.
> 
> 
Ok I got it, thank you.



Re: [net-next 2/3] tcp: do not tcp_collapse once SYN or FIN found

2017-10-14 Thread Koichiro Den
On Sat, 2017-10-14 at 08:43 -0700, Eric Dumazet wrote:
> On Sat, 2017-10-14 at 16:27 +0900, Koichiro Den wrote:
> > Since 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
> > applied, we no longer need to continue to search for the starting
> > point once we encounter FIN packet. Same reasoning for SYN packet
> > since commit 9d691539eea2d ("tcp: do not enqueue skb with SYN flag"),
> > that would help us with error message when actual receiving.
> 
> Very confusing changelog or patch.
> 
> What exact problem do you want to solve ?
> 
> 
That I thought as unnecessary search for the starting point. I am going to re-
read all to correct my misunderstanding.
Thank you.



Re: [net-next 1/3] tcp: avoid useless copying and collapsing of just one skb

2017-10-14 Thread Koichiro Den
On Sat, 2017-10-14 at 08:42 -0700, Eric Dumazet wrote:
> On Sat, 2017-10-14 at 16:27 +0900, Koichiro Den wrote:
> > On the starting point chosen, it could be possible that just one skb
> > remains in between the range provided, leading to copying and re-insertion
> > of rb node, which is useless with respect to the rcv buf measurement.
> > This is rather probable in ooo queue case, in which non-contiguous bloated
> > packets have been queued up.
> > 
> > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > ---
> >  net/ipv4/tcp_input.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index d0682ce2a5d6..1d785b5bf62d 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -4807,7 +4807,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head
> > *list, struct rb_root *root,
> >     start = TCP_SKB_CB(skb)->end_seq;
> >     }
> >     if (end_of_skbs ||
> > -   (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
> > +   (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
> > +   (TCP_SKB_CB(skb)->seq == start && TCP_SKB_CB(skb)->end_seq ==
> > end))
> >     return;
> >  
> >     __skb_queue_head_init();
> 
> 
> What do you mean by useless ?
> 
> Surely if this skb contains 17 segments (some USB drivers allocate 8KB
> per frame), we want to collapse them to save memory.
> 
> So I do not agree with this patch.
> 
> 
I missed that, and sorry about bothering with all, I totally misunderstood it.
Thank you for all of them.



[net-next 0/3] optimisations and reorganisations of tcp_collapse

2017-10-14 Thread Koichiro Den
This patch series removes possible useless copying and collapsing while
not missing the chance when it is worth the effort. Also reorganizes it
and do some cleanups.

Koichiro Den (3):
  tcp: avoid useless copying and collapsing of just one skb
  tcp: do not tcp_collapse once SYN or FIN found
  tcp: keep tcp_collapse controllable even after processing starts

 net/ipv4/tcp_input.c | 193 ---
 1 file changed, 90 insertions(+), 103 deletions(-)

-- 
2.9.4




[net-next 1/3] tcp: avoid useless copying and collapsing of just one skb

2017-10-14 Thread Koichiro Den
On the starting point chosen, it could be possible that just one skb
remains in between the range provided, leading to copying and re-insertion
of rb node, which is useless with respect to the rcv buf measurement.
This is rather probable in ooo queue case, in which non-contiguous bloated
packets have been queued up.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d0682ce2a5d6..1d785b5bf62d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4807,7 +4807,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, 
struct rb_root *root,
start = TCP_SKB_CB(skb)->end_seq;
}
if (end_of_skbs ||
-   (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
+   (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
+   (TCP_SKB_CB(skb)->seq == start && TCP_SKB_CB(skb)->end_seq == end))
return;
 
__skb_queue_head_init();
-- 
2.9.4




[net-next 3/3] tcp: keep tcp_collapse controllable even after processing starts

2017-10-14 Thread Koichiro Den
Combining actual collapsing with reasoning for deciding the starting
point, we can apply its logic in a consistent manner such that we can
avoid costly yet not much useful collapsing. When collapsing to be
triggered, it's not rare that most of the skbs in the receive or ooo
queue are large ones without much metadata overhead. This also
simplifies code and makes it easier to apply logic in a fair manner.

Subtle subsidiary changes included:
- When the end_seq of the skb we are trying to collapse was larger than
  the 'end' argument provided, we would end up copying to the 'end'
  even though we couldn't collapse the original one. Current users of
  tcp_collapse does not require such reserves so redefines it as the
  point over which skbs whose seq passes guranteed not to be collapsed.
- Naturally tcp_collapse_ofo_queue shapes up and we no longer need
  'tail' argument.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 net/ipv4/tcp_input.c | 197 +++
 1 file changed, 90 insertions(+), 107 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1a74457db0f3..5fb90cc0ae95 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4756,108 +4756,117 @@ void tcp_rbtree_insert(struct rb_root *root, struct 
sk_buff *skb)
 
 /* Collapse contiguous sequence of skbs head..tail with
  * sequence numbers start..end.
- *
- * If tail is NULL, this means until the end of the queue.
- *
- * Segments with FIN/SYN are not collapsed (only because this
- * simplifies code)
  */
 static void
 tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
-struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end)
+struct sk_buff *head, u32 start, u32 *end)
 {
-   struct sk_buff *skb = head, *n;
+   struct sk_buff *skb = head, *n, *nskb = NULL;
+   int copy = 0, offset, size;
struct sk_buff_head tmp;
-   bool end_of_skbs;
 
-   /* First, check that queue is collapsible and find
-* the point where collapsing can be useful.
-*/
-restart:
-   for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) {
-   /* If list is ooo queue, it will get purged when
-* this FIN will get moved to sk_receive_queue.
-* SYN packet is not expected here. We will get
-* error message when actual receiving.
-*/
-   if (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_FIN | TCPHDR_SYN))
-   return;
+   if (!list)
+   __skb_queue_head_init(); /* To defer rb tree insertion */
 
-   n = tcp_skb_next(skb, list);
-
-   /* No new bits? It is possible on ofo queue. */
+   while (skb) {
if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
skb = tcp_collapse_one(sk, skb, list, root);
-   if (!skb)
-   break;
-   goto restart;
+   continue;
}
+   n = tcp_skb_next(skb, list);
 
-   /* The first skb to collapse is:
-* - bloated or contains data before "start" or
-*   overlaps to the next one.
-*/
-   if (tcp_win_from_space(skb->truesize) > skb->len ||
-   before(TCP_SKB_CB(skb)->seq, start)) {
-   end_of_skbs = false;
+examine:
+   /* Nothing beneficial to expect any more if SYN/FIN or last. */
+   if (!n ||
+   TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_FIN | TCPHDR_SYN))
break;
+
+   /* If hole found, skip to the next. */
+   if (after(TCP_SKB_CB(n)->seq, TCP_SKB_CB(skb)->end_seq)) {
+   skb = n;
+   continue;
}
 
-   if (n && n != tail &&
-   TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) {
-   end_of_skbs = false;
-   break;
+   /* If the 2nd skb has no newer bits than the current one,
+* just collapse and advance it to re-examine.
+*/
+   if (!after(TCP_SKB_CB(n)->end_seq, TCP_SKB_CB(skb)->end_seq)) {
+   n = tcp_collapse_one(sk, skb, list, root);
+   if (!n)
+   break;
+   goto examine;
}
 
-   /* Decided to skip this, advance start seq. */
-   start = TCP_SKB_CB(skb)->end_seq;
-   }
-   if (end_of_skbs ||
-   (TCP_SKB_CB(skb)->seq == start && TCP_SKB_CB(skb)->end_seq == end))
-   return;
+   /* If the next skb passes the end hint, finish. */
+   if (

[net-next 2/3] tcp: do not tcp_collapse once SYN or FIN found

2017-10-14 Thread Koichiro Den
Since 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
applied, we no longer need to continue to search for the starting
point once we encounter FIN packet. Same reasoning for SYN packet
since commit 9d691539eea2d ("tcp: do not enqueue skb with SYN flag"),
that would help us with error message when actual receiving.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 net/ipv4/tcp_input.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1d785b5bf62d..1a74457db0f3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4775,6 +4775,14 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, 
struct rb_root *root,
 */
 restart:
for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) {
+   /* If list is ooo queue, it will get purged when
+* this FIN will get moved to sk_receive_queue.
+* SYN packet is not expected here. We will get
+* error message when actual receiving.
+*/
+   if (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_FIN | TCPHDR_SYN))
+   return;
+
n = tcp_skb_next(skb, list);
 
/* No new bits? It is possible on ofo queue. */
@@ -4786,13 +4794,11 @@ tcp_collapse(struct sock *sk, struct sk_buff_head 
*list, struct rb_root *root,
}
 
/* The first skb to collapse is:
-* - not SYN/FIN and
 * - bloated or contains data before "start" or
 *   overlaps to the next one.
 */
-   if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) &&
-   (tcp_win_from_space(skb->truesize) > skb->len ||
-before(TCP_SKB_CB(skb)->seq, start))) {
+   if (tcp_win_from_space(skb->truesize) > skb->len ||
+   before(TCP_SKB_CB(skb)->seq, start)) {
end_of_skbs = false;
break;
}
@@ -4807,7 +4813,6 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, 
struct rb_root *root,
start = TCP_SKB_CB(skb)->end_seq;
}
if (end_of_skbs ||
-   (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
(TCP_SKB_CB(skb)->seq == start && TCP_SKB_CB(skb)->end_seq == end))
return;
 
@@ -4845,9 +4850,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, 
struct rb_root *root,
}
if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
skb = tcp_collapse_one(sk, skb, list, root);
-   if (!skb ||
-   skb == tail ||
-   (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | 
TCPHDR_FIN)))
+   if (!skb || skb == tail)
goto end;
}
}
-- 
2.9.4




Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-23 Thread Koichiro Den
On Wed, 2017-08-23 at 23:28 +0900, Koichiro Den wrote:
> On Tue, 2017-08-22 at 20:55 +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 22, 2017 at 10:50:41AM +0800, Jason Wang wrote:
> > > > Perhaps the descriptor pool should also be
> > > > revised to allow out of order completions. Then there is no need to
> > > > copy zerocopy packets whenever they may experience delay.
> > > 
> > > Yes, but as replied in the referenced thread, windows driver may treat out
> > > of order completion as a bug.
> > 
> > That would be a windows driver bug then, but I don't think it makes this
> > assumption. What the referenced thread
> > (https://patchwork.kernel.org/patch/3787671/) is saying is that host
> > must use any buffers made available on a tx vq within a reasonable
> > timeframe otherwise windows guests panic.
> > 
> > Ideally we would detect that a packet is actually experiencing delay and
> > trigger the copy at that point e.g. by calling skb_linearize. But it
> > isn't easy to track these packets though and even harder to do a data
> > copy without races.
> > 
> > Which reminds me that skb_linearize in net core seems to be
> > fundamentally racy - I suspect that if skb is cloned, and someone is
> > trying to use the shared frags while another thread calls skb_linearize,
> > we get some use after free bugs which likely mostly go undetected
> > because the corrupted packets mostly go on wire and get dropped
> > by checksum code.
> > 
> 
> Please let me make sure if I understand it correctly:
> * always do copy with skb_orphan_frags_rx as Willem mentioned in the earlier
> post, before the xmit_skb as opposed to my original patch, is safe but too
> costly so cannot be adopted.
> * as a generic solution, if we were to somehow overcome the safety issue,
> track
> the delay and do copy if some threshold is reached could be an answer, but
> it's
> hard for now.
> * so things like the current vhost-net implementation of deciding whether or
> not
> to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> practical compromise.
<- I forgot to mention the max pend checking part.
> 
> Thanks.



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-23 Thread Koichiro Den
On Tue, 2017-08-22 at 20:55 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 22, 2017 at 10:50:41AM +0800, Jason Wang wrote:
> > > Perhaps the descriptor pool should also be
> > > revised to allow out of order completions. Then there is no need to
> > > copy zerocopy packets whenever they may experience delay.
> > 
> > Yes, but as replied in the referenced thread, windows driver may treat out
> > of order completion as a bug.
> 
> That would be a windows driver bug then, but I don't think it makes this
> assumption. What the referenced thread
> (https://patchwork.kernel.org/patch/3787671/) is saying is that host
> must use any buffers made available on a tx vq within a reasonable
> timeframe otherwise windows guests panic.
> 
> Ideally we would detect that a packet is actually experiencing delay and
> trigger the copy at that point e.g. by calling skb_linearize. But it
> isn't easy to track these packets though and even harder to do a data
> copy without races.
> 
> Which reminds me that skb_linearize in net core seems to be
> fundamentally racy - I suspect that if skb is cloned, and someone is
> trying to use the shared frags while another thread calls skb_linearize,
> we get some use after free bugs which likely mostly go undetected
> because the corrupted packets mostly go on wire and get dropped
> by checksum code.
> 
Please let me make sure if I understand it correctly:
* always do copy with skb_orphan_frags_rx as Willem mentioned in the earlier
post, before the xmit_skb as opposed to my original patch, is safe but too
costly so cannot be adopted.
* as a generic solution, if we were to somehow overcome the safety issue, track
the delay and do copy if some threshold is reached could be an answer, but it's
hard for now.
* so things like the current vhost-net implementation of deciding whether or not
to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
practical compromise.

Thanks.



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-23 Thread Koichiro Den
On Tue, 2017-08-22 at 13:19 -0400, Willem de Bruijn wrote:
> > > > > /* Don't wait up for transmitted skbs to be freed. */
> > > > > if (!use_napi) {
> > > > > +   if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > > > > +   struct ubuf_info *uarg;
> > > > > +   uarg = skb_shinfo(skb)->destructor_arg;
> > > > > +   if (uarg->callback)
> > > > > +   uarg->callback(uarg, true);
> > > > > +   skb_shinfo(skb)->destructor_arg = NULL;
> > > > > +   skb_shinfo(skb)->tx_flags &=
> > > > > ~SKBTX_DEV_ZEROCOPY;
> > > > > +   }
> > > > 
> > > > Instead of open coding, this can use skb_zcopy_clear.
> > > 
> > > It is not correct to only send the zerocopy completion here, as
> > > the skb will still be shared with the nic. The only safe approach
> > > to notifying early is to create a copy with skb_orphan_frags_rx.
> > > That will call skb_zcopy_clear(skb, false) internally. But the
> > > copy will be very expensive.
> > 
> > I noticed this email after my last post. I cannot not imagine how it could
> > be
> > unsafe in virtio case. Sorry could you explain a bit more?
> 
> A guest process sends a packet with MSG_ZEROCOPY to the
> virtio-net device. As soon as the process receives the completion
> notification, it is allowed to reuse the memory backing the packet.
> 
> A call to skb_zcopy_clear in virtio-net start_xmit will notify the
> process that it is allowed to reuse the memory. But the user pages
> are still linked into the skb frags and are about to be shared with
> the host os.
> 
> > > Is the deadlock you refer to the case where tx interrupts are
> > > disabled, so transmit completions are only handled in start_xmit
> > > and somehow the socket(s) are unable to send new data? The
> > > question is what is blocking them. If it is zerocopy, it may be a
> > > too low optmem_max or hitting the per-user locked pages limit.
> > > We may have to keep interrupts enabled when zerocopy skbs
> > > are in flight.
> > 
> > Even if we keep interrupts enabled, We miss the completion without
> > start_xmit.
> > And it's also likely that the next start_xmit depends on the completion
> > itself
> > as I described in my last post.
> > 
Thanks for the explanation, I misinterpreted the "nic" part, now it's clear.
Sorry about bothering.



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-23 Thread Koichiro Den
On Tue, 2017-08-22 at 13:16 -0400, Willem de Bruijn wrote:
> > > > An issue of the referenced patch is that sndbuf could be smaller than
> > > > low
> > > > watermark.
> > 
> > We cannot determine the low watermark properly because of not only sndbuf
> > size
> > issue but also the fact that the upper vhost-net cannot directly see how
> > much
> > descriptor is currently available at the virtio-net tx queue. It depends on
> > multiqueue settings or other senders which are also using the same tx queue.
> > Note that in the latter case if they constantly transmitting, the deadlock
> > could
> > not occur(*), however if it has just temporarily fulfill some portion of the
> > pool in the mean time, then the low watermark cannot be helpful.
> > (*: That is because it's reliable enough in the sense I mention below.)
> > 
> > Keep in this in mind, let me briefly describe the possible deadlock I
> > mentioned:
> > (1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer
> > sets
> > new descriptors, which depends only on the vhost-net zcopy callback and
> > adding
> > newly used descriptors.
> > (2). vhost-net callback depends on the skb freeing on the xmit path only.
> > (3). the xmit path depends (possibly only) on the vhost-net sendmsg.
> > As you see, it's enough to bring about the situation above that L1 virtio-
> > net
> > reaches its limit earlier than the L0 host processing. The vhost-net pool
> > could
> > be almost full or empty, whatever.
> 
> Thanks for the context. This issue is very similar to the one that used to
> exist when running out of transmit descriptors, before the removal of
> the timer and introduction of skb_orphan in start_xmit.
> 
> To make sure that I understand correctly, let me paraphrase:
> 
> A. guest socket cannot send because it exhausted its sk budget (sndbuf, tsq,
> ..)
> 
> B. budget is not freed up until guest receives tx completion for this flow
> 
> C. tx completion is held back on the host side in vhost_zerocopy_signal_used
>    behind the completion for an unrelated skb
> 
> D. unrelated packet is delayed somewhere in the host stackf zerocopy
> completions.
>    e.g., netem
> 
> The issue that is specific to vhost-net zerocopy is that (C) enforces strict
> ordering of transmit completions causing head of line blocking behind
> vhost-net zerocopy callbacks.
> 
> This is a different problem from
> 
> C1. tx completion is delayed until guest sends another packet and
>    triggers free_old_xmit_skb
> 
> Both in host and guest, zerocopy packets should never be able to loop
> to a receive path where they can cause unbounded delay.
> 
> The obvious cases of latency are queueing, like netem. That leads
> to poor performance for unrelated flows, but I don't see how this
> could cause deadlock.

Thanks for the wrap-up. I see all the points now and also that C1 should not
cause deadlock.



Re: [PATCH] vhost: fix end of range for access_ok

2017-08-22 Thread Koichiro Den
On Mon, 2017-08-21 at 22:45 +0300, Michael S. Tsirkin wrote:
> During access_ok checks, addr increases as we iterate over the data
> structure, thus addr + len - 1 will point beyond the end of region we
> are translating.  Harmless since we then verify that the region covers
> addr, but let's not waste cpu cycles.
> 
> Reported-by: Koichiro Den <d...@klaipeden.com>
> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> ---
> 
> Lightly tested, would appreciate an ack from reporter.
> 
>  drivers/vhost/vhost.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index e4613a3..ecd70e4 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1176,7 +1176,7 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
>  {
>   const struct vhost_umem_node *node;
>   struct vhost_umem *umem = vq->iotlb;
> - u64 s = 0, size, orig_addr = addr;
> + u64 s = 0, size, orig_addr = addr, last = addr + len - 1;
>  
>   if (vhost_vq_meta_fetch(vq, addr, len, type))
>   return true;
> @@ -1184,7 +1184,7 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
>   while (len > s) {
>   node = vhost_umem_interval_tree_iter_first(>umem_tree,
>      addr,
> -    addr + len - 1);
> +    last);
>   if (node == NULL || node->start > addr) {
>       vhost_iotlb_miss(vq, addr, access);
>   return false;

Michael, Thank you for this one.

Acked-by: Koichiro Den <d...@klaipeden.com>



Re: [PATCH 1/2] vhost: remove the possible fruitless search on iotlb prefetch

2017-08-22 Thread Koichiro Den
On Mon, 2017-08-21 at 22:45 +0300, Michael S. Tsirkin wrote:
> On Sat, Aug 19, 2017 at 03:41:14PM +0900, Koichiro Den wrote:
> > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > ---
> >  drivers/vhost/vhost.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index e4613a3c362d..93e909afc1c3 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1184,7 +1184,7 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
> >     while (len > s) {
> >     node = vhost_umem_interval_tree_iter_first(
> > >umem_tree,
> >        addr,
> > -      addr + len - 1);
> > +      addr + len - s -
> > 1);
> >     if (node == NULL || node->start > addr) {
> >     vhost_iotlb_miss(vq, addr, access);
> >     return false;
> 
> This works but it probably makes sense to just refactor the code to make end
> of
> range a variable. I posted a patch like this, pls take a look.
> 
> > -- 
> > 2.9.4
> > 
Ok, thanks.



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Koichiro Den
On Tue, 2017-08-22 at 08:11 -0400, Willem de Bruijn wrote:
> On Sun, Aug 20, 2017 at 4:49 PM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
> > On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den <d...@klaipeden.com> wrote:
> > > Facing the possible unbounded delay relying on freeing on xmit path,
> > > we also better to invoke and clear the upper layer zerocopy callback
> > > beforehand to keep them from waiting for unbounded duration in vain.
> > 
> > Good point.
> > 
> > > For instance, this removes the possible deadlock in the case that the
> > > upper layer is a zerocopy-enabled vhost-net.
> > > This does not apply if napi_tx is enabled since it will be called in
> > > reasonale time.
> > 
> > Indeed. Btw, I am gathering data to eventually make napi the default
> > mode. But that is taking some time.
> > 
> > > 
> > > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > > ---
> > >  drivers/net/virtio_net.c | 8 
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 4302f313d9a7..f7deaa5b7b50 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > > struct net_device *dev)
> > > 
> > > /* Don't wait up for transmitted skbs to be freed. */
> > > if (!use_napi) {
> > > +   if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > > +   struct ubuf_info *uarg;
> > > +   uarg = skb_shinfo(skb)->destructor_arg;
> > > +   if (uarg->callback)
> > > +   uarg->callback(uarg, true);
> > > +   skb_shinfo(skb)->destructor_arg = NULL;
> > > +   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> > > +   }
> > 
> > Instead of open coding, this can use skb_zcopy_clear.
> 
> It is not correct to only send the zerocopy completion here, as
> the skb will still be shared with the nic. The only safe approach
> to notifying early is to create a copy with skb_orphan_frags_rx.
> That will call skb_zcopy_clear(skb, false) internally. But the
> copy will be very expensive.
I noticed this email after my last post. I cannot not imagine how it could be
unsafe in virtio case. Sorry could you explain a bit more?
> 
> Is the deadlock you refer to the case where tx interrupts are
> disabled, so transmit completions are only handled in start_xmit
> and somehow the socket(s) are unable to send new data? The
> question is what is blocking them. If it is zerocopy, it may be a
> too low optmem_max or hitting the per-user locked pages limit.
> We may have to keep interrupts enabled when zerocopy skbs
> are in flight.
Even if we keep interrupts enabled, We miss the completion without start_xmit.
And it's also likely that the next start_xmit depends on the completion itself
as I described in my last post.



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Koichiro Den
On Mon, 2017-08-21 at 23:10 -0400, Willem de Bruijn wrote:
> > > > Interesting, deadlock could be treated as a a radical case of the
> > > > discussion
> > > > here https://patchwork.kernel.org/patch/3787671/.
> > > > 
> > > > git grep tells more similar skb_orphan() cases. Do we need to change
> > > > them
> > > > all (or part)?
> > > 
> > > Most skb_orphan calls are not relevant to the issue of transmit delay.
> > 
> > 
> > Yes, but at least we should audit the ones in drivers/net.
> 
> Do you mean other virtual device driver transmit paths, like xen,
> specifically?
> 
> > > > Actually, we may meet similar issues at many other places (e.g netem).
> > > 
> > > Netem is an interesting case. Because it is intended to mimic network
> > > delay, at least in the case where it calls skb_orphan, it may make
> > > sense to release all references, including calling skb_zcopy_clear.
> > > 
> > > In general, zerocopy reverts to copy on all paths that may cause
> > > unbounded delay due to another process. Guarding against delay
> > > induced by the administrator is infeasible. It is always possible to
> > > just pause the nic. Netem is one instance of that, and not unbounded.
> > 
> > 
> > The problem is, admin may only delay the traffic in e.g one interface, but
> > it actually delay or stall all traffic inside a VM.
> 
> Understood. Ideally we can remove the HoL blocking cause of this,
> itself.
> 
> > > > Need
> > > > to consider a complete solution for this. Figuring out all places that
> > > > could
> > > > delay a packet is a method.
> > > 
> > > The issue described in the referenced patch seems like head of line
> > > blocking between two flows. If one flow delays zerocopy descriptor
> > > release from the vhost-net pool, it blocks all subsequent descriptors
> > > in that pool from being released, also delaying other flows that use
> > > the same descriptor pool. If the pool is empty, all transmission stopped.
> > > 
> > > Reverting to copy tx when the pool reaches a low watermark, as the
> > > patch does, fixes this.
> > 
> > 
> > An issue of the referenced patch is that sndbuf could be smaller than low
> > watermark.
We cannot determine the low watermark properly because of not only sndbuf size
issue but also the fact that the upper vhost-net cannot directly see how much
descriptor is currently available at the virtio-net tx queue. It depends on
multiqueue settings or other senders which are also using the same tx queue.
Note that in the latter case if they constantly transmitting, the deadlock could
not occur(*), however if it has just temporarily fulfill some portion of the
pool in the mean time, then the low watermark cannot be helpful.
(*: That is because it's reliable enough in the sense I mention below.)

Keep in this in mind, let me briefly describe the possible deadlock I mentioned:
(1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer sets
new descriptors, which depends only on the vhost-net zcopy callback and adding
newly used descriptors.
(2). vhost-net callback depends on the skb freeing on the xmit path only.
(3). the xmit path depends (possibly only) on the vhost-net sendmsg.
As you see, it's enough to bring about the situation above that L1 virtio-net
reaches its limit earlier than the L0 host processing. The vhost-net pool could
be almost full or empty, whatever.

Also note that relaxing the restriction (2) (and thus remove the dependency (3)
-> (1)) seems quite natural but it needs to be reliable enough to never miss the
last trigger if it's based on interruption. We might need to do some
modification on virtio interruption mitigation in none tx napi case but it's too
costly and not fair to cover the upper layer peculiarity.

So I think drivers which holds characteristics such as (2) is worth checking
whether we should revise.
> > 
> > > Perhaps the descriptor pool should also be
> > > revised to allow out of order completions. Then there is no need to
> > > copy zerocopy packets whenever they may experience delay.
> > 
> > 
> > Yes, but as replied in the referenced thread, windows driver may treat out
> > of order completion as a bug.
> 
> Interesting. I missed that. Perhaps the zerocopy optimization
> could be gated on guest support for out of order completions.
> 
> > > On the point of counting copy vs zerocopy: the new msg_zerocopy
> > > variant of ubuf_info has a field to record whether a deep copy was
> > > made. This can be used with vhost-net zerocopy, too.
> > 
> > 
> > Just to make sure I understand. It's still not clear to me how to reuse this
> > for vhost-net, e.g zerocopy flag is in a union which is not used by
> > vhost_net.
> 
> True, but that is not set in stone. I went back and forth on that when
> preparing fix 0a4a060bb204 ("sock: fix zerocopy_success regression
> with msg_zerocopy"). The field can be moved outside the union and
> initialized in the other zerocopy paths.



Re: [PATCH 2/2] vhost-net: revert vhost_exceeds_maxpend logic to its original

2017-08-21 Thread Koichiro Den
On Mon, 2017-08-21 at 20:40 +0800, Jason Wang wrote:
> 
> On 2017年08月21日 11:06, Jason Wang wrote:
> > 
> > 
> > On 2017年08月19日 14:41, Koichiro Den wrote:
> > > To depend on vq.num and the usage of VHOST_MAX_PEND is not succinct
> > > and in some case unexpected, so revert its logic part only.
> > 
> > Hi:
> > 
> > Could you explain a little bit more on the case that is was not 
> > sufficent?
> > 
> > Thanks
> 
> Just have another thought.
> 
> I wonder whether or not just use ulimit(memlock) is better here. It 
> looks more flexible.
> 
> Thanks
It sounds a better approach, though at present I have got no idea how much
portion it should be.



Re: [PATCH 2/2] vhost-net: revert vhost_exceeds_maxpend logic to its original

2017-08-21 Thread Koichiro Den
On Mon, 2017-08-21 at 11:06 +0800, Jason Wang wrote:
> 
> On 2017年08月19日 14:41, Koichiro Den wrote:
> > To depend on vq.num and the usage of VHOST_MAX_PEND is not succinct
> > and in some case unexpected, so revert its logic part only.
> 
> Hi:
> 
> Could you explain a little bit more on the case that is was not sufficent?
It's just because vq.num could theoretically be around VHOST_MAX_PEND, that
could lead to an unexpected situation. Plus, I thought its name and usage is not
intrinsic. Its actual functioning in normal vq.num == 256 case is I think good
enough. Thanks.
> 
> Thanks
> 
> > 
> > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > ---
> >   drivers/vhost/net.c | 8 ++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 06d044862e58..99cf99b308a7 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -433,11 +433,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net
> > *net,
> >   
> >   static bool vhost_exceeds_maxpend(struct vhost_net *net)
> >   {
> > +   int num_pends;
> >     struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
> >     struct vhost_virtqueue *vq = >vq;
> >   
> > -   return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
> > -   == nvq->done_idx;
> > +   num_pends = likely(nvq->upend_idx >= nvq->done_idx) ?
> > +   (nvq->upend_idx - nvq->done_idx) :
> > +   (nvq->upend_idx + UIO_MAXIOV - nvq->done_idx);
> > +
> > +   return num_pends > VHOST_MAX_PEND;
> >   }
> >   
> >   /* Expects to be always run from workqueue - which acts as
> 
> 



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-21 Thread Koichiro Den
On Mon, 2017-08-21 at 20:33 +0800, Jason Wang wrote:
> 
> On 2017年08月19日 14:38, Koichiro Den wrote:
> > Facing the possible unbounded delay relying on freeing on xmit path,
> > we also better to invoke and clear the upper layer zerocopy callback
> > beforehand to keep them from waiting for unbounded duration in vain.
> > For instance, this removes the possible deadlock in the case that the
> > upper layer is a zerocopy-enabled vhost-net.
> > This does not apply if napi_tx is enabled since it will be called in
> > reasonale time.
> > 
> > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > ---
> >   drivers/net/virtio_net.c | 8 
> >   1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 4302f313d9a7..f7deaa5b7b50 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> >   
> >     /* Don't wait up for transmitted skbs to be freed. */
> >     if (!use_napi) {
> > +   if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > +   struct ubuf_info *uarg;
> > +   uarg = skb_shinfo(skb)->destructor_arg;
> > +   if (uarg->callback)
> > +   uarg->callback(uarg, true);
> > +   skb_shinfo(skb)->destructor_arg = NULL;
> > +   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> > +   }
> >     skb_orphan(skb);
> >     nf_reset(skb);
> >     }
> 
> 
> Interesting, deadlock could be treated as a a radical case of the 
> discussion here https://patchwork.kernel.org/patch/3787671/.
Thanks for letting me know this one. I am going to read it.
> 
> git grep tells more similar skb_orphan() cases. Do we need to change 
> them all (or part)?
Maybe, even though it seems less likely that we may meet it than the one I
described as an example, such as virtio-net sandwiched between vhost-net.
> 
> Actually, we may meet similar issues at many other places (e.g netem). 
> Need to consider a complete solution for this. Figuring out all places 
> that could delay a packet is a method.
Okay I will do it and post the result and a suggestion if possible. Thanks.
> 
> Thanks



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-21 Thread Koichiro Den
On Sun, 2017-08-20 at 16:49 -0400, Willem de Bruijn wrote:
> On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den <d...@klaipeden.com> wrote:
> > Facing the possible unbounded delay relying on freeing on xmit path,
> > we also better to invoke and clear the upper layer zerocopy callback
> > beforehand to keep them from waiting for unbounded duration in vain.
> 
> Good point.
> 
> > For instance, this removes the possible deadlock in the case that the
> > upper layer is a zerocopy-enabled vhost-net.
> > This does not apply if napi_tx is enabled since it will be called in
> > reasonale time.
> 
> Indeed. Btw, I am gathering data to eventually make napi the default
> mode. But that is taking some time.
I'm looking forward to it. Btw I'm just guessing somehow delayed napi disabling
seems to relieve the interrupt costs much, though it might need to be stateful
so sounds a bit offensive. That's to say, not-yet-used ones are not necessarily
going to be used in the near future, so we have to limit the delay with some
sort of counter. Just guessing as a virtio novice, so pls take it with a grain
of salt ;)
> 
> > 
> > Signed-off-by: Koichiro Den <d...@klaipeden.com>
> > ---
> >  drivers/net/virtio_net.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 4302f313d9a7..f7deaa5b7b50 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> > 
> > /* Don't wait up for transmitted skbs to be freed. */
> > if (!use_napi) {
> > +   if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > +   struct ubuf_info *uarg;
> > +   uarg = skb_shinfo(skb)->destructor_arg;
> > +   if (uarg->callback)
> > +   uarg->callback(uarg, true);
> > +   skb_shinfo(skb)->destructor_arg = NULL;
> > +   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> > +   }
> 
> Instead of open coding, this can use skb_zcopy_clear.
I didn't notice it, seems necessary and sufficient. Please let me repost.
Thank you.
> 
> > skb_orphan(skb);
> > nf_reset(skb);
> > }
> > --
> > 2.9.4
> > 
> > 

Re: [PATCH net-next] virtio-net: make napi_tx param easier to grasp

2017-08-21 Thread Koichiro Den
On Sun, 2017-08-20 at 16:30 -0400, Willem de Bruijn wrote:
> On Sat, Aug 19, 2017 at 2:37 AM, Koichiro Den <d...@klaipeden.com> wrote:
> > The module param napi_tx needs not to be writable for now since we do
> > not have any means of activating/deactivating it online,
> 
> A virtio_net device inherits its napi tx mode from the global napi_tx flag
> on device up. It is possible to change the parameter and bring a device
> down/up to change the device mode.
> 
> > @@ -1179,13 +1172,19 @@ static int virtnet_open(struct net_device *dev)
> > struct virtnet_info *vi = netdev_priv(dev);
> > int i;
> > 
> > +   /* Tx napi touches cachelines on the cpu handling tx interrupts.
> > Only
> > +* enable the feature if this is likely affine with the transmit
> > path.
> > +*/
> > +   if (!vi->affinity_hint_set)
> > +   napi_tx = false;
> > +
> 
> This disables napi globally if a specific device lacks affinity.
Now I see this is not appropriate since it just represents whether or not TX
napi is available, not that whether or not it is being turned on or off on a
particular device. Thank you.

To be honest I hoped to make it possible to see which mode it is currently
running on with ease, but I guess it's not nice to accomplish it with net sysfs
nor ethtool or whatever because it seems not much generic matter.

Thanks.



[PATCH 2/2] vhost-net: revert vhost_exceeds_maxpend logic to its original

2017-08-19 Thread Koichiro Den
To depend on vq.num and the usage of VHOST_MAX_PEND is not succinct
and in some case unexpected, so revert its logic part only.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 drivers/vhost/net.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 06d044862e58..99cf99b308a7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -433,11 +433,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 
 static bool vhost_exceeds_maxpend(struct vhost_net *net)
 {
+   int num_pends;
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
 
-   return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
-   == nvq->done_idx;
+   num_pends = likely(nvq->upend_idx >= nvq->done_idx) ?
+   (nvq->upend_idx - nvq->done_idx) :
+   (nvq->upend_idx + UIO_MAXIOV - nvq->done_idx);
+
+   return num_pends > VHOST_MAX_PEND;
 }
 
 /* Expects to be always run from workqueue - which acts as
-- 
2.9.4




[PATCH 1/2] vhost: remove the possible fruitless search on iotlb prefetch

2017-08-19 Thread Koichiro Den
Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e4613a3c362d..93e909afc1c3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1184,7 +1184,7 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
while (len > s) {
node = vhost_umem_interval_tree_iter_first(>umem_tree,
   addr,
-  addr + len - 1);
+  addr + len - s - 1);
if (node == NULL || node->start > addr) {
vhost_iotlb_miss(vq, addr, access);
return false;
-- 
2.9.4




[PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-19 Thread Koichiro Den
Facing the possible unbounded delay relying on freeing on xmit path,
we also better to invoke and clear the upper layer zerocopy callback
beforehand to keep them from waiting for unbounded duration in vain.
For instance, this removes the possible deadlock in the case that the
upper layer is a zerocopy-enabled vhost-net.
This does not apply if napi_tx is enabled since it will be called in
reasonale time.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 drivers/net/virtio_net.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4302f313d9a7..f7deaa5b7b50 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
/* Don't wait up for transmitted skbs to be freed. */
if (!use_napi) {
+   if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
+   struct ubuf_info *uarg;
+   uarg = skb_shinfo(skb)->destructor_arg;
+   if (uarg->callback)
+   uarg->callback(uarg, true);
+   skb_shinfo(skb)->destructor_arg = NULL;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+   }
skb_orphan(skb);
nf_reset(skb);
}
-- 
2.9.4




[PATCH net-next] virtio-net: make napi_tx param easier to grasp

2017-08-19 Thread Koichiro Den
The module param napi_tx needs not to be writable for now since we do
not have any means of activating/deactivating it online, which seems to
be a low priority. Also make it clear that napi_tx is disabled when it
has been dynamically disabled behind the scenes.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 drivers/net/virtio_net.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4302f313d9a7..ea4e7ddcd377 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -37,7 +37,7 @@ module_param(napi_weight, int, 0444);
 static bool csum = true, gso = true, napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
-module_param(napi_tx, bool, 0644);
+module_param(napi_tx, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -1026,20 +1026,13 @@ static void virtnet_napi_enable(struct virtqueue *vq, 
struct napi_struct *napi)
local_bh_enable();
 }
 
-static void virtnet_napi_tx_enable(struct virtnet_info *vi,
-  struct virtqueue *vq,
-  struct napi_struct *napi)
+static void virtnet_napi_tx_enable(struct virtqueue *vq, struct napi_struct 
*napi)
 {
-   if (!napi->weight)
-   return;
-
-   /* Tx napi touches cachelines on the cpu handling tx interrupts. Only
-* enable the feature if this is likely affine with the transmit path.
-*/
-   if (!vi->affinity_hint_set) {
+   if (!napi_tx)
napi->weight = 0;
+
+   if (!napi->weight)
return;
-   }
 
return virtnet_napi_enable(vq, napi);
 }
@@ -1179,13 +1172,19 @@ static int virtnet_open(struct net_device *dev)
struct virtnet_info *vi = netdev_priv(dev);
int i;
 
+   /* Tx napi touches cachelines on the cpu handling tx interrupts. Only
+* enable the feature if this is likely affine with the transmit path.
+*/
+   if (!vi->affinity_hint_set)
+   napi_tx = false;
+
for (i = 0; i < vi->max_queue_pairs; i++) {
if (i < vi->curr_queue_pairs)
/* Make sure we have some buffers: if oom use wq. */
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
-   virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi);
+   virtnet_napi_tx_enable(vi->sq[i].vq, >sq[i].napi);
}
 
return 0;
@@ -1890,6 +1889,12 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 
virtio_device_ready(vdev);
 
+   /* Tx napi touches cachelines on the cpu handling tx interrupts. Only
+* enable the feature if this is likely affine with the transmit path.
+*/
+   if (!vi->affinity_hint_set)
+   napi_tx = false;
+
if (netif_running(vi->dev)) {
for (i = 0; i < vi->curr_queue_pairs; i++)
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
@@ -1897,8 +1902,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 
for (i = 0; i < vi->max_queue_pairs; i++) {
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
-   virtnet_napi_tx_enable(vi, vi->sq[i].vq,
-  >sq[i].napi);
+   virtnet_napi_tx_enable(vi->sq[i].vq, >sq[i].napi);
}
}
 
-- 
2.9.4




Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-08-13 Thread Koichiro Den
Sorry I mistakenly focused on NET case, please pass it over. I will do any
relevant suggestion in patch-based way. Thanks.

On Sun, 2017-08-13 at 23:11 +0900, Koichiro Den wrote:
> Thanks for your comments, Michael and Jason. And I'm sorry about late
> response.
> To be honest, I am on a summer vacation until next Tuesday.
> 
> I noticed that what I wrote was not sufficient. Regardless of caching
> mechanism
> existence, the "event" could legitimately be at any point out of the latest
> interval, which vhost_notify checks it against, meaning that if it's out of
> the
> interval we cannot distinguish whether or not it lags behind or has a
> lead. And
> it seems to conform to the spec. Thanks for leading me to the spec. The corner
> case I point out here is:
> (0) event idx feature turned on + TX napi turned off
> -> (1) guest-side TX traffic bursting occurs and delayed callback set
> -> (2) some interruption triggers skb_xmit_done
> -> (3) guest-side modest traffic makes the interval proceed to unbounded
> extent
> without updating "event" since NO_INTERRUPT continues to be set on its shadow
> flag.
> 
> IMHO, if you plan to make TX napi the only choice, doing this sort of
> optimisation beforehand seems likely to be in vain.
> 
> So, if the none-TX napi case continues to coexist, what I would like to
> suggest
> is not just the sort of my last email, but like making maximum staleness of
> "event" less than or equal to vq.num, and afterward caching suggestion.
> Otherwise, I guess I should not repost my last email since it would make
> matters
>  too complicated even though it will soon be removed when TX-napi becomes the
> only choice.
> 
> Thanks!
> 
> 
> On Wed, 2017-08-09 at 07:37 +0300, Michael S. Tsirkin wrote:
> > On Wed, Aug 09, 2017 at 10:38:10AM +0800, Jason Wang wrote:
> > > I think don't think current code can work well if vq.num is grater than
> > > 2^15. Since all cached idx is u16. This looks like a bug which needs to be
> > > fixed.
> > 
> > That's a limitation of virtio 1.0.
> > 
> > > > * else if the interval of vq.num is [2^15, 2^16):
> > > > the logic in the original patch (809ecb9bca6a9) suffices
> > > > * else (= less than 2^15) (optional):
> > > > checking only (vring_need_event(vq->last_used_event, new + vq->num, new)
> > > > would suffice.
> > > > 
> > > > Am I missing something, or is this irrelevant?
> > 
> > Could you pls repost the suggestion copying virtio-dev mailing list
> > (subscriber only, sorry about that, but host/guest ABI discussions
> > need to copy that list)?
> > 
> > > Looks not, I think this may work. Let me do some test.
> > > 
> > > Thanks
> > 
> > I think that at this point it's prudent to add a feature bit
> > as the virtio spec does not require to never move the event index back.
> > 



Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-08-13 Thread Koichiro Den
Thanks for your comments, Michael and Jason. And I'm sorry about late response.
To be honest, I am on a summer vacation until next Tuesday.

I noticed that what I wrote was not sufficient. Regardless of caching mechanism
existence, the "event" could legitimately be at any point out of the latest
interval, which vhost_notify checks it against, meaning that if it's out of the
interval we cannot distinguish whether or not it lags behind or has a lead. And
it seems to conform to the spec. Thanks for leading me to the spec. The corner
case I point out here is:
(0) event idx feature turned on + TX napi turned off
-> (1) guest-side TX traffic bursting occurs and delayed callback set
-> (2) some interruption triggers skb_xmit_done
-> (3) guest-side modest traffic makes the interval proceed to unbounded extent
without updating "event" since NO_INTERRUPT continues to be set on its shadow
flag.

IMHO, if you plan to make TX napi the only choice, doing this sort of
optimisation beforehand seems likely to be in vain.

So, if the none-TX napi case continues to coexist, what I would like to suggest
is not just the sort of my last email, but like making maximum staleness of
"event" less than or equal to vq.num, and afterward caching suggestion.
Otherwise, I guess I should not repost my last email since it would make matters
 too complicated even though it will soon be removed when TX-napi becomes the
only choice.

Thanks!


On Wed, 2017-08-09 at 07:37 +0300, Michael S. Tsirkin wrote:
> On Wed, Aug 09, 2017 at 10:38:10AM +0800, Jason Wang wrote:
> > I think don't think current code can work well if vq.num is grater than
> > 2^15. Since all cached idx is u16. This looks like a bug which needs to be
> > fixed.
> 
> That's a limitation of virtio 1.0.
> 
> > > * else if the interval of vq.num is [2^15, 2^16):
> > > the logic in the original patch (809ecb9bca6a9) suffices
> > > * else (= less than 2^15) (optional):
> > > checking only (vring_need_event(vq->last_used_event, new + vq->num, new)
> > > would suffice.
> > > 
> > > Am I missing something, or is this irrelevant?
> 
> Could you pls repost the suggestion copying virtio-dev mailing list
> (subscriber only, sorry about that, but host/guest ABI discussions
> need to copy that list)?
> 
> > Looks not, I think this may work. Let me do some test.
> > 
> > Thanks
> 
> I think that at this point it's prudent to add a feature bit
> as the virtio spec does not require to never move the event index back.
> 



[PATCH net] xfrm: fix null pointer dereference on state and tmpl sort

2017-08-01 Thread Koichiro Den
Creating sub policy that matches the same outer flow as main policy does
leads to a null pointer dereference if the outer mode's family is ipv4.
For userspace compatibility, this patch just eliminates the crash i.e.,
does not introduce any new sorting rule, which would fruitlessly affect
all but the aforementioned case.

Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 net/xfrm/xfrm_state.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 6c0956d10db6..a792effdb0b5 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1620,6 +1620,7 @@ int
 xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl **src, int n,
   unsigned short family, struct net *net)
 {
+   int i;
int err = 0;
struct xfrm_state_afinfo *afinfo = xfrm_state_get_afinfo(family);
if (!afinfo)
@@ -1628,6 +1629,9 @@ xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl 
**src, int n,
spin_lock_bh(>xfrm.xfrm_state_lock); /*FIXME*/
if (afinfo->tmpl_sort)
err = afinfo->tmpl_sort(dst, src, n);
+   else
+   for (i = 0; i < n; i++)
+   dst[i] = src[i];
spin_unlock_bh(>xfrm.xfrm_state_lock);
rcu_read_unlock();
return err;
@@ -1638,6 +1642,7 @@ int
 xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state **src, int n,
unsigned short family)
 {
+   int i;
int err = 0;
struct xfrm_state_afinfo *afinfo = xfrm_state_get_afinfo(family);
struct net *net = xs_net(*src);
@@ -1648,6 +1653,9 @@ xfrm_state_sort(struct xfrm_state **dst, struct 
xfrm_state **src, int n,
spin_lock_bh(>xfrm.xfrm_state_lock);
if (afinfo->state_sort)
err = afinfo->state_sort(dst, src, n);
+   else
+   for (i = 0; i < n; i++)
+   dst[i] = src[i];
spin_unlock_bh(>xfrm.xfrm_state_lock);
rcu_read_unlock();
return err;
-- 
2.9.4




[PATCH net 2/2] gue: fix remcsum when GRO on and CHECKSUM_PARTIAL boundary is outer UDP

2017-07-31 Thread Koichiro Den
In the case that GRO is turned on and the original received packet is
CHECKSUM_PARTIAL, if the outer UDP header is exactly at the last
csum-unnecessary point, which for instance could occur if the packet
comes from another Linux guest on the same Linux host, we have to do
either remcsum_adjust or set up CHECKSUM_PARTIAL again with its
csum_start properly reset considering RCO.

However, since b7fe10e5ebac ("gro: Fix remcsum offload to deal with frags
in GRO") that barrier in such case could be skipped if GRO turned on,
hence we pass over it and the inner L4 validation mistakenly reckons
it as a bad csum.

This patch makes remcsum_offload being reset at the same time of GRO
remcsum cleanup, so as to make it work in such case as before.

Fixes: b7fe10e5ebac ("gro: Fix remcsum offload to deal with frags in GRO")
Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 net/ipv4/fou.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 8e0257d01200..1540db65241a 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -450,6 +450,7 @@ static struct sk_buff **gue_gro_receive(struct sock *sk,
 out:
NAPI_GRO_CB(skb)->flush |= flush;
skb_gro_remcsum_cleanup(skb, );
+   skb->remcsum_offload = 0;
 
return pp;
 }
-- 
2.9.4




[PATCH net 1/2] vxlan: fix remcsum when GRO on and CHECKSUM_PARTIAL boundary is outer UDP

2017-07-31 Thread Koichiro Den
In the case that GRO is turned on and the original received packet is
CHECKSUM_PARTIAL, if the outer UDP header is exactly at the last
csum-unnecessary point, which for instance could occur if the packet
comes from another Linux guest on the same Linux host, we have to do
either remcsum_adjust or set up CHECKSUM_PARTIAL again with its
csum_start properly reset considering RCO.

However, since b7fe10e5ebac("gro: Fix remcsum offload to deal with frags
in GRO") that barrier in such case could be skipped if GRO turned on,
hence we pass over it and the inner L4 validation mistakenly reckons
it as a bad csum.

This patch makes remcsum_offload being reset at the same time of GRO
remcsum cleanup, so as to make it work in such case as before.

Fixes: b7fe10e5ebac ("gro: Fix remcsum offload to deal with frags in GRO")
Signed-off-by: Koichiro Den <d...@klaipeden.com>
---
 drivers/net/vxlan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 96aa7e6cf214..e17baac70f43 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -623,6 +623,7 @@ static struct sk_buff **vxlan_gro_receive(struct sock *sk,
 
 out:
skb_gro_remcsum_cleanup(skb, );
+   skb->remcsum_offload = 0;
NAPI_GRO_CB(skb)->flush |= flush;
 
return pp;
-- 
2.9.4