Re: [net-next 7/7] net/mlx5e: Improve ethtool private-flags code structure

2018-12-05 Thread Cong Wang
Hello, Saeed

On Tue, Dec 4, 2018 at 10:27 PM Saeed Mahameed  wrote:
>  static int mlx5e_handle_pflag(struct net_device *netdev,
>   u32 wanted_flags,
> - enum mlx5e_priv_flag flag,
> - mlx5e_pflag_handler pflag_handler)
> + enum mlx5e_priv_flag flag)
>  {
...
> if (err) {
> netdev_err(netdev, "%s private flag 0x%x failed err %d\n",
>enable ? "Enable" : "Disable", flag, err);

As flag is now a true enum, you probably don't want to print it out as hex
any more.

Thanks.


Re: [net-next 6/7] net/mlx5e: ethtool, Support user configuration for RX hash fields

2018-12-05 Thread Cong Wang
Hello, Saeed


On Tue, Dec 4, 2018 at 10:26 PM Saeed Mahameed  wrote:
> +static int mlx5e_get_rss_hash_opt(struct mlx5e_priv *priv,
> + struct ethtool_rxnfc *nfc)
...
> +   tt = flow_type_to_traffic_type(nfc->flow_type);
> +
> +   if (tt == MLX5E_NUM_INDIR_TIRS)
> +   return -EINVAL;

The blank line above is unnecessary.

Thanks.


Re: [net-next 4/7] net/mlx5e: Refactor TIR configuration function

2018-12-05 Thread Cong Wang
On Tue, Dec 4, 2018 at 10:26 PM Saeed Mahameed  wrote:
> +static const struct mlx5e_tirc_config
> +tirc_default_config[MLX5E_NUM_INDIR_TIRS] = {

Is it okay to define an array in a header??? No link error???

I must be dumb...


Re: [net-next 7/7] net/mlx5e: Improve ethtool private-flags code structure

2018-12-05 Thread Cong Wang
On Tue, Dec 4, 2018 at 10:27 PM Saeed Mahameed  wrote:
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index a429553002a6..49e90ac5dc8b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
...
>  #define MLX5E_SET_PFLAG(params, pflag, enable) \
> do {\
> if (enable) \
> -   (params)->pflags |= (pflag);\
> +   (params)->pflags |= BIT(pflag); \
> else\
> -   (params)->pflags &= ~(pflag);   \
> +   (params)->pflags &= ~(BIT(pflag));  \
> } while (0)
>

Please #include  explicitly.

Thanks.


Re: [PATCH net 2/2] net/mlx4_en: Fix build break when CONFIG_INET is off

2018-12-04 Thread Cong Wang
On Sun, Dec 2, 2018 at 4:37 AM Tariq Toukan  wrote:
>
> From: Saeed Mahameed 
>
> MLX4_EN depends on NETDEVICES, ETHERNET and INET Kconfigs.
> Make sure they are listed in MLX4_EN Kconfig dependencies.
>
> This fixes the following build break:
>
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: ‘struct iphdr’ 
> declared inside parameter list [enabled by default]
> struct iphdr *iph)
> ^
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is only 
> this definition or declaration, which is probably not what you want [enabled 
> by default]
> drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function ‘get_fixed_ipv4_csum’:
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing 
> pointer to incomplete type
> _u8 ipproto = iph->protocol;

I am confused with this build error. It complains about the
compiler doesn't know struct iphdr, but struct iphdr is defined
merely in include/uapi/linux/ip.h, without any dependency with
CONFIG_INET.

How could changing Kconfig fix this?

BTW, the patch itself is probably correct as mlx4 does rely on
INET to function, only the build failure confuses me.

Thanks.


Re: [PATCH net 1/2] net/mlx4_en: Change min MTU size to ETH_MIN_MTU

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 9:06 AM Eric Dumazet  wrote:
>
>
>
> On 12/04/2018 08:59 AM, David Laight wrote:
> > Where does 68 come from?
>
> Min IPv4 MTU per RFC791
>

What's wrong with using IPV4_MIN_MTU instead of 68
even in just comment?


Re: [PATCH mlx5-next 4/4] net/mlx5: Move flow counters data structures from flow steering header

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 6:05 PM Saeed Mahameed  wrote:
>
> After the following flow counters API refactoring:
> ("net/mlx5: Use flow counter IDs and not the wrapping cache object")
> flow counters private data structures mlx5_fc_cache and mlx5_fc are
> redundantly exposed in fs_core.h, they have nothing to do with flow
> steering core and they are private to fs_counter.c, this patch moves them
> to where they belong and reduces their exposure in the driver.
>
> Signed-off-by: Saeed Mahameed 
> ---
>  .../net/ethernet/mellanox/mlx5/core/fs_core.h | 23 ---
>  .../ethernet/mellanox/mlx5/core/fs_counters.c | 23 +++

The #include  can be moved together too.


Re: [PATCH mlx5-next 2/4] net/mlx5: Use helper to get CQE opcode

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 6:06 PM Saeed Mahameed  wrote:
>
> +static inline u8 get_cqe_opcode(struct mlx5_cqe64 *cqe)

Make it const please.


Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-12-04 Thread Cong Wang
On Tue, Nov 27, 2018 at 10:10 PM Cong Wang  wrote:
>
> When an ethernet frame is padded to meet the minimum ethernet frame
> size, the padding octets are not covered by the hardware checksum.
> Fortunately the padding octets are ususally zero's, which don't affect
> checksum. However, we have a switch which pads non-zero octets, this
> causes kernel hardware checksum fault repeatedly.
>
> Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE 
> are friends"),
> skb checksum is forced to be CHECKSUM_NONE when padding is detected.
> After it, we need to keep skb->csum updated, like what we do for FCS.
>
> The logic is a bit complicated when dealing with both FCS and padding,
> so I wrap it up in a helper function mlx5e_csum_padding().
>
> I tested this patch with RXFCS on and off, it works fine without any
> warning in both cases.
>
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends"),
> Cc: Saeed Mahameed 
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 

Nacked-by: Cong Wang 

This patch has no value for upstream. Let's discard it. Please don't
use or rework on this patch for any purpose.

Thanks!


Re: [Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 4:59 PM Saeed Mahameed  wrote:
>
> Cong, you are missing some important details, hardware can only parse a
> handful of protocols IP/TCP/UDP some tunneling like vxlan,GRE, etc..
> but for complicated protocols and new generic tunneling protocols,
> which the hardware wasn't built to understand, the checksum complete
> comes to the rescue.
>
> not only that, imagine you are doing some proprietary tunneling and you
> want to encapsulate the rx traffic and send it back to wire, how would
> you want to recalculate the csum before txing ? manually on the whole
> new packet or use the csum complete and just figure out the differences
> ? i am sure you gonna want to use the checksum complete of the original
> packet.
>
> So again csum complete is not only for validation, it is more powerful
> than that.


OK, so how do you detect this? Clearly not by just checking skb->len,
right?

Let me show you why your arguments are non-sense.

1. From your description of how mlx5 hardware works, the driver logic
must be something like this:

if (hardware can checksum L4)
use CHECKSUM_UNNECESSARY
else
use CHECKSUM_COMPLETE

This obviously makes sense. Does the current code base match this?
No. The current code base only checks for IP/IPv6 and sets
CHECKSUM_COMPLETE (unless SCTP), far from matching what
you are trying to sell.

2. What you and Eric suggest to change to:

if (the frame is short)
use CHECKSUM_UNNECESSARY
else
use CHECKSUM_COMPLETE

Does this match what you describe? No, because short frame does not
necessarily mean hardware can't check its L4 checksum. It only misleads
people to believe so, it helps nothing.


3. What my patch actually does:

if (the frame is short)
use CHECKSUM_NONE
else
use CHECKSUM_COMPLETE

Why this makes sense? Because when skb->csum is not correct,
treating it as none is reasonable. And, checking skb->len doesn't
have indication of whether hardware can checksum L4 checksum
or not. This is _good_ when the rest code remains unchanged,
we have no clue whether hardware has done L4 checksum validation
or not, therefore choose to pass it to software to validate.

Of course you can eventually change all the code to match what
you are describing, but _until_ that happens CHECKSUM_NONE
makes more sense than CHECKSUM_UNNECESSARY here.



>
> >
> > > So i agree with Eric, let's jump to checksum_unnecessary for short
> > > packets.
> >
> > This is odd, if Eric is right, then we should completely get rid of
> > CHECKSUM_COMPLETE. Short packets are not exceptions.
> >
>
> short packets are exception,
> 1. because of the small packet padding issue

Why padding issue makes it CHECKSUM_UNNECESSARY not
CHECKSUM_NONE? Isn't the hardware's capability what makes the
difference?


> 2. because they are short enough not to feel the performance hit of
> recalculating their whole csum (if necessary) .

With CHECKSUM_UNNECESSARY, checksum is not recalculated
or validated at L4.


>
> Again still it is nice to report csum unnecessary for those packets, if
> possible.


Please define "if possible". You are hiding a lot of things behind
"if possible" to favor your own argument.


>
> for large packets csum complete is favorable because of the above
> reasons.


Only if hardware can checksum all large packets.


>
>
> if you don't like checksum complete you have a way to totally disable
> it in mlx5 via ethtool private flags.

Thanks but we are very happy to carry the fix locally.

This is why I stop arguing with this non-sense. Why not we all stop
here and agree to an disagreement? Why are we wasting our time?


Thanks!


Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 5:06 PM Saeed Mahameed  wrote:
>
> Hi Cong, sorry to hear that, i will take your patch evaluate and test
> personally, will do the needed changes and post it later.

Please don't. I am withdrawing my SoB too. To avoid any legal
issues, please speak to a legal expert before you taking any action
on my self-Nacked patch.


>
> for now i feel that you don't want csum complete in your servers and we
> already have the tool for that, just do:

Thanks but we are happy to just carry the fix locally.


Re: [Patch net] mlx5: check for malformed packets

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 5:16 PM Saeed Mahameed  wrote:
> Please understand that RX data path is really sensitive and we are
> trying to find the optimal fix of any issue here, sorry for any
> inconvenience.

I am sorry for sending out this patch, I am certain that it wastes
a lot of your time.

The best we can do is to just ignore it.

Thanks!


Re: [Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 1:16 PM Eric Dumazet  wrote:
>
> Erm I never suggested to get rid of CHECKSUM_COMPLETE...
> My suggestion was to reorder the mlx5 logic to match mlx4 one.
>
> CHECKSUM_COMPLETE is very nice _when_/_if_ the NIC is unable to
> fully dissect a packet and validate L4, as a fallback.

Quote from Eric:

"For native IP+TCP or IP+UDP, the NIC has the ability to fully
understand the packet and fully validate the checksum."

Therefore CHECKSUM_COMPLETE is not nice here.

>
> I am pretty sure for example that IP reassembly can benefit from 
> CHECKSUM_COMPLETE.
> (Although for some reason mlx4 code does not handle IPv6 fragments in its 
> CHECKSUM_COMPLETE path)
>

I am pretty sure mlx5 driver code doesn't check for IP fragments.


My patch is dropped, let's keep the current code as it is and pretend
everything just works fine.

Thanks a lot!


Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 11:21 AM Saeed Mahameed
 wrote:
> we are still working on it.

No, I give up.

Sorry for wasting time. Let's save everyone's time by discarding it!! :)


Re: [Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-04 Thread Cong Wang
On Mon, Dec 3, 2018 at 10:14 PM Cong Wang  wrote:
>
> When an ethernet frame is padded to meet the minimum ethernet frame
> size, the padding octets are not covered by the hardware checksum.
> Fortunately the padding octets are ususally zero's, which don't affect
> checksum. However, we have a switch which pads non-zero octets, this
> causes kernel hardware checksum fault repeatedly.
>
> Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE 
> are friends"),
> skb checksum was forced to be CHECKSUM_NONE when padding is detected.
> After it, we need to keep skb->csum updated, like what we do for RXFCS.
> However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
> headers, it is not worthy the effort as the packets are so small that
> CHECKSUM_COMPLETE can't save anything.
>
> I tested this patch with RXFCS on and off, it works fine without any
> warning in both cases.
>
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends"),
> Cc: Saeed Mahameed 
> Cc: Eric Dumazet 
> Cc: Tariq Toukan 
> Signed-off-by: Cong Wang 

Nacked-by: Cong Wang 

This patch has no value for upstream. Let's discard it.

Thanks!


Re: [Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 11:17 AM Saeed Mahameed
 wrote:
>
> On Mon, Dec 3, 2018 at 11:52 PM Eric Dumazet  wrote:
> >
> > On Mon, Dec 3, 2018 at 11:30 PM Cong Wang  wrote:
> > >
> > > On Mon, Dec 3, 2018 at 11:08 PM Eric Dumazet  wrote:
> > > >
> > > > The hardware has probably validated the L3 & L4 checksum just fine.
> > > >
> > > > Note that if ip_summed is CHECKSUM_UNNECESSARY, the padding bytes (if 
> > > > any)
> > > > have no impact on the csum that has been verified by the NIC.
> > >
> > >
> > > Why? Why does the hardware validates L3/L4 checksum when it
> > > supplies a full-packet checksum? What's its point here?
> >
> > The point is that the driver author can decide what is best.
> >
> > For native IP+TCP or IP+UDP, the NIC has the ability to fully
> > understand the packet and fully validate the checksum.
>
> Also for Native IP4 and IP6 plain L3 packets.
> The Hardware validates the csum when it can, and always provides
> checksum complete for all packets.
> One of the reason to validate is that sometimes we want to skip
> checksum complete, but still leverage the hw validation,
> like in your patch :), or LRO case, or many other cases in other
> kernels/OSes/drivers.

This sounds wrong to me too.

If the HW already validates it, the software doesn't need to do it,
therefore must skip hw csum for performance gain.,


>
> So i agree with Eric, let's jump to checksum_unnecessary for short packets.

This is odd, if Eric is right, then we should completely get rid of
CHECKSUM_COMPLETE. Short packets are not exceptions.

I still don't understand why people including Eric kept fixing this
thing which could be just removed from the very beginning.
Sounds like nobody even looked into it until my patch.

Thanks.


Re: [Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-04 Thread Cong Wang
On Mon, Dec 3, 2018 at 11:51 PM Eric Dumazet  wrote:
> > If it really validates L3/L4 checksum, then a full-packet checksum
> > is not needed.
>
> Yes, this is exactly what CHECKSUM_UNNECESSARY means.
> linux stack does not have to perform the check another time.

So you are suggesting to get rid of CHECKSUM_COMPLETE, right?
Like below:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 7cb988640c9c..b9626c8b6b91 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -764,32 +764,6 @@ static inline void mlx5e_handle_csum(struct
net_device *netdev,
return;
}

-   if (unlikely(test_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, >state)))
-   goto csum_unnecessary;
-
-   if (likely(is_last_ethertype_ip(skb, _depth, ))) {
-   if (unlikely(get_ip_proto(skb, network_depth, proto)
== IPPROTO_SCTP))
-   goto csum_unnecessary;
-
-   skb->ip_summed = CHECKSUM_COMPLETE;
-   skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
-   if (network_depth > ETH_HLEN)
-   /* CQE csum is calculated from the IP header and does
-* not cover VLAN headers (if present). This will add
-* the checksum manually.
-*/
-   skb->csum = csum_partial(skb->data + ETH_HLEN,
-network_depth - ETH_HLEN,
-skb->csum);
-   if (unlikely(netdev->features & NETIF_F_RXFCS))
-   skb->csum = csum_block_add(skb->csum,
-  (__force
__wsum)mlx5e_get_fcs(skb),
-  skb->len - ETH_FCS_LEN);
-   stats->csum_complete++;
-   return;
-   }
-
-csum_unnecessary:


>
> For example, no call to csum_partial() is needed, even for IPv6+TCP or 
> IPv6+UDP

Sure, if you mean to get rid of CHECKSUM_COMPLETE.

Remember you fixed CHECKSUM_COMPLETE too when you touch it,
see d48051c5b837 ("net/mlx5e: fix csum adjustments caused by RXFCS").
So why didn't you remove CHECKSUM_COMPLETE but fixed it instead?

Thanks.


Re: [Patch net] mlx5: check for malformed packets

2018-12-04 Thread Cong Wang
On Sat, Dec 1, 2018 at 12:38 PM Cong Wang  wrote:
>
> is_last_ethertype_ip() is used to check IP/IPv6 protocol before
> parsing IP/IPv6 headers.
>
> But __vlan_get_protocol() is only bound to skb->len, a malicious
> packet could exhaust all skb->len by inserting sufficient ETH_P_8021AD
> headers, and it may not even contain an IP/IPv6 header at all, so we
> have to check if we are still safe to continue to parse IP/IPv6 header.
> If not, treat it as non-IP packet.
>
> This should not cause any crash as we stil have tail room in skb,
> but we can't just rely on it either.
>
> Cc: Tariq Toukan 
> Cc: Saeed Mahameed 
> Signed-off-by: Cong Wang 

NAcked-by: Cong Wang 

This patch has no value for upstream. Let's discard it.

Thanks!


Re: [Patch net] mlx5: check for malformed packets

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 11:33 AM Saeed Mahameed
 wrote:
>
> On Sat, Dec 1, 2018 at 12:38 PM Cong Wang  wrote:
> >
> > is_last_ethertype_ip() is used to check IP/IPv6 protocol before
> > parsing IP/IPv6 headers.
> >
> > But __vlan_get_protocol() is only bound to skb->len, a malicious
> > packet could exhaust all skb->len by inserting sufficient ETH_P_8021AD
> > headers, and it may not even contain an IP/IPv6 header at all, so we
> > have to check if we are still safe to continue to parse IP/IPv6 header.
> > If not, treat it as non-IP packet.
> >
> > This should not cause any crash as we stil have tail room in skb,
> > but we can't just rely on it either.
>
> Hi Cong, is this reproducible or just a theory ? which part of the
> code you think will cause the invalid access or crash ?

Since you don't even read into my changelog, here it is:

"This should not cause any crash as we stil have tail room in skb,
but we can't just rely on it either."

As I already explained to you in a private email, when we
reference whatever field in struct iphdr, we have to make sure
the offset of that field is within skb->len.


> do you have steps to reproduce this?
>

Again, you really have to read the changelog I wrote:


"a malicious
packet could exhaust all skb->len by inserting sufficient ETH_P_8021AD
headers, and it may not even contain an IP/IPv6 header at all, "


> I would like to investigate this myself, it will take a couple of days
> if that's ok with you ..

Sure, take your time. I am sending the patch only for showing
the problem, NOT to merge.


Let's discard it anyway. I am wasting my time.

Thanks.


Re: [Patch net] mlx5: check for malformed packets

2018-12-04 Thread Cong Wang
On Tue, Dec 4, 2018 at 11:28 AM Saeed Mahameed
 wrote:
>
> We are against having inline keywords in c file,  so better if you
> move this function to a header file an force the inline keyword there.


Two points:

1. The existing code without my patch has inline keyword. Why
not move it from the beginning?

2. As I already mentioned, inline doesn't work with my patch,
__always_inline does.

Therefore, moving inline to a header doesn't make any difference.


Re: [Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-03 Thread Cong Wang
On Mon, Dec 3, 2018 at 11:08 PM Eric Dumazet  wrote:
>
> The hardware has probably validated the L3 & L4 checksum just fine.
>
> Note that if ip_summed is CHECKSUM_UNNECESSARY, the padding bytes (if any)
> have no impact on the csum that has been verified by the NIC.


Why? Why does the hardware validates L3/L4 checksum when it
supplies a full-packet checksum? What's its point here?

If it really validates L3/L4 checksum, then a full-packet checksum
is not needed.

If a full-packet checksum is supplied, the software is able to use
it to validate L3/L4 checksum, then the hardware doesn't need to
validate it.

I see no reason it provides both at the same time. If it really does,
then all CHECKSUM_COMPLETE code here could be just removed
and would be faster.

Something must be wrong with your argument.

>
> Sorry I do not get your point.


I don't get your point either.


Thanks.


Re: [Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-03 Thread Cong Wang
On Mon, Dec 3, 2018 at 10:34 PM Eric Dumazet  wrote:
>
> On Mon, Dec 3, 2018 at 10:14 PM Cong Wang  wrote:
> >
> > When an ethernet frame is padded to meet the minimum ethernet frame
> > size, the padding octets are not covered by the hardware checksum.
> > Fortunately the padding octets are ususally zero's, which don't affect
> > checksum. However, we have a switch which pads non-zero octets, this
> > causes kernel hardware checksum fault repeatedly.
> >
> > Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE 
> > are friends"),
> > skb checksum was forced to be CHECKSUM_NONE when padding is detected.
> > After it, we need to keep skb->csum updated, like what we do for RXFCS.
> > However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
> > headers, it is not worthy the effort as the packets are so small that
> > CHECKSUM_COMPLETE can't save anything.
> >
> > I tested this patch with RXFCS on and off, it works fine without any
> > warning in both cases.
> >
> > Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> > friends"),
> > Cc: Saeed Mahameed 
> > Cc: Eric Dumazet 
> > Cc: Tariq Toukan 
> > Signed-off-by: Cong Wang 
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 22 ++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 624eed345b5d..1c153b8091da 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -732,6 +732,13 @@ static u8 get_ip_proto(struct sk_buff *skb, int 
> > network_depth, __be16 proto)
> > ((struct ipv6hdr 
> > *)ip_p)->nexthdr;
> >  }
> >
> > +static bool is_short_frame(struct sk_buff *skb, bool has_fcs)
> > +{
> > +   u32 frame_len = has_fcs ? skb->len - ETH_FCS_LEN : skb->len;
> > +
> > +   return frame_len <= ETH_ZLEN;
> > +}
> > +
> >  static inline void mlx5e_handle_csum(struct net_device *netdev,
> >  struct mlx5_cqe64 *cqe,
> >  struct mlx5e_rq *rq,
> > @@ -755,9 +762,22 @@ static inline void mlx5e_handle_csum(struct net_device 
> > *netdev,
> > goto csum_unnecessary;
> >
> > if (likely(is_last_ethertype_ip(skb, _depth, ))) {
> > +   bool has_fcs = !!(netdev->features & NETIF_F_RXFCS);
> > +
> > if (unlikely(get_ip_proto(skb, network_depth, proto) == 
> > IPPROTO_SCTP))
> > goto csum_unnecessary;
> >
> > +   /* CQE csum doesn't cover padding octets in short ethernet
> > +* frames. And the pad field is appended prior to 
> > calculating
> > +* and appending the FCS field.
> > +*
> > +* Detecting these padded frames requires to verify and 
> > parse
> > +* IP headers, so we simply force all those small frames to 
> > be
> > +* CHECKSUM_NONE even if they are not padded.
> > +*/
> > +   if (unlikely(is_short_frame(skb, has_fcs)))
> > +   goto csum_none;
>
> Should not this go to csum_unnecessary instead ?

I don't see why we don't even want to validate the protocol checksum
here.

Any reason you are suggesting so?


>
> Probably not a big deal, but small UDP frames might hit this code path,
> so ethtool -S would show a lot of csum_none which could confuse mlx5 owners.

Why it is confusing? We intentionally bypass hardware checksum
and let protocol layer validate it.


>
> BTW,
> It looks like mlx5 prefers delivering skbs with CHECKSUM_COMPLETE instead of
> CHECKSUM_UNNECESSARY but at least for ipv6 CHECKSUM_UNNECESSARY
> would be slightly faster, by avoiding  various csum_partial() costs
> when headers are parsed.

Sure, it is certainly faster if you don't want to validate L4 checksum.
The only question is why we don't either validate hardware checksum
or L4 checksum?

Thanks.


[Patch net v3] mlx5: force CHECKSUM_NONE for short ethernet frames

2018-12-03 Thread Cong Wang
When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are ususally zero's, which don't affect
checksum. However, we have a switch which pads non-zero octets, this
causes kernel hardware checksum fault repeatedly.

Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
friends"),
skb checksum was forced to be CHECKSUM_NONE when padding is detected.
After it, we need to keep skb->csum updated, like what we do for RXFCS.
However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
headers, it is not worthy the effort as the packets are so small that
CHECKSUM_COMPLETE can't save anything.

I tested this patch with RXFCS on and off, it works fine without any
warning in both cases.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
friends"),
Cc: Saeed Mahameed 
Cc: Eric Dumazet 
Cc: Tariq Toukan 
Signed-off-by: Cong Wang 
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 22 ++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 624eed345b5d..1c153b8091da 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -732,6 +732,13 @@ static u8 get_ip_proto(struct sk_buff *skb, int 
network_depth, __be16 proto)
((struct ipv6hdr *)ip_p)->nexthdr;
 }
 
+static bool is_short_frame(struct sk_buff *skb, bool has_fcs)
+{
+   u32 frame_len = has_fcs ? skb->len - ETH_FCS_LEN : skb->len;
+
+   return frame_len <= ETH_ZLEN;
+}
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,
 struct mlx5_cqe64 *cqe,
 struct mlx5e_rq *rq,
@@ -755,9 +762,22 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
goto csum_unnecessary;
 
if (likely(is_last_ethertype_ip(skb, _depth, ))) {
+   bool has_fcs = !!(netdev->features & NETIF_F_RXFCS);
+
if (unlikely(get_ip_proto(skb, network_depth, proto) == 
IPPROTO_SCTP))
goto csum_unnecessary;
 
+   /* CQE csum doesn't cover padding octets in short ethernet
+* frames. And the pad field is appended prior to calculating
+* and appending the FCS field.
+*
+* Detecting these padded frames requires to verify and parse
+* IP headers, so we simply force all those small frames to be
+* CHECKSUM_NONE even if they are not padded.
+*/
+   if (unlikely(is_short_frame(skb, has_fcs)))
+   goto csum_none;
+
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
if (network_depth > ETH_HLEN)
@@ -768,7 +788,7 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
skb->csum = csum_partial(skb->data + ETH_HLEN,
 network_depth - ETH_HLEN,
 skb->csum);
-   if (unlikely(netdev->features & NETIF_F_RXFCS))
+   if (unlikely(has_fcs))
skb->csum = csum_block_add(skb->csum,
   (__force 
__wsum)mlx5e_get_fcs(skb),
   skb->len - ETH_FCS_LEN);
-- 
2.19.1



Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-12-03 Thread Cong Wang
On Mon, Dec 3, 2018 at 3:17 PM David Miller  wrote:
>
> Saeed, are you going to take care of this?

David, I will send v3 to switch to CHECKSUM_NONE for these short
frames, which also could avoid parsing IP headers.

Thanks.


Re: [Patch net] mlx5: check for malformed packets

2018-12-03 Thread Cong Wang
On Sat, Dec 1, 2018 at 12:38 PM Cong Wang  wrote:
>
> is_last_ethertype_ip() is used to check IP/IPv6 protocol before
> parsing IP/IPv6 headers.

One thing I noticed while reviewing the assembly code is that
is_last_ethertype_ip() is no longer inlined after this patch.

I think I should keep it inlined by using __always_inline, as it is on
a hot path.

What do you think, Tariq?


Re: [Patch net] mlx5: check for malformed packets

2018-12-03 Thread Cong Wang
On Sun, Dec 2, 2018 at 9:11 PM Cong Wang  wrote:
>
> On Sun, Dec 2, 2018 at 12:56 AM Tariq Toukan  wrote:
> >
> > > + } else if (*proto == htons(ETH_P_IPV6)) {
> >
> > No need for an else here, the first if block always returns.
>
>
> Yeah, but not sure if this makes a difference on the generated
> asm code. I will give it a try anyway.

I just tried, there is no difference on the generated assembly code.

The gcc I use is:

$ gcc --version
gcc (GCC) 8.2.1 20181011 (Red Hat 8.2.1-4)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


Are you okay with the current code now? :)


Re: [Patch net] mlx5: check for malformed packets

2018-12-02 Thread Cong Wang
On Sun, Dec 2, 2018 at 12:56 AM Tariq Toukan  wrote:
>
>
>
> On 01/12/2018 10:38 PM, Cong Wang wrote:
> > + if (*proto == htons(ETH_P_IP)) {
> > + if (unlikely(*network_depth > skb->len - sizeof(struct 
> > iphdr)))
> > + return false;
> > + return true;
>
> Or just do the following?
> return *network_depth <= skb->len - sizeof(struct iphdr));
>
> We'll lose the compiler hint though, so I'm not sure which is better.


It is very important to keep this unlikely(), as it is on a hot path.


>
> > + } else if (*proto == htons(ETH_P_IPV6)) {
>
> No need for an else here, the first if block always returns.


Yeah, but not sure if this makes a difference on the generated
asm code. I will give it a try anyway.

Thanks.


[Patch net] mlx5: check for malformed packets

2018-12-01 Thread Cong Wang
is_last_ethertype_ip() is used to check IP/IPv6 protocol before
parsing IP/IPv6 headers.

But __vlan_get_protocol() is only bound to skb->len, a malicious
packet could exhaust all skb->len by inserting sufficient ETH_P_8021AD
headers, and it may not even contain an IP/IPv6 header at all, so we
have to check if we are still safe to continue to parse IP/IPv6 header.
If not, treat it as non-IP packet.

This should not cause any crash as we stil have tail room in skb,
but we can't just rely on it either.

Cc: Tariq Toukan 
Cc: Saeed Mahameed 
Signed-off-by: Cong Wang 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 624eed345b5d..1e505013ebfd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -693,7 +693,18 @@ static inline bool is_last_ethertype_ip(struct sk_buff 
*skb, int *network_depth,
 {
*proto = ((struct ethhdr *)skb->data)->h_proto;
*proto = __vlan_get_protocol(skb, *proto, network_depth);
-   return (*proto == htons(ETH_P_IP) || *proto == htons(ETH_P_IPV6));
+
+   if (*proto == htons(ETH_P_IP)) {
+   if (unlikely(*network_depth > skb->len - sizeof(struct iphdr)))
+   return false;
+   return true;
+   } else if (*proto == htons(ETH_P_IPV6)) {
+   if (unlikely(*network_depth > skb->len - sizeof(struct 
ipv6hdr)))
+   return false;
+   return true;
+   }
+
+   return false;
 }
 
 static inline void mlx5e_enable_ecn(struct mlx5e_rq *rq, struct sk_buff *skb)
-- 
2.19.1



Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Cong Wang
On Thu, Nov 29, 2018 at 1:46 PM Eric Dumazet  wrote:
>
>
>
> On 11/29/2018 01:13 PM, Duyck, Alexander H wrote:
>
> > Instead of just checking for the max it might make more sense to do a
> > check using skb_is_gso, and then if true use gso_segs, otherwise just
> > default to 1.
> >
> > Also your bytes are going to be totally messed up as well since the
> > headers are trimmed in the GSO frames. It might be worthwhile to just
> > have a branch based on skb_is_gso that sets the packets and bytes based
> > on the GSO values, and one that sets it for the default case.
> >
> Note that __dev_queue_xmit() is already doing that, no need
> to re-implement in each driver.
>
> It calls qdisc_pkt_len_init(), meaning that drivers can use
> qdisc_skb_cb(skb)->pkt_len instead of skb->len
>
> (Qdisc layers should not modify qdisc_skb_cb(skb)->pkt_len )

It should not modify, but, on the other hand, it is neither supposed
to be used for non-qdisc layer. At very least, you have to audit
each ->ndo_start_xmit() to see if it uses its own skb->cb
first.


Re: [PATCH net] net/sched: act_police: fix memory leak in case of invalid control action

2018-11-29 Thread Cong Wang
On Wed, Nov 28, 2018 at 9:44 AM Davide Caratti  wrote:
>
> when users set an invalid control action, kmemleak complains as follows:
...
> change tcf_police_init() to avoid leaking 'new' in case TCA_POLICE_RESULT
> contains TC_ACT_GOTO_CHAIN extended action.
>
> Fixes: c08f5ed5d625 ("net/sched: act_police: disallow 'goto chain' on 
> fallback control action")
> Reported-by: Dan Carpenter 
> Signed-off-by: Davide Caratti 

Acked-by: Cong Wang 


Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang
On Wed, Nov 28, 2018 at 7:50 PM Eric Dumazet  wrote:
>
> On Wed, Nov 28, 2018 at 7:40 PM Cong Wang  wrote:
> >
> > On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  wrote:
> > >
> > > A NIC is supposed to deliver frames, even the ones that 'seem' bad.
> >
> > A quick test shows this is not the case for mlx5.
> >
> > With the trafgen script you gave to me, with tot_len==40, the dest host
> > could receive all the packets. Changing tot_len to 80, tcpdump could no
> > longer see any packet. (Both sender and receiver are mlx5.)
> >
> > So, packets with tot_len > skb->len are clearly dropped before tcpdump
> > could see it, that is likely by mlx5 hardware.
>
> Or a router, or a switch.
>
> Are your two hosts connected back to back ?

Both should be plugged into a same switch. I fail to see why a
switch could parse IP header as the packet is nothing of interest,
like a IGMP snooping.


Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang
On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  wrote:
>
> A NIC is supposed to deliver frames, even the ones that 'seem' bad.

A quick test shows this is not the case for mlx5.

With the trafgen script you gave to me, with tot_len==40, the dest host
could receive all the packets. Changing tot_len to 80, tcpdump could no
longer see any packet. (Both sender and receiver are mlx5.)

So, packets with tot_len > skb->len are clearly dropped before tcpdump
could see it, that is likely by mlx5 hardware.


Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang
On Wed, Nov 28, 2018 at 3:57 PM Cong Wang  wrote:
> But again, I kinda feel the hardware already does the sanity check,
> otherwise we have much more serious trouble in mlx5e_lro_update_hdr()
> which parses into TCP header.
>

While we are on this page, mlx5e_lro_update_hdr() incorrectly assumes
TCP header is located right after struct iphdr, which is wrong if we could
have IP options on this path.

It could the hardware which already verified this corner case though.


Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang
On Wed, Nov 28, 2018 at 3:50 PM Eric Dumazet  wrote:
>
> On Wed, Nov 28, 2018 at 2:16 PM Cong Wang  wrote:
> >
> > On Wed, Nov 28, 2018 at 7:00 AM Eric Dumazet  wrote:
> > >
> > > Nice packet of death alert.
> > >
> > > pad_len can be 0xFF67  here, if frame_len is smaller than pad_offset.
> >
> > Unless IP header is malformed, how could it be?
>
> This is totally something an attacker can forge.

Of course, as in the email I sent to mellanox guys,__vlan_get_protocol()
could _literately_ exhaust all skb->len. If no sufficient skb tail room,
we could even possibly crash.

But again, I kinda feel the hardware already does the sanity check,
otherwise we have much more serious trouble in mlx5e_lro_update_hdr()
which parses into TCP header.

Thanks.


[Patch net] mlx5: fix get_ip_proto()

2018-11-28 Thread Cong Wang
IP header is not necessarily located right after struct ethhdr,
there could be multiple 802.1Q headers in between, this is why
we call __vlan_get_protocol().

Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets")
Cc: Alaa Hleihel 
Cc: Or Gerlitz 
Cc: Saeed Mahameed 
Signed-off-by: Cong Wang 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 9b6bd2b51556..f7c5dbb0ffcd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -724,9 +724,9 @@ static u32 mlx5e_get_fcs(const struct sk_buff *skb)
return __get_unaligned_cpu32(fcs_bytes);
 }
 
-static u8 get_ip_proto(struct sk_buff *skb, __be16 proto)
+static u8 get_ip_proto(struct sk_buff *skb, int network_depth, __be16 proto)
 {
-   void *ip_p = skb->data + sizeof(struct ethhdr);
+   void *ip_p = skb->data + network_depth;
 
return (proto == htons(ETH_P_IP)) ? ((struct iphdr *)ip_p)->protocol :
((struct ipv6hdr *)ip_p)->nexthdr;
@@ -786,7 +786,7 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
goto csum_unnecessary;
 
if (likely(is_last_ethertype_ip(skb, _depth, ))) {
-   if (unlikely(get_ip_proto(skb, proto) == IPPROTO_SCTP))
+   if (unlikely(get_ip_proto(skb, network_depth, proto) == 
IPPROTO_SCTP))
goto csum_unnecessary;
 
skb->ip_summed = CHECKSUM_COMPLETE;
-- 
2.19.1



Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang
On Wed, Nov 28, 2018 at 7:00 AM Eric Dumazet  wrote:
>
> Nice packet of death alert.
>
> pad_len can be 0xFF67  here, if frame_len is smaller than pad_offset.

Unless IP header is malformed, how could it be?

Speaking of IP header sanity, I am totally aware of it, I don't check it because
I know get_ip_proto() doesn't check either, it must be hardware which verifies
the sanity.


>
> Really I suggest you set ip_summed to CHECKSUM_NONE, then remove the
> initial test ( if (likely(frame_len > ETH_ZLEN)) ...)
>
> Until the firmware is fixed.

Hmm, why setting to CHECKSUM_NONE could get rid of the minimum ethernet
frame check? I am lost, there is no bug for packet > ETH_ZLEN _for me_, what
needs to fix here?

Overall, you keep pushing me to fix a bug I don't observe. I don't understand
why. If you see it, please come up with your own patch? Why do I have to fix
the problem you see??


>
> Otherwise frames with a wrong checksum and some non zero padding could
> potentially
> be seen as correct frames. (Probability of 1/65536)
>
> Do not focus on your immediate problem (small packets being padded by
> a non malicious entity)

Again, why _I_ should fix a problem I never observe? Why is it not you who
fix the problem you find during code review? No to mention I have no environment
to test it even if I really want to fix. I can' take such a risk.

Thanks.


[Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-27 Thread Cong Wang
When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are ususally zero's, which don't affect
checksum. However, we have a switch which pads non-zero octets, this
causes kernel hardware checksum fault repeatedly.

Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
friends"),
skb checksum is forced to be CHECKSUM_NONE when padding is detected.
After it, we need to keep skb->csum updated, like what we do for FCS.

The logic is a bit complicated when dealing with both FCS and padding,
so I wrap it up in a helper function mlx5e_csum_padding().

I tested this patch with RXFCS on and off, it works fine without any
warning in both cases.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
friends"),
Cc: Saeed Mahameed 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 16985ca3248d..9b6bd2b51556 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -732,6 +732,37 @@ static u8 get_ip_proto(struct sk_buff *skb, __be16 proto)
((struct ipv6hdr *)ip_p)->nexthdr;
 }
 
+static void mlx5e_csum_padding(struct sk_buff *skb, int network_depth,
+  __be16 proto, bool has_fcs)
+{
+   u32 frame_len = has_fcs ? skb->len - ETH_FCS_LEN : skb->len;
+   u32 pad_offset, pad_len;
+   void *pad, *ip_p;
+
+   /* We only handle short frames here, and we know they are linear. */
+   if (likely(frame_len > ETH_ZLEN))
+   return;
+
+   ip_p = skb->data + network_depth;
+   if (proto == htons(ETH_P_IP)) {
+   struct iphdr *ipv4 = ip_p;
+
+   pad_offset =  network_depth + be16_to_cpu(ipv4->tot_len);
+   } else { /* Either IPv4 or IPv6 here. */
+   struct ipv6hdr *ipv6 = ip_p;
+
+   pad_offset = network_depth + sizeof(struct ipv6hdr) +
+be16_to_cpu(ipv6->payload_len);
+   }
+   pad_len = frame_len - pad_offset;
+   if (pad_len == 0)
+   return;
+
+   pad = skb->data + pad_offset;
+   skb->csum = csum_block_add(skb->csum, csum_partial(pad, pad_len, 0),
+  pad_offset);
+}
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,
 struct mlx5_cqe64 *cqe,
 struct mlx5e_rq *rq,
@@ -772,6 +803,14 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
skb->csum = csum_block_add(skb->csum,
   (__force 
__wsum)mlx5e_get_fcs(skb),
   skb->len - ETH_FCS_LEN);
+
+   /* CQE csum doesn't cover padding octets in short ethernet
+* frames. And the pad field is appended prior to calculating
+* and appending the FCS field.
+*/
+   mlx5e_csum_padding(skb, network_depth, proto,
+  !!(netdev->features & NETIF_F_RXFCS));
+
stats->csum_complete++;
return;
}
-- 
2.19.1



Re: [Patch net] mlx5: fixup checksum for ethernet padding

2018-11-27 Thread Cong Wang
On Tue, Nov 27, 2018 at 5:49 PM Eric Dumazet  wrote:
>
>
> Sure this patch handles short frames, but I am saying the issue is
> more generic than that.

Yeah. Let's fix one problem in one patch.

I will clarify that I am only fixing short frames in v2, as I need to update
it anyway.

Thanks!


Re: [Patch net] mlx5: fixup checksum for ethernet padding

2018-11-27 Thread Cong Wang
On Tue, Nov 27, 2018 at 5:25 PM Eric Dumazet  wrote:
>
>
>
> On 11/27/2018 04:07 PM, Cong Wang wrote:
> > On Tue, Nov 27, 2018 at 3:48 PM Eric Dumazet  wrote:
>
> >>
> >> But the padding might be added on normal packets (say 1000 bytes + 3 bytes 
> >> of padding) ?
> >
> > I never see other padding cases than ETH_ZLEN. Does ethernet standard
> > require padding for other cases? I only read the section "3.2.8 Pad field" 
> > in
> > the standard.
> >
>
> Padding can be done by senders, eg using AF_PACKET,
> added at the tail of a regular IP/IP6 frame of 1000 or 6000 bytes.


I tried the trafgen script you provided, it doesn't trigger any checksum fault.
So, it is probably a different case.

>
> Note that mlx5 will presumably set CHECKSUM_UNNECESSARY for standard 
> protocols,
> so if you want to reproduce the issue, you might need to find an IP frame that
> mlx5 is not able to checksum validate.

This warning is 100% reproducible with a TCP RST packet (no data)
here, after I find the right switch which pads non-zero's. This is
also how I verified this patch.

Thanks.


Re: [Patch net] mlx5: fixup checksum for ethernet padding

2018-11-27 Thread Cong Wang
On Tue, Nov 27, 2018 at 5:11 PM Saeed Mahameed  wrote:
>
> On Tue, 2018-11-27 at 16:07 -0800, Cong Wang wrote:
> > On Tue, Nov 27, 2018 at 3:48 PM Eric Dumazet 
> > wrote:
> > > The bug here is that mlx5 csum only includes the data in IP frame.
> > >
> > > I would simply force skb->ip_summed to CHECKSUM_NONE if any padding
> > > is detected.
> > >
>
> Totally agree, let's just skip csum completes for very small packets
> ( < ETH_ZLEN ).

That is my first idea in my mind, but you know we still have to parse IP header
to detect packets < ETH_ZLEN. Essentially same work, only saves a few adds
in csum_add(). ;)


Re: [Patch net] mlx5: fixup checksum for ethernet padding

2018-11-27 Thread Cong Wang
On Tue, Nov 27, 2018 at 3:48 PM Eric Dumazet  wrote:
>
>
>
> On 11/27/2018 03:21 PM, Cong Wang wrote:
> > When an ethernet frame is padded to meet the minimum ethernet frame
> > size, the padding octets are not covered by the hardware checksum.
> > Fortunately the padding octets are ususally zero's, which don't affect
> > checksum. However, we have a switch which pads non-zero octets, this
> > causes kernel hardware checksum fault repeatedly.
> >
> > Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE 
> > are friends"),
> > skb checksum is forced to be CHECKSUM_NONE when padding is detected.
> > After it, we have to keep skb->csum updated, like what we do for FCS.
> >
> > The logic is a bit complicated when dealing with both FCS and padding,
> > so I wrap it up in a helper function mlx5e_csum_padding().
> >
> > I tested this patch with RXFCS on and off, it works fine without any
> > warning in both cases.
> >
> > Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> > friends"),
> > Cc: Saeed Mahameed 
> > Cc: Eric Dumazet 
> > Signed-off-by: Cong Wang 
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 36 +++
> >  1 file changed, 36 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 16985ca3248d..93c18647ca74 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -732,6 +732,35 @@ static u8 get_ip_proto(struct sk_buff *skb, __be16 
> > proto)
> >   ((struct ipv6hdr *)ip_p)->nexthdr;
> >  }
> >
> > +static void mlx5e_csum_padding(struct sk_buff *skb, int network_depth,
> > +__be16 proto, bool has_fcs)
> > +{
> > + u32 frame_len = has_fcs ? skb->len - ETH_FCS_LEN : skb->len;
> > + void *ip_p = skb->data + network_depth;
> > + u32 pad_offset, pad_len;
> > + void *pad;
> > +
> > + if (likely(frame_len > ETH_ZLEN))
> > + return;
> > +
>
>
> But the padding might be added on normal packets (say 1000 bytes + 3 bytes of 
> padding) ?

I never see other padding cases than ETH_ZLEN. Does ethernet standard
require padding for other cases? I only read the section "3.2.8 Pad field" in
the standard.

>
> The bug here is that mlx5 csum only includes the data in IP frame.
>
> I would simply force skb->ip_summed to CHECKSUM_NONE if any padding is 
> detected.
>
> Otherwise, your patch needs more work when multiple frags are used (ie 
> num_frags > 1 )

You mean using skb_header_pointer()? Yeah if we need to handle cases
larger than ETH_ZLEN.

Thanks.


Re: [Patch net] mlx5: fixup checksum for ethernet padding

2018-11-27 Thread Cong Wang
On Tue, Nov 27, 2018 at 3:21 PM Cong Wang  wrote:
> +   if (proto == htons(ETH_P_IP)) {
> +   struct iphdr *ipv4 = ip_p;
> +
> +   pad_offset =  network_depth + be16_to_cpu(ipv4->tot_len);
> +   } else if (proto == htons(ETH_P_IPV6)) {
> +   struct ipv6hdr *ipv6 = ip_p;
> +
> +   pad_offset = network_depth + sizeof(struct ipv6hdr) +
> +be16_to_cpu(ipv6->payload_len);
> +   }
> +

Should return here for other protocols... Will fix this in v2.


[Patch net] mlx5: fixup checksum for ethernet padding

2018-11-27 Thread Cong Wang
When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are ususally zero's, which don't affect
checksum. However, we have a switch which pads non-zero octets, this
causes kernel hardware checksum fault repeatedly.

Prior to commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
friends"),
skb checksum is forced to be CHECKSUM_NONE when padding is detected.
After it, we have to keep skb->csum updated, like what we do for FCS.

The logic is a bit complicated when dealing with both FCS and padding,
so I wrap it up in a helper function mlx5e_csum_padding().

I tested this patch with RXFCS on and off, it works fine without any
warning in both cases.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
friends"),
Cc: Saeed Mahameed 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 36 +++
 1 file changed, 36 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 16985ca3248d..93c18647ca74 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -732,6 +732,35 @@ static u8 get_ip_proto(struct sk_buff *skb, __be16 proto)
((struct ipv6hdr *)ip_p)->nexthdr;
 }
 
+static void mlx5e_csum_padding(struct sk_buff *skb, int network_depth,
+  __be16 proto, bool has_fcs)
+{
+   u32 frame_len = has_fcs ? skb->len - ETH_FCS_LEN : skb->len;
+   void *ip_p = skb->data + network_depth;
+   u32 pad_offset, pad_len;
+   void *pad;
+
+   if (likely(frame_len > ETH_ZLEN))
+   return;
+
+   if (proto == htons(ETH_P_IP)) {
+   struct iphdr *ipv4 = ip_p;
+
+   pad_offset =  network_depth + be16_to_cpu(ipv4->tot_len);
+   } else if (proto == htons(ETH_P_IPV6)) {
+   struct ipv6hdr *ipv6 = ip_p;
+
+   pad_offset = network_depth + sizeof(struct ipv6hdr) +
+be16_to_cpu(ipv6->payload_len);
+   }
+
+   pad = skb->data + pad_offset;
+   pad_len = frame_len - pad_offset;
+
+   skb->csum = csum_block_add(skb->csum, csum_partial(pad, pad_len, 0),
+  pad_offset);
+}
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,
 struct mlx5_cqe64 *cqe,
 struct mlx5e_rq *rq,
@@ -772,6 +801,13 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
skb->csum = csum_block_add(skb->csum,
   (__force 
__wsum)mlx5e_get_fcs(skb),
   skb->len - ETH_FCS_LEN);
+
+   /* CQE csum doesn't cover padding octets. And the padding is
+* appended prior to calculating and appending the FCS field.
+*/
+   mlx5e_csum_padding(skb, network_depth, proto,
+  !!(netdev->features & NETIF_F_RXFCS));
+
stats->csum_complete++;
return;
}
-- 
2.19.1



[Patch net-next] net: explain __skb_checksum_complete() with comments

2018-11-26 Thread Cong Wang
Cc: Herbert Xu 
Signed-off-by: Cong Wang 
---
 net/core/dev.c|  1 +
 net/core/skbuff.c | 18 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 368dc3b49dc0..ee0a4ac0bbb6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5787,6 +5787,7 @@ __sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
 
/* NAPI_GRO_CB(skb)->csum holds pseudo checksum */
sum = csum_fold(csum_add(NAPI_GRO_CB(skb)->csum, wsum));
+   /* See comments in __skb_checksum_complete(). */
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6f2ea0f0fb75..530097df328f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2637,6 +2637,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, 
int len)
__sum16 sum;
 
sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
+   /* See comments in __skb_checksum_complete(). */
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
@@ -2648,6 +2649,15 @@ __sum16 __skb_checksum_complete_head(struct sk_buff 
*skb, int len)
 }
 EXPORT_SYMBOL(__skb_checksum_complete_head);
 
+/* This function assumes skb->csum already holds pseudo header's checksum,
+ * which has been changed from the hardware checksum, for example, by
+ * __skb_checksum_validate_complete(). And, the original skb->csum must
+ * have been validated unsuccessfully for CHECKSUM_COMPLETE case.
+ *
+ * It returns non-zero if the recomputed checksum is still invalid, otherwise
+ * zero. The new checksum is stored back into skb->csum unless the skb is
+ * shared.
+ */
 __sum16 __skb_checksum_complete(struct sk_buff *skb)
 {
__wsum csum;
@@ -2655,8 +2665,14 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
 
csum = skb_checksum(skb, 0, skb->len, 0);
 
-   /* skb->csum holds pseudo checksum */
sum = csum_fold(csum_add(skb->csum, csum));
+   /* This check is inverted, because we already knew the hardware
+* checksum is invalid before calling this function. So, if the
+* re-computed checksum is valid instead, then we have a mismatch
+* between the original skb->csum and skb_checksum(). This means either
+* the original hardware checksum is incorrect or we screw up skb->csum
+* when moving skb->data around.
+*/
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-- 
2.19.1



Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-22 Thread Cong Wang
On Wed, Nov 21, 2018 at 11:33 AM Saeed Mahameed  wrote:
>
> On Wed, 2018-11-21 at 10:26 -0800, Eric Dumazet wrote:
> > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang 
> > wrote:
> > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet <
> > > eric.duma...@gmail.com> wrote:
> > > >
> > > >
> > > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > > Currently, we only dump a few selected skb fields in
> > > > > netdev_rx_csum_fault(). It is not suffient for debugging
> > > > > checksum
> > > > > fault. This patch introduces skb_dump() which dumps skb mac
> > > > > header,
> > > > > network header and its whole skb->data too.
> > > > >
> > > > > Cc: Herbert Xu 
> > > > > Cc: Eric Dumazet 
> > > > > Cc: David Miller 
> > > > > Signed-off-by: Cong Wang 
> > > > > ---
> > > > > + print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET,
> > > > > 16, 1,
> > > > > +skb->data, skb->len, false);
> > > >
> > > > As I mentioned to David, we want all the bytes that were maybe
> > > > already pulled
> > > >
> > > > (skb->head starting point, not skb->data)
> > >
> > > Hmm, with mac header and network header, it is effectively from
> > > skb->head, no?
> > > Is there anything between skb->head and mac header?
> >
> > Oh, I guess we wanted a single hex dump, or we need some user program
> > to be able to
> > rebuild from different memory zones the original CHECKSUM_COMPLETE
> > value.
> >
>
> Normally the driver keeps some headroom @skb->head, so the actual mac
> header starts @ skb->head + driver_specific_headroom

Good to know, but this headroom isn't covered by skb->csum, so
not useful here, right? The skb->csum for mlx5 only covers network
header and its payload.


Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-22 Thread Cong Wang
On Wed, Nov 21, 2018 at 10:26 AM Eric Dumazet  wrote:
>
> On Wed, Nov 21, 2018 at 10:17 AM Cong Wang  wrote:
> >
> > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet  wrote:
> > >
> > >
> > >
> > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > Currently, we only dump a few selected skb fields in
> > > > netdev_rx_csum_fault(). It is not suffient for debugging checksum
> > > > fault. This patch introduces skb_dump() which dumps skb mac header,
> > > > network header and its whole skb->data too.
> > > >
> > > > Cc: Herbert Xu 
> > > > Cc: Eric Dumazet 
> > > > Cc: David Miller 
> > > > Signed-off-by: Cong Wang 
> > > > ---
> > >
> > >
> > > > + print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET, 16, 1,
> > > > +skb->data, skb->len, false);
> > >
> > > As I mentioned to David, we want all the bytes that were maybe already 
> > > pulled
> > >
> > > (skb->head starting point, not skb->data)
> >
> > Hmm, with mac header and network header, it is effectively from skb->head, 
> > no?
> > Is there anything between skb->head and mac header?
>
> Oh, I guess we wanted a single hex dump, or we need some user program
> to be able to
> rebuild from different memory zones the original CHECKSUM_COMPLETE value.


Yeah, I can remove the prefix and dump the complete packet as
one single block. This means I also need to check where
skb->data points to.

>
> >
> > >
> > > Also we will miss the trimmed bytes if there were padding data.
> > > And it seems the various bugs we have are all tied to the pulled or 
> > > trimmed bytes.
> > >
> >
> > Unless I miss something, the tailing padding data should be in range
> > [iphdr->tot_len, skb->len]. No?
>
>
> Not after we did the pskb_trim_rcsum() call, since it has effectively
> reduced skb->len by the number of padding bytes.

Sure, this patch can't change where netdev_rx_csum_fault() gets
called. We either need to move the checksum validation earlier,
or move the trimming later, none of them belongs to this patch.

Thanks.


Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-21 Thread Cong Wang
On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet  wrote:
>
>
>
> On 11/20/2018 06:13 PM, Cong Wang wrote:
> > Currently, we only dump a few selected skb fields in
> > netdev_rx_csum_fault(). It is not suffient for debugging checksum
> > fault. This patch introduces skb_dump() which dumps skb mac header,
> > network header and its whole skb->data too.
> >
> > Cc: Herbert Xu 
> > Cc: Eric Dumazet 
> > Cc: David Miller 
> > Signed-off-by: Cong Wang 
> > ---
>
>
> > + print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET, 16, 1,
> > +skb->data, skb->len, false);
>
> As I mentioned to David, we want all the bytes that were maybe already pulled
>
> (skb->head starting point, not skb->data)

Hmm, with mac header and network header, it is effectively from skb->head, no?
Is there anything between skb->head and mac header?


>
> Also we will miss the trimmed bytes if there were padding data.
> And it seems the various bugs we have are all tied to the pulled or trimmed 
> bytes.
>

Unless I miss something, the tailing padding data should be in range
[iphdr->tot_len, skb->len]. No?


Thanks


Re: [PATCH net v2] net/sched: act_police: fix race condition on state variables

2018-11-21 Thread Cong Wang
On Tue, Nov 20, 2018 at 1:19 PM Davide Caratti  wrote:
> @@ -257,25 +261,28 @@ static int tcf_police_act(struct sk_buff *skb, const 
> struct tc_action *a,
> }
>
> now = ktime_get_ns();
> -   toks = min_t(s64, now - p->tcfp_t_c, p->tcfp_burst);
> +   spin_lock_bh(>tcfp_lock);
> +   toks = min_t(s64, now - police->tcfp_t_c, p->tcfp_burst);


I don't think you need to disable BH here.


Re: [PATCH net] net/sched: act_police: add missing spinlock initialization

2018-11-21 Thread Cong Wang
On Wed, Nov 21, 2018 at 9:24 AM Davide Caratti  wrote:
>
> commit f2cbd4852820 ("net/sched: act_police: fix race condition on state
> variables") introduces a new spinlock, but forgets its initialization.
> Ensure that tcf_police_init() initializes 'tcfp_lock' every time a 'police'
> action is newly created, to avoid the following lockdep splat:

Acked-by: Cong Wang 


Re: [PATCH net v2] net/sched: act_police: fix race condition on state variables

2018-11-20 Thread Cong Wang
On Tue, Nov 20, 2018 at 3:30 PM Eric Dumazet  wrote:
>
> On Tue, Nov 20, 2018 at 3:28 PM David Miller  wrote:
> >
> > Applied.
>
> We need a fix to make lockdep happy, as reported by Cong.
>
> Cong, do you want to handle this ?
>

I hope Davide could send a followup fix and really test it
with LOCKDEP enabled.


[Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-20 Thread Cong Wang
Currently, we only dump a few selected skb fields in
netdev_rx_csum_fault(). It is not suffient for debugging checksum
fault. This patch introduces skb_dump() which dumps skb mac header,
network header and its whole skb->data too.

Cc: Herbert Xu 
Cc: Eric Dumazet 
Cc: David Miller 
Signed-off-by: Cong Wang 
---
 include/linux/skbuff.h |  5 +
 net/core/dev.c |  6 +-
 net/core/skbuff.c  | 49 ++
 3 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index afddb5c17ce5..844c0a7ff52f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4218,5 +4218,10 @@ static inline __wsum lco_csum(struct sk_buff *skb)
return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
 }
 
+#ifdef CONFIG_BUG
+void skb_dump(const char *level, const struct sk_buff *skb, bool dump_header,
+ bool dump_mac_header, bool dump_network_header);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_SKBUFF_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index f2bfd2eda7b2..dc54c89fb4b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3097,11 +3097,7 @@ void netdev_rx_csum_fault(struct net_device *dev, struct 
sk_buff *skb)
pr_err("%s: hw csum failure\n", dev ? dev->name : "");
if (dev)
pr_err("dev features: %pNF\n", >features);
-   pr_err("skb len=%u data_len=%u pkt_type=%u gso_size=%u 
gso_type=%u nr_frags=%u ip_summed=%u csum=%x csum_complete_sw=%d csum_valid=%d 
csum_level=%u\n",
-  skb->len, skb->data_len, skb->pkt_type,
-  skb_shinfo(skb)->gso_size, skb_shinfo(skb)->gso_type,
-  skb_shinfo(skb)->nr_frags, skb->ip_summed, skb->csum,
-  skb->csum_complete_sw, skb->csum_valid, skb->csum_level);
+   skb_dump(KERN_ERR, skb, true, true, true);
dump_stack();
}
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b6ba923e7dc7..21aaef3f6a4a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -,3 +,52 @@ void skb_condense(struct sk_buff *skb)
 */
skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
 }
+
+#ifdef CONFIG_BUG
+void skb_dump(const char *level, const struct sk_buff *skb, bool dump_header,
+ bool dump_mac_header, bool dump_network_header)
+{
+   struct sk_buff *frag_iter;
+   int i;
+
+   if (dump_header)
+   printk("%sskb len=%u data_len=%u pkt_type=%u gso_size=%u 
gso_type=%u nr_frags=%u ip_summed=%u csum=%x csum_complete_sw=%d csum_valid=%d 
csum_level=%u\n",
+  level, skb->len, skb->data_len, skb->pkt_type,
+  skb_shinfo(skb)->gso_size, skb_shinfo(skb)->gso_type,
+  skb_shinfo(skb)->nr_frags, skb->ip_summed, skb->csum,
+  skb->csum_complete_sw, skb->csum_valid, skb->csum_level);
+
+   if (dump_mac_header && skb_mac_header_was_set(skb))
+   print_hex_dump(level, "mac header: ", DUMP_PREFIX_OFFSET, 16, 1,
+  skb_mac_header(skb), skb_mac_header_len(skb),
+  false);
+
+   if (dump_network_header && skb_network_header_was_set(skb))
+   print_hex_dump(level, "network header: ", DUMP_PREFIX_OFFSET,
+  16, 1, skb_network_header(skb),
+  skb_network_header_len(skb), false);
+
+   print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET, 16, 1,
+  skb->data, skb->len, false);
+
+   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+   skb_frag_t *frag = _shinfo(skb)->frags[i];
+   u32 p_off, p_len, copied;
+   struct page *p;
+   u8 *vaddr;
+
+   skb_frag_foreach_page(frag, frag->page_offset, 
skb_frag_size(frag),
+ p, p_off, p_len, copied) {
+   vaddr = kmap_atomic(p);
+   print_hex_dump(level, "skb frag: ", DUMP_PREFIX_OFFSET,
+  16, 1, vaddr + p_off, p_len, false);
+   kunmap_atomic(vaddr);
+   }
+   }
+
+   if (skb_has_frag_list(skb))
+   printk("%sskb frags list:\n", level);
+   skb_walk_frags(skb, frag_iter)
+   skb_dump(level, frag_iter, false, false, false);
+}
+#endif
-- 
2.19.1



[Patch net-next 1/2] net: introduce skb_network_header_was_set()

2018-11-20 Thread Cong Wang
Signed-off-by: Cong Wang 
---
 include/linux/skbuff.h | 5 +
 net/core/skbuff.c  | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a2e8297a5b00..afddb5c17ce5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2444,6 +2444,11 @@ static inline u32 skb_network_header_len(const struct 
sk_buff *skb)
return skb->transport_header - skb->network_header;
 }
 
+static inline int skb_network_header_was_set(const struct sk_buff *skb)
+{
+   return skb->network_header != (typeof(skb->network_header))~0U;
+}
+
 static inline u32 skb_inner_network_header_len(const struct sk_buff *skb)
 {
return skb->inner_transport_header - skb->inner_network_header;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9a8a72cefe9b..b6ba923e7dc7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -227,6 +227,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
skb_reset_tail_pointer(skb);
skb->end = skb->tail + size;
skb->mac_header = (typeof(skb->mac_header))~0U;
+   skb->network_header = (typeof(skb->network_header))~0U;
skb->transport_header = (typeof(skb->transport_header))~0U;
 
/* make sure we initialize shinfo sequentially */
@@ -292,6 +293,7 @@ struct sk_buff *__build_skb(void *data, unsigned int 
frag_size)
skb_reset_tail_pointer(skb);
skb->end = skb->tail + size;
skb->mac_header = (typeof(skb->mac_header))~0U;
+   skb->network_header = (typeof(skb->network_header))~0U;
skb->transport_header = (typeof(skb->transport_header))~0U;
 
/* make sure we initialize shinfo sequentially */
-- 
2.19.1



Re: [PATCH net v2] net/sched: act_police: fix race condition on state variables

2018-11-20 Thread Cong Wang
On Tue, Nov 20, 2018 at 1:19 PM Davide Caratti  wrote:
>
> after 'police' configuration parameters were converted to use RCU instead
> of spinlock, the state variables used to compute the traffic rate (namely
> 'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in
> the traffic path without any protection.
>
> Use a dedicated spinlock to avoid race conditions on these variables, and
> ensure proper cache-line alignment. In this way, 'police' is still faster
> than what we observed when 'tcf_lock' was used in the traffic path _ i.e.
> reverting commit 2d550dbad83c ("net/sched: act_police: don't use spinlock
> in the data path"). Moreover, we preserve the throughput improvement that
> was obtained after 'police' started using per-cpu counters, when 'avrate'
> is used instead of 'rate'.
>
> Changes since v1 (thanks to Eric Dumazet):
> - call ktime_get_ns() before acquiring the lock in the traffic path
> - use a dedicated spinlock instead of tcf_lock

Do you initialize this dedicated spinlock?


Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-19 Thread Cong Wang
On Fri, Nov 16, 2018 at 12:15 PM Cong Wang  wrote:
>
> On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet  wrote:
> >
> > You could use trafgen to cook such a frame and confirm the theory.
> >
> > Something like :
>
> I will try it.

I just tried it, it doesn't make much difference, the warning only
shows up once after I ran the trafgen script for many times,
it could be triggered by other daemons running on the host too.


Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-19 Thread Cong Wang
On Sun, Nov 18, 2018 at 8:02 PM Herbert Xu  wrote:
>
> On Fri, Nov 16, 2018 at 01:32:50PM -0800, Cong Wang wrote:
> >
> > This is true only when there is a skb_checksum_init*() or
> > skb_checksum_validate*() prior to it, it seems not true for
> > nf_ip_checksum() where skb->csum is correctly set to pesudo header
> > checksum but there is no validation of the original skb->csum.
> > So this check should be still inverted there??
> >
> > Or am I still missing anything here?
>
> What do you mean? My copy of nf_ip_checksum seems to be doing the
> right thing as far as verifying CHECKSUM_COMPLETED goes.

Hmm, it calls csum_tcpudp_magic() directly instead of
__skb_checksum_validate_complete(), but it also sets ip_summed
to CHECKSUM_UNNECESSARY:

 20 if ((protocol == 0 && !csum_fold(skb->csum)) ||
 21 !csum_tcpudp_magic(iph->saddr, iph->daddr,
 22skb->len - dataoff, protocol,
 23skb->csum)) {
 24 skb->ip_summed = CHECKSUM_UNNECESSARY;
 25 break;
 26 }

which means the rx checksum fault won't be triggered.


Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang
On Fri, Nov 16, 2018 at 12:06 PM Cong Wang  wrote:
>
> Hmm, now I see how it works. Actually it uses the differences between
> these two check's as the difference between hardware checksum with
> skb_checksum().
>

Well...

This is true only when there is a skb_checksum_init*() or
skb_checksum_validate*() prior to it, it seems not true for
nf_ip_checksum() where skb->csum is correctly set to pesudo header
checksum but there is no validation of the original skb->csum.
So this check should be still inverted there??

Or am I still missing anything here?


Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang
On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet  wrote:
>
> It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> case non zero trailer bytes were added by a buggy switch (or host)
>
> Saeed can comment/confirm, but the theory is that the NIC does header 
> analysis and
> computes a checksum only on the bytes of the IP frame, not including the tail 
> bytes
> that were added by a switch.


This theory seems can't explain why Pawel saw this warning so often,
which is beyond the probability of a buggy switch. I don't know.


>
> You could use trafgen to cook such a frame and confirm the theory.
>
> Something like :

I will try it.

Thanks.


Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang
On Thu, Nov 15, 2018 at 8:59 PM Herbert Xu  wrote:
>
> On Thu, Nov 15, 2018 at 08:52:23PM -0800, Eric Dumazet wrote:
> >
> > It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> > case non zero trailer bytes were added by a buggy switch (or host)
>
> We should probably change netdev_rx_csum_fault to print out at
> least one complete packet plus the hardware-generated checksum.
>
> That would make debugging these rare hardware faults much easier.

I have a patch as a starter:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7fe50ac83f4319c18ed7c634d85cad16bd0bf509

Let me know if you want to add more information there.

Dumping the hex of an skb data?

Thanks.


Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang
On Thu, Nov 15, 2018 at 8:50 PM Herbert Xu  wrote:
>
> On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote:
> >
> > > Normally if the hardware's partial checksum is valid then we just
> > > trust it and send the packet along.  However, if the partial
> > > checksum is invalid we don't trust it and we will compute the
> > > whole checksum manually which is what ends up in sum.
> >
> > Not sure if I understand partial checksum here, but it is the
> > CHECKSUM_COMPLETE case which I am trying to fix, not
> > CHECKSUM_PARTIAL.
>
> What I meant by partial checksum is the checksum produced by the
> hardware on RX.  In the kernel we call that CHECKSUM_COMPLETE.
> CHECKSUM_PARTIAL is the absence of the substantial part of the
> checksum which is something we use in the kernel primarily for TX.
>
> Yes the names are confusing :)

Yeah, understood. The hardware provides skb->csum in this case, but
we keep adjusting it each time when we change skb->data.


>
> > So, in other word, a checksum *match* is the intended to detect
> > this HW RX checksum fault?
>
> Correct.  Or more likely it's probably a bug in either the driver
> or if there are overlaying code such as VLAN then in that code.
>
> Basically if the RX checksum is buggy, it's much more likely to
> cause a valid packet to be rejected than to cause an invalid packet
> to be accepted, because we still verify that checksum against the
> pseudoheader.  So we only attempt to catch buggy hardware/drivers
> by doing a second manual verification for the case where the packet
> is flagged as invalid.

Hmm, now I see how it works. Actually it uses the differences between
these two check's as the difference between hardware checksum with
skb_checksum().

I will send a patch to add a comment there to avoid confusion.


>
> > Sure, my case is nearly same with Pawel's, except I have no vlan:
> > https://marc.info/?l=linux-netdev=154086647601721=2
>
> Can you please provide your backtrace?

I already did:
https://marc.info/?l=linux-netdev=154092211305599=2

Note, the offending commit has been backported to 4.14, which
is why I saw this warning. I have no idea why it is backported
from the beginning, it is just an optimization, doesn't fix any bug,
IMHO.

Also, it is much harder for me to reproduce it than Pawel who
saw the warning every second. Sometimes I need 1 hour to trigger
it, sometimes other people here needs 10+ hours to trigger it.

Let me see if I can add vlan on my side to make it more reproducible,
it seems hard as our switch doesn't use vlan either.

We have warnings with conntrack involved too, I can provide it too
if you are interested.

I tend to revert it for -stable, at least that is what I plan to do
on my side unless there is a fix coming soon.

Thanks.


Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-15 Thread Cong Wang
On Thu, Nov 15, 2018 at 5:52 PM Herbert Xu  wrote:
>
> On Thu, Nov 15, 2018 at 03:16:02PM -0800, Cong Wang wrote:
> > The following evidences indicate this check is likely wrong:
> >
> > 1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid 
> > checksum.
> >
> > 2. __skb_checksum_complete() always returns sum, and TCP packets are dropped
> >only when it returns non-zero. So non-zero indicates a failure.
> >
> > 3. In __skb_checksum_validate_complete(), we have a nearly same check, where
> >zero is considered as success.
> >
> > 4. csum_fold() already does the one’s complement, this indicates 0 should
> >be considered as a successful validation.
> >
> > 5. We have triggered this fault for many times, but InCsumErrors field in
> >/proc/net/snmp remains 0.
> >
> > Base on the above, non-zero should be used as a checksum mismatch.
> >
> > I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour.
> >
> > Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly")
> > Cc: Herbert Xu 
> > Cc: Tom Herbert 
> > Cc: Eric Dumazet 
> > Signed-off-by: Cong Wang 
> > ---
> >  net/core/datagram.c | 4 ++--
> >  net/core/dev.c  | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 57f3a6fcfc1e..e542a9a212a7 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -733,7 +733,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff 
> > *skb, int len)
> >   __sum16 sum;
> >
> >   sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
> > - if (likely(!sum)) {
> > + if (unlikely(sum)) {
> >   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >   !skb->csum_complete_sw)
> >   netdev_rx_csum_fault(skb->dev);
>
> Normally if the hardware's partial checksum is valid then we just
> trust it and send the packet along.  However, if the partial
> checksum is invalid we don't trust it and we will compute the
> whole checksum manually which is what ends up in sum.

Not sure if I understand partial checksum here, but it is the
CHECKSUM_COMPLETE case which I am trying to fix, not
CHECKSUM_PARTIAL.

Or you mean the checksum returned by skb_checksum(), that is,
checksum from skb->data to skb->data+skb->len.

If neither, I am confused.

>
> netdev_rx_csum_fault is meant to warn about the situation where
> a packet with a valid checksum (i.e., sum == 0) was given to us
> by the hardware with a partial checksum that was invalid.
>
> So changing it to sum here is wrong.
>

So, in other word, a checksum *match* is the intended to detect
this HW RX checksum fault?

What has been changed in between skb_checksum_init() and
tcp_checksum_complete() so that the logic is inverted?

Looks like I miss something too obvious to understand the logic. :-/



> Can you give more information as to how you got the warnings with
> mlx5? It sounds like there may be a real bug there because if you
> are getting the warning then it means that a packet with an invalid
> hardware-computed partial checksum passed the manual check and
> was actually valid.  This implies that either the hardware or the
> driver is broken.

Sure, my case is nearly same with Pawel's, except I have no vlan:
https://marc.info/?l=linux-netdev=154086647601721=2

None of us has RXFCS, if you are curious whether Eric's fix works
for us.

There are also a few other reports with conntrack involved:
https://marc.info/?l=linux-netdev=154134983130200=2
https://marc.info/?l=linux-netdev=154070099731902=2

Thanks.


[Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-15 Thread Cong Wang
The following evidences indicate this check is likely wrong:

1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid 
checksum.

2. __skb_checksum_complete() always returns sum, and TCP packets are dropped
   only when it returns non-zero. So non-zero indicates a failure.

3. In __skb_checksum_validate_complete(), we have a nearly same check, where
   zero is considered as success.

4. csum_fold() already does the one’s complement, this indicates 0 should
   be considered as a successful validation.

5. We have triggered this fault for many times, but InCsumErrors field in
   /proc/net/snmp remains 0.

Base on the above, non-zero should be used as a checksum mismatch.

I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour.

Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly")
Cc: Herbert Xu 
Cc: Tom Herbert 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/core/datagram.c | 4 ++--
 net/core/dev.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..e542a9a212a7 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -733,7 +733,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, 
int len)
__sum16 sum;
 
sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
-   if (likely(!sum)) {
+   if (unlikely(sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
netdev_rx_csum_fault(skb->dev);
@@ -753,7 +753,7 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
 
/* skb->csum holds pseudo checksum */
sum = csum_fold(csum_add(skb->csum, csum));
-   if (likely(!sum)) {
+   if (unlikely(sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
netdev_rx_csum_fault(skb->dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ffcbdd55fa9..c76dee329844 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5776,7 +5776,7 @@ __sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
 
/* NAPI_GRO_CB(skb)->csum holds pseudo checksum */
sum = csum_fold(csum_add(NAPI_GRO_CB(skb)->csum, wsum));
-   if (likely(!sum)) {
+   if (unlikely(sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
netdev_rx_csum_fault(skb->dev);
-- 
2.19.1



Re: [PATCH net] net/sched: act_pedit: fix memory leak when IDR allocation fails

2018-11-14 Thread Cong Wang
(Cc'ing Jamal)

On Wed, Nov 14, 2018 at 3:26 AM Davide Caratti  wrote:
>
> tcf_idr_check_alloc() can return a negative value, on allocation failures
> (-ENOMEM) or IDR exhaustion (-ENOSPC): don't leak keys_ex in these cases.

I think the comments above tcf_idr_check_alloc() need to improve too,
they imply tcf_idr_check_alloc() returns either 0 or 1.

Of course, this can be done with a separated patch.

>
> Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
> Signed-off-by: Davide Caratti 

I think your patch is correct.

Acked-by: Cong Wang 


Re: [Patch net-next v2] net: dump more useful information in netdev_rx_csum_fault()

2018-11-13 Thread Cong Wang
On Tue, Nov 13, 2018 at 10:01 AM Willem de Bruijn
 wrote:
>
> On Mon, Nov 12, 2018 at 2:49 PM Cong Wang  wrote:
> >
> > Currently netdev_rx_csum_fault() only shows a device name,
> > we need more information about the skb for debugging csum
> > failures.
> >
> > Sample output:
> >
> >  ens3: hw csum failure
> >  dev features: 0x00014b89
> >  skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 
> > ip_summed=0 csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0
>
> Recent issues were protocol dependent, including whether vlan headers
> were present. Perhaps also print skb vlan fields and even the first N
> bytes of data to inspect protocol headers? Also skb_iif, esp. if this
> differs from dev->ifindex.

Pawel's case seems to be vlan related, however, as I mentioned,
my case is neither vlan nor RXFCS related.

Ideally, we should dump the whole packet in order to verify the
correctness of the checksum. :) It is not easy to do so given
how complex an skb is now. This is why I only select a few skb
fields to dump. I am pretty sure this can't cover all cases, you
can always add more for your need in the future.

Thanks.


[Patch net-next] net: remove unused skb_send_sock()

2018-11-12 Thread Cong Wang
Signed-off-by: Cong Wang 
---
 include/linux/skbuff.h |  1 -
 net/core/skbuff.c  | 13 -
 2 files changed, 14 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7dcfb5591dc3..9f5bd97a26bd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3327,7 +3327,6 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, 
unsigned int offset,
unsigned int flags);
 int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 int len);
-int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len);
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
 int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fcb1155a00ec..cfe5a03fbdf3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2366,19 +2366,6 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff 
*skb, int offset,
 }
 EXPORT_SYMBOL_GPL(skb_send_sock_locked);
 
-/* Send skb data on a socket. */
-int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len)
-{
-   int ret = 0;
-
-   lock_sock(sk);
-   ret = skb_send_sock_locked(sk, skb, offset, len);
-   release_sock(sk);
-
-   return ret;
-}
-EXPORT_SYMBOL_GPL(skb_send_sock);
-
 /**
  * skb_store_bits - store bits from kernel buffer to skb
  * @skb: destination buffer
-- 
2.19.1



[Patch net-next] net: get rid of __tcp_checksum_complete()

2018-11-12 Thread Cong Wang
__tcp_checksum_complete() is 100% same with __skb_checksum_complete()
and there is no other caller except tcp_checksum_complete().
So, just use __skb_checksum_complete() there.

Signed-off-by: Cong Wang 
---
 include/net/tcp.h | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4743836bed2e..b84b694e8b3d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1315,15 +1315,10 @@ static inline __sum16 tcp_v4_check(int len, __be32 
saddr,
return csum_tcpudp_magic(saddr,daddr,len,IPPROTO_TCP,base);
 }
 
-static inline __sum16 __tcp_checksum_complete(struct sk_buff *skb)
-{
-   return __skb_checksum_complete(skb);
-}
-
 static inline bool tcp_checksum_complete(struct sk_buff *skb)
 {
return !skb_csum_unnecessary(skb) &&
-   __tcp_checksum_complete(skb);
+   __skb_checksum_complete(skb);
 }
 
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb);
-- 
2.19.1



[Patch net-next v2] net: dump more useful information in netdev_rx_csum_fault()

2018-11-12 Thread Cong Wang
Currently netdev_rx_csum_fault() only shows a device name,
we need more information about the skb for debugging csum
failures.

Sample output:

 ens3: hw csum failure
 dev features: 0x00014b89
 skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 ip_summed=0 
csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0

Note, I use pr_err() just to be consistent with the existing one.

Signed-off-by: Cong Wang 
---
 include/linux/netdevice.h |  5 +++--
 net/core/datagram.c   |  2 +-
 net/core/dev.c| 11 +--
 net/core/skbuff.c |  4 ++--
 net/sunrpc/socklib.c  |  2 +-
 5 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 487fa5e0e165..5a97ffbff932 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4342,9 +4342,10 @@ static inline bool 
can_checksum_protocol(netdev_features_t features,
 }
 
 #ifdef CONFIG_BUG
-void netdev_rx_csum_fault(struct net_device *dev);
+void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb);
 #else
-static inline void netdev_rx_csum_fault(struct net_device *dev)
+static inline void netdev_rx_csum_fault(struct net_device *dev,
+   struct sk_buff *skb)
 {
 }
 #endif
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 07983b90d2bd..4bf62b1afa3b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -767,7 +767,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
 
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(NULL);
+   netdev_rx_csum_fault(NULL, skb);
}
return 0;
 fault:
diff --git a/net/core/dev.c b/net/core/dev.c
index bf7e0a471186..5927f6a7c301 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3091,10 +3091,17 @@ EXPORT_SYMBOL(__skb_gso_segment);
 
 /* Take action when hardware reception checksum errors are detected. */
 #ifdef CONFIG_BUG
-void netdev_rx_csum_fault(struct net_device *dev)
+void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
 {
if (net_ratelimit()) {
pr_err("%s: hw csum failure\n", dev ? dev->name : "");
+   if (dev)
+   pr_err("dev features: %pNF\n", >features);
+   pr_err("skb len=%u data_len=%u pkt_type=%u gso_size=%u 
gso_type=%u nr_frags=%u ip_summed=%u csum=%x csum_complete_sw=%d csum_valid=%d 
csum_level=%u\n",
+  skb->len, skb->data_len, skb->pkt_type,
+  skb_shinfo(skb)->gso_size, skb_shinfo(skb)->gso_type,
+  skb_shinfo(skb)->nr_frags, skb->ip_summed, skb->csum,
+  skb->csum_complete_sw, skb->csum_valid, skb->csum_level);
dump_stack();
}
 }
@@ -5781,7 +5788,7 @@ __sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
}
 
NAPI_GRO_CB(skb)->csum = wsum;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 396fcb3baad0..fcb1155a00ec 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2653,7 +2653,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, 
int len)
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
}
if (!skb_shared(skb))
skb->csum_valid = !sum;
@@ -2673,7 +2673,7 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
}
 
if (!skb_shared(skb)) {
diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
index 9062967575c4..7e55cfc69697 100644
--- a/net/sunrpc/socklib.c
+++ b/net/sunrpc/socklib.c
@@ -175,7 +175,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct 
sk_buff *skb)
return -1;
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
return 0;
 no_checksum:
if (xdr_partial_copy_from_skb(xdr, 0, , xdr_skb_read_bits) < 0)
-- 
2.19.1



Re: [Patch net-next] net: dump more useful information in netdev_rx_csum_fault()

2018-11-12 Thread Cong Wang
On Fri, Nov 9, 2018 at 8:16 PM David Miller  wrote:
>
> Didn't you move this function into net/core/skbuff.c? :-)

At the time when I created this patch, that patch didn't hit net-next yet. :)

>
> Please respin.

Sure. I will send v2, as I need to cleanup the mixed commas and
spaces too.

Thanks.


Re: [PATCH] add an initial version of snmp_counter.rst

2018-11-09 Thread Cong Wang
(Cc Randy)

On Fri, Nov 9, 2018 at 10:13 AM yupeng  wrote:
>
> The snmp_counter.rst run a set of simple experiments, explains the
> meaning of snmp counters depend on the experiments' results. This is
> an initial version, only covers a small part of the snmp counters.


I don't look into much details, so just a few high-level reviews:

1. Please try to group those counters by protocol, it would be easier
to search.

2. For many counters you provide a link to RFC, do you just copy
and paste them? Please try to expand.

3. _I think_ you don't need to show, for example, how to run a ping
command. It's safe to assume readers already know this. Therefore,
just explaining those counters is okay.


Thanks.

>
> Signed-off-by: yupeng 
> ---
>  Documentation/networking/index.rst|   1 +
>  Documentation/networking/snmp_counter.rst | 963 ++
>  2 files changed, 964 insertions(+)
>  create mode 100644 Documentation/networking/snmp_counter.rst
>
> diff --git a/Documentation/networking/index.rst 
> b/Documentation/networking/index.rst
> index bd89dae8d578..6a47629ef8ed 100644
> --- a/Documentation/networking/index.rst
> +++ b/Documentation/networking/index.rst
> @@ -31,6 +31,7 @@ Contents:
> net_failover
> alias
> bridge
> +   snmp_counter
>
>  .. only::  subproject
>
> diff --git a/Documentation/networking/snmp_counter.rst 
> b/Documentation/networking/snmp_counter.rst
> new file mode 100644
> index ..2939c5acf675
> --- /dev/null
> +++ b/Documentation/networking/snmp_counter.rst
> @@ -0,0 +1,963 @@
> +
> +snmp counter tutorial
> +
> +
> +This document explains the meaning of snmp counters. For understanding
> +their meanings better, this document doesn't explain the counters one
> +by one, but creates a set of experiments, and explains the counters
> +depend on the experiments' results. The experiments are on one or two
> +virtual machines. Except for the test commands we use in the experiments,
> +the virtual machines have no other network traffic. We use the 'nstat'
> +command to get the values of snmp counters, before every test, we run
> +'nstat -n' to update the history, so the 'nstat' output would only
> +show the changes of the snmp counters. For more information about
> +nstat, please refer:
> +
> +http://man7.org/linux/man-pages/man8/nstat.8.html
> +
> +icmp ping
> +
> +
> +Run the ping command against the public dns server 8.8.8.8::
> +
> +  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
> +  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
> +  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
> +
> +  --- 8.8.8.8 ping statistics ---
> +  1 packets transmitted, 1 received, 0% packet loss, time 0ms
> +  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
> +
> +The nstayt result::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives1  0.0
> +  IpInDelivers1  0.0
> +  IpOutRequests   1  0.0
> +  IcmpInMsgs  1  0.0
> +  IcmpInEchoReps  1  0.0
> +  IcmpOutMsgs 1  0.0
> +  IcmpOutEchos1  0.0
> +  IcmpMsgInType0  1  0.0
> +  IcmpMsgOutType8 1  0.0
> +  IpExtInOctets   84 0.0
> +  IpExtOutOctets  84 0.0
> +  IpExtInNoECTPkts1  0.0
> +
> +The nstat output could be divided into two part: one with the 'Ext'
> +keyword, another without the 'Ext' keyword. If the counter name
> +doesn't have 'Ext', it is defined by one of snmp rfc, if it has 'Ext',
> +it is a kernel extent counter. Below we explain them one by one.
> +
> +The rfc defined counters
> +--
> +
> +* IpInReceives
> +The total number of input datagrams received from interfaces,
> +including those received in error.
> +
> +https://tools.ietf.org/html/rfc1213#page-26
> +
> +* IpInDelivers
> +The total number of input datagrams successfully delivered to IP
> +user-protocols (including ICMP).
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IpOutRequests
> +The total number of IP datagrams which local IP user-protocols
> +(including ICMP) supplied to IP in requests for transmission.  Note
> +that this counter does not include any datagrams counted in
> +ipForwDatagrams.
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IcmpInMsgs
> +The total number of ICMP messages which the entity received.  Note
> +that this counter includes all those counted by icmpInErrors.
> +
> +https://tools.ietf.org/html/rfc1213#page-41
> +
> +* IcmpInEchoReps
> +The number of ICMP Echo Reply messages received.
> +
> +https://tools.ietf.org/html/rfc1213#page-42
> +
> +* IcmpOutMsgs
> +The total number of ICMP messages which this entity attempted 

Re: [Patch net-next] net: dump more useful information in netdev_rx_csum_fault()

2018-11-09 Thread Cong Wang
On Fri, Nov 9, 2018 at 6:02 PM Yunsheng Lin  wrote:
>
> On 2018/11/10 9:42, Cong Wang wrote:
> > On Fri, Nov 9, 2018 at 5:39 PM Yunsheng Lin  wrote:
> >>
> >> On 2018/11/10 3:43, Cong Wang wrote:
> >>> Currently netdev_rx_csum_fault() only shows a device name,
> >>> we need more information about the skb for debugging.
> >>>
> >>> Sample output:
> >>>
> >>>  ens3: hw csum failure
> >>>  dev features: 0x00014b89
> >>>  skb len=84 data_len=0 gso_size=0 gso_type=0 ip_summed=0 csum=0, 
> >>> csum_complete_sw=0, csum_valid=0
> >>>
> >>> Signed-off-by: Cong Wang 
> >>> ---
> >>>  include/linux/netdevice.h |  5 +++--
> >>>  net/core/datagram.c   |  6 +++---
> >>>  net/core/dev.c| 10 --
> >>>  net/sunrpc/socklib.c  |  2 +-
> >>>  4 files changed, 15 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>> index 857f8abf7b91..fabcd9fa6cf7 100644
> >>> --- a/include/linux/netdevice.h
> >>> +++ b/include/linux/netdevice.h
> >>> @@ -4332,9 +4332,10 @@ static inline bool 
> >>> can_checksum_protocol(netdev_features_t features,
> >>>  }
> >>>
> >>>  #ifdef CONFIG_BUG
> >>> -void netdev_rx_csum_fault(struct net_device *dev);
> >>> +void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb);
> >>>  #else
> >>> -static inline void netdev_rx_csum_fault(struct net_device *dev)
> >>> +static inline void netdev_rx_csum_fault(struct net_device *dev,
> >>> + struct sk_buff *skb)
> >>>  {
> >>>  }
> >>>  #endif
> >>> diff --git a/net/core/datagram.c b/net/core/datagram.c
> >>> index 57f3a6fcfc1e..d8f4d55cd6c5 100644
> >>> --- a/net/core/datagram.c
> >>> +++ b/net/core/datagram.c
> >>> @@ -736,7 +736,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff 
> >>> *skb, int len)
> >>>   if (likely(!sum)) {
> >>>   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >>>   !skb->csum_complete_sw)
> >>> - netdev_rx_csum_fault(skb->dev);
> >>> + netdev_rx_csum_fault(skb->dev, skb);
> >>>   }
> >>>   if (!skb_shared(skb))
> >>>   skb->csum_valid = !sum;
> >>> @@ -756,7 +756,7 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
> >>>   if (likely(!sum)) {
> >>>   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >>>   !skb->csum_complete_sw)
> >>> - netdev_rx_csum_fault(skb->dev);
> >>> + netdev_rx_csum_fault(skb->dev, skb);
> >>>   }
> >>>
> >>>   if (!skb_shared(skb)) {
> >>> @@ -810,7 +810,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff 
> >>> *skb,
> >>>
> >>>   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >>>   !skb->csum_complete_sw)
> >>> - netdev_rx_csum_fault(NULL);
> >>> + netdev_rx_csum_fault(NULL, skb);
> >>>   }
> >>>   return 0;
> >>>  fault:
> >>> diff --git a/net/core/dev.c b/net/core/dev.c
> >>> index 0ffcbdd55fa9..2b337df26117 100644
> >>> --- a/net/core/dev.c
> >>> +++ b/net/core/dev.c
> >>> @@ -3091,10 +3091,16 @@ EXPORT_SYMBOL(__skb_gso_segment);
> >>>
> >>>  /* Take action when hardware reception checksum errors are detected. */
> >>>  #ifdef CONFIG_BUG
> >>> -void netdev_rx_csum_fault(struct net_device *dev)
> >>> +void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
> >>>  {
> >>>   if (net_ratelimit()) {
> >>>   pr_err("%s: hw csum failure\n", dev ? dev->name : 
> >>> "");
> >>> + if (dev)
> >>> + pr_err("dev features: %pNF\n", >features);
> >>> + pr_err("skb len=%d data_len=%d gso_size=%d gso_type=%d 
> >>> ip_summed=%d csum=%x, csum_complete_sw=%d, csum_valid=%d\n",
> >>> +skb->len, skb->data_len, skb_shinfo(skb)->gso_size,
> >>> +skb_shinfo(skb)->gso_type, skb->ip_summed, skb->csum,
> >>> +skb->csum_complete_sw, skb->csum_valid);
> >>
> >>
> >> This function also have the netdev available, use netdev_err to log the 
> >> error?
> >
> > It is apparently not me who picked pr_err() from the beginning,
> > I just follow that pr_err(). If you are not happy with it, please send
> > a followup.
>
> Yes, but perhaps it is something to improve.


Sure, no one stops you from improving it in a followup patch. :)


> When using the netdev, then maybe it does not have to check if dev is null, 
> because
> netdev_err has handled the netdev being NULL case.
> Maybe I missed something that netdev can not be used here?
> If not, maybe I can send a followup.
>

Maybe. Again, my patch intends to add a few debugging logs,
not to convert pr_err() to whatever else, they are totally different
goals. I choose pr_err() only because I follow the existing one,
not to say which one is better than the other.

Thanks.


Re: [Patch net-next] net: dump more useful information in netdev_rx_csum_fault()

2018-11-09 Thread Cong Wang
On Fri, Nov 9, 2018 at 5:39 PM Yunsheng Lin  wrote:
>
> On 2018/11/10 3:43, Cong Wang wrote:
> > Currently netdev_rx_csum_fault() only shows a device name,
> > we need more information about the skb for debugging.
> >
> > Sample output:
> >
> >  ens3: hw csum failure
> >  dev features: 0x00014b89
> >  skb len=84 data_len=0 gso_size=0 gso_type=0 ip_summed=0 csum=0, 
> > csum_complete_sw=0, csum_valid=0
> >
> > Signed-off-by: Cong Wang 
> > ---
> >  include/linux/netdevice.h |  5 +++--
> >  net/core/datagram.c   |  6 +++---
> >  net/core/dev.c| 10 --
> >  net/sunrpc/socklib.c  |  2 +-
> >  4 files changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 857f8abf7b91..fabcd9fa6cf7 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -4332,9 +4332,10 @@ static inline bool 
> > can_checksum_protocol(netdev_features_t features,
> >  }
> >
> >  #ifdef CONFIG_BUG
> > -void netdev_rx_csum_fault(struct net_device *dev);
> > +void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb);
> >  #else
> > -static inline void netdev_rx_csum_fault(struct net_device *dev)
> > +static inline void netdev_rx_csum_fault(struct net_device *dev,
> > + struct sk_buff *skb)
> >  {
> >  }
> >  #endif
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 57f3a6fcfc1e..d8f4d55cd6c5 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -736,7 +736,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff 
> > *skb, int len)
> >   if (likely(!sum)) {
> >   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >   !skb->csum_complete_sw)
> > - netdev_rx_csum_fault(skb->dev);
> > + netdev_rx_csum_fault(skb->dev, skb);
> >   }
> >   if (!skb_shared(skb))
> >   skb->csum_valid = !sum;
> > @@ -756,7 +756,7 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
> >   if (likely(!sum)) {
> >   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >   !skb->csum_complete_sw)
> > - netdev_rx_csum_fault(skb->dev);
> > + netdev_rx_csum_fault(skb->dev, skb);
> >   }
> >
> >   if (!skb_shared(skb)) {
> > @@ -810,7 +810,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
> >
> >   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
> >   !skb->csum_complete_sw)
> > - netdev_rx_csum_fault(NULL);
> > + netdev_rx_csum_fault(NULL, skb);
> >   }
> >   return 0;
> >  fault:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0ffcbdd55fa9..2b337df26117 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3091,10 +3091,16 @@ EXPORT_SYMBOL(__skb_gso_segment);
> >
> >  /* Take action when hardware reception checksum errors are detected. */
> >  #ifdef CONFIG_BUG
> > -void netdev_rx_csum_fault(struct net_device *dev)
> > +void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
> >  {
> >   if (net_ratelimit()) {
> >   pr_err("%s: hw csum failure\n", dev ? dev->name : 
> > "");
> > + if (dev)
> > + pr_err("dev features: %pNF\n", >features);
> > + pr_err("skb len=%d data_len=%d gso_size=%d gso_type=%d 
> > ip_summed=%d csum=%x, csum_complete_sw=%d, csum_valid=%d\n",
> > +skb->len, skb->data_len, skb_shinfo(skb)->gso_size,
> > +skb_shinfo(skb)->gso_type, skb->ip_summed, skb->csum,
> > +skb->csum_complete_sw, skb->csum_valid);
>
>
> This function also have the netdev available, use netdev_err to log the error?

It is apparently not me who picked pr_err() from the beginning,
I just follow that pr_err(). If you are not happy with it, please send
a followup.


>
> Also, dev->features was dumped before this patch, why remove it?

Seriously? Where do I remove it? Please be specific. :)


[Patch net-next] net: dump more useful information in netdev_rx_csum_fault()

2018-11-09 Thread Cong Wang
Currently netdev_rx_csum_fault() only shows a device name,
we need more information about the skb for debugging.

Sample output:

 ens3: hw csum failure
 dev features: 0x00014b89
 skb len=84 data_len=0 gso_size=0 gso_type=0 ip_summed=0 csum=0, 
csum_complete_sw=0, csum_valid=0

Signed-off-by: Cong Wang 
---
 include/linux/netdevice.h |  5 +++--
 net/core/datagram.c   |  6 +++---
 net/core/dev.c| 10 --
 net/sunrpc/socklib.c  |  2 +-
 4 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8abf7b91..fabcd9fa6cf7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4332,9 +4332,10 @@ static inline bool 
can_checksum_protocol(netdev_features_t features,
 }
 
 #ifdef CONFIG_BUG
-void netdev_rx_csum_fault(struct net_device *dev);
+void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb);
 #else
-static inline void netdev_rx_csum_fault(struct net_device *dev)
+static inline void netdev_rx_csum_fault(struct net_device *dev,
+   struct sk_buff *skb)
 {
 }
 #endif
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..d8f4d55cd6c5 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -736,7 +736,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, 
int len)
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
}
if (!skb_shared(skb))
skb->csum_valid = !sum;
@@ -756,7 +756,7 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
}
 
if (!skb_shared(skb)) {
@@ -810,7 +810,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
 
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(NULL);
+   netdev_rx_csum_fault(NULL, skb);
}
return 0;
 fault:
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ffcbdd55fa9..2b337df26117 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3091,10 +3091,16 @@ EXPORT_SYMBOL(__skb_gso_segment);
 
 /* Take action when hardware reception checksum errors are detected. */
 #ifdef CONFIG_BUG
-void netdev_rx_csum_fault(struct net_device *dev)
+void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
 {
if (net_ratelimit()) {
pr_err("%s: hw csum failure\n", dev ? dev->name : "");
+   if (dev)
+   pr_err("dev features: %pNF\n", >features);
+   pr_err("skb len=%d data_len=%d gso_size=%d gso_type=%d 
ip_summed=%d csum=%x, csum_complete_sw=%d, csum_valid=%d\n",
+  skb->len, skb->data_len, skb_shinfo(skb)->gso_size,
+  skb_shinfo(skb)->gso_type, skb->ip_summed, skb->csum,
+  skb->csum_complete_sw, skb->csum_valid);
dump_stack();
}
 }
@@ -5779,7 +5785,7 @@ __sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
if (likely(!sum)) {
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
}
 
NAPI_GRO_CB(skb)->csum = wsum;
diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
index 9062967575c4..7e55cfc69697 100644
--- a/net/sunrpc/socklib.c
+++ b/net/sunrpc/socklib.c
@@ -175,7 +175,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct 
sk_buff *skb)
return -1;
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
!skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
+   netdev_rx_csum_fault(skb->dev, skb);
return 0;
 no_checksum:
if (xdr_partial_copy_from_skb(xdr, 0, , xdr_skb_read_bits) < 0)
-- 
2.19.1



[Patch net] ip: fix csum_sub() with csum_block_sub()

2018-11-08 Thread Cong Wang
When subtracting the checksum of a block of data,
csum_block_sub() must be used to respect the offset.

We learned this lesson from both commit d55bef5059dd
("net: fix pskb_trim_rcsum_slow() with odd trim offset") and
commit d48051c5b837 ("net/mlx5e: fix csum adjustments caused by RXFCS").

Fixes: ca4ef4574f1e ("ip: fix IP_CHECKSUM handling")
Cc: Paolo Abeni 
Cc: Eric Dumazet 
Cc: Willem de Bruijn 
Signed-off-by: Cong Wang 
---
 net/ipv4/ip_sockglue.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index fffcc130900e..0d69f0823c08 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -122,7 +122,10 @@ static void ip_cmsg_recv_checksum(struct msghdr *msg, 
struct sk_buff *skb,
 
if (offset != 0) {
int tend_off = skb_transport_offset(skb) + tlen;
-   csum = csum_sub(csum, skb_checksum(skb, tend_off, offset, 0));
+
+   csum = csum_block_sub(csum,
+ skb_checksum(skb, tend_off, offset, 0),
+ tend_off);
}
 
put_cmsg(msg, SOL_IP, IP_CHECKSUM, sizeof(__wsum), );
-- 
2.19.1



Re: [Patch net] ip: fix csum_sub() with csum_block_sub()

2018-11-08 Thread Cong Wang
On Thu, Nov 8, 2018 at 2:17 PM Eric Dumazet  wrote:
>
> On Thu, Nov 8, 2018 at 2:04 PM Cong Wang  wrote:
> >
> > When subtracting the checksum of a block of data,
> > csum_block_sub() must be used to respect the offset.
> >
> > We learned this lesson from both commit d55bef5059dd
> > ("net: fix pskb_trim_rcsum_slow() with odd trim offset") and
> > commit d48051c5b837 ("net/mlx5e: fix csum adjustments caused by RXFCS").
> >
> > Fixes: ca4ef4574f1e ("ip: fix IP_CHECKSUM handling")
>
> I do not believe you fix any bug here...
>
> I do not know of any inet protocol having odd header sizes.

That offset is payload offset, but yeah, the payload offset must be
aligned some way. Good point!

Let's drop it.


[Patch net-next v2] net: move __skb_checksum_complete*() to skbuff.c

2018-11-08 Thread Cong Wang
__skb_checksum_complete_head() and __skb_checksum_complete()
are both declared in skbuff.h, they fit better in skbuff.c
than datagram.c.

Cc: Stefano Brivio 
Signed-off-by: Cong Wang 
---
 net/core/datagram.c | 43 ---
 net/core/skbuff.c   | 43 +++
 2 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..07983b90d2bd 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -728,49 +728,6 @@ static int skb_copy_and_csum_datagram(const struct sk_buff 
*skb, int offset,
return -EFAULT;
 }
 
-__sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
-{
-   __sum16 sum;
-
-   sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
-   if (likely(!sum)) {
-   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
-   !skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
-   }
-   if (!skb_shared(skb))
-   skb->csum_valid = !sum;
-   return sum;
-}
-EXPORT_SYMBOL(__skb_checksum_complete_head);
-
-__sum16 __skb_checksum_complete(struct sk_buff *skb)
-{
-   __wsum csum;
-   __sum16 sum;
-
-   csum = skb_checksum(skb, 0, skb->len, 0);
-
-   /* skb->csum holds pseudo checksum */
-   sum = csum_fold(csum_add(skb->csum, csum));
-   if (likely(!sum)) {
-   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
-   !skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
-   }
-
-   if (!skb_shared(skb)) {
-   /* Save full packet checksum */
-   skb->csum = csum;
-   skb->ip_summed = CHECKSUM_COMPLETE;
-   skb->csum_complete_sw = 1;
-   skb->csum_valid = !sum;
-   }
-
-   return sum;
-}
-EXPORT_SYMBOL(__skb_checksum_complete);
-
 /**
  * skb_copy_and_csum_datagram_msg - Copy and checksum skb to user iovec.
  * @skb: skbuff
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b4ee5c8b928f..5cb4b3440153 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2645,6 +2645,49 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, 
int offset,
 }
 EXPORT_SYMBOL(skb_copy_and_csum_bits);
 
+__sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
+{
+   __sum16 sum;
+
+   sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
+   if (likely(!sum)) {
+   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+   !skb->csum_complete_sw)
+   netdev_rx_csum_fault(skb->dev);
+   }
+   if (!skb_shared(skb))
+   skb->csum_valid = !sum;
+   return sum;
+}
+EXPORT_SYMBOL(__skb_checksum_complete_head);
+
+__sum16 __skb_checksum_complete(struct sk_buff *skb)
+{
+   __wsum csum;
+   __sum16 sum;
+
+   csum = skb_checksum(skb, 0, skb->len, 0);
+
+   /* skb->csum holds pseudo checksum */
+   sum = csum_fold(csum_add(skb->csum, csum));
+   if (likely(!sum)) {
+   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+   !skb->csum_complete_sw)
+   netdev_rx_csum_fault(skb->dev);
+   }
+
+   if (!skb_shared(skb)) {
+   /* Save full packet checksum */
+   skb->csum = csum;
+   skb->ip_summed = CHECKSUM_COMPLETE;
+   skb->csum_complete_sw = 1;
+   skb->csum_valid = !sum;
+   }
+
+   return sum;
+}
+EXPORT_SYMBOL(__skb_checksum_complete);
+
 static __wsum warn_crc32c_csum_update(const void *buff, int len, __wsum sum)
 {
net_warn_ratelimited(
-- 
2.19.1



Re: [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c

2018-11-08 Thread Cong Wang
On Thu, Nov 8, 2018 at 12:14 PM Stefano Brivio  wrote:
>
> On Thu, 8 Nov 2018 12:02:00 -0800
> Cong Wang  wrote:
>
> > On Thu, Nov 8, 2018 at 11:54 AM Stefano Brivio  wrote:
> > >
> > > Hi,
> > >
> > > On Thu,  8 Nov 2018 11:49:49 -0800
> > > Cong Wang  wrote:
> > >
> > > > +EXPORT_SYMBOL(__skb_checksum_complete);
> > > > +
> > > >  /* Both of above in one bottle. */
> > >
> > > Maybe you should also update/drop this comment now?
> >
> > I have no idea what that comment means. Do you?
>
> I think it refers to the fact that skb_copy_and_csum_bits() implements
> both skb_checksum() and skb_copy_bits(), that were just above it at
> 1da177e4c3f4.
>
> Then more stuff was moved in between, and the comment was never updated
> or dropped.

I don't want to touch what I don't understand. So, let me just move
these two after skb_copy_and_csum_bits().


Re: [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c

2018-11-08 Thread Cong Wang
On Thu, Nov 8, 2018 at 11:54 AM Stefano Brivio  wrote:
>
> Hi,
>
> On Thu,  8 Nov 2018 11:49:49 -0800
> Cong Wang  wrote:
>
> > +EXPORT_SYMBOL(__skb_checksum_complete);
> > +
> >  /* Both of above in one bottle. */
>
> Maybe you should also update/drop this comment now?

I have no idea what that comment means. Do you?

It dates back to pre-git history.

Thanks.


[Patch net-next] net: move __skb_checksum_complete*() to skbuff.c

2018-11-08 Thread Cong Wang
__skb_checksum_complete_head() and __skb_checksum_complete()
are both declared in skbuff.h, they fit better in skbuff.c
than datagram.c.

Signed-off-by: Cong Wang 
---
 net/core/datagram.c | 43 ---
 net/core/skbuff.c   | 43 +++
 2 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..07983b90d2bd 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -728,49 +728,6 @@ static int skb_copy_and_csum_datagram(const struct sk_buff 
*skb, int offset,
return -EFAULT;
 }
 
-__sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
-{
-   __sum16 sum;
-
-   sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
-   if (likely(!sum)) {
-   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
-   !skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
-   }
-   if (!skb_shared(skb))
-   skb->csum_valid = !sum;
-   return sum;
-}
-EXPORT_SYMBOL(__skb_checksum_complete_head);
-
-__sum16 __skb_checksum_complete(struct sk_buff *skb)
-{
-   __wsum csum;
-   __sum16 sum;
-
-   csum = skb_checksum(skb, 0, skb->len, 0);
-
-   /* skb->csum holds pseudo checksum */
-   sum = csum_fold(csum_add(skb->csum, csum));
-   if (likely(!sum)) {
-   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
-   !skb->csum_complete_sw)
-   netdev_rx_csum_fault(skb->dev);
-   }
-
-   if (!skb_shared(skb)) {
-   /* Save full packet checksum */
-   skb->csum = csum;
-   skb->ip_summed = CHECKSUM_COMPLETE;
-   skb->csum_complete_sw = 1;
-   skb->csum_valid = !sum;
-   }
-
-   return sum;
-}
-EXPORT_SYMBOL(__skb_checksum_complete);
-
 /**
  * skb_copy_and_csum_datagram_msg - Copy and checksum skb to user iovec.
  * @skb: skbuff
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b4ee5c8b928f..7db6c520ed92 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2559,6 +2559,49 @@ __wsum skb_checksum(const struct sk_buff *skb, int 
offset,
 }
 EXPORT_SYMBOL(skb_checksum);
 
+__sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
+{
+   __sum16 sum;
+
+   sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
+   if (likely(!sum)) {
+   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+   !skb->csum_complete_sw)
+   netdev_rx_csum_fault(skb->dev);
+   }
+   if (!skb_shared(skb))
+   skb->csum_valid = !sum;
+   return sum;
+}
+EXPORT_SYMBOL(__skb_checksum_complete_head);
+
+__sum16 __skb_checksum_complete(struct sk_buff *skb)
+{
+   __wsum csum;
+   __sum16 sum;
+
+   csum = skb_checksum(skb, 0, skb->len, 0);
+
+   /* skb->csum holds pseudo checksum */
+   sum = csum_fold(csum_add(skb->csum, csum));
+   if (likely(!sum)) {
+   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+   !skb->csum_complete_sw)
+   netdev_rx_csum_fault(skb->dev);
+   }
+
+   if (!skb_shared(skb)) {
+   /* Save full packet checksum */
+   skb->csum = csum;
+   skb->ip_summed = CHECKSUM_COMPLETE;
+   skb->csum_complete_sw = 1;
+   skb->csum_valid = !sum;
+   }
+
+   return sum;
+}
+EXPORT_SYMBOL(__skb_checksum_complete);
+
 /* Both of above in one bottle. */
 
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
-- 
2.19.1



Re: Latest net-next kernel 4.19.0+

2018-11-08 Thread Cong Wang
On Thu, Nov 1, 2018 at 3:59 PM Paweł Staszewski  wrote:
>
>
>
> W dniu 31.10.2018 o 22:17, Cong Wang pisze:
> > On Wed, Oct 31, 2018 at 2:05 PM Saeed Mahameed  wrote:
> >> Cong, How often does this happen ? can you some how verify if the
> >> problematic packet has extra end padding after the ip payload ?
> > For us, we need 10+ hours to get one warning. This is also
> > why we never capture the packet that causes this warning.
> >
> >
> >> It would be cool if we had a feature in kernel to store such SKB in
> >> memory when such issue occurs, and let the user dump it later (via
> >> tcpdump) and send the dump to the vendor for debug so we could just
> >> replay and see what happens.
> >>
> > Yeah, the warning kinda sucks, it tells almost nothing, the SKB
> > should be dumped up on this warning.
> >
>
> So another vlan and same hw csum - this time this vlan have less traffic
> so i catch traffic with tcpdump
> Nov  1 23:46:22 kernel: vlan2805: hw csum failure
> but the problem is there is about 1986 frames in that second
> Will tcpdump output helps ?

Looks like you don't have any IP fragments.

Do you try Eric's debugging patch? Does it make a difference?

Also, if doable, can you try to remove vlan from your setup to see if
the warning will be gone?

Thanks!


Re: a propose of snmp counter document

2018-11-08 Thread Cong Wang
On Thu, Nov 8, 2018 at 12:10 AM peng yu  wrote:
>
> I'm planing to write a document which explains the meaning of the
> kernel snmp counters, and combine the explanations with some tests,
> because I found lots of the 'TcpExt' and 'IpExt' counters are not
> explained in any document. Here is a draft:
> https://github.com/yupeng0921/iproute2_learning/blob/master/nstat.md
> It is still on going. I think it might be useful. Besides put it on my
> git repo, could someone have any suggestion about any place I
> could contribute this document to?

Good work! It has been in my todo list for a long time.

I believe we have enough room in Documentation/networking/
for it. You can follow the normal patch submission process to
contribute to upstream:

https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html


Thanks!


[Patch net] net: drop skb on failure in ip_check_defrag()

2018-11-01 Thread Cong Wang
Most callers of pskb_trim_rcsum() simply drop the skb when
it fails, however, ip_check_defrag() still continues to pass
the skb up to stack. This is suspicious.

In ip_check_defrag(), after we learn the skb is an IP fragment,
passing the skb to callers makes no sense, because callers expect
fragments are defrag'ed on success. So, dropping the skb when we
can't defrag it is reasonable.

Note, prior to commit 88078d98d1bb, this is not a big problem as
checksum will be fixed up anyway. After it, the checksum is not
correct on failure.

Found this during code review.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/ipv4/ip_fragment.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9b0158fa431f..d6ee343fdb86 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -722,10 +722,14 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
if (ip_is_fragment()) {
skb = skb_share_check(skb, GFP_ATOMIC);
if (skb) {
-   if (!pskb_may_pull(skb, netoff + iph.ihl * 4))
-   return skb;
-   if (pskb_trim_rcsum(skb, netoff + len))
-   return skb;
+   if (!pskb_may_pull(skb, netoff + iph.ihl * 4)) {
+   kfree_skb(skb);
+   return NULL;
+   }
+   if (pskb_trim_rcsum(skb, netoff + len)) {
+   kfree_skb(skb);
+   return NULL;
+   }
memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
if (ip_defrag(net, skb, user))
return NULL;
-- 
2.14.5



Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Cong Wang
On Wed, Oct 31, 2018 at 2:05 PM Saeed Mahameed  wrote:
>
> Cong, How often does this happen ? can you some how verify if the
> problematic packet has extra end padding after the ip payload ?

For us, we need 10+ hours to get one warning. This is also
why we never capture the packet that causes this warning.


>
> It would be cool if we had a feature in kernel to store such SKB in
> memory when such issue occurs, and let the user dump it later (via
> tcpdump) and send the dump to the vendor for debug so we could just
> replay and see what happens.
>

Yeah, the warning kinda sucks, it tells almost nothing, the SKB
should be dumped up on this warning.


Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread Cong Wang
On Tue, Oct 30, 2018 at 11:42 AM Eric Dumazet  wrote:
>
> On Tue, Oct 30, 2018 at 11:09 AM Cong Wang  wrote:
>
> > At least skb_header_pointer() is marked as __must_check, I don't see
> > you check its return value here.
>
> This can not fail here.
>
> skb->length must be above 14+4 at this point.

Never say it is wrong, just saying what compiler thinks.

>
> My compiler seems to be okay with that.

I wonder how compiler recognizes it as "never fail" when marked with
__must_check.


Re: [Patch net] net: make pskb_trim_rcsum_slow() robust

2018-10-30 Thread Cong Wang
On Mon, Oct 29, 2018 at 8:08 PM Eric Dumazet  wrote:
>
>
>
> On 10/29/2018 07:41 PM, Cong Wang wrote:
> > On Mon, Oct 29, 2018 at 7:25 PM Eric Dumazet  wrote:
> >>
> >>
> >>
> >> On 10/29/2018 07:21 PM, Cong Wang wrote:
> >>> On Mon, Oct 29, 2018 at 7:14 PM Eric Dumazet  
> >>> wrote:
> >>>>
> >>>> Would not it be simpler to set ip_summed to CHECKSUM_NONE (no need to 
> >>>> save old_csum) ?
> >>>
> >>> For !CHECKSUM_COMPLETE, ip_summed should be untouched, right?
> >>>
> >>> If you mean only setting to CHECKSUM_NONE for CHECKSUM_COMPLETE case,
> >>> the end result may not be simpler.
> >>
> >> I meant to reinstate what was there before my patch in this error case
> >>
> >>if (skb->ip_summed == CHECKSUM_COMPLETE)
> >>skb->ip_summed = CHECKSUM_NONE;
> >>
> >> That would only be run in error (quite unlikely) path, instead of saving 
> >> old_csum in all cases.
> >
> > I know your point, however, I am not sure that is a desired behavior.
> >
> > On failure, I think the whole skb should be restored to its previous state
> > before entering this function, changing it to CHECKSUM_NONE on failure
> > is inconsistent with success case.
> >
>
> Before my patch, we were changing skb->ip_summed to CHECKSUM_NONE,
> so why suddenly we need to be consistent ?

That is because setting it to CHECKSUM_NONE _was_ how the success
case works and nothing _was_ needed for failure case.

You changed how we handle checksum for success case, it is why we need
to change for the failure case too.


>
> In any case, ip_check_defrag() should really drop this skb, as for other 
> allocation
> failures (like skb_share_check()), if really we want consistency.

I have the same feeling, just not brave enough to change the logic of
ip_check_defrag() where pskb_may_pull() failure is treated in a same way.


Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread Cong Wang
On Tue, Oct 30, 2018 at 12:57 AM Eric Dumazet  wrote:
> -static __be32 mlx5e_get_fcs(struct sk_buff *skb)
> +static u32 mlx5e_get_fcs(const struct sk_buff *skb)
>  {
> -   int last_frag_sz, bytes_in_prev, nr_frags;
> -   u8 *fcs_p1, *fcs_p2;
> -   skb_frag_t *last_frag;
> -   __be32 fcs_bytes;
> +   const void *fcs_bytes;
> +   u32 _fcs_bytes;
>
> -   if (!skb_is_nonlinear(skb))
> -   return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN);
> +   fcs_bytes = skb_header_pointer(skb, skb->len - ETH_FCS_LEN,
> +  ETH_FCS_LEN, &_fcs_bytes);

At least skb_header_pointer() is marked as __must_check, I don't see
you check its return value here.


Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Cong Wang
On Tue, Oct 30, 2018 at 10:50 AM Eric Dumazet  wrote:
>
>
>
> On 10/30/2018 10:32 AM, Cong Wang wrote:
>
> > Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
> > it much less frequently than Pawel.
> >
> > Also, it is probably not specific to mlx5, as there is another report which
> > is probably a non-mlx5 driver.
>
> Not sure if you provided a stack trace ?

I said it is the same with Pawel's. Here it is anyway:

[ 3731.075989] eth0: hw csum failure
[ 3731.079316] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 4.14.74.x86_64 #1
[ 3731.086703] Hardware name: Wiwynn F4WW/Y 300-0284/F4WW MAIN BOARD,
BIOS F4WWP02 10/19/2018
[ 3731.094961] Call Trace:
[ 3731.097408]  
[ 3731.099432]  dump_stack+0x46/0x59
[ 3731.102751]  __skb_checksum_complete+0xb8/0xd0
[ 3731.107194]  tcp_v4_rcv+0x116/0xa30
[ 3731.110688]  ip_local_deliver_finish+0x5d/0x1f0
[ 3731.115218]  ip_local_deliver+0x6b/0xe0
[ 3731.119056]  ? ip_rcv_finish+0x400/0x400
[ 3731.122973]  ip_rcv+0x287/0x360
[ 3731.126112]  ? inet_del_offload+0x40/0x40
[ 3731.130124]  __netif_receive_skb_core+0x404/0xc10
[ 3731.134831]  ? netif_receive_skb_internal+0x34/0xd0
[ 3731.139709]  netif_receive_skb_internal+0x34/0xd0
[ 3731.144415]  napi_gro_receive+0xb8/0xe0
[ 3731.148271]  mlx5e_handle_rx_cqe_mpwrq+0x4e3/0x7f0 [mlx5_core]
[ 3731.154099]  ? enqueue_entity+0x103/0x7f0
[ 3731.158114]  mlx5e_poll_rx_cq+0xba/0x850 [mlx5_core]
[ 3731.163080]  mlx5e_napi_poll+0x91/0x290 [mlx5_core]
[ 3731.167955]  net_rx_action+0x14a/0x3e0
[ 3731.171707]  ? credit_entropy_bits+0x23d/0x260
[ 3731.176153]  __do_softirq+0xe2/0x2c3
[ 3731.179734]  irq_exit+0xbc/0xd0
[ 3731.182878]  do_IRQ+0x89/0xd0
[ 3731.185851]  common_interrupt+0x7a/0x7a
[ 3731.189690]  
[ 3731.191799] RIP: 0010:cpuidle_enter_state+0xa6/0x2d0
[ 3731.196761] RSP: 0018:bb950c6f7eb0 EFLAGS: 0246 ORIG_RAX:
ff60
[ 3731.204328] RAX: 9fe25fbe14c0 RBX: 0364b57553af RCX: 001f
[ 3731.211459] RDX: 20c49ba5e353f7cf RSI: 68294248f469 RDI: 
[ 3731.218583] RBP: db7d003c3300 R08: c3be R09: 8612
[ 3731.225709] R10: bb950c6f7e98 R11: c3be R12: 0003
[ 3731.232841] R13: 912c9d18 R14:  R15: 0364b396207a
[ 3731.239968]  do_idle+0x166/0x1a0
[ 3731.243199]  cpu_startup_entry+0x6f/0x80
[ 3731.247128]  start_secondary+0x19c/0x1f0
[ 3731.251052]  secondary_startup_64+0xa5/0xb0



>
> Have you tried IPv6 frags maybe ?
>

We have no IPv6 traffic. I asked people to try to generate IPv4 fragment
traffic to see if it would be more reproducible, no progress yet.


Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Cong Wang
On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet  wrote:
>
>
>
> On 10/30/2018 01:09 AM, Paweł Staszewski wrote:
> >
> >
> > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:
> >>
> >> On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:
> >>
> >>> Indeed this is a bug. I would expect it to produce frequent errors
> >>> though as many odd-length
> >>> packets would trigger it. Do you have RXFCS? Regardless, how
> >>> frequently do you see the problem?
> >>>
> >> Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
> >> CHECKSUM_NONE
> >>
> >> And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the 
> >> bug you fixed.
> >>
> >> So we now need to also fix mlx5.
> >>
> >> And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
> >> earlier,
> >> plus __get_unaligned_cpu32() as you hinted.
> >>
> >>
> >>
> >>
> >
> > No RXFCS


Same with Pawel, RXFCS is disabled by default.


> >
> > And this trace is rly frequently like once per 3/4 seconds
> > like below:
> > [28965.776864] vlan1490: hw csum failure
>
> Might be vlan related.

Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
it much less frequently than Pawel.

Also, it is probably not specific to mlx5, as there is another report which
is probably a non-mlx5 driver.

Thanks.


Re: 4.19 - tons of hw csum failure errors

2018-10-29 Thread Cong Wang
(Cc'ing Eric)

On Sat, Oct 27, 2018 at 12:47 PM Nikola Ciprich
 wrote:
>
> Hi,
>
> just wanted to report, thet after switching to 4.19 (fro 4.14.x, so maybe
> the problem appeared somewhere between), I'm getting tons of similar
> messages:
>
> Oct 27 09:06:27 xxx kernel: br501: hw csum failure
> Oct 27 09:06:27 xxx kernel: CPU: 8 PID: 0 Comm: swapper/8 Tainted: G  
>   E 4.19.0lb7.00_01_PRE04 #1
> Oct 27 09:06:27 xxx kernel: Hardware name: Supermicro Super Server/X11DDW-NT, 
> BIOS 2.0b 03/07/2018
> Oct 27 09:06:27 xxx kernel: Call Trace:
> Oct 27 09:06:27 xxx kernel:  
> Oct 27 09:06:27 xxx kernel:  dump_stack+0x5a/0x73
> Oct 27 09:06:27 xxx kernel:  __skb_checksum_complete+0xba/0xc0
> Oct 27 09:06:27 xxx kernel:  tcp_error+0x108/0x180 [nf_conntrack]
> Oct 27 09:06:27 xxx kernel:  nf_conntrack_in+0xd2/0x4b0 [nf_conntrack]
> Oct 27 09:06:27 xxx kernel:  ? csum_partial+0xd/0x20
> Oct 27 09:06:27 xxx kernel:  nf_hook_slow+0x3d/0xb0
> Oct 27 09:06:27 xxx kernel:  ip_rcv+0xb5/0xd0
> Oct 27 09:06:27 xxx kernel:  ? ip_rcv_finish_core.isra.12+0x370/0x370
> Oct 27 09:06:27 xxx kernel:  __netif_receive_skb_one_core+0x52/0x70
> Oct 27 09:06:27 xxx kernel:  process_backlog+0xa3/0x150
> Oct 27 09:06:27 xxx kernel:  net_rx_action+0x2af/0x3f0
> Oct 27 09:06:27 xxx kernel:  __do_softirq+0xd1/0x28c
> Oct 27 09:06:27 xxx kernel:  irq_exit+0xde/0xf0
> Oct 27 09:06:27 xxx kernel:  do_IRQ+0x54/0xe0
> Oct 27 09:06:27 xxx kernel:  common_interrupt+0xf/0xf
> Oct 27 09:06:27 xxx kernel:  


We got the same warning (but a different backtrace) with mlx5 driver.

It seems you are using a different driver. Do you have any clue to reproduce
it?

If you do, try to tcpdump the packets triggering this warning, it could
be useful for debugging.

As I explained in other thread, it is likely that commit 88078d98d1bb
introduces more troubles than the one fixed by d55bef5059dd057bd.
You can try to play with these two commits to see if you get the same
conclusion.

BTW, the offending commit has been backported to 4.14 too. ;)

Thanks!


Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Cong Wang
(Adding Eric and Dimitris into Cc)

On Mon, Oct 29, 2018 at 7:27 PM Cong Wang  wrote:
>
> Hi,
>
> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  
> wrote:
> >
> > Sorry not complete - followed by hw csum:
> >
> > [  342.190831] vlan1490: hw csum failure
> > [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
> > [  342.190836] Call Trace:
> > [  342.190839]  
> > [  342.190849]  dump_stack+0x46/0x5b
> > [  342.190856]  __skb_checksum_complete+0x9a/0xa0
> > [  342.190859]  tcp_v4_rcv+0xef/0x960
> > [  342.190864]  ip_local_deliver_finish+0x49/0xd0
> > [  342.190866]  ip_local_deliver+0x5e/0xe0
> > [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
> > [  342.190870]  ip_rcv+0x41/0xc0
> > [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
> > [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
> > [  342.190879]  napi_gro_receive+0xb7/0xe0
> > [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
> > [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
> > [  342.190888]  mlx5e_napi_poll+0xab/0xc90
>
>
> We got exactly the same backtrace in our data center. However,
> it is not easy for us to reproduce it, do you have any clue to reproduce it?
>
> If you do, try to tcpdump the packets triggering this warning, it could
> be useful for debugging.
>
> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
> occurs. We tried to revert the offending commit 88078d98d1bb, it
> disappears. So it is likely that commit 88078d98d1bb introduces
> more troubles than the one fixed by d55bef5059dd057bd.


Re: [Patch net] net: make pskb_trim_rcsum_slow() robust

2018-10-29 Thread Cong Wang
On Mon, Oct 29, 2018 at 7:25 PM Eric Dumazet  wrote:
>
>
>
> On 10/29/2018 07:21 PM, Cong Wang wrote:
> > On Mon, Oct 29, 2018 at 7:14 PM Eric Dumazet  wrote:
> >>
> >> Would not it be simpler to set ip_summed to CHECKSUM_NONE (no need to save 
> >> old_csum) ?
> >
> > For !CHECKSUM_COMPLETE, ip_summed should be untouched, right?
> >
> > If you mean only setting to CHECKSUM_NONE for CHECKSUM_COMPLETE case,
> > the end result may not be simpler.
>
> I meant to reinstate what was there before my patch in this error case
>
>if (skb->ip_summed == CHECKSUM_COMPLETE)
>skb->ip_summed = CHECKSUM_NONE;
>
> That would only be run in error (quite unlikely) path, instead of saving 
> old_csum in all cases.

I know your point, however, I am not sure that is a desired behavior.

On failure, I think the whole skb should be restored to its previous state
before entering this function, changing it to CHECKSUM_NONE on failure
is inconsistent with success case.


Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Cong Wang
Hi,

On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  wrote:
>
> Sorry not complete - followed by hw csum:
>
> [  342.190831] vlan1490: hw csum failure
> [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
> [  342.190836] Call Trace:
> [  342.190839]  
> [  342.190849]  dump_stack+0x46/0x5b
> [  342.190856]  __skb_checksum_complete+0x9a/0xa0
> [  342.190859]  tcp_v4_rcv+0xef/0x960
> [  342.190864]  ip_local_deliver_finish+0x49/0xd0
> [  342.190866]  ip_local_deliver+0x5e/0xe0
> [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
> [  342.190870]  ip_rcv+0x41/0xc0
> [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
> [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
> [  342.190879]  napi_gro_receive+0xb7/0xe0
> [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
> [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
> [  342.190888]  mlx5e_napi_poll+0xab/0xc90


We got exactly the same backtrace in our data center. However,
it is not easy for us to reproduce it, do you have any clue to reproduce it?

If you do, try to tcpdump the packets triggering this warning, it could
be useful for debugging.

Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
occurs. We tried to revert the offending commit 88078d98d1bb, it
disappears. So it is likely that commit 88078d98d1bb introduces
more troubles than the one fixed by d55bef5059dd057bd.


Re: [Patch net] net: make pskb_trim_rcsum_slow() robust

2018-10-29 Thread Cong Wang
On Mon, Oct 29, 2018 at 7:14 PM Eric Dumazet  wrote:
>
> Would not it be simpler to set ip_summed to CHECKSUM_NONE (no need to save 
> old_csum) ?

For !CHECKSUM_COMPLETE, ip_summed should be untouched, right?

If you mean only setting to CHECKSUM_NONE for CHECKSUM_COMPLETE case,
the end result may not be simpler.


[Patch net] net: make pskb_trim_rcsum_slow() robust

2018-10-29 Thread Cong Wang
Most callers of pskb_trim_rcsum() simply drops the skb when
it fails, however, ip_check_defrag() still continues to pass
the skb up to stack. In that case, we should restore its previous
csum if __pskb_trim() fails.

Found this during code review.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/core/skbuff.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 946de0e24c87..5decd6e6d2b6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1843,6 +1843,9 @@ EXPORT_SYMBOL(___pskb_trim);
  */
 int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
 {
+   __wsum old_csum = skb->csum;
+   int ret;
+
if (skb->ip_summed == CHECKSUM_COMPLETE) {
int delta = skb->len - len;
 
@@ -1850,7 +1853,10 @@ int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned 
int len)
   skb_checksum(skb, len, delta, 0),
   len);
}
-   return __pskb_trim(skb, len);
+   ret = __pskb_trim(skb, len);
+   if (unlikely(ret))
+   skb->csum = old_csum;
+   return ret;
 }
 EXPORT_SYMBOL(pskb_trim_rcsum_slow);
 
-- 
2.16.4



Re: [PATCH net] net: sched: Remove TCA_OPTIONS from policy

2018-10-26 Thread Cong Wang
On Fri, Oct 26, 2018 at 4:35 AM Marco Berizzi  wrote:
> Apologies for bothering you again.
> I applied your patch to 4.19, but after issuing this
> command:
>
> root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
> root@Calimero:~# ping 10.81.104.1
> PING 10.81.104.1 (10.81.104.1) 56(84) bytes of data.
> ^C
> --- 10.81.104.1 ping statistics ---
> 2 packets transmitted, 0 received, 100% packet loss, time 1001ms
>
> I'm losing ipv4 connectivity.
> If I remove the qdisc everything is going to work again:

Did this really work before?

You specify a default class without adding it, so the packets are dropped.

How would you expect this to work? :)


Re: [PATCH net 2/4] net/sched: act_police: disallow 'goto chain' on fallback control action

2018-10-22 Thread Cong Wang
On Sat, Oct 20, 2018 at 2:33 PM Davide Caratti  wrote:
>
> in the following command:
>
>  # tc action add action police rate  burst  conform-exceed /
>
> 'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
> crash with NULL pointer dereference, since TC core doesn't initialize the
> chain handle.
>
> Signed-off-by: Davide Caratti 

Acked-by: Cong Wang 


Re: [PATCH net 1/4] net/sched: act_gact: disallow 'goto chain' on fallback control action

2018-10-22 Thread Cong Wang
On Sat, Oct 20, 2018 at 2:33 PM Davide Caratti  wrote:
>
> in the following command:
>
>  # tc action add action  random   
>
> 'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
> crash with NULL pointer dereference, since TC core doesn't initialize the
> chain handle.
>
> Signed-off-by: Davide Caratti 

Acked-by: Cong Wang 


Re: [PATCH net v2] net/sched: act_gact: properly init 'goto chain'

2018-10-19 Thread Cong Wang
On Thu, Oct 18, 2018 at 8:30 AM Davide Caratti  wrote:
> The alternative is, we systematically forbid usage of 'goto chain' in
> tcfg_paction, so that:
>
> # tc f a dev v0 egress matchall action  random determ goto chain 4 5
>
> is systematically rejected with -EINVAL. This comand never worked, so we
> are not breaking anything in userspace.

This is exactly why I asked you if we really need to support it. :)

If no one finds it useful, disallowing it is a good solution here, as
we don't need
to introduce any additional code to handle filter chains.


Re: bond: take rcu lock in netpoll_send_skb_on_dev

2018-10-17 Thread Cong Wang
On Mon, Oct 15, 2018 at 4:36 AM Eran Ben Elisha  wrote:
> Hi,
>
> This suggested fix introduced a regression while using netconsole module
> with mlx5_core module loaded.

It is already reported here:
https://marc.info/?l=linux-kernel=153917359528669=2


>
> During irq handling, we hit a warning that this rcu_read_lock_bh cannot
> be taken inside an IRQ.

Yes, I mentioned the same even before this patch was sent out:
https://marc.info/?l=linux-netdev=153816136624679=2

Thanks.


Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges

2018-10-17 Thread Cong Wang
On Wed, Oct 17, 2018 at 9:42 PM David Miller  wrote:
>
> From: Amritha Nambiar 
> Date: Fri, 12 Oct 2018 06:53:30 -0700
>
> > Added support in tc flower for filtering based on port ranges.
> > This is a rework of the RFC patch at:
> > https://patchwork.ozlabs.org/patch/969595/
>
> You never addressed Cong's feedback asking you to explain why this
> can't be simply built using existing generic filtering facilities that
> exist already.
>
> I appreciate that you addressed Jiri's feedback, but Cong's feedback is
> just as, if not more, important.
>

My objection is against introducing a new filter just for port range, now
it is built on top of flower filter, so it is much better now.

u32 filter can do the nearly same, but requires a power-of-two, so it is
not completely duplicated.

Therefore, I think the idea of building it on top of flower is fine. But I don't
read into any code, only the description.

Thanks!


Re: [PATCH net] net/sched: properly init chain in case of multiple control actions

2018-10-17 Thread Cong Wang
On Tue, Oct 16, 2018 at 10:38 AM Davide Caratti  wrote:
>
> On Mon, 2018-10-15 at 11:31 -0700, Cong Wang wrote:
> > On Sat, Oct 13, 2018 at 8:23 AM Davide Caratti  wrote:
> > >
> > > On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote:
> > > > Why not just validate the fallback action in each action init()?
> > > > For example, checking tcfg_paction in tcf_gact_init().
> > > >
> > > > I don't see the need of making it generic.
> ...
> > > A (legal?) trick  is to let tcf_action store the fallback action when it
> > > contains a 'goto chain' command, I just posted a proposal for gact. If you
> > > think it's ok, I will test and post the same for act_police.
> >
> > Do we really need to support TC_ACT_GOTO_CHAIN for
> > gact->tcfg_paction etc.? I mean, is it useful in practice or is it just for
> > completeness?
> >
> > IF we don't need to support it, we can just make it invalid without needing
> > to initialize it in ->init() at all.
> >
> > If we do, however, we really need to move it into each ->init(), because
> > we have to lock each action if we are modifying an existing one. With
> > your patch, tcf_action_goto_chain_init() is still called without the 
> > per-action
> > lock.
> >
> > What's more, if we support two different actions in gact, that is, 
> > tcfg_paction
> > and tcf_action, how could you still only have one a->goto_chain pointer?
> > There should be two pointers for each of them. :)
>
> whatever fixes the NULL dereference is OK for me.
> I thought that the proposal made with
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg251933.html
>
> (i.e., letting init() copy tcfg_paction to tcf_action in case it contained
> 'goto chain x') was smart enough to preserve the current behavior, and
> also let 'goto chain' work in case it was configured  *only* for the
> fallback action.
> When the action is modified, the change to tcfg_paction is done with the
> same spinlock as tcf_action, so I didn't notice anything worse than the
> current locking layout.
>
> (well, after some more thinking I looked again at that patch and yes, it
> lacked the most important thing:)

Hmm, as I said, I am not sure if the logic is correct, if we have two different
goto actions, we must have two pointers.

I will re-think about it tomorrow. (I am at a conference so don't have much
time on reviewing this.)

Thanks.


Re: [PATCH net] net/sched: properly init chain in case of multiple control actions

2018-10-15 Thread Cong Wang
On Sat, Oct 13, 2018 at 8:23 AM Davide Caratti  wrote:
>
> On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote:
> > Why not just validate the fallback action in each action init()?
> > For example, checking tcfg_paction in tcf_gact_init().
> >
> > I don't see the need of making it generic.
>
> hello Cong, once again thanks for looking at this.
>
> what you say is doable, and I evaluated doing it before proposing this
> patch.
>
> But I felt unconfortable, because I needed to pass struct tcf_proto *tp in
> tcf_gact_init() to initialize a->goto_chain with the chain_idx encoded in
> the fallback action. So, I would have changed all the init() functions in
> all TC actions, just to fix two of them.
>
> A (legal?) trick  is to let tcf_action store the fallback action when it
> contains a 'goto chain' command, I just posted a proposal for gact. If you
> think it's ok, I will test and post the same for act_police.

Do we really need to support TC_ACT_GOTO_CHAIN for
gact->tcfg_paction etc.? I mean, is it useful in practice or is it just for
completeness?

IF we don't need to support it, we can just make it invalid without needing
to initialize it in ->init() at all.

If we do, however, we really need to move it into each ->init(), because
we have to lock each action if we are modifying an existing one. With
your patch, tcf_action_goto_chain_init() is still called without the per-action
lock.

What's more, if we support two different actions in gact, that is, tcfg_paction
and tcf_action, how could you still only have one a->goto_chain pointer?
There should be two pointers for each of them. :)

Thanks.


  1   2   3   4   5   6   7   8   9   10   >