Re: Silently dropped UDP packets on kernel 4.14

2018-05-03 Thread Kristian Evensen
Hi Michal,

Thanks for providing a nice summary of your experience when dealing
with this problem. Always nice to know that I am not alone :)

On Thu, May 3, 2018 at 11:42 AM, Michal Kubecek  wrote:
> One of the ideas I had was this:
>
>   - keep also unconfirmed conntracks in some data structure
>   - check new packets also against unconfirmed conntracks
>   - if it matches an unconfirmed conntrack, defer its processing
> until that conntrack is either inserted or discarded

I was thinking about something along the same lines and came to the
same conclusion, it is a lot of hassle and work for a very special
case. I think that replacing the conntrack entry is a good compromise,
it improves on the current situation, and allows for the creation of
"perfect" solutions in user-space. For example, a user can keep track
of seen UDP flows, and then only release new packets belonging to the
same flow when the conntrack entry is created.

BR,
Kristian


Re: Silently dropped UDP packets on kernel 4.14

2018-05-03 Thread Michal Kubecek
On Thu, May 03, 2018 at 07:03:45AM +0200, Florian Westphal wrote:
> Kristian Evensen  wrote:
> > I went for the early-insert approached and have patched
> 
> I'm sorry for suggesting that.
> 
> It doesn't work, because of NAT.
> NAT rewrites packet content and changes the reply tuple, but the tuples
> determine the hash insertion location.
> 
> I don't know how to solve this problem.

It's an old problem which surfaces from time to time when some special
conditions make it more visible. When I was facing it in 2015, I found
this thread from as early as 2009:

  https://www.spinics.net/lists/linux-net/msg16712.html

In our case, the customer was using IPVS in "one packet scheduling" mode
(it drops the conntrack entry after each packet) which increased the
probability of insert collisions significantly. Using NFQUEUE 

We were lucky, though, as it turned out the only reason why customer
needed connection tracking was to make sure fragments of long UDP
datagrams are not sent to different real servers. For newer kernels
after commit 6aafeef03b9d ("netfilter: push reasm skb through instead of
original frag skbs"), this was no longer necessary so that they could
disable connection tracking for these packets.

For older kernels without this change, I tried several ideas, each of
which didn't work for some reason. We ended up with rather hacky
workaround, not dropping the packet on collision (so that its conntrack
wasn't inserted into the table and was dropped once the packet was
sent). It worked fine for our customer but like the early insert
approach, it wouldn't work with NAT.

One of the ideas I had was this:

  - keep also unconfirmed conntracks in some data structure
  - check new packets also against unconfirmed conntracks
  - if it matches an unconfirmed conntrack, defer its processing
until that conntrack is either inserted or discarded

But as it would be rather complicated to implement without races and
harming performance, I didn't want to actually try it until I would
run out of other ideas. With NAT coming to the play, there doesn't seem
to be many other options.

Michal Kubecek


Re: Silently dropped UDP packets on kernel 4.14

2018-05-03 Thread Kristian Evensen
Hi Florian,

On Thu, May 3, 2018 at 7:03 AM, Florian Westphal  wrote:
> I'm sorry for suggesting that.
>
> It doesn't work, because of NAT.
> NAT rewrites packet content and changes the reply tuple, but the tuples
> determine the hash insertion location.
>
> I don't know how to solve this problem.

No problem. This has anyway served as a good exercise for getting more
familiar with the conntrack/nat code in the kernel. I did some more
tests and I see that on my router (or routers actually), just
replacing the ct solves the issue. While not a perfect solution, the
situation is improved considerably. Do you think a patch where the ct
is replace would be acceptable, or would upstream rather wait for a
"proper" fix to this problem? When replacing the ct, it is at least
possible to work around the problem in userspace, while without
replacing ct we are stuck with the original entry.

BR,
Kristian


Re: Silently dropped UDP packets on kernel 4.14

2018-05-02 Thread Florian Westphal
Kristian Evensen  wrote:
> I went for the early-insert approached and have patched

I'm sorry for suggesting that.

It doesn't work, because of NAT.
NAT rewrites packet content and changes the reply tuple, but the tuples
determine the hash insertion location.

I don't know how to solve this problem.


Re: Silently dropped UDP packets on kernel 4.14

2018-05-02 Thread Kristian Evensen
Hello,

On Wed, May 2, 2018 at 12:42 AM, Kristian Evensen
 wrote:
> My knowledge of the conntrack/nat subsystem is not that great, and I
> don't know the implications of what I am about to suggest. However,
> considering that the two packets represent the same flow, wouldn't it
> be possible to apply the existing nat-mapping to the second packet,
> and then let the second packet pass?

I have spent the day today trying to solve my problem and I think I am
almost there. I have attached my work in progress patch to this email
if anyone wants to take a look (for kernel 4.14).

I went for the early-insert approached and have patched
nfnetlink_queue to perform an insert if no conntrack entry is found
when the verdict is passed from user-space. If a conntrack entry is
found, then I replace the ct attached to the skb with the existing
conntrack entry. I have verified that my approach works by
artificially delaying the verdict from my application for the second
packet (so that the first packet has passed all netfilter hooks).
After replacing the ct, the second packet is handled correctly and
sent over the wire.

However, something goes wrong after the conntrack entry has been
inserted (using nf_conntrack_hash_check_insert()). At some random (but
very short) time after I see a couple of "early insert ..."/"early
insert confirmed", I get an RCU-stall. The trace for the stall looks
as follows:

[  105.420024] INFO: rcu_sched self-detected stall on CPU
[  105.425191]  2-...: (5999 ticks this GP) idle=12a/141/0
softirq=2674/2674 fqs=2543
[  105.433845]   (t=6001 jiffies g=587 c=586 q=5896)
[  105.438545] CPU: 2 PID: 3632 Comm: dlb Not tainted 4.14.36 #0
[  105.444261] Stack :     805c7ada
0031  8052c588
[  105.452610] 8fd48ff4 805668e7 804f5a2c 0002 0e30
0001 8fc11d20 0007
[  105.460957]   805c 7448 
016e 0007 
[  105.469303]  8057 0006b111  8000
 8059 804608e4
[  105.477650] 8056c2c0 8056408c 00e0 8056 
8027c418 0008 805c0008
[  105.485996] ...
[  105.488437] Call Trace:
[  105.490909] [<800103f8>] show_stack+0x58/0x100
[  105.495363] [<8043c1dc>] dump_stack+0x9c/0xe0
[  105.499707] [<8000d938>] arch_trigger_cpumask_backtrace+0x50/0x78
[  105.505785] [<80083540>] rcu_dump_cpu_stacks+0xc4/0x134
[  105.510991] [<800829c4>] rcu_check_callbacks+0x310/0x814
[  105.516295] [<80085f04>] update_process_times+0x34/0x70
[  105.521522] [<8009691c>] tick_handle_periodic+0x34/0xd0
[  105.526749] [<802f6e98>] gic_compare_interrupt+0x48/0x58
[  105.532047] [<80076450>] handle_percpu_devid_irq+0xbc/0x1a8
[  105.537619] [<800707c0>] generic_handle_irq+0x40/0x58
[  105.542679] [<8023a274>] gic_handle_local_int+0x84/0xd0
[  105.547886] [<8023a434>] gic_irq_dispatch+0x10/0x20
[  105.552747] [<800707c0>] generic_handle_irq+0x40/0x58
[  105.557806] [<804591b4>] do_IRQ+0x1c/0x2c
[  105.561805] [<802394bc>] plat_irq_dispatch+0xfc/0x138
[  105.566839] [<8000b508>] except_vec_vi_end+0xb8/0xc4
[  105.571904] [<8f200ca4>] nf_conntrack_lock+0x28c/0x440 [nf_conntrack]
[  105.578341] [ cut here ]
[  105.582948] WARNING: CPU: 2 PID: 3632 at kernel/smp.c:416
smp_call_function_many+0xc8/0x3bc
[  105.591257] Modules linked in: rt2800pci rt2800mmio rt2800lib
qcserial ppp_async option usb_wwan rt2x00pci rt2x00mmio rt2x00lib
rndis_host qmi_wwan ppp_generic nf_nat_pptp nf_conntrack_pptp
nf_conntrack_ipv6p
[  105.662308]  nf_nat_snmp_basic nf_nat_sip nf_nat_redirect
nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4
nf_nat_ipv4 nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_log_ipv4
nf_flow_tablt
[  105.733288]  ip_set_hash_netiface ip_set_hash_netport
ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet
ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip
ip_set_hash_ipport ip_set_hash_ipmarm
[  105.804724]  ohci_hcd ehci_platform sd_mod scsi_mod ehci_hcd
gpio_button_hotplug usbcore nls_base usb_common mii
[  105.814899] CPU: 2 PID: 3632 Comm: dlb Not tainted 4.14.36 #0
[  105.820615] Stack :     805c7ada
0031  8052c588
[  105.828961] 8fd48ff4 805668e7 804f5a2c 0002 0e30
0001 8fc11c88 0007
[  105.837308]   805c 8590 
018d 0007 
[  105.845654]  8057 000c6f33  8000
 8059 8009e304
[  105.854001] 0009 01a0 8000d0fc  
8027c418 0008 805c0008
[  105.862348] ...
[  105.864786] Call Trace:
[  105.867230] [<800103f8>] show_stack+0x58/0x100
[  105.871662] [<8043c1dc>] dump_stack+0x9c/0xe0
[  105.876009] [<8002e190>] __warn+0xe0/0x114
[  105.880091] [<8002e254>] warn_slowpath_null+0x1c/0x28
[  105.885124] [<8009e304>] smp_call_function_many+0xc8/0x3bc
[  105.890591] [<8000d950>] arch_trigger_cpum

Re: Silently dropped UDP packets on kernel 4.14

2018-05-01 Thread Kristian Evensen
Hi,

Thanks for your quick and detailed reply!

On Wed, May 2, 2018 at 12:24 AM, Florian Westphal  wrote:
> I'm not sure what the best way to solve this is, we either need
> to insert earlier in nfqueue case, or do conflict resolution in nfqueue
> case (and perhaps also nat undo? not sure).

My knowledge of the conntrack/nat subsystem is not that great, and I
don't know the implications of what I am about to suggest. However,
considering that the two packets represent the same flow, wouldn't it
be possible to apply the existing nat-mapping to the second packet,
and then let the second packet pass?

BR,
Kristian


Re: Silently dropped UDP packets on kernel 4.14

2018-05-01 Thread Florian Westphal
Kristian Evensen  wrote:
> On Tue, May 1, 2018 at 8:50 PM, Kristian Evensen
>  wrote:
> > Does anyone have any idea of what could be wrong, where I should look
> > or other things I can try? I tried to space the requests out a bit in
> > time (I inserted a sleep 1 between them), and then the problem went
> > away.
> 
> I should learn to always go through everything one last time before
> sending an email. First of all, I see that both requests are treated
> as new.

Normal when nfqueue is in use.

> Second, on my router, new requests are sent to user space for
> marking, which explains the large delay in processing. When removing
> the NFQUEUE-rule + handling and marking statically, my problem goes
> away and I get an answer for both packets.

Yes, because 2nd packet finds the conntrack entry created by the first
one.

> However, I do have one question. In my application, both packets are
> assigned the same mark. Shouldn't they then match the same conntrack
> entry, or am I missing something since that seems not to be the case?

The 2nd packet creates a new conntrack entry, because the conntrack
entry created by the first one is not inserted into global table yet.

This happens as last step, after packet has traversed all chains.
When nfqueue is used, this gets delayed, and 2nd packet will be dropped
as the insertion step finds that another packet created same flow
already.

I'm not sure what the best way to solve this is, we either need
to insert earlier in nfqueue case, or do conflict resolution in nfqueue
case (and perhaps also nat undo? not sure).


Re: Silently dropped UDP packets on kernel 4.14

2018-05-01 Thread Kristian Evensen
On Tue, May 1, 2018 at 8:50 PM, Kristian Evensen
 wrote:
> Does anyone have any idea of what could be wrong, where I should look
> or other things I can try? I tried to space the requests out a bit in
> time (I inserted a sleep 1 between them), and then the problem went
> away.

I should learn to always go through everything one last time before
sending an email. First of all, I see that both requests are treated
as new. Second, on my router, new requests are sent to user space for
marking, which explains the large delay in processing. When removing
the NFQUEUE-rule + handling and marking statically, my problem goes
away and I get an answer for both packets.

However, I do have one question. In my application, both packets are
assigned the same mark. Shouldn't they then match the same conntrack
entry, or am I missing something since that seems not to be the case?

BR,
Kristian


Silently dropped UDP packets on kernel 4.14

2018-05-01 Thread Kristian Evensen
Hello,

I am currently debugging an issue where it looks like UDP packets are
silently dropped. My setup is a client with a tool that sends two UDP
packets (DNS requests) "simultaneously" using the same socket, and
then a router running latest OpenWRT (with kernel 4.14.37). What I am
seeing is that the first request goes through, while the second seems
to be lost somewhere. I can see the second packet in my firewall logs
on the router (all the way past the POSTROUTING chain in the
nat-table), but the packet is not sent over the wire (checked with
tcpdump).

In order to try to figure out what was going on, I added some
LOG-rules to the firewall. I Inserted the rules in the PREROUTING
chains in the raw, mangle and nat tables, in filter FORWARD and nat
and mangle POSTROUTING. Here are the logs for one set of requests
(with some notes):

First DNS request (verified by packet length):
[723269.968256] raw-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36
[723269.984892] mangle-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36
[723270.001769] mangle-pre #2: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 -> this LOG-rule only matches when ctstate is NEW.

Second DNS request (verified by packet length):
[723270.019033] raw-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
[723270.035734] mangle-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
-> No mangle-pre #2, so the conntrack entry is found

Processing first request:
[723270.036213] nat-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200 -> I use policy routing, therefore mark. Routing
works correctly.
[723270.036366] filter-fwd: IN=br-lan OUT=eth0.2
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200
[723270.036419] mangle-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200
[723270.036457] nat-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200

Processing second request:
[723270.117394] mangle-pre #2: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
[723270.136644] nat-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.154716] filter-fwd: IN=br-lan OUT=eth0.2
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.173501] mangle-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.187787] nat-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200

When dumping conntrack after the second request has been processed, I
get the following:
udp  17 55 src=10.0.0.3 dst=1.1.1.1 sport=42397 dport=53 packets=1
bytes=56 src=1.1.1.1 dst=193.213.155.210 sport=53 dport=42397
packets=1 bytes=72 mark=67109120 delta-time=4 use=1

This looks a bit strange to me, it seems that only the first request
has been counted.

Does anyone have any idea of what could be wrong, where I should look
or other things I can try? I tried to space the requests out a bit in
time (I inserted a sleep 1 between them), and then the problem went
away.

Thanks in advance for any help,
Kristian