Any interest in more details for radar events?

2018-11-30 Thread Ben Greear

Hello,

I notice that ath10k can report PRI and some other details for radar events.

Any interest in being able to propagate that up to the netlink radar event
so that 'iw event' and similar could report the details?

In addition, how about allowing an 'ignored' flag so we can pass up a radar
event that will propagate to user-space, but should be ignored by the kernel
as far as making any changes to the regulatory logic.  User-space could
ignore it as desired.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Issue with STBC capability and forcing radio to 1x1 mode.

2018-11-28 Thread Ben Greear

Hello,

I notice some weird things while debugging an issue on a 4.16+ kernel with 
ath10k
radio.

It seems that the 'iw phy info' does not remove the TX-STBC info
when changing the antenna mask _until_ some vdev is brought up.

This makes it difficult to properly create hostapd config files.

[root@lf0313-63e7 lanforge]# iw phy wiphy1 info|grep STB
TX STBC
RX STBC 1-stream
TX STBC
[root@lf0313-63e7 lanforge]# ./local/sbin/iw phy wiphy1 set antenna 1 1
[root@lf0313-63e7 lanforge]# iw phy wiphy1 info|grep STB
TX STBC
RX STBC 1-stream
TX STBC
[root@lf0313-63e7 lanforge]# iw phy wiphy1 info|grep STB
TX STBC
RX STBC 1-stream
TX STBC
[root@lf0313-63e7 lanforge]# ifconfig wlan1 up
[root@lf0313-63e7 lanforge]# iw phy wiphy1 info|grep STB
RX STBC 1-stream

Any idea on this?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC 0/3] cfg80211/nl80211/iw: add basic AMPDU/AMSDU controls

2018-11-05 Thread Ben Greear

On 11/05/2018 02:49 PM, Igor Mitsyanko wrote:

On 11/05/2018 12:45 PM, Ben Greear wrote:


I see you don't implement it this way in the driver, but wouldn't it
make more sense to have this as a per-STA (RA) setting? That's really
the granularity it can be done on, I think?

Arguably even per-RA/TID, though that seems a little excessive?


I like the idea of providing this API per peer/tid.  And, just allow
peer == -1, tid == -1 or similar to mean 'all' so that you can still set
the entire device
to one particular setting w/out having to iterate through all peers if you
don't want to iterate...


Maye I'm wrong, but isn't the setting we're discussing are for the
device itself, not for its peers? I mean, disabling AMSDU, AMPDU implies
we need to update capabilities advertised in our information elements,
which are common for all devices, and it affects both Tx and Rx.

And per-node/per-TID aggregation settings are a separate configuration
option related to rate adaptation on Tx path only..


You can advertise your maximum capabilities, but just because you advertise
that you can do large AMPDU chains doesn't mean you are required to send
them.

So, to advertise stuff, it is per vdev (not per radio), but once you associate
a peer, you might decide to configure it so that you always send no more than 5
frames in an AMPDU chain, for instance.

And, you might decide that BE gets up to 32 AMPDU chain, but VI should be 
limitted to 16
to decrease latency just a bit.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC 0/3] cfg80211/nl80211/iw: add basic AMPDU/AMSDU controls

2018-11-05 Thread Ben Greear

On 11/05/2018 12:32 PM, Johannes Berg wrote:

Hi,


Here are several RFC patches providing simple high-level controls of AMSDU/AMPDU
aggregation. The primary purpose of this functionality is an attempt to fill
missing gaps in nl80211 interface for basic WFA certification tests.


I see you don't implement it this way in the driver, but wouldn't it
make more sense to have this as a per-STA (RA) setting? That's really
the granularity it can be done on, I think?

Arguably even per-RA/TID, though that seems a little excessive?


I like the idea of providing this API per peer/tid.  And, just allow
peer == -1, tid == -1 or similar to mean 'all' so that you can still set the 
entire device
to one particular setting w/out having to iterate through all peers if you
don't want to iterate...


We experimented with QCA sigma-dut tool: https://github.com/qca/sigma-dut.
The purpose was to cover basic HT/VHT WFA STA tests for cfg80211 driver
controlled by wpa_supplicant  w/o adding any vendor specific commands.
Multiple WFA test parameters (e.g. STBC, NSS, SGI, LDPC) can be configured
by overriding HT/VHT capabilities in wpa_supplicant and applying them on
connect in cfg80211_connect callback. Others (e.g. RTS params) can be
configured using iw tool or NL80211_CMD_SET_WIPHY directly. These patches
implement simpe high-level switches for AMSDU/AMPDU aggregation.

It would be interesting to collect comments/concerns regarding this approach.
Does it make sense to enhance nl80211 in order to cover all the missing pieces
required for WFA certification tests ? Or maybe it makes sense to use
NL80211_TESTMODE subcommands for this kind of testing.


Honestly, I don't really know.

I think other tests, e.g. noack, we used to just have debugfs files for,
and then we got NL80211_CMD_SET_NOACK_MAP (which *is* per TID, btw)

Perhaps if we really don't see any value beyond the testing, keeping it
in debugfs would make sense, until we have more use cases and understand
the granularity we should support?

Clearly this is enough for the testing you refer to, but other use cases
might actually need more fine-grained control (in the future), possibly
down to the RA/TID level?


At least with ath10k, it seems to be a common struggle for AP manufacturers
to do regulatory testing.  I would image that is true for other chipsets
as well, so it seems like it might be worth adding to the official API.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Bad WARN_ON spam in mac80211 wake_queue

2018-10-26 Thread Ben Greear

I made the mistake of configuring 64 ath10k vdevs on an un-modified
4.19-rc5 mac80211, and the (OpenWRT) system mostly killed itself.  I'm not sure
if this is root cause or not, but one of the issues is that this code
hits the WARN_ON in wake_queue() over and over again.  Please make it 
WARN_ON_ONCE
when you get a chance...my tree is too dirty to make a good patch
in this area.

static void __ieee80211_wake_queue(struct ieee80211_hw *hw, int queue,
   enum queue_stop_reason reason,
   bool refcounted)
{
struct ieee80211_local *local = hw_to_local(hw);

trace_wake_queue(local, queue, reason);

if (WARN_ON(queue >= hw->queues))
return;


Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Should we check netif_running in cfg80211_calculate_bi_data?

2018-10-23 Thread Ben Greear

On 10/23/2018 12:53 PM, Johannes Berg wrote:

On Tue, 2018-10-23 at 12:19 -0700, Ben Greear wrote:


Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: Assigning beacon-in-gcd to: 
240 from wdev: vap39
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: beacon-int-diff, 
beacon-int-gcd: 240  new-beacon-int: 100


This new-beacon-int 100 seems strange and suspicious. Why is it even
trying to look at this? Hmm.


Maybe we need to clear beacon-interval back to 0 on admin down of the wifi dev?


We should be doing this, we should end up in __cfg80211_stop_ap() and
that does clear it? Hmm... perhaps this _fails_ somehow, and we don't
clear it in the error path? I suppose we really should make that
unconditional because there's nothing we can do to recover from that
error ...


I am suspicious about this...  that is not the same memory location as 
wdev->beacon_interval,
so maybe this is the thing that is not properly cleared?

if (sdata->vif.type == NL80211_IFTYPE_AP ||
sdata->vif.type == NL80211_IFTYPE_MESH_POINT) {
/*
 * always passing this is harmless, since it'll be the
 * same value that cfg80211 finds if it finds the same
 * interface ... and that's always allowed
 */
params.new_beacon_int = sdata->vif.bss_conf.beacon_int;
}

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Should we check netif_running in cfg80211_calculate_bi_data?

2018-10-23 Thread Ben Greear

On 10/23/2018 02:32 AM, Johannes Berg wrote:

On Mon, 2018-10-22 at 13:27 -0700, Ben Greear wrote:

I was testing on my 4.16 kernel with a bunch of VAPs and I had two configured
for 100 beacon interval, and more at 240.  This failed for reasons I figured out
(gcd logic), but even once I tried to configure the vaps for 240 interval they
could not come up.  I am thinking maybe it was because I could only re-configure
admin-up interfaces, and they couldn't come up due the gcd thing?

Anyway, while poking, I thought maybe the patch below would be helpful since
we shouldn't care about admin-down interfaces in this case?

diff --git a/net/wireless/util.c b/net/wireless/util.c
index fbc880e..56d7583 100644
--- a/net/wireless/util.c
+++ b/net/wireless/util.c
@@ -1541,6 +1541,9 @@ static void cfg80211_calculate_bi_data(struct wiphy 
*wiphy, u32 new_beacon_int,
 if (wdev->beacon_interval == *beacon_int_gcd)
 continue;

+   if (!netif_running(wdev->netdev))
+   continue;
+


I don't think we'd ever get to this check, since those interfaces will
always have wdev->beacon_interval == 0, checked a few lines before this
code at the beginning of the loop? At least they should have, but a
quick check suggests that is true.


Ok, that netif_running thing was just dumb luck that it started working for me.
It doesn't really fix the problem.

Here is how I reproduce:

Create 32 vdevs (I'm quite sure fewer would work).  Have one with beacon-int 
100,
have others with beacon int-240.

Try to start all of them.  Assume the 240 ones start first, so they win and 
start.
The 100 one fails correctly.

Now, reconfigure the 100 one to have 240 and restart its hostapd.  It will fail
to come up like this (I added WARN-ON and other kernel debugging to track down 
the
failure path):

Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: Assigning beacon-in-gcd to: 
240 from wdev: vap39
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: beacon-int-diff, 
beacon-int-gcd: 240  new-beacon-int: 100
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: WARNING: CPU: 3 PID: 28533 at /home/greearb/git/linux-4.16.dev.y/net/wireless/util.c:1564 
cfg80211_iter_combinations+0x2ff/0x3b0 [cfg80211]
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: Modules linked in: nf_conntrack_netlink nfnetlink nf_conntrack_ipv4 iptable_raw xt_CT nf_conntrack bridge 
wanlink(O) mac80211_hwsim ath10k_pci ath10k_core ath5k ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nf_defrag_ipv4 libcrc32c vrf 8021q garp mrp stp llc 
fuse macvlan pktgen rpcsec_gss_krb5 nfsv4 nfs fscache snd_hda_codec_hdmi coretemp intel_rapl x86_pkg_temp_thermal snd_hda_codec_realtek intel_powerclamp 
snd_hda_codec_generic snd_hda_intel snd_hda_codec kvm iTCO_wdt snd_hda_core iTCO_vendor_support snd_hwdep irqbypass snd_seq snd_seq_device hwmon i2c_i801 
snd_pcm snd_timer lpc_ich snd shpchp soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel serio_raw i915 i2c_algo_bit drm_kms_helper e1000e drm 
i2c_core video ipv6 crc_ccitt [last unloaded: nfnetlink]

Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: CPU: 3 PID: 28533 Comm: 
hostapd Tainted: GW  O 4.16.18+ #50
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: Hardware name: To be filled 
by O.E.M. To be filled by O.E.M./ChiefRiver, BIOS 4.6.5 06/07/2013
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: RIP: 
0010:cfg80211_iter_combinations+0x2ff/0x3b0 [cfg80211]
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: RSP: 0018:c9000327ba90 
EFLAGS: 00010286
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: RAX: 0039 RBX: 
0064 RCX: 
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: RDX: 88021e39ea80 RSI: 
88021e396698 RDI: 88021e396698
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: RBP: 88020c618700 R08: 
 R09: 89fa
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: R10: c9000327bb88 R11: 
0001 R12: 00f0
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: R13: 88020c6186f0 R14: 
c9000327bb10 R15: 88020c6182e0
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: FS:  7f6d8b5db800() 
GS:88021e38() knlGS:
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: CS:  0010 DS:  ES:  
CR0: 80050033
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: CR2: 7fbee7b3e280 CR3: 
00020bf08002 CR4: 001606e0
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel: Call Trace:
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel:  ? 
cfg80211_netdev_notifier_call+0x65/0x610 [cfg80211]
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel:  ? skb_queue_tail+0x16/0x40
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel:  ? 
ieee80211_chandef_to_operating_class+0x1d0/0x1d0 [cfg80211]
Oct 23 12:11:05 ben-ota-2.candelatech.com kernel:  
cfg80211_check_combinations+0x1e/0x60 [cfg80211]
Oct 23 12:11:05 

Should we check netif_running in cfg80211_calculate_bi_data?

2018-10-22 Thread Ben Greear

I was testing on my 4.16 kernel with a bunch of VAPs and I had two configured
for 100 beacon interval, and more at 240.  This failed for reasons I figured out
(gcd logic), but even once I tried to configure the vaps for 240 interval they
could not come up.  I am thinking maybe it was because I could only re-configure
admin-up interfaces, and they couldn't come up due the gcd thing?

Anyway, while poking, I thought maybe the patch below would be helpful since
we shouldn't care about admin-down interfaces in this case?

diff --git a/net/wireless/util.c b/net/wireless/util.c
index fbc880e..56d7583 100644
--- a/net/wireless/util.c
+++ b/net/wireless/util.c
@@ -1541,6 +1541,9 @@ static void cfg80211_calculate_bi_data(struct wiphy 
*wiphy, u32 new_beacon_int,
if (wdev->beacon_interval == *beacon_int_gcd)
continue;

+   if (!netif_running(wdev->netdev))
+   continue;
+
*beacon_int_different = true;
*beacon_int_gcd = gcd(*beacon_int_gcd, wdev->beacon_interval);
}

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Where to report mpdus tx vs failed?

2018-10-22 Thread Ben Greear

On 10/22/2018 05:07 AM, Johannes Berg wrote:

On Mon, 2018-10-22 at 14:06 +0200, Johannes Berg wrote:

On Fri, 2018-10-19 at 11:32 -0700, Ben Greear wrote:


I was hoping I could fit it into some existing stat.  It is sort of like
retries, so that will be my first attempt.

By investigating an RF sniff, I notice the 9880 ath10k (with my fw
and driver, at least), will retransmit about 30% of the frames when running
at least one of my test cases (small udp frame transmit to AP that can only do 
about mcs5 or mcs7
reliably at 3x3 nss).

I'd like to report this stat in my wifi test tool if nothing else, but
likely other people would make use of it as well.



We have NL80211_STA_INFO_TX_RETRIES, shouldn't you be able to capture it
there?


Or, per TID, NL80211_TID_STATS_TX_MSDU_RETRIES


I added this code (rate is struct ieee80211_tx_rate)

if (tx_done->mpdus_failed) {
/* Maybe there is a better way to report this tried vs failed 
stat up the stack? */
rate->count = tx_done->mpdus_failed + 1;
}
else {
rate->count = 1;
}

I think this is mostly providing useful info up the stack and we are trying to 
verify
the counts against a packet sniff today.  The driver is only reporting
this number for the first packet in the ampdu chain, so all retries are counted 
against the
first frame (the rest will have mpdus_failed == 0 regardless of how many 
retries they have).

And, on total retry failure, packets should be reported up the stack as 
tx-no-ack, so I think
that is working OK too.

I guess mac80211 could use this tx-status to report per-tid if it cared to.  I 
don't see any
other way to pass such tx failure details up the stack to mac80211.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Rate-ctrl experience gained with ath10k

2018-10-20 Thread Ben Greear

I now have the 9984 ath10k working pretty good I think, at least as far as
rate-ctrl is going.  It tries to keep retransmit < 10% and throughput is
fine compared to using MCS with higher retransmit (and higher encoding rates).

But, I tried to put similar changes (dodge high retransmit MCS if possible),
and for the wave-1 9880 chip, this decreased throughput.

For instance, current rate-ctrl was retransmitting about 30% of the frames
when allowed to choose its MCS.  In this case, I saw around 225-230Mbps 
throughput.

But, if I forced it down to MCS-5, 3x3, then throughput was about 220Mbps, and
retry percent was closer to 2%.

So, question is...should I still tweak its ratectrl to choose lower MCS when
retransmit % is high, or should I just let it retransmit.  Maybe lower MCS and 
lower
retransmit would be better over-all if there are lots of devices on the network?

I'll run some tests to experiment with that when I have time, but curious if 
others
have already done similar...

Thanks,
Ben


On 10/16/2018 04:31 PM, Ben Greear wrote:

So, I've been trying to understand why the ath10k-rate ctrl was acting
so poorly in my 20-station UDP tx case.  I wrote that wifi-diag tool,
and it gave some clues, but only in retrospect were they obvious :)

The problem in general was that a single station could upload, at say 500Mbps,
but 20 stations could only do about 325Mbps.  If we fixed the MCS at 3x3 MCS7,
then the 20 station upload test ran about 490Mbps.

There were several optimizations I made in the ath10k firmware rate-ctrl.  One
was to penalize rates with higher PER (packet error rate) more than the old 
algorithm did.  This helped
a small bit, but not enough (around 350-370Mbps total throughput).

Then, after adding logs I noticed that each station was probing once about 
every 5 ampdu
chains, or almost 20% of the time!  These probe AMPDUs are limited to a max of 
4 AMSDUs,
and that explains why my 'bad' case showed 17% of the AMPDU chains were 4 in 
length in
my wifi-diag output.

I changed the firmware to take number of active stations into account, and 
increase the
75ms probe interval by up to a factor of 5.  I also probe less often when the 
probed
rate has PER > 5 or the last probe had a dropped frame.

And, I allow probing with AMPDU chains of up to 16 AMSDU instead of just 4.

Together this gets me up to about 460Mbps in this test case.   Still not quite 
as good
as fixing the MCS at 7, but good enough I think.

Here's hoping this email will let other folks poking at rate-ctrl have a faster
time at fixing things than I did.

Thanks,
Ben




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Where to report mpdus tx vs failed?

2018-10-19 Thread Ben Greear

On 10/19/2018 11:25 AM, Steve deRosier wrote:

On Fri, Oct 19, 2018 at 10:34 AM Ben Greear  wrote:


While debugging rate-ctrl in ath10k, I found the amount of mpdus transmitted vs 
failed
ratio useful.  Probably more useful than retries since retries could count an 
attempt at
80Mhz followed by HW trying a 40Mhz rate (afaik).

Is there a good way to report this up the stack in a useful manner?  I 
currently only
get this stat for the first frame in an transmitted ampdu.



debugfs? Is it useful for someone working on it, like a sysadmin or
kernel programmer?  Or is it useful for programs to know?  In the
first case I'd say debugfs, in the second case I'd suggest it goes in
some tx stats structure that is reported by netlink.

Also, is it something that is given by (or should be given by) all
drivers, or is it very driver-specific?


I was hoping I could fit it into some existing stat.  It is sort of like
retries, so that will be my first attempt.

By investigating an RF sniff, I notice the 9880 ath10k (with my fw
and driver, at least), will retransmit about 30% of the frames when running
at least one of my test cases (small udp frame transmit to AP that can only do 
about mcs5 or mcs7
reliably at 3x3 nss).

I'd like to report this stat in my wifi test tool if nothing else, but
likely other people would make use of it as well.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Where to report mpdus tx vs failed?

2018-10-19 Thread Ben Greear

While debugging rate-ctrl in ath10k, I found the amount of mpdus transmitted vs 
failed
ratio useful.  Probably more useful than retries since retries could count an 
attempt at
80Mhz followed by HW trying a 40Mhz rate (afaik).

Is there a good way to report this up the stack in a useful manner?  I 
currently only
get this stat for the first frame in an transmitted ampdu.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Rate-ctrl experience gained with ath10k

2018-10-16 Thread Ben Greear

So, I've been trying to understand why the ath10k-rate ctrl was acting
so poorly in my 20-station UDP tx case.  I wrote that wifi-diag tool,
and it gave some clues, but only in retrospect were they obvious :)

The problem in general was that a single station could upload, at say 500Mbps,
but 20 stations could only do about 325Mbps.  If we fixed the MCS at 3x3 MCS7,
then the 20 station upload test ran about 490Mbps.

There were several optimizations I made in the ath10k firmware rate-ctrl.  One
was to penalize rates with higher PER (packet error rate) more than the old 
algorithm did.  This helped
a small bit, but not enough (around 350-370Mbps total throughput).

Then, after adding logs I noticed that each station was probing once about 
every 5 ampdu
chains, or almost 20% of the time!  These probe AMPDUs are limited to a max of 
4 AMSDUs,
and that explains why my 'bad' case showed 17% of the AMPDU chains were 4 in 
length in
my wifi-diag output.

I changed the firmware to take number of active stations into account, and 
increase the
75ms probe interval by up to a factor of 5.  I also probe less often when the 
probed
rate has PER > 5 or the last probe had a dropped frame.

And, I allow probing with AMPDU chains of up to 16 AMSDU instead of just 4.

Together this gets me up to about 460Mbps in this test case.   Still not quite 
as good
as fixing the MCS at 7, but good enough I think.

Here's hoping this email will let other folks poking at rate-ctrl have a faster
time at fixing things than I did.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Tool to debug wifi pkt sniffs?

2018-10-10 Thread Ben Greear

On 10/10/2018 12:13 PM, Dave Taht wrote:

On Wed, Oct 10, 2018 at 10:10 AM Ben Greear  wrote:


On 10/03/2018 01:29 PM, Dave Taht wrote:

On Wed, Oct 3, 2018 at 1:16 PM Toke Høiland-Jørgensen  wrote:


Ben Greear  writes:


Hello,

I often find myself wanting to figure out what equipment is to blame (and why)
in a wifi environment.

I am thinking writing a tool that would parse a pcap file and look at frames
in enough detail to flag block-ack bugs, rate-ctrl bugs, guess at the sniffer's
capture ability, etc.

Does anyone have anything already written that they would like to share, or know
of projects that might already do some of this?


Not sure if this fits your criteria, but Sven's tool to create airtime
charts from packet sniffing data immediately came to mind:

https://github.com/cloudtrax/airtime-pie-chart


I have used that. Oy, it's a PITA. Some of kathie's code over here
(example: https://github.com/pollere/pping ) uses the slightly less
painful http://libtins.github.io/ library for parsing packets.


I couldn't find anything that did what I wanted, so I wrote my own.

The (perl) code is in the wifi-diag directory of this public repo:

https://github.com/greearb/lanforge-scripts

The rest of the scripts in that repo are not related to the wifi-diag script, 
so just ignore those.

Here is example output for what I have so far:

https://www.candelatech.com/oss/wifi-diag/netgear-up-5s/index.html


I *miss* writing in perl. :)

My guess from looking at that output that that was a udp flood test.
Do I win the internets?


Yes, UDP upload test with 20 emulated stations, sending ~500 byte UDP frames.
One thing we notice in the case we are debugging, is
that the average time from transmitter station device receiving BA from the AP
to the transmitter station device putting the next AMPDU frame on air
is 0.728ms for the problem AP, and 0.448ms for the good AP.

I checked that the wmm config in the beacons for the two APs is the same.

I am at a loss as to what could cause this delay, other than possibly the 
problem
AP has a funky transmitter than puts a bit of extra noise on the air after it
is done transmitting a frame?

The problem AP also has 5% retransmits vs about 2% for the good AP, and problem 
AP
is typically using MCS8 instead of MCS9, but even so, I do not see how that 
would explain
the extra BA to AMPDU delay.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Tool to debug wifi pkt sniffs?

2018-10-10 Thread Ben Greear

On 10/03/2018 01:29 PM, Dave Taht wrote:

On Wed, Oct 3, 2018 at 1:16 PM Toke Høiland-Jørgensen  wrote:


Ben Greear  writes:


Hello,

I often find myself wanting to figure out what equipment is to blame (and why)
in a wifi environment.

I am thinking writing a tool that would parse a pcap file and look at frames
in enough detail to flag block-ack bugs, rate-ctrl bugs, guess at the sniffer's
capture ability, etc.

Does anyone have anything already written that they would like to share, or know
of projects that might already do some of this?


Not sure if this fits your criteria, but Sven's tool to create airtime
charts from packet sniffing data immediately came to mind:

https://github.com/cloudtrax/airtime-pie-chart


I have used that. Oy, it's a PITA. Some of kathie's code over here
(example: https://github.com/pollere/pping ) uses the slightly less
painful http://libtins.github.io/ library for parsing packets.


I couldn't find anything that did what I wanted, so I wrote my own.

The (perl) code is in the wifi-diag directory of this public repo:

https://github.com/greearb/lanforge-scripts

The rest of the scripts in that repo are not related to the wifi-diag script, 
so just ignore those.

Here is example output for what I have so far:

https://www.candelatech.com/oss/wifi-diag/netgear-up-5s/index.html

The general idea is to get a performance test going, and then use tshark or 
similar
to grab a short sample (my script is slow, it can process only about 400 
packets per second
on my desktop, so a 5 sec capture at full speed takes around 5 minutes to 
process),
and then pipe that decoded pcap into my script.

It tries to pay attention to latencies between block-ack and next AMPDU frame,
MCS distributions, packet-type distributions, retries, and other
such things.  I'm guessing tweaking wmm settings (or changing QoS in the 
generated traffic)
would be visible in this kind of metric, for instance.

The goal is to be able to answer the question of why one AP is faster or slower 
than another
when running the same test case.

Feedback (and even patches) is welcome...what other things can I report that 
would
be helpful?


Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: ath9k and 16 VAP interfaces?

2018-10-04 Thread Ben Greear

Hello,

I'm finally getting around to trying this.

One thing, the second patch won't compile due to missing ath9k_set_moredata 
method.

I see it is implemented here:

http://git.fem.tu-ilmenau.de/?p=fem-wlan.git;a=blob;f=package/kernel/mac80211/patches/310-ath9k-fix-moredata-bit-in-PS-buffered-frame-release.patch;h=b2a74ccbacb29115bede4806006f3265f54cb31c;hb=refs/heads/femwlan_stage2.kernel4.9

I guess I should apply this one...maybe more?

Any idea if any of these patches should be submitted upstream?

Thanks,
Ben

On 08/05/2018 11:18 AM, michael-dev wrote:

Hi,

I'm using [1]. I was seeing some trouble with powersave + broadcasts, so I 
added [2], so that the queue would get empty faster.

Regards,
M. Braun


[1]
http://git.fem.tu-ilmenau.de/?p=fem-wlan.git;a=blob;f=package/kernel/mac80211/patches/911-more-ap-interfaces.patch;h=873a6cb7bdb8a7462ffd07c1a50fc87c580f223d;hb=refs/heads/femwlan_stage2.kernel4.9

[2]
http://git.fem.tu-ilmenau.de/?p=fem-wlan.git;a=blob;f=package/kernel/mac80211/patches/915-use-more-time-for-multicast.patch;h=05d7fb6c2567e1abdef63a4d01f02df7f8501630;hb=refs/heads/femwlan_stage2.kernel4.9

Am 30.07.2018 16:34, schrieb Ben Greear:

On 07/30/2018 04:13 AM, Matthias May wrote:

On 30/07/18 11:40, michael-...@fami-braun.de wrote:

Do you mean AP interfaces as required for multiple BSS/SSIDs?

I'm running a patched ath9k to have hostapd run >8 BSS on a single radio.
It works fine since years.


Yes, I'd love to see any patches you can share on this as well.

Thanks,
Ben



Regards,
M. Braun


Am 27. Juli 2018 15:35:28 MESZ schrieb Ben Greear :

Hello,

Has anyone tried making ath9k able to support 16 vAP interfaces on a
single
radio?  I seem to recall that there were limitations regarding beacon
timers and such, and that is why the current limit is 8?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com




Are these patches available somewhere?
I'm interested to play with them :)

BR
Matthias






--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Tool to debug wifi pkt sniffs?

2018-10-03 Thread Ben Greear

Hello,

I often find myself wanting to figure out what equipment is to blame (and why)
in a wifi environment.

I am thinking writing a tool that would parse a pcap file and look at frames
in enough detail to flag block-ack bugs, rate-ctrl bugs, guess at the sniffer's
capture ability, etc.

Does anyone have anything already written that they would like to share, or know
of projects that might already do some of this?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Problem with sending pkt on a monitor port

2018-09-28 Thread Ben Greear




On 09/28/2018 12:14 AM, Johannes Berg wrote:

Sorry, I'm a bit behind things ...


It's actually created by mac80211, but only once, and not directly
mapped to each vif seen by userspace - it's an internal construction.


I'm not sure it matters, but ath10k firmware can also create a monitor vdev
itself for certain reasons.  (Maybe offchannel tx on some FW, but I haven't 
looked at
that code lately).


Yeah and I think it may actually do for active monitor, but I believe
those get their own MAC address anyway? That might get used in the end
as the vif to the driver too.


The monitor port has the same mac as the wlanX in ath10k, ie the 'radio's mac'.


However, thinking about it, that also breaks userspace in other ways -
for example if you do injection this way you actually get encryption and
other nice things if you use the local address that matches an existing
interface.


I'm not entirely sure of a useful use-case for this feature in user-space.


Which feature?


radio-tap send on a monitor vdev.



At least ancient versions of hostapd would rely on this, but clearly
that's no longer super relevant. I don't know if anyone else relies on
it, but in a way that is the problem. If I knew, then I could think
about alternatives or how to keep that working if we change anything
here.


I am using it just to test sending some test frames to debug some firmware
features.  I think another user sent hand-crafted specialized beacons in this 
manner
using my 10.1 ath10k firmware & driver.  For whatever reason, I didn't realize 
monitor
vdevs were not directly used when I added that support..maybe I just got lucky
before I had to dig closely.


They may be used if they were active monitor? I don't know ath10k well.
But then they shouldn't have had the same MAC address to start with,
IIRC.


The code I quoted at the first of this thread make sure the monitor vdev is
not used if possible.  But, maybe the driver has or had some ways to force 
certain
frames out the monitor port.  That is my recollection.  I added code to the 
firmware
to allow this to work, including bug fixes to crashes, so I am pretty sure 
there is *some*
way for that tx path to happen, at least in wave-1 firmware.


If I make the code in my original email be skipped, so that sdata remains the
monitor vdev, then it fails a check later in that method because there is no
chanctxt for the monitor sdata object.

I guess that changing the source MAC to something unique would cause the same
issue and no frame would be sent towards the driver.


Hmm. This *should* work in one way or the other? But again, maybe ath10k
has something special here?

You skipped *just* that loop?


Yes...because the monitor vdev chanctx was null and that method checks a bit 
later for it.

Maybe there is a way to create/configure the monitor vdev so that it has
a chanctx?

Thanks,
Ben



johannes



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: How many null-data probes on connection loss?

2018-09-28 Thread Ben Greear




On 09/28/2018 12:19 AM, Johannes Berg wrote:

On Thu, 2018-09-27 at 08:32 -0700, Ben Greear wrote:


It seems though that if there's some noise or so on the channel you
wouldn't be transmitting, so what kind of "network glitches" might
affect this? AP going away unexpectedly for some time?


I am thinking that if the 'timeout' is 500ms, and the number of probes is 2
(the default values), then it should probe at 0ms, and at 250ms, and then 
finally
fail at 500ms if nothing was received.  In otherwords, X probes, x/timeout 
apart.


That seems reasonable I guess. Although I'm not sure - perhaps once we
know it failed we *do* want to try a bit quicker again? Otherwise we
have a totally dead period there in the meantime, no?


I don't think the tx path stops just because the mlme decides probes
are timing out, so if AP is really functional, you won't have a dead period.

If it is dead, then the sooner you probe and discover timeout and disconnect,
the sooner you can re-connect to some better AP (assuming one exists).  A
false disconnect due to missing a few probes would be disruptive though, and
probing very often in idle situations would use more airtime, so
of course there is a trade-off.

While watching the sniffer, it seems the initial mlme probe happens about
3 seconds after I admin down the AP, and then disconnect is about 1s after
that (I have my timeout set to 1000ms instead of default 500ms, and I have retry
count set to 5 instead of 2.

Did you see that patch I posted?  It looks good in the sniffer and logs,
as far as I can tell

Thanks,
Ben



johannes



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: How many null-data probes on connection loss?

2018-09-27 Thread Ben Greear




On 09/27/2018 12:12 AM, Johannes Berg wrote:

On Wed, 2018-09-26 at 15:21 -0700, Ben Greear wrote:


Now only 4 retries per frame, but it seems mac80211 is all 5 of its
null-data probes within a few miliseconds.  Is that expected, or should
there be a bit more pause between each of the probe requests to better
weather periodic network glitches?


Hmm. That's a good point, but it seems we've never considered this
before. This must be a consequence of retrying immediately on lost ACK,
but I guess I could see that delaying it for a little bit would make
sense.

We do delay it if there's no reliable ACK reporting, IIRC, but if we
know it failed for sure ...

It seems though that if there's some noise or so on the channel you
wouldn't be transmitting, so what kind of "network glitches" might
affect this? AP going away unexpectedly for some time?


I am thinking that if the 'timeout' is 500ms, and the number of probes is 2
(the default values), then it should probe at 0ms, and at 250ms, and then 
finally
fail at 500ms if nothing was received.  In otherwords, X probes, x/timeout 
apart.

I should have a patch to make it work more like I think it should work
later today for discussion.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: How many null-data probes on connection loss?

2018-09-26 Thread Ben Greear

On 09/26/2018 11:48 AM, Johannes Berg wrote:

On Wed, 2018-09-26 at 11:47 -0700, Ben Greear wrote:


I think I'll start by making sure the firmware does not do software retransmits
for frames from the driver (self-gen frames are OK to be retransmitted I guess).


You do want it to be doing retries for frames from the driver, since you
want it to recover from temporary collisions with a microwave and
whatnot ... just not *that many*, I guess.


 From what I can tell so far, my firmware has this sort of logic:

frame from stack to the driver
   -> send to firmware
   -> in firmware, hardware will do up to X retries (maybe 16 or so, need to 
check)
   -> On failure, the firmware may re-queue the packet (firmware-software retry)
   -> back to hardware retries (~32 frames on air at this point)
   ...
   Eventually tx-fail notification is sent back to the driver one way or 
another.

I am thinking it would be best to have the software retry in the firmware
disabled.

Then, when mac80211 sends a null-data frame, you would see at most about
16 of them on air, every 500ms or so until it recovers or considers the
connection lost.


Yes, that seems reasonable. In fact, I'd argue that such software-retry
should just be disabled completely - it's better to lose the occasional
frame than to keep using airtime for it forever ...

Toke is probably getting nightmares reading this - sweet dreams ;-)


I fixed the firmware

Now only 4 retries per frame, but it seems mac80211 is all 5 of its
null-data probes within a few miliseconds.  Is that expected, or should
there be a bit more pause between each of the probe requests to better
weather periodic network glitches?

Thanks,
Ben



johannes




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: How many null-data probes on connection loss?

2018-09-26 Thread Ben Greear

On 09/26/2018 11:48 AM, Johannes Berg wrote:

On Wed, 2018-09-26 at 11:47 -0700, Ben Greear wrote:


I think I'll start by making sure the firmware does not do software retransmits
for frames from the driver (self-gen frames are OK to be retransmitted I guess).


You do want it to be doing retries for frames from the driver, since you
want it to recover from temporary collisions with a microwave and
whatnot ... just not *that many*, I guess.


 From what I can tell so far, my firmware has this sort of logic:

frame from stack to the driver
   -> send to firmware
   -> in firmware, hardware will do up to X retries (maybe 16 or so, need to 
check)
   -> On failure, the firmware may re-queue the packet (firmware-software retry)
   -> back to hardware retries (~32 frames on air at this point)
   ...
   Eventually tx-fail notification is sent back to the driver one way or 
another.

I am thinking it would be best to have the software retry in the firmware
disabled.

Then, when mac80211 sends a null-data frame, you would see at most about
16 of them on air, every 500ms or so until it recovers or considers the
connection lost.


Yes, that seems reasonable. In fact, I'd argue that such software-retry
should just be disabled completely - it's better to lose the occasional
frame than to keep using airtime for it forever ...

Toke is probably getting nightmares reading this - sweet dreams ;-)


I *think* this software-retry does not apply to frames on a block-ack enabled
TID at least...that appeared to be the case with wave-2 firmware I just got 
through
modifying similar logic.

Just maybe that will help him sleep a bit better :)

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: How many null-data probes on connection loss?

2018-09-26 Thread Ben Greear

On 09/26/2018 11:26 AM, Johannes Berg wrote:

On Wed, 2018-09-26 at 11:04 -0700, Ben Greear wrote:


I have been running with mac80211/mlme.c's max_nullfunc_tries set to 5 for many 
years.
Long ago it helped with connectivity issues with lots of vdevs and and/orloaded 
APs
if I recall correctly.


That's different, that's the number of distinct frames mac80211 will
send.

I thought you were asking about *retries*.


Well, it retries the probe action 5 times in my case.

I am also asking about total amount of retried frames on the air.


In fact, I see 62 frames captured on air all with the same sequence number
in the test I just did, and subsequent frames with the next seq-no are sent
immediately after the first one.  The frames are all right after each other, so
I guess this is probably firmware doing lots of HW retransmits and then *also*
doing software retransmits in the firmware (my reading of mlme.c indicates it 
should
only probe every 500ms).


Yes.


I think I'll start by making sure the firmware does not do software retransmits
for frames from the driver (self-gen frames are OK to be retransmitted I guess).


You do want it to be doing retries for frames from the driver, since you
want it to recover from temporary collisions with a microwave and
whatnot ... just not *that many*, I guess.


From what I can tell so far, my firmware has this sort of logic:

frame from stack to the driver
  -> send to firmware
  -> in firmware, hardware will do up to X retries (maybe 16 or so, need to 
check)
  -> On failure, the firmware may re-queue the packet (firmware-software retry)
  -> back to hardware retries (~32 frames on air at this point)
  ...
  Eventually tx-fail notification is sent back to the driver one way or another.

I am thinking it would be best to have the software retry in the firmware
disabled.

Then, when mac80211 sends a null-data frame, you would see at most about
16 of them on air, every 500ms or so until it recovers or considers the
connection lost.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: How many null-data probes on connection loss?

2018-09-26 Thread Ben Greear

On 09/26/2018 01:38 AM, Johannes Berg wrote:

On Tue, 2018-09-25 at 16:12 -0700, Ben Greear wrote:

While testing out some other issue, I noticed that my ath10k system creates
several hundred null-data probes when I abruptly down the AP the station
is connected to.

I guess this is because I use the mac80211 stack to handle the probes, and
the firmware then retries each mac80211 probe many times.

So, in the case where mac80211 is sending a null-data probe, is the assumption
that the driver will try each frame exactly once?


Not really, it should be treated like any other management frame.


Or is several hundred frames expected?  I'm guessing the former, but before I go
hacking firmware, I thought I would ask...


Certainly not several hundred, but maybe a dozen? I think iwlwifi uses
16, and minstrel would set up max_rate_tries, which drivers set to
somewhere between 1 and 18? One seems s a bit low, mt76?

johannes



I have been running with mac80211/mlme.c's max_nullfunc_tries set to 5 for many 
years.
Long ago it helped with connectivity issues with lots of vdevs and and/orloaded 
APs
if I recall correctly.

In fact, I see 62 frames captured on air all with the same sequence number
in the test I just did, and subsequent frames with the next seq-no are sent
immediately after the first one.  The frames are all right after each other, so
I guess this is probably firmware doing lots of HW retransmits and then *also*
doing software retransmits in the firmware (my reading of mlme.c indicates it 
should
only probe every 500ms).

I think I'll start by making sure the firmware does not do software retransmits
for frames from the driver (self-gen frames are OK to be retransmitted I guess).

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



How many null-data probes on connection loss?

2018-09-25 Thread Ben Greear

While testing out some other issue, I noticed that my ath10k system creates
several hundred null-data probes when I abruptly down the AP the station
is connected to.

I guess this is because I use the mac80211 stack to handle the probes, and
the firmware then retries each mac80211 probe many times.

So, in the case where mac80211 is sending a null-data probe, is the assumption
that the driver will try each frame exactly once?

Or is several hundred frames expected?  I'm guessing the former, but before I go
hacking firmware, I thought I would ask...

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Netgear R7800 - ath10k QCA 9984 firmware crash

2018-09-21 Thread Ben Greear




On 09/19/2018 01:56 PM, Bruno Rosset wrote:

Hi,

Done, thanks
Le mer. 19 sept. 2018 à 22:45, Ben Greear  a écrit :


On 09/19/2018 01:37 PM, Bruno Rosset wrote:

Hi all,




Various tests with others firmware from candelatech , stock LEDE 17.01
or dd-wrt give me firmware crash also


The crash in my firmware decodes to something that looks probably like
the calibration data is wrong.

This is probably some issue with how the board data is selected, or maybe
the wrong board data is installed on the file system.

Thanks,
Ben



Please open a bug for the candela firmware crash, I might be able to figure
out why it crashes:

https://github.com/greearb/ath10k-ct/issues

https://www.candelatech.com/ath10k-bugs.php

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com





--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: Problem with sending pkt on a monitor port

2018-09-20 Thread Ben Greear




On 09/20/2018 12:31 AM, Johannes Berg wrote:

On Wed, 2018-09-19 at 13:47 -0700, Ben Greear wrote:


For one, the driver has no concept of the original vif, since monitor
vifs aren't added to it.


ath10k does create a monitor vif, but maybe it is not mapped directly
to mac80211.


It's actually created by mac80211, but only once, and not directly
mapped to each vif seen by userspace - it's an internal construction.


I'm not sure it matters, but ath10k firmware can also create a monitor vdev
itself for certain reasons.  (Maybe offchannel tx on some FW, but I haven't 
looked at
that code lately).


Secondly, the old hostapd code before nl80211 injects frames that way,
and they need to go there.


Ok, I agree we should not break backwards compat then.  I'll poke some more
to see if I can get it working.


We might get away with doing this only for cooked monitor mode, which it
used there ...

However, thinking about it, that also breaks userspace in other ways -
for example if you do injection this way you actually get encryption and
other nice things if you use the local address that matches an existing
interface.


I'm not entirely sure of a useful use-case for this feature in user-space.
I am using it just to test sending some test frames to debug some firmware
features.  I think another user sent hand-crafted specialized beacons in this 
manner
using my 10.1 ath10k firmware & driver.  For whatever reason, I didn't realize 
monitor
vdevs were not directly used when I added that support..maybe I just got lucky
before I had to dig closely.



Perhaps you should just use a different address, and then nothing of the
sort would happen?


If I make the code in my original email be skipped, so that sdata remains the
monitor vdev, then it fails a check later in that method because there is no
chanctxt for the monitor sdata object.

I guess that changing the source MAC to something unique would cause the same
issue and no frame would be sent towards the driver.

Thanks,
Ben



johannes



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: Problem with sending pkt on a monitor port

2018-09-19 Thread Ben Greear

On 09/19/2018 01:35 PM, Johannes Berg wrote:

On Wed, 2018-09-19 at 13:33 -0700, Ben Greear wrote:

This is with a modified 4.16.18+ kernel, though the code in question
is from 2011, so this is not new...

I am attempting to use radiotap packet injection on a monitor port.

In the ieee80211_monitor_start_xmit method, before this code below
runs, sdata is 'moni6a', my monitor port.  But, since I have a
station wlan1 with the same MAC address, then when this code is
completed, stdata becomes wlan1.

Ath10k has all sorts of issues transmitting raw frames, and sending on
the wrong vdev only makes it even more broken!

If user-space binds a socket to a monitor vdev and transmits a frame,
why would we want to change the vdev here?


For one, the driver has no concept of the original vif, since monitor
vifs aren't added to it.


ath10k does create a monitor vif, but maybe it is not mapped directly
to mac80211.


Secondly, the old hostapd code before nl80211 injects frames that way,
and they need to go there.


Ok, I agree we should not break backwards compat then.  I'll poke some more
to see if I can get it working.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Netgear R7800 - ath10k QCA 9984 firmware crash

2018-09-19 Thread Ben Greear

On 09/19/2018 01:37 PM, Bruno Rosset wrote:

Hi all,




Various tests with others firmware from candelatech , stock LEDE 17.01
or dd-wrt give me firmware crash also


Please open a bug for the candela firmware crash, I might be able to figure
out why it crashes:

https://github.com/greearb/ath10k-ct/issues

https://www.candelatech.com/ath10k-bugs.php

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Problem with sending pkt on a monitor port

2018-09-19 Thread Ben Greear

This is with a modified 4.16.18+ kernel, though the code in question
is from 2011, so this is not new...

I am attempting to use radiotap packet injection on a monitor port.

In the ieee80211_monitor_start_xmit method, before this code below
runs, sdata is 'moni6a', my monitor port.  But, since I have a
station wlan1 with the same MAC address, then when this code is
completed, stdata becomes wlan1.

Ath10k has all sorts of issues transmitting raw frames, and sending on
the wrong vdev only makes it even more broken!

If user-space binds a socket to a monitor vdev and transmits a frame,
why would we want to change the vdev here?

list_for_each_entry_rcu(tmp_sdata, >interfaces, list) {
if (!ieee80211_sdata_running(tmp_sdata))
continue;
if (tmp_sdata->vif.type == NL80211_IFTYPE_MONITOR ||
tmp_sdata->vif.type == NL80211_IFTYPE_AP_VLAN ||
tmp_sdata->vif.type == NL80211_IFTYPE_WDS)
continue;
if (ether_addr_equal(tmp_sdata->vif.addr, hdr->addr2)) {
sdata = tmp_sdata;
break;
}
}

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Crash in stock Fedora 4.17 kernel in ieee80211_set_wmm_default

2018-08-17 Thread Ben Greear

On 08/17/2018 03:49 PM, Ben Greear wrote:

I have been running some tests on un-modified Fedora 27, with a stock QCA 
firmware-2.bin.


I realized a bit later than NetworkManager (or wpa_supplicant service?) was 
already doing something with this
interface.  When I disabled wpa_supplicant service, the crashes stopped.

Still a bug, but maybe that will help someone understand why and/or reproduce.

Thanks,
Ben



[root@lf0350-0a0e ~]# uname -a
Linux lf0350-0a0e 4.17.14-102.fc27.x86_64 #1 SMP Wed Aug 15 12:26:40 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

[root@lf0350-0a0e network-scripts]# ethtool -i wlp5s0
driver: ath10k_pci
version: 4.17.14-102.fc27.x86_64
firmware-version: 10.1.467.3-1
expansion-rom-version:
bus-info: :05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

[root@lf0350-0a0e network-scripts]# cat /root/run_sup.sh
#!/bin/bash

WLAN=wlp5s0
CONF=/root/bagel.conf

wpa_supplicant -g /var/run/wpa_supplicant_if_wiphy1 -B -P 
/tmp/wpa_supplicant-wiphy1.pid -t -f /tmp/wpa_supplicant_log_wiphy1.txt -dd -K 
-Dnl80211 -i $WLAN -c
$CONF


# cat /root/bagel.conf
ctrl_interface=/var/run/wpa_supplicant
fast_reauth=1
p2p_disabled=1
bss_max_count=2000

network={
ssid="HOME-C9EC-2.4"

proto=RSN
key_mgmt=WPA-PSK WPA-PSK-SHA256
psk="XXX"
#psk=xxx
pairwise=TKIP CCMP
group=TKIP CCMP
proactive_key_caching=0

}



The kernel reliably crashes when I start up supplicant with the script above.

Aug 17 15:37:28 lf0350-0a0e kernel: wlp5s0: authenticate with ec:aa:a0:f6:e3:98
Aug 17 15:37:28 lf0350-0a0e kernel: wlp5s0: send auth to ec:aa:a0:f6:e3:98 (try 
1/3)
Aug 17 15:37:28 lf0350-0a0e kernel: wlp5s0: authenticated
Aug 17 15:37:28 lf0350-0a0e kernel: wlp5s0: associate with ec:aa:a0:f6:e3:98 
(try 1/3)
Aug 17 15:37:28 lf0350-0a0e kernel: wlp5s0: RX AssocResp from ec:aa:a0:f6:e3:98 
(capab=0x431 status=0 aid=1)
Aug 17 15:37:28 lf0350-0a0e kernel: wlp5s0: associated
Aug 17 15:37:28 lf0350-0a0e wpa_supplicant[974]: wlp5s0: No network 
configuration found for the current AP
Aug 17 15:37:28 lf0350-0a0e kernel: wlp5s0: deauthenticating from 
ec:aa:a0:f6:e3:98 by local choice (Reason: 3=DEAUTH_LEAVING)
Aug 17 15:37:28 lf0350-0a0e kernel: general protection fault:  [#1] SMP 
NOPTI
Aug 17 15:37:28 lf0350-0a0e kernel: Modules linked in: ip6t_rpfilter 
ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat 
ebtable_broute bridge
stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw 
iptable_security ebtable_filter ebtables ip6table_filter ip6_tables arc4 sunrpc 
ath10k_pci
ath10k_core mac80211 amd64_edac_mod edac_mce_amd kvm_amd kvm ath irqbypass 
crct10dif_pclmul crc32_pclmul sdhci_pci cfg80211 leds_apu ghash_clmulni_intel 
cqhci
sdhci igb fam15h_power sp5100_tco mmc_core i2c_piix4 k10temp rfkill 
i2c_algo_bit dca ccp shpchp pcc_cpufreq acpi_cpufreq crc32c_intel
Aug 17 15:37:28 lf0350-0a0e kernel: CPU: 3 PID: 974 Comm: wpa_supplicant Not 
tainted 4.17.14-102.fc27.x86_64 #1
Aug 17 15:37:28 lf0350-0a0e kernel: Hardware name: PC Engines APU2/APU2, BIOS 
4.0.7 02/28/2017
Aug 17 15:37:28 lf0350-0a0e kernel: RIP: 
0010:ieee80211_regulatory_limit_wmm_params.part.19+0x63/0xe0 [mac80211]
Aug 17 15:37:28 lf0350-0a0e kernel: RSP: 0018:9d7fc1177940 EFLAGS: 00010297
Aug 17 15:37:28 lf0350-0a0e kernel: RAX: 2c5ff8c7f2828f00 RBX: 8bf9983ee8c0 
RCX: 2c5ff8c7f2828f00
Aug 17 15:37:28 lf0350-0a0e kernel: RDX: 0025b840 RSI: 00259130 
RDI: 8bf997fa5020
Aug 17 15:37:28 lf0350-0a0e kernel: RBP:  R08:  
R09: 
Aug 17 15:37:28 lf0350-0a0e kernel: R10: 001f R11: 03ff 
R12: 9d7fc117797e
Aug 17 15:37:28 lf0350-0a0e kernel: R13: 8bf9983ee8c0 R14:  
R15: 8bf996a50760
Aug 17 15:37:28 lf0350-0a0e kernel: FS:  7f8bc46f8300() 
GS:8bf99ed8() knlGS:
Aug 17 15:37:28 lf0350-0a0e kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Aug 17 15:37:28 lf0350-0a0e kernel: CR2: 55efb3ce6728 CR3: 000102666000 
CR4: 000406e0
Aug 17 15:37:28 lf0350-0a0e kernel: Call Trace:
Aug 17 15:37:28 lf0350-0a0e kernel:  ieee80211_set_wmm_default+0x2f7/0x3a0 
[mac80211]
Aug 17 15:37:28 lf0350-0a0e kernel:  ieee80211_set_disassoc+0x21b/0x5b0 
[mac80211]
Aug 17 15:37:28 lf0350-0a0e kernel:  ? avtab_search_node+0xb1/0x100
Aug 17 15:37:28 lf0350-0a0e kernel:  ieee80211_mgd_deauth+0x113/0x230 [mac80211]
Aug 17 15:37:28 lf0350-0a0e kernel:  cfg80211_mlme_deauth+0xaf/0x1c0 [cfg80211]
Aug 17 15:37:28 lf0350-0a0e kernel:  ? startup_64+0x3/0x30
Aug 17 15:37:28 lf0350-0a0e kernel:  nl80211_deauthenticate+0x11f/0x160 
[cfg80211]
Aug 17 

Crash in stock Fedora 4.17 kernel in ieee80211_set_wmm_default

2018-08-17 Thread Ben Greear
15:37:28 lf0350-0a0e kernel:  ? __switch_to+0x16f/0x4c0
Aug 17 15:37:28 lf0350-0a0e kernel:  genl_rcv_msg+0x47/0x90
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __kmalloc_node_track_caller+0x1f9/0x2a0
Aug 17 15:37:28 lf0350-0a0e kernel:  ? genl_family_rcv_msg+0x390/0x390
Aug 17 15:37:28 lf0350-0a0e kernel:  netlink_rcv_skb+0x4d/0x130
Aug 17 15:37:28 lf0350-0a0e kernel:  genl_rcv+0x24/0x40
Aug 17 15:37:28 lf0350-0a0e kernel:  netlink_unicast+0x1a3/0x250
Aug 17 15:37:28 lf0350-0a0e kernel:  netlink_sendmsg+0x2c1/0x3c0
Aug 17 15:37:28 lf0350-0a0e kernel:  sock_sendmsg+0x36/0x40
Aug 17 15:37:28 lf0350-0a0e kernel:  ___sys_sendmsg+0x2a0/0x2f0
Aug 17 15:37:28 lf0350-0a0e kernel:  ? unix_dgram_sendmsg+0x35e/0x6f0
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x34/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to+0x16f/0x4c0
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __switch_to_asm+0x40/0x70
Aug 17 15:37:28 lf0350-0a0e kernel:  ? __sys_sendmsg+0x5e/0xa0
Aug 17 15:37:28 lf0350-0a0e kernel:  __sys_sendmsg+0x5e/0xa0
Aug 17 15:37:28 lf0350-0a0e kernel:  do_syscall_64+0x5b/0x160
Aug 17 15:37:28 lf0350-0a0e kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 17 15:37:28 lf0350-0a0e kernel: RIP: 0033:0x7f8bc2e40387
Aug 17 15:37:28 lf0350-0a0e kernel: RSP: 002b:7ffe48b99778 EFLAGS: 0246 
ORIG_RAX: 002e
Aug 17 15:37:28 lf0350-0a0e kernel: RAX: ffda RBX: 555c886749a0 
RCX: 7f8bc2e40387
Aug 17 15:37:28 lf0350-0a0e kernel: RDX:  RSI: 7ffe48b997b0 
RDI: 0007
Aug 17 15:37:28 lf0350-0a0e kernel: RBP: 555c886cdc20 R08:  
R09: 000d
Aug 17 15:37:28 lf0350-0a0e kernel: R10: 555c8866a010 R11: 0246 
R12: 555c886748b0
Aug 17 15:37:28 lf0350-0a0e kernel: R13: 7ffe48b997b0 R14:  
R15: 7ffe48b99b80
Aug 17 15:37:28 lf0350-0a0e kernel: Code: 2d ce ff 48 85 c0 74 7a 48 3d 00 f0 ff ff 77 72 48 8b 40 18 48 85 c0 74 69 83 bb d0 0a 00 00 03 48 8d 0c e8 75 05 48 
8d 4c e8 20 <0f> b7 01 41 0f b7 7c 24 02 41 0f b7 14 24 66 39 c7 0f 47 c7 41

Aug 17 15:37:28 lf0350-0a0e kernel: RIP: 
ieee80211_regulatory_limit_wmm_params.part.19+0x63/0xe0 [mac80211] RSP: 
9d7fc1177940
Aug 17 15:37:28 lf0350-0a0e kernel: ---[ end trace 28cadc83f715e641 ]---

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] mac80211: Run TXQ teardown code before de-registering interfaces

2018-08-13 Thread Ben Greear

On 08/13/2018 11:25 AM, Arend van Spriel wrote:

On 8/13/2018 2:16 PM, Toke Høiland-Jørgensen wrote:

The TXQ teardown code can reference the vif data structures that are
stored in the netdev private memory area if there are still packets on
the queue when it is being freed. Since the TXQ teardown code is run
after the netdevs are freed, this can lead to a use-after-free. Fix this
by moving the TXQ teardown code to earlier in ieee80211_unregister_hw().


Just off the bat, but from reading the above I am wondering whether the 
use-after-free could also happen upon removing an interface?


At least in practice, it does not seem to happen.  Some of our test cases bring 
up and down
netdevs very often, and those doe not seem to trigger this bug.

But, could be luck, of course.

Crashing ath10k firmware under tx load, and unloading modules under tx load
seems to be the main trigger.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

2018-08-12 Thread Ben Greear

On 08/02/2018 01:20 PM, Toke Høiland-Jørgensen wrote:

Ben Greear  writes:


On 08/02/2018 12:45 PM, Toke Høiland-Jørgensen wrote:

Ben Greear  writes:


This is from my hacked kernel, could be my fault. I thought the fq
guys might want to know however...


Hmm, nothing obvious comes to mind; fq_flow_dequeue() just dequeues a
packet from the queue; it only has two memory derefs, to fq->lock and
flow->queue. Don't see why either of those should be freed at this
point.

Unless fq_adjust_removal() is being inlined, perhaps? Then I suppose the
flow->tin reference could be the problem, if the txq_info struct was
already freed; did you change anything around the handling of TXQs?


I have worked on some stuff to fix other leaks and corruptions in ath10k related
to txqs, maybe that is part of this problem.  My full tree is here:

https://github.com/greearb/linux-ct-4.16

This bug in question is fairly repeatable on my current setup, which
is high speed tx + rx on a 9984 NIC, with buggy firmware that crashes
often in the tx path. I think the crash only happens when I rmmod the
driver under load, but possibly some of the fw crash cleanup logic
that ran previously is also involved.


Yeah, if it happens under load that is consistent with packets being
queued.

It seems that mac80211 frees the netdevs of an interface before flushing
the TXQs, which may be the cause of the bug you are seeing. Could you
try the patch below and see if that fixes the issue?


I've run with this for a few days, and it seems to at least not cause
any extra problems.  I mostly fixed the firmware crashing I was seeing
before, so not certain it fixes the root cause of the crashes I
saw before.  I'm going to roll this into my 4.16 ct kernel for wider
testing.

Thanks,
Ben



-Toke


diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index e65c2abb2a54..d21ef14d327d 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -1213,6 +1213,7 @@ void ieee80211_unregister_hw(struct ieee80211_hw *hw)
  #if IS_ENABLED(CONFIG_IPV6)
unregister_inet6addr_notifier(>ifa6_notifier);
  #endif
+   ieee80211_txq_teardown_flows(local);

rtnl_lock();

@@ -1241,7 +1242,6 @@ void ieee80211_unregister_hw(struct ieee80211_hw *hw)
skb_queue_purge(>skb_queue);
skb_queue_purge(>skb_queue_unreliable);
skb_queue_purge(>skb_queue_tdls_chsw);
-   ieee80211_txq_teardown_flows(local);

destroy_workqueue(local->workqueue);
wiphy_unregister(local->hw.wiphy);




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: ath10k SWBA overrun / tx credit starvation

2018-08-06 Thread Ben Greear

On 08/06/2018 05:15 AM, Martin Willi wrote:

Hi Ben,

Thanks for your help.


If you use the -ct firmware and the -ct driver, you can configure
more than 2 tx-credits.


Unfortunately, this didn't help, either. I hit these issues even sooner
with any 10.1-based firmware (including CT), which implies that at
least some of them have been addressed with 10.2/10.2.4.


Out of curiosity, how soon could you hit it with -ct firmware?  We often
see these around once per day in some of our test cases, rarely more often
than that.


I am not sure it resolves everything and a buggy firmware would still
cause issues no matter.


As a work-around, I'm experimenting with handling timeout conditions in
ath10k_wmi_cmd_send() caused by missing credits. Given that we can't do
any TX-flush or warm-restart over WMI under these conditions, I just
issue a hardware restart (patch below).

Some initial tests show that this in fact recovers the module from its
bad state with just a small connectivity gap; certainly much better
than that unpredictable behavior we've seen previously.


Yes, I have been using a similar patch for years with good results.

Thanks,
Ben



I'll do some more testing with this approach before considering to
upstream it.

Regards
Martin

---

From fd9e90d0294450c093d243ee4f1eb1e07b1cd73a Mon Sep 17 00:00:00 2001
From: Martin Willi 
Date: Fri, 3 Aug 2018 14:23:30 +0200
Subject: [PATCH] ath10k: Schedule hardware restart if WMI command times out

If the TX queue gets stuck for some reason, we run out of tx credits and
are unable to send any commands over WMI. To recover from this situation,
issue a hard hardware restart. This implies a connectivity outage of about
1.4s in AP mode, but brings back the interface to a usable state.

Signed-off-by: Martin Willi 
---
 drivers/net/wireless/ath/ath10k/wmi.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index 38a97086708b..d39a983f4a1f 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -1852,6 +1852,12 @@ int ath10k_wmi_cmd_send(struct ath10k *ar, struct 
sk_buff *skb, u32 cmd_id)
if (ret)
dev_kfree_skb_any(skb);

+   if (ret == -EAGAIN) {
+   ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
+   cmd_id);
+   queue_work(ar->workqueue, >restart_work);
+   }
+
return ret;
 }





--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

2018-08-02 Thread Ben Greear

On 08/02/2018 12:45 PM, Toke Høiland-Jørgensen wrote:

Ben Greear  writes:


This is from my hacked kernel, could be my fault. I thought the fq
guys might want to know however...


Hmm, nothing obvious comes to mind; fq_flow_dequeue() just dequeues a
packet from the queue; it only has two memory derefs, to fq->lock and
flow->queue. Don't see why either of those should be freed at this
point.

Unless fq_adjust_removal() is being inlined, perhaps? Then I suppose the
flow->tin reference could be the problem, if the txq_info struct was
already freed; did you change anything around the handling of TXQs?


I have worked on some stuff to fix other leaks and corruptions in ath10k related
to txqs, maybe that is part of this problem.  My full tree is here:

https://github.com/greearb/linux-ct-4.16

This bug in question is fairly repeatable on my current setup, which is high 
speed
tx + rx on a 9984 NIC, with buggy firmware that crashes often in the tx
path.  I think the crash only happens when I rmmod the driver under
load, but possibly some of the fw crash cleanup logic that ran previously
is also involved.

I'll get the FW fixed sooner or later and quite reloading modules, and then this
problem will probably go away.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

2018-08-01 Thread Ben Greear

This is from my hacked kernel, could be my fault.  I thought the fq guys might
want to know however...

==
BUG: KASAN: use-after-free in fq_flow_dequeue+0x353/0x3c0 [mac80211]
Read of size 4 at addr 88013d92a700 by task rmmod/813

audit: type=1130 audit(1533153605.287:233): pid=1 uid=0 auid=4294967295 ses=4294967295 
msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/s'
CPU: 0 PID: 813 Comm: rmmod Tainted: GW4.16.18+ #24
Hardware name: _ _/, BIOS 5.11 08/26/2016
Call Trace:
 dump_stack+0x7c/0xbf
 print_address_description+0x70/0x280
audit: type=1131 audit(1533153605.287:234): pid=1 uid=0 auid=4294967295 ses=4294967295 
msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/s'
 ? fq_flow_dequeue+0x353/0x3c0 [mac80211]
 kasan_report+0x25c/0x350
 fq_flow_dequeue+0x353/0x3c0 [mac80211]
 fq_flow_reset.constprop.56+0x2b/0x2d0 [mac80211]
 fq_reset.constprop.53+0x79/0x110 [mac80211]
 ieee80211_txq_teardown_flows+0xc2/0x100 [mac80211]
 ieee80211_unregister_hw+0x17b/0x260 [mac80211]
 ath10k_mac_unregister+0x35/0x1a0 [ath10k_core]
 ath10k_core_unregister+0x60/0x160 [ath10k_core]
 ath10k_pci_remove+0x53/0x100 [ath10k_pci]
 pci_device_remove+0x97/0x1d0
 device_release_driver_internal+0x26f/0x520
 driver_detach+0x9d/0x140
 bus_remove_driver+0xde/0x2c0
 pci_unregister_driver+0x28/0x1a0
 ath10k_pci_exit+0xc/0x14 [ath10k_pci]
 SyS_delete_module+0x39a/0x4a0
 ? free_module+0x7d0/0x7d0
 ? exit_to_usermode_loop+0x75/0xf0
 ? free_module+0x7d0/0x7d0
 do_syscall_64+0x193/0x5e0
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f65a31ac5e7
RSP: 002b:7ffd0781e9a8 EFLAGS: 0206 ORIG_RAX: 00b0
RAX: ffda RBX: 7ffd0781e9f8 RCX: 7f65a31ac5e7
RDX: 000a RSI: 0800 RDI: 55e08a426248
RBP: 55e08a4261e0 R08: 000a R09: 1999
R10: 7f65a321c1a0 R11: 0206 R12: 7ffd0781ebc0
R13: 7ffd07820643 R14:  R15: 55e08a4261e0

The buggy address belongs to the page:
page:ea0004f64a80 count:0 mapcount:0 mapping: 
index:0x88013d92a640
flags: 0x5fff80()
raw: 005fff80  88013d92a640 
raw:  dead0200 88014c02a600 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 88013d92a600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 88013d92a680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>88013d92a700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
   ^
 88013d92a780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 88013d92a800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
======


Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v4 3/3] mac80211: Fix PTK rekey freezes and cleartext leaks

2018-08-01 Thread Ben Greear
etter than current stable
kernels. Of course we can handle that like in the V2 version of the
patch. I kind of like the idea...Is that something I shall add in the
next version?


I think if mac80211 will stop sending frames, then a driver *should* be able
to implement flush one way or another.  But, having the driver itself
try to flush in a key_replace() method while the rest of the system might
still be sending it frames is probably going to be more complicated.

You could handle it on a per-driver basis by having the driver somehow notify
your logic whether a flush is desired or not?


Someone could also tweak supplicant to somewhat randomize the rekey timers,
but that would take a while to propagate to all of the station devices out
there.


It may not even be a realistic problem. Just trying to think of worst
cases:-) The reconnections will for sure be spread a few ms, at least.
What is the longest time we have to keep the old key programmed in HW?


If you are trying to transmit a frame with the wrong key, probably it will
be retransmitted a lot since the receiver should dispose of it, so I guess
a few seconds max?

ath10k firmware will only dispose of old keys once all packets referencing
it are removed.

Either way, this is a fixable problem, and a flush should certainly fix it
if there is no better way.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v4 3/3] mac80211: Fix PTK rekey freezes and cleartext leaks

2018-08-01 Thread Ben Greear

On 08/01/2018 10:01 AM, Alexander Wetzel wrote:

Hallo,


Rekeying a pairwise key with encryption offload and only keyid 0 had
multiple races. Two of them could freeze the wlan connection till
rekeyed again and the third could send out packets in clear which should
have been encrypted.

 1) Freeze caused by incoming packets:
If the local STA installed the key prior to the remote STA we still
had the old key active in the hardware while mac80211 was already
operating on the new key.
The card could still hand over packets decoded with the old key to
mac80211, bumping the new PN (IV) value to an incorrect high number
and tricking the local replay detection to later drop all packets
really sent with the new key.

 2) Freeze caused by outgoing packets:
If mac80211 was providing the PN (IV) and handed over a cleartext
packets for encryption to the hardware prior to a key change the
driver/card could have processed the queued packets after switching
to the new key.
This immediately bumped the PN (IV) value on the remote STA to
an incorrect high number, which also froze the connection.

 3) Clear text leak:
Removing encryption offload from the card cleared the encryption
offload flag only after the card had removed the key. Packets handed
over between that were send out unencrypted.

To prevent those issues we now stop queuing packets to the driver while
replacing the key, replace the key first in the HW and stop/block new
aggregation sessions during the rekey.


I guess the replace_key logic will have to flush the tx hardware/firmware
queues and any other packets currently owned by the driver before it can
replace the key?


That is one of the options, but looks like we can also avoid it with
some efforts. Question will be what's the best approach per driver and
how complex we want the code to become...
My current ath9k fix depends on flush - since it's simple - but it
definitely looks like it can be done without. (It looks like the
userspace updates are more urgent - assuming we can even agree on this
overall solution - and I've not done anything on it for weeks.)

There are two ways to establish the boundary we need here:
1) Make sure all old packets have been send/dropped or 2) continue to
use the old key for them.

Iwlwifi e.g. is handing over the key with the packet to the HW and the
driver simply uses the key it was told per packet. Implementing
replace_key is trivial, we can just call the existing set_key function
for deletion and addition from replace_key if we want. (I've tested that
quite extensive and it works flawless with my dvm card.)

>

Drivers uploading the keys to the HW and then just tell the HW to use
key ID X for encryption - like ath9, the only other driver I drilled
deeper so far - are more tricky. But replace_key is still allowed to
change the HW key ID for the new key if desired by the driver.

The driver can therefore only "deprecate" a key but keep it programmed
in HW, assign and program a new key and return to mac80211.
Any queued packets will still lookup and use the old key while new
packets will tell the HW to use the new one. Problem solved, if we
ignore the headache when and how to really free the old key id and
remove it from the HW. In the worst case we wait the max time any packet
can be stuck in the queue. Of course this may cause other issues in at
least some special cases. The one I can think of would be:

A busy AP with many clients connected and rekey enabled gets rebooted
during work hours. All clients reconnect at basically the same time.
Thus all of those will also rekey at nearly the same time and therefore
all need two PTK key slots in the HW for some seconds, potentially
exceeding the available ones. So you may get some strange effects after
the rekey interval expired, long after the reboot...


The ath10k firmware already sort of handles stale keys as far as I can tell,
and early on in my testing, which often has this issue of all
stations rekeying at once, I was running out of firmware key objects.  I
increased the multiplier in the driver and made the firmware smarter about
recycling key objects, and that fixed the problem for me.

So, maybe this would just work with ath10k, but if there were any issues,
probably you'd have to fall back to flushing everything.  And, that might
be a way to work somewhat generically with drivers that don't have special
replace-key logic?

Someone could also tweak supplicant to somewhat randomize the rekey timers,
but that would take a while to propagate to all of the station devices out
there.

Thanks,
Ben



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v4 3/3] mac80211: Fix PTK rekey freezes and cleartext leaks

2018-07-31 Thread Ben Greear

On 07/31/2018 01:10 PM, Alexander Wetzel wrote:

Rekeying a pairwise key with encryption offload and only keyid 0 had
multiple races. Two of them could freeze the wlan connection till
rekeyed again and the third could send out packets in clear which should
have been encrypted.

 1) Freeze caused by incoming packets:
If the local STA installed the key prior to the remote STA we still
had the old key active in the hardware while mac80211 was already
operating on the new key.
The card could still hand over packets decoded with the old key to
mac80211, bumping the new PN (IV) value to an incorrect high number
and tricking the local replay detection to later drop all packets
really sent with the new key.

 2) Freeze caused by outgoing packets:
If mac80211 was providing the PN (IV) and handed over a cleartext
packets for encryption to the hardware prior to a key change the
driver/card could have processed the queued packets after switching
to the new key.
This immediately bumped the PN (IV) value on the remote STA to
an incorrect high number, which also froze the connection.

 3) Clear text leak:
Removing encryption offload from the card cleared the encryption
offload flag only after the card had removed the key. Packets handed
over between that were send out unencrypted.

To prevent those issues we now stop queuing packets to the driver while
replacing the key, replace the key first in the HW and stop/block new
aggregation sessions during the rekey.


I guess the replace_key logic will have to flush the tx hardware/firmware
queues and any other packets currently owned by the driver before it can
replace the key?


..


+   /* Key is beeing removed */


Looks like 'being' is mis-spelled?


Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: ath10k SWBA overrun / tx credit starvation

2018-07-30 Thread Ben Greear

On 07/30/2018 01:12 AM, Martin Willi wrote:

Hi,

We are experiencing some issues when running ath10k in AP mode.
Unfortunately, I didn't manage to reproduce the issue in the lab, but
in the field we see it roughly once a day on one out of fifty devices.

The symptoms are the logged "SWBA overruns" followed by a kernel
WARNING when removing a station (see below), followed by many more
"SWBA overruns". It seems that the firmware and kernel get out of sync
about the associated stations. The module does not recover, but the
whole networking stack gets very sluggish, probably due to a lock held
for many seconds. Bringing down the affected network interface takes
some extra seconds, but then allows recovering from that issue.

We are running 4.14-stable, and tried many firmware versions, including
 10.2.4.70-2, 10.2.4-1.0-00040, 10.2.4.70.61-2, 10.2.4.70.67 and
firmware-2-ct-full-community-20, but the issue remains. Hardware is
QCA9882 on a WLE600VX.

I stumbled over a some years old discussion at [1] about tx credit
starvation. Is this still the same issue we are seeing? Given that the
mentioned newer firmware versions did not help here, is there anything
else we can try?


If you use the -ct firmware and the -ct driver, you can configure more than
2 tx-credits.  Search for 'TARGET_HTC_MAX_TX_CREDITS_CT' in the -ct driver
and change it, maybe to 4 or 6.

The support was added in this patch and it has some comments that explain
it a bit:

https://github.com/greearb/linux-ct-4.16/commit/59acbd0481e8fc2028373ed01a0ec5212990b330

I don't know that it will fix your problem, but it might, and is something
to try.

The -ct driver also has several other patches that attempt to improve
tx credits issues, but I am not sure it resolves everything and a buggy
firmware would still cause issues no matter.

Thanks,
Ben



Thanks!
Martin

[1] https://lists.infradead.org/pipermail/ath10k/2015-June/005340.html

---

15:27:39 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:39 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:39 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:39 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:41 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:41 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 [ cut here ]
15:27:44 WARNING: CPU: 0 PID: 150 at net/mac80211/sta_info.c:976 
__sta_info_destroy_part2+0x170/0x174
15:27:44 Modules linked in: xt_comment xt_cluster xt_u32 esp4 xfrm6_mode_tunnel 
xfrm4_mode_tunnel ebtable_filter ebtables bridge stp llc xt_policy xt_connmark 
xt_mark xt_set ip_set_hash_ipport ip_set_hash_netnet ip_set iptable_mangle nfnet
15:27:44 CPU: 0 PID: 150 Comm: hostapd Not tainted 4.14.55 #2
15:27:44 Hardware name: Marvell Armada 380/385 (Device Tree)
15:27:44 [] (unwind_backtrace) from [] 
(show_stack+0x10/0x14)
15:27:44 [] (show_stack) from [] (dump_stack+0x88/0x9c)
15:27:44 [] (dump_stack) from [] (__warn+0xe8/0x100)
15:27:44 [] (__warn) from [] (warn_slowpath_null+0x20/0x28)
15:27:44 [] (warn_slowpath_null) from [] 
(__sta_info_destroy_part2+0x170/0x174)
15:27:44 [] (__sta_info_destroy_part2) from [] 
(__sta_info_destroy+0x20/0x28)
15:27:44 [] (__sta_info_destroy) from [] 
(sta_info_destroy_addr_bss+0x2c/0x44)
15:27:44 [] (sta_info_destroy_addr_bss) from [] 
(nl80211_del_station+0xc8/0x100)
15:27:44 [] (nl80211_del_station) from [] 
(genl_rcv_msg+0x2f8/0x3c8)
15:27:44 [] (genl_rcv_msg) from [] 
(netlink_rcv_skb+0xac/0x104)
15:27:44 [] (netlink_rcv_skb) from [] (genl_rcv+0x24/0x34)
15:27:44 [] (genl_rcv) from [] (netlink_unicast+0x184/0x21c)
15:27:44 [] (netlink_unicast) from [] 
(netlink_sendmsg+0x334/0x374)
15:27:44 [] (netlink_sendmsg) from [] 
(sock_sendmsg+0x14/0x24)
15:27:44 [] (sock_sendmsg) from [] 
(___sys_sendmsg+0x214/0x228)
15:27:44 [] (___sys_sendmsg) from [] 
(__sys_sendmsg+0x40/0x6c)
15:27:44 [] (__sys_sendmsg) from [] 
(ret_fast_syscall+0x0/0x54)
15:27:44 ---[ end trace 036b835c84274321 ]---
15:27:44 ath10k_warn: 41 callbacks suppressed
15:27:44 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:45 ath10k_pci :01:00.0: SWBA overrun on vdev 0, skipped old beacon




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: ath9k and 16 VAP interfaces?

2018-07-30 Thread Ben Greear




On 07/30/2018 04:13 AM, Matthias May wrote:

On 30/07/18 11:40, michael-...@fami-braun.de wrote:

Do you mean AP interfaces as required for multiple BSS/SSIDs?

I'm running a patched ath9k to have hostapd run >8 BSS on a single radio.
It works fine since years.


Yes, I'd love to see any patches you can share on this as well.

Thanks,
Ben



Regards,
M. Braun


Am 27. Juli 2018 15:35:28 MESZ schrieb Ben Greear :

Hello,

Has anyone tried making ath9k able to support 16 vAP interfaces on a
single
radio?  I seem to recall that there were limitations regarding beacon
timers and such, and that is why the current limit is 8?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com




Are these patches available somewhere?
I'm interested to play with them :)

BR
Matthias



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [RFC] ath10k: Support configuring firmware tx-retry counter.

2018-07-27 Thread Ben Greear

On 07/16/2018 05:48 PM, gree...@candelatech.com wrote:

From: Ben Greear 

Firmware already supports this, so just enable the path
to configure the vdev parameter.

One potential issue:  The firmware (10.1 at least) defaults to 2,
and the mac80211 stack defaults to 4.  So applying this patch would
change the firmware behaviour regardless of user settings.

I verified this tweaked the proper firmware setting when I
ran this command:


So, please don't apply this.  At least wave-1 10.1 firmware is going to
crash and/or corrupt its rate-ctrl memory, because it has some logic that
is just full of bugs when there can be more than 2 retries.

I plan to fix this in my firmware, but I doubt upstream FW is fixed
or will be fixed (a quick code inspection of 10.2 fw src makes me think
the problem exists there as well).

Thanks,
Ben



iw phy wiphy0 set retry short 2 long 2

Signed-off-by: Ben Greear 
---
 drivers/net/wireless/ath/ath10k/mac.c | 31 +++
 drivers/net/wireless/ath/ath10k/wmi.c |  2 +-
 drivers/net/wireless/ath/ath10k/wmi.h |  2 ++
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index c90c8bb..b458e5b 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2027,6 +2027,16 @@ static int ath10k_mac_vif_setup_ps(struct ath10k_vif 
*arvif)
return 0;
 }

+static int ath10k_mac_vif_config_retry_limit(struct ath10k_vif *arvif, int 
limit)
+{
+   struct ath10k *ar = arvif->ar;
+   int vdev_param = ar->wmi.vdev_param->rc_num_retries;
+
+   lockdep_assert_held(>ar->conf_mutex);
+
+   return ath10k_wmi_vdev_set_param(ar, arvif->vdev_id, vdev_param, limit);
+}
+
 static int ath10k_mac_vif_disable_keepalive(struct ath10k_vif *arvif)
 {
struct ath10k *ar = arvif->ar;
@@ -5516,6 +5526,24 @@ static int ath10k_config_ps(struct ath10k *ar)
return ret;
 }

+static int ath10k_config_retry_limit(struct ath10k *ar, int limit)
+{
+   struct ath10k_vif *arvif;
+   int ret = 0;
+
+   lockdep_assert_held(>conf_mutex);
+
+   list_for_each_entry(arvif, >arvifs, list) {
+   ret = ath10k_mac_vif_config_retry_limit(arvif, limit);
+   if (ret) {
+   ath10k_warn(ar, "failed to setup retry-limit: %d\n", 
ret);
+   break;
+   }
+   }
+
+   return ret;
+}
+
 static int ath10k_mac_txpower_setup(struct ath10k *ar, int txpower)
 {
int ret;
@@ -5592,6 +5620,9 @@ static int ath10k_config(struct ieee80211_hw *hw, u32 
changed)
ath10k_warn(ar, "failed to recalc monitor: %d\n", ret);
}

+   if (changed & IEEE80211_CONF_CHANGE_RETRY_LIMITS)
+   ret = ath10k_config_retry_limit(ar, 
conf->long_frame_max_tx_count);
+
mutex_unlock(>conf_mutex);
return ret;
 }
diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index 6aaa439..816883d 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -874,7 +874,7 @@ static struct wmi_vdev_param_map wmi_10x_vdev_param_map = {
.tx_encap_type = WMI_VDEV_PARAM_UNSUPPORTED,
.ap_detect_out_of_sync_sleeping_sta_time_secs =
WMI_10X_VDEV_PARAM_AP_DETECT_OUT_OF_SYNC_SLEEPING_STA_TIME_SECS,
-   .rc_num_retries = WMI_VDEV_PARAM_UNSUPPORTED,
+   .rc_num_retries = WMI_10X_VDEV_PARAM_RC_NUM_RETRIES,
.cabq_maxdur = WMI_VDEV_PARAM_UNSUPPORTED,
.mfptest_set = WMI_VDEV_PARAM_UNSUPPORTED,
.rts_fixed_rate = WMI_VDEV_PARAM_UNSUPPORTED,
diff --git a/drivers/net/wireless/ath/ath10k/wmi.h 
b/drivers/net/wireless/ath/ath10k/wmi.h
index 95ff280..b01778e 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -5369,6 +5369,8 @@ enum wmi_10x_vdev_param {
WMI_10X_VDEV_PARAM_MCAST2UCAST_SET,
/* Enable/Disable RTS-CTS */
WMI_10X_VDEV_PARAM_ENABLE_RTSCTS,
+   /** Total number of HW retries */
+   WMI_10X_VDEV_PARAM_RC_NUM_RETRIES,

WMI_10X_VDEV_PARAM_AP_DETECT_OUT_OF_SYNC_SLEEPING_STA_TIME_SECS,





--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



ath9k and 16 VAP interfaces?

2018-07-27 Thread Ben Greear

Hello,

Has anyone tried making ath9k able to support 16 vAP interfaces on a single
radio?  I seem to recall that there were limitations regarding beacon
timers and such, and that is why the current limit is 8?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: IBSS timeouts

2018-07-17 Thread Ben Greear

On 07/17/2018 10:44 AM, Nicolas Cavallari wrote:

On 17/07/2018 17:02, Ben Greear wrote:

On 07/17/2018 12:57 AM, Nicolas Cavallari wrote:

In IBSS mode, all stations are required to send beacons. The protocol
is a bit
complex to arrange that, every 102.4ms, a station is chosen to emit
the beacon.


I'm not sure this part is correct.  In practice, it seems to often
happen this way,
but last time I read the spec on this it seems like it is supposed to
sort of negotiate
and not have all ibss stations beaconning.


Well, in 802.11-2016, section 11.1.3.5 is clear: "All members of the
IBSS participate in beacon generation."

On each TBTT, each station must wait for a random delay before
transmitting, and the first one to transmit wins.

In theory...



See 11.1.3.5, section d and e.  I think this means that if it receives a beacon 
while
waiting to transmit one, then it cancels its intent to transmit the beacon.  So
while they 'participate', they may not actually generate beacons?

I could be wrong about my interpretation of this however, and I am not sure how 
it
is actually implemented.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: IBSS timeouts

2018-07-17 Thread Ben Greear




On 07/17/2018 12:57 AM, Nicolas Cavallari wrote:

On 16/07/2018 21:31, James Prestwood wrote:

Hello,

I am a developer for IWD and trying to implement IBSS networks. The
initial IBSS_JOIN, 4-way, and setting the keys all works and I am able
to connect two stations. The problem is that I am hitting a timeout in
the kernel once the connection has succeeded and there are no more
frames flowing between stations. I dug around in the kernel and saw
there is a 60 second inactivity timeout which is precicely what is
happening. After setting the keys 60 seconds go by and I recieve a
DEL_STATION (log attached).

My question is: is this timeout expected after the station has been
added and the keys are set? If so how does one reset this timeout so
the connection can remain alive even if no data is being sent?
heartbeat of some kind?


This is not normal, but it isn't your fault. It's more a problem with your card
firmware/driver. What card/driver do you have ?

In IBSS mode, all stations are required to send beacons. The protocol is a bit
complex to arrange that, every 102.4ms, a station is chosen to emit the beacon.


I'm not sure this part is correct.  In practice, it seems to often happen this 
way,
but last time I read the spec on this it seems like it is supposed to sort of 
negotiate
and not have all ibss stations beaconning.

Thanks,
Ben



Receiving beacons from a station is enough to reset its timer, so with a
properly functioning station, this timer does not expire.

Unfortunately, in the wild, nobody test that IBSS beacons are generated, because
without them, an open IBSS still appear to work...



--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


lockdep circular lock related to ath10k fw crash

2018-06-19 Thread Ben Greear
/0x4c0
 ? dev_set_allmulti+0x10/0x10
 dev_change_flags+0x75/0x150
 do_setlink+0x929/0x2be0
 ? validate_linkmsg+0x670/0x670
 ? debug_check_no_locks_freed+0x290/0x290
 ? __read_once_size_nocheck.constprop.8+0x10/0x10
 ? unwind_next_frame+0xfe9/0x19e0
 ? __lock_acquire_lockdep+0xb4d/0x3de0
 ? is_bpf_text_address+0x5c/0xe0
 ? debug_check_no_locks_freed+0x290/0x290
 ? __read_once_size_nocheck.constprop.8+0x10/0x10
 ? is_bpf_text_address+0x79/0xe0
 ? memset+0x1f/0x40
 rtnl_newlink+0xc33/0x12e0
 ? rtnl_newlink+0x715/0x12e0
 ? rtnl_link_unregister+0x200/0x200
 ? unwind_next_frame+0xfe9/0x19e0
 ? is_bpf_text_address+0x5c/0xe0
 ? debug_check_no_locks_freed+0x290/0x290
 ? rtnetlink_rcv_msg+0x273/0x840
 ? lock_downgrade+0x580/0x580
 rtnetlink_rcv_msg+0x2e6/0x840
 ? rtnl_fdb_del+0x7c0/0x7c0
 ? lock_downgrade+0x580/0x580
 netlink_rcv_skb+0x263/0x3b0
 ? rtnl_fdb_del+0x7c0/0x7c0
 ? netlink_ack+0x7f0/0x7f0
 netlink_unicast+0x3d4/0x560
 ? netlink_attachskb+0x630/0x630
 ? dup_iter+0x2a0/0x2a0
 ? __check_object_size+0xfd/0x2b0
 netlink_sendmsg+0x73f/0xae0
 ? copy_msghdr_from_user+0x1f8/0x2f0
 ? netlink_unicast+0x560/0x560
 ? SYSC_sendto+0x2c0/0x2c0
 ? netlink_unicast+0x560/0x560
 sock_sendmsg+0xac/0xe0
 ___sys_sendmsg+0x744/0x8f0
 ? copy_msghdr_from_user+0x2f0/0x2f0
 ? debug_check_no_locks_freed+0x290/0x290
 ? debug_check_no_locks_freed+0x290/0x290
 ? rcu_read_lock_sched_held+0x9e/0x120
 ? __alloc_pages_nodemask+0x4b1/0x590
 ? __handle_mm_fault+0xd7e/0x2bc0
 ? __audit_syscall_entry+0x2f5/0x5f0
 ? lock_downgrade+0x580/0x580
 ? lock_acquire+0x114/0x330
 ? __audit_syscall_entry+0x2f5/0x5f0
 ? __sys_sendmsg+0xab/0x120
 __sys_sendmsg+0xab/0x120
 ? SyS_shutdown+0x150/0x150
 ? __audit_syscall_entry+0x2f5/0x5f0
 ? lock_downgrade+0x580/0x580
 ? syscall_trace_enter+0x51a/0xbf0
 ? do_syscall_64+0x3e/0x5e0
 ? __sys_sendmsg+0x120/0x120
 do_syscall_64+0x193/0x5e0
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7fcd271d0150
RSP: 002b:7ffe8d317258 EFLAGS: 0246 ORIG_RAX: 002e
RAX: ffda RBX: 5b296a2f RCX: 7fcd271d0150
RDX:  RSI: 7ffe8d3172d0 RDI: 0004
RBP: 7ffe8d3172d0 R08: 0001 R09: 
R10: 05e7 R11: 0246 R12: 7ffe8d317310
R13: 0066b460 R14: 00007ffe8d31f380 R15: 
ath10k_pci :04:00.0: boot hif power up

Suggestions are welcome.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] cfg80211: Add cumulative channel survey dump support.

2018-06-18 Thread Ben Greear

On 06/18/2018 01:46 PM, Johannes Berg wrote:

On Thu, 2018-05-31 at 08:52 -0700, Ben Greear wrote:


On 05/31/2018 02:06 AM, vnara...@codeaurora.org wrote:

On 2018-05-23 15:24, Johannes Berg wrote:

On Tue, 2018-05-22 at 14:45 +0530, Venkateswara Naralasetty wrote:

This patch provides support to send accumulated survey data to
user if low level drivers provides non-accumulated survey data.


I think the commit log should say what you need this for?

It's simultaneously a new flag, and a lot of code, but it's not clear
what the point is?

johannes


I will sent next version of patch with updated commit log.

Providing you the earlier discussion of this patch to give quick overview about 
this patch.
https://patchwork.kernel.org/patch/9701459/


It is simple to fix the firmware, it just has several bugs related to clearing 
the
accumulator as well as the 'real' values.  If you can find a QCA firmware 
engineer
that will accept patches I can show them how to fix this easily.  I recently
fixed this in my wave-1 firmware.  I posted patches to the ath10k driver to
support this a week or so ago, but not sure if Kalle will apply them.


That would be nicer even, but I guess we still have to worry about older
firmware? Perhaps not for survey support? Dunno.

Obviously less code is better, but I can't really say how easy it would
be to get firmware updates rolled out to people who'd need them ...


And, it is likely that even if you don't use the 'clear' option, you are going 
to
get wraps in the firmware and that will just be other harder to debug bugs.


You have to use clear if you don't accumulate completely in fw, no?


At least some firmware is broken in numerous ways, and there is no clear way to
tell which firmware is broken how as far as I can tell.  It does appear to be 
fixed in
recent 10.4 (wave-2) firmware, so maybe they will fix 10.2 as well.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-11 Thread Ben Greear

On 06/10/2018 10:10 AM, Michał Kazior wrote:

Ben,

The patch is symptomatic. fq_tin_dequeue() already checks if the list
is empty before it tries to access first entry. I see no point in
using the _or_null() + WARN_ON.

The 0x3c deref is likely an offset off of NULL base pointer. Did you
check gdb/addr2line of the ieee80211_tx_dequeue+0xfb? Where did it
point to?


gdb pointed to one line above the flow dereference, which is why I was
going to put some debugging in there.



I suspect there's not enough synchronization between quescing the
device/ath10k after fw crashes and performing mac80211's reconfig
procedure.


I am already running this patch which helps with some of that.  That
patch never made it upstream, but it fixed problems for me earlier.

https://patchwork.kernel.org/patch/9457639/

Could easily be there are some more issues in that logic.

Someone else posted a patch to disable mac-80211 tx when FW crashes,
I think...I have not tried to backport that.

https://patchwork.kernel.org/patch/10411967/

Thanks,
Ben





Michał

On 8 June 2018 at 23:40, Arend van Spriel  wrote:

On 6/8/2018 5:17 PM, Ben Greear wrote:

I recalled an email from Michał leaving tieto so adding his alternate email
he provided back then.

Gr. AvS



On 06/07/2018 04:59 PM, Cong Wang wrote:


On Thu, Jun 7, 2018 at 4:48 PM,   wrote:


diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..cb911f0 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
return NULL;
}

-   flow = list_first_entry(head, struct fq_flow, flowchain);
+   flow = list_first_entry_or_null(head, struct fq_flow,
flowchain);
+
+   if (WARN_ON_ONCE(!flow))
+   return NULL;



This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.



Hello Michal,

git blame shows you as the author of the fq_impl.h code.

I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks
kernel.  There was an apparent
mostly-null deref in the fq_tin_dequeue method.  According to gdb, it
was within
1 line of the dereference of 'flow'.

My hack above is probably not that useful.  Cong thinks maybe the
locking is bad.

If you get a chance, please review this thread and see if you have any
ideas for
a better fix (or better debugging code).

As always, if you would like me to generate you a buggy firmware that
will crash
in the tx path and cause all sorts of mayhem in the ath10k driver and
wifi stack,
I will be happy to do so.

https://www.mail-archive.com/netdev@vger.kernel.org/msg239738.html

Thanks,
Ben







--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: How to let devcoredump know data has been read?

2018-06-06 Thread Ben Greear

On 06/05/2018 03:53 PM, Brian Norris wrote:

Hi,

On Tue, Jun 05, 2018 at 03:27:28PM -0700, Ben Greear wrote:

I have been testing ath10k on 4.16, which uses the devcoredump API
to notify about dumps.

I am able to see the binary crash dump at /sys/class/devcoredump/devcd2/data,
for instance, but if I do another crash quickly, I get no new uevent sent
and no new crash.

I see there is a 5 minute timer on the coredump data, but it also seems to 
indicate
that if I read the crash, the data should be cleared and ready to be
recreated again?  How do you notify the system that the crash data has
been read?

I tried 'cat' on the data file, but that did not seem to clear anything.


Try *writing* to it?

https://elixir.bootlin.com/linux/v4.16/source/drivers/base/devcoredump.c#L91

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=833c95456a70826d1384883b73fd23aff24d366f


The dumped data will be readable in sysfs in the virtual device's
data file under /sys/class/devcoredump/devcd*/. Writing to it will
free the data and remove the device, as does a 5-minute timeout.


e.g.:

# ls -l /sys/class/devcoredump/devcd1
lrwxrwxrwx. 1 root root 0 Jun  5 15:49 /sys/class/devcoredump/devcd1 -> 
../../devices/virtual/devcoredump/devcd1
# echo 1 > /sys/class/devcoredump/devcd1/data
# ls -l /sys/class/devcoredump/devcd1
ls: cannot access '/sys/class/devcoredump/devcd1': No such file or directory


Thanks to all who responded.  That indeed works just fine.

Just in case you know a quick answer:  I opened a netlink socket to listen
to uevents so I would know when FW crashed.  It receives the kernel messages,
but also receives 'libudev' events which seem to have some binary header in
them (which I could not google any info about how to decode it).  Is there
an easy way to tell the socket to not send the libudev events?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



How to let devcoredump know data has been read?

2018-06-05 Thread Ben Greear

I have been testing ath10k on 4.16, which uses the devcoredump API
to notify about dumps.

I am able to see the binary crash dump at /sys/class/devcoredump/devcd2/data,
for instance, but if I do another crash quickly, I get no new uevent sent
and no new crash.

I see there is a 5 minute timer on the coredump data, but it also seems to 
indicate
that if I read the crash, the data should be cleared and ready to be
recreated again?  How do you notify the system that the crash data has
been read?

I tried 'cat' on the data file, but that did not seem to clear anything.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] cfg80211: Add cumulative channel survey dump support.

2018-05-31 Thread Ben Greear




On 05/31/2018 02:06 AM, vnara...@codeaurora.org wrote:

On 2018-05-23 15:24, Johannes Berg wrote:

On Tue, 2018-05-22 at 14:45 +0530, Venkateswara Naralasetty wrote:

This patch provides support to send accumulated survey data to
user if low level drivers provides non-accumulated survey data.


I think the commit log should say what you need this for?

It's simultaneously a new flag, and a lot of code, but it's not clear
what the point is?

johannes


I will sent next version of patch with updated commit log.

Providing you the earlier discussion of this patch to give quick overview about 
this patch.
https://patchwork.kernel.org/patch/9701459/


It is simple to fix the firmware, it just has several bugs related to clearing 
the
accumulator as well as the 'real' values.  If you can find a QCA firmware 
engineer
that will accept patches I can show them how to fix this easily.  I recently
fixed this in my wave-1 firmware.  I posted patches to the ath10k driver to
support this a week or so ago, but not sure if Kalle will apply them.

And, it is likely that even if you don't use the 'clear' option, you are going 
to
get wraps in the firmware and that will just be other harder to debug bugs.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v1] ath10k: fix band_center_freq handling for VHT160 in recent firmwares

2018-05-28 Thread Ben Greear




On 05/28/2018 10:54 AM, Sebastian Gottschall wrote:



Am 28.05.2018 um 16:43 schrieb Ben Greear:



On 05/27/2018 03:25 PM, Sebastian Gottschall wrote:



Am 25.05.2018 um 16:52 schrieb Ben Greear:

On 05/25/2018 07:44 AM, Kalle Valo wrote:

Sebastian Gottschall  writes:


Am 26.04.2018 um 15:44 schrieb Ben Greear:



On 04/26/2018 02:43 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall 

starting with firmware 10.4.3.4.x series QCA changed the handling
of the channel property band_center_freq1 and band_center_freq2 in
vht160 operation mode
likelly for backward compatiblity with vht80 only capable clients.
this patch adjusts the handling to get vht160 to work again with
official qca firmwares newer than 3.3
consider that this patch will not work with older firmwares
anymore. to avoid undefined behaviour this we disable vht160
capability for outdated firmwares


We should be able to use a feature-flag or otherwise determine if
the firmware needs the old or new
API and make the driver able to handle both.


the new firmware must be used as is and it works. the old firmware can
be detected on the missing vht cap flag.
but thats not my task. i can only use feature flags if they are
included within the qca firmwares. but they arent
the old pre 3.3 firmwares should be treated as obsolete. they are more
than 2 years old and do not announce vht160 capability
even if it works with some ignorance, but on the other side the it has
backward incompatiblies with older vht80 only clients.
this is why the new way was introduced


I was told ath10k could check for WMI_SERVICE_EXTENDED_NSS_SUPPORT flag.
Can someone test and verify that?



I do see that my firmware source based on older upstream QCA FW does not 
advertise
that flag, and newer QCA firmware does, so it would appear that test might
work.

but your sourcebase is new enough that this new handling is required.  if i 
remember correct this handling is required starting with 10.4-3.4 source base
if a 3.4 original firmware is not providing that flag, it cannot be used for 
correct handling and yours is 3.4 based


With my current driver and my current (older) firmware source, 160Mhz works 
fine.

I think my driver changes related to 160Mhz are upstream, so probably stock 
driver
works with my firmware at 160Mhz as well.

If you change the driver, then it will likely break older firmware.  So, just 
change
the driver behaviour if the WMI_SERVICE_EXTENDED_NSS_SUPPORT is enabled.

only pre 3.4 which isnt important. but without my change everything from 3.4 
upwards wont work anymore. so my change fixes vht160 for firmwares released 
within the last 2 years.
WMI_SERVICE_EXTENDED_NSS_SUPPORT condition cannot work since you explained that 
your driver base which is 3.4 doesnt provide this flag but 3.4 stock does 
require this change


I'll let you and Kalle work this out then.  I can fix my driver to work whatever
you do, and if I can ever get patches upstream, then I'll deal with it then.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v1] ath10k: fix band_center_freq handling for VHT160 in recent firmwares

2018-05-28 Thread Ben Greear



On 05/27/2018 03:25 PM, Sebastian Gottschall wrote:



Am 25.05.2018 um 16:52 schrieb Ben Greear:

On 05/25/2018 07:44 AM, Kalle Valo wrote:

Sebastian Gottschall <s.gottsch...@dd-wrt.com> writes:


Am 26.04.2018 um 15:44 schrieb Ben Greear:



On 04/26/2018 02:43 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

starting with firmware 10.4.3.4.x series QCA changed the handling
of the channel property band_center_freq1 and band_center_freq2 in
vht160 operation mode
likelly for backward compatiblity with vht80 only capable clients.
this patch adjusts the handling to get vht160 to work again with
official qca firmwares newer than 3.3
consider that this patch will not work with older firmwares
anymore. to avoid undefined behaviour this we disable vht160
capability for outdated firmwares


We should be able to use a feature-flag or otherwise determine if
the firmware needs the old or new
API and make the driver able to handle both.


the new firmware must be used as is and it works. the old firmware can
be detected on the missing vht cap flag.
but thats not my task. i can only use feature flags if they are
included within the qca firmwares. but they arent
the old pre 3.3 firmwares should be treated as obsolete. they are more
than 2 years old and do not announce vht160 capability
even if it works with some ignorance, but on the other side the it has
backward incompatiblies with older vht80 only clients.
this is why the new way was introduced


I was told ath10k could check for WMI_SERVICE_EXTENDED_NSS_SUPPORT flag.
Can someone test and verify that?



I do see that my firmware source based on older upstream QCA FW does not 
advertise
that flag, and newer QCA firmware does, so it would appear that test might
work.

but your sourcebase is new enough that this new handling is required.  if i 
remember correct this handling is required starting with 10.4-3.4 source base
if a 3.4 original firmware is not providing that flag, it cannot be used for 
correct handling and yours is 3.4 based


With my current driver and my current (older) firmware source, 160Mhz works 
fine.

I think my driver changes related to 160Mhz are upstream, so probably stock 
driver
works with my firmware at 160Mhz as well.

If you change the driver, then it will likely break older firmware.  So, just 
change
the driver behaviour if the WMI_SERVICE_EXTENDED_NSS_SUPPORT is enabled.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v1] ath10k: fix band_center_freq handling for VHT160 in recent firmwares

2018-05-25 Thread Ben Greear

On 05/25/2018 07:44 AM, Kalle Valo wrote:

Sebastian Gottschall <s.gottsch...@dd-wrt.com> writes:


Am 26.04.2018 um 15:44 schrieb Ben Greear:



On 04/26/2018 02:43 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

starting with firmware 10.4.3.4.x series QCA changed the handling
of the channel property band_center_freq1 and band_center_freq2 in
vht160 operation mode
likelly for backward compatiblity with vht80 only capable clients.
this patch adjusts the handling to get vht160 to work again with
official qca firmwares newer than 3.3
consider that this patch will not work with older firmwares
anymore. to avoid undefined behaviour this we disable vht160
capability for outdated firmwares


We should be able to use a feature-flag or otherwise determine if
the firmware needs the old or new
API and make the driver able to handle both.


the new firmware must be used as is and it works. the old firmware can
be detected on the missing vht cap flag.
but thats not my task. i can only use feature flags if they are
included within the qca firmwares. but they arent
the old pre 3.3 firmwares should be treated as obsolete. they are more
than 2 years old and do not announce vht160 capability
even if it works with some ignorance, but on the other side the it has
backward incompatiblies with older vht80 only clients.
this is why the new way was introduced


I was told ath10k could check for WMI_SERVICE_EXTENDED_NSS_SUPPORT flag.
Can someone test and verify that?



I do see that my firmware source based on older upstream QCA FW does not 
advertise
that flag, and newer QCA firmware does, so it would appear that test might
work.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: ath10k wake_tx_queue issues

2018-05-15 Thread Ben Greear
: ath10k_htt_tx_dec_pending: 
num_pen: 13
  ksdioirqd/mmc2-139   [002] ...174.720292: ath10k_htt_tx_dec_pending: 
num_pen: 12
  ksdioirqd/mmc2-139   [002] ...174.720304: ath10k_htt_tx_dec_pending: 
num_pen: 11
  ksdioirqd/mmc2-139   [002] ...174.720316: ath10k_htt_tx_dec_pending: 
num_pen: 10
  ksdioirqd/mmc2-139   [002] ...174.720396: ath10k_htt_tx_dec_pending: 
num_pen: 9
  ksdioirqd/mmc2-139   [002] ...174.720503: ath10k_htt_tx_dec_pending: 
num_pen: 8
  ksdioirqd/mmc2-139   [002] ...174.720527: ath10k_htt_tx_dec_pending: 
num_pen: 7
  ksdioirqd/mmc2-139   [002] ...174.720540: ath10k_htt_tx_dec_pending: 
num_pen: 6
  ksdioirqd/mmc2-139   [002] ...174.720552: ath10k_htt_tx_dec_pending: 
num_pen: 5
  ksdioirqd/mmc2-139   [002] ...174.720645: ath10k_htt_tx_dec_pending: 
num_pen: 4
  ksdioirqd/mmc2-139   [002] ...174.720668: ath10k_htt_tx_dec_pending: 
num_pen: 3
  ksdioirqd/mmc2-139   [002] ...174.720681: ath10k_htt_tx_dec_pending: 
num_pen: 2
  ksdioirqd/mmc2-139   [002] ...174.720694: ath10k_htt_tx_dec_pending: 
num_pen: 1
  ksdioirqd/mmc2-139   [002] ...174.720709: ath10k_htt_tx_dec_pending: 
num_pen: 0

ath10k just finished transmission of pending frames.

Half a second later, the send buffer is full, and we start seeing errors
in iperf.

   iperf-181   [001] 75.191606: tcp_sendmsg_locked: err: -11
   iperf-181   [001] 75.701511: tcp_sendmsg_locked: err: -11
   iperf-181   [001] 76.211648: tcp_sendmsg_locked: err: -11



Sure, the regular way ath10k_mac_op_wake_tx_queue is called is via
ieee80211_subif_start_xmit => __ieee80211_subif_start_xmit => 
ieee80211_xmit_fast
=> ieee80211_queue_skb => drv_wake_tx_queue.

But I was expecting the call to ieee80211_wake_queue to somehow trigger a call
to ath10k_mac_op_wake_tx_queue, since there is still data in the send buffer/
in the ieee80211_txq that needs to be sent, to allow more data to be written to
the socket. But obviously the callback never comes.
Or how else is this supposed to work?


Note that setting ops->wake_tx_queue = NULL; works around the problem
(i.e. let mac80211 use ath10k's tx callback instead of wake_tx_queue callback).
But then there might still be a bug for other drivers to stumble upon.

However, since the only wireless drivers using wake_tx_queue are: ath9k, ath10k,
and mt76, perhaps it is not such a bad idea to use the tx callback instead of
the wake_tx_queue callback.


Kind regards,
Niklas



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2] ath10k: support for multicast rate control

2018-05-09 Thread Ben Greear



On 05/08/2018 11:57 PM, Sebastian Gottschall wrote:



Am 09.05.2018 um 08:15 schrieb Sven Eckelmann:

On Montag, 7. Mai 2018 18:34:51 CEST Pradeep Kumar Chitrapu wrote:

Issues wmi command to firmwre when multicast rate change is received
with the new BSS_CHANGED_MCAST_RATE flag.
Also fixes the incorrect fixed_rate setting for CCK rates which got
introduced with addition of ath10k_rates_rev2 enum.

Tested on QCA9984 with firmware ver 10.4-3.6-00104
Signed-off-by: Pradeep Kumar Chitrapu <prade...@codeaurora.org>
---
v2:
  - fixed the CCK rates setting for mcast_rate and fixed_rate paths.
  - set broadcast rate along with multicast rate setting.

These things are only modified in ath10k_bss_info_changed and not saved/
restored. What happens when the HW needs to be reset (there are are couple of
firmware crashes which can be seen in the wild and are handled by ath10k)? I
would guess that not all of them automatically cause an .bss_info_changed.


Yes, that sounds like a good idea to me.


i have never seen a card properly recovering after a firmware crash, all 
firmware crashes i have seen
are caused by bugs in firmware handling or the firmware itself. so if it 
recovers it crashes immediatly again
ending in a endless crashloop since the cause for the crash hasnt been fixed. 
they are often triggered by extern clients for instance which still remain in 
the field.
i think firmware crashes should not be hidden by recovering. this would also 
force users to report problems more frequently, than they do
since they dont even take notice of such problems


We see recovery work often.  If you see repeatable crashes, post them
to the list.

Do you know of any particular external clients that cause these crashes?

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v7] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-05-01 Thread Ben Greear



On 04/30/2018 03:14 PM, Sebastian Gottschall wrote:



+/* only 4x4 configuration do support 2x2 for VHT160, everything else must 
use 1x1 */
+if (ar->cfg_rx_chainmask == 15)
+nss160 = arg->peer_num_spatial_streams <= 2 ? 
arg->peer_num_spatial_streams : 2;


If peer nss == 3, then nss160 must be 1x1.  That is why I previously suggested 
the code that set nss160 to equal nss / 2
(with special case to bump nss160 to 1x1 if nss == 1.

btw. it doesnt matter if the peer sends with 3x3 or even 4x4, i still can 
receive with 2x2. thats no conflict. switching back to 1x1 of the peer sends 
vht160 with 3x3 makes no real sense
i dont have to turn off a chain, if i'm able todo 2x2, no matter what the peer 
does. i just have to limit the maximum


If the local system is nss 4x4 and the remote is nss 3x3, then the 
peer_num_spatial_streams will be 3 and the nss160 should be 1.

At least for ath10k chips, it requires 2 chains to receive a 1x1 signal at 
160Mhz, so that is why
a nss 3x3 cannot receive at 2x2 160Mhz.

If you still think this makes no sense, think about it a bit before responding!

Thanks,
Ben





A 9984 peer with chainmask configured to 0x7 would hit this case I think.

Overall this looks better than previous patches though.

Thanks,
Ben


+
+/* in case if peer is connected with vht160 or vht80+80, we need to 
properly adjust rxnss parameters otherwise firmware will raise a assert */
+switch(arg->peer_phymode) {
+case MODE_11AC_VHT80_80:
+arg->peer_bw_rxnss_override = BW_NSS_FWCONF_80_80(nss160);
+/* fall through */
+case MODE_11AC_VHT160:
+arg->peer_bw_rxnss_override |= BW_NSS_FWCONF_160(nss160);
+break;
+default:
+break;
  }
+
+ath10k_dbg(ar, ATH10K_DBG_MAC, "mac vht peer %pM max_mpdu %d flags 0x%x 
peer_bw_rxnss_override 0x%x\n",
+   sta->addr, arg->peer_max_mpdu, arg->peer_flags, 
arg->peer_bw_rxnss_override);
  }

  static void ath10k_peer_assoc_h_qos(struct ath10k *ar,
@@ -2696,9 +2700,9 @@ static int ath10k_peer_assoc_prepare(struct ath10k *ar,
  ath10k_peer_assoc_h_crypto(ar, vif, sta, arg);
  ath10k_peer_assoc_h_rates(ar, vif, sta, arg);
  ath10k_peer_assoc_h_ht(ar, vif, sta, arg);
+ath10k_peer_assoc_h_phymode(ar, vif, sta, arg);
  ath10k_peer_assoc_h_vht(ar, vif, sta, arg);
  ath10k_peer_assoc_h_qos(ar, vif, sta, arg);
-ath10k_peer_assoc_h_phymode(ar, vif, sta, arg);

  return 0;
  }
diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index 2c36256a441d..3797dca317ff 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -7211,12 +7211,7 @@ ath10k_wmi_peer_assoc_fill_10_4(struct ath10k *ar, void 
*buf,
  struct wmi_10_4_peer_assoc_complete_cmd *cmd = buf;

  ath10k_wmi_peer_assoc_fill_10_2(ar, buf, arg);
-if (arg->peer_bw_rxnss_override)
-cmd->peer_bw_rxnss_override =
-__cpu_to_le32((arg->peer_bw_rxnss_override - 1) |
-  BIT(PEER_BW_RXNSS_OVERRIDE_OFFSET));
-else
-cmd->peer_bw_rxnss_override = 0;
+cmd->peer_bw_rxnss_override = __cpu_to_le32(arg->peer_bw_rxnss_override);
  }

  static int
diff --git a/drivers/net/wireless/ath/ath10k/wmi.h 
b/drivers/net/wireless/ath/ath10k/wmi.h
index 46ae19bb2c92..1fe0aa5523a6 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -6380,7 +6380,19 @@ struct wmi_10_2_peer_assoc_complete_cmd {
  __le32 info0; /* WMI_PEER_ASSOC_INFO0_ */
  } __packed;

-#define PEER_BW_RXNSS_OVERRIDE_OFFSET  31
+#define BW_NSS_FWCONF_MAP_ENABLE (1 << 31)
+#define BW_NSS_FWCONF_MAP_160MHZ_S   (0)
+#define BW_NSS_FWCONF_MAP_160MHZ_M   (0x0007)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_S (3)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_M (0x0038)
+#define BW_NSS_FWCONF_MAP_M  (0x003F)
+
+#define GET_BW_NSS_FWCONF_160(x) x) & BW_NSS_FWCONF_MAP_160MHZ_M) 
>> BW_NSS_FWCONF_MAP_160MHZ_S) + 1)
+#define GET_BW_NSS_FWCONF_80_80(x)   x) & BW_NSS_FWCONF_MAP_80_80MHZ_M) 
>> BW_NSS_FWCONF_MAP_80_80MHZ_S) + 1)
+
+/* Values defined to set 160 MHz Bandwidth NSS Mapping into FW*/
+#define BW_NSS_FWCONF_160(x)  (BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_160MHZ_S) & BW_NSS_FWCONF_MAP_160MHZ_M))
+#define BW_NSS_FWCONF_80_80(x)(BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_80_80MHZ_S) & BW_NSS_FWCONF_MAP_80_80MHZ_M))

  struct wmi_10_4_peer_assoc_complete_cmd {
  struct wmi_10_2_peer_assoc_complete_cmd cmd;








--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v7] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-30 Thread Ben Greear
th/ath10k/wmi.c
index 2c36256a441d..3797dca317ff 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -7211,12 +7211,7 @@ ath10k_wmi_peer_assoc_fill_10_4(struct ath10k *ar, void 
*buf,
struct wmi_10_4_peer_assoc_complete_cmd *cmd = buf;

ath10k_wmi_peer_assoc_fill_10_2(ar, buf, arg);
-   if (arg->peer_bw_rxnss_override)
-   cmd->peer_bw_rxnss_override =
-   __cpu_to_le32((arg->peer_bw_rxnss_override - 1) |
- BIT(PEER_BW_RXNSS_OVERRIDE_OFFSET));
-   else
-   cmd->peer_bw_rxnss_override = 0;
+   cmd->peer_bw_rxnss_override = 
__cpu_to_le32(arg->peer_bw_rxnss_override);
  }

  static int
diff --git a/drivers/net/wireless/ath/ath10k/wmi.h 
b/drivers/net/wireless/ath/ath10k/wmi.h
index 46ae19bb2c92..1fe0aa5523a6 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -6380,7 +6380,19 @@ struct wmi_10_2_peer_assoc_complete_cmd {
__le32 info0; /* WMI_PEER_ASSOC_INFO0_ */
  } __packed;

-#define PEER_BW_RXNSS_OVERRIDE_OFFSET  31
+#define BW_NSS_FWCONF_MAP_ENABLE (1 << 31)
+#define BW_NSS_FWCONF_MAP_160MHZ_S   (0)
+#define BW_NSS_FWCONF_MAP_160MHZ_M   (0x0007)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_S (3)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_M (0x0038)
+#define BW_NSS_FWCONF_MAP_M  (0x003F)
+
+#define GET_BW_NSS_FWCONF_160(x) x) & BW_NSS_FWCONF_MAP_160MHZ_M) 
>> BW_NSS_FWCONF_MAP_160MHZ_S) + 1)
+#define GET_BW_NSS_FWCONF_80_80(x)   x) & BW_NSS_FWCONF_MAP_80_80MHZ_M) 
>> BW_NSS_FWCONF_MAP_80_80MHZ_S) + 1)
+
+/* Values defined to set 160 MHz Bandwidth NSS Mapping into FW*/
+#define BW_NSS_FWCONF_160(x)  (BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_160MHZ_S) & BW_NSS_FWCONF_MAP_160MHZ_M))
+#define BW_NSS_FWCONF_80_80(x)(BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_80_80MHZ_S) & BW_NSS_FWCONF_MAP_80_80MHZ_M))

  struct wmi_10_4_peer_assoc_complete_cmd {
struct wmi_10_2_peer_assoc_complete_cmd cmd;




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v4] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-28 Thread Ben Greear



On 04/28/2018 08:26 AM, Sebastian Gottschall wrote:

Am 28.04.2018 um 17:01 schrieb Ben Greear:



On 04/27/2018 05:47 PM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

current handling of peer_bw_rxnss_override parameter is based on guessing the 
VHT160/8080 capability by rx rate. this is wrong and may lead
to a non initialized peer_bw_rxnss_override parameter which is required since 
VHT160 operation mode only supports 2x2 chainmasks in addition the original code
initialized the parameter with wrong masked values.
This patch uses the peer phymode and peer nss information for correct 
initialisation of the peer_bw_rxnss_override parameter.
if this peer information is not available, we initialize the parameter by 
minimum nss which is suggested by QCA as temporary workaround according
to the QCA sourcecodes.

Signed-off-by: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

v2: remove debug messages
v3: apply some cosmetics, update documentation
v4: fix compile warning and truncate nss to maximum of 2x2 since current 
chipsets only support 2x2 at vht160
---
 drivers/net/wireless/ath/ath10k/mac.c | 44 ++-
 drivers/net/wireless/ath/ath10k/wmi.c |  7 +
 drivers/net/wireless/ath/ath10k/wmi.h | 14 -
 3 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index 5be6386ede8f..de8a099c9f5a 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2528,23 +2528,37 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 __le16_to_cpu(vht_cap->vht_mcs.tx_highest);
 arg->peer_vht_rates.tx_mcs_set = ath10k_peer_assoc_h_vht_limit(
 __le16_to_cpu(vht_cap->vht_mcs.tx_mcs_map), vht_mcs_mask);
+arg->peer_bw_rxnss_override = 0;
+
+if (arg->peer_num_spatial_streams) {


Check for the 80+80 or 160 instead of num-spatial-streams, if if something has
nss == 0, then act like it is 1 instead since it is obviously configured 
incorrect.

this check is bellow. see the switch statement for the VHT160, 80_80 case.
if nss 0, it must fail here since we cannot handle nss = 0, in that case just 
bit 31 is masked




+int nss160 = arg->peer_num_spatial_streams;


I think this line should be:
int nss160 = arg->peer_num_spatial_streams / 2;

no. makes no sense. 2 would get 1 and 1 turns to zero. makes no sense
read the macro BW_NSS_FWCONF_160 first. it already does nss160-1
this would lead to -1 if spatial_streams is 1


Well, nss160 should be the number of streams we can use at 160 Mhz.  The macro
can convert to zero-based api that the firmware wants.  A 9984 will have nss == 
4,
but you want nss160 to be 2.  A 4019 will have nss == 2, but nss160 should be 1.



if (nss160 == 0) { nss160 = 1; } /* deal with mis-configured 
nss */

no
if peer_num_spatial_stream is 0 (so nss160 too), the code is not triggered. but 
another code. same as above


If NSS was 1, and you do 1/2, you get zero.



if (!arg->peer_num_spatial_streams && (arg->peer_phymode == MODE_11AC_VHT80_80 || 
arg->peer_phymode == MODE_11AC_VHT160)) {
   arg->peer_bw_rxnss_override = BW_NSS_FWCONF_MAP_ENABLE;
 }


If you have a 4019 client, it will have arg->peer_num_spatial_streams == 2, but
it can only do 1x1 at 160Mhz.

peer_num_spatial_streams is calculated from mcs rate set. so why should 4019 
announce 2x2 if it just can to 1x1.
have you checked the mcs set at assoc management frame?


You are confused about nss vs the 160nss.  They are different.  I do not know 
how to explain
this better.  Dig through the firmware source and look at how they are used.  
Basically, if
rate-ctrl decides 80Mhz is best, it can send at the normal 80Mhz NSS (so, 4 for 
9984, 2 for 4019),
but if rate-ctrl uses 160, it uses the different nss max (2 for 9984, one for 
4019).





+/* truncate vht160 nss value to 2x2 since all known chipsets do not 
support more than 2x2 in vht160 modes */
+if (nss160 > 2)
+nss160 = 2;
+/* in case if peer is connected with vht160 or vht80+80, we need to 
properly adjust rxnss parameters */
+switch(arg->peer_phymode) {
+case MODE_11AC_VHT80_80:
+arg->peer_bw_rxnss_override = BW_NSS_FWCONF_80_80(nss160);
+/* fall through */
+case MODE_11AC_VHT160:
+arg->peer_bw_rxnss_override |= BW_NSS_FWCONF_160(nss160);


From looking at firmware, it will just ignore the bits it does not need, so you 
do not need
to special case adding the 80_80 fields, you can do more like my patch and just 
always add
those bits.

i'm following the reference code. in case 80_80 only is configure vht160 nd 
80_80 must be masked. for everything else vht160 only must be masked.
thats the meaning. there are 2 override masks. independend for vht160 and 
80_80. for sure i ca

Re: [PATCH v2] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-28 Thread Ben Greear



On 04/27/2018 05:24 PM, Sebastian Gottschall wrote:

Am 27.04.2018 um 23:57 schrieb Ben Greear:

On 04/27/2018 11:54 AM, Sebastian Gottschall wrote:

Am 27.04.2018 um 18:07 schrieb Ben Greear:

On 04/26/2018 09:40 PM, Sebastian Gottschall wrote:

Am 26.04.2018 um 22:35 schrieb Ben Greear:

On 04/26/2018 01:21 PM, Sebastian Gottschall wrote:

Am 26.04.2018 um 17:30 schrieb Ben Greear:

On 04/26/2018 02:28 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

current handling of peer_bw_rxnss_override parameter is based on guessing the 
VHT160/8080 capability by rx rate. this is wrong and may lead
to a non initialized peer_bw_rxnss_override parameter which is required since 
VHT160 operation mode only supports 2x2 chainmasks in addition the original
code
initialized the parameter with wrong masked values.
This patch uses the peer phymode and peer nss information for correct 
initialisation of the peer_bw_rxnss_override parameter.
if this peer information is not available, we initialize the parameter by 
minimum nss which is suggested by QCA as temporary workaround according
to the QCA sourcecodes.

Signed-off-by: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

v2: remove debug messages
---
 drivers/net/wireless/ath/ath10k/mac.c | 36 +++
 drivers/net/wireless/ath/ath10k/wmi.c |  4 +---
 drivers/net/wireless/ath/ath10k/wmi.h | 14 +-
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index 5be6386ede8f..df79af89ee71 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2505,8 +2505,9 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 if (sta->bandwidth == IEEE80211_STA_RX_BW_80)
 arg->peer_flags |= ar->wmi.peer_flags->bw80;

-if (sta->bandwidth == IEEE80211_STA_RX_BW_160)
+if (sta->bandwidth == IEEE80211_STA_RX_BW_160) {
 arg->peer_flags |= ar->wmi.peer_flags->bw160;
+}

 /* Calculate peer NSS capability from VHT capabilities if STA
  * supports VHT.
@@ -2529,22 +2530,25 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 arg->peer_vht_rates.tx_mcs_set = ath10k_peer_assoc_h_vht_limit(
__le16_to_cpu(vht_cap->vht_mcs.tx_mcs_map), vht_mcs_mask);

-ath10k_dbg(ar, ATH10K_DBG_MAC, "mac vht peer %pM max_mpdu %d flags 0x%x\n",
-   sta->addr, arg->peer_max_mpdu, arg->peer_flags);
+if (arg->peer_num_spatial_streams && (arg->peer_phymode == MODE_11AC_VHT160 
|| arg->peer_phymode == MODE_11AC_VHT80_80)) {
+arg->peer_bw_rxnss_override = 
BW_NSS_FWCONF_160(arg->peer_num_spatial_streams);
+}


So, an 80Mhz peer could be 4x4 and could connect to a VHT-160 AP. From what I 
can tell,
the VHT-160 AP can talk 4x4 @ 80Mhz to the peer in this case, but if peer is 
VHT-160,
then of course it can only talk at 2x2.

So, I don't think you can just look at the peer_num_spatial_streams here.

no? rxnss_override is only taked if peer phymode is vht160 or vht80_80. vht80 
is not considered in that code PEER phy_mode, not HOST phy_mode
this parameter is a assoc per peer parameter as far as i can say here.
consider that the firmware will accept anything except 0 for 
peer_bw_rxnss_override in vht160 operation mode
if a peer is 80 mhz, the code will be skipped here.


From what I can tell, this feature is supposed to configure the rate-ctrl in 
the firmware to know that
it can only send 1x1 or 2x2 when sending at 160Mhz, but that it can send at 
higher NSS if it sends
at 80Mhz or lower.

right. but thats exactly what it should does in case that a peer is connecting 
with vht160 / 80_80
and the peer itself does also send his own nss capabilities which is used if 
available. if not ,it uses the fallback.


If a 2x2 peer connects to the AP, will it have peer_num_spatial_streams set to 
2?

yes. i had some debug code in my initial early versions. the peer does transmit 
his own nss capabilities.

If so,
then your code will configure the peer_bw_rxnss_override to indicate it can 
send at 160Mhz
2x2 as well, right?  And if so, then that is incorrect.

no. since nss_override is only set if peer phymode is 160 mhz as well


The peer can run at 2x2 80Mhz and 1x1 160Mhz at the same time, so it can 
advertise
2x2 nss and phy-mode of 160Mhz when associating to the 160Mhz AP, but the peer 
can only do 1x1 at 160Mhz.  There
is no standard way I know of for the peer to tell you specifically that it can 
only do
1x1 at 160Mhz and also 2x2 at 80Mhz in this case.

per specification the peer is able to provide max nss to the ap. the rx_nss 
property is calculated
from the mcs rateset provided by the peer by mac80211. this is mcs set provided 
on on assoc is mandatory.
so there is a way the peer can tell you what it supports and this is what is 
used.
if a peer does no

Re: [PATCH v4] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-28 Thread Ben Greear
truct ath10k *ar,
@@ -2696,9 +2710,9 @@ static int ath10k_peer_assoc_prepare(struct ath10k *ar,
ath10k_peer_assoc_h_crypto(ar, vif, sta, arg);
ath10k_peer_assoc_h_rates(ar, vif, sta, arg);
ath10k_peer_assoc_h_ht(ar, vif, sta, arg);
+   ath10k_peer_assoc_h_phymode(ar, vif, sta, arg);
ath10k_peer_assoc_h_vht(ar, vif, sta, arg);
ath10k_peer_assoc_h_qos(ar, vif, sta, arg);
-   ath10k_peer_assoc_h_phymode(ar, vif, sta, arg);

return 0;
 }
diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index 2c36256a441d..3797dca317ff 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -7211,12 +7211,7 @@ ath10k_wmi_peer_assoc_fill_10_4(struct ath10k *ar, void 
*buf,
struct wmi_10_4_peer_assoc_complete_cmd *cmd = buf;

ath10k_wmi_peer_assoc_fill_10_2(ar, buf, arg);
-   if (arg->peer_bw_rxnss_override)
-   cmd->peer_bw_rxnss_override =
-   __cpu_to_le32((arg->peer_bw_rxnss_override - 1) |
- BIT(PEER_BW_RXNSS_OVERRIDE_OFFSET));
-   else
-   cmd->peer_bw_rxnss_override = 0;
+   cmd->peer_bw_rxnss_override = 
__cpu_to_le32(arg->peer_bw_rxnss_override);
 }

 static int
diff --git a/drivers/net/wireless/ath/ath10k/wmi.h 
b/drivers/net/wireless/ath/ath10k/wmi.h
index 46ae19bb2c92..1fe0aa5523a6 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -6380,7 +6380,19 @@ struct wmi_10_2_peer_assoc_complete_cmd {
__le32 info0; /* WMI_PEER_ASSOC_INFO0_ */
 } __packed;

-#define PEER_BW_RXNSS_OVERRIDE_OFFSET  31
+#define BW_NSS_FWCONF_MAP_ENABLE (1 << 31)
+#define BW_NSS_FWCONF_MAP_160MHZ_S   (0)
+#define BW_NSS_FWCONF_MAP_160MHZ_M   (0x0007)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_S (3)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_M (0x0038)
+#define BW_NSS_FWCONF_MAP_M  (0x003F)
+
+#define GET_BW_NSS_FWCONF_160(x) x) & BW_NSS_FWCONF_MAP_160MHZ_M) 
>> BW_NSS_FWCONF_MAP_160MHZ_S) + 1)
+#define GET_BW_NSS_FWCONF_80_80(x)   x) & BW_NSS_FWCONF_MAP_80_80MHZ_M) 
>> BW_NSS_FWCONF_MAP_80_80MHZ_S) + 1)
+
+/* Values defined to set 160 MHz Bandwidth NSS Mapping into FW*/
+#define BW_NSS_FWCONF_160(x)  (BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_160MHZ_S) & BW_NSS_FWCONF_MAP_160MHZ_M))
+#define BW_NSS_FWCONF_80_80(x)(BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_80_80MHZ_S) & BW_NSS_FWCONF_MAP_80_80MHZ_M))

 struct wmi_10_4_peer_assoc_complete_cmd {
struct wmi_10_2_peer_assoc_complete_cmd cmd;



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-27 Thread Ben Greear

On 04/27/2018 11:54 AM, Sebastian Gottschall wrote:

Am 27.04.2018 um 18:07 schrieb Ben Greear:

On 04/26/2018 09:40 PM, Sebastian Gottschall wrote:

Am 26.04.2018 um 22:35 schrieb Ben Greear:

On 04/26/2018 01:21 PM, Sebastian Gottschall wrote:

Am 26.04.2018 um 17:30 schrieb Ben Greear:

On 04/26/2018 02:28 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

current handling of peer_bw_rxnss_override parameter is based on guessing the 
VHT160/8080 capability by rx rate. this is wrong and may lead
to a non initialized peer_bw_rxnss_override parameter which is required since 
VHT160 operation mode only supports 2x2 chainmasks in addition the original
code
initialized the parameter with wrong masked values.
This patch uses the peer phymode and peer nss information for correct 
initialisation of the peer_bw_rxnss_override parameter.
if this peer information is not available, we initialize the parameter by 
minimum nss which is suggested by QCA as temporary workaround according
to the QCA sourcecodes.

Signed-off-by: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

v2: remove debug messages
---
 drivers/net/wireless/ath/ath10k/mac.c | 36 +++
 drivers/net/wireless/ath/ath10k/wmi.c |  4 +---
 drivers/net/wireless/ath/ath10k/wmi.h | 14 +-
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index 5be6386ede8f..df79af89ee71 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2505,8 +2505,9 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 if (sta->bandwidth == IEEE80211_STA_RX_BW_80)
 arg->peer_flags |= ar->wmi.peer_flags->bw80;

-if (sta->bandwidth == IEEE80211_STA_RX_BW_160)
+if (sta->bandwidth == IEEE80211_STA_RX_BW_160) {
 arg->peer_flags |= ar->wmi.peer_flags->bw160;
+}

 /* Calculate peer NSS capability from VHT capabilities if STA
  * supports VHT.
@@ -2529,22 +2530,25 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 arg->peer_vht_rates.tx_mcs_set = ath10k_peer_assoc_h_vht_limit(
 __le16_to_cpu(vht_cap->vht_mcs.tx_mcs_map), vht_mcs_mask);

-ath10k_dbg(ar, ATH10K_DBG_MAC, "mac vht peer %pM max_mpdu %d flags 0x%x\n",
-   sta->addr, arg->peer_max_mpdu, arg->peer_flags);
+if (arg->peer_num_spatial_streams && (arg->peer_phymode == MODE_11AC_VHT160 
|| arg->peer_phymode == MODE_11AC_VHT80_80)) {
+arg->peer_bw_rxnss_override = 
BW_NSS_FWCONF_160(arg->peer_num_spatial_streams);
+}


So, an 80Mhz peer could be 4x4 and could connect to a VHT-160 AP. From what I 
can tell,
the VHT-160 AP can talk 4x4 @ 80Mhz to the peer in this case, but if peer is 
VHT-160,
then of course it can only talk at 2x2.

So, I don't think you can just look at the peer_num_spatial_streams here.

no? rxnss_override is only taked if peer phymode is vht160 or vht80_80. vht80 
is not considered in that code PEER phy_mode, not HOST phy_mode
this parameter is a assoc per peer parameter as far as i can say here.
consider that the firmware will accept anything except 0 for 
peer_bw_rxnss_override in vht160 operation mode
if a peer is 80 mhz, the code will be skipped here.


From what I can tell, this feature is supposed to configure the rate-ctrl in 
the firmware to know that
it can only send 1x1 or 2x2 when sending at 160Mhz, but that it can send at 
higher NSS if it sends
at 80Mhz or lower.

right. but thats exactly what it should does in case that a peer is connecting 
with vht160 / 80_80
and the peer itself does also send his own nss capabilities which is used if 
available. if not ,it uses the fallback.


If a 2x2 peer connects to the AP, will it have peer_num_spatial_streams set to 
2?

yes. i had some debug code in my initial early versions. the peer does transmit 
his own nss capabilities.

If so,
then your code will configure the peer_bw_rxnss_override to indicate it can 
send at 160Mhz
2x2 as well, right?  And if so, then that is incorrect.

no. since nss_override is only set if peer phymode is 160 mhz as well


The peer can run at 2x2 80Mhz and 1x1 160Mhz at the same time, so it can 
advertise
2x2 nss and phy-mode of 160Mhz when associating to the 160Mhz AP, but the peer 
can only do 1x1 at 160Mhz.  There
is no standard way I know of for the peer to tell you specifically that it can 
only do
1x1 at 160Mhz and also 2x2 at 80Mhz in this case.

per specification the peer is able to provide max nss to the ap. the rx_nss 
property is calculated
from the mcs rateset provided by the peer by mac80211. this is mcs set provided 
on on assoc is mandatory.
so there is a way the peer can tell you what it supports and this is what is 
used.
if a peer does not provide the mcs rateset (which i have seen for a single 
marvell client)
the fallback

Re: [PATCH v2] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-27 Thread Ben Greear

On 04/26/2018 09:40 PM, Sebastian Gottschall wrote:

Am 26.04.2018 um 22:35 schrieb Ben Greear:

On 04/26/2018 01:21 PM, Sebastian Gottschall wrote:

Am 26.04.2018 um 17:30 schrieb Ben Greear:

On 04/26/2018 02:28 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

current handling of peer_bw_rxnss_override parameter is based on guessing the 
VHT160/8080 capability by rx rate. this is wrong and may lead
to a non initialized peer_bw_rxnss_override parameter which is required since 
VHT160 operation mode only supports 2x2 chainmasks in addition the original code
initialized the parameter with wrong masked values.
This patch uses the peer phymode and peer nss information for correct 
initialisation of the peer_bw_rxnss_override parameter.
if this peer information is not available, we initialize the parameter by 
minimum nss which is suggested by QCA as temporary workaround according
to the QCA sourcecodes.

Signed-off-by: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

v2: remove debug messages
---
 drivers/net/wireless/ath/ath10k/mac.c | 36 +++
 drivers/net/wireless/ath/ath10k/wmi.c |  4 +---
 drivers/net/wireless/ath/ath10k/wmi.h | 14 +-
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index 5be6386ede8f..df79af89ee71 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2505,8 +2505,9 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 if (sta->bandwidth == IEEE80211_STA_RX_BW_80)
 arg->peer_flags |= ar->wmi.peer_flags->bw80;

-if (sta->bandwidth == IEEE80211_STA_RX_BW_160)
+if (sta->bandwidth == IEEE80211_STA_RX_BW_160) {
 arg->peer_flags |= ar->wmi.peer_flags->bw160;
+}

 /* Calculate peer NSS capability from VHT capabilities if STA
  * supports VHT.
@@ -2529,22 +2530,25 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 arg->peer_vht_rates.tx_mcs_set = ath10k_peer_assoc_h_vht_limit(
 __le16_to_cpu(vht_cap->vht_mcs.tx_mcs_map), vht_mcs_mask);

-ath10k_dbg(ar, ATH10K_DBG_MAC, "mac vht peer %pM max_mpdu %d flags 0x%x\n",
-   sta->addr, arg->peer_max_mpdu, arg->peer_flags);
+if (arg->peer_num_spatial_streams && (arg->peer_phymode == MODE_11AC_VHT160 
|| arg->peer_phymode == MODE_11AC_VHT80_80)) {
+arg->peer_bw_rxnss_override = 
BW_NSS_FWCONF_160(arg->peer_num_spatial_streams);
+}


So, an 80Mhz peer could be 4x4 and could connect to a VHT-160 AP. From what I 
can tell,
the VHT-160 AP can talk 4x4 @ 80Mhz to the peer in this case, but if peer is 
VHT-160,
then of course it can only talk at 2x2.

So, I don't think you can just look at the peer_num_spatial_streams here.

no? rxnss_override is only taked if peer phymode is vht160 or vht80_80. vht80 
is not considered in that code PEER phy_mode, not HOST phy_mode
this parameter is a assoc per peer parameter as far as i can say here.
consider that the firmware will accept anything except 0 for 
peer_bw_rxnss_override in vht160 operation mode
if a peer is 80 mhz, the code will be skipped here.


From what I can tell, this feature is supposed to configure the rate-ctrl in 
the firmware to know that
it can only send 1x1 or 2x2 when sending at 160Mhz, but that it can send at 
higher NSS if it sends
at 80Mhz or lower.

right. but thats exactly what it should does in case that a peer is connecting 
with vht160 / 80_80
and the peer itself does also send his own nss capabilities which is used if 
available. if not ,it uses the fallback.


If a 2x2 peer connects to the AP, will it have peer_num_spatial_streams set to 
2?

yes. i had some debug code in my initial early versions. the peer does transmit 
his own nss capabilities.

If so,
then your code will configure the peer_bw_rxnss_override to indicate it can 
send at 160Mhz
2x2 as well, right?  And if so, then that is incorrect.

no. since nss_override is only set if peer phymode is 160 mhz as well


The peer can run at 2x2 80Mhz and 1x1 160Mhz at the same time, so it can 
advertise
2x2 nss and phy-mode of 160Mhz when associating to the 160Mhz AP, but the peer 
can only do 1x1 at 160Mhz.  There
is no standard way I know of for the peer to tell you specifically that it can 
only do
1x1 at 160Mhz and also 2x2 at 80Mhz in this case.

That is why this rxnns_override exists, to hack around this problem.

Your patch is going to break in this case as far as I can tell.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v2] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-26 Thread Ben Greear

On 04/26/2018 01:21 PM, Sebastian Gottschall wrote:

Am 26.04.2018 um 17:30 schrieb Ben Greear:

On 04/26/2018 02:28 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

current handling of peer_bw_rxnss_override parameter is based on guessing the 
VHT160/8080 capability by rx rate. this is wrong and may lead
to a non initialized peer_bw_rxnss_override parameter which is required since 
VHT160 operation mode only supports 2x2 chainmasks in addition the original code
initialized the parameter with wrong masked values.
This patch uses the peer phymode and peer nss information for correct 
initialisation of the peer_bw_rxnss_override parameter.
if this peer information is not available, we initialize the parameter by 
minimum nss which is suggested by QCA as temporary workaround according
to the QCA sourcecodes.

Signed-off-by: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

v2: remove debug messages
---
 drivers/net/wireless/ath/ath10k/mac.c | 36 +++
 drivers/net/wireless/ath/ath10k/wmi.c |  4 +---
 drivers/net/wireless/ath/ath10k/wmi.h | 14 +-
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index 5be6386ede8f..df79af89ee71 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2505,8 +2505,9 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 if (sta->bandwidth == IEEE80211_STA_RX_BW_80)
 arg->peer_flags |= ar->wmi.peer_flags->bw80;

-if (sta->bandwidth == IEEE80211_STA_RX_BW_160)
+if (sta->bandwidth == IEEE80211_STA_RX_BW_160) {
 arg->peer_flags |= ar->wmi.peer_flags->bw160;
+}

 /* Calculate peer NSS capability from VHT capabilities if STA
  * supports VHT.
@@ -2529,22 +2530,25 @@ static void ath10k_peer_assoc_h_vht(struct ath10k *ar,
 arg->peer_vht_rates.tx_mcs_set = ath10k_peer_assoc_h_vht_limit(
 __le16_to_cpu(vht_cap->vht_mcs.tx_mcs_map), vht_mcs_mask);

-ath10k_dbg(ar, ATH10K_DBG_MAC, "mac vht peer %pM max_mpdu %d flags 0x%x\n",
-   sta->addr, arg->peer_max_mpdu, arg->peer_flags);
+if (arg->peer_num_spatial_streams && (arg->peer_phymode == MODE_11AC_VHT160 
|| arg->peer_phymode == MODE_11AC_VHT80_80)) {
+arg->peer_bw_rxnss_override = 
BW_NSS_FWCONF_160(arg->peer_num_spatial_streams);
+}


So, an 80Mhz peer could be 4x4 and could connect to a VHT-160 AP. From what I 
can tell,
the VHT-160 AP can talk 4x4 @ 80Mhz to the peer in this case, but if peer is 
VHT-160,
then of course it can only talk at 2x2.

So, I don't think you can just look at the peer_num_spatial_streams here.

no? rxnss_override is only taked if peer phymode is vht160 or vht80_80. vht80 
is not considered in that code PEER phy_mode, not HOST phy_mode
this parameter is a assoc per peer parameter as far as i can say here.
consider that the firmware will accept anything except 0 for 
peer_bw_rxnss_override in vht160 operation mode
if a peer is 80 mhz, the code will be skipped here.


From what I can tell, this feature is supposed to configure the rate-ctrl in 
the firmware to know that
it can only send 1x1 or 2x2 when sending at 160Mhz, but that it can send at 
higher NSS if it sends
at 80Mhz or lower.

If a 2x2 peer connects to the AP, will it have peer_num_spatial_streams set to 
2?  If so,
then your code will configure the peer_bw_rxnss_override to indicate it can 
send at 160Mhz
2x2 as well, right?  And if so, then that is incorrect.

Probably if you connected something like a IPQ4019 station to a 9984 AP 
configured for 160Mhz,
the peer_bw_rxnss_override would be set to 2x2 instead of 1x1?



-if (arg->peer_vht_rates.rx_max_rate &&
-(sta->vht_cap.cap & IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_MASK)) {
-switch (arg->peer_vht_rates.rx_max_rate) {
-case 1560:
-/* Must be 2x2 at 160Mhz is all it can do. */
-arg->peer_bw_rxnss_override = 2;
-break;
-case 780:
-/* Can only do 1x1 at 160Mhz (Long Guard Interval) */
-arg->peer_bw_rxnss_override = 1;
-break;
-}


This old code does look wrong, the firmware is using zero-based, so override-0 
== nss-1, override-1 == nss-2.

0 = crash

and 1 and 2 is wrong.


Yes, it should be 0 and 1.  The old code | in the (1<<31) later.



+#define BW_NSS_FWCONF_MAP_ENABLE (1 << 31)
+#define BW_NSS_FWCONF_MAP_160MHZ_S   (0)
+#define BW_NSS_FWCONF_MAP_160MHZ_M   (0x0007)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_S (3)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_M (0x0038)


This is confusing enough that it deserves a comment in the code I think

the removal doesnt deserve a comment. i dont know how to explain that its 
simply wro

Re: [PATCH v2] ath10k: fix crash in recent 3.5.3 9984 firmware due wrong handling of peer_bw_rxnss_override parameter

2018-04-26 Thread Ben Greear
peer_assoc_h_phymode(ar, vif, sta, arg);

return 0;
 }
diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index 2c36256a441d..7d72fdc703c8 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -7212,9 +7212,7 @@ ath10k_wmi_peer_assoc_fill_10_4(struct ath10k *ar, void 
*buf,

ath10k_wmi_peer_assoc_fill_10_2(ar, buf, arg);
if (arg->peer_bw_rxnss_override)
-   cmd->peer_bw_rxnss_override =
-   __cpu_to_le32((arg->peer_bw_rxnss_override - 1) |
- BIT(PEER_BW_RXNSS_OVERRIDE_OFFSET));
+   cmd->peer_bw_rxnss_override = 
__cpu_to_le32(arg->peer_bw_rxnss_override);
else
cmd->peer_bw_rxnss_override = 0;
 }
diff --git a/drivers/net/wireless/ath/ath10k/wmi.h 
b/drivers/net/wireless/ath/ath10k/wmi.h
index 46ae19bb2c92..1fe0aa5523a6 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -6380,7 +6380,19 @@ struct wmi_10_2_peer_assoc_complete_cmd {
__le32 info0; /* WMI_PEER_ASSOC_INFO0_ */
 } __packed;

-#define PEER_BW_RXNSS_OVERRIDE_OFFSET  31
+#define BW_NSS_FWCONF_MAP_ENABLE (1 << 31)
+#define BW_NSS_FWCONF_MAP_160MHZ_S   (0)
+#define BW_NSS_FWCONF_MAP_160MHZ_M   (0x0007)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_S (3)
+#define BW_NSS_FWCONF_MAP_80_80MHZ_M (0x0038)


Older FW does not pay attention to the 80_80 bits.  It uses masks so it is
backwards-compat, but maybe worth a comment.


+#define BW_NSS_FWCONF_MAP_M  (0x003F)
+
+#define GET_BW_NSS_FWCONF_160(x) x) & BW_NSS_FWCONF_MAP_160MHZ_M) 
>> BW_NSS_FWCONF_MAP_160MHZ_S) + 1)
+#define GET_BW_NSS_FWCONF_80_80(x)   x) & BW_NSS_FWCONF_MAP_80_80MHZ_M) 
>> BW_NSS_FWCONF_MAP_80_80MHZ_S) + 1)
+
+/* Values defined to set 160 MHz Bandwidth NSS Mapping into FW*/
+#define BW_NSS_FWCONF_160(x)  (BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_160MHZ_S) & BW_NSS_FWCONF_MAP_160MHZ_M))
+#define BW_NSS_FWCONF_80_80(x)        (BW_NSS_FWCONF_MAP_ENABLE | (((x - 1) << 
BW_NSS_FWCONF_MAP_80_80MHZ_S) & BW_NSS_FWCONF_MAP_80_80MHZ_M))

 struct wmi_10_4_peer_assoc_complete_cmd {
struct wmi_10_2_peer_assoc_complete_cmd cmd;



Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH v1] ath10k: fix band_center_freq handling for VHT160 in recent firmwares

2018-04-26 Thread Ben Greear



On 04/26/2018 02:43 AM, s.gottsch...@dd-wrt.com wrote:

From: Sebastian Gottschall <s.gottsch...@dd-wrt.com>

starting with firmware 10.4.3.4.x series QCA changed the handling of the 
channel property band_center_freq1 and band_center_freq2 in vht160 operation 
mode
likelly for backward compatiblity with vht80 only capable clients.
this patch adjusts the handling to get vht160 to work again with official qca 
firmwares newer than 3.3
consider that this patch will not work with older firmwares anymore. to avoid 
undefined behaviour this we disable vht160 capability for outdated firmwares


We should be able to use a feature-flag or otherwise determine if the firmware 
needs the old or new
API and make the driver able to handle both.

Thanks,
Ben


---
 drivers/net/wireless/ath/ath10k/mac.c |  7 ---
 drivers/net/wireless/ath/ath10k/wmi.c | 11 ---
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index 5be6386ede8f..d1239d40ac19 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -4449,13 +4449,6 @@ static struct ieee80211_sta_vht_cap 
ath10k_create_vht_cap(struct ath10k *ar)
vht_cap.cap |= val;
}

-   /* Currently the firmware seems to be buggy, don't enable 80+80
-* mode until that's resolved.
-*/
-   if ((ar->vht_cap_info & IEEE80211_VHT_CAP_SHORT_GI_160) &&
-   (ar->vht_cap_info & IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_MASK) == 0)
-   vht_cap.cap |= IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_160MHZ;
-
mcs_map = 0;
for (i = 0; i < 8; i++) {
if ((i < ar->num_rf_chains) && (ar->cfg_tx_chainmask & BIT(i)))
diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
b/drivers/net/wireless/ath/ath10k/wmi.c
index 2c36256a441d..d78b8857a513 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -1671,13 +1671,18 @@ void ath10k_wmi_put_wmi_channel(struct wmi_channel *ch,
flags |= WMI_CHAN_FLAG_HT40_PLUS;
if (arg->chan_radar)
flags |= WMI_CHAN_FLAG_DFS;
-
+   ch->band_center_freq2 = 0;
ch->mhz = __cpu_to_le32(arg->freq);
ch->band_center_freq1 = __cpu_to_le32(arg->band_center_freq1);
if (arg->mode == MODE_11AC_VHT80_80)
ch->band_center_freq2 = __cpu_to_le32(arg->band_center_freq2);
-   else
-   ch->band_center_freq2 = 0;
+   if (arg->mode == MODE_11AC_VHT160)  {
+   if (arg->freq < arg->band_center_freq1)
+   ch->band_center_freq1 = 
__cpu_to_le32(arg->band_center_freq1 - 40);
+   else
+   ch->band_center_freq1 = 
__cpu_to_le32(arg->band_center_freq1 + 40);   
+   ch->band_center_freq2 = __cpu_to_le32(arg->band_center_freq1);
+   }
ch->min_power = arg->min_power;
ch->max_power = arg->max_power;
ch->reg_power = arg->max_reg_power;



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-23 Thread Ben Greear

On 04/22/2018 02:15 PM, Roopa Prabhu wrote:

On Sun, Apr 22, 2018 at 11:54 AM, David Miller <da...@davemloft.net> wrote:

From: Johannes Berg <johan...@sipsolutions.net>
Date: Thu, 19 Apr 2018 17:26:57 +0200


On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)


I guess you'll have to ask davem. :)


Well, first of all, I really don't like this.

The first reason is that every time I see interface foo become foo2,
foo3 is never far behind it.

If foo was not extensible enough such that we needed foo2, we beter
design the new thing with explicitly better extensibility in mind.

Furthermore, what you want here is a specific filter.  Someone else
will want to filter on another criteria, and the next person will
want yet another.

This needs to be properly generalized.

And frankly if we had moved to ethtool netlink/devlink by now, we
could just add a netlink attribute for filtering and not even be
having this conversation.



+1.

Also, the RTM_GETSTATS api was added to improve stats query efficiency
(with filters).
 we should look at it  to see if this fits there. Keeping all stats
queries in one place will help.


I like the ethtool API, so I'll be sticking with that for now.

Thanks,
Ben



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-23 Thread Ben Greear

On 04/22/2018 11:54 AM, David Miller wrote:

From: Johannes Berg <johan...@sipsolutions.net>
Date: Thu, 19 Apr 2018 17:26:57 +0200


On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)


I guess you'll have to ask davem. :)


Well, first of all, I really don't like this.

The first reason is that every time I see interface foo become foo2,
foo3 is never far behind it.

If foo was not extensible enough such that we needed foo2, we beter
design the new thing with explicitly better extensibility in mind.

Furthermore, what you want here is a specific filter.  Someone else
will want to filter on another criteria, and the next person will
want yet another.

This needs to be properly generalized.

And frankly if we had moved to ethtool netlink/devlink by now, we
could just add a netlink attribute for filtering and not even be
having this conversation.


Well, since there are un-defined flags, it would be simple enough to
extend the API further in the future (flag (1<<31) could mean expect
more input members, etc.  And, adding up to 30 more flags to filter on different
things won't change the API and should be backwards compatible.

But, if you don't want it, that is OK by me, I agree it is a fairly
obscure feature.  It would have saved me time if you had said you didn't
want it at the first RFC patch though...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-19 Thread Ben Greear



On 04/18/2018 11:38 PM, Johannes Berg wrote:

On Wed, 2018-04-18 at 14:51 -0700, Ben Greear wrote:


It'd be pretty hard to know which flags are firmware stats?


Yes, it is, but ethtool stats are difficult to understand in a generic
manner anyway, so someone using them is already likely aware of low-level
details of the driver(s) they are using.


Right. Come to think of it though,


+ * @get_ethtool_stats2: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in  rtnl_link_stats64.
+ *  Takes a flags argument:  0 means all (same as get_ethtool_stats),
+ *  0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
+ *  Other flags are reserved for now.
+ *  Same number of stats will be returned, but some of them might
+ *  not be as accurate/refreshed.  This is to allow not querying
+ *  firmware or other expensive-to-read stats, for instance.


"skip" vs. "don't refresh" is a bit ambiguous - I'd argue better to
either really skip and not return the non-refreshed ones (also helps
with the identifying), or rename the flag.


In order to efficiently parse lots of stats over and over again, I probe
the stat names once on startup, map them to the variable I am trying to use
(since different drivers may have different names for the same basic stat),
and then I store the stat index.

On subsequent stat reads, I just grab stats and go right to the index to
store the stat.

If the stats indexes change, that will complicate my logic quite a bit.

Maybe the flag could be called:  ETHTOOL_GS2_NO_REFRESH_FW ?



Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to
write the spatch and just add the flags argument to "get_ethtool_stats"
instead of adding a separate method - internally to the kernel it's not
that hard to change.


Maybe this could be in followup patches?  It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-18 Thread Ben Greear

On 04/18/2018 02:26 PM, Johannes Berg wrote:

On Tue, 2018-04-17 at 18:49 -0700, gree...@candelatech.com wrote:


+ * @get_ethtool_stats2: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in  rtnl_link_stats64.
+ *  Takes a flags argument:  0 means all (same as get_ethtool_stats),
+ *  0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
+ *  Other flags are reserved for now.


It'd be pretty hard to know which flags are firmware stats?


Yes, it is, but ethtool stats are difficult to understand in a generic
manner anyway, so someone using them is already likely aware of low-level
details of the driver(s) they are using.

In my case, I have lots of virtual stations (or APs), and I want stats
for them as well as for the 'radio', so I would probe the first vdev with
flags of 'skip-none' to get all stats, including radio (firmware) stats.

And then the rest I would just probe the non-firmware stats.

To be honest, I was slightly amused that anyone expressed interest in
this patch originally, but maybe other people have similar use case
and/or drivers with slow-to-acquire stats.


Anyway, there's no way I'm going to take this patch, so you need to
float it on netdev first (best CC us here) and get it applied there
before we can do anything on the wifi side.


I posted the patches to netdev, ath10k and linux-wireless.  If I had only
posted them individually to different lists I figure I'd be hearing about how
the netdev patch is useless because it has no driver support, etc.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: second wifi card enforce CN reg dom

2018-04-12 Thread Ben Greear

On 04/12/2018 10:25 AM, solsTiCe d'Hiver wrote:

It's the second time that you (Ben and Steve) are implying that I
might break the law.

But why are you saying that ? I am not gonna repeat myself again.


If you force the NIC to use a different regulatory domain that what it
originally was tested with, then it might generate out-of-spec RF
noise on the newly available channels, and to generate invalid RF noise is
against the law in many places.

*Probably* it would be OK, but you cannot know that for certain
without some specialized RF analyzer equipment and some very detailed
testing.



And for the patch, it is also implied that I am able to write one.


Unfortunately, my opinion is that if you are unable to write one, then
you should not be mucking with the regulatory domain stuff at all.

Thanks,
Ben



2018-04-12 19:11 GMT+02:00 Ben Greear <gree...@candelatech.com>:

On 04/12/2018 10:05 AM, solsTiCe d'Hiver wrote:


Hi.

I thought I made myself clear.
I leave in France. My system(s) is/are set up to use FR as default
regulatory domain.

But when I plug in that tp-link card, I am restricted to use CN
regulatory domain. Why am I the only one to see this as a problem ?

I know that one can only have one regdom defined on the system. I have
set it up myself. So why is it changed behind my back by some card or
whatever ?
Like I said, I am left with the option, to disable crda, or to use 2
systems, one for each card !

Or may be try Windows when this is not messed up like that ??? Well,
it's not on Windows that I will be able to use monitor mode, anyway.



You can hack the ath9k-htc driver to allow over-riding the regdom
of the NIC, but that requires an out of tree patch and is probably
against the law in your country since the NIC may then not be able to
pass the regulatory requirements.

Thanks,
Ben




Never mind.

2018-04-12 17:52 GMT+02:00 Dan Williams <d...@redhat.com>:


On Thu, 2018-04-12 at 08:18 -0700, Steve deRosier wrote:


On Thu, Apr 12, 2018 at 3:51 AM, Arend van Spriel
<arend.vanspr...@broadcom.com> wrote:


On 4/12/2018 10:42 AM, solsTiCe d'Hiver wrote:



Hi.

This is beyond my comprehension that you could assert this is a
non issue.




Well. I am just saying that it is by design. There is no way for
the
regulatory code to determine where you and your hardware actually
reside so
instead it takes a conservative approach.



To say it another way: mixing regulatory domains on your host system
should result in a _smaller_ set of channels - ie only those channels
at the intersection of the two.

And another wrinkle to consider - one of the 802.11 amendments (can't
remember which one) actually causes the radio to listen to the



802.11d I believe, from the early 2000s.

Dan


beacons
around it, determine what the local regulatory domain is based on the
beacons it hears, and then lock to that regulatory domain. It's
possible for that information to be propagated up to the card's host
and the regulatory domain then would affect both cards. That's how
it's supposed to work, though I don't factually know Linux does this
in all cases.  Could it be you're somewhere where CN is the local
regulatory domain and the TL-WN722N has this feature?

In any case, as Arend points out, despite the hand-wringing that
regulatory domains cause users trying to do something particular,
between certain rules and regulations and certain manufacturers bad
interpretations and implementations around it, there's little that
can
be done about it. Fact is, your radio must comply to whatever
regulatory domain you are in, otherwise it's breaking the rules. And
people breaking the regulatory rules is part of what's gotten
governments to pass even worse (for us OSS guys) laws that tighten
those rules down further.

You asked who to contact. Its not the LKML - it's your relevant
government body. And certain manufacturers who improperly interpret
said rules because it's easier for them.

- Steve

--
Steve deRosier
Cal-Sierra Consulting LLC
https://www.cal-sierra.com/






--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com






--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: second wifi card enforce CN reg dom

2018-04-12 Thread Ben Greear

On 04/12/2018 10:05 AM, solsTiCe d'Hiver wrote:

Hi.

I thought I made myself clear.
I leave in France. My system(s) is/are set up to use FR as default
regulatory domain.

But when I plug in that tp-link card, I am restricted to use CN
regulatory domain. Why am I the only one to see this as a problem ?

I know that one can only have one regdom defined on the system. I have
set it up myself. So why is it changed behind my back by some card or
whatever ?
Like I said, I am left with the option, to disable crda, or to use 2
systems, one for each card !

Or may be try Windows when this is not messed up like that ??? Well,
it's not on Windows that I will be able to use monitor mode, anyway.


You can hack the ath9k-htc driver to allow over-riding the regdom
of the NIC, but that requires an out of tree patch and is probably
against the law in your country since the NIC may then not be able to
pass the regulatory requirements.

Thanks,
Ben



Never mind.

2018-04-12 17:52 GMT+02:00 Dan Williams <d...@redhat.com>:

On Thu, 2018-04-12 at 08:18 -0700, Steve deRosier wrote:

On Thu, Apr 12, 2018 at 3:51 AM, Arend van Spriel
<arend.vanspr...@broadcom.com> wrote:

On 4/12/2018 10:42 AM, solsTiCe d'Hiver wrote:


Hi.

This is beyond my comprehension that you could assert this is a
non issue.



Well. I am just saying that it is by design. There is no way for
the
regulatory code to determine where you and your hardware actually
reside so
instead it takes a conservative approach.



To say it another way: mixing regulatory domains on your host system
should result in a _smaller_ set of channels - ie only those channels
at the intersection of the two.

And another wrinkle to consider - one of the 802.11 amendments (can't
remember which one) actually causes the radio to listen to the


802.11d I believe, from the early 2000s.

Dan


beacons
around it, determine what the local regulatory domain is based on the
beacons it hears, and then lock to that regulatory domain. It's
possible for that information to be propagated up to the card's host
and the regulatory domain then would affect both cards. That's how
it's supposed to work, though I don't factually know Linux does this
in all cases.  Could it be you're somewhere where CN is the local
regulatory domain and the TL-WN722N has this feature?

In any case, as Arend points out, despite the hand-wringing that
regulatory domains cause users trying to do something particular,
between certain rules and regulations and certain manufacturers bad
interpretations and implementations around it, there's little that
can
be done about it. Fact is, your radio must comply to whatever
regulatory domain you are in, otherwise it's breaking the rules. And
people breaking the regulatory rules is part of what's gotten
governments to pass even worse (for us OSS guys) laws that tighten
those rules down further.

You asked who to contact. Its not the LKML - it's your relevant
government body. And certain manufacturers who improperly interpret
said rules because it's easier for them.

- Steve

--
Steve deRosier
Cal-Sierra Consulting LLC
https://www.cal-sierra.com/





--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] mac80211: Fix wlan freezes under load at rekey

2018-03-25 Thread Ben Greear



On 03/25/2018 12:45 PM, Alexander Wetzel wrote:



What will happen to drivers like ath10k that cannot do software

encrypt/decrypt?


ath10k can support multiple key-ids as far as I can tell,
so maybe it would just never hit this code?


Still learning how that all fits together, but I'm sure any card using
mac80211 will also use ieee80211_key_replace, including ath10k.

We are in a race with the remote station there is no chance that we can
switch over exactly at the same time. If we can't fall pack to software
encryption we'll just have to drop some more packets.

I'm pretty sure mac80211 will just encrypt a frame in software and
send it to ath10 for processing once we have removed the key from the hw
in the same way as for any other card.


I don't think ath10k can handle sending already-encrypted data packets,
but possibly it works with newer upstream firmware/driver.

Either way, as long as it does not fundamentally break something (like
a non-recoverable data stall), then maybe your patch is fine anyway
and ath10k may just drop a few extra frames.


My expectation here would be, that the driver detects and drops the
pre-encrypted frames it no longer has a hw key for.

Unfortunately this is just an assumption, since I haven't found the code
handling this case in ath10k. And even if true this could well cause
some undesired warning messages.

I guess we should therefore make sure we do not send out any packets in
the critical time window.

Now stopping and flushing the queues seems to be bad idea which could
cause a real performance impact for on a busy AP with many stations and
rekeys enabled...
Luckily it looks like we can instead just set KEY_FLAG_TAINTED for the
old key to make sure we stop sending packets till the rekey is done.

That should cause ieee80211_tx_h_select_key to drop all packets without
a new per-packet check and also should cover potential undesired side
effects, isn't it?


I get lost in the weeds when trying to understand all of this, and some
previous attempts of mine to fix some of this evidently wasn't correct
enough to accept upstream:

https://www.spinics.net/lists/hostap/msg03677.html

So I really don't know enough to properly review
your patch.  Just be aware that ath10k is weird about sw-crypt, maybe make
sure your patch is tested on it to make sure it doesn't out-right break 
something.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH] mac80211: Fix wlan freezes under load at rekey

2018-03-24 Thread Ben Greear



On 03/24/2018 03:29 AM, Alexander Wetzel wrote:

Rekeying a pairwise key with encryption offload and only keyid 0 has two
potential races which can freeze the wlan conection till rekeyed again:

 1) For incomming packets:
If the local STA installs the key prior to the remote STA we still
have the old key active in the hardware for a short time after
mac80211 switched to the new key.
The card can still hand over packets decoded with the old key to
mac80211, bumping the new PN (IV) value to an incorrect high number and
tricking the local replay detection to drop all packets really sent
with the new key.

 2) For outgoing packets:
If mac80211 is providing the PN (IV) and hands over the cleartext
packets for encryption to the hardware immediately prior to a key
change the driver/card may process the queued packets after
switching to the new key.
This will immediatelly bump the PN (IV) value on the remote STA to
an incorrect high number, also freezing the connection.

Both issues can be prevented by deleting the key from the hardware prior
to switching to the new key in mac80211, falling back to software
encryption/decryption till the switch to the new key is completed.


What will happen to drivers like ath10k that cannot do software encrypt/decrypt?

ath10k can support multiple key-ids as far as I can tell,
so maybe it would just never hit this code?

Thanks,
Ben



Signed-off-by: Alexander Wetzel <alexander.wet...@web.de>
---
 net/mac80211/key.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/key.c b/net/mac80211/key.c
index aee05ec3f7ea..266ea0b507e7 100644
--- a/net/mac80211/key.c
+++ b/net/mac80211/key.c
@@ -332,10 +332,15 @@ static void ieee80211_key_replace(struct 
ieee80211_sub_if_data *sdata,

WARN_ON(new && old && new->conf.keyidx != old->conf.keyidx);

-   if (old)
+   if (old) {
idx = old->conf.keyidx;
-   else
+   /* Make sure the card can't encrypt/decrypt packets with
+* the old key prior to switching to new key in mac80211.
+*/
+   ieee80211_key_disable_hw_accel(old);
+   } else {
idx = new->conf.keyidx;
+   }

if (sta) {
        if (pairwise) {



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH] ath9k: Protect queue draining by rcu_read_lock()

2018-03-23 Thread Ben Greear

On 02/02/2018 02:36 AM, Toke Høiland-Jørgensen wrote:

When ath9k was switched over to use the mac80211 intermediate queues,
node cleanup now drains the mac80211 queues. However, this call path is
not protected by rcu_read_lock() as it was previously entirely internal
to the driver which uses its own locking.


As far as I can tell, this is not currently in Linus' tree.

Was this dropped on purpose?

Thanks,
Ben



This leads to a possible rcu_dereference() without holding
rcu_read_lock(); but only if a station is cleaned up while having
packets queued on the TXQ. Fix this by adding the rcu_read_lock() to the
caller in ath9k.

Fixes: 50f08edf9809 ("ath9k: Switch to using mac80211 intermediate software 
queues.")
Cc: sta...@vger.kernel.org
Reported-by: Ben Greear <gree...@candelatech.com>
Signed-off-by: Toke Høiland-Jørgensen <t...@toke.dk>
---
 drivers/net/wireless/ath/ath9k/xmit.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/wireless/ath/ath9k/xmit.c 
b/drivers/net/wireless/ath/ath9k/xmit.c
index 396bf05c6bf6..d8b041f48ca8 100644
--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -2892,6 +2892,8 @@ void ath_tx_node_cleanup(struct ath_softc *sc, struct 
ath_node *an)
struct ath_txq *txq;
int tidno;

+   rcu_read_lock();
+
for (tidno = 0; tidno < IEEE80211_NUM_TIDS; tidno++) {
tid = ath_node_to_tid(an, tidno);
txq = tid->txq;
@@ -2909,6 +2911,8 @@ void ath_tx_node_cleanup(struct ath_softc *sc, struct 
ath_node *an)
if (!an->sta)
break; /* just one multicast ath_atx_tid */
}
+
+   rcu_read_unlock();
 }

 #ifdef CONFIG_ATH9K_TX99




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [ath10k throughput] low throughput in multi-user mode

2018-03-21 Thread Ben Greear

On 03/20/2018 06:44 PM, gary wrote:


Hi all,
I have run the throughput test on veriwave.(from Ethernet to wireless)
My AP suports 4*4, 11ac, mu-mimo.,wireless chip is QCA9984.
At first, the throughput is about 80Mbps only with 16 users, so I force to
return true in function ath10k_mac_tx_can_push().
As a result, when the user number is 1~16, the udp throughput is about
1.2Gbps ~1.4Gbps.
But when the user number is 32, the udp throughput is only 520Mbps.

So I try to substitute ath10k-firmware.
With firmware 10.4-3.4-0082 and above, the throughput is around 520Mbps.
But with firmware 10.4-3.4-0072, the throughput gets 1.1Gbps.

Do I miss something in configuration?
Any comment is appreciated.


What kernel version (or backports, if you are using that?) are you using?

And, I'd be curious how the latest ath10k-CT (beta) firmware compares
if you have time to try that:

http://www.candelatech.com/ath10k-10.4.php

At least in upload testing, we get better throughput with lots of
virtual stations if we tune the number of tx descriptors to be larger
as that will generate larger AMPDUs on air.

You might also check your rate-ctrl logic to make sure all of your
stations are using higher MCS rates.

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 11:24 AM, Michal Kubecek wrote:

On Tue, Mar 20, 2018 at 08:39:33AM -0700, Ben Greear wrote:

On 03/20/2018 03:37 AM, Michal Kubecek wrote:


IMHO it would be more practical to set "0 means same as GSTATS" as a
rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
avoid code duplication (or perhaps a use fall-through in the switch). It
would also allow drivers to provide only one of the callbacks.


Yes, but that would require changing all drivers at once, and would make 
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature 
would
make it upstream, so I didn't want to propose any large changes up front.


I don't think so. What I mean is:

(a) driver implements ->get_ethtool_stats2() callback; then we use it
for GSTATS2
(b) driver does not implement get_ethtool_stats2() but implements
->get_ethtool_stats(); then we call for GSTATS2 if level is zero,
otherwise GSTATS2 returns -EINVAL

and GSTATS is always translated to GSTATS2 with level 0, either by
defining ethtool_get_stats() as a wrapper or by fall-through in the
switch statement.

This way, most drivers could be left untouched and only those which
would implement non-default levels would provide ->get_ethtool_stats2()
callback instead of ->get_ethtool_stats().


OK, that makes sense.  I'll wait on feedback from the flags or #defined levels
and re-spin the patch accordingly.

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 09:11 AM, Steve deRosier wrote:

On Tue, Mar 20, 2018 at 8:39 AM, Ben Greear <gree...@candelatech.com> wrote:

On 03/20/2018 03:37 AM, Michal Kubecek wrote:


On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote:


From: Ben Greear <gree...@candelatech.com>

This is similar to ETHTOOL_GSTATS, but it allows you to specify
a 'level'.  This level can be used by the driver to decrease the
amount of stats refreshed.  In particular, this helps with ath10k
since getting the firmware stats can be slow.

Signed-off-by: Ben Greear <gree...@candelatech.com>
---

NOTE:  I know to make it upstream I would need to split the patch and
remove the #define for 'backporting' that I added.  But, is the
feature in general wanted?  If so, I'll do the patch split and
other tweaks that might be suggested.





Yes, but that would require changing all drivers at once, and would make
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature
would
make it upstream, so I didn't want to propose any large changes up front.



Hi Ben,

I find the feature OK, but I'm not thrilled with the arbitrary scale
of "level". Maybe there could be some named values, either on a
spectrum as level already is, similar to the kernel log DEBUG, WARN,
INFO  type levels. Or named bit flags like the way the ath drivers
do their debug flags for granular results.  Thoughts?


Yes, that would be easier to code too.  If there are any other drivers
out there that might take advantage of this, maybe they could chime in with
what levels and/or bit-fields they would like to see.

For instance a bit that says 'refresh-stats-from-firmware' would be great for 
ath10k,
but maybe useless for everyone else

Thanks,
Ben



- Steve




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Ben Greear

On 03/20/2018 03:37 AM, Michal Kubecek wrote:

On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote:

From: Ben Greear <gree...@candelatech.com>

This is similar to ETHTOOL_GSTATS, but it allows you to specify
a 'level'.  This level can be used by the driver to decrease the
amount of stats refreshed.  In particular, this helps with ath10k
since getting the firmware stats can be slow.

Signed-off-by: Ben Greear <gree...@candelatech.com>
---

NOTE:  I know to make it upstream I would need to split the patch and
remove the #define for 'backporting' that I added.  But, is the
feature in general wanted?  If so, I'll do the patch split and
other tweaks that might be suggested.


I'm not familiar enough with the technical background of stats
collecting to comment on usefulness and desirability of this feature.
Adding a new command just to add a numeric parameter certainly doesn't
feel right but it's how the ioctl interface works. I take it as
a reminder to find some time to get back to the netlink interface.


diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 674b6c9..d3b709f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1947,6 +1947,54 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
return ret;
 }

+static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr)
+{
+   struct ethtool_stats stats;
+   const struct ethtool_ops *ops = dev->ethtool_ops;
+   u64 *data;
+   int ret, n_stats;
+   u32 stats_level = 0;
+
+   if (!ops->get_ethtool_stats2 || !ops->get_sset_count)
+   return -EOPNOTSUPP;
+
+   n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
+   if (n_stats < 0)
+   return n_stats;
+   if (n_stats > S32_MAX / sizeof(u64))
+   return -ENOMEM;
+   WARN_ON_ONCE(!n_stats);
+   if (copy_from_user(, useraddr, sizeof(stats)))
+   return -EFAULT;
+
+   /* User can specify the level of stats to query.  How the
+* level value is used is up to the driver, but in general,
+* 0 means 'all', 1 means least, and higher means more.
+* The idea is that some stats may be expensive to query, so user
+* space could just ask for the cheap ones...
+*/
+   stats_level = stats.n_stats;
+
+   stats.n_stats = n_stats;
+   data = vzalloc(n_stats * sizeof(u64));
+   if (n_stats && !data)
+   return -ENOMEM;
+
+   ops->get_ethtool_stats2(dev, , data, stats_level);
+
+   ret = -EFAULT;
+   if (copy_to_user(useraddr, , sizeof(stats)))
+   goto out;
+   useraddr += sizeof(stats);
+   if (n_stats && copy_to_user(useraddr, data, n_stats * sizeof(u64)))
+   goto out;
+   ret = 0;
+
+ out:
+   vfree(data);
+   return ret;
+}
+
 static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 {
struct ethtool_stats stats;


IMHO it would be more practical to set "0 means same as GSTATS" as a
rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
avoid code duplication (or perhaps a use fall-through in the switch). It
would also allow drivers to provide only one of the callbacks.


Yes, but that would require changing all drivers at once, and would make 
backporting
and out-of-tree drivers harder to manage.  I had low hopes that this feature 
would
make it upstream, so I didn't want to propose any large changes up front.

Thanks,
Ben



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: fixed bit rate 9Mbps set as 24Mbps in g mode(legacy)

2018-02-28 Thread Ben Greear

On 02/27/2018 08:03 PM, KAVITA MATHUR wrote:




On Tue, 27 Feb 2018 08:41:50 -0800, Ben Greear wrote

On 02/27/2018 12:49 AM, KAVITA MATHUR wrote:

Hi,

I have configured AP in g mode and tested legacy rates.All basic rates and 
supported
rates(1,2,5.5,11,6,9,12,18,24,36,48,54) are set except 9Mbps.

When fixed bit rate 9Mbps is set, it gets set as 24 Mbps. This issue has been 
observed
in backports-4.4.2-1 as well as latest git version of ath10k driver.


Is anybody aware of this issue? How can I fix this rate 9Mbps?


You should specify which firmware/chipset you are using, since this might
easily be a firmware issue.



Chipset : QCA988x
ath10k version : git 16 oct 2017(v4.14-rc2-1-24-gc6db471)
firmware version : firmware-5.bin_10.2.4.70.66
Kernel version : 3.12


And, how are you determining that 'it gets set to 24Mbps'.  Is that the
encoding you see in a wifi sniff, or is the driver being set to 24Mbps
improperly when you use iw (or whatever) to try configuring the rate?


I have used command "iw wlan0 station dump" to see set data rate as well as I 
have
sniffer. Measured data throughput is also as per 24Mbps.


I use more recent (and hacked) kernels/drivers, and my own wave-1 firmware.  I 
did have
a bug in some code I wrote because the 48Mbps rate code is '0x0', and some of
my code was ignoring it thinking it was not properly set.  I did see correct
on-air 48Mbps encoding rates though.

Possibly you have similar problems with rate-code 0x0.

Have you tried doing a capture with an ath9k or some other non-ath10k driver
to see if it is really 24Mbps on air?






Recent upstream mac80211 has broken the ability to set a single tx rate as
well, last I checked.


Other rates are set properly except 9Mbps.By default it sets as 6Mbps, when data
transfer starts, then it set to the mentioned fixed data rate.


Setting the tx rate through normal API will not effect management frames, and 
at the beginning of
the connection, perhaps no data frames have been sent/received yet, so you would
see 6Mbps reported.

Thanks,
Ben



Thanks & Regards,
Kavita

Thanks,
Ben




Thanks & Regards,
Kavita


--
To unsubscribe from this list: send the line "unsubscribe backports" in



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Thanks & Regards,
कविता माथुर Kavita Mathur
वरिष्ठ अनुसंधान अभियंता  Senior Research Engineer
सी-डॉट  C-DOT
इलैक्ट्रॉनिक्स सिटी फेज़ I Electronics City Phase I
होसूर रोड, बेंगलूरु   Hosur Road, Bengaluru – 560100
फोन  Ph 080-28529896
Disclaimer:
--
This email and any files transmitted with it




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [RFC] ath10k: Attempt to work around napi_synchronize hang.

2018-02-28 Thread Ben Greear

On 02/28/2018 09:31 AM, Michał Kazior wrote:

On 28 February 2018 at 02:22,  <gree...@candelatech.com> wrote:
[...]

@@ -2086,8 +2087,28 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
ath10k_pci_irq_disable(ar);
ath10k_pci_irq_sync(ar);
ath10k_pci_flush(ar);
-   napi_synchronize(>napi);
-   napi_disable(>napi);
+
+   /* Calling napi_disable twice in a row (w/out starting it and/or without
+* having NAPI active leads to deadlock because napi_disable sets
+* NAPI_STATE_SCHED and NAPI_STATE_NPSVC when it returns, as far as I
+* can tell.  So, guard this call to napi_disable.  I believe the
+* failure case is something like this:
+* rmmod ath10k_pci ath10k_core
+*   Firmware crashes before hif_stop is called by the rmmod path
+*   The crash handling logic calls hif_stop
+ *   Then rmmod gets around to calling hif_stop, but spins endlessly
+*   in napi_synchronize.
+*
+*  I think one way this could happen is that ath10k_stop checks
+*  for state != ATH10K_STATE_OFF, but STATE_RESTARTING is also
+*  a possibility.  That might be how we can have hif_stop called twice
+*  without a hif_start in between. --Ben
+*/
+   if (ar->napi_enabled) {
+   napi_synchronize(>napi);
+   napi_disable(>napi);
+   ar->napi_enabled = false;
+   }


Looking at the code and your comment the described fail case seems
legit. I would consider tuning ath10k_stop() so that it calls
ath10k_halt() only if ar->state != OFF & ar->state != RESTARTING
though. Or did you try already?


I did not try tuning ath10k_stop().  The code in this area is quite complex,
and in my opinion trying to keep the start/stop calls exactly matched without
individual 'has_started' flags seems ripe for bugs.


While your approach will probably work it won't prevent other non-NAPI
bad things from happening. Even if there's nothing today it might
creep up in the future. And you'd need to update ahb.c too.


I'll update ahb.c to match.

Thanks,
Ben




Michał




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Deadlock debugging help.

2018-02-27 Thread Ben Greear

On 02/27/2018 01:42 PM, Ben Greear wrote:

On 02/27/2018 12:49 PM, Ben Greear wrote:

I notice I can reliably lock up the kernel if I rmmod ath10k while it is under
heavy tx/rx traffic.  First, this causes the firmware to crash, and then right
after (or possibly during?) the related kernel threads deadlock.

This is with my hacked driver and hacked firmware.  In particular, the
ath10k_debug_nop_dwork is something I added, though it is pretty trivial,
it does take the ar->conf_mutex.  It appears blocked trying to get it.

It appears something is holding the ar->conf_mutex, but it is not clear to
me from the lockdep output what process actually holds it.
Anyone see a clue they could share?


Changing how I start/stop the nop_dwork stuff seems to have made the
problem go away, so I guess maybe that was the issue.


Ok, so problem still remains.  The 'rmmod' process appears to be the
one that is really not making progress.  Unfortunately, decoding
ath10k_pci_hif_stop+0x6f leads to some bitops.h inline, which doesn't
let me know where it is actually stuck...  Off to do more debugging


[ 4037.220992] rmmod   D0 20267   3050 0x0080
[ 4037.220995] Call Trace:
[ 4037.220997]  __schedule+0x407/0xb70
[ 4037.220999]  ? _raw_spin_unlock_irqrestore+0x4e/0x70
[ 4037.221003]  schedule+0x38/0x90
[ 4037.221005]  schedule_timeout+0x224/0x580
[ 4037.221007]  ? retint_kernel+0x2d/0x2d
[ 4037.221010]  ? call_timer_fn+0x370/0x370
[ 4037.221015]  msleep+0x34/0x40
[ 4037.221017]  ? msleep+0x34/0x40
[ 4037.221021]  ath10k_pci_hif_stop+0x6f/0xd0 [ath10k_pci]
[ 4037.221032]  ath10k_core_stop+0x4d/0x90 [ath10k_core]
[ 4037.221038]  ath10k_halt+0x14b/0x1f0 [ath10k_core]
[ 4037.221044]  ath10k_stop+0x36/0x80 [ath10k_core]
[ 4037.221059]  drv_stop+0x58/0x2d0 [mac80211]
[ 4037.221075]  ieee80211_stop_device+0x3e/0x50 [mac80211]
[ 4037.221088]  ieee80211_do_stop+0x501/0x880 [mac80211]
[ 4037.221092]  ? dev_deactivate_many+0x2b2/0x2f0
[ 4037.221105]  ieee80211_stop+0x15/0x20 [mac80211]
[ 4037.221107]  __dev_close_many+0x93/0xe0
[ 4037.221110]  dev_close_many+0x7d/0x120
[ 4037.221114]  dev_close.part.85+0x36/0x50
[ 4037.221116]  dev_close+0x15/0x20
[ 4037.221155]  cfg80211_shutdown_all_interfaces+0x44/0xc0 [cfg80211]
[ 4037.221168]  ieee80211_remove_interfaces+0x42/0x1c0 [mac80211]
[ 4037.221180]  ieee80211_unregister_hw+0x45/0x130 [mac80211]
[ 4037.221187]  ath10k_mac_unregister+0x14/0x60 [ath10k_core]
[ 4037.221193]  ath10k_core_unregister+0x3a/0xa0 [ath10k_core]
[ 4037.221197]  ath10k_pci_remove+0x2d/0x70 [ath10k_pci]
[ 4037.221200]  pci_device_remove+0x34/0xb0
[ 4037.221203]  device_release_driver_internal+0x158/0x210
[ 4037.221206]  driver_detach+0x3b/0x80
[ 4037.221208]  bus_remove_driver+0x53/0xd0
[ 4037.221210]  driver_unregister+0x27/0x40
[ 4037.221213]  pci_unregister_driver+0x24/0x90
[ 4037.221216]  ath10k_pci_exit+0x10/0x6ee [ath10k_pci]
[ 4037.221218]  SyS_delete_module+0x1e1/0x2a0
[ 4037.221222]  do_syscall_64+0x64/0x140
[ 4037.221225]  entry_SYSCALL64_slow_path+0x25/0x25

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: Deadlock debugging help.

2018-02-27 Thread Ben Greear

On 02/27/2018 12:49 PM, Ben Greear wrote:

I notice I can reliably lock up the kernel if I rmmod ath10k while it is under
heavy tx/rx traffic.  First, this causes the firmware to crash, and then right
after (or possibly during?) the related kernel threads deadlock.

This is with my hacked driver and hacked firmware.  In particular, the
ath10k_debug_nop_dwork is something I added, though it is pretty trivial,
it does take the ar->conf_mutex.  It appears blocked trying to get it.

It appears something is holding the ar->conf_mutex, but it is not clear to
me from the lockdep output what process actually holds it.
Anyone see a clue they could share?


Changing how I start/stop the nop_dwork stuff seems to have made the
problem go away, so I guess maybe that was the issue.

Thanks,
Ben



Thanks,
Ben


[ 1618.914057] sysrq: SysRq : Show Regs
[ 1618.916617] CPU: 2 PID: 0 Comm: swapper/2 Tainted: GW  O4.13.16+ 
#2
[ 1618.916618] Hardware name: _ _/ , BIOS 5.11 08/26/2016
[ 1618.916620] task: 88016d3ea740 task.stack: c96c8000
[ 1618.916624] RIP: 0010:cpuidle_enter_state+0x130/0x3b0
[ 1618.916625] RSP: 0018:c96cbe78 EFLAGS: 0202 ORIG_RAX: 
ff3e
[ 1618.916628] RAX: 88016d3ea740 RBX: 0008 RCX: 001f
[ 1618.916629] RDX: 0001 RSI: 0001 RDI: 88016d3ea740
[ 1618.916630] RBP: c96cbeb0 R08: 52ca R09: 
[ 1618.916631] R10:  R11:  R12: 0008
[ 1618.916633] R13: 880172d26c00 R14: 81ec66b8 R15: 0178ee494b98
[ 1618.916634] FS:  () GS:880172d0() 
knlGS:
[ 1618.916636] CS:  0010 DS:  ES:  CR0: 80050033
[ 1618.916637] CR2: 7ffdda2194b8 CR3: 01e0f000 CR4: 003406e0
[ 1618.916638] DR0:  DR1:  DR2: 
[ 1618.916639] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1618.916640] Call Trace:
[ 1618.916645]  cpuidle_enter+0x12/0x20
[ 1618.916647]  call_cpuidle+0x1e/0x40
[ 1618.916649]  do_idle+0x184/0x1e0
[ 1618.916651]  cpu_startup_entry+0x5f/0x70
[ 1618.916654]  start_secondary+0x150/0x180
[ 1618.916657]  secondary_startup_64+0x9f/0x9f
[ 1618.916662] Code: b7 df 95 ff 48 89 45 d0 0f 1f 44 00 00 31 ff e8 97 6c 97 ff 8b 
45 c8 85 c0 0f 85 83 01 00 00 e8 57 5b 98 ff fb 66 0f 1f 44 00 00 <48> 8b 5d
d0 48 ba cf f7 53 e3 a5 9b c4 20 4c 29 fb 48 89 d8 48

[ 1618.916724] CPU#2: ctrl:   0007000f
[ 1618.916726] CPU#2: status: 
[ 1618.916727] CPU#2: overflow:   
[ 1618.916728] CPU#2: fixed:  00b0
[ 1618.916729] CPU#2: pebs:   
[ 1618.916730] CPU#2: debugctl:   
[ 1618.916731] CPU#2: active: 0002
[ 1618.916732] CPU#2:   gen-PMC0 ctrl:  
[ 1618.916733] CPU#2:   gen-PMC0 count: 
[ 1618.916734] CPU#2:   gen-PMC0 left:  
[ 1618.916735] CPU#2:   gen-PMC1 ctrl:  
[ 1618.916736] CPU#2:   gen-PMC1 count: 
[ 1618.916737] CPU#2:   gen-PMC1 left:  
[ 1618.916738] CPU#2:   gen-PMC2 ctrl:  
[ 1618.916739] CPU#2:   gen-PMC2 count: 
[ 1618.916740] CPU#2:   gen-PMC2 left:  
[ 1618.916742] CPU#2:   gen-PMC3 ctrl:  
[ 1618.916743] CPU#2:   gen-PMC3 count: 
[ 1618.916744] CPU#2:   gen-PMC3 left:  
[ 1618.916745] CPU#2: fixed-PMC0 count: 
[ 1618.916746] CPU#2: fixed-PMC1 count: fffed3945191
[ 1618.916747] CPU#2: fixed-PMC2 count: 
[ 1623.035449] INFO: task kworker/u8:1:273 blocked for more than 180 seconds.
[ 1623.041155]   Tainted: GW  O4.13.16+ #2
[ 1623.045173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1623.051966] kworker/u8:1D0   273  2 0x
[ 1623.052010] Workqueue: cfg80211 cfg80211_dfs_channels_update_work [cfg80211]
[ 1623.052026] Call Trace:
[ 1623.052032]  __schedule+0x407/0xb70
[ 1623.052037]  schedule+0x38/0x90
[ 1623.052039]  schedule_preempt_disabled+0x10/0x20
[ 1623.052041]  __mutex_lock+0x3d3/0x930
[ 1623.052044]  ? rtnl_lock+0x12/0x20
[ 1623.052050]  mutex_lock_nested+0x16/0x20
[ 1623.052052]  ? mutex_lock_nested+0x16/0x20
[ 1623.052054]  rtnl_lock+0x12/0x20
[ 1623.052066]  cfg80211_dfs_channels_update_work+0x3c/0x1c0 [cfg80211]
[ 1623.052069]  ? process_one_work+0x14f/0x6a0
[ 1623.052074]  process_one_work+0x1ce/0x6a0
[ 1623.052078]  worker_thread+0x46/0x440
[ 1623.052083]  kthread+0x10f/0x150
[ 1623.052084]  ? process_one_work+0x6a0/0x6a0
[ 1623.052086]  ? kthread_create_on_node+0x40/0x40
[ 1623.052088]  ? kthread_create_on_node+0x40/0x40
[ 1623.052091]  ret_from_fork+0x2a/0x40
[ 1623.052098] INFO: task kworker/u8:2:281 blocked for more than 180 seconds.
[ 1623.057700]

Deadlock debugging help.

2018-02-27 Thread Ben Greear
16738]  #0:  (tasklist_lock){.+.+..}, at: [] 
debug_show_all_locks+0x3d/0x1a0
[ 1623.116747] 3 locks held by kworker/u8:1/273:
[ 1623.116749]  #0:  ("cfg80211"){.+.+.+}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116755]  #1:  ((&(>dfs_update_channels_wk)->work)){+.+...}, at: 
[] process_one_work+0x14f/0x6a0
[ 1623.116761]  #2:  (rtnl_mutex){+.+.+.}, at: [] 
rtnl_lock+0x12/0x20
[ 1623.116768] 3 locks held by kworker/u8:2/281:
[ 1623.116769]  #0:  ("%s""ath10k_wq"){.+.+.+}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116775]  #1:  ((&(>debug.nop_dwork)->work)){+.+...}, at: 
[] process_one_work+0x14f/0x6a0
[ 1623.116781]  #2:  (>conf_mutex){+.+.+.}, at: [] 
ath10k_debug_nop_dwork+0x24/0x80 [ath10k_core]
[ 1623.116792] 3 locks held by kworker/0:3/462:
[ 1623.116793]  #0:  ("events"){.+.+.+}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116799]  #1:  ((linkwatch_work).work){+.+.+.}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116805]  #2:  (rtnl_mutex){+.+.+.}, at: [] 
rtnl_lock+0x12/0x20
[ 1623.116812] 3 locks held by kworker/1:2/562:
[ 1623.116813]  #0:  ("events_freezable"){.+.+..}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116819]  #1:  ((>restart_work)){+.+...}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116825]  #2:  (rtnl_mutex){+.+.+.}, at: [] 
rtnl_lock+0x12/0x20
[ 1623.116845] 2 locks held by bash/1896:
[ 1623.116846]  #0:  (>ldisc_sem){.+}, at: [] 
ldsem_down_read+0x2d/0x40
[ 1623.116852]  #1:  (>atomic_read_lock){+.+...}, at: 
[] n_tty_read+0xa9/0x8d0
[ 1623.116860] 1 lock held by btserver/3871:
[ 1623.116861]  #0:  (rtnl_mutex){+.+.+.}, at: [] 
rtnetlink_rcv+0x16/0x30
[ 1623.116867] 3 locks held by kworker/2:0/4308:
[ 1623.116868]  #0:  ("%s"("ipv6_addrconf")){.+.+..}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116874]  #1:  ((addr_chk_work).work){+.+...}, at: [] 
process_one_work+0x14f/0x6a0
[ 1623.116880]  #2:  (rtnl_mutex){+.+.+.}, at: [] 
rtnl_lock+0x12/0x20
[ 1623.116887] 4 locks held by rmmod/7551:
[ 1623.116888]  #0:  (>mutex){..}, at: [] 
device_release_driver_internal+0x26/0x210
[ 1623.116894]  #1:  (>mutex){..}, at: [] 
device_release_driver_internal+0x34/0x210
[ 1623.116899]  #2:  (rtnl_mutex){+.+.+.}, at: [] 
rtnl_lock+0x12/0x20
[ 1623.116905]  #3:  (>conf_mutex){+.+.+.}, at: [] 
ath10k_stop+0x24/0x80 [ath10k_core]
[ 1623.116915] 2 locks held by bash/7559:
[ 1623.116916]  #0:  (>ldisc_sem){++++.+}, at: [] 
ldsem_down_read+0x2d/0x40
[ 1623.116922]  #1:  (>atomic_read_lock){+.+...}, at: 
[] n_tty_read+0xa9/0x8d0

[ 1623.116929] =


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: fixed bit rate 9Mbps set as 24Mbps in g mode(legacy)

2018-02-27 Thread Ben Greear

On 02/27/2018 12:49 AM, KAVITA MATHUR wrote:

Hi,

I have configured AP in g mode and tested legacy rates.All basic rates and 
supported
rates(1,2,5.5,11,6,9,12,18,24,36,48,54) are set except 9Mbps.

When fixed bit rate 9Mbps is set, it gets set as 24 Mbps. This issue has been 
observed
in backports-4.4.2-1 as well as latest git version of ath10k driver.


Is anybody aware of this issue? How can I fix this rate 9Mbps?


You should specify which firmware/chipset you are using, since this might
easily be a firmware issue.

And, how are you determining that 'it gets set to 24Mbps'.  Is that the encoding
you see in a wifi sniff, or is the driver being set to 24Mbps improperly when
you use iw (or whatever) to try configuring the rate?

Recent upstream mac80211 has broken the ability to set a single tx rate as well,
last I checked.

Thanks,
Ben




Thanks & Regards,
Kavita


--
To unsubscribe from this list: send the line "unsubscribe backports" in




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] ath9k: break out of irq handler after 5 jiffies

2018-02-26 Thread Ben Greear

On 02/26/2018 02:08 PM, Arend van Spriel wrote:

On 2/26/2018 10:39 PM, Ben Greear wrote:

On 02/07/2018 07:39 AM, Ben Greear wrote:



On 02/07/2018 02:55 AM, Johannes Berg wrote:

On Wed, 2018-02-07 at 10:16 +0100, Felix Fietkau wrote:

On 2018-02-07 00:05, gree...@candelatech.com wrote:

From: Ben Greear <gree...@candelatech.com>

In case where the system is sluggish, we should probably break out
early.  Maybe this will fix issues where the OS thinks the IRQ handler
is not responding and disables the IRQ because 'nobody cared'

Signed-off-by: Ben Greear <gree...@candelatech.com>


5 jiffies as a hardcoded value is a bad idea, since it produces
different behavior based on CONFIG_HZ.


I figured that was a benefit since it would run shorter duration on
systems with
a faster HZ clock.



Also, err, NAPI? Or is something else is going on here?


I don't really know, but part of my test was running traffic while
creating
1200 stations, so likely there were lots of higher-level lock
contention that
slowed down sending pkts up the stack.

I got a bunch of errors about IRQs being ignored because nobody
cared.  I noticed
that the ath9k loop could handle up to 500 or so frames, and that
seemed like too
many for my particular test case.

Once I put in this patch, I did not see the 'nobody cared' error again.

There could easily be a better fix.  If you all want me to use a fixed
time instead
of HZ, then please suggest a value.  I was testing with HZ of 1000, btw.


Hello,

I don't mind changing this patch, but I could use some guidance as to what
values you all want me to use.

Should I use a millisecond based clock instead of jiffies?

What time duration do you want if 5 Jiffies (or 5ms) is not desired?


Hi Ben,

Instead of using some time unit you could consider breaking out after handing 
'x' number of frames and make 'x' configurable through debugfs.


I don't see why you would care about number of pkts...it is just a proxy for 
time, right?

So, in that case, then using jiffies (or some other fast timer) seems the most 
useful.

Thanks,
Ben



Regards,
Arend




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] ath9k: break out of irq handler after 5 jiffies

2018-02-26 Thread Ben Greear

On 02/07/2018 07:39 AM, Ben Greear wrote:



On 02/07/2018 02:55 AM, Johannes Berg wrote:

On Wed, 2018-02-07 at 10:16 +0100, Felix Fietkau wrote:

On 2018-02-07 00:05, gree...@candelatech.com wrote:

From: Ben Greear <gree...@candelatech.com>

In case where the system is sluggish, we should probably break out
early.  Maybe this will fix issues where the OS thinks the IRQ handler
is not responding and disables the IRQ because 'nobody cared'

Signed-off-by: Ben Greear <gree...@candelatech.com>


5 jiffies as a hardcoded value is a bad idea, since it produces
different behavior based on CONFIG_HZ.


I figured that was a benefit since it would run shorter duration on systems with
a faster HZ clock.



Also, err, NAPI? Or is something else is going on here?


I don't really know, but part of my test was running traffic while creating
1200 stations, so likely there were lots of higher-level lock contention that
slowed down sending pkts up the stack.

I got a bunch of errors about IRQs being ignored because nobody cared.  I 
noticed
that the ath9k loop could handle up to 500 or so frames, and that seemed like 
too
many for my particular test case.

Once I put in this patch, I did not see the 'nobody cared' error again.

There could easily be a better fix.  If you all want me to use a fixed time 
instead
of HZ, then please suggest a value.  I was testing with HZ of 1000, btw.


Hello,

I don't mind changing this patch, but I could use some guidance as to what
values you all want me to use.

Should I use a millisecond based clock instead of jiffies?

What time duration do you want if 5 Jiffies (or 5ms) is not desired?

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] Revert "ath10k: send (re)assoc peer command when NSS changed"

2018-02-26 Thread Ben Greear

On 02/25/2018 10:16 PM, Karthikeyan Periyasamy wrote:

This reverts commit 55884c045d31a29cf69db8332d1064a1b61dd159.

When Ath10k is in AP mode and an unassociated STA sends a VHT action frame
(Operating Mode Notification for the NSS change) periodically to AP this causes
ath10k to call ath10k_station_assoc() which sends WMI_PEER_ASSOC_CMDID during
NSS update. Over the time (with a certain client it can happen within 15 mins
when there are over 500 of these VHT action frames) continuous calls of
WMI_PEER_ASSOC_CMDID cause firmware to assert due to resource exhaust.


Can you share exactly which resource the firmware ran out of?  It would seem to
be a FW bug if it is leaking, so maybe it can be fixed as well...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] ath9k: break out of irq handler after 5 jiffies

2018-02-07 Thread Ben Greear



On 02/07/2018 02:55 AM, Johannes Berg wrote:

On Wed, 2018-02-07 at 10:16 +0100, Felix Fietkau wrote:

On 2018-02-07 00:05, gree...@candelatech.com wrote:

From: Ben Greear <gree...@candelatech.com>

In case where the system is sluggish, we should probably break out
early.  Maybe this will fix issues where the OS thinks the IRQ handler
is not responding and disables the IRQ because 'nobody cared'

Signed-off-by: Ben Greear <gree...@candelatech.com>


5 jiffies as a hardcoded value is a bad idea, since it produces
different behavior based on CONFIG_HZ.


I figured that was a benefit since it would run shorter duration on systems with
a faster HZ clock.



Also, err, NAPI? Or is something else is going on here?


I don't really know, but part of my test was running traffic while creating
1200 stations, so likely there were lots of higher-level lock contention that
slowed down sending pkts up the stack.

I got a bunch of errors about IRQs being ignored because nobody cared.  I 
noticed
that the ath9k loop could handle up to 500 or so frames, and that seemed like 
too
many for my particular test case.

Once I put in this patch, I did not see the 'nobody cared' error again.

There could easily be a better fix.  If you all want me to use a fixed time 
instead
of HZ, then please suggest a value.  I was testing with HZ of 1000, btw.

Thanks,
Ben



johannes



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


Re: lockdep warning in mac80211/tx.c, 4.13.16+ kernel, ath9k related

2018-02-01 Thread Ben Greear

On 02/01/2018 02:47 PM, Johannes Berg wrote:

On Thu, 2018-02-01 at 23:40 +0100, Johannes Berg wrote:


The code does a plain rcu_dereference(), no locking other than
rcu_read_lock() involved.

On second thought though, I'm not convinced that your modifications
caused the problem.

Given your call stack, we'd expect rcu_read_lock() somewhere around
ath_tid_dequeue (or its caller(s)), since ieee80211_tx_dequeue clearly
requires it.

Normally, ieee80211_tx_dequeue() is called from various places that
probably come from mac80211 and already hold the rcu_read_lock(), e.g.
the wake_tx_queue op.

In this case, you're coming from drv_sta_state, so not sure why the
driver thinks it's OK to call the dequeue there.


Just to clarify - it could just be that in the "normal" case, when a
station dies, there's nothing on the queues - so the dequeue just
doesn't do anything and never goes into the code path where the
rcu_dereference() is, hence it might be a bug in mainline that just
never triggers in ordinary scenarios.


It looks like the code in ath9k has not been changed in that area for
some time.

Would adding rcu_read_lock in drv_sta_state() be a possibility?

Thanks,
Ben



johannes




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: lockdep warning in mac80211/tx.c, 4.13.16+ kernel, ath9k related

2018-02-01 Thread Ben Greear

On 02/01/2018 02:23 PM, Johannes Berg wrote:

On Wed, 2018-01-31 at 16:53 -0800, Ben Greear wrote:


Any idea why this might be complaining?



  30917 Jan 31 15:21:01 2u-6n kernel: =
  30918 Jan 31 15:21:01 2u-6n kernel: WARNING: suspicious RCU usage


well, that's why? :-)


All of RCU is suspicious as far as my feeble brain is concerned...I was 
wondering *why*
it was suspicious.  I see you answered below... :)




  30933   stack backtrace:
  30934 Jan 31 15:21:01 2u-6n kernel: CPU: 1 PID: 19628 Comm: ip Tainted: G 
   W  O4.13.16+ #2
  30935 Jan 31 15:21:01 2u-6n kernel: Hardware name: Iron_Systems,Inc 
CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
  30936 Jan 31 15:21:01 2u-6n kernel: Call Trace:
  30937 Jan 31 15:21:01 2u-6n kernel:  dump_stack+0x85/0xc7
  30938 Jan 31 15:21:01 2u-6n kernel:  lockdep_rcu_suspicious+0xc5/0x100
  30939 Jan 31 15:21:01 2u-6n kernel:  ieee80211_tx_h_select_key+0x1c9/0x4e0 
[mac80211]
  30940 Jan 31 15:21:01 2u-6n kernel:  ieee80211_tx_dequeue+0x376/0xca0 
[mac80211]
  30941 Jan 31 15:21:01 2u-6n kernel:  ath_tid_dequeue+0x9c/0x110 [ath9k]
  30942 Jan 31 15:21:01 2u-6n kernel:  ath_tx_node_cleanup+0xa4/0x160 [ath9k]
  30943 Jan 31 15:21:01 2u-6n kernel:  ath9k_sta_state+0x6b/0x1e0 [ath9k]
  30944 Jan 31 15:21:01 2u-6n kernel:  drv_sta_state+0xad/0xa80 [mac80211]


I'd argue your hacks are at fault - somewhere in your modifications
around this call stack you lost the necessary rcu_read_lock()
protection.


Ok, I'll go looking for bugs like that.

Thanks!
--Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



lockdep warning in mac80211/tx.c, 4.13.16+ kernel, ath9k related

2018-01-31 Thread Ben Greear
17
 31021 Jan 31 15:21:01 2u-6n kernel: Call Trace:
 31022 Jan 31 15:21:01 2u-6n kernel:  dump_stack+0x85/0xc7
 31023 Jan 31 15:21:01 2u-6n kernel:  lockdep_rcu_suspicious+0xc5/0x100
 31024 Jan 31 15:21:01 2u-6n kernel:  ieee80211_tx_h_select_key+0x431/0x4e0 
[mac80211]
 31025 Jan 31 15:21:01 2u-6n kernel:  ieee80211_tx_dequeue+0x376/0xca0 
[mac80211]
 31026 Jan 31 15:21:01 2u-6n kernel:  ath_tid_dequeue+0x9c/0x110 [ath9k]
 31027 Jan 31 15:21:01 2u-6n kernel:  ath_tx_node_cleanup+0xa4/0x160 [ath9k]
 31028 Jan 31 15:21:01 2u-6n kernel:  ath9k_sta_state+0x6b/0x1e0 [ath9k]
 31029 Jan 31 15:21:01 2u-6n kernel:  drv_sta_state+0xad/0xa80 [mac80211]
 31030 Jan 31 15:21:01 2u-6n kernel:  __sta_info_destroy_part2+0x178/0x1d0 
[mac80211]
31031 Jan 31 15:21:01 2u-6n kernel:  __sta_info_flush+0xef/0x170 [mac80211]
 31032 Jan 31 15:21:01 2u-6n kernel:  ieee80211_set_disassoc+0xc6/0x400 
[mac80211]
 31033 Jan 31 15:21:01 2u-6n kernel:  ieee80211_mgd_deauth+0x2f6/0x830 
[mac80211]
 31034 Jan 31 15:21:01 2u-6n kernel:  ieee80211_deauth+0x13/0x20 [mac80211]
 31035 Jan 31 15:21:01 2u-6n kernel:  cfg80211_mlme_deauth+0x16b/0x3e0 
[cfg80211]
 31036 Jan 31 15:21:01 2u-6n kernel:  cfg80211_mlme_down+0x6d/0xa0 [cfg80211]
 31037 Jan 31 15:21:01 2u-6n kernel:  cfg80211_disconnect+0x2f4/0x3f0 [cfg80211]
 31038 Jan 31 15:21:01 2u-6n kernel:  ? kfree+0x24d/0x2b0
 31039 Jan 31 15:21:01 2u-6n kernel:  __cfg80211_leave+0x134/0x190 [cfg80211]
 31040 Jan 31 15:21:01 2u-6n kernel:  cfg80211_leave+0x28/0x40 [cfg80211]
 31041 Jan 31 15:21:01 2u-6n kernel:  cfg80211_netdev_notifier_call+0x49c/0x820 
[cfg80211]
 31042 Jan 31 15:21:01 2u-6n kernel:  ? lockdep_rtnl_is_held+0x15/0x20
 31043 Jan 31 15:21:01 2u-6n kernel:  ? addrconf_notify+0x6b/0xcc0 [ipv6]
 31044 Jan 31 15:21:01 2u-6n kernel:  ? packet_notifier+0xee/0x2a0
 31045 Jan 31 15:21:01 2u-6n kernel:  notifier_call_chain+0x45/0x70
 31046 Jan 31 15:21:01 2u-6n kernel:  raw_notifier_call_chain+0x11/0x20
 31047 Jan 31 15:21:01 2u-6n kernel:  call_netdevice_notifiers_info+0x30/0x60
 31048 Jan 31 15:21:01 2u-6n kernel:  __dev_close_many+0x54/0xe0
 31049 Jan 31 15:21:01 2u-6n kernel:  __dev_close+0x31/0x50
 31050 Jan 31 15:21:01 2u-6n kernel:  __dev_change_flags+0x98/0x160
 31051 Jan 31 15:21:01 2u-6n kernel:  dev_change_flags+0x24/0x60
 31052 Jan 31 15:21:01 2u-6n kernel:  do_setlink+0x367/0xcd0
 31053 Jan 31 15:21:01 2u-6n kernel:  ? mark_held_locks+0x6f/0xa0
 31054 Jan 31 15:21:01 2u-6n kernel:  ? get_page_from_freelist+0x1fd/0xc10
 31055 Jan 31 15:21:01 2u-6n kernel:  ? trace_hardirqs_on_caller+0x11f/0x190
 31056 Jan 31 15:21:01 2u-6n kernel:  ? nla_parse+0x36/0x150
 31057 Jan 31 15:21:01 2u-6n kernel:  rtnl_newlink+0x776/0x8e0
 31058 Jan 31 15:21:01 2u-6n kernel:  ? __wake_up+0x1e/0x50
 31059 Jan 31 15:21:01 2u-6n kernel:  ? ns_capable_common+0x75/0x90
 31060 Jan 31 15:21:01 2u-6n kernel:  ? ns_capable+0xe/0x10
 31061 Jan 31 15:21:01 2u-6n kernel:  rtnetlink_rcv_msg+0x85/0x1f0
 31062 Jan 31 15:21:01 2u-6n kernel:  ? lock_acquire+0xac/0x200
 31063 Jan 31 15:21:01 2u-6n kernel:  ? rtnl_newlink+0x8e0/0x8e0
 31064 Jan 31 15:21:01 2u-6n kernel:  netlink_rcv_skb+0xe2/0x110
 31065 Jan 31 15:21:01 2u-6n kernel:  rtnetlink_rcv+0x25/0x30
 31066 Jan 31 15:21:01 2u-6n kernel:  netlink_unicast+0x1cb/0x2e0
 31067 Jan 31 15:21:01 2u-6n kernel:  netlink_sendmsg+0x2c6/0x3a0
 31068 Jan 31 15:21:01 2u-6n kernel:  sock_sendmsg+0x33/0x40
 31069 Jan 31 15:21:01 2u-6n kernel:  ___sys_sendmsg+0x2f3/0x300
 31070 Jan 31 15:21:01 2u-6n kernel:  ? lock_acquire+0xac/0x200
 31071 Jan 31 15:21:01 2u-6n kernel:  ? __handle_mm_fault+0x5e8/0xfb0
 31072 Jan 31 15:21:01 2u-6n kernel:  ? getnstimeofday64+0x9/0x20
 31073 Jan 31 15:21:01 2u-6n kernel:  ? trace_hardirqs_on_caller+0x11f/0x190
 31074 Jan 31 15:21:01 2u-6n kernel:  __sys_sendmsg+0x40/0x70
 31075 Jan 31 15:21:01 2u-6n kernel:  ? __sys_sendmsg+0x40/0x70
 31076 Jan 31 15:21:01 2u-6n kernel:  SyS_sendmsg+0xd/0x20
 31077 Jan 31 15:21:01 2u-6n kernel:  do_syscall_64+0x64/0x140
 31078 Jan 31 15:21:01 2u-6n kernel:  entry_SYSCALL64_slow_path+0x25/0x25
 31079 Jan 31 15:21:01 2u-6n kernel: RIP: 0033:0x7f1e8363a150
 31080 Jan 31 15:21:01 2u-6n kernel: RSP: 002b:7ffee8bb8bf8 EFLAGS: 
0246 ORIG_RAX: 002e
 31081 Jan 31 15:21:01 2u-6n kernel: RAX: ffda RBX: 
5a724f5d RCX: 7f1e8363a150
 31082 Jan 31 15:21:01 2u-6n kernel: RDX:  RSI: 
7ffee8bb8c70 RDI: 0004
 31083 Jan 31 15:21:01 2u-6n kernel: RBP: 7ffee8bb8c70 R08: 
0001 R09: 
 31084 Jan 31 15:21:01 2u-6n kernel: R10: 05e7 R11: 
0246 R12: 7ffee8bb8cb0
 31085 Jan 31 15:21:01 2u-6n kernel: R13: 0066b460 R14: 
7ffee8bc0d20 R15: 


Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: ath9k will not tx packets sometimes.

2018-01-30 Thread Ben Greear

On 01/30/2018 04:07 AM, Toke Høiland-Jørgensen wrote:

Ben Greear <gree...@candelatech.com> writes:


Maybe there is some way for the scheduler to get stuck and not
schedule anything?


It would appear so, yeah. Do you do anything special other than
associate a whole bunch of stations to trigger this? I can try to see if
I can script something that works equivalently on my setup.


Ok, and I'll give you my user-space software package to easily set up
this test case if you want to test with that.  Contact me off-list if
you want help setting this up.



I'm actually working on reworking that whole scheduler logic, and move
some of it into mac80211. Could you test this (WiP) patch and see if
that has the same problem?


It had some serious conflicts in ath10k, due to my local changes, so
I did not actually test this.

But, a revert of the atf patches (a6e56d749 and 63fefa050) appear to have 
resolved the
issue.  I'll test more with these reverted, and maybe will have time to
work more on actually fixing upstream code next time I move to a newer
kernel (and/or after your pending changes get in).

Thanks,
Ben


--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: ath9k will not tx packets sometimes.

2018-01-29 Thread Ben Greear

On 01/29/2018 02:35 PM, Toke Høiland-Jørgensen wrote:

Ben Greear <gree...@candelatech.com> writes:


On 01/29/2018 01:47 PM, Toke Høiland-Jørgensen wrote:

Ben Greear <gree...@candelatech.com> writes:


On 01/27/2018 05:11 AM, Toke Høiland-Jørgensen wrote:

Ben Greear <gree...@candelatech.com> writes:


I'm doing a test with 200 virtual stations on each of 6 ath9k radios.

When I configure stations for DHCP, I see cases where stations on a particular
radio will not transmit anything sometimes.  I see no 'XMIT' logs that show 
indication of
frames being received in the driver from the upper stack, but if I use 'tshark' 
on
a station interface, it shows frames being 'transmitted'.

I do, however, see this, which looks like it might show
an issue.  It looks like whatever 'aqm' is, it has an ever expanding number
of backlog packets:


The aqm is the intermediate queues in mac80211. So this indicates that
the driver is not pulling packets for transmission.

With that many stations, I wonder whether it is due to the airtime
fairness scheduler throttling the station? What is the contents of
debug/ieee80211/wiphy2/netdev\:sta30194/stations/00\:0e\:8e\:69\:b8\:f7/airtime
while the station is not transmitting? And is it all stations on that
particular radio, or only some of them?


Here is the output of airtime and aqm on a hung station:

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10057/stations/00\:0e\:8e\:50\:74\:8a/airtime
RX: 83706 us
TX: 4202 us
Deficit: VO: 198 us VI: 300 us BE: -8306 us BK: 300 us


Right. This looks like incoming traffic is depleting the airtime quantum
faster than it can be replenished by the scheduler, which means that the
station gets completely starved.

Could you try turning off the airtime scheduler?

echo 0 > /sys/kernel/debug/ieee80211/wiphy0/ath9k/airtime_flags

and see if the problem goes away.

If it does, please check if the problem persists when setting
airtime_flags to 1 (which means only include TX airtime).

-Toke



That did not seem to help:

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10058/stations/00\:0e\:8e\:50\:74\:8a/node_aggr
Max-AMPDU: 65535
MPDU Density: 8


TID  SEQ_START  SEQ_NEXT  BAW_SIZE  BAW_HEAD  BAW_TAIL  BAR_IDX SCHED HAS-QUED
0  0 064 0 0   -1 11


Hmm, SCHED and HAS-QUED are both set, so it should be scheduled. Is the
scheduler maybe simply taking too long to get round to scheduling that
station again?

What happens if you don't kill things after 30 seconds? Is it hanging
forever, or just long enough for your tools to lose patience?

If you have 200 stations all requesting DHCP addresses I could see how
things might take a while...


I bring them up in groups of 30 or so.  I typically see 1-10 of them get
DHCP address, and then it seems that no data frames ever are tx'd again on
any interface on the radio...or at least tx is very rare.  Sometimes, all 200 
will come
up and pass traffic, but not reliably.  Once the system gets in this state,
down/up of the affected station interfaces does not fix it.  I have not tried
bouncing all of them at once yet.

I never even see dhcp discovers on the air when sniffing on another machine,
from any interface once it is hung, so it should not be a simple over-busy
network issue.

Maybe there is some way for the scheduler to get stuck and not schedule 
anything?

Thanks,
Ben



-Toke




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: ath9k will not tx packets sometimes.

2018-01-29 Thread Ben Greear

On 01/29/2018 01:47 PM, Toke Høiland-Jørgensen wrote:

Ben Greear <gree...@candelatech.com> writes:


On 01/27/2018 05:11 AM, Toke Høiland-Jørgensen wrote:

Ben Greear <gree...@candelatech.com> writes:


I'm doing a test with 200 virtual stations on each of 6 ath9k radios.

When I configure stations for DHCP, I see cases where stations on a particular
radio will not transmit anything sometimes.  I see no 'XMIT' logs that show 
indication of
frames being received in the driver from the upper stack, but if I use 'tshark' 
on
a station interface, it shows frames being 'transmitted'.

I do, however, see this, which looks like it might show
an issue.  It looks like whatever 'aqm' is, it has an ever expanding number
of backlog packets:


The aqm is the intermediate queues in mac80211. So this indicates that
the driver is not pulling packets for transmission.

With that many stations, I wonder whether it is due to the airtime
fairness scheduler throttling the station? What is the contents of
debug/ieee80211/wiphy2/netdev\:sta30194/stations/00\:0e\:8e\:69\:b8\:f7/airtime
while the station is not transmitting? And is it all stations on that
particular radio, or only some of them?


Here is the output of airtime and aqm on a hung station:

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10057/stations/00\:0e\:8e\:50\:74\:8a/airtime
RX: 83706 us
TX: 4202 us
Deficit: VO: 198 us VI: 300 us BE: -8306 us BK: 300 us


Right. This looks like incoming traffic is depleting the airtime quantum
faster than it can be replenished by the scheduler, which means that the
station gets completely starved.

Could you try turning off the airtime scheduler?

echo 0 > /sys/kernel/debug/ieee80211/wiphy0/ath9k/airtime_flags

and see if the problem goes away.

If it does, please check if the problem persists when setting
airtime_flags to 1 (which means only include TX airtime).

-Toke



That did not seem to help:

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10058/stations/00\:0e\:8e\:50\:74\:8a/node_aggr
Max-AMPDU: 65535
MPDU Density: 8


TID  SEQ_START  SEQ_NEXT  BAW_SIZE  BAW_HEAD  BAW_TAIL  BAR_IDX SCHED HAS-QUED
  0  0 064 0 0   -1 11

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10058/stations/00\:0e\:8e\:50\:74\:8a/airtime
RX: 0 us
TX: 4682 us
Deficit: VO: 300 us VI: 300 us BE: 300 us BK: 300 us

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10058/stations/00\:0e\:8e\:50\:74\:8a/aqm
target 4us interval 29us ecn no
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions 
tx-bytes tx-packets flags
0 2 2406 11 3 0 0 0 7 0 0 0x6(RUN AMPDU NO-AMSDU)
1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
2 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
6 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
7 0 0 0 12 0 0 0 13 312 13 0x0(RUN)
8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)


Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: ath9k will not tx packets sometimes.

2018-01-29 Thread Ben Greear

On 01/27/2018 05:11 AM, Toke Høiland-Jørgensen wrote:

Ben Greear <gree...@candelatech.com> writes:


I'm doing a test with 200 virtual stations on each of 6 ath9k radios.

When I configure stations for DHCP, I see cases where stations on a particular
radio will not transmit anything sometimes.  I see no 'XMIT' logs that show 
indication of
frames being received in the driver from the upper stack, but if I use 'tshark' 
on
a station interface, it shows frames being 'transmitted'.

I do, however, see this, which looks like it might show
an issue.  It looks like whatever 'aqm' is, it has an ever expanding number
of backlog packets:


The aqm is the intermediate queues in mac80211. So this indicates that
the driver is not pulling packets for transmission.

With that many stations, I wonder whether it is due to the airtime
fairness scheduler throttling the station? What is the contents of
debug/ieee80211/wiphy2/netdev\:sta30194/stations/00\:0e\:8e\:69\:b8\:f7/airtime
while the station is not transmitting? And is it all stations on that
particular radio, or only some of them?


Here is the output of airtime and aqm on a hung station:

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10057/stations/00\:0e\:8e\:50\:74\:8a/airtime
RX: 83706 us
TX: 4202 us
Deficit: VO: 198 us VI: 300 us BE: -8306 us BK: 300 us

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10057/stations/00\:0e\:8e\:50\:74\:8a/aqm
target 4us interval 29us ecn no
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions 
tx-bytes tx-packets flags
0 2 3476 14 6 0 0 0 8 326 3 0x6(RUN AMPDU NO-AMSDU)
1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
2 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
6 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
7 0 0 0 4 0 0 0 6 168 7 0x0(RUN)
8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)


My tool will effectively down/up the interface after 30 seconds if DHCP has not
been acquired...here is another set of debug after that has happened:

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10057/stations/00\:0e\:8e\:50\:74\:8a/airtime
RX: 0 us
TX: 3946 us
Deficit: VO: 254 us VI: 300 us BE: 300 us BK: 300 us

# cat 
/debug/ieee80211/wiphy0/netdev\:sta10057/stations/00\:0e\:8e\:50\:74\:8a/aqm
target 4us interval 29us ecn no
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions 
tx-bytes tx-packets flags
0 2 1376 6 2 0 0 0 4 0 0 0x6(RUN AMPDU NO-AMSDU)
1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
2 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
6 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
7 0 0 0 12 0 0 0 13 312 13 0x0(RUN)
8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)

I was able to reproduce this on a system by configuring "only" 200 stations on
a single ath9k radio, so probably the dedicated individual could reproduce
this on their own system as well.  Stock kernels are less optimized for this, 
but at least
for ath9k, it should function with multiple virtual station devices...

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] ath9k: Print has_queued in debugfs.

2018-01-27 Thread Ben Greear



On 01/27/2018 05:29 AM, Toke Høiland-Jørgensen wrote:

gree...@candelatech.com writes:


From: Ben Greear <gree...@candelatech.com>

The PAUSED field was never printed per tid.  Replace that
with has_queued, which might help someone track down strange
bugs related to aqm.

And, make tx-queue debug info show peer BSSID as well as vdev
MAC to aid debugging with multiple stations connected to the
same peer.

Signed-off-by: Ben Greear <gree...@candelatech.com>
---
 drivers/net/wireless/ath/ath9k/debug_sta.c | 10 ++
 drivers/net/wireless/ath/ath9k/xmit.c  |  5 +++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/debug_sta.c 
b/drivers/net/wireless/ath/ath9k/debug_sta.c
index efc692e..a45f1f5 100644
--- a/drivers/net/wireless/ath/ath9k/debug_sta.c
+++ b/drivers/net/wireless/ath/ath9k/debug_sta.c
@@ -48,9 +48,10 @@ static ssize_t read_file_node_aggr(struct file *file, char 
__user *user_buf,
 an->mpdudensity);

len += scnprintf(buf + len, size - len,
-"\n%3s%11s%10s%10s%10s%10s%9s%6s%8s\n",
+"\n%3s%11s%10s%10s%10s%10s%9s%6s%9s\n",
 "TID", "SEQ_START", "SEQ_NEXT", "BAW_SIZE",
-"BAW_HEAD", "BAW_TAIL", "BAR_IDX", "SCHED", "PAUSED");
+"BAW_HEAD", "BAW_TAIL", "BAR_IDX", "SCHED",
+"HAS-QUED");

for (tidno = 0; tidno < IEEE80211_NUM_TIDS; tidno++) {
tid = ath_node_to_tid(an, tidno);
@@ -58,7 +59,7 @@ static ssize_t read_file_node_aggr(struct file *file, char 
__user *user_buf,
ath_txq_lock(sc, txq);
if (tid->active) {
len += scnprintf(buf + len, size - len,
-"%3d%11d%10d%10d%10d%10d%9d%6d\n",
+"%3d%11d%10d%10d%10d%10d%9d%6d%9d\n",
 tid->tidno,
 tid->seq_start,
 tid->seq_next,
@@ -66,7 +67,8 @@ static ssize_t read_file_node_aggr(struct file *file, char 
__user *user_buf,
 tid->baw_head,
 tid->baw_tail,
 tid->bar_index,
-!list_empty(>list));
+!list_empty(>list),
+tid->has_queued);


Would it perhaps be useful to print the length of tid->retry_q instead
of / in addition to has_queued? Planning to get rid of the has_queued
variable entirely...


I'm just getting started looking at this.  4.9 kernel works fine, so it seems
to be a regression in the aqm stuff since that is the big change between 4.9
and 4.13.

From a brief look yesterday, it seems that has_queued is not adequately 
protected
by locks, but also, it was '1', which indicates it wants to send pkts (?), so I 
guess
it was not hitting a locking race.

Can you point me at the scheduler code (method name?).  I'll be looking at that
as soon as I get time.

Thanks,
Ben



-Toke



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


ath9k will not tx packets sometimes.

2018-01-26 Thread Ben Greear

I'm doing a test with 200 virtual stations on each of 6 ath9k radios.

When I configure stations for DHCP, I see cases where stations on a particular
radio will not transmit anything sometimes.  I see no 'XMIT' logs that show 
indication of
frames being received in the driver from the upper stack, but if I use 'tshark' 
on
a station interface, it shows frames being 'transmitted'.

I do, however, see this, which looks like it might show
an issue.  It looks like whatever 'aqm' is, it has an ever expanding number
of backlog packets:

[root@2u-6n lanforge]# cat 
/debug/ieee80211/wiphy2/netdev\:sta30194/stations/00\:0e\:8e\:69\:b8\:f7/aqm
target 4us interval 29us ecn no
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions 
tx-bytes tx-packets
0 2 27616 440 9 0 0 0 428 998 7
1 3 0 0 0 0 0 0 0 0 0
2 3 0 0 0 0 0 0 0 0 0
3 2 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0
6 0 0 0 1 0 0 0 0 390 1
7 0 0 0 637 0 0 0 792 22072 903
8 2 0 0 0 0 0 0 0 0 0
9 3 0 0 0 0 0 0 0 0 0
10 3 0 0 0 0 0 0 0 0 0
11 2 0 0 0 0 0 0 0 0 0
12 1 0 0 0 0 0 0 0 0 0
13 1 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0


Anyone have any pointers as to whether this might be a real issue or not?  I'll 
go
digging in the meantime

Thanks,
Ben

--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



Re: ath9k, irq 16: nobody cared, 4.13.16+ kernel

2018-01-26 Thread Ben Greear

On 01/26/2018 10:04 AM, Ben Greear wrote:

Hello,

I'm testing on a heavily loaded system.  It has 6 ath9k radios, each of which 
have
200 stations on them.  When I try to start some slow-speed UDP traffic on each
virtual station, I see this splat below.  Is this just expected on a heavily
loaded system?  Or, any ideas how to make it better (I'll investigate this 
irqpoll
thing it suggests...)


Maybe a related question...  The ath9k driver has a napi budget of 512.

I added code and verified that at least sometimes it would take more than
5 jiffies in this handler.

Maybe it is a good idea to have a jiffies timeout breakout as well as a budget
to stop the while loop and exit this handler?

If so, is 5 jiffies a good number do you think?

And, from looking at the IRQ code, it seems that it will re-enable the IRQ
later even if I see the splat below, so the system should recover OK?

Thanks,
Ben




[ 2833.533351] irq 16: nobody cared (try booting with the "irqpoll" option)
[ 2833.539111] CPU: 4 PID: 0 Comm: swapper/4 Tainted: GW  O4.13.16+ 
#36
[ 2833.539112] Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 
2.0b 05/02/2017
[ 2833.539113] Call Trace:
[ 2833.539114]  
[ 2833.539119]  dump_stack+0x63/0x87
[ 2833.539122]  __report_bad_irq+0x2e/0xc0
[ 2833.539123]  note_interrupt+0x227/0x270
[ 2833.539124]  handle_irq_event_percpu+0x40/0x50
[ 2833.539126]  handle_irq_event+0x34/0x60
[ 2833.539127]  handle_fasteoi_irq+0x78/0x140
[ 2833.539128]  handle_irq+0x18/0x30
[ 2833.539129]  do_IRQ+0x41/0xc0
[ 2833.539130]  common_interrupt+0x89/0x89
[ 2833.539133] RIP: 0010:fib_rules_lookup+0x3e/0x1a0
[ 2833.539134] RSP: 0018:88087fd03828 EFLAGS: 0283 ORIG_RAX: 
ffac
[ 2833.539135] RAX: 0353 RBX: 8807eb10c300 RCX: 88087fd03868
[ 2833.539135] RDX:  RSI: 88087fd038e0 RDI: 88085ab55b40
[ 2833.539136] RBP: 88087fd03858 R08:  R09: 0001
[ 2833.539136] R10: 81efd500 R11: 88081ae6ac00 R12: 88085ab55b40
[ 2833.539137] R13: 88087fd03868 R14: 88085ab55bc0 R15: 88087fd038e0
[ 2833.539139]  ? fib_rules_lookup+0x2a/0x1a0
[ 2833.539141]  __fib_lookup+0x4f/0x80
[ 2833.539143]  ip_route_input_rcu+0xa0d/0xc20
[ 2833.539144]  ip_route_input_noref+0x3e/0x60
[ 2833.539147]  ? ioapic_ir_ack_level+0x29/0x30
[ 2833.539147]  ? handle_fasteoi_irq+0xff/0x140
[ 2833.539149]  arp_process+0x486/0x820
[ 2833.539150]  ? do_IRQ+0x4a/0xc0
[ 2833.539151]  ? common_interrupt+0x89/0x89
[ 2833.539152]  arp_rcv+0x112/0x1b0
[ 2833.539155]  __netif_receive_skb_core+0x689/0xa90
[ 2833.539156]  __netif_receive_skb+0x13/0x60
[ 2833.539156]  ? __netif_receive_skb+0x13/0x60
[ 2833.539158]  netif_receive_skb_internal+0x1c6/0x440
[ 2833.539159]  netif_receive_skb+0x17/0x80
[ 2833.539178]  ieee80211_deliver_skb+0x165/0x1d0 [mac80211]
[ 2833.539187]  ieee80211_rx_handlers+0xe50/0x26f0 [mac80211]
[ 2833.539188]  ? skb_copy_bits+0x5f/0x260
[ 2833.539189]  ? copy_skb_header+0x12/0x90
[ 2833.539197]  ieee80211_prepare_and_rx_handle+0x638/0x1130 [mac80211]
[ 2833.539198]  ? ioapic_ir_ack_level+0x29/0x30
[ 2833.539206]  ieee80211_rx_napi+0x87d/0xa70 [mac80211]
[ 2833.539210]  ath_rx_tasklet+0xa99/0x11a0 [ath9k]
[ 2833.539213]  ath9k_tasklet+0x113/0x2e0 [ath9k]
[ 2833.539215]  tasklet_action+0x10c/0x120
[ 2833.539216]  __do_softirq+0xc1/0x2a0
[ 2833.539217]  irq_exit+0x9b/0xa0
[ 2833.539218]  do_IRQ+0x4a/0xc0
[ 2833.539219]  common_interrupt+0x89/0x89
[ 2833.539221] RIP: 0010:cpuidle_enter_state+0xf8/0x2a0
[ 2833.539221] RSP: 0018:c900031fbe78 EFLAGS: 0246 ORIG_RAX: 
ffac
[ 2833.539222] RAX: 88087fd1c000 RBX: 88087fd25f00 RCX: 001f
[ 2833.539223] RDX:  RSI: e691b77ea649 RDI: 
[ 2833.539223] RBP: c900031fbeb0 R08: c442b282beb1 R09: 02eb
[ 2833.539224] R10: 035c R11: 88087fd18ee4 R12: 0004
[ 2833.539224] R13: 0004 R14: 0004 R15: 0293855eb8ba
[ 2833.539225]  
[ 2833.539226]  ? cpuidle_enter_state+0xd4/0x2a0
[ 2833.539227]  cpuidle_enter+0x12/0x20
[ 2833.539229]  call_cpuidle+0x1e/0x40
[ 2833.539229]  do_idle+0x17f/0x1d0
[ 2833.539230]  cpu_startup_entry+0x5f/0x70
[ 2833.539231]  start_secondary+0x14b/0x180
[ 2833.539233]  secondary_startup_64+0x9f/0x9f
[ 2833.539233] handlers:
[ 2833.540606] [] ath_isr [ath9k]
[ 2833.544278] Disabling IRQ #16

Thanks,
Ben




--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



  1   2   3   4   5   6   >