Re: [PATCH] wifi: mt76: mt7603: add debugfs attr for disabling frames buffering

2024-05-25 Thread Rafał Miłecki

On 26.03.2024 00:11, Rafał Miłecki wrote:

I'll provide some iperf logs from my ThinkPad with 8086:3166 Intel
Wireless-AC 3165 working as 2 GHz client of MT7603EN on Netgear R6220.


I run seven 1-hour iperf sessions overnight using ThinkPad + Xiaomi
Mi Router 4C (MT7628AN Wi-Fi SoC) with buffering DISABLED.

Session 1 (1 hour) with 51.7 Mbps avg:
All GOOD

Session 2 (1 hour) with 40.5 Mbps avg:
* 2 seconds traffic stall (STA diconnected with 34=DISASSOC_LOW_ACK)
* 718 seconds traffic stall (nothing in STA logs)

Session 3 (1 hour) with 51.6 Mbps avg:
* 5 seconds traffic stall (STA diconnected with 34=DISASSOC_LOW_ACK)

Session 4 (1 hour) with 51.8 Mbps avg:
All GOOD

Session 5 (1 hour) with 51.8 Mbps avg:
All GOOD

Session 6 (1 hour) with 51.8 Mbps avg:
All GOOD

Session 7 (1 hour) with 51.1 Mbps avg:
All GOOD

I find MT7628AN without frames buffering pretty usable.



Now, with frames buffering enabled, things look quite worse. There are
multiple traffic stalls and often TCP session dies. See below.

pon, 25 mar 2024, 23:23:51 CET

Client connecting to 192.168.27.1, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local 192.168.1.181 port 58838 connected with 192.168.27.1 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0- 1.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3]  1.0- 2.0 sec  3.25 MBytes  27.3 Mbits/sec
[  3]  2.0- 3.0 sec  3.25 MBytes  27.3 Mbits/sec
[  3]  3.0- 4.0 sec  5.25 MBytes  44.0 Mbits/sec
[  3]  4.0- 5.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3]  5.0- 6.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3]  6.0- 7.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3]  7.0- 8.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3]  8.0- 9.0 sec  7.12 MBytes  59.8 Mbits/sec
[  3]  9.0-10.0 sec  6.25 MBytes  52.4 Mbits/sec
[  3] 10.0-11.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 11.0-12.0 sec  7.12 MBytes  59.8 Mbits/sec
[  3] 12.0-13.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 13.0-14.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3] 14.0-15.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 15.0-16.0 sec  5.38 MBytes  45.1 Mbits/sec
[  3] 16.0-17.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 17.0-18.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 18.0-19.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 19.0-20.0 sec  6.25 MBytes  52.4 Mbits/sec
[  3] 20.0-21.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 21.0-22.0 sec  6.25 MBytes  52.4 Mbits/sec
[  3] 22.0-23.0 sec  7.00 MBytes  58.7 Mbits/sec
[  3] 23.0-24.0 sec  6.25 MBytes  52.4 Mbits/sec
[  3] 24.0-25.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 25.0-26.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 26.0-27.0 sec  3.62 MBytes  30.4 Mbits/sec
[  3] 27.0-28.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 28.0-29.0 sec  2.62 MBytes  22.0 Mbits/sec
[  3] 29.0-30.0 sec  1.75 MBytes  14.7 Mbits/sec
[  3] 30.0-31.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 31.0-32.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 32.0-33.0 sec  3.50 MBytes  29.4 Mbits/sec
[  3] 33.0-34.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 34.0-35.0 sec  5.62 MBytes  47.2 Mbits/sec
[  3] 35.0-36.0 sec  7.12 MBytes  59.8 Mbits/sec
[  3] 36.0-37.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 37.0-38.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 38.0-39.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 39.0-40.0 sec  7.12 MBytes  59.8 Mbits/sec
[  3] 40.0-41.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 41.0-42.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 42.0-43.0 sec  3.88 MBytes  32.5 Mbits/sec
[  3] 43.0-44.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 44.0-45.0 sec  6.25 MBytes  52.4 Mbits/sec
[  3] 45.0-46.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 46.0-47.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 47.0-48.0 sec  6.25 MBytes  52.4 Mbits/sec
[  3] 48.0-49.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 49.0-50.0 sec  6.25 MBytes  52.4 Mbits/sec
[  3] 50.0-51.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 51.0-52.0 sec  7.12 MBytes  59.8 Mbits/sec
[  3] 52.0-53.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 53.0-54.0 sec  7.00 MBytes  58.7 Mbits/sec
[  3] 54.0-55.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 55.0-56.0 sec  3.75 MBytes  31.5 Mbits/sec
[  3] 56.0-57.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 57.0-58.0 sec  2.38 MBytes  19.9 Mbits/sec
[  3] 58.0-59.0 sec  2.88 MBytes  24.1 Mbits/sec
[  3] 59.0-60.0 sec  2.12 MBytes  17.8 Mbits/sec
[  3] 60.0-61.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 61.0-62.0 sec  2.12 MBytes  17.8 Mbits/sec
[  3] 62.0-63.0 sec  2.12 MBytes  17.8 Mbits/sec
[  3] 63.0-64.0 sec  2.00 MBytes  16.8 Mbits/sec
[  3] 64.0-65.0 sec  2.25 MBytes  18.9 Mbits/sec
[  3] 65.0-66.0 sec  2.00 MBytes  16.8 Mbits/sec
[  3] 66.0-67.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 67.0-68.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 68.0-69.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 69.0-70.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 70.0-71.0 sec  3.75 MBytes  31.5 Mbits/sec
[  3] 71.0-72.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 72.0-73.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 73.0-74.0 sec  0.00 Bytes  0.00 bits/sec
[  

[PATCH] wifi: mt76: mt7603: add debugfs attr for disabling frames buffering

2024-05-25 Thread Rafał Miłecki
From: Rafał Miłecki 

MT7603EN and MT7628AN were reported by multiple users to be unstable
under high traffic. Transmission of packets could stop for seconds often
leading to disconnections.

Long research & debugging revelaed a close relation between MCU
interrupts of type PKT_TYPE_TXS and slowdowns / stalls. That led to
questioning frames buffering feature.

It turns out that disabling SKBs loopback code makes mt7603 devices much
more stable under load. There are still some traffic hiccups but those
happen like once every an hour and end up in recovery in most cases.

Add a debugfs option for disabling frames buffering so users can give it
a try. If this solution yields a success we can make this feature
disabled by default.

This change was successfully tested using 2 GHz AP interface on:
1. Netgear R6220 - MT7621ST (SoC) + MT7603EN (WiFi) + MT7612EN (WiFi)
2. Xiaomi Mi Router 4C - MT7628AN (Wi-Fi SoC)

Link: 
https://lore.kernel.org/linux-wireless/7c96d5ee-86c1-8068-1b58-40db6087a...@gmail.com/
Closes: https://github.com/openwrt/mt76/issues/865
Fixes: c8846e101502 ("mt76: add driver for MT7603E and MT7628/7688")
Signed-off-by: Rafał Miłecki 
---
 drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c | 2 ++
 drivers/net/wireless/mediatek/mt76/mt7603/dma.c | 3 +++
 drivers/net/wireless/mediatek/mt76/mt7603/init.c| 1 +
 drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h  | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c 
b/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c
index 3967f2f05774..c80baba7a402 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c
@@ -115,4 +115,6 @@ void mt7603_init_debugfs(struct mt7603_dev *dev)
>sensitivity_limit);
debugfs_create_bool("dynamic_sensitivity", 0600, dir,
>dynamic_sensitivity);
+   debugfs_create_bool("frames_buffering", 0600, dir,
+   >frames_buffering);
 }
diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/dma.c 
b/drivers/net/wireless/mediatek/mt76/mt7603/dma.c
index 7a2f5d38562b..f5ab729dec31 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/dma.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/dma.c
@@ -27,6 +27,9 @@ mt7603_rx_loopback_skb(struct mt7603_dev *dev, struct sk_buff 
*skb)
u32 val;
u8 tid = 0;
 
+   if (!dev->frames_buffering)
+   goto free;
+
if (skb->len < MT_TXD_SIZE + sizeof(struct ieee80211_hdr))
goto free;
 
diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/init.c 
b/drivers/net/wireless/mediatek/mt76/mt7603/init.c
index 6c55c72f28a2..5abc2618ec8b 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/init.c
@@ -515,6 +515,7 @@ int mt7603_register_device(struct mt7603_dev *dev)
dev->slottime = 9;
dev->sensitivity_limit = 28;
dev->dynamic_sensitivity = true;
+   dev->frames_buffering = true;
 
ret = mt7603_init_hardware(dev);
if (ret)
diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h 
b/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h
index 9e58df7042ad..02c88334cdc0 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h
@@ -155,6 +155,8 @@ struct mt7603_dev {
u32 reset_test;
 
unsigned int reset_cause[__RESET_CAUSE_MAX];
+
+   bool frames_buffering;
 };
 
 extern const struct mt76_driver_ops mt7603_drv_ops;
-- 
2.35.3


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


[PATCH] dt-bindings: leds: add FUNCTION defines for per-band WLANs

2024-05-25 Thread Rafał Miłecki
From: Rafał Miłecki 

Most wireless routers and access points can operate in multiple bands
simultaneously. Vendors often equip their devices with per-band LEDs.

Add defines for those very common functions to allow cleaner & clearer
bindings.

Signed-off-by: Rafał Miłecki 
---
 include/dt-bindings/leds/common.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/dt-bindings/leds/common.h 
b/include/dt-bindings/leds/common.h
index 9a0d33d027ff..c56785bb9c9c 100644
--- a/include/dt-bindings/leds/common.h
+++ b/include/dt-bindings/leds/common.h
@@ -101,6 +101,9 @@
 #define LED_FUNCTION_USB "usb"
 #define LED_FUNCTION_WAN "wan"
 #define LED_FUNCTION_WLAN "wlan"
+#define LED_FUNCTION_WLAN_2GHZ "wlan-2ghz"
+#define LED_FUNCTION_WLAN_5GHZ "wlan-5ghz"
+#define LED_FUNCTION_WLAN_6GHZ "wlan-6ghz"
 #define LED_FUNCTION_WPS "wps"
 
 #endif /* __DT_BINDINGS_LEDS_H */
-- 
2.35.3


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [PATCH] wifi: mt76: mt7603: add debugfs attr for disabling frames buffering

2024-05-25 Thread Rafał Miłecki

On 2.04.2024 18:54, Shengyu Qu wrote:

> Maybe we could disable frames buffering by default until it is fixed?

Please check commit description:

"If this solution yields a success we can make this feature disabled by 
default."


> Also, maybe we could do more tests on newer models such as mt7986/81 to
> make this patch benefit more models?

Can you point me to the mt7915 driver buffering code? I clearly missed that.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [PATCH dt-schema] schemas: chosen: Add OpenWrt LEDs properties for system states

2024-05-25 Thread Rafał Miłecki

On 9.01.2024 20:10, Krzysztof Kozlowski wrote:

On 09/01/2024 17:38, Rafał Miłecki wrote:

On 9.01.2024 10:02, Krzysztof Kozlowski wrote:

On 09/01/2024 09:23, Rafał Miłecki wrote:

From: Rafał Miłecki 

OpenWrt project provides downstream support for thousands of embedded
home network devices. Its custom requirement for DT is to provide info
about LEDs roles. Currently it does it by using custom non-documented
aliases. While formally valid (aliases.yaml doesn't limit names or
purposes of aliases) it's quite a loose solution.

Document 4 precise "chosen" biding properties with clearly documented
OpenWrt usage. This will allow upstreaming tons of DTS files that noone


typo: none


typo: no one ;)


cared about so far as those would need to be patched downstream anyway.


  From all this description I understand why you want to add it, but I
don't understand what are you exactly adding and how it is being used by
the users/OS.


In OpenWrt we have user space script that reads LED full path:
cat /proc/device-tree/aliases/led-

And then sets LED to state matching current boot stage:
echo 1 > /sys/class/leds//brightness


OK, it's nowhere close to a hardware property. You now instruct OS what
to do. It's software or software policy.


That's the reason I targeted "chosen" node which seemed the best option
given it does not describe real hardware.



Signed-off-by: Rafał Miłecki 
---
A few weeks ago I was seeking for a help regarding OpenWrt's need for
specifing LEDs roles in DT, see:

Describing LEDs roles in device tree?
https://lore.kernel.org/linux-devicetree/ee912a89-4fd7-43c3-a79b-16659a035...@gmail.com/T/#u

I DON'T think OpenWrt's current solution with aliases is good enough:
* It's not clearly documented
* It may vary from other projects usa case
* It may be refused by random maintainers I think

I decided to suggest 4 OpenWrt-prefixed properties for "chosen". I'm
hoping this small custom binding is sth we could go with. I'm really
looking forward to upstreaming OpenWrt's downstream DTS files so other
projects (e.g. Buildroot) can use them.

If you have any better fitting solution in mind please let me know. I
should be fine with anything that lets me solve this downstream mess
situation.

   dtschema/schemas/chosen.yaml | 9 +
   1 file changed, 9 insertions(+)

diff --git a/dtschema/schemas/chosen.yaml b/dtschema/schemas/chosen.yaml
index 6d5c3f1..96d0db7 100644
--- a/dtschema/schemas/chosen.yaml
+++ b/dtschema/schemas/chosen.yaml
@@ -264,4 +264,13 @@ properties:
   patternProperties:
 "^framebuffer": true
   
+  "^openwrt,led-(boot|failsafe|running|upgrade)$":

+$ref: types.yaml#/definitions/string
+description:
+  OpenWrt choice of LED for a given role.


Neither this explains it. What is "OpenWrt choice"? Choice like what?
What are the valid choices?


Value must be a full path (encoded
+  as a string) to a relevant LED node.


What do you mean by full path? DT node path? Then no, use phandles.

Anyway, we have existing properties for this - LED functions. Just
choose LED with given function (from sysfs) and you are done. If
function is missing in the header: feel free to go, least for me.


In "Describing LEDs roles in device tree?" e-mail I wrote:

"
Please note that "function" on its own is not sufficient as multiple
LEDs my share the same function name but its meaning may vary depending
on color.
"

Let me elaborate here.

Values of "function" property match LEDs descriptions (usually it's a
physical label). That is OK and makes sense but doesn't allow OpenWrt to
automatically pick proper LED to use during given boot stage.

Some devices may have multiple color of a the same LED function. OpenWrt
developer needs to choose which color to use for failsafe more and which
boot fully booted state (and others too).

Devices also differ in available LEDs by their functions. Some may have
LED_FUNCTION_POWER that OpenWrt needs to use. Some may have
LED_FUNCTION_STATUS. Or both. There are some more (less common)
functions like LED_FUNCTION_ACTIVITY.

We failed at OpenWrt to develop a single generic script to rule all
devices and all their LEDs combinations. We ended up with developers
*choosing* what LED to use for a given system state.


So this all is because you want to write easier OS. That's abuse of
Devicetree. You can define which LEDs have different meaning, e.g.
physical label or icon attached to them. That's a hardware property.

You can also define how pieces of hardware are wired together and create
entire system, e.g. connect one LED to disk activity.

However what you are proposing here is to dynamically configure one,
given OS. I don't think it is suitable.

The problem of OS to nicely figure out which LED to blink when, is not a
problem of Devicetree. It is a problem of OS and its configuration.


I'd say it's a thin line. Or just a grey idea as Geert said.

What is a LED "function" after all? How exactly are:
LED_FUNCTION_STATUS
LED_FUNCTION_ACTIVITY

Re: [PATCH dt-schema] schemas: chosen: Add OpenWrt LEDs properties for system states

2024-05-25 Thread Rafał Miłecki

On 9.01.2024 10:02, Krzysztof Kozlowski wrote:

On 09/01/2024 09:23, Rafał Miłecki wrote:

From: Rafał Miłecki 

OpenWrt project provides downstream support for thousands of embedded
home network devices. Its custom requirement for DT is to provide info
about LEDs roles. Currently it does it by using custom non-documented
aliases. While formally valid (aliases.yaml doesn't limit names or
purposes of aliases) it's quite a loose solution.

Document 4 precise "chosen" biding properties with clearly documented
OpenWrt usage. This will allow upstreaming tons of DTS files that noone


typo: none


typo: no one ;)


cared about so far as those would need to be patched downstream anyway.


 From all this description I understand why you want to add it, but I
don't understand what are you exactly adding and how it is being used by
the users/OS.


In OpenWrt we have user space script that reads LED full path:
cat /proc/device-tree/aliases/led-

And then sets LED to state matching current boot stage:
echo 1 > /sys/class/leds//brightness



Signed-off-by: Rafał Miłecki 
---
A few weeks ago I was seeking for a help regarding OpenWrt's need for
specifing LEDs roles in DT, see:

Describing LEDs roles in device tree?
https://lore.kernel.org/linux-devicetree/ee912a89-4fd7-43c3-a79b-16659a035...@gmail.com/T/#u

I DON'T think OpenWrt's current solution with aliases is good enough:
* It's not clearly documented
* It may vary from other projects usa case
* It may be refused by random maintainers I think

I decided to suggest 4 OpenWrt-prefixed properties for "chosen". I'm
hoping this small custom binding is sth we could go with. I'm really
looking forward to upstreaming OpenWrt's downstream DTS files so other
projects (e.g. Buildroot) can use them.

If you have any better fitting solution in mind please let me know. I
should be fine with anything that lets me solve this downstream mess
situation.

  dtschema/schemas/chosen.yaml | 9 +
  1 file changed, 9 insertions(+)

diff --git a/dtschema/schemas/chosen.yaml b/dtschema/schemas/chosen.yaml
index 6d5c3f1..96d0db7 100644
--- a/dtschema/schemas/chosen.yaml
+++ b/dtschema/schemas/chosen.yaml
@@ -264,4 +264,13 @@ properties:
  patternProperties:
"^framebuffer": true
  
+  "^openwrt,led-(boot|failsafe|running|upgrade)$":

+$ref: types.yaml#/definitions/string
+description:
+  OpenWrt choice of LED for a given role.


Neither this explains it. What is "OpenWrt choice"? Choice like what?
What are the valid choices?


Value must be a full path (encoded
+  as a string) to a relevant LED node.


What do you mean by full path? DT node path? Then no, use phandles.

Anyway, we have existing properties for this - LED functions. Just
choose LED with given function (from sysfs) and you are done. If
function is missing in the header: feel free to go, least for me.


In "Describing LEDs roles in device tree?" e-mail I wrote:

"
Please note that "function" on its own is not sufficient as multiple
LEDs my share the same function name but its meaning may vary depending
on color.
"

Let me elaborate here.

Values of "function" property match LEDs descriptions (usually it's a
physical label). That is OK and makes sense but doesn't allow OpenWrt to
automatically pick proper LED to use during given boot stage.

Some devices may have multiple color of a the same LED function. OpenWrt
developer needs to choose which color to use for failsafe more and which
boot fully booted state (and others too).

Devices also differ in available LEDs by their functions. Some may have
LED_FUNCTION_POWER that OpenWrt needs to use. Some may have
LED_FUNCTION_STATUS. Or both. There are some more (less common)
functions like LED_FUNCTION_ACTIVITY.

We failed at OpenWrt to develop a single generic script to rule all
devices and all their LEDs combinations. We ended up with developers
*choosing* what LED to use for a given system state.



Also: please Cc LED maintainers on all future submissions of this.

Included them (apart from linux-leds@ already present) now, thanks.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes

2024-05-25 Thread Rafał Miłecki

I made a second attempt on debugging some longstanding stability issues
affecting BCM53753 SoCs. Those are single CPU core ARM Cortex-A7 boards
with a pretty slow arch timer running at 36,8 kHz.

After 0 to 20 minutes of close to zero activity I experience hangs and I
need to wait a minute for watchdog to kick in and reboot device.

First debugging attempt:
https://lore.kernel.org/netdev/0f9d0cd6-d344-7915-7bc1-7a090b830...@gmail.com/T/ 
("ARM board lockups/hangs triggered by locks and mutexes")

After a lot of bisecting, testing & hacking I believe there are 3 types
of kernel aspects that affect BCM53573 stability. I'd like to describe
them below to document my debugging work. I'm clueless at this point.
Maybe someone can come up with an idea of actual issue & ideally a
solution.

#

1. Locking

During my first bisecting attempts I found multiple locking-related
commit that regressed stability.

Bisected commits:

131287ff833d ("once: add DO_ONCE_SLOW() for sleepable contexts").

and a following group:

d0d583484d2e ("locking/refcount: Consolidate implementations of refcount_t")
dab787c73f6e ("locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} 
definitions")
0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
809554147d60 ("locking/refcount: Improve performance of generic REFCOUNT_FULL 
code")
9c9269977f03 ("locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into 
the  header")
04bff7d7b808 ("locking/refcount: Remove unused refcount_*_checked() variants")
513b19a43bec ("locking/refcount: Ensure integer operands are treated as signed")
68b4ee68e8c8 ("locking/refcount: Define constants for saturation and max refcount 
values")

I don't believe there is actually anything wrong about above changes.
Maybe it's some tiny timing thing that my board just doesn't like?

#

2. Clock (arm,armv7-timer)

While comparing main clock in Broadcom's SDK with upstream one I noticed
a tiny difference: mask value. I don't know it it makes any sense but
switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
arm_arch_timer.c (to match SDK) increases average uptime (time before a
hang/lockup happens) from 4 minutes to 36 minutes.

#

3. Random code changes

During my bisecting attempts I found one commit that regressed kernel
stability but actual changes were meaningless in context of locking. It
was commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic
partitions"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26

I thought that maybe it was all about making add_mtd_device() bigger and
changing addresses of a lot of symbols (looking at System.map). So I
reverted that mtd commit and developed a dummy change relocating as few
symbols (System.map) as possible while still breaking stability:

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -94,6 +94,21 @@ void __cpuidle default_idle_call(void)
arch_cpu_idle();
start_critical_timings();
}
+
+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 5678)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 5678)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 5678)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();
 }

 static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,

Above dummy change didn't relocate thousands of symbols but only about
20 of them. They happened to be lock symbols however. Does it make any
sense for above diff to regress kernel stability for me and cause
hangs/lockups?

--- System.map.good
+++ System.map.bad
@@ -22214,36 +22214,36 @@
 c062e7e0 T __cpuidle_text_start
 c062e7e0 t cpu_idle_poll
 c062e860 T default_idle_call
-c062e884 T __cpuidle_text_end
-c062e888 T __lock_text_start
-c062e8a0 T _raw_spin_unlock_irqrestore
-c062e8c0 T _raw_spin_trylock
-c062e900 T _raw_write_unlock_irqrestore
-c062e920 T _raw_read_trylock
-c062e960 T _raw_write_trylock
-c062e9a0 T _raw_spin_lock_bh
-c062ea00 T _raw_read_lock_bh
-c062ea40 T _raw_write_lock_bh
-c062ea80 T _raw_spin_trylock_bh
-c062eb00 T _raw_spin_unlock_bh
-c062eb40 T _raw_write_unlock_bh
-c062eb80 T _raw_read_unlock_bh
-c062ebc0 T _raw_read_unlock_irqrestore
-c062ec00 T _raw_write_lock
-c062ec40 T _raw_write_lock_irq
-c062ec80 T _raw_write_lock_irqsave
-c062ecc0 T _raw_read_lock
-c062ed00 T _raw_spin_lock
-c062ed40 T _raw_read_lock_irq
-c062ed80 T _raw_spin_lock_irq
-c062ede0 T _raw_spin_lock_irqsave
-c062ee40 T _raw_read_lock_irqsave
-c062ee70 T __hyp_text_end
-c062ee70 T __hyp_text_start
-c062ee70 T __kprobes_text_end
-c062ee70 T __kprobes_text_start
-c062ee70 T __lock_text_end
-c062ee70 T _etext
+c062e954 T 

Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 18.08.2023 22:23, Rafał Miłecki wrote:

On 14.08.2023 11:04, Geert Uytterhoeven wrote:

Hi Rafal,

On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki  wrote:

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.


I had similar experiences on other ARM platforms many years ago:
bisection lead to something completely bogus, and it turned out
adding a single line of innocent code made the system lock-up or crash
unexpectedly.  It was definitely related to alignment, as adding the
right extra amount of innocent code would fix the problem. Until some
later change changing alignment again...
I never found the real cause, but the problems went away over time.
I am not sure I did enable all required errata config options, so I
may have missed some...


I already experiented some weird performance variations on Broadcom's
Northstar platform that was related to symbols layout & cache hit/miss
ratio. For that reason I use -falign-functions=32 for that whole
OpenWrt's "bcm53xx" target (it covers Northstar and BCM53573). So
this aspect should be ruled out already in my case.


Relevant OpenWrt commit with some description and links: b54ef39e0b91 ("bcm53xx: use 
-falign-functions=32 for kernel compilation"):

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=b54ef39e0b910a4b8eaca0497fe9b63e8392262a

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 14.08.2023 11:04, Geert Uytterhoeven wrote:

Hi Rafal,

On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki  wrote:

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.


I had similar experiences on other ARM platforms many years ago:
bisection lead to something completely bogus, and it turned out
adding a single line of innocent code made the system lock-up or crash
unexpectedly.  It was definitely related to alignment, as adding the
right extra amount of innocent code would fix the problem. Until some
later change changing alignment again...
I never found the real cause, but the problems went away over time.
I am not sure I did enable all required errata config options, so I
may have missed some...


I already experiented some weird performance variations on Broadcom's
Northstar platform that was related to symbols layout & cache hit/miss
ratio. For that reason I use -falign-functions=32 for that whole
OpenWrt's "bcm53xx" target (it covers Northstar and BCM53573). So
this aspect should be ruled out already in my case.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 7.08.2023 20:34, Florian Fainelli wrote:

On 8/7/23 04:10, Rafał Miłecki wrote:

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.


In the config.gz file you attached in your first email, both CONFIG_MTD_* and 
CONFIG_NVMEM_* so it is not like we are reaching into module space for code 
and/or data and need veneers or anything, it is part of the kernel image so we 
can assert the maximum distance between instructions etc.

Now is it just that specific mutex that is an issue, or do other mutexes 
through the system do cause problems as well?


If you mean mtd mutex, I'm quite sure it's not the one to blame. It just
happened modified function was using a mutex. Could be any other.



Do we suspect the toolchain to be possibly problematic?


Maybe, I really don't know much such low level stuff.




Following dump change brings back lockups/hangs:

diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index ee437af41..0a24dec55 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1028,6 +1028,22 @@ static long mtdchar_unlocked_ioctl(struct file *file, 
u_int cmd, u_long arg)
  {
  int ret;

+    if (!file)
+    pr_info("Missing\n");
+    WARN_ON(!file);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+    WARN_ON(cmd == 1234);
+    WARN_ON(cmd == 5678);
+
  mutex_lock(_mutex);
  ret = mtdchar_ioctl(file, cmd, arg);
  mutex_unlock(_mutex);






___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 4.08.2023 13:07, Rafał Miłecki wrote:

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.

Following dump change brings back lockups/hangs:

diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index ee437af41..0a24dec55 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1028,6 +1028,22 @@ static long mtdchar_unlocked_ioctl(struct file *file, 
u_int cmd, u_long arg)
 {
int ret;

+   if (!file)
+   pr_info("Missing\n");
+   WARN_ON(!file);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+   WARN_ON(cmd == 1234);
+   WARN_ON(cmd == 5678);
+
mutex_lock(_mutex);
ret = mtdchar_ioctl(file, cmd, arg);
mutex_unlock(_mutex);


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:10, Rafał Miłecki wrote:

Unfortunately enabling *any* of following options:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
seems to make locksup/hangs go away. I tried for few hours.


I decided to find out why enabling CONFIG_DEBUG_MUTEXES "fixes" kernel /
device stability for me. I tried enabling manually code that normally
hides behind the #ifdev CONFIG_DEBUG_MUTEXES.

Attached to this e-mail is a small patch that is enough to make my
kernel stable (mutex-fix-bcm53573.diff).

#

It's not what's the most interesting thought. What really doesn't make
sense anymore is that below diff (on top of attached one) brings back
hangs/lockups.

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 4fe40910f..c440222a4 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -34,6 +34,8 @@ void debug_mutex_lock_common(struct mutex *lock, struct 
mutex_waiter *waiter)
INIT_LIST_HEAD(>list);
 }

+/* Dropping below function brings back hangs/lockups & reboots */
+#if 0
 void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter)
 {
lockdep_assert_held(>wait_lock);
@@ -41,6 +43,7 @@ void debug_mutex_wake_waiter(struct mutex *lock, struct 
mutex_waiter *waiter)
DEBUG_LOCKS_WARN_ON(waiter->magic != waiter);
DEBUG_LOCKS_WARN_ON(list_empty(>list));
 }
+#endif

 void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 {
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 479bc96c3..15bd4691b 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -57,9 +57,7 @@ struct mutex {
 	struct optimistic_spin_queue osq; /* Spinner MCS lock */
 #endif
 	struct list_head	wait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
-#endif
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 #endif
@@ -73,12 +71,10 @@ struct mutex_waiter {
 	struct list_head	list;
 	struct task_struct	*task;
 	struct ww_acquire_ctx	*ww_ctx;
-#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
-#endif
 };
 
-#ifdef CONFIG_DEBUG_MUTEXES
+#if 1 //def CONFIG_DEBUG_MUTEXES
 
 #define __DEBUG_MUTEX_INITIALIZER(lockname)\
 	, .magic = 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d0e639497..8fef4485e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -958,10 +958,8 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	/* Mutex deadlock detection: */
 	struct mutex_waiter		*blocked_on;
-#endif
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	intnon_block_count;
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 45452facf..b22e6ecd8 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -12,7 +12,7 @@ CFLAGS_REMOVE_mutex-debug.o = $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_rtmutex-debug.o = $(CC_FLAGS_FTRACE)
 endif
 
-obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
+obj-y += mutex-debug.o
 obj-$(CONFIG_LOCKDEP) += lockdep.o
 ifeq ($(CONFIG_PROC_FS),y)
 obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index b02fff282..6dc3f80a3 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -946,9 +946,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 
 	might_sleep();
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-#endif
 
 	ww = container_of(lock, struct ww_mutex, base);
 	if (ww_ctx) {
@@ -1417,9 +1415,7 @@ int __sched mutex_trylock(struct mutex *lock)
 {
 	bool locked;
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-#endif
 
 	locked = __mutex_trylock(lock);
 	if (locked)
___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:10, Rafał Miłecki wrote:

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.


I switched back to OpenWrt's kernel 5.4 and applied all those reverts I
found. Nothing. I was still getting hangs / lockups + reboots.

After more bisecting and I found out it's because OpenWrt backported
commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic
partitions"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26

It didn't make any sense to me. That patch does nothing on my device and
its code is only executed when booting.

It makes even less sense to me. Why such changes that should not affect
anything actually break stability for BCM53573?

I narrowed above patch even furher. It's actually enough to apply below
diff to break kernel stability:

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index f69c5b94e..f10dd3af1 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -590,6 +590,25 @@ static int mtd_nvmem_add(struct mtd_info *mtd)
return 0;
 }

+static void mtd_check_of_node(struct mtd_info *mtd)
+{
+   struct device_node *partitions, *parent_dn;
+   struct mtd_info *parent;
+
+   /* Check if MTD already has a device node */
+   if (dev_of_node(>dev))
+   return;
+
+   /* Check if a partitions node exist */
+   parent = mtd_get_master(mtd);
+   parent_dn = dev_of_node(>dev);
+   pr_info("[%s] mtd->name:%s parent_dn:%pOF\n", __func__, mtd->name, 
parent_dn);
+   if (!parent_dn)
+   return;
+
+   of_node_put(parent_dn);
+}
+
 /**
  * add_mtd_device - register an MTD device
  * @mtd: pointer to new MTD device info structure
@@ -673,6 +692,7 @@ int add_mtd_device(struct mtd_info *mtd)
mtd->dev.devt = MTD_DEVT(i);
dev_set_name(>dev, "mtd%d", i);
dev_set_drvdata(>dev, mtd);
+   mtd_check_of_node(mtd);
of_node_get(mtd_get_of_node(mtd));
error = device_register(>dev);
if (error) {


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki
Hi,

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

#

I took that great chance and decided to track the regression.

Linux 5.4 stable branch worked stable up to the release v5.4.197.
Starting with v5.4.198 I started experiencing those stability issues. I
bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
with loops over idev->addr_list"):
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=4460066eb2480b9e203c73755e12e2efc820a27e

With above commit reverted I was able to use stable 5.4 branch up to the
release v5.4.207. Starting with v5.4.208 it got unstable again. I
bisected it down to:
commit d0d583484d2e ("locking/refcount: Consolidate implementations of
refcount_t")
commit dab787c73f6e ("locking/refcount: Consolidate
REFCOUNT_{MAX,SATURATED} definitions")
commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
commit 809554147d60 ("locking/refcount: Improve performance of generic
REFCOUNT_FULL code")
commit 9c9269977f03 ("locking/refcount: Move the bulk of the
REFCOUNT_FULL implementation into the  header")
commit 04bff7d7b808 ("locking/refcount: Remove unused
refcount_*_checked() variants")
commit 513b19a43bec ("locking/refcount: Ensure integer operands are
treated as signed")
commit 68b4ee68e8c8 ("locking/refcount: Define constants for
saturation and max refcount values")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=809554147d609163cfbaf815c443c575b538a7ef
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=04bff7d7b8081c4bb2e8171be31d33df297eee5b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=513b19a43becee5f7af6d283bb9d3d241a8a21a8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
(I didn't actually check above commits individually).

Reverting above locking/refcount commits worked fine for few releases:
up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
sleepable contexts").

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.

#

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.

I tried compiling kernel with:
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_WQ_WATCHDOG=y
but it didn't change or report anything.

Unfortunately enabling *any* of following options:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
seems to make locksup/hangs go away. I tried for few hours.

Sadly I don't have access to JTAG or any low level debugging interface.

Does looking at commits I reported above give anyone a hint on what may
be going on maybe?

-- 
Rafał
[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 5.4.238 (ubuntu@nat) (gcc version 8.4.0 (OpenWrt 
GCC 8.4.0 r15234+1-d89a7f0120)) #0 SMP Fri Jul 14 12:56:51 2023
[0.00] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[0.00] CPU: div instructions available: patching division code
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache
[0.00] OF: fdt: Machine model: Tenda AC9
[0.00] earlycon: ns16550a0 at MMIO 0x18000300 (options '115200n8')
[0.00] printk: bootconsole [ns16550a0] enabled
[0.00] Memory policy: Data cache writealloc
[0.00] Hit pending asynchronous external abort (FSR=0x0c06) during 

Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:21, Russell King (Oracle) wrote:

On Wed, Aug 02, 2023 at 12:10:24AM +0200, Rafał Miłecki wrote:

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

(...)

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.

I tried compiling kernel with:
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_WQ_WATCHDOG=y
but it didn't change or report anything.

Unfortunately enabling *any* of following options:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
seems to make locksup/hangs go away. I tried for few hours.

Sadly I don't have access to JTAG or any low level debugging interface.

Does looking at commits I reported above give anyone a hint on what may
be going on maybe?


If you suspect locking issues, make sure you have lockdep enabled which
will detect locking errors. You will want CONFIG_PROVE_LOCKING enabled.

I will say that I use IPv6, and I run 32-bit kernels here both on real
ARMv7 hardware (Armada 388 and iMX6 based stuff) and also in KVM based
VMs, and these have run virtually every release of the kernel (not
stable kernels though) and I haven't ever seen the behaviour that you
describe.

If it is specific to stable kernels, then that would be rather
disappointing.


I wrote above that with any of: CONFIG_DEBUG_RT_MUTEXES,
CONFIG_DEBUG_SPINLOCK or CONFIG_DEBUG_MUTEXES enabled I can't reproduce
the issue anymore. Right? Well I swear it was true for some random 5.4
release I tested before.

With your comment I decided to try CONFIG_PROVE_LOCKING anyway / again
and this time on 1 of my BCM53573 devices I got something very
interesting on the first boot.

FWIW following error:
Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
is caused by invalid DT I sent fixes for just recently.

Please scroll through the first booting lines for the WARNING:

[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 5.4.238 (ubuntu@nat) (gcc version 8.4.0 (OpenWrt 
GCC 8.4.0 r15234+1-d89a7f0120)) #0 SMP Fri Jul 14 12:56:51 2023
[0.00] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[0.00] CPU: div instructions available: patching division code
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache
[0.00] OF: fdt: Machine model: Tenda AC9
[0.00] earlycon: ns16550a0 at MMIO 0x18000300 (options '115200n8')
[0.00] printk: bootconsole [ns16550a0] enabled
[0.00] Memory policy: Data cache writealloc
[0.00] Hit pending asynchronous external abort (FSR=0x0c06) during 
first unmask, this is most likely caused by a firmware/bootloader bug.
[0.00] percpu: Embedded 14 pages/cpu s27944 r8192 d21208 u57344
[0.00] Built 1 zonelists, mobility grouping on.  Total pages: 32480
[0.00] Kernel command line: console=ttyS0,115200 earlycon
[0.00] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes, 
linear)
[0.00] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes, 
linear)
[0.00] mem auto-init: stack:off, heap alloc:off, heap free:off
[0.00] Memory: 118164K/131072K available (5531K kernel code, 201K 
rwdata, 1960K rodata, 1024K init, 2106K bss, 12908K reserved, 0K cma-reserved, 
0K highmem)
[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[0.00] rcu: Hierarchical RCU implementation.
[0.00] rcu: RCU restricting CPUs from NR_CPUS=2 to nr_cpu_ids=1.
[0.00] rcu: RCU calculated value of scheduler-enlistment delay is 10 
jiffies.
[0.00] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[0.00] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[0.00] arch_timer: cp15 timer(s) running at 0.03MHz (virt).
[0.00] clocksource: arch_sys_counter: mask: 0xff 
max_cycles: 0x10eb00226, max_idle_ns: 56421785894076 ns
[0.27] sched_clock: 56 bits at 35kHz, resolution 27918ns, wraps every 
70368744165810ns
[0.008654] Ignoring delay timer arch_delay_timer, which has insufficient 
resolution of 27918ns
[0.017951] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., 
Ingo Molnar
[0.025936] ... 

Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 00:25, Florian Fainelli wrote:

Hi Rafal,

On 8/1/23 15:10, Rafał Miłecki wrote:

Hi,

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

#

I took that great chance and decided to track the regression.

Linux 5.4 stable branch worked stable up to the release v5.4.197.
Starting with v5.4.198 I started experiencing those stability issues. I
bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
with loops over idev->addr_list"):
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=4460066eb2480b9e203c73755e12e2efc820a27e

With above commit reverted I was able to use stable 5.4 branch up to the
release v5.4.207. Starting with v5.4.208 it got unstable again. I
bisected it down to:
commit d0d583484d2e ("locking/refcount: Consolidate implementations of
refcount_t")
commit dab787c73f6e ("locking/refcount: Consolidate
REFCOUNT_{MAX,SATURATED} definitions")
commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
commit 809554147d60 ("locking/refcount: Improve performance of generic
REFCOUNT_FULL code")
commit 9c9269977f03 ("locking/refcount: Move the bulk of the
REFCOUNT_FULL implementation into the  header")
commit 04bff7d7b808 ("locking/refcount: Remove unused
refcount_*_checked() variants")
commit 513b19a43bec ("locking/refcount: Ensure integer operands are
treated as signed")
commit 68b4ee68e8c8 ("locking/refcount: Define constants for
saturation and max refcount values")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=809554147d609163cfbaf815c443c575b538a7ef
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=04bff7d7b8081c4bb2e8171be31d33df297eee5b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=513b19a43becee5f7af6d283bb9d3d241a8a21a8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
(I didn't actually check above commits individually).

Reverting above locking/refcount commits worked fine for few releases:
up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
sleepable contexts").

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.

#

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.


 From the look of it, it seems like the CPU might have bugs with atomics?

Your log indicates that your Cortex-A7 is r0p5 which is described to be 
susceptible to ARM_ERRATA_814220, do you have it enabled by any chance, if not, 
can you enable it and see if makes any difference?


I had it disabled. Unfortunately CONFIG_ARM_ERRATA_814220=y doesn't help.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Rafał Miłecki

On 2.08.2023 09:00, Rafał Miłecki wrote:

With your comment I decided to try CONFIG_PROVE_LOCKING anyway / again
and this time on 1 of my BCM53573 devices I got something very
interesting on the first boot.

FWIW following error:
Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
is caused by invalid DT I sent fixes for just recently.

Please scroll through the first booting lines for the WARNING:

(...)
[    1.167234] bgmac_bcma bcma0:5: Found PHY addr: 30 (NOREGS)
[    1.173655] [ cut here ]
[    1.178374] WARNING: CPU: 0 PID: 1 at kernel/locking/mutex.c:950 
__mutex_lock+0x6b4/0x8a0
[    1.186721] DEBUG_LOCKS_WARN_ON(lock->magic != lock)


Ah, that mutex WARNING comes from my Tenda AC9 device which happens to
use a hacky OpenWrt downstream b53 driver. That driver uses wrong API
(it behaves as PHY driver instead of MDIO driver). It results in probing
against PHY device which isn't properly initialized.

Long story short: above WARNING is just a noise. Ignore it please. Sorry
for that.

Kernel compiled with CONFIG_PROVE_LOCKING still works fine on other
devices and on Tenda AC9 after fixing PHY<->MDIO thing. That kernel
option hides actual bug whatever it is.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


[PATCH] ath79: replace "mac-address-ascii" with "mac-base"

2024-05-25 Thread Rafał Miłecki
From: Rafał Miłecki 

With upstream accepted "mac-base" binding there is no need for a
downstream "mac-address-ascii" workaround anymore.

Signed-off-by: Rafał Miłecki 
---
 .../ath79/dts/ar7161_dlink_dir-825-b1.dts | 57 +++
 .../linux/ath79/dts/ar9331_hiwifi_hc6361.dts  | 37 ++--
 .../linux/ath79/dts/ar9342_zyxel_nwa11xx.dtsi | 28 +
 .../linux/ath79/dts/ar9344_dlink_dir-8x5.dtsi | 42 --
 .../ath79/dts/qca9558_dlink_dir-629-a1.dts| 32 +++
 .../ath79/dts/qca9563_dlink_dir-8x9-a1.dtsi   | 31 ++
 6 files changed, 134 insertions(+), 93 deletions(-)

diff --git a/target/linux/ath79/dts/ar7161_dlink_dir-825-b1.dts 
b/target/linux/ath79/dts/ar7161_dlink_dir-825-b1.dts
index 0e39be7d0b..bdb678298d 100644
--- a/target/linux/ath79/dts/ar7161_dlink_dir-825-b1.dts
+++ b/target/linux/ath79/dts/ar7161_dlink_dir-825-b1.dts
@@ -139,8 +139,8 @@
ath9k0: wifi@0,11 {
compatible = "pci168c,0029";
reg = <0x8800 0 0 0 0>;
-   nvmem-cells = <_lan>, <_art_1000>;
-   nvmem-cell-names = "mac-address-ascii", "calibration";
+   nvmem-cells = <_lan 0>, <_art_1000>;
+   nvmem-cell-names = "mac-address", "calibration";
#gpio-cells = <2>;
gpio-controller;
};
@@ -148,9 +148,8 @@
ath9k1: wifi@0,12 {
compatible = "pci168c,0029";
reg = <0x9000 0 0 0 0>;
-   nvmem-cells = <_wan>, <_art_5000>;
-   nvmem-cell-names = "mac-address-ascii", "calibration";
-   mac-address-increment = <1>;
+   nvmem-cells = <_wan 1>, <_art_5000>;
+   nvmem-cell-names = "mac-address", "calibration";
#gpio-cells = <2>;
gpio-controller;
};
@@ -191,23 +190,31 @@
label = "caldata";
reg = <0x66 0x01>;
read-only;
-   #address-cells = <1>;
-   #size-cells = <1>;
 
-   cal_art_1000: cal@1000 {
-   reg = <0x1000 0xeb8>;
-   };
-
-   cal_art_5000: cal@5000 {
-   reg = <0x5000 0xeb8>;
-   };
-
-   macaddr_lan: macaddr@ffa0 {
-   reg = <0xffa0 0x11>;
-   };
-
-   macaddr_wan: macaddr@ffb4 {
-   reg = <0xffb4 0x11>;
+   nvmem-layout {
+   compatible = "fixed-layout";
+   #address-cells = <1>;
+   #size-cells = <1>;
+
+   cal_art_1000: cal@1000 {
+   reg = <0x1000 0xeb8>;
+   };
+
+   cal_art_5000: cal@5000 {
+   reg = <0x5000 0xeb8>;
+   };
+
+   macaddr_lan: macaddr@ffa0 {
+   compatible = "mac-base";
+   reg = <0xffa0 0x11>;
+   #nvmem-cell-cells = <1>;
+   };
+
+   macaddr_wan: macaddr@ffb4 {
+   compatible = "mac-base";
+   reg = <0xffb4 0x11>;
+   #nvmem-cell-cells = <1>;
+   };
};
};
 
@@ -224,8 +231,8 @@
 
pll-data = <0x 0x1099 0x00991099>;
 
-   nvmem-cells = <_lan>;
-   nvmem-cell-names = "mac-address-ascii";
+   nvmem-cells = <_lan 0>;
+   nvmem-cell-names = "mac-address";
 
fixed-link {
speed = <1000>;
@@ -238,8 +245,8 @@
 
pll-data = <0x 0x1099 0x00991099>;
 
-   nvmem-cells = <_wan>;
-   nvmem-cell-names = "mac-address-ascii";
+   nvmem-cells = <_wan 0>;
+   nvmem-cell-names = "mac-address";
 
phy-handle = <>;
 };
diff --git a/target/linux/ath79/dts/ar9331_hiwifi_hc6361.dts 
b/target/linux/ath79/dts/ar9331_hiwifi_hc6361.dts
index 05d3f6730e..fa000ab90c 100644
--- a/target/linux/ath79/dts/ar9331_hiwifi_hc6361.dts
+++ b/target/linux/ath79/dts/ar9331_hiwifi_hc6361.dts
@@ -77,9 +77,22 @@
};
 
bdinfo: partition@1 {
+   compatible = "nvmem-cells";
  

Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes

2024-05-25 Thread Rafał Miłecki

Hi,

it's a late reply but I didn't find enough determination earlier.

On 8.09.2023 10:10, Linus Walleij wrote:

On Mon, Sep 4, 2023 at 10:34 AM Rafał Miłecki  wrote:


I'm clueless at this point.
Maybe someone can come up with an idea of actual issue & ideally a
solution.


Damn this is frustrating.


2. Clock (arm,armv7-timer)

While comparing main clock in Broadcom's SDK with upstream one I noticed
a tiny difference: mask value. I don't know it it makes any sense but
switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
arm_arch_timer.c (to match SDK) increases average uptime (time before a
hang/lockup happens) from 4 minutes to 36 minutes.


This could be related to how often the system goes to idle.


+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 5678)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 5678)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 5678)
+   arch_cpu_idle();
+   if (cpu_idle_force_poll == 1234)
+   arch_cpu_idle();


Idle again.

I would have tried to see what arch_cpu_idle() is doing.

arm_pm_idle() or cpu_do_idle()?


In my case arm_pm_idle is NULL.



What happens if you just put return in arch_cpu_idle()
so it does nothing?


Doesn't help. I also tried putting:
udelay(10);
and
udelay(1000);
at the arch_cpu_idle() beginning. None helped.


Here comes more interesting experiment though. Putting there:

if (!(foo++ % 1)) {
pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
}

doesn't seem to help.


Putting following however seems to make kernel/device stable:

if (!(foo++ % 100)) {
pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
}


I think I'm just going to assume those chipsets are simply hw broken.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [PATCH] brcm: add CLM BLOB files for Luxul devices

2024-05-25 Thread Rafał Miłecki

Hi Josh,

I'm OpenWrt developer and those CLM BLOBs have significant meaning for
this project. They allow OpenWrt users to use their devices with full
WiFi support (not limited to some generic fallback setup level).

For that let me speak up regarding this PATCH case.

On 12.06.2023 18:57, Josh Boyer wrote:

On Mon, Jun 12, 2023 at 11:05 AM Rip Route  wrote:


The files were compiled from source obtained through Luxul's contract 
manufacturer of these devices.


OK, it's not clear to me whether we can accept these.  We have no idea
what your contract specified, what sources were used, what the license
on them is, or if you had authority to release them at all.  I'm not
questioning your integrity by any means, but we have very little way
to know Broadcom is agreeable to these being included in the
linux-firmware repo.  If Broadcom wants to submit them or provide a
Signed-off-by statement, we might be able to take them then.


I fully understand your concerns. Those are binary files and it's hard
to tell what they contain at first sight.

It'd be perfect to have Broadcom simply approve them but I'm afraid it's
unlikely. I see linux-wireless@ and Broadcom lists were cc-ed but we
didn't get any feedback. I know some firmware contributions took
Broadcom months or years in the past.

So I'd try to analyze what we have here and try to review it.


First of all: CLM (BLOB) is a binary formatted data. It doesn't contain
any actual CPU-executable code. It gets uploaded to FullMAC (closed
source) firmware and gets parsed by it.

So what we have here is not actual firmware (executable binary) built
from proprietary C sources full of logic and possibly some patent. We
have binary files containing (literally) numbers and strings. They just
use a binary undocumented format.


I actually took a moment to reverse engineer those files and write basic
tools for them.

BLOB is a simple container format with header containing offsets, sizes
and CRC32 sums. I developed "bcmblob" tool for parsing BLOBs and
extracting from them:
https://git.openwrt.org/?p=project/firmware-utils.git;a=commitdiff;h=f730ad2fa0b4728e2bd66771d22cf642909e020e

CLM BLOB is a BLOB file containing CLM and chipset version. I also
developed absolutely basic tool "bcmclm" for parsing CLM files:
https://git.openwrt.org/?p=project/firmware-utils.git;a=commitdiff;h=916633160dc92ccae6a0ad8fd900f2d75b5b5ff0


Given that, in my I-am-not-a-lawyer opinion, those files are safe to
accept and distribute.

If I were to compare them to something it'd be an XLS stylesheet file.
It's a binary undocumented format but what it actually contains are
sets of values.

I don't think it raises any concerns to use a custom binary format.
We're clearly fine sharing .xls, .exe or .rar files.

I also don't think you can stop anyone from sharing a list of 2 GHz,
5 GHz or 6 GHz channels. Or their rates. Or flags (like radar etc.).
Or allowed TX power levels. Just numbers in general.


Do you think that with my basic technical summary and Luxul-provided
info you can accept those files?

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


[PATCH] net: gro: respect nf_conntrack_checksum for skipping csum verification

2024-05-25 Thread Rafał Miłecki
From: Rafał Miłecki 

Netfilter allows disabling checksum verification of incoming packets by
setting nf_conntrack_checksum variable. That feature is very useful for
home routers which:
1. Most of the time just /forward/ network traffic
2. Have slow CPU(s) and csum calculation is a challenge

Some projects like OpenWrt set nf_conntrack_checksum to 0 by default.

It would be nice to allow similar optimization in the GRO code paths.
This patch simply reuses nf_conntrack_checksum variable to skip
skb_gro_checksum_validate() calls if applicable.

Signed-off-by: Rafał Miłecki 
---
Hi guys,

I'm not very familiar with net subsystem, please let me know if there is
a better way of implementing such a feature.
---
 net/ipv4/tcp_offload.c   | 3 +++
 net/ipv6/tcpv6_offload.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 30abde86db45..734a3c0f3d4a 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -311,6 +311,9 @@ struct sk_buff *tcp4_gro_receive(struct list_head *head, 
struct sk_buff *skb)
 {
/* Don't bother verifying checksum if we're going to flush anyway. */
if (!NAPI_GRO_CB(skb)->flush &&
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+   dev_net(skb->dev)->ct.sysctl_checksum &&
+#endif
skb_gro_checksum_validate(skb, IPPROTO_TCP,
  inet_gro_compute_pseudo)) {
NAPI_GRO_CB(skb)->flush = 1;
diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c
index 39db5a226855..2144afa56fa3 100644
--- a/net/ipv6/tcpv6_offload.c
+++ b/net/ipv6/tcpv6_offload.c
@@ -18,6 +18,9 @@ struct sk_buff *tcp6_gro_receive(struct list_head *head, 
struct sk_buff *skb)
 {
/* Don't bother verifying checksum if we're going to flush anyway. */
if (!NAPI_GRO_CB(skb)->flush &&
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+   dev_net(skb->dev)->ct.sysctl_checksum &&
+#endif
skb_gro_checksum_validate(skb, IPPROTO_TCP,
  ip6_gro_compute_pseudo)) {
NAPI_GRO_CB(skb)->flush = 1;
-- 
2.34.1


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [PATCH] wifi: mt76: mt7603: add debugfs attr for disabling frames buffering

2024-05-25 Thread Shengyu Qu

Hi Rafal,

Maybe we could disable frames buffering by default until it is fixed?
Also, maybe we could do more tests on newer models such as mt7986/81 to
make this patch benefit more models?

Best regards,
Shengyu


在 2024/3/26 6:33, Rafał Miłecki 写道:

From: Rafał Miłecki 

MT7603EN and MT7628AN were reported by multiple users to be unstable
under high traffic. Transmission of packets could stop for seconds often
leading to disconnections.

Long research & debugging revelaed a close relation between MCU
interrupts of type PKT_TYPE_TXS and slowdowns / stalls. That led to
questioning frames buffering feature.

It turns out that disabling SKBs loopback code makes mt7603 devices much
more stable under load. There are still some traffic hiccups but those
happen like once every an hour and end up in recovery in most cases.

Add a debugfs option for disabling frames buffering so users can give it
a try. If this solution yields a success we can make this feature
disabled by default.

This change was successfully tested using 2 GHz AP interface on:
1. Netgear R6220 - MT7621ST (SoC) + MT7603EN (WiFi) + MT7612EN (WiFi)
2. Xiaomi Mi Router 4C - MT7628AN (Wi-Fi SoC)

Link: 
https://lore.kernel.org/linux-wireless/7c96d5ee-86c1-8068-1b58-40db6087a...@gmail.com/
Closes: https://github.com/openwrt/mt76/issues/865
Fixes: c8846e101502 ("mt76: add driver for MT7603E and MT7628/7688")
Signed-off-by: Rafał Miłecki 
---
  drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c | 2 ++
  drivers/net/wireless/mediatek/mt76/mt7603/dma.c | 3 +++
  drivers/net/wireless/mediatek/mt76/mt7603/init.c| 1 +
  drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h  | 2 ++
  4 files changed, 8 insertions(+)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c 
b/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c
index 3967f2f05774..c80baba7a402 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/debugfs.c
@@ -115,4 +115,6 @@ void mt7603_init_debugfs(struct mt7603_dev *dev)
>sensitivity_limit);
debugfs_create_bool("dynamic_sensitivity", 0600, dir,
>dynamic_sensitivity);
+   debugfs_create_bool("frames_buffering", 0600, dir,
+   >frames_buffering);
  }
diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/dma.c 
b/drivers/net/wireless/mediatek/mt76/mt7603/dma.c
index 7a2f5d38562b..f5ab729dec31 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/dma.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/dma.c
@@ -27,6 +27,9 @@ mt7603_rx_loopback_skb(struct mt7603_dev *dev, struct sk_buff 
*skb)
u32 val;
u8 tid = 0;
  
+	if (!dev->frames_buffering)

+   goto free;
+
if (skb->len < MT_TXD_SIZE + sizeof(struct ieee80211_hdr))
goto free;
  
diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/init.c b/drivers/net/wireless/mediatek/mt76/mt7603/init.c

index 6c55c72f28a2..5abc2618ec8b 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/init.c
@@ -515,6 +515,7 @@ int mt7603_register_device(struct mt7603_dev *dev)
dev->slottime = 9;
dev->sensitivity_limit = 28;
dev->dynamic_sensitivity = true;
+   dev->frames_buffering = true;
  
  	ret = mt7603_init_hardware(dev);

if (ret)
diff --git a/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h 
b/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h
index 9e58df7042ad..02c88334cdc0 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h
+++ b/drivers/net/wireless/mediatek/mt76/mt7603/mt7603.h
@@ -155,6 +155,8 @@ struct mt7603_dev {
u32 reset_test;
  
  	unsigned int reset_cause[__RESET_CAUSE_MAX];

+
+   bool frames_buffering;
  };
  
  extern const struct mt76_driver_ops mt7603_drv_ops;


OpenPGP_0xE3520CC91929C8E7.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


[PATCH v2] imx: add imx8m support

2024-05-25 Thread Tim Harvey
Add imx8m support:
 - add a cortexa53 subtarget to imx
 - move ARCH and KERNELNAME to subtargets

No device-specific targets or firmware images are created yet but all
imx8m* dtbs will be built.

enabling CONFIG_TARGET_ROOTFS_INITRAMFS results in
openwrt-imx-cortexa53-imx8m-initramfs-kernel.bin which has been
successfully booted on an imx8mm-evk using the following:

u-boot=> tftpboot $fdt_addr_r image-imx8mm-evk.dtb && \
tftpboot $kernel_addr_r openwrt-imx-cortexa53-imx8m-initramfs-kernel.bin && \
booti $kernel_addr_r - $fdt_addr_r

Signed-off-by: Tim Harvey 
---
v2:
 - fix build failure regarding ubifs image generation
 - remove kernel patches (these all go away when we move to 6.1
   and I've already submitted a series to add 6.1 support to imx)
 - add a generic imx8m device that picks up imx8m dtb's
 - show example of how I'm booting this in commit log
---
 target/linux/imx/Makefile |  5 +-
 target/linux/imx/cortexa53/config-default | 96 +++
 target/linux/imx/cortexa53/target.mk  |  8 ++
 target/linux/imx/cortexa7/target.mk   |  2 +
 target/linux/imx/cortexa9/target.mk   |  2 +
 target/linux/imx/image/cortexa53.mk   | 16 
 6 files changed, 125 insertions(+), 4 deletions(-)
 create mode 100644 target/linux/imx/cortexa53/config-default
 create mode 100644 target/linux/imx/cortexa53/target.mk
 create mode 100644 target/linux/imx/image/cortexa53.mk

diff --git a/target/linux/imx/Makefile b/target/linux/imx/Makefile
index 5f7f9db589..53c1ccce0c 100644
--- a/target/linux/imx/Makefile
+++ b/target/linux/imx/Makefile
@@ -4,19 +4,16 @@
 
 include $(TOPDIR)/rules.mk
 
-ARCH:=arm
 BOARD:=imx
 BOARDNAME:=NXP i.MX
 FEATURES:=audio display fpu gpio pcie rtc usb usbgadget squashfs targz nand 
ubifs boot-part rootfs-part
-SUBTARGETS:=cortexa7 cortexa9
+SUBTARGETS:=cortexa7 cortexa9 cortexa53
 
 KERNEL_PATCHVER:=5.15
 KERNEL_TESTING_PATCHVER:=6.1
 
 include $(INCLUDE_DIR)/target.mk
 
-KERNELNAME:=zImage dtbs
-
 DEFAULT_PACKAGES += uboot-envtools mkf2fs e2fsprogs blkid
 
 $(eval $(call BuildTarget))
diff --git a/target/linux/imx/cortexa53/config-default 
b/target/linux/imx/cortexa53/config-default
new file mode 100644
index 00..de52260489
--- /dev/null
+++ b/target/linux/imx/cortexa53/config-default
@@ -0,0 +1,96 @@
+CONFIG_64BIT=y
+CONFIG_ARM64=y
+CONFIG_ARM64_4K_PAGES=y
+CONFIG_ARM64_VA_BITS_39=y
+CONFIG_ARM64_CRYPTO=y
+CONFIG_ARCH_NXP=y
+CONFIG_CRYPTO_AES_ARM64=y
+CONFIG_CRYPTO_AES_ARM64_CE=y
+CONFIG_CRYPTO_AES_ARM64_CE_BLK=y
+CONFIG_CRYPTO_AES_ARM64_CE_CCM=y
+CONFIG_CRYPTO_BLAKE2S=y
+CONFIG_CRYPTO_GHASH_ARM64_CE=y
+CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
+CONFIG_CRYPTO_SHA1_ARM64_CE=y
+CONFIG_CRYPTO_SHA256_ARM64=y
+CONFIG_CRYPTO_SHA2_ARM64_CE=y
+CONFIG_CRYPTO_SHA512_ARM64=y
+CONFIG_CRYPTO_SHA512_ARM64_CE=y
+CONFIG_CRYPTO_ZSTD=y
+CONFIG_UNMAP_KERNEL_AT_EL0=y
+CONFIG_RODATA_FULL_DEFAULT_ENABLED=y
+CONFIG_ARM64_TAGGED_ADDR_ABI=y
+CONFIG_ARCH_MMAP_RND_BITS=18
+CONFIG_VMAP_STACK=y
+CONFIG_MEMORY_ISOLATION=y
+CONFIG_CMA=y
+CONFIG_CMA_AREAS=7
+# CONFIG_CMA_DEBUG is not set
+# CONFIG_CMA_DEBUGFS is not set
+# CONFIG_CMA_SYSFS is not set
+CONFIG_CONSOLE_TRANSLATIONS=y
+CONFIG_CONTIG_ALLOC=y
+CONFIG_ZONE_DMA32=y
+CONFIG_ARM_IMX_CPUFREQ_DT=y
+CONFIG_ARM64_CRYPTO=y
+CONFIG_EXTRA_FIRMWARE="imx/sdma/sdma-imx7d.bin"
+CONFIG_EXTRA_FIRMWARE_DIR="firmware"
+CONFIG_PCI=y
+CONFIG_PCIEAER=y
+CONFIG_PCIEPORTBUS=y
+CONFIG_PCIE_PME=y
+CONFIG_PHY_FSL_IMX8M_PCIE=y
+CONFIG_PCIEAER=y
+CONFIG_PCIEPORTBUS=y
+CONFIG_PCIE_DW=y
+CONFIG_PCIE_DW_HOST=y
+CONFIG_PCIE_PME=y
+CONFIG_PCI_DOMAINS=y
+CONFIG_PCI_DOMAINS_GENERIC=y
+CONFIG_PCI_IMX6=y
+CONFIG_PCI_MSI=y
+CONFIG_PCI_MSI_IRQ_DOMAIN=y
+CONFIG_PINCTRL_IMX=y
+CONFIG_DWMAC_DWC_QOS_ETH=y
+CONFIG_PINCTRL=y
+CONFIG_PINCTRL_SINGLE=y
+CONFIG_PINCTRL_IMX=y
+CONFIG_PINCTRL_IMX8MM=y
+CONFIG_PINCTRL_IMX8MN=y
+CONFIG_PINCTRL_IMX8MP=y
+CONFIG_PINCTRL_IMX8MQ=y
+CONFIG_THERMAL=y
+CONFIG_IMX8MM_THERMAL=y
+CONFIG_REGULATOR_MP5416=y
+CONFIG_REGULATOR_PCA9450=y
+CONFIG_USB_CONN_GPIO=y
+CONFIG_NOP_USB_XCEIV=y
+CONFIG_USB_OTG=y
+CONFIG_USB_DWC3=y
+CONFIG_USB_DWC3_DUAL_ROLE=y
+# CONFIG_USB_DWC3_GADGET is not set
+# CONFIG_USB_DWC3_HOST is not set
+CONFIG_USB_DWC3_IMX8MP=y
+# CONFIG_MMC_SDHCI_PCI is not set
+CONFIG_CLK_IMX8MM=y
+CONFIG_CLK_IMX8MN=y
+CONFIG_CLK_IMX8MP=y
+CONFIG_CLK_IMX8MQ=y
+CONFIG_CLK_IMX8QXP=y
+CONFIG_SOC_IMX8M=y
+# CONFIG_IMX_DSP is not set
+# CONFIG_IMX_SCU is not set
+CONFIG_EXTCON_USB_GPIO=y
+CONFIG_PHY_FSL_IMX8MQ_USB=y
+# CONFIG_PHY_MIXEL_LVDS_PHY is not set
+CONFIG_RESET_IMX7=y
+CONFIG_INTERCONNECT=y
+CONFIG_INTERCONNECT_IMX=y
+CONFIG_INTERCONNECT_IMX8MM=y
+CONFIG_INTERCONNECT_IMX8MN=y
+CONFIG_INTERCONNECT_IMX8MQ=y
+CONFIG_INTERCONNECT_IMX8MP=y
+# CONFIG_DMA_CMA is not set
+CONFIG_VT=y
+CONFIG_VT_CONSOLE=y
+CONFIG_VT_HW_CONSOLE_BINDING=y
diff --git a/target/linux/imx/cortexa53/target.mk 
b/target/linux/imx/cortexa53/target.mk
new file mode 100644
index 00..b9b32d1829
--- /dev/null
+++ b/target/linux/imx/cortexa53/target.mk
@@ -0,0 

Re: [PATCH] dt-bindings: leds: add FUNCTION defines for per-band WLANs

2024-05-25 Thread Rob Herring

On Wed, 17 Jan 2024 16:17:36 +0100, Rafał Miłecki wrote:
> From: Rafał Miłecki 
> 
> Most wireless routers and access points can operate in multiple bands
> simultaneously. Vendors often equip their devices with per-band LEDs.
> 
> Add defines for those very common functions to allow cleaner & clearer
> bindings.
> 
> Signed-off-by: Rafał Miłecki 
> ---
>  include/dt-bindings/leds/common.h | 3 +++
>  1 file changed, 3 insertions(+)
> 

Acked-by: Rob Herring 


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: (subset) [PATCH] dt-bindings: leds: add FUNCTION defines for per-band WLANs

2024-05-25 Thread Lee Jones
On Wed, 17 Jan 2024 16:17:36 +0100, Rafał Miłecki wrote:
> Most wireless routers and access points can operate in multiple bands
> simultaneously. Vendors often equip their devices with per-band LEDs.
> 
> Add defines for those very common functions to allow cleaner & clearer
> bindings.
> 
> 
> [...]

Applied, thanks!

[1/1] dt-bindings: leds: add FUNCTION defines for per-band WLANs
  commit: 89d9d3eedc8804e06a770e3cf1279f9131b785f1

--
Lee Jones [李琼斯]


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [PATCH dt-schema] schemas: chosen: Add OpenWrt LEDs properties for system states

2024-05-25 Thread Krzysztof Kozlowski
On 09/01/2024 17:38, Rafał Miłecki wrote:
> On 9.01.2024 10:02, Krzysztof Kozlowski wrote:
>> On 09/01/2024 09:23, Rafał Miłecki wrote:
>>> From: Rafał Miłecki 
>>>
>>> OpenWrt project provides downstream support for thousands of embedded
>>> home network devices. Its custom requirement for DT is to provide info
>>> about LEDs roles. Currently it does it by using custom non-documented
>>> aliases. While formally valid (aliases.yaml doesn't limit names or
>>> purposes of aliases) it's quite a loose solution.
>>>
>>> Document 4 precise "chosen" biding properties with clearly documented
>>> OpenWrt usage. This will allow upstreaming tons of DTS files that noone
>>
>> typo: none
> 
> typo: no one ;)
> 
>>> cared about so far as those would need to be patched downstream anyway.
>>
>>  From all this description I understand why you want to add it, but I
>> don't understand what are you exactly adding and how it is being used by
>> the users/OS.
> 
> In OpenWrt we have user space script that reads LED full path:
> cat /proc/device-tree/aliases/led-
> 
> And then sets LED to state matching current boot stage:
> echo 1 > /sys/class/leds//brightness

OK, it's nowhere close to a hardware property. You now instruct OS what
to do. It's software or software policy.

> 
> 
>>> Signed-off-by: Rafał Miłecki 
>>> ---
>>> A few weeks ago I was seeking for a help regarding OpenWrt's need for
>>> specifing LEDs roles in DT, see:
>>>
>>> Describing LEDs roles in device tree?
>>> https://lore.kernel.org/linux-devicetree/ee912a89-4fd7-43c3-a79b-16659a035...@gmail.com/T/#u
>>>
>>> I DON'T think OpenWrt's current solution with aliases is good enough:
>>> * It's not clearly documented
>>> * It may vary from other projects usa case
>>> * It may be refused by random maintainers I think
>>>
>>> I decided to suggest 4 OpenWrt-prefixed properties for "chosen". I'm
>>> hoping this small custom binding is sth we could go with. I'm really
>>> looking forward to upstreaming OpenWrt's downstream DTS files so other
>>> projects (e.g. Buildroot) can use them.
>>>
>>> If you have any better fitting solution in mind please let me know. I
>>> should be fine with anything that lets me solve this downstream mess
>>> situation.
>>>
>>>   dtschema/schemas/chosen.yaml | 9 +
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/dtschema/schemas/chosen.yaml b/dtschema/schemas/chosen.yaml
>>> index 6d5c3f1..96d0db7 100644
>>> --- a/dtschema/schemas/chosen.yaml
>>> +++ b/dtschema/schemas/chosen.yaml
>>> @@ -264,4 +264,13 @@ properties:
>>>   patternProperties:
>>> "^framebuffer": true
>>>   
>>> +  "^openwrt,led-(boot|failsafe|running|upgrade)$":
>>> +$ref: types.yaml#/definitions/string
>>> +description:
>>> +  OpenWrt choice of LED for a given role.
>>
>> Neither this explains it. What is "OpenWrt choice"? Choice like what?
>> What are the valid choices?
>>
>>> Value must be a full path (encoded
>>> +  as a string) to a relevant LED node.
>>
>> What do you mean by full path? DT node path? Then no, use phandles.
>>
>> Anyway, we have existing properties for this - LED functions. Just
>> choose LED with given function (from sysfs) and you are done. If
>> function is missing in the header: feel free to go, least for me.
> 
> In "Describing LEDs roles in device tree?" e-mail I wrote:
> 
> "
> Please note that "function" on its own is not sufficient as multiple
> LEDs my share the same function name but its meaning may vary depending
> on color.
> "
> 
> Let me elaborate here.
> 
> Values of "function" property match LEDs descriptions (usually it's a
> physical label). That is OK and makes sense but doesn't allow OpenWrt to
> automatically pick proper LED to use during given boot stage.
> 
> Some devices may have multiple color of a the same LED function. OpenWrt
> developer needs to choose which color to use for failsafe more and which
> boot fully booted state (and others too).
> 
> Devices also differ in available LEDs by their functions. Some may have
> LED_FUNCTION_POWER that OpenWrt needs to use. Some may have
> LED_FUNCTION_STATUS. Or both. There are some more (less common)
> functions like LED_FUNCTION_ACTIVITY.
> 
> We failed at OpenWrt to develop a single generic script to rule all
> devices and all their LEDs combinations. We ended up with developers
> *choosing* what LED to use for a given system state.

So this all is because you want to write easier OS. That's abuse of
Devicetree. You can define which LEDs have different meaning, e.g.
physical label or icon attached to them. That's a hardware property.

You can also define how pieces of hardware are wired together and create
entire system, e.g. connect one LED to disk activity.

However what you are proposing here is to dynamically configure one,
given OS. I don't think it is suitable.

The problem of OS to nicely figure out which LED to blink when, is not a
problem of Devicetree. It is a problem of OS and its configuration.

Best regards,
Krzysztof



Re: [PATCH dt-schema] schemas: chosen: Add OpenWrt LEDs properties for system states

2024-05-25 Thread Krzysztof Kozlowski
On 09/01/2024 22:08, Geert Uytterhoeven wrote:
> Hi Krzysztof,
> 
> On Tue, Jan 9, 2024 at 8:10 PM Krzysztof Kozlowski
>  wrote:
>> On 09/01/2024 17:38, Rafał Miłecki wrote:
>>> On 9.01.2024 10:02, Krzysztof Kozlowski wrote:
 On 09/01/2024 09:23, Rafał Miłecki wrote:
> From: Rafał Miłecki 
>
> OpenWrt project provides downstream support for thousands of embedded
> home network devices. Its custom requirement for DT is to provide info
> about LEDs roles. Currently it does it by using custom non-documented
> aliases. While formally valid (aliases.yaml doesn't limit names or
> purposes of aliases) it's quite a loose solution.
>
> Document 4 precise "chosen" biding properties with clearly documented
> OpenWrt usage. This will allow upstreaming tons of DTS files that noone

 typo: none
>>>
>>> typo: no one ;)
>>>
> cared about so far as those would need to be patched downstream anyway.

  From all this description I understand why you want to add it, but I
 don't understand what are you exactly adding and how it is being used by
 the users/OS.
>>>
>>> In OpenWrt we have user space script that reads LED full path:
>>> cat /proc/device-tree/aliases/led-
>>>
>>> And then sets LED to state matching current boot stage:
>>> echo 1 > /sys/class/leds//brightness
>>
>> OK, it's nowhere close to a hardware property. You now instruct OS what
>> to do. It's software or software policy.
> 
 Anyway, we have existing properties for this - LED functions. Just
 choose LED with given function (from sysfs) and you are done. If
 function is missing in the header: feel free to go, least for me.
>>>
>>> In "Describing LEDs roles in device tree?" e-mail I wrote:
>>>
>>> "
>>> Please note that "function" on its own is not sufficient as multiple
>>> LEDs my share the same function name but its meaning may vary depending
>>> on color.
>>> "
>>>
>>> Let me elaborate here.
>>>
>>> Values of "function" property match LEDs descriptions (usually it's a
>>> physical label). That is OK and makes sense but doesn't allow OpenWrt to
>>> automatically pick proper LED to use during given boot stage.
>>>
>>> Some devices may have multiple color of a the same LED function. OpenWrt
>>> developer needs to choose which color to use for failsafe more and which
>>> boot fully booted state (and others too).
>>>
>>> Devices also differ in available LEDs by their functions. Some may have
>>> LED_FUNCTION_POWER that OpenWrt needs to use. Some may have
>>> LED_FUNCTION_STATUS. Or both. There are some more (less common)
>>> functions like LED_FUNCTION_ACTIVITY.
>>>
>>> We failed at OpenWrt to develop a single generic script to rule all
>>> devices and all their LEDs combinations. We ended up with developers
>>> *choosing* what LED to use for a given system state.
>>
>> So this all is because you want to write easier OS. That's abuse of
>> Devicetree. You can define which LEDs have different meaning, e.g.
>> physical label or icon attached to them. That's a hardware property.
>>
>> You can also define how pieces of hardware are wired together and create
>> entire system, e.g. connect one LED to disk activity.
>>
>> However what you are proposing here is to dynamically configure one,
>> given OS. I don't think it is suitable.
>>
>> The problem of OS to nicely figure out which LED to blink when, is not a
>> problem of Devicetree. It is a problem of OS and its configuration.
> 
> I'd say it's a grey area...
> 
> What if the colors are printed on the case, next to the LED?
> Like these multi-color LEDs indicating port speed on network switches?
> Then it suddenly becomes hardware description, just like
> "aliases/serial2 = &...;" referring to serial port 2.
> 
> Next, what if the colors are not printed on the case, but documented
> in the product's manual?
> What if there is no paper product manual, but just the OpenWRT online
> documentation?

Mapping between colors and logical meanings is subjective. I don't think
it is the same case as LED with a radio-signal or power-plug label.

Anyway I also agree that this distinction is a bit blurred.

Best regards,
Krzysztof


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [PATCH dt-schema] schemas: chosen: Add OpenWrt LEDs properties for system states

2024-05-25 Thread Krzysztof Kozlowski
On 09/01/2024 22:48, Rafał Miłecki wrote:
>>
>> You can also define how pieces of hardware are wired together and create
>> entire system, e.g. connect one LED to disk activity.
>>
>> However what you are proposing here is to dynamically configure one,
>> given OS. I don't think it is suitable.
>>
>> The problem of OS to nicely figure out which LED to blink when, is not a
>> problem of Devicetree. It is a problem of OS and its configuration.
> 
> I'd say it's a thin line. Or just a grey idea as Geert said.
> 
> What is a LED "function" after all? How exactly are:
> LED_FUNCTION_STATUS
> LED_FUNCTION_ACTIVITY
> LED_FUNCTION_BOOT
> LED_FUNCTION_HEARTBEAT
> different from each other?
> 
> I can imagine OpenWrt seeing a different role for LED_FUNCTION_ACTIVITY
> or LED_FUNCTION_BOOT than other projects.

...which is not a problem. The meaning of these, except quite obvious
heartbeat, is defined by the OS or system configurators.

> 
> Proposed properties "openwrt,led-" don't exactly describe hardware
> per se but are still designed to deal with hardware differences.
> 
>  From a practical point of view it's much easier to put such OS
> configuration info in DT since it's closely related to LEDs defined
> there and it helps a lot with maintenance. If at some point we change

I agree, however this is an abuse of DT and therefore it is not an
argument to put something into DT. And this was told many, many times on
the lists: just because it is easier to instantiate each Linux struct
device from DT (with 1-1 mapping between devices and device nodes), does
not mean you should do it.

Same here. Just because it is easier for OpenWRT, does not mean this is
the solution.

This is the most frequent argument used in all of such DT abuses.
Another example: I want to boot some virtual machine and doing ACPI is
too difficult, so I will just use DT as way to pass from host to guest.
There were several examples of this. I understand why DT is the easiest
for the job...

> DT due to previous mistake (e.g. we fix LED color from amber to red)
> that would mean breaking user space of Linux system (changing LED name).
> Having DT binding for LEDs roles would prevent that.

I can argue that LEDs "label" can be un-deprecated and used for that
purpose as well. It will provide you stable sysfs entry, regardless of
the "color" property.

In your case you could also use to solve the actual problem: just label
each LED accordingly, e.g. "phase:boot", "phase:upgrade". It might be
not the best solution though, because we put one's OS expectations
inside DT device node...

> I was hoping that vendor prefixed "chosen" properties may somehow get
> accepted as a reasonable solution for dealing with hardware differences
> even if they don't strictly describe hardware itself.

It's actually not the worst idea considering above "OS expectations
inside DT device node" when using "label"...

> 
> Is there any other DT solution you think would be better and could be
> accepted?
> Given my hesitation about "function" meaning would something like
> openwrt,function = "(boot|failsafe|running|upgrade)"
> be any better?

Your problem is not really that specific to OpenWRT - several embedded
systems want to do the same, including Android. Some of the LEDs must be
active before the user-space comes up, so it is the job for kernel
and/or DT. Therefore let's go with generic solutions?

I still wonder why we cannot define new LED FUNCTION constants and use
them? You need them only for the pre-userspace phase, so do you expect
one LED would have two functions? But if you do not have user-space how
this aliases are being handled? By how?

If you have user-space, then it's not a job for kernel.

Best regards,
Krzysztof


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [PATCH dt-schema] schemas: chosen: Add OpenWrt LEDs properties for system states

2024-05-25 Thread Geert Uytterhoeven
Hi Krzysztof,

On Tue, Jan 9, 2024 at 8:10 PM Krzysztof Kozlowski
 wrote:
> On 09/01/2024 17:38, Rafał Miłecki wrote:
> > On 9.01.2024 10:02, Krzysztof Kozlowski wrote:
> >> On 09/01/2024 09:23, Rafał Miłecki wrote:
> >>> From: Rafał Miłecki 
> >>>
> >>> OpenWrt project provides downstream support for thousands of embedded
> >>> home network devices. Its custom requirement for DT is to provide info
> >>> about LEDs roles. Currently it does it by using custom non-documented
> >>> aliases. While formally valid (aliases.yaml doesn't limit names or
> >>> purposes of aliases) it's quite a loose solution.
> >>>
> >>> Document 4 precise "chosen" biding properties with clearly documented
> >>> OpenWrt usage. This will allow upstreaming tons of DTS files that noone
> >>
> >> typo: none
> >
> > typo: no one ;)
> >
> >>> cared about so far as those would need to be patched downstream anyway.
> >>
> >>  From all this description I understand why you want to add it, but I
> >> don't understand what are you exactly adding and how it is being used by
> >> the users/OS.
> >
> > In OpenWrt we have user space script that reads LED full path:
> > cat /proc/device-tree/aliases/led-
> >
> > And then sets LED to state matching current boot stage:
> > echo 1 > /sys/class/leds//brightness
>
> OK, it's nowhere close to a hardware property. You now instruct OS what
> to do. It's software or software policy.

> >> Anyway, we have existing properties for this - LED functions. Just
> >> choose LED with given function (from sysfs) and you are done. If
> >> function is missing in the header: feel free to go, least for me.
> >
> > In "Describing LEDs roles in device tree?" e-mail I wrote:
> >
> > "
> > Please note that "function" on its own is not sufficient as multiple
> > LEDs my share the same function name but its meaning may vary depending
> > on color.
> > "
> >
> > Let me elaborate here.
> >
> > Values of "function" property match LEDs descriptions (usually it's a
> > physical label). That is OK and makes sense but doesn't allow OpenWrt to
> > automatically pick proper LED to use during given boot stage.
> >
> > Some devices may have multiple color of a the same LED function. OpenWrt
> > developer needs to choose which color to use for failsafe more and which
> > boot fully booted state (and others too).
> >
> > Devices also differ in available LEDs by their functions. Some may have
> > LED_FUNCTION_POWER that OpenWrt needs to use. Some may have
> > LED_FUNCTION_STATUS. Or both. There are some more (less common)
> > functions like LED_FUNCTION_ACTIVITY.
> >
> > We failed at OpenWrt to develop a single generic script to rule all
> > devices and all their LEDs combinations. We ended up with developers
> > *choosing* what LED to use for a given system state.
>
> So this all is because you want to write easier OS. That's abuse of
> Devicetree. You can define which LEDs have different meaning, e.g.
> physical label or icon attached to them. That's a hardware property.
>
> You can also define how pieces of hardware are wired together and create
> entire system, e.g. connect one LED to disk activity.
>
> However what you are proposing here is to dynamically configure one,
> given OS. I don't think it is suitable.
>
> The problem of OS to nicely figure out which LED to blink when, is not a
> problem of Devicetree. It is a problem of OS and its configuration.

I'd say it's a grey area...

What if the colors are printed on the case, next to the LED?
Like these multi-color LEDs indicating port speed on network switches?
Then it suddenly becomes hardware description, just like
"aliases/serial2 = &...;" referring to serial port 2.

Next, what if the colors are not printed on the case, but documented
in the product's manual?
What if there is no paper product manual, but just the OpenWRT online
documentation?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM board lockups/hangs triggered by locks and mutexes

2024-05-25 Thread Geert Uytterhoeven
Hi Rafal,

On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki  wrote:
> On 4.08.2023 13:07, Rafał Miłecki wrote:
> > I triple checked that. Dropping a single unused function breaks kernel /
> > device stability on BCM53573!
> >
> > AFAIK the only thing below diff actually affects is location of symbols
> > (I actually verified that by comparing System.map before and after -
> > over 22'000 of relocated symbols).
> >
> > Can some unfortunate location of symbols cause those hangs/lockups?
>
> I performed another experiment. First I dropped mtd_check_of_node() to
> bring kernel back to the stable state.
>
> Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
> ended up adding just enough to make sure all post-mtd symbols in
> System.map got the same offset as in case of backporting
> mtd_check_of_node().
>
> I started experiencing lockups/hangs again.
>
> I repeated the same test with adding dumb code to the brcm_nvram_probe()
> and verifying symbols offsets following brcm_nvram_probe one.
>
> I believe this confirms that this problem is about offset or alignment
> of some specific symbol(s). The remaining question is what symbols and
> how to fix or workaround that.

I had similar experiences on other ARM platforms many years ago:
bisection lead to something completely bogus, and it turned out
adding a single line of innocent code made the system lock-up or crash
unexpectedly.  It was definitely related to alignment, as adding the
right extra amount of innocent code would fix the problem. Until some
later change changing alignment again...
I never found the real cause, but the problems went away over time.
I am not sure I did enable all required errata config options, so I
may have missed some...

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes

2024-05-25 Thread Geert Uytterhoeven
Hi Rafał,

On Mon, Sep 4, 2023 at 10:35 AM Rafał Miłecki  wrote:
> 2. Clock (arm,armv7-timer)
>
> While comparing main clock in Broadcom's SDK with upstream one I noticed
> a tiny difference: mask value. I don't know it it makes any sense but
> switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
> arm_arch_timer.c (to match SDK) increases average uptime (time before a
> hang/lockup happens) from 4 minutes to 36 minutes.

That code path is used only for type != ARCH_TIMER_TYPE_CP15,
but your kernel log

arch_timer: cp15 timer(s) running at 0.03MHz (virt).

suggest that type == ARCH_TIMER_TYPE_CP15?!?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel