[PATCH] tracing: Add __string_src() helper to help compilers not to get confused

2024-03-14 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The __string() helper macro of the TRACE_EVENT() macro is used to
determine how much of the ring buffer needs to be allocated to fit the
given source string. Some trace events have a string that is dependent on
another variable that could be NULL, and in those cases the string is
passed in to be NULL.

The __string() macro can handle being passed in a NULL pointer for which
it will turn it into "(null)". It does that with:

  strlen((src) ? (const char *)(src) : "(null)") + 1

But if src itself has the same conditional type it can confuse the
compiler. That is:

  __string(r ? dev(r)->name : NULL)

Would turn into:

 strlen((r ? dev(r)->name : NULL) ? (r ? dev(r)->name : NULL) : "(null)" + 1

For which the compiler thinks that NULL is being passed to strlen() and
gives this kind of warning:

./include/trace/stages/stage5_get_offsets.h:50:21: warning: argument 1 null 
where non-null expected [-Wnonnull]
   50 | strlen((src) ? (const char *)(src) : "(null)") + 1)

Instead, create a static inline function that takes the src string and
will return the string if it is not NULL and will return "(null)" if it
is. This will then make the strlen() line:

 strlen(__string_src(src)) + 1

Where the compiler can see that strlen() will not end up with NULL and
does not warn about it.

Note that this depends on commit 51270d573a8d ("tracing/net_sched: Fix
tracepoints that save qdisc_dev() as a string") being applied, as passing
the qdisc_dev() into __string_src() will give an error.

Link: https://lore.kernel.org/all/ZfNmfCmgCs4Nc+EH@aschofie-mobl2/

Reported-by: Alison Schofield 
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage5_get_offsets.h | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/trace/stages/stage5_get_offsets.h 
b/include/trace/stages/stage5_get_offsets.h
index e6b96608f452..c6a62dfb18ef 100644
--- a/include/trace/stages/stage5_get_offsets.h
+++ b/include/trace/stages/stage5_get_offsets.h
@@ -9,6 +9,16 @@
 #undef __entry
 #define __entry entry
 
+#ifndef __STAGE5_STRING_SRC_H
+#define __STAGE5_STRING_SRC_H
+static inline const char *__string_src(const char *str)
+{
+   if (!str)
+  return EVENT_NULL_STR;
+   return str;
+}
+#endif /* __STAGE5_STRING_SRC_H */
+
 /*
  * Fields should never declare an array: i.e. __field(int, arr[5])
  * If they do, it will cause issues in parsing and possibly corrupt the
@@ -47,7 +57,7 @@
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item,
\
-   strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \
+   strlen(__string_src(src)) + 1)  \
__data_offsets->item##_ptr_ = src;
 
 #undef __string_len
@@ -70,7 +80,7 @@
 
 #undef __rel_string
 #define __rel_string(item, src) __rel_dynamic_array(char, item,
\
-   strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \
+   strlen(__string_src(src)) + 1)  \
__data_offsets->item##_ptr_ = src;
 
 #undef __rel_string_len
-- 
2.43.0




Re: [RFC PATCH 0/2] net: provides dim profile fine-tuning channels

2024-03-14 Thread Heng Qi




在 2024/3/15 上午2:49, Jakub Kicinski 写道:

On Thu, 14 Mar 2024 21:09:31 +0800 Heng Qi wrote:

The NetDIM library provides excellent acceleration for many modern
network cards. However, the default profiles of DIM limits its maximum
capabilities for different NICs, so providing a channel through which
the NIC can be custom configured is necessary.

Given that DIM is currently enabled and disable via ethtool
why are you putting the API is sysfs and ops in ndo?


Hi Jakub,

Thank you for reaching out. We're flexible regarding configuration 
methods and are
open to using either sysfs or ethtool, depending on what's most 
appropriate for the task at hand.


If the ethtool is favored, I am happy to proceed with it!

Best regards,
Heng




Re: [PATCH v2] kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address

2024-03-14 Thread Jinghao Jia
On 3/14/24 10:17, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Read from an unsafe address with copy_from_kernel_nofault() in
> arch_adjust_kprobe_addr() because this function is used before checking
> the address is in text or not. Syzcaller bot found a bug and reported
> the case if user specifies inaccessible data area,
> arch_adjust_kprobe_addr() will cause a kernel panic.

IMHO there is a check on the address in kallsyms_lookup_size_offset to see
if it is a kernel text address before arch_adjust_kprobe_addr is invoked.

The call chain is:

register_kprobe()
  _kprobe_addr()
kallsyms_lookup_size_offset() <- check on addr is here
arch_adjust_kprobe_addr()

I wonder why this check was not able to capture the problem in this bug
report (I cannot reproduce it locally).

Thanks,
--Jinghao

> 
> Reported-by: Qiang Zhang 
> Closes: 
> https://urldefense.com/v3/__https://lore.kernel.org/all/cakhosas2rof6vqvbw_lg_j3qnku0canzr2qmy4et7r5lo8m...@mail.gmail.com/__;!!DZ3fjg!_C9Dn6-GBlkyS2z34bDUBsEXkTkgWp45MNrd4Rl5I5slz2A3SrurXOxKupsxLMqxC2BMiySCTfB2-5fPhkLP1g$
>  
> Fixes: cc66bb914578 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v2:
>   - Fix to return NULL if failed to access the address.
> ---
>  arch/x86/kernel/kprobes/core.c |   11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index a0ce46c0a2d8..95e4ebe5d514 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -335,7 +335,16 @@ static int can_probe(unsigned long paddr)
>  kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long 
> offset,
>bool *on_func_entry)
>  {
> - if (is_endbr(*(u32 *)addr)) {
> + u32 insn;
> +
> + /*
> +  * Since addr is not guaranteed to be safely accessed yet, use
> +  * copy_from_kernel_nofault() to get the instruction.
> +  */
> + if (copy_from_kernel_nofault(, (void *)addr, sizeof(u32)))
> + return NULL;
> +
> + if (is_endbr(insn)) {
>   *on_func_entry = !offset || offset == 4;
>   if (*on_func_entry)
>   offset = 4;
> 


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] OpenRISC updates for 6.9

2024-03-14 Thread pr-tracker-bot
The pull request you sent on Wed, 13 Mar 2024 16:39:39 +:

> https://github.com/openrisc/linux.git tags/for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/29da654bd20842d4c1e17c6d4dc1b12642ca16ac

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html



Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-14 Thread Björn Töpel
Björn Töpel  writes:

> Puranjay Mohan  writes:
>
>> Björn Töpel  writes:
>>
>>>
>>> Hmm, depending on RISC-V's CMODX path, the pro/cons CALL_OPS vs dynamic
>>> trampolines changes quite a bit.
>>>
>>> The more I look at the pains of patching two instruction ("split
>>> immediates"), the better "patch data" + one insn patching look.
>>
>> I was looking at how dynamic trampolines would be implemented for RISC-V.
>>
>> With CALL-OPS we need to patch the auipc+jalr at function entry only, the
>> ops pointer above the function can be patched atomically.
>>
>> With a dynamic trampoline we need a auipc+jalr pair at function entry to jump
>> to the trampoline and then another auipc+jalr pair to jump from trampoline to
>> ops->func. When the ops->func is modified, we would need to update the
>> auipc+jalr at in the trampoline.
>>
>> So, I am not sure how to move forward here, CALL-OPS or Dynamic trampolines?
>
> Yeah. Honestly, we need to figure out the patching story prior
> choosing the path, so let's start there.
>
> After reading Mark's reply, and discussing with OpenJDK folks (who does
> the most crazy text patching on all platforms), having to patch multiple
> instructions (where the address materialization is split over multiple
> instructions) is a no-go. It's just a too big can of worms. So, if we
> can only patch one insn, it's CALL_OPS.
>
> A couple of options (in addition to Andy's), and all require a
> per-function landing address ala CALL_OPS) tweaking what Mark is doing
> on Arm (given the poor branch range).
>
> ...and maybe we'll get RISC-V rainbows/unicorns in the future getting
> better reach (full 64b! ;-)).
>
> A) Use auipc/jalr, only patch jalr to take us to a common
>dispatcher/trampoline
>   
>  |  # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   aupic
>  |   nop <=> jalr # Text patch point -> common_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | common_dispatch:
>  |   load  based on ra
>  |   jalr
>  |   ...
>
> The auipc is never touched, and will be overhead. Also, we need a mv to
> store ra in a scratch register as well -- like Arm. We'll have two insn
> per-caller overhead for a disabled caller.
>
> B) Use jal, which can only take us +/-1M, and requires multiple
>dispatchers (and tracking which one to use, and properly distribute
>them. Ick.)
>
>  |  # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> within_1M_to_func_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | within_1M_to_func_dispatch:
>  |   load  based on ra
>  |   jalr
>
> C) Use jal, which can only take us +/-1M, and use a per-function
>trampoline requires multiple dispatchers (and tracking which one to
>use). Blows up text size A LOT.
>
>  |  # somewhere, but probably on a different 
> cacheline than the .text to avoid ping-ongs
>  | ...
>  | per_func_dispatch
>  |   load  based on ra
>  |   jalr
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> per_func_dispatch
>  |   ACTUAL_FUNC

Brendan proposed yet another option, "in-function dispatch":

D) 
  |  # idk somewhere
  | ...
  | func:
  |mv tmp1, ra
  |aupic tmp2, 
  |mv tmp3, zero <=> ld tmp3, 
tmp2
  |nop <=> jalr ra, tmp3
  |ACTUAL_FUNC

There are 4 CMODX possiblities:
   mv, nop:  fully disabled, no problems
   mv, jalr: We will jump to zero. We would need to have the inst
 page/access fault handler take care of this case. Especially
 if we align the instructions so that they can be patched
 together, being interrupted in the middle and taking this
 path will be rare.
  ld, nop:   no problems
  ld, jalr:  fully enabled, no problems

Patching is a 64b store/sd, and we only need a fence.i at the end, since
we can handle all 4 possibilities.

For the disabled case we'll have:
A) mv, aupic, nop
D) mv, aupic, mv, nop.

Puranjay, I've flipped. Let's go Mark's CALL_OPS together with a new
text patch mechanism w/o stop_machine().


Björn



[GIT PULL] NVDIMM//DAX changes for 6.9

2024-03-14 Thread Dave Jiang
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
tags/libnvdimm-for-6.9

... to get updates to the nvdimm tree. They are a number of updates to 
interfaces used by nvdimm/dax and a documentation fix.

Doc fixes:
ACPI_NFIT Kconfig documetation fix

Updates:
Make nvdimm_bus_type and dax_bus_type const
Remove SLAB_MEM_SPREAD flag usage in DAX

They have all been in the -next for more than a week with no reported issues.

---

The following changes since commit 54be6c6c5ae8e0d93a6c4641cb7528eb0b6ba478:

  Linux 6.8-rc3 (2024-02-04 12:20:36 +)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
tags/libnvdimm-for-6.9

for you to fetch changes up to d9212b35da52109361247b66010802d43c6b1f0d:

  dax: remove SLAB_MEM_SPREAD flag usage (2024-03-04 09:10:37 -0700)


libnvdimm updates for v6.9

- ACPI_NFIT Kconfig documetation fix
- Make nvdimm_bus_type const
- Make dax_bus_type const
- remove SLAB_MEM_SPREAD flag usage in DAX


Chengming Zhou (1):
  dax: remove SLAB_MEM_SPREAD flag usage

Peter Robinson (1):
  libnvdimm: Fix ACPI_NFIT in BLK_DEV_PMEM help

Ricardo B. Marliere (2):
  nvdimm: make nvdimm_bus_type const
  device-dax: make dax_bus_type const

 drivers/dax/bus.c  | 2 +-
 drivers/dax/super.c| 2 +-
 drivers/nvdimm/Kconfig | 2 +-
 drivers/nvdimm/bus.c   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)



[PATCH v2 1/2] dt-bindings: arm: qcom: Add Samsung Galaxy Note 3

2024-03-14 Thread Luca Weiss
Add the compatible for this Samsung smartphone ("phablet" as it was
named in that era).

Acked-by: Krzysztof Kozlowski 
Signed-off-by: Luca Weiss 
---
 Documentation/devicetree/bindings/arm/qcom.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/qcom.yaml 
b/Documentation/devicetree/bindings/arm/qcom.yaml
index 1a5fb889a444..57182bfa27ee 100644
--- a/Documentation/devicetree/bindings/arm/qcom.yaml
+++ b/Documentation/devicetree/bindings/arm/qcom.yaml
@@ -214,6 +214,7 @@ properties:
   - items:
   - enum:
   - lge,hammerhead
+  - samsung,hlte
   - sony,xperia-amami
   - sony,xperia-honami
   - const: qcom,msm8974

-- 
2.44.0




[PATCH v2 2/2] ARM: dts: qcom: msm8974: Add Samsung Galaxy Note 3

2024-03-14 Thread Luca Weiss
From: Adam Honse 

Add the devicetree for this "phablet" using the Snapdragon 800 SoC.

Signed-off-by: Adam Honse 
[l...@z3ntu.xyz: clean up, prepare for upstream]
Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/Makefile|   1 +
 .../boot/dts/qcom/qcom-msm8974-samsung-hlte.dts| 401 +
 2 files changed, 402 insertions(+)

diff --git a/arch/arm/boot/dts/qcom/Makefile b/arch/arm/boot/dts/qcom/Makefile
index 9cc1e14e6cd0..845af12d15a2 100644
--- a/arch/arm/boot/dts/qcom/Makefile
+++ b/arch/arm/boot/dts/qcom/Makefile
@@ -39,6 +39,7 @@ dtb-$(CONFIG_ARCH_QCOM) += \
qcom-msm8960-cdp.dtb \
qcom-msm8960-samsung-expressatt.dtb \
qcom-msm8974-lge-nexus5-hammerhead.dtb \
+   qcom-msm8974-samsung-hlte.dtb \
qcom-msm8974-sony-xperia-rhine-amami.dtb \
qcom-msm8974-sony-xperia-rhine-honami.dtb \
qcom-msm8974pro-fairphone-fp2.dtb \
diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-samsung-hlte.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974-samsung-hlte.dts
new file mode 100644
index ..903bb4d12513
--- /dev/null
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974-samsung-hlte.dts
@@ -0,0 +1,401 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "qcom-msm8974.dtsi"
+#include "pm8841.dtsi"
+#include "pm8941.dtsi"
+#include 
+#include 
+#include 
+
+/ {
+   model = "Samsung Galaxy Note 3";
+   compatible = "samsung,hlte", "qcom,msm8974";
+   chassis-type = "handset";
+
+   aliases {
+   mmc0 = _1; /* SDC1 eMMC slot */
+   mmc1 = _3; /* SDC3 SD card slot */
+   serial0 = _uart1;
+   };
+
+   chosen {
+   stdout-path = "serial0:115200n8";
+   };
+
+   gpio-keys {
+   compatible = "gpio-keys";
+
+   pinctrl-0 = <_keys_pin_a>;
+   pinctrl-names = "default";
+
+   key-home {
+   label = "Home Key";
+   gpios = <_gpios 3 GPIO_ACTIVE_LOW>;
+   linux,code = ;
+   wakeup-source;
+   debounce-interval = <15>;
+   };
+
+   key-volume-down {
+   label = "Volume Down";
+   gpios = <_gpios 2 GPIO_ACTIVE_LOW>;
+   linux,code = ;
+   debounce-interval = <15>;
+   };
+
+   key-volume-up {
+   label = "Volume Up";
+   gpios = <_gpios 5 GPIO_ACTIVE_LOW>;
+   linux,code = ;
+   debounce-interval = <15>;
+   };
+   };
+
+   touch_ldo: regulator-touch {
+   compatible = "regulator-fixed";
+   regulator-name = "touch-ldo";
+
+   gpio = <_gpios 9 GPIO_ACTIVE_HIGH>;
+   enable-active-high;
+   regulator-boot-on;
+
+   pinctrl-0 = <_ldo_pin>;
+   pinctrl-names = "default";
+   };
+};
+
+_i2c2 {
+   status = "okay";
+
+   touchscreen@20 {
+   compatible = "syna,rmi4-i2c";
+   reg = <0x20>;
+
+   interrupt-parent = <_gpios>;
+   interrupts = <30 IRQ_TYPE_EDGE_FALLING>;
+
+   vdd-supply = <_l10>;
+   vio-supply = <_ldo>;
+
+   pinctrl-0 = <_pin>;
+   pinctrl-names = "default";
+
+   syna,startup-delay-ms = <100>;
+
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   rmi4-f01@1 {
+   reg = <0x1>;
+   syna,nosleep-mode = <1>;
+   };
+
+   rmi4-f12@12 {
+   reg = <0x12>;
+   syna,sensor-type = <1>;
+   };
+   };
+};
+
+_i2c6 {
+   status = "okay";
+
+   fuelgauge@36 {
+   compatible = "maxim,max17048";
+   reg = <0x36>;
+
+   maxim,double-soc;
+   maxim,rcomp = /bits/ 8 <0x56>;
+
+   interrupt-parent = <_gpios>;
+   interrupts = <26 IRQ_TYPE_EDGE_FALLING>;
+
+   pinctrl-0 = <_pin>;
+   pinctrl-names = "default";
+   };
+};
+
+_uart2 {
+   status = "okay";
+};
+
+_gpios {
+   gpio_keys_pin_a: gpio-keys-active-state {
+   pins = "gpio2", "gpio3", "gpio5";
+   function = "normal";
+   bias-pull-up;
+   power-source = ;
+   };
+
+   fuelgauge_pin: fuelgauge-int-state {
+   pins = "gpio26";
+   function = "normal";
+   bias-disable;
+   input-enable;
+   power-source = ;
+   };
+
+   touch_pin: touchscreen-int-state {
+   pins = "gpio30";
+   function = "normal";
+   bias-disable;
+   input-enable;
+   power-source = ;
+   };
+
+   touch_ldo_pin: 

[PATCH v2 0/2] Add Samsung Galaxy Note 3 support

2024-03-14 Thread Luca Weiss
Add the dts for "hlte" which is a phablet from 2013.

Signed-off-by: Luca Weiss 
---
Changes in v2:
- Correct property order (Konrad)
- Pick up tags
- Link to v1: 
https://lore.kernel.org/r/20240310-samsung-hlte-v1-0-e9b55bf98...@z3ntu.xyz

---
Adam Honse (1):
  ARM: dts: qcom: msm8974: Add Samsung Galaxy Note 3

Luca Weiss (1):
  dt-bindings: arm: qcom: Add Samsung Galaxy Note 3

 Documentation/devicetree/bindings/arm/qcom.yaml|   1 +
 arch/arm/boot/dts/qcom/Makefile|   1 +
 .../boot/dts/qcom/qcom-msm8974-samsung-hlte.dts| 401 +
 3 files changed, 403 insertions(+)
---
base-commit: 90d35da658da8cff0d4ecbb5113f5fac9d00eb72
change-id: 20240310-samsung-hlte-78d1a287b0a8

Best regards,
-- 
Luca Weiss 




[PATCH v2 3/3] ARM: dts: qcom: Add Sony Xperia Z3 smartphone

2024-03-14 Thread Luca Weiss
Add the dts for the Xperia Z3 smartphone which is based on Sony's
shinano platform, so at the moment there's little device-specific dts to
add on top of the common parts.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/Makefile|  1 +
 .../qcom-msm8974pro-sony-xperia-shinano-leo.dts| 44 ++
 2 files changed, 45 insertions(+)

diff --git a/arch/arm/boot/dts/qcom/Makefile b/arch/arm/boot/dts/qcom/Makefile
index 9cc1e14e6cd0..92eca505a4ab 100644
--- a/arch/arm/boot/dts/qcom/Makefile
+++ b/arch/arm/boot/dts/qcom/Makefile
@@ -45,6 +45,7 @@ dtb-$(CONFIG_ARCH_QCOM) += \
qcom-msm8974pro-oneplus-bacon.dtb \
qcom-msm8974pro-samsung-klte.dtb \
qcom-msm8974pro-sony-xperia-shinano-castor.dtb \
+   qcom-msm8974pro-sony-xperia-shinano-leo.dtb \
qcom-mdm9615-wp8548-mangoh-green.dtb \
qcom-sdx55-mtp.dtb \
qcom-sdx55-t55.dtb \
diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974pro-sony-xperia-shinano-leo.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974pro-sony-xperia-shinano-leo.dts
new file mode 100644
index ..1ed6e1cc21d5
--- /dev/null
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974pro-sony-xperia-shinano-leo.dts
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "qcom-msm8974pro-sony-xperia-shinano-common.dtsi"
+
+/ {
+   model = "Sony Xperia Z3";
+   compatible = "sony,xperia-leo", "qcom,msm8974pro", "qcom,msm8974";
+   chassis-type = "handset";
+
+   gpio-keys {
+   key-camera-snapshot {
+   label = "camera_snapshot";
+   gpios = <_gpios 3 GPIO_ACTIVE_LOW>;
+   linux,code = ;
+   debounce-interval = <15>;
+   };
+
+   key-camera-focus {
+   label = "camera_focus";
+   gpios = <_gpios 4 GPIO_ACTIVE_LOW>;
+   linux,code = ;
+   debounce-interval = <15>;
+   };
+   };
+};
+
+_keys_pin_a {
+   pins = "gpio2", "gpio3", "gpio4", "gpio5";
+};
+
+ {
+   usb-charge-current-limit = <150>;
+   qcom,fast-charge-safe-current = <300>;
+   qcom,fast-charge-current-limit = <215>;
+   qcom,fast-charge-safe-voltage = <440>;
+   qcom,fast-charge-high-threshold-voltage = <435>;
+   qcom,auto-recharge-threshold-voltage = <428>;
+   qcom,minimum-input-voltage = <420>;
+
+   status = "okay";
+};
+
+_touchscreen {
+   vio-supply = <_s3>;
+};

-- 
2.44.0




[PATCH v2 2/3] dt-bindings: arm: qcom: Add Sony Xperia Z3

2024-03-14 Thread Luca Weiss
Add the compatible for this Sony smartphone.

Acked-by: Krzysztof Kozlowski 
Signed-off-by: Luca Weiss 
---
 Documentation/devicetree/bindings/arm/qcom.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/qcom.yaml 
b/Documentation/devicetree/bindings/arm/qcom.yaml
index 1a5fb889a444..d6a7ee5e1d91 100644
--- a/Documentation/devicetree/bindings/arm/qcom.yaml
+++ b/Documentation/devicetree/bindings/arm/qcom.yaml
@@ -224,6 +224,7 @@ properties:
   - oneplus,bacon
   - samsung,klte
   - sony,xperia-castor
+  - sony,xperia-leo
   - const: qcom,msm8974pro
   - const: qcom,msm8974
 

-- 
2.44.0




[PATCH v2 1/3] ARM: dts: qcom: msm8974-sony-castor: Split into shinano-common

2024-03-14 Thread Luca Weiss
In preparation for adding the Sony Xperia Z3 smartphone, split the
common parts into shinano-common.dtsi.

No functional change intended.

Reviewed-by: Konrad Dybcio 
Signed-off-by: Luca Weiss 
---
 .../qcom-msm8974pro-sony-xperia-shinano-castor.dts | 863 +
 ...com-msm8974pro-sony-xperia-shinano-common.dtsi} | 155 +---
 2 files changed, 179 insertions(+), 839 deletions(-)

diff --git 
a/arch/arm/boot/dts/qcom/qcom-msm8974pro-sony-xperia-shinano-castor.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974pro-sony-xperia-shinano-castor.dts
dissimilarity index 74%
index 20f98a9e49ea..409d1798de34 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974pro-sony-xperia-shinano-castor.dts
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974pro-sony-xperia-shinano-castor.dts
@@ -1,686 +1,177 @@
-// SPDX-License-Identifier: GPL-2.0
-#include "qcom-msm8974pro.dtsi"
-#include "pm8841.dtsi"
-#include "pm8941.dtsi"
-#include 
-#include 
-#include 
-
-/ {
-   model = "Sony Xperia Z2 Tablet";
-   compatible = "sony,xperia-castor", "qcom,msm8974pro", "qcom,msm8974";
-   chassis-type = "tablet";
-
-   aliases {
-   mmc0 = _1;
-   mmc1 = _2;
-   serial0 = _uart2;
-   serial1 = _uart1;
-   };
-
-   chosen {
-   stdout-path = "serial0:115200n8";
-   };
-
-   gpio-keys {
-   compatible = "gpio-keys";
-
-   pinctrl-0 = <_keys_pin_a>;
-   pinctrl-names = "default";
-
-   key-volume-down {
-   label = "volume_down";
-   gpios = <_gpios 2 GPIO_ACTIVE_LOW>;
-   linux,code = ;
-   debounce-interval = <15>;
-   };
-
-   key-volume-up {
-   label = "volume_up";
-   gpios = <_gpios 5 GPIO_ACTIVE_LOW>;
-   linux,code = ;
-   debounce-interval = <15>;
-   };
-   };
-
-   vreg_bl_vddio: lcd-backlight-vddio {
-   compatible = "regulator-fixed";
-   regulator-name = "vreg_bl_vddio";
-   regulator-min-microvolt = <315>;
-   regulator-max-microvolt = <315>;
-
-   gpio = < 69 0>;
-   enable-active-high;
-
-   vin-supply = <_s3>;
-   startup-delay-us = <7>;
-
-   pinctrl-0 = <_backlight_en_pin_a>;
-   pinctrl-names = "default";
-   };
-
-   vreg_vsp: lcd-dcdc-regulator {
-   compatible = "regulator-fixed";
-   regulator-name = "vreg_vsp";
-   regulator-min-microvolt = <560>;
-   regulator-max-microvolt = <560>;
-
-   gpio = <_gpios 20 GPIO_ACTIVE_HIGH>;
-   enable-active-high;
-
-   pinctrl-0 = <_dcdc_en_pin_a>;
-   pinctrl-names = "default";
-   };
-
-   vreg_boost: vreg-boost {
-   compatible = "regulator-fixed";
-
-   regulator-name = "vreg-boost";
-   regulator-min-microvolt = <315>;
-   regulator-max-microvolt = <315>;
-
-   regulator-always-on;
-   regulator-boot-on;
-
-   gpio = <_gpios 21 GPIO_ACTIVE_HIGH>;
-   enable-active-high;
-
-   pinctrl-names = "default";
-   pinctrl-0 = <_bypass_n_pin>;
-   };
-
-   vreg_vph_pwr: vreg-vph-pwr {
-   compatible = "regulator-fixed";
-   regulator-name = "vph-pwr";
-
-   regulator-min-microvolt = <360>;
-   regulator-max-microvolt = <360>;
-
-   regulator-always-on;
-   };
-
-   vreg_wlan: wlan-regulator {
-   compatible = "regulator-fixed";
-
-   regulator-name = "wl-reg";
-   regulator-min-microvolt = <330>;
-   regulator-max-microvolt = <330>;
-
-   gpio = <_gpios 18 GPIO_ACTIVE_HIGH>;
-   enable-active-high;
-
-   pinctrl-0 = <_regulator_pin>;
-   pinctrl-names = "default";
-   };
-};
-
-_uart2 {
-   status = "okay";
-};
-
-_i2c2 {
-   clock-frequency = <355000>;
-
-   status = "okay";
-
-   synaptics@2c {
-   compatible = "syna,rmi4-i2c";
-   reg = <0x2c>;
-
-   interrupt-parent = <>;
-   interrupts = <86 IRQ_TYPE_EDGE_FALLING>;
-
-   #address-cells = <1>;
-   #size-cells = <0>;
-
-   vdd-supply = <_l22>;
-   vio-supply = <_lvs3>;
-
-   pinctrl-0 = <_int_pin>;
-   pinctrl-names = "default";
-
-   syna,startup-delay-ms = <100>;
-
-   rmi4-f01@1 {
-   reg = <0x1>;
-   syna,nosleep-mode = <1>;
-   };
-
-   rmi4-f11@11 {
-   reg = <0x11>;
-   

[PATCH v2 0/3] Split sony-castor into shinano-common and add Sony Xperia Z3

2024-03-14 Thread Luca Weiss
Prepare for adding sony-leo dts by splitting common parts into a
separate dtsi file.

Then add the dts for Sony Xperia Z3.

Depends on:
https://lore.kernel.org/linux-arm-msm/20240306-castor-changes-v1-0-2286eaf85...@z3ntu.xyz/T/

Signed-off-by: Luca Weiss 
---
Changes in v2:
- Add leo dtb to Makefile
- Add newlines between the subnodes in the backlight node
- Pick up tags
- Link to v1: 
https://lore.kernel.org/r/20240310-shinano-common-v1-0-d64cd322e...@z3ntu.xyz

---
Luca Weiss (3):
  ARM: dts: qcom: msm8974-sony-castor: Split into shinano-common
  dt-bindings: arm: qcom: Add Sony Xperia Z3
  ARM: dts: qcom: Add Sony Xperia Z3 smartphone

 Documentation/devicetree/bindings/arm/qcom.yaml|   1 +
 arch/arm/boot/dts/qcom/Makefile|   1 +
 .../qcom-msm8974pro-sony-xperia-shinano-castor.dts | 551 +
 ...qcom-msm8974pro-sony-xperia-shinano-common.dtsi | 535 
 .../qcom-msm8974pro-sony-xperia-shinano-leo.dts|  44 ++
 5 files changed, 602 insertions(+), 530 deletions(-)
---
base-commit: bee52eeb37d8124a07711657d1650bf3b467e7dd
change-id: 20240310-shinano-common-093fe25fe3a1

Best regards,
-- 
Luca Weiss 




Re: [RFC PATCH 0/2] net: provides dim profile fine-tuning channels

2024-03-14 Thread Jakub Kicinski
On Thu, 14 Mar 2024 21:09:31 +0800 Heng Qi wrote:
> The NetDIM library provides excellent acceleration for many modern
> network cards. However, the default profiles of DIM limits its maximum
> capabilities for different NICs, so providing a channel through which
> the NIC can be custom configured is necessary.

Given that DIM is currently enabled and disable via ethtool
why are you putting the API is sysfs and ops in ndo?



Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-03-14 Thread Steven Rostedt
On Thu, 14 Mar 2024 09:57:57 -0700
Alison Schofield  wrote:

> On Fri, Feb 23, 2024 at 12:56:34PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > [
> >This is a treewide change. I will likely re-create this patch again in
> >the second week of the merge window of v6.9 and submit it then. Hoping
> >to keep the conflicts that it will cause to a minimum.
> > ]

Note, change of plans. I plan on sending this in the next merge window, as
this merge window I have this patch:

  
https://lore.kernel.org/linux-trace-kernel/20240312113002.00031...@gandalf.local.home/

That will warn if the source string of __string() is different than the
source string of __assign_str(). I want to make sure they are identical
before just dropping one of them.


> 
> > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> > index bdf117a33744..07ba4e033347 100644
> > --- a/drivers/cxl/core/trace.h
> > +++ b/drivers/cxl/core/trace.h  
> 
> snip to poison
> 
> > @@ -668,8 +668,8 @@ TRACE_EVENT(cxl_poison,
> > ),
> >  
> > TP_fast_assign(
> > -   __assign_str(memdev, dev_name(>dev));
> > -   __assign_str(host, dev_name(cxlmd->dev.parent));
> > +   __assign_str(memdev);
> > +   __assign_str(host);  
> 
> I think I get that the above changes work because the TP_STRUCT__entry for
> these did:
>   __string(memdev, dev_name(>dev))
>   __string(host, dev_name(cxlmd->dev.parent))

That's the point. They have to be identical or you will likely bug.

The __string(name, src) is used to find the string length of src which
allocates the necessary length on the ring buffer. The __assign_str(name, src)
will copy src into the ring buffer.

Similar to:

len = strlen(src);
buf = malloc(len);
strcpy(buf, str);

Where __string() is strlen() and __assign_str() is strcpy(). It doesn't
make sense to use two different strings, and if you did, it would likely be
a bug.

But the magic behind __string() does much more than just get the length of
the string, and it could easily save the pointer to the string (along with
its length) and have it copy that in the __assign_str() call, making the
src parameter of __assign_str() useless.

> 
> > __entry->serial = cxlmd->cxlds->serial;
> > __entry->overflow_ts = cxl_poison_overflow(flags, overflow_ts);
> > __entry->dpa = cxl_poison_record_dpa(record);
> > @@ -678,12 +678,12 @@ TRACE_EVENT(cxl_poison,
> > __entry->trace_type = trace_type;
> > __entry->flags = flags;
> > if (region) {
> > -   __assign_str(region, dev_name(>dev));
> > +   __assign_str(region);
> > memcpy(__entry->uuid, >params.uuid, 16);
> > __entry->hpa = cxl_trace_hpa(region, cxlmd,
> >  __entry->dpa);
> > } else {
> > -   __assign_str(region, "");
> > +   __assign_str(region);
> > memset(__entry->uuid, 0, 16);
> > __entry->hpa = ULLONG_MAX;  
> 
> For the above 2, there was no helper in TP_STRUCT__entry. A recently
> posted patch is fixing that up to be __string(region, NULL) See [1],
> with the actual assignment still happening in TP_fast_assign.

__string(region, NULL) doesn't make sense. It's like:

len = strlen(NULL);
buf = malloc(len);
strcpy(buf, NULL);

??

I'll reply to that email.

-- Steve

> 
> Does that assign logic need to move to the TP_STRUCT__entry definition
> when you merge these changes? I'm not clear how much logic is able to be
> included, ie like 'C' style code in the TP_STRUCT__entry.
> 
> [1]
> https://lore.kernel.org/linux-cxl/20240314044301.2108650-1-alison.schofi...@intel.com/



Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-03-14 Thread Alison Schofield
On Fri, Feb 23, 2024 at 12:56:34PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> [
>This is a treewide change. I will likely re-create this patch again in
>the second week of the merge window of v6.9 and submit it then. Hoping
>to keep the conflicts that it will cause to a minimum.
> ]
> 
> With the rework of how the __string() handles dynamic strings where it
> saves off the source string in field in the helper structure[1], the
> assignment of that value to the trace event field is stored in the helper
> value and does not need to be passed in again.
> 
> This means that with:
> 
>   __string(field, mystring)
> 
> Which use to be assigned with __assign_str(field, mystring), no longer
> needs the second parameter and it is unused. With this, __assign_str()
> will now only get a single parameter.
> 
> There's over 700 users of __assign_str() and because coccinelle does not
> handle the TRACE_EVENT() macro I ended up using the following sed script:
> 
>   git grep -l __assign_str | while read a ; do
>   sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
>   mv /tmp/test-file $a;
>   done
> 
> I then searched for __assign_str() that did not end with ';' as those
> were multi line assignments that the sed script above would fail to catch.
> 
> Note, the same updates will need to be done for:
> 
>   __assign_str_len()
>   __assign_rel_str()
>   __assign_rel_str_len()
>   __assign_bitmask()
>   __assign_rel_bitmask()
>   __assign_cpumask()
>   __assign_rel_cpumask()
> 
> [1] 
> https://lore.kernel.org/linux-trace-kernel/2024011442.634192...@goodmis.org/
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  arch/arm64/kernel/trace-events-emulation.h|   2 +-
>  arch/powerpc/include/asm/trace.h  |   4 +-
>  arch/x86/kvm/trace.h  |   2 +-
>  drivers/base/regmap/trace.h   |  18 +--
>  drivers/base/trace.h  |   2 +-
>  drivers/block/rnbd/rnbd-srv-trace.h   |  12 +-
>  drivers/cxl/core/trace.h  |  24 ++--

snip to CXL


> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index bdf117a33744..07ba4e033347 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h

snip to poison

> @@ -668,8 +668,8 @@ TRACE_EVENT(cxl_poison,
>   ),
>  
>   TP_fast_assign(
> - __assign_str(memdev, dev_name(>dev));
> - __assign_str(host, dev_name(cxlmd->dev.parent));
> + __assign_str(memdev);
> + __assign_str(host);

I think I get that the above changes work because the TP_STRUCT__entry for
these did:
__string(memdev, dev_name(>dev))
__string(host, dev_name(cxlmd->dev.parent))

>   __entry->serial = cxlmd->cxlds->serial;
>   __entry->overflow_ts = cxl_poison_overflow(flags, overflow_ts);
>   __entry->dpa = cxl_poison_record_dpa(record);
> @@ -678,12 +678,12 @@ TRACE_EVENT(cxl_poison,
>   __entry->trace_type = trace_type;
>   __entry->flags = flags;
>   if (region) {
> - __assign_str(region, dev_name(>dev));
> + __assign_str(region);
>   memcpy(__entry->uuid, >params.uuid, 16);
>   __entry->hpa = cxl_trace_hpa(region, cxlmd,
>__entry->dpa);
>   } else {
> - __assign_str(region, "");
> + __assign_str(region);
>   memset(__entry->uuid, 0, 16);
>   __entry->hpa = ULLONG_MAX;

For the above 2, there was no helper in TP_STRUCT__entry. A recently
posted patch is fixing that up to be __string(region, NULL) See [1],
with the actual assignment still happening in TP_fast_assign.

Does that assign logic need to move to the TP_STRUCT__entry definition
when you merge these changes? I'm not clear how much logic is able to be
included, ie like 'C' style code in the TP_STRUCT__entry.

[1]
https://lore.kernel.org/linux-cxl/20240314044301.2108650-1-alison.schofi...@intel.com/

Thanks for helping,
Alison


>   }






Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-14 Thread Mathieu Poirier
On Thu, 14 Mar 2024 at 08:59, Sudeep Holla  wrote:
>
> On Thu, Mar 14, 2024 at 08:52:59AM -0600, Mathieu Poirier wrote:
> > On Wed, Mar 13, 2024 at 05:17:56PM +, Abdellatif El Khlifi wrote:
> > > Hi Mathieu,
> > >
> > > On Wed, Mar 13, 2024 at 10:25:32AM -0600, Mathieu Poirier wrote:
> > > > On Tue, Mar 12, 2024 at 05:32:52PM +, Abdellatif El Khlifi wrote:
> > > > > Hi Mathieu,
> > > > >
> > > > > On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote:
> > > > > > > This is an initial patchset for allowing to turn on and off the 
> > > > > > > remote processor.
> > > > > > > The FW is already loaded before the Corstone-1000 SoC is powered 
> > > > > > > on and this
> > > > > > > is done through the FPGA board bootloader in case of the FPGA 
> > > > > > > target. Or by the Corstone-1000 FVP model
> > > > > > > (emulator).
> > > > > > >
> > > > > > >From the above I take it that booting with a preloaded firmware is 
> > > > > > >a
> > > > > > scenario that needs to be supported and not just a temporary stage.
> > > > >
> > > > > The current status of the Corstone-1000 SoC requires that there is
> > > > > a preloaded firmware for the external core. Preloading is done 
> > > > > externally
> > > > > either through the FPGA bootloader or the emulator (FVP) before 
> > > > > powering
> > > > > on the SoC.
> > > > >
> > > >
> > > > Ok
> > > >
> > > > > Corstone-1000 will be upgraded in a way that the A core running Linux 
> > > > > is able
> > > > > to share memory with the remote core and also being able to access 
> > > > > the remote
> > > > > core memory so Linux can copy the firmware to. This HW changes are 
> > > > > still
> > > > > This is why this patchset is relying on a preloaded firmware. And 
> > > > > it's the step 1
> > > > > of adding remoteproc support for Corstone.
> > > > >
> > > >
> > > > Ok, so there is a HW problem where A core and M core can't see each 
> > > > other's
> > > > memory, preventing the A core from copying the firmware image to the 
> > > > proper
> > > > location.
> > > >
> > > > When the HW is fixed, will there be a need to support scenarios where 
> > > > the
> > > > firmware image has been preloaded into memory?
> > >
> > > No, this scenario won't apply when we get the HW upgrade. No need for an
> > > external entity anymore. The firmware(s) will all be files in the linux 
> > > filesystem.
> > >
> >
> > Very well.  I am willing to continue with this driver but it does so little 
> > that
> > I wonder if it wouldn't simply be better to move forward with upstreaming 
> > when
> > the HW is fixed.  The choice is yours.
> >
>
> I think Robin has raised few points that need clarification. I think it was
> done as part of DT binding patch. I share those concerns and I wanted to
> reaching to the same concerns by starting the questions I asked on corstone
> device tree changes.
>

I also agree with Robin's point of view.  Proceeding with an initial
driver with minimal functionality doesn't preclude having complete
bindings.  But that said and as I pointed out, it might be better to
wait for the HW to be fixed before moving forward.

> --
> Regards,
> Sudeep



Re: [PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-14 Thread Steven Rostedt
On Thu, 14 Mar 2024 15:39:28 +0100
Paolo Abeni  wrote:

> On Wed, 2024-03-13 at 09:34 -0400, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > [
> >Note, I need to take this patch through my tree, so I'm looking for 
> > acks.  
> 
> Note that this device driver is changing quite rapidly, so I expect
> some conflicts here later. I guess Liuns will have to handle them ;)

Well, it merges fine with linux-next and linus's current master. ;-)

> 
> >This causes the build to fail when I add the __assign_str() check, which
> >I was about to push to Linus, but it breaks allmodconfig due to this 
> > error.
> > ]
> > 
> > The __string() and __assign_str() helper macros of the TRACE_EVENT() macro
> > are going through some optimizations where only the source string of
> > __string() will be used and the __assign_str() source will be ignored and
> > later removed.
> > 
> > To make sure that there's no issues, a new check is added between the
> > __string() src argument and the __assign_str() src argument that does a
> > strcmp() to make sure they are the same string.
> > 
> > The hclgevf trace events have:
> > 
> >   __assign_str(devname, >nic.kinfo.netdev->name);
> > 
> > Which triggers the warning:
> > 
> > hclgevf_trace.h:34:39: error: passing argument 1 of ‘strcmp’ from 
> > incompatible pointer type [-Werror=incompatible-pointer-types]
> >34 | __assign_str(devname, 
> > >nic.kinfo.netdev->name);
> >  [..]
> > arch/x86/include/asm/string_64.h:75:24: note: expected ‘const char *’ but 
> > argument is of type ‘char (*)[16]’
> >75 | int strcmp(const char *cs, const char *ct);
> >   |^~
> > 
> > 
> > Because __assign_str() now has:
> > 
> > WARN_ON_ONCE(__builtin_constant_p(src) ?\
> >  strcmp((src), __data_offsets.dst##_ptr_) : \
> >  (src) != __data_offsets.dst##_ptr_);   \
> > 
> > The problem is the '&' on hdev->nic.kinfo.netdev->name. That's because
> > that name is:
> > 
> > charname[IFNAMSIZ]
> > 
> > Where passing an address '&' of a char array is not compatible with 
> > strcmp().
> > 
> > The '&' is not necessary, remove it.
> > 
> > Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF 
> > mailbox")  
> 
> checkpactch in strict mode complains the hash is not 12 char long.

Hmm, I wonder why my git blame gives me 13 characters in the sha. (I cut
and pasted it from git blame). My git config has:

[core]  
abbrev = 12


> 
> > Signed-off-by: Steven Rostedt (Google)   
> 
> FWIW
> 
> Acked-by: Paolo Abeni 

Thanks!

-- Steve



Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-14 Thread Sudeep Holla
On Thu, Mar 14, 2024 at 03:16:53PM +, Abdellatif El Khlifi wrote:
> Hi Sudeep,
>
> On Thu, Mar 14, 2024 at 02:59:20PM +, Sudeep Holla wrote:
> >
> > I think Robin has raised few points that need clarification. I think it was
> > done as part of DT binding patch. I share those concerns and I wanted to
> > reaching to the same concerns by starting the questions I asked on corstone
> > device tree changes.
>
> Please have a look at my answer to Robin with clarifications [1].
>

Yes I did take a look at the response but am not convinced yet. I have
responded to you email with other details which I want to be explored.
I don't think we need to rush to add the *foundation driver* as you call
before the bindings are defined well.

--
Regards,
Sudeep



Re: [PATCH 3/3] dt-bindings: remoteproc: Add Arm remoteproc

2024-03-14 Thread Abdellatif El Khlifi
Hi Krzysztof,

On Thu, Mar 14, 2024 at 02:56:53PM +0100, Krzysztof Kozlowski wrote:
> On 14/03/2024 14:49, Abdellatif El Khlifi wrote:
> >> Frankly at the moment I'd be inclined to say it isn't even a remoteproc
> >> binding (or driver) at all, it's a reset controller. Bindings are a 
> >> contract
> >> for describing the hardware, not the current state of Linux driver support 
> >> -
> >> if this thing still needs mailboxes, shared memory, a reset vector 
> >> register,
> >> or whatever else to actually be useful, those should be in the binding from
> >> day 1 so that a) people can write and deploy correct DTs now, such that
> >> functionality becomes available on their systems as soon as driver support
> >> catches up, and b) the community has any hope of being able to review
> >> whether the binding is appropriately designed and specified for the purpose
> >> it intends to serve.
> > 
> > This is an initial patchset for allowing to turn on and off the remote 
> > processor.
> > The FW is already loaded before the Corstone-1000 SoC is powered on and this
> > is done through the FPGA board bootloader in case of the FPGA target.
> > Or by the Corstone-1000 FVP model (emulator).
> > 
> > The plan for the driver is as follows:
> > 
> > Step 1: provide a foundation driver capable of turning the core on/off
> > Step 2: provide mailbox support for comms
> > Step 3: provide FW reload capability
> > 
> > Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) 
> > can
> > share memory with the remote core.
> > 
> > So, when memory sharing becomes available in the FPGA and FVP the
> > DT binding will be upgraded with:
> > 
> > - mboxes property specifying the RX/TX mailboxes (based on MHU v2)
> > - memory-region property describing the virtio vrings
> > 
> > Currently the mailbox controller does exist in the HW but is not
> > usable via virtio (no memory sharing available).
> > 
> > Do you recommend I add the mboxes property even currently we can't do the 
> > comms ?
> 
> Bindings should be complete, regardless whether Linux driver supports it
> or not. Please see writing bindings document for explanation on this and
> other rules.
> 
> So yes: please describe as much as possible/reasonable.

I'll do thanks.

Cheers,
Abdellatif



Re: [PATCH 3/3] dt-bindings: remoteproc: Add Arm remoteproc

2024-03-14 Thread Sudeep Holla
On Thu, Mar 14, 2024 at 01:49:28PM +, Abdellatif El Khlifi wrote:
> Hi Robin,
> 
> > > +  firmware-name:
> > > +description: |
> > > +  Default name of the firmware to load to the remote processor.
> > 
> > So... is loading the firmware image achieved by somehow bitbanging it
> > through the one reset register, maybe? I find it hard to believe this is a
> > complete and functional binding.
> > 
> > Frankly at the moment I'd be inclined to say it isn't even a remoteproc
> > binding (or driver) at all, it's a reset controller. Bindings are a contract
> > for describing the hardware, not the current state of Linux driver support -
> > if this thing still needs mailboxes, shared memory, a reset vector register,
> > or whatever else to actually be useful, those should be in the binding from
> > day 1 so that a) people can write and deploy correct DTs now, such that
> > functionality becomes available on their systems as soon as driver support
> > catches up, and b) the community has any hope of being able to review
> > whether the binding is appropriately designed and specified for the purpose
> > it intends to serve.
> 
> This is an initial patchset for allowing to turn on and off the remote 
> processor.
> The FW is already loaded before the Corstone-1000 SoC is powered on and this
> is done through the FPGA board bootloader in case of the FPGA target.
> Or by the Corstone-1000 FVP model (emulator).
>

If this driver does the loading of the firmware, it will get reloaded. Do
you need any issues there ? The correctness matters here in the upstream
driver, it may not be optimised for you usecase now, but that is fine IMO.

> The plan for the driver is as follows:
>
> Step 1: provide a foundation driver capable of turning the core on/off
> Step 2: provide mailbox support for comms
> Step 3: provide FW reload capability
>
> Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) can
> share memory with the remote core.
>

Honestly, I would prefer to know the overall design before pushing any partial
solution. If you know the final complete solution, present the same with
the complete device tree binding for better understanding and review.

> So, when memory sharing becomes available in the FPGA and FVP the
> DT binding will be upgraded with:
>
> - mboxes property specifying the RX/TX mailboxes (based on MHU v2)
> - memory-region property describing the virtio vrings
>

Looks like you have the information now, please define the complete
bindings now.

> Currently the mailbox controller does exist in the HW but is not
> usable via virtio (no memory sharing available).
>

That is fine, if the plan is to use them, then include them in the design
of your DT bindings.

> Do you recommend I add the mboxes property even currently we can't do the 
> comms ?
>

Yes for sure IMO.

> > For instance right now it seems somewhat tenuous to describe two consecutive
> > 32-bit registers as separate "reg" entries, but *maybe* it's OK if that's
> > all there ever is. However if it's actually going to end up needing several
> > more additional MMIO and/or memory regions for other functionality, then
> > describing each register and location individually is liable to get
> > unmanageable really fast, and a higher-level functional grouping (e.g. these
> > reset-related registers together as a single 8-byte region) would likely be
> > a better design.
>

I completely agree with the above and this is what I was meant(I must admit
didn't put it forward with above clarity).

> Currently the HW provides only 2 registers to control the remote processors:
>
> The reset control and status registers.
>

Agreed, but it is part of a bigger block with other functionality in place.
MFD/syscon might be better way to use these registers. You never know in
future you might want to use another set of 2-4 registers with a different
functionality in another driver.

> It makes sense to me to use a mapped region of 8 bytes for both registers 
> rather
> than individual registers (since they are consecutive).

Not exactly. Are you sure, Linux will not have to use another other registers
in that block ? Will you keep creating such (random if I may call it so)
bindings for a smaller sets of register under "Host Base System Control
registers".

I would see if it makes sense to put together a single binding for
this "Host Base System Control" register(not sure what exactly that means).
Use MFD/regmap you access parts of this block. The remoteproc driver can
then be semi-generic(meaning applicable to group of similar platforms)
based on the platform compatible and use this regmap to provide the
functionality needed.

-- 
Regards,
Sudeep



[PATCH v2] kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address

2024-03-14 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Read from an unsafe address with copy_from_kernel_nofault() in
arch_adjust_kprobe_addr() because this function is used before checking
the address is in text or not. Syzcaller bot found a bug and reported
the case if user specifies inaccessible data area,
arch_adjust_kprobe_addr() will cause a kernel panic.

Reported-by: Qiang Zhang 
Closes: 
https://lore.kernel.org/all/cakhosas2rof6vqvbw_lg_j3qnku0canzr2qmy4et7r5lo8m...@mail.gmail.com/
Fixes: cc66bb914578 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Fix to return NULL if failed to access the address.
---
 arch/x86/kernel/kprobes/core.c |   11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index a0ce46c0a2d8..95e4ebe5d514 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -335,7 +335,16 @@ static int can_probe(unsigned long paddr)
 kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long 
offset,
 bool *on_func_entry)
 {
-   if (is_endbr(*(u32 *)addr)) {
+   u32 insn;
+
+   /*
+* Since addr is not guaranteed to be safely accessed yet, use
+* copy_from_kernel_nofault() to get the instruction.
+*/
+   if (copy_from_kernel_nofault(, (void *)addr, sizeof(u32)))
+   return NULL;
+
+   if (is_endbr(insn)) {
*on_func_entry = !offset || offset == 4;
if (*on_func_entry)
offset = 4;




Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-14 Thread Abdellatif El Khlifi
Hi Sudeep,

On Thu, Mar 14, 2024 at 02:59:20PM +, Sudeep Holla wrote:
> On Thu, Mar 14, 2024 at 08:52:59AM -0600, Mathieu Poirier wrote:
> > On Wed, Mar 13, 2024 at 05:17:56PM +, Abdellatif El Khlifi wrote:
> > > Hi Mathieu,
> > >
> > > On Wed, Mar 13, 2024 at 10:25:32AM -0600, Mathieu Poirier wrote:
> > > > On Tue, Mar 12, 2024 at 05:32:52PM +, Abdellatif El Khlifi wrote:
> > > > > Hi Mathieu,
> > > > >
> > > > > On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote:
> > > > > > > This is an initial patchset for allowing to turn on and off the 
> > > > > > > remote processor.
> > > > > > > The FW is already loaded before the Corstone-1000 SoC is powered 
> > > > > > > on and this
> > > > > > > is done through the FPGA board bootloader in case of the FPGA 
> > > > > > > target. Or by the Corstone-1000 FVP model
> > > > > > > (emulator).
> > > > > > >
> > > > > > >From the above I take it that booting with a preloaded firmware is 
> > > > > > >a
> > > > > > scenario that needs to be supported and not just a temporary stage.
> > > > >
> > > > > The current status of the Corstone-1000 SoC requires that there is
> > > > > a preloaded firmware for the external core. Preloading is done 
> > > > > externally
> > > > > either through the FPGA bootloader or the emulator (FVP) before 
> > > > > powering
> > > > > on the SoC.
> > > > >
> > > >
> > > > Ok
> > > >
> > > > > Corstone-1000 will be upgraded in a way that the A core running Linux 
> > > > > is able
> > > > > to share memory with the remote core and also being able to access 
> > > > > the remote
> > > > > core memory so Linux can copy the firmware to. This HW changes are 
> > > > > still
> > > > > This is why this patchset is relying on a preloaded firmware. And 
> > > > > it's the step 1
> > > > > of adding remoteproc support for Corstone.
> > > > >
> > > >
> > > > Ok, so there is a HW problem where A core and M core can't see each 
> > > > other's
> > > > memory, preventing the A core from copying the firmware image to the 
> > > > proper
> > > > location.
> > > >
> > > > When the HW is fixed, will there be a need to support scenarios where 
> > > > the
> > > > firmware image has been preloaded into memory?
> > >
> > > No, this scenario won't apply when we get the HW upgrade. No need for an
> > > external entity anymore. The firmware(s) will all be files in the linux 
> > > filesystem.
> > >
> >
> > Very well.  I am willing to continue with this driver but it does so little 
> > that
> > I wonder if it wouldn't simply be better to move forward with upstreaming 
> > when
> > the HW is fixed.  The choice is yours.
> >
> 
> I think Robin has raised few points that need clarification. I think it was
> done as part of DT binding patch. I share those concerns and I wanted to
> reaching to the same concerns by starting the questions I asked on corstone
> device tree changes.

Please have a look at my answer to Robin with clarifications [1].

Apart from mapping the register area rather than using the reg property
I'll also add the mboxes property as Krzysztof confirmed.

[1]: https://lore.kernel.org/all/20240314134928.ga27...@e130802.arm.com/

Cheers,
Abdellatif



Re: [PATCH] kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address

2024-03-14 Thread Google
On Fri, 15 Mar 2024 00:12:30 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Read from an unsafe address with copy_from_kernel_nofault() in
> arch_adjust_kprobe_addr() because this function is used before checking
> the address is in text or not. Syzcaller bot found a bug and reported
> the case if user specifies inaccessible data area,
> arch_adjust_kprobe_addr() will cause a kernel panic.
> 
> 
> Reported-by: Qiang Zhang 
> Closes: 
> https://lore.kernel.org/all/cakhosas2rof6vqvbw_lg_j3qnku0canzr2qmy4et7r5lo8m...@mail.gmail.com/
> Fixes: cc66bb914578 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  arch/x86/kernel/kprobes/core.c |   11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index a0ce46c0a2d8..a885eea3bd34 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -335,7 +335,16 @@ static int can_probe(unsigned long paddr)
>  kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long 
> offset,
>bool *on_func_entry)
>  {
> - if (is_endbr(*(u32 *)addr)) {
> + u32 insn;
> +
> + /*
> +  * Since addr is not guaranteed to be safely accessed yet, use
> +  * copy_from_kernel_nofault() to get the instruction.
> +  */
> + if (copy_from_kernel_nofault(, (void *)addr, sizeof(u32)))
> + return 0;

Oops, it should be NULL.

> +
> + if (is_endbr(insn)) {
>   *on_func_entry = !offset || offset == 4;
>   if (*on_func_entry)
>   offset = 4;
> 


-- 
Masami Hiramatsu (Google) 



[PATCH] kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address

2024-03-14 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Read from an unsafe address with copy_from_kernel_nofault() in
arch_adjust_kprobe_addr() because this function is used before checking
the address is in text or not. Syzcaller bot found a bug and reported
the case if user specifies inaccessible data area,
arch_adjust_kprobe_addr() will cause a kernel panic.


Reported-by: Qiang Zhang 
Closes: 
https://lore.kernel.org/all/cakhosas2rof6vqvbw_lg_j3qnku0canzr2qmy4et7r5lo8m...@mail.gmail.com/
Fixes: cc66bb914578 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
Signed-off-by: Masami Hiramatsu (Google) 
---
 arch/x86/kernel/kprobes/core.c |   11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index a0ce46c0a2d8..a885eea3bd34 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -335,7 +335,16 @@ static int can_probe(unsigned long paddr)
 kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long 
offset,
 bool *on_func_entry)
 {
-   if (is_endbr(*(u32 *)addr)) {
+   u32 insn;
+
+   /*
+* Since addr is not guaranteed to be safely accessed yet, use
+* copy_from_kernel_nofault() to get the instruction.
+*/
+   if (copy_from_kernel_nofault(, (void *)addr, sizeof(u32)))
+   return 0;
+
+   if (is_endbr(insn)) {
*on_func_entry = !offset || offset == 4;
if (*on_func_entry)
offset = 4;




Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-14 Thread Michael S. Tsirkin
On Thu, Mar 14, 2024 at 12:46:54PM +0100, Tobias Huschle wrote:
> On Tue, Mar 12, 2024 at 09:45:57AM +, Luis Machado wrote:
> > On 3/11/24 17:05, Michael S. Tsirkin wrote:
> > > 
> > > Are we going anywhere with this btw?
> > > 
> > >
> > 
> > I think Tobias had a couple other threads related to this, with other 
> > potential fixes:
> > 
> > https://lore.kernel.org/lkml/20240228161018.14253-1-husc...@linux.ibm.com/
> > 
> > https://lore.kernel.org/lkml/20240228161023.14310-1-husc...@linux.ibm.com/
> > 
> 
> Sorry, Michael, should have provided those threads here as well.
> 
> The more I look into this issue, the more things to ponder upon I find.
> It seems like this issue can (maybe) be fixed on the scheduler side after all.
> 
> The root cause of this regression remains that the mentioned kworker gets
> a negative lag value and is therefore not elligible to run on wake up.
> This negative lag is potentially assigned incorrectly. But I'm not sure yet.
> 
> Anytime I find something that can address the symptom, there is a potential
> root cause on another level, and I would like to avoid to just address a
> symptom to fix the issue, wheras it would be better to find the actual
> root cause.
> 
> I would nevertheless still argue, that vhost relies rather heavily on the fact
> that the kworker gets scheduled on wake up everytime. But I don't have a 
> proposal at hand that accounts for potential side effects if opting for
> explicitly initiating a schedule.
> Maybe the assumption, that said kworker should always be selected on wake 
> up is valid. In that case the explicit schedule would merely be a safety 
> net.
> 
> I will let you know if something comes up on the scheduler side. There are
> some more ideas on my side how this could be approached.

Thanks a lot! To clarify it is not that I am opposed to changing vhost.
I would like however for some documentation to exist saying that if you
do abc then call API xyz. Then I hope we can feel a bit safer that
future scheduler changes will not break vhost (though as usual, nothing
is for sure).  Right now we are going by the documentation and that says
cond_resched so we do that.

-- 
MST




Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-14 Thread Björn Töpel
Puranjay Mohan  writes:

> Björn Töpel  writes:
>
>>
>> Hmm, depending on RISC-V's CMODX path, the pro/cons CALL_OPS vs dynamic
>> trampolines changes quite a bit.
>>
>> The more I look at the pains of patching two instruction ("split
>> immediates"), the better "patch data" + one insn patching look.
>
> I was looking at how dynamic trampolines would be implemented for RISC-V.
>
> With CALL-OPS we need to patch the auipc+jalr at function entry only, the
> ops pointer above the function can be patched atomically.
>
> With a dynamic trampoline we need a auipc+jalr pair at function entry to jump
> to the trampoline and then another auipc+jalr pair to jump from trampoline to
> ops->func. When the ops->func is modified, we would need to update the
> auipc+jalr at in the trampoline.
>
> So, I am not sure how to move forward here, CALL-OPS or Dynamic trampolines?

Yeah. Honestly, we need to figure out the patching story prior
choosing the path, so let's start there.

After reading Mark's reply, and discussing with OpenJDK folks (who does
the most crazy text patching on all platforms), having to patch multiple
instructions (where the address materialization is split over multiple
instructions) is a no-go. It's just a too big can of worms. So, if we
can only patch one insn, it's CALL_OPS.

A couple of options (in addition to Andy's), and all require a
per-function landing address ala CALL_OPS) tweaking what Mark is doing
on Arm (given the poor branch range).

...and maybe we'll get RISC-V rainbows/unicorns in the future getting
better reach (full 64b! ;-)).

A) Use auipc/jalr, only patch jalr to take us to a common
   dispatcher/trampoline
  
 |  # probably on a data cache-line != func .text to 
avoid ping-pong
 | ...
 | func:
 |   ...make sure ra isn't messed up...
 |   aupic
 |   nop <=> jalr # Text patch point -> common_dispatch
 |   ACTUAL_FUNC
 | 
 | common_dispatch:
 |   load  based on ra
 |   jalr
 |   ...

The auipc is never touched, and will be overhead. Also, we need a mv to
store ra in a scratch register as well -- like Arm. We'll have two insn
per-caller overhead for a disabled caller.

B) Use jal, which can only take us +/-1M, and requires multiple
   dispatchers (and tracking which one to use, and properly distribute
   them. Ick.)

 |  # probably on a data cache-line != func .text to 
avoid ping-pong
 | ...
 | func:
 |   ...make sure ra isn't messed up...
 |   nop <=> jal # Text patch point -> within_1M_to_func_dispatch
 |   ACTUAL_FUNC
 | 
 | within_1M_to_func_dispatch:
 |   load  based on ra
 |   jalr

C) Use jal, which can only take us +/-1M, and use a per-function
   trampoline requires multiple dispatchers (and tracking which one to
   use). Blows up text size A LOT.

 |  # somewhere, but probably on a different 
cacheline than the .text to avoid ping-ongs
 | ...
 | per_func_dispatch
 |   load  based on ra
 |   jalr
 | func:
 |   ...make sure ra isn't messed up...
 |   nop <=> jal # Text patch point -> per_func_dispatch
 |   ACTUAL_FUNC

It's a bit sad that we'll always have to have a dispatcher/trampoline,
but it's still better than stop_machine(). (And we'll need a fencei IPI
as well, but only one. ;-))

Today, I'm leaning towards A (which is what Mark suggested, and also
Robbin).. Any other options?

[Now how do we implement OPTPROBES? I'm kidding. ;-)]


Björn



Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-14 Thread Sudeep Holla
On Thu, Mar 14, 2024 at 08:52:59AM -0600, Mathieu Poirier wrote:
> On Wed, Mar 13, 2024 at 05:17:56PM +, Abdellatif El Khlifi wrote:
> > Hi Mathieu,
> >
> > On Wed, Mar 13, 2024 at 10:25:32AM -0600, Mathieu Poirier wrote:
> > > On Tue, Mar 12, 2024 at 05:32:52PM +, Abdellatif El Khlifi wrote:
> > > > Hi Mathieu,
> > > >
> > > > On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote:
> > > > > > This is an initial patchset for allowing to turn on and off the 
> > > > > > remote processor.
> > > > > > The FW is already loaded before the Corstone-1000 SoC is powered on 
> > > > > > and this
> > > > > > is done through the FPGA board bootloader in case of the FPGA 
> > > > > > target. Or by the Corstone-1000 FVP model
> > > > > > (emulator).
> > > > > >
> > > > > >From the above I take it that booting with a preloaded firmware is a
> > > > > scenario that needs to be supported and not just a temporary stage.
> > > >
> > > > The current status of the Corstone-1000 SoC requires that there is
> > > > a preloaded firmware for the external core. Preloading is done 
> > > > externally
> > > > either through the FPGA bootloader or the emulator (FVP) before powering
> > > > on the SoC.
> > > >
> > >
> > > Ok
> > >
> > > > Corstone-1000 will be upgraded in a way that the A core running Linux 
> > > > is able
> > > > to share memory with the remote core and also being able to access the 
> > > > remote
> > > > core memory so Linux can copy the firmware to. This HW changes are still
> > > > This is why this patchset is relying on a preloaded firmware. And it's 
> > > > the step 1
> > > > of adding remoteproc support for Corstone.
> > > >
> > >
> > > Ok, so there is a HW problem where A core and M core can't see each 
> > > other's
> > > memory, preventing the A core from copying the firmware image to the 
> > > proper
> > > location.
> > >
> > > When the HW is fixed, will there be a need to support scenarios where the
> > > firmware image has been preloaded into memory?
> >
> > No, this scenario won't apply when we get the HW upgrade. No need for an
> > external entity anymore. The firmware(s) will all be files in the linux 
> > filesystem.
> >
>
> Very well.  I am willing to continue with this driver but it does so little 
> that
> I wonder if it wouldn't simply be better to move forward with upstreaming when
> the HW is fixed.  The choice is yours.
>

I think Robin has raised few points that need clarification. I think it was
done as part of DT binding patch. I share those concerns and I wanted to
reaching to the same concerns by starting the questions I asked on corstone
device tree changes.

--
Regards,
Sudeep



Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-14 Thread Mathieu Poirier
On Wed, Mar 13, 2024 at 05:17:56PM +, Abdellatif El Khlifi wrote:
> Hi Mathieu,
> 
> On Wed, Mar 13, 2024 at 10:25:32AM -0600, Mathieu Poirier wrote:
> > On Tue, Mar 12, 2024 at 05:32:52PM +, Abdellatif El Khlifi wrote:
> > > Hi Mathieu,
> > > 
> > > On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote:
> > > > > This is an initial patchset for allowing to turn on and off the 
> > > > > remote processor.
> > > > > The FW is already loaded before the Corstone-1000 SoC is powered on 
> > > > > and this
> > > > > is done through the FPGA board bootloader in case of the FPGA target. 
> > > > > Or by the Corstone-1000 FVP model
> > > > > (emulator).
> > > > >
> > > > >From the above I take it that booting with a preloaded firmware is a
> > > > scenario that needs to be supported and not just a temporary stage.
> > > 
> > > The current status of the Corstone-1000 SoC requires that there is
> > > a preloaded firmware for the external core. Preloading is done externally
> > > either through the FPGA bootloader or the emulator (FVP) before powering
> > > on the SoC.
> > > 
> > 
> > Ok
> > 
> > > Corstone-1000 will be upgraded in a way that the A core running Linux is 
> > > able
> > > to share memory with the remote core and also being able to access the 
> > > remote
> > > core memory so Linux can copy the firmware to. This HW changes are still
> > > This is why this patchset is relying on a preloaded firmware. And it's 
> > > the step 1
> > > of adding remoteproc support for Corstone.
> > >
> > 
> > Ok, so there is a HW problem where A core and M core can't see each other's
> > memory, preventing the A core from copying the firmware image to the proper
> > location.
> > 
> > When the HW is fixed, will there be a need to support scenarios where the
> > firmware image has been preloaded into memory?
> 
> No, this scenario won't apply when we get the HW upgrade. No need for an
> external entity anymore. The firmware(s) will all be files in the linux 
> filesystem.
>

Very well.  I am willing to continue with this driver but it does so little that
I wonder if it wouldn't simply be better to move forward with upstreaming when
the HW is fixed.  The choice is yours. 

Thanks,
Mathieu



Re: [PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-14 Thread Paolo Abeni
On Wed, 2024-03-13 at 09:34 -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> [
>Note, I need to take this patch through my tree, so I'm looking for acks.

Note that this device driver is changing quite rapidly, so I expect
some conflicts here later. I guess Liuns will have to handle them ;)

>This causes the build to fail when I add the __assign_str() check, which
>I was about to push to Linus, but it breaks allmodconfig due to this error.
> ]
> 
> The __string() and __assign_str() helper macros of the TRACE_EVENT() macro
> are going through some optimizations where only the source string of
> __string() will be used and the __assign_str() source will be ignored and
> later removed.
> 
> To make sure that there's no issues, a new check is added between the
> __string() src argument and the __assign_str() src argument that does a
> strcmp() to make sure they are the same string.
> 
> The hclgevf trace events have:
> 
>   __assign_str(devname, >nic.kinfo.netdev->name);
> 
> Which triggers the warning:
> 
> hclgevf_trace.h:34:39: error: passing argument 1 of ‘strcmp’ from 
> incompatible pointer type [-Werror=incompatible-pointer-types]
>34 | __assign_str(devname, >nic.kinfo.netdev->name);
>  [..]
> arch/x86/include/asm/string_64.h:75:24: note: expected ‘const char *’ but 
> argument is of type ‘char (*)[16]’
>75 | int strcmp(const char *cs, const char *ct);
>   |^~
> 
> 
> Because __assign_str() now has:
> 
>   WARN_ON_ONCE(__builtin_constant_p(src) ?\
>strcmp((src), __data_offsets.dst##_ptr_) : \
>(src) != __data_offsets.dst##_ptr_);   \
> 
> The problem is the '&' on hdev->nic.kinfo.netdev->name. That's because
> that name is:
> 
>   charname[IFNAMSIZ]
> 
> Where passing an address '&' of a char array is not compatible with strcmp().
> 
> The '&' is not necessary, remove it.
> 
> Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF mailbox")

checkpactch in strict mode complains the hash is not 12 char long.

> Signed-off-by: Steven Rostedt (Google) 

FWIW

Acked-by: Paolo Abeni 




Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-14 Thread David Woodhouse
On 14 March 2024 11:13:37 CET, Peter Hilber  
wrote:
>> To a certain extent, as long as the virtio-rtc device is designed to expose 
>> time precisely and unambiguously, it's less important if the Linux kernel 
>> *today* can use that. Although of course we should strive for that. Let's 
>> be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
>> that though.
>> 
>
>As Virtio is extensible (unlike hardware), my approach is to mostly specify
>only what also has a PoC user and a use case.

If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on 
day one. Otherwise, as I said in my first response, I can go do that as a 
separate device and decide that virtio_rtc doesn't meet our needs (especially 
for maintaining accuracy over LM).

My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that 
it's extensible but we don't want a v1.0 of the spec, implemented by various 
hypervisors, which still leaves guests not knowing what the actual time is. 
That would not be good. And even UTC without a leap second indicator has that 
problem.



Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-14 Thread Puranjay Mohan
Björn Töpel  writes:

>
> Hmm, depending on RISC-V's CMODX path, the pro/cons CALL_OPS vs dynamic
> trampolines changes quite a bit.
>
> The more I look at the pains of patching two instruction ("split
> immediates"), the better "patch data" + one insn patching look.

I was looking at how dynamic trampolines would be implemented for RISC-V.

With CALL-OPS we need to patch the auipc+jalr at function entry only, the
ops pointer above the function can be patched atomically.

With a dynamic trampoline we need a auipc+jalr pair at function entry to jump
to the trampoline and then another auipc+jalr pair to jump from trampoline to
ops->func. When the ops->func is modified, we would need to update the
auipc+jalr at in the trampoline.

So, I am not sure how to move forward here, CALL-OPS or Dynamic trampolines?

Thanks,
Puranjay



Re: [PATCH 3/3] dt-bindings: remoteproc: Add Arm remoteproc

2024-03-14 Thread Krzysztof Kozlowski
On 14/03/2024 14:49, Abdellatif El Khlifi wrote:
>> Frankly at the moment I'd be inclined to say it isn't even a remoteproc
>> binding (or driver) at all, it's a reset controller. Bindings are a contract
>> for describing the hardware, not the current state of Linux driver support -
>> if this thing still needs mailboxes, shared memory, a reset vector register,
>> or whatever else to actually be useful, those should be in the binding from
>> day 1 so that a) people can write and deploy correct DTs now, such that
>> functionality becomes available on their systems as soon as driver support
>> catches up, and b) the community has any hope of being able to review
>> whether the binding is appropriately designed and specified for the purpose
>> it intends to serve.
> 
> This is an initial patchset for allowing to turn on and off the remote 
> processor.
> The FW is already loaded before the Corstone-1000 SoC is powered on and this
> is done through the FPGA board bootloader in case of the FPGA target.
> Or by the Corstone-1000 FVP model (emulator).
> 
> The plan for the driver is as follows:
> 
> Step 1: provide a foundation driver capable of turning the core on/off
> Step 2: provide mailbox support for comms
> Step 3: provide FW reload capability
> 
> Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) can
> share memory with the remote core.
> 
> So, when memory sharing becomes available in the FPGA and FVP the
> DT binding will be upgraded with:
> 
> - mboxes property specifying the RX/TX mailboxes (based on MHU v2)
> - memory-region property describing the virtio vrings
> 
> Currently the mailbox controller does exist in the HW but is not
> usable via virtio (no memory sharing available).
> 
> Do you recommend I add the mboxes property even currently we can't do the 
> comms ?

Bindings should be complete, regardless whether Linux driver supports it
or not. Please see writing bindings document for explanation on this and
other rules.

So yes: please describe as much as possible/reasonable.


Best regards,
Krzysztof




Re: [PATCH 3/3] dt-bindings: remoteproc: Add Arm remoteproc

2024-03-14 Thread Abdellatif El Khlifi
Hi Robin,

> > +  firmware-name:
> > +description: |
> > +  Default name of the firmware to load to the remote processor.
> 
> So... is loading the firmware image achieved by somehow bitbanging it
> through the one reset register, maybe? I find it hard to believe this is a
> complete and functional binding.
> 
> Frankly at the moment I'd be inclined to say it isn't even a remoteproc
> binding (or driver) at all, it's a reset controller. Bindings are a contract
> for describing the hardware, not the current state of Linux driver support -
> if this thing still needs mailboxes, shared memory, a reset vector register,
> or whatever else to actually be useful, those should be in the binding from
> day 1 so that a) people can write and deploy correct DTs now, such that
> functionality becomes available on their systems as soon as driver support
> catches up, and b) the community has any hope of being able to review
> whether the binding is appropriately designed and specified for the purpose
> it intends to serve.

This is an initial patchset for allowing to turn on and off the remote 
processor.
The FW is already loaded before the Corstone-1000 SoC is powered on and this
is done through the FPGA board bootloader in case of the FPGA target.
Or by the Corstone-1000 FVP model (emulator).

The plan for the driver is as follows:

Step 1: provide a foundation driver capable of turning the core on/off
Step 2: provide mailbox support for comms
Step 3: provide FW reload capability

Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) can
share memory with the remote core.

So, when memory sharing becomes available in the FPGA and FVP the
DT binding will be upgraded with:

- mboxes property specifying the RX/TX mailboxes (based on MHU v2)
- memory-region property describing the virtio vrings

Currently the mailbox controller does exist in the HW but is not
usable via virtio (no memory sharing available).

Do you recommend I add the mboxes property even currently we can't do the comms 
?

> For instance right now it seems somewhat tenuous to describe two consecutive
> 32-bit registers as separate "reg" entries, but *maybe* it's OK if that's
> all there ever is. However if it's actually going to end up needing several
> more additional MMIO and/or memory regions for other functionality, then
> describing each register and location individually is liable to get
> unmanageable really fast, and a higher-level functional grouping (e.g. these
> reset-related registers together as a single 8-byte region) would likely be
> a better design.

Currently the HW provides only 2 registers to control the remote processors:

The reset control and status registers.

It makes sense to me to use a mapped region of 8 bytes for both registers rather
than individual registers (since they are consecutive).
I'll update that, thanks for the suggestion.

Abdellatif,
Cheers



[RFC PATCH 1/2] net: add sysfs attributes for customized dim profile management

2024-03-14 Thread Heng Qi
The NetDIM library, currently leveraged by an array of NICs, delivers
excellent acceleration benefits. Nevertheless, NICs vary significantly
in their dim profile list prerequisites.

Specifically, virtio-net backends may present diverse sw or hw device
implementation, making a one-size-fits-all parameter list impractical.
On Alibaba Cloud, the virtio DPU's performance under the default DIM
profile falls short of expectations, partly due to a mismatch in
parameter configuration.

I also noticed that ice/idpf/ena and other NICs have customized
profilelist or placed some restrictions on dim capabilities.

Motivated by this, I tried adding new sysfs attributes that provides
a per-device control to modify and access a device's interrupt parameters.

Usage

1. Query the currently customized list of the device

$ cat dim_profs
The profiles of (RX, EQE):
{.usec =   1, .pkts = 256, .comps =   0,},
{.usec =   8, .pkts = 256, .comps =   0,},
{.usec =  64, .pkts = 256, .comps =   0,},
{.usec = 128, .pkts = 256, .comps =   0,},
{.usec = 256, .pkts = 256, .comps =   0,}
The profiles of (TX, EQE):
{.usec =   1, .pkts = 256, .comps =   0,},
{.usec =   2, .pkts = 256, .comps =   0,},
{.usec =   3, .pkts = 256, .comps =   0,},
{.usec =   4, .pkts = 256, .comps =   0,},
{.usec =   5, .pkts = 256, .comps =   0,}

2. Tune

$ echo "RX EQE 8,8,0 16,16,0 32,32,0 64,64,0 128,128,0" > dim_profs
$ echo "  TX  EQE 0,2,0   1,3,0 2,4,0   3,5,0  4,6,0   " > dim_profs
$ cat dim_profs
The profiles of (RX, EQE):
{.usec =   8, .pkts =   8, .comps =   0,},
{.usec =  16, .pkts =  16, .comps =   0,},
{.usec =  32, .pkts =  32, .comps =   0,},
{.usec =  64, .pkts =  64, .comps =   0,},
{.usec = 128, .pkts = 128, .comps =   0,}
The profiles of (TX, EQE):
{.usec =   0, .pkts =   2, .comps =   0,},
{.usec =   1, .pkts =   3, .comps =   0,},
{.usec =   2, .pkts =   4, .comps =   0,},
{.usec =   3, .pkts =   5, .comps =   0,},
{.usec =   4, .pkts =   6, .comps =   0,}

3. Warn
If the device does not support .ndo_dim_moder_{set, get},
the following warning will response:
"Profile is default and not customized by the device."

Signed-off-by: Heng Qi 
---
 Documentation/ABI/testing/sysfs-class-net |  17 +++
 include/linux/dim.h   |   7 ++
 include/linux/netdevice.h |  35 ++
 lib/dim/net_dim.c |   6 --
 net/core/net-sysfs.c  | 172 ++
 5 files changed, 231 insertions(+), 6 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net 
b/Documentation/ABI/testing/sysfs-class-net
index ebf21be..1e4faa8 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -352,3 +352,20 @@ Description:
0  threaded mode disabled for this dev
1  threaded mode enabled for this dev
== ==
+
+What:  /sys/class/net//dim_profs
+Date:  Mar 2024
+KernelVersion: 6.8
+Contact:   net...@vger.kernel.org
+Description:
+   String value to control the profile list of DIM per device. 
User could
+   set this value to tune the profile list for RX/TX direction and 
EQE/CQE
+   mode respectively.
+
+   Possible values:
+    
==
+   RX EQE 1,1,0  2,2,0   3,3,0   4,4,05,5,0 tune RX + EQE 
profile list
+   RX CQE 8,8,0  16,16,0 32,32,0 64,64,0  128,128,0 tune RX + CQE 
profile list
+   TX EQE 16,8,0 2,16,0  16,8,0  32,64,0  128,64,0  tune TX + EQE 
profile list
+   TX CQE 8,5,0  8,16,0  32,12,0 128,64,0 256,128,0 tune TX + CQE 
profile list
+    
==
diff --git a/include/linux/dim.h b/include/linux/dim.h
index f343bc9..43398f5 100644
--- a/include/linux/dim.h
+++ b/include/linux/dim.h
@@ -10,6 +10,13 @@
 #include 
 #include 
 
+/* Number of DIM profiles and period mode. */
+#define NET_DIM_PARAMS_NUM_PROFILES 5
+#define NET_DIM_DEFAULT_RX_CQ_PKTS_FROM_EQE 256
+#define NET_DIM_DEFAULT_TX_CQ_PKTS_FROM_EQE 128
+#define NET_DIM_DEF_PROFILE_CQE 1
+#define NET_DIM_DEF_PROFILE_EQE 1
+
 /*
  * Number of events between DIM iterations.
  * Causes a moderation of the algorithm run.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c6f6ac7..bc2f3ac 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -998,6 +999,27 @@ struct netdev_net_notifier {
struct notifier_block *nb;
 };
 
+enum dim_direction {
+   DIM_RX_DIRECTION = 0x0,
+   DIM_TX_DIRECTION = 0x1,
+   DIM_NUM_DIRECTIONS
+};
+/**
+ * struct dim_profs_list - Structure for dim sysfs configuration.
+ * Used to exchange profile list between the sysfs and the 

[RFC PATCH 0/2] net: provides dim profile fine-tuning channels

2024-03-14 Thread Heng Qi
The NetDIM library provides excellent acceleration for many modern
network cards. However, the default profiles of DIM limits its maximum
capabilities for different NICs, so providing a channel through which
the NIC can be custom configured is necessary.

Please review, thank you very much!

Heng Qi (2):
  net: add sysfs attributes for customized dim profile management
  virtio-net: support net sysfs to fine-tune dim profile

 Documentation/ABI/testing/sysfs-class-net |  17 +++
 drivers/net/virtio_net.c  |  64 ++-
 include/linux/dim.h   |   7 ++
 include/linux/netdevice.h |  35 ++
 lib/dim/net_dim.c |   6 --
 net/core/net-sysfs.c  | 172 ++
 6 files changed, 294 insertions(+), 7 deletions(-)

-- 
1.8.3.1




[RFC PATCH 2/2] virtio-net: support net sysfs to fine-tune dim profile

2024-03-14 Thread Heng Qi
Virtio-net has different types of back-end device
implementations. In order to effectively optimize
the dim library's gains for different device
implementations, let's use the interface provided
by net-sysfs to fine-tune the profile list.

Signed-off-by: Heng Qi 
---
 drivers/net/virtio_net.c | 64 +++-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e709d44..7fae737 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -57,6 +57,16 @@
 
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
+/* This is copied from NET_DIM_RX_EQE_PROFILES in DIM library */
+#define VIRTNET_DIM_RX_PKTS 256
+static struct dim_cq_moder rx_itr_conf[] = {
+   {.usec = 1,   .pkts = VIRTNET_DIM_RX_PKTS,},
+   {.usec = 8,   .pkts = VIRTNET_DIM_RX_PKTS,},
+   {.usec = 64,  .pkts = VIRTNET_DIM_RX_PKTS,},
+   {.usec = 128, .pkts = VIRTNET_DIM_RX_PKTS,},
+   {.usec = 256, .pkts = VIRTNET_DIM_RX_PKTS,}
+};
+
 static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_TSO4,
VIRTIO_NET_F_GUEST_TSO6,
@@ -3584,7 +3594,10 @@ static void virtnet_rx_dim_work(struct work_struct *work)
if (!rq->dim_enabled)
continue;
 
-   update_moder = net_dim_get_rx_moderation(dim->mode, 
dim->profile_ix);
+   if (dim->profile_ix >= ARRAY_SIZE(rx_itr_conf))
+   dim->profile_ix = ARRAY_SIZE(rx_itr_conf) - 1;
+
+   update_moder = rx_itr_conf[dim->profile_ix];
if (update_moder.usec != rq->intr_coal.max_usecs ||
update_moder.pkts != rq->intr_coal.max_packets) {
err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
@@ -4170,6 +4183,53 @@ static void virtnet_tx_timeout(struct net_device *dev, 
unsigned int txqueue)
   jiffies_to_usecs(jiffies - READ_ONCE(txq->trans_start)));
 }
 
+static int virtnet_dim_moder_valid(struct net_device *dev, struct 
dim_profs_list *list)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+
+   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
+   return -EOPNOTSUPP;
+
+   if (!list || list->direction != DIM_RX_DIRECTION ||
+   list->num != NET_DIM_PARAMS_NUM_PROFILES ||
+   list->mode != DIM_CQ_PERIOD_MODE_START_FROM_EQE) {
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int virtnet_dim_moder_get(struct net_device *dev, struct dim_profs_list 
*list)
+{
+   int ret;
+
+   ret = virtnet_dim_moder_valid(dev, list);
+   if (ret)
+   return ret;
+
+   memcpy(list->profs, rx_itr_conf, sizeof(*list->profs) * list->num);
+
+   return 0;
+}
+
+static int virtnet_dim_moder_set(struct net_device *dev, struct dim_profs_list 
*list)
+{
+   int i, ret;
+
+   ret = virtnet_dim_moder_valid(dev, list);
+   if (ret)
+   return ret;
+
+   for (i = 0; i < list->num; i++) {
+   rx_itr_conf[i].usec = list->profs[i].usec;
+   rx_itr_conf[i].pkts = list->profs[i].pkts;
+   if (list->profs[i].comps)
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
.ndo_open= virtnet_open,
.ndo_stop= virtnet_close,
@@ -4186,6 +4246,8 @@ static void virtnet_tx_timeout(struct net_device *dev, 
unsigned int txqueue)
.ndo_get_phys_port_name = virtnet_get_phys_port_name,
.ndo_set_features   = virtnet_set_features,
.ndo_tx_timeout = virtnet_tx_timeout,
+   .ndo_dim_moder_get  = virtnet_dim_moder_get,
+   .ndo_dim_moder_set  = virtnet_dim_moder_set,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
-- 
1.8.3.1




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin
On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > 
> > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > We observe a error message spew from VM and then soft-lockup
> > > > > report. The error message indicates the data associated with
> > > > > the descriptor (index: 135) has been released, and the queue
> > > > > is marked as broken. It eventually leads to the endless effort
> > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > ring and published to the used ring by vhost, meaning we have
> > > > > disordred write to the available ring element and available index.
> > > > > 
> > > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64   
> > > > >\
> > > > > -accel kvm -machine virt,gic-version=host 
> > > > >\
> > > > >:  
> > > > >\
> > > > > -netdev tap,id=vnet0,vhost=on 
> > > > >\
> > > > > -device 
> > > > > virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > 
> > > > > [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > > 
> > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > expected.
> > > > > 
> > > > > Cc: sta...@vger.kernel.org
> > > > > Reported-by: Yihuang Yu 
> > > > > Signed-off-by: Gavin Shan 
> > > > > ---
> > > > >drivers/virtio/virtio_ring.c | 12 +---
> > > > >1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > 
> > > > 
> > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > enough besides "it does not work"?
> > > > 
> > > 
> > > The change is replacing instruction "dmb" with "dsb". "dsb" is stronger 
> > > barrier
> > > than "dmb" because "dsb" ensures that all memory accesses raised before 
> > > this
> > > instruction is completed when the 'dsb' instruction completes. However, 
> > > "dmb"
> > > doesn't guarantee the order of completion of the memory accesses.
> > > 
> > > So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
> > > vq->split.avail_idx_shadow)'
> > > can be completed before 'vq->split.vring.avail->ring[avail] = 
> > > cpu_to_virtio16(_vq->vdev, head)'.
> > 
> > Completed as observed by which CPU?
> > We have 2 writes that we want observed by another CPU in order.
> > So if CPU observes a new value of idx we want it to see
> > new value in ring.
> > This is standard use of smp_wmb()
> > How are these 2 writes different?
> > 
> > What DMB does, is that is seems to ensure that effects
> > of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
> > vq->split.avail_idx_shadow)'
> > are observed after effects of
> > 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > 
> > 
> 
> Completed as observed by the CPU where vhost worker 

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Gavin Shan

On 3/14/24 21:50, Michael S. Tsirkin wrote:

On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:

On 3/14/24 18:05, Michael S. Tsirkin wrote:

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:

The issue is reported by Yihuang Yu who have 'netperf' test on
NVidia's grace-grace and grace-hopper machines. The 'netperf'
client is started in the VM hosted by grace-hopper machine,
while the 'netperf' server is running on grace-grace machine.

The VM is started with virtio-net and vhost has been enabled.
We observe a error message spew from VM and then soft-lockup
report. The error message indicates the data associated with
the descriptor (index: 135) has been released, and the queue
is marked as broken. It eventually leads to the endless effort
to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
and soft-lockup. The stale index 135 is fetched from the available
ring and published to the used ring by vhost, meaning we have
disordred write to the available ring element and available index.

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
-accel kvm -machine virt,gic-version=host\
   : \
-netdev tap,id=vnet0,vhost=on\
-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \

[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!

Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
ARM64. It should work for other architectures, but performance loss is
expected.

Cc: sta...@vger.kernel.org
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
   drivers/virtio/virtio_ring.c | 12 +---
   1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
-   /* Descriptors and available array need to be set before we expose the
-* new available array entries. */
-   virtio_wmb(vq->weak_barriers);
+   /*
+* Descriptors and available array need to be set before we expose
+* the new available array entries. virtio_wmb() should be enough
+* to ensuere the order theoretically. However, a stronger barrier
+* is needed by ARM64. Otherwise, the stale data can be observed
+* by the host (vhost). A stronger barrier should work for other
+* architectures, but performance loss is expected.
+*/
+   virtio_mb(false);



I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?



The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
than "dmb" because "dsb" ensures that all memory accesses raised before this
instruction is completed when the 'dsb' instruction completes. However, "dmb"
doesn't guarantee the order of completion of the memory accesses.

So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
vq->split.avail_idx_shadow)'
can be completed before 'vq->split.vring.avail->ring[avail] = 
cpu_to_virtio16(_vq->vdev, head)'.


Completed as observed by which CPU?
We have 2 writes that we want observed by another CPU in order.
So if CPU observes a new value of idx we want it to see
new value in ring.
This is standard use of smp_wmb()
How are these 2 writes different?

What DMB does, is that is seems to ensure that effects
of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
vq->split.avail_idx_shadow)'
are observed after effects of
'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.




Completed as observed by the CPU where vhost worker is running. I don't think 
DMB
does the work here. If I'm understanding correctly, DMB ensures the order of 
these
two writes from the local CPU's standpoint. The written data can be stored in 
local
CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU 
where
vhost worker is running. So DMB isn't ensuring the write data is observed from 
the
far CPU.

DSB ensures that the written data is observable from the far CPU immediately.
 




The stronger barrier 'dsb' ensures the completion order as we expected.

 virtio_wmb(true) virt_mb(false)
   virt_wmb mb
 __smp_wmb   __mb
   dmb(ishst)  dsb(sy)


First, why would you want a non smp barrier when you are
synchronizing with another CPU? This is what smp_ means ...



Extraced from ARMv9 specificaton

The DMB 

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-14 Thread Tobias Huschle
On Tue, Mar 12, 2024 at 09:45:57AM +, Luis Machado wrote:
> On 3/11/24 17:05, Michael S. Tsirkin wrote:
> > 
> > Are we going anywhere with this btw?
> > 
> >
> 
> I think Tobias had a couple other threads related to this, with other 
> potential fixes:
> 
> https://lore.kernel.org/lkml/20240228161018.14253-1-husc...@linux.ibm.com/
> 
> https://lore.kernel.org/lkml/20240228161023.14310-1-husc...@linux.ibm.com/
> 

Sorry, Michael, should have provided those threads here as well.

The more I look into this issue, the more things to ponder upon I find.
It seems like this issue can (maybe) be fixed on the scheduler side after all.

The root cause of this regression remains that the mentioned kworker gets
a negative lag value and is therefore not elligible to run on wake up.
This negative lag is potentially assigned incorrectly. But I'm not sure yet.

Anytime I find something that can address the symptom, there is a potential
root cause on another level, and I would like to avoid to just address a
symptom to fix the issue, wheras it would be better to find the actual
root cause.

I would nevertheless still argue, that vhost relies rather heavily on the fact
that the kworker gets scheduled on wake up everytime. But I don't have a 
proposal at hand that accounts for potential side effects if opting for
explicitly initiating a schedule.
Maybe the assumption, that said kworker should always be selected on wake 
up is valid. In that case the explicit schedule would merely be a safety 
net.

I will let you know if something comes up on the scheduler side. There are
some more ideas on my side how this could be approached.



Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin
On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> > >-accel kvm -machine virt,gic-version=host\
> > >   : \
> > >-netdev tap,id=vnet0,vhost=on\
> > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Yihuang Yu 
> > > Signed-off-by: Gavin Shan 
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > 
> > 
> > I don't get what is going on here. Any explanation why virtio_wmb is not
> > enough besides "it does not work"?
> > 
> 
> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger 
> barrier
> than "dmb" because "dsb" ensures that all memory accesses raised before this
> instruction is completed when the 'dsb' instruction completes. However, "dmb"
> doesn't guarantee the order of completion of the memory accesses.
>
> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
> vq->split.avail_idx_shadow)'
> can be completed before 'vq->split.vring.avail->ring[avail] = 
> cpu_to_virtio16(_vq->vdev, head)'.

Completed as observed by which CPU?
We have 2 writes that we want observed by another CPU in order.
So if CPU observes a new value of idx we want it to see
new value in ring.
This is standard use of smp_wmb()
How are these 2 writes different?

What DMB does, is that is seems to ensure that effects
of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
vq->split.avail_idx_shadow)'
are observed after effects of
'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.




> The stronger barrier 'dsb' ensures the completion order as we expected.
> 
> virtio_wmb(true) virt_mb(false)
>   virt_wmb mb
> __smp_wmb   __mb
>   dmb(ishst)  dsb(sy)

First, why would you want a non smp barrier when you are
synchronizing with another CPU? This is what smp_ means ...


> Extraced from ARMv9 specificaton
> 
> The DMB instruction is a memory barrier instruction that ensures the relative
> order of memory accesses before the barrier with memory accesses after the
> barrier. The DMB instruction _does not_ ensure the completion of any of the
> memory accesses for which 

Re: [PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-14 Thread Jijie Shao

Reviewed-by: Jijie Shao

on 2024/3/13 21:34, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

[
Note, I need to take this patch through my tree, so I'm looking for acks.
This causes the build to fail when I add the __assign_str() check, which
I was about to push to Linus, but it breaks allmodconfig due to this error.
]

The __string() and __assign_str() helper macros of the TRACE_EVENT() macro
are going through some optimizations where only the source string of
__string() will be used and the __assign_str() source will be ignored and
later removed.

To make sure that there's no issues, a new check is added between the
__string() src argument and the __assign_str() src argument that does a
strcmp() to make sure they are the same string.

The hclgevf trace events have:

   __assign_str(devname, >nic.kinfo.netdev->name);

Which triggers the warning:

hclgevf_trace.h:34:39: error: passing argument 1 of ‘strcmp’ from incompatible 
pointer type [-Werror=incompatible-pointer-types]
34 | __assign_str(devname, >nic.kinfo.netdev->name);
  [..]
arch/x86/include/asm/string_64.h:75:24: note: expected ‘const char *’ but 
argument is of type ‘char (*)[16]’
75 | int strcmp(const char *cs, const char *ct);
   |^~


Because __assign_str() now has:

WARN_ON_ONCE(__builtin_constant_p(src) ?\
 strcmp((src), __data_offsets.dst##_ptr_) : \
 (src) != __data_offsets.dst##_ptr_);   \

The problem is the '&' on hdev->nic.kinfo.netdev->name. That's because
that name is:

charname[IFNAMSIZ]

Where passing an address '&' of a char array is not compatible with strcmp().

The '&' is not necessary, remove it.

Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF mailbox")
Signed-off-by: Steven Rostedt (Google) 
---
  drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h  | 8 
  .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h| 8 
  2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
index 8510b88d4982..f3cd5a376eca 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
@@ -24,7 +24,7 @@ TRACE_EVENT(hclge_pf_mbx_get,
__field(u8, code)
__field(u8, subcode)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, >vport[0].nic.kinfo.netdev->name)
+   __string(devname, hdev->vport[0].nic.kinfo.netdev->name)
__array(u32, mbx_data, PF_GET_MBX_LEN)
),
  
@@ -33,7 +33,7 @@ TRACE_EVENT(hclge_pf_mbx_get,

__entry->code = req->msg.code;
__entry->subcode = req->msg.subcode;
__assign_str(pciname, pci_name(hdev->pdev));
-   __assign_str(devname, >vport[0].nic.kinfo.netdev->name);
+   __assign_str(devname, hdev->vport[0].nic.kinfo.netdev->name);
memcpy(__entry->mbx_data, req,
   sizeof(struct hclge_mbx_vf_to_pf_cmd));
),
@@ -56,7 +56,7 @@ TRACE_EVENT(hclge_pf_mbx_send,
__field(u8, vfid)
__field(u16, code)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, >vport[0].nic.kinfo.netdev->name)
+   __string(devname, hdev->vport[0].nic.kinfo.netdev->name)
__array(u32, mbx_data, PF_SEND_MBX_LEN)
),
  
@@ -64,7 +64,7 @@ TRACE_EVENT(hclge_pf_mbx_send,

__entry->vfid = req->dest_vfid;
__entry->code = le16_to_cpu(req->msg.code);
__assign_str(pciname, pci_name(hdev->pdev));
-   __assign_str(devname, >vport[0].nic.kinfo.netdev->name);
+   __assign_str(devname, hdev->vport[0].nic.kinfo.netdev->name);
memcpy(__entry->mbx_data, req,
   sizeof(struct hclge_mbx_pf_to_vf_cmd));
),
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
index 5d4895bb57a1..b259e95dd53c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
@@ -23,7 +23,7 @@ TRACE_EVENT(hclge_vf_mbx_get,
__field(u8, vfid)
__field(u16, code)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, >nic.kinfo.netdev->name)
+   __string(devname, hdev->nic.kinfo.netdev->name)
__array(u32, mbx_data, VF_GET_MBX_LEN)
),
  
@@ -31,7 +31,7 @@ TRACE_EVENT(hclge_vf_mbx_get,

__entry->vfid = req->dest_vfid;
__entry->code = le16_to_cpu(req->msg.code);
__assign_str(pciname, 

Re: [PATCH] tracing/osnoise: init osnoise_instances list before registering tracer

2024-03-14 Thread Jerome Marchand

On 14/03/2024 11:49, Jerome Marchand wrote:

The kernel panic during the initialization of the osnoise tracer when
booting with "ftrace=osnoise" or "ftrace=timerlat" option.


BTW, while this fixes this issue for osnoise and timerlat, another issue,
remains with timerlat which prevent to boot with "ftrace=timerlat" option.
It apparently related to hrtimers:

[1.489168] Starting tracer 'timerlat'
[1.506835] BUG: kernel NULL pointer dereference, address: 
[1.507355] Loading compiled-in X.509 certificates
[1.507652] #PF: supervisor write access in kernel mode
[1.509639] #PF: error_code(0x0002) - not-present page
[1.510425] PGD 0 P4D 0
[1.510888] Oops: 0002 [#1] PREEMPT SMP NOPTI
[1.511582] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
6.8.0-63.osnoiseinit.fc41.x86_64 #1
[1.512819] Hardware name: Red Hat KVM/RHEL-AV, BIOS 
1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
[1.514175] RIP: 0010:rb_erase+0x18b/0x3a0
[1.514871] Code: 48 83 c0 01 48 89 01 e9 b3 30 03 00 e9 ae 30 03 00 48 89 46 10 
e9 17 ff ff ff 48 8b 56 10 48 8d 41 01 48 89 51 08 48 89 4e 10 <48> 89 02 48 8b 
01 48 89 06 48 89 31 48 83 f8 03 0f 86 96 00 00 00
[1.517486] RSP: 0018:99403e08 EFLAGS: 00010046
[1.521008] RAX: 95e06fa2de49 RBX: 95e06fa25050 RCX: 95e06fa2de48
[1.522067] RDX:  RSI: 95e06fa2de48 RDI: 95e06fa25050
[1.523111] RBP: 95e06fa24b20 R08: 95e06fa24b20 R09: 
[1.524151] R10:  R11: 99464880 R12: 0001
[1.525195] R13: 95e06fa24ac0 R14: 95e06fa24b00 R15: 00024ac0
[1.526230] FS:  () GS:95e06fa0() 
knlGS:
[1.527433] CS:  0010 DS:  ES:  CR0: 80050033
[1.528295] CR2:  CR3: 00014a422000 CR4: 00350ef0
[1.529331] Call Trace:
[1.529783]  
[1.530192]  ? __die+0x23/0x70
[1.530736]  ? page_fault_oops+0x174/0x530
[1.531394]  ? srso_return_thunk+0x5/0x5f
[1.532040]  ? srso_return_thunk+0x5/0x5f
[1.532681]  ? ttwu_queue_wakelist+0xf1/0x110
[1.533377]  ? exc_page_fault+0x7f/0x180
[1.534007]  ? asm_exc_page_fault+0x26/0x30
[1.534677]  ? rb_erase+0x18b/0x3a0
[1.535251]  ? __schedule+0x3e9/0x1530
[1.535863]  timerqueue_del+0x2e/0x50
[1.536457]  __remove_hrtimer+0x39/0x90
[1.537095]  hrtimer_start_range_ns+0x2a6/0x350
[1.537918]  tick_nohz_idle_stop_tick+0x69/0xd0
[1.538646]  do_idle+0x221/0x270
[1.539195]  cpu_startup_entry+0x29/0x30
[1.539829]  rest_init+0xcf/0xd0
[1.540374]  arch_call_rest_init+0xe/0x30
[1.541031]  start_kernel+0x717/0xa70
[1.541623]  x86_64_start_reservations+0x18/0x30
[1.542352]  x86_64_start_kernel+0x94/0xa0
[1.543007]  secondary_startup_64_no_verify+0x184/0x18b
[1.543813]  
[1.544217] Modules linked in:
[1.544751] CR2: 
[1.545300] ---[ end trace  ]---
[1.546015] RIP: 0010:rb_erase+0x18b/0x3a0
[1.546674] Code: 48 83 c0 01 48 89 01 e9 b3 30 03 00 e9 ae 30 03 00 48 89 46 10 
e9 17 ff ff ff 48 8b 56 10 48 8d 41 01 48 89 51 08 48 89 4e 10 <48> 89 02 48 8b 
01 48 89 06 48 89 31 48 83 f8 03 0f 86 96 00 00 00
[1.549271] RSP: 0018:99403e08 EFLAGS: 00010046
[1.550063] RAX: 95e06fa2de49 RBX: 95e06fa25050 RCX: 95e06fa2de48
[1.551091] RDX:  RSI: 95e06fa2de48 RDI: 95e06fa25050
[1.552125] RBP: 95e06fa24b20 R08: 95e06fa24b20 R09: 
[1.553157] R10:  R11: 99464880 R12: 0001
[1.554191] R13: 95e06fa24ac0 R14: 95e06fa24b00 R15: 00024ac0
[1.555226] FS:  () GS:95e06fa0() 
knlGS:
[1.556433] CS:  0010 DS:  ES:  CR0: 80050033
[1.557291] CR2:  CR3: 00014a422000 CR4: 00350ef0
[1.558322] Kernel panic - not syncing: Attempted to kill the idle task!
[1.559602] Kernel Offset: 0x1600 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[1.561164] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---

Maybe a similar initialization issue that's easily fixed, or maybe the
timerlat option should be ignoreded at boot time, as mmiotrace is.

Regards,
Jerome Marchand



[1.505256] Starting tracer 'osnoise'
[1.510214] BUG: kernel NULL pointer dereference, address: 0010
[1.511189] #PF: supervisor read access in kernel mode
[1.511938] #PF: error_code(0x) - not-present page
[1.512676] PGD 0 P4D 0
[1.513105] Oops:  [#1] PREEMPT SMP NOPTI
[1.513763] CPU: 9 PID: 1 Comm: swapper/0 Not tainted 6.8.0-63.fc41.x86_64 #1
[1.515485] Hardware name: Red Hat KVM/RHEL-AV, BIOS 
1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
[1.516765] RIP: 0010:osnoise_tracer_start+0x21/0x80
[  

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Gavin Shan

On 3/14/24 18:05, Michael S. Tsirkin wrote:

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:

The issue is reported by Yihuang Yu who have 'netperf' test on
NVidia's grace-grace and grace-hopper machines. The 'netperf'
client is started in the VM hosted by grace-hopper machine,
while the 'netperf' server is running on grace-grace machine.

The VM is started with virtio-net and vhost has been enabled.
We observe a error message spew from VM and then soft-lockup
report. The error message indicates the data associated with
the descriptor (index: 135) has been released, and the queue
is marked as broken. It eventually leads to the endless effort
to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
and soft-lockup. The stale index 135 is fetched from the available
ring and published to the used ring by vhost, meaning we have
disordred write to the available ring element and available index.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host\
  : \
   -netdev tap,id=vnet0,vhost=on\
   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \

   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!

Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
ARM64. It should work for other architectures, but performance loss is
expected.

Cc: sta...@vger.kernel.org
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
  drivers/virtio/virtio_ring.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
  
-	/* Descriptors and available array need to be set before we expose the

-* new available array entries. */
-   virtio_wmb(vq->weak_barriers);
+   /*
+* Descriptors and available array need to be set before we expose
+* the new available array entries. virtio_wmb() should be enough
+* to ensuere the order theoretically. However, a stronger barrier
+* is needed by ARM64. Otherwise, the stale data can be observed
+* by the host (vhost). A stronger barrier should work for other
+* architectures, but performance loss is expected.
+*/
+   virtio_mb(false);



I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?



The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
than "dmb" because "dsb" ensures that all memory accesses raised before this
instruction is completed when the 'dsb' instruction completes. However, "dmb"
doesn't guarantee the order of completion of the memory accesses.

So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
vq->split.avail_idx_shadow)'
can be completed before 'vq->split.vring.avail->ring[avail] = 
cpu_to_virtio16(_vq->vdev, head)'.
The stronger barrier 'dsb' ensures the completion order as we expected.

virtio_wmb(true) virt_mb(false)
  virt_wmb mb
__smp_wmb   __mb
  dmb(ishst)  dsb(sy)
  


Extraced from ARMv9 specificaton

The DMB instruction is a memory barrier instruction that ensures the relative
order of memory accesses before the barrier with memory accesses after the
barrier. The DMB instruction _does not_ ensure the completion of any of the
memory accesses for which it ensures relative order.

A DSB instruction is a memory barrier that ensures that memory accesses that
occur before the DSB instruction have __completed__ before the completion of
the DSB instruction. In doing this, it acts as a stronger barrier than a DMB
and all ordering that is created by a DMB with specific options is also 
generated
by a DSB with the same options.


vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
--
2.44.0




Thanks,
Gavin




Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-14 Thread Peter Hilber
Now CC'ing the previous commenters to the virtio-rtc spec draft, since
this discussion is mostly about the spec, and the Virtio mailing lists
still seem to be in a migration hiatus...

On 13.03.24 19:18, David Woodhouse wrote:
> On 13 March 2024 17:50:48 GMT, Peter Hilber  
> wrote:
>> On 13.03.24 13:45, David Woodhouse wrote:
>>> Surely the whole point of this effort is to provide guests with precise
>>> and *unambiguous* knowledge of what the time is? 
>>
>> I would say, a fundamental point of this effort is to enable such
>> implementations, and to detect if a device is promising to support this.
>>
>> Where we might differ is as to whether the Virtio clock *for every
>> implementation* has to be *continuously* accurate w.r.t. a time standard,
>> or whether *for some implementations* it could be enough that all guests in
>> the local system have the same, precise local notion of time, which might
>> be off from the actual time standard.
> 
> That makes sense, but remember I don't just want {X, Y, Z} but *also* the 
> error bounds of ±deltaY and ±deltaZ too.
> 
> So your example just boils down to "I'm calling it UTC, and it's really 
> precise, but we make no promises about its *accuracy*". And that's fine.
> 
>> Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...
> 
> KVM is not an exemplar of good time practices. 
> Not in *any* respect :)
> 
>> With your described use case the UTC_SMEARED clock should of course not be
>> used. The UTC_SMEARED clock would get a distinct name through udev, like
>> /dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>> detected.
> 
> As long as it's clear to all concerned that this is fundamentally not usable 
> as an accurate time source, and is only for the local-sync case you 
> described, sure.
> 
>>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>>> leap second the guest can't know know *which* occurrence of that leap
>>> second it is, so it might be wrong by a second. To resolve that
>>> ambiguity needs a leap indicator and/or tai_offset field.
>>
>> I agree that virtio-rtc should communicate this. The question is, what
>> exactly, and for which clock read request?
> 
> Are we now conflating software architecture (and Linux in particular) with 
> "hardware" design?
> 
> To a certain extent, as long as the virtio-rtc device is designed to expose 
> time precisely and unambiguously, it's less important if the Linux kernel 
> *today* can use that. Although of course we should strive for that. Let's 
> be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
> that though.
> 

As Virtio is extensible (unlike hardware), my approach is to mostly specify
only what also has a PoC user and a use case.

>> As for PTP clocks:
>>
>> - It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>>
>> - The clock_adjtime(2) tai_offset and return value could be set (if
>>  upstream will accept this). Would this help? As discussed, user space
>>  would need to interpret this (and currently no dynamic POSIX clock sets
>>  this).
> 
> Hm, maybe?
> 
> 
 I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
 boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
 so the device might not even know which vCPUs there are. E.g. there is even
 interest to make virtio-rtc work as part of the virtio-net device (which
 might be implemented in hardware).
>>>
>>> Sure, but those implementations aren't going to offer the TSC pairing
>>> at all, are they?
>>>
>>
>> They could offer an Intel ART pairing (some physical PTP NICs are already
>> doing this, look for the convert_art_to_tsc() users).
> 
> Right, but isn't that software's problem? The time pairing is defined against 
> the ART in that case.

My point was that such a device would then not necessarily have an idea
what vCPU 0 is. But let's just say that this will be phrased as a SHOULD
best-effort requirement anyway.

Thanks for the comments,

Peter



Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Add Samsung Galaxy Note 3

2024-03-14 Thread Konrad Dybcio




On 3/10/24 15:13, Luca Weiss wrote:

From: Adam Honse 

Add the devicetree for this "phablet" using the Snapdragon 800 SoC.

Signed-off-by: Adam Honse 
[l...@z3ntu.xyz: clean up, prepare for upstream]
Signed-off-by: Luca Weiss 
---
  arch/arm/boot/dts/qcom/Makefile|   1 +
  .../boot/dts/qcom/qcom-msm8974-samsung-hlte.dts| 403 +
  2 files changed, 404 insertions(+)

diff --git a/arch/arm/boot/dts/qcom/Makefile b/arch/arm/boot/dts/qcom/Makefile
index 9cc1e14e6cd0..845af12d15a2 100644
--- a/arch/arm/boot/dts/qcom/Makefile
+++ b/arch/arm/boot/dts/qcom/Makefile
@@ -39,6 +39,7 @@ dtb-$(CONFIG_ARCH_QCOM) += \
qcom-msm8960-cdp.dtb \
qcom-msm8960-samsung-expressatt.dtb \
qcom-msm8974-lge-nexus5-hammerhead.dtb \
+   qcom-msm8974-samsung-hlte.dtb \
qcom-msm8974-sony-xperia-rhine-amami.dtb \
qcom-msm8974-sony-xperia-rhine-honami.dtb \
qcom-msm8974pro-fairphone-fp2.dtb \
diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-samsung-hlte.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974-samsung-hlte.dts
new file mode 100644
index ..e03227a49b67
--- /dev/null
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974-samsung-hlte.dts
@@ -0,0 +1,403 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "qcom-msm8974.dtsi"
+#include "pm8841.dtsi"
+#include "pm8941.dtsi"
+#include 
+#include 
+#include 
+
+/ {
+   model = "Samsung Galaxy Note 3";
+   compatible = "samsung,hlte", "qcom,msm8974";
+   chassis-type = "handset";
+
+   aliases {
+   mmc0 = _1; /* SDC1 eMMC slot */
+   mmc1 = _3; /* SDC3 SD card slot */
+   serial0 = _uart1;
+   };
+
+   chosen {
+   stdout-path = "serial0:115200n8";
+   };
+
+   gpio-keys {
+   compatible = "gpio-keys";
+
+   pinctrl-names = "default";
+   pinctrl-0 = <_keys_pin_a>;


property-n
property-names


+
+   key-home {
+   label = "home_key";


"Home Key" etc.


+
+   pm8941_l20: l20 {
+   regulator-min-microvolt = <295>;
+   regulator-max-microvolt = <295>;
+
+   regulator-allow-set-load;
+   regulator-system-load = <20>;


regulator-min-microvolt = <295>;
regulator-max-microvolt = <295>;
regulator-system-load = <20>;
regulator-allow-set-load;


[...]

Konrad



[PATCH] vdpa: Convert sprintf/snprintf to sysfs_emit

2024-03-14 Thread Li Zhijian
Per filesystems/sysfs.rst, show() should only use sysfs_emit()
or sysfs_emit_at() when formatting the value to be returned to user space.

coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().

sprintf() will be converted as weel if they have.

Generally, this patch is generated by
make coccicheck M= MODE=patch \
COCCI=scripts/coccinelle/api/device_attr_show.cocci

No functional change intended

CC: "Michael S. Tsirkin" 
CC: Jason Wang 
CC: Xuan Zhuo 
CC: virtualizat...@lists.linux.dev
Signed-off-by: Li Zhijian 
---
This is a part of the work "Fix coccicheck device_attr_show warnings"[1]
Split them per subsystem so that the maintainer can review it easily
[1] https://lore.kernel.org/lkml/20240116041129.3937800-1-lizhij...@fujitsu.com/
---
 drivers/vdpa/vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index d0695680b282..7b82b9aa91e9 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -98,7 +98,7 @@ static ssize_t driver_override_show(struct device *dev,
ssize_t len;
 
device_lock(dev);
-   len = snprintf(buf, PAGE_SIZE, "%s\n", vdev->driver_override);
+   len = sysfs_emit(buf, "%s\n", vdev->driver_override);
device_unlock(dev);
 
return len;
-- 
2.29.2




Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-14 Thread Peter Hilber
On 13.03.24 21:12, Andrew Lunn wrote:
>> As long as it doesn't behave differently from the other RTC, I'm fine
>> with this. This is important because I don't want to carry any special
>> infrastructure for this driver or to have to special case this driver
>> later on because it is incompatible with some evolution of the
>> subsystem.
> 
> Maybe deliberately throw away all the sub-second accuracy when
> accessing the time via the RTC API? That helps to make it look like an
> RTC. And when doing the rounding, add a constant offset of 10ms to
> emulate the virtual i2c bus it is hanging off, like most RTCs?
> 
> Andrew

The truncating to whole seconds is already done. As to the offset, I do not
understand how this would help. I can read out my CMOS RTC in ~50 us.

Thanks for the comment,

Peter



Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin
On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host\
>  : \
>   -netdev tap,id=vnet0,vhost=on\
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/virtio_ring.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> - /* Descriptors and available array need to be set before we expose the
> -  * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> +  * Descriptors and available array need to be set before we expose
> +  * the new available array entries. virtio_wmb() should be enough
> +  * to ensuere the order theoretically. However, a stronger barrier
> +  * is needed by ARM64. Otherwise, the stale data can be observed
> +  * by the host (vhost). A stronger barrier should work for other
> +  * architectures, but performance loss is expected.
> +  */
> + virtio_mb(false);


I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?

>   vq->split.avail_idx_shadow++;
>   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>   vq->split.avail_idx_shadow);
> -- 
> 2.44.0




Re: [External] Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-14 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Tue, Mar 12, 2024 at 2:21 AM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * Abstract common functions into `find_alloc_memory_type()`
>>
>> We should move kmem_put_memory_types() (renamed to
>> mt_put_memory_types()?) too.  This can be put in a separate patch.
>>
>
> Will do! Thanks,
>
>
>>
>> > Since different memory devices require finding or allocating a memory type,
>> > these common steps are abstracted into a single function,
>> > `find_alloc_memory_type()`, enhancing code scalability and conciseness.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Change adist calculation code to use another new lock, mt_perf_lock.
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `mt_perf_lock` to protect
>> > `default_dram_perf`. This approach not only avoids deadlock but also
>> > prevents holding a large lock simultaneously.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > ---
>> >  drivers/acpi/numa/hmat.c | 11 ++
>> >  drivers/dax/kmem.c   | 13 +--
>> >  include/linux/acpi.h |  6 
>> >  include/linux/memory-tiers.h |  8 +
>> >  mm/memory-tiers.c| 70 +---
>> >  5 files changed, 92 insertions(+), 16 deletions(-)
>> >
>> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> > index d6b85f0f6082..28812ec2c793 100644
>> > --- a/drivers/acpi/numa/hmat.c
>> > +++ b/drivers/acpi/numa/hmat.c
>> > @@ -38,6 +38,8 @@ static LIST_HEAD(targets);
>> >  static LIST_HEAD(initiators);
>> >  static LIST_HEAD(localities);
>> >
>> > +static LIST_HEAD(hmat_memory_types);
>> > +
>>
>> HMAT isn't a device driver for some memory devices.  So I don't think we
>> should manage memory types in HMAT.
>
> I can put it back in memory-tier.c. How about the list? Do we still
> need to keep a separate list for storing late_inited memory nodes?
> And how about the list name if we need to remove the prefix "hmat_"?

I don't think we need a separate list for memory-less nodes.  Just
iterate all memory-less nodes.

>
>> Instead, if the memory_type of a
>> node isn't set by the driver, we should manage it in memory-tier.c as
>> fallback.
>>
>
> Do you mean some device drivers may init memory tiers between
> memory_tier_init() and late_initcall(memory_tier_late_init);?
> And this is the reason why you mention to exclude
> "node_memory_types[nid].memtype != NULL" in memory_tier_late_init().
> Is my understanding correct?

Yes.

>> >  static DEFINE_MUTEX(target_lock);
>> >
>> >  /*
>> > @@ -149,6 +151,12 @@ int acpi_get_genport_coordinates(u32 uid,
>> >  }
>> >  EXPORT_SYMBOL_NS_GPL(acpi_get_genport_coordinates, CXL);
>> >
>> > +struct memory_dev_type *hmat_find_alloc_memory_type(int adist)
>> > +{
>> > + return find_alloc_memory_type(adist, _memory_types);
>> > +}
>> > +EXPORT_SYMBOL_GPL(hmat_find_alloc_memory_type);
>> > +
>> >  static __init void alloc_memory_initiator(unsigned int cpu_pxm)
>> >  {
>> >   struct memory_initiator *initiator;
>> > @@ -1038,6 +1046,9 @@ static __init int hmat_init(void)
>> >   if (!hmat_set_default_dram_perf())
>> >   register_mt_adistance_algorithm(_adist_nb);
>> >
>> > + /* Post-create CPUless memory tiers after getting HMAT info */
>> > + memory_tier_late_init();
>> > +
>>
>> This should be called in memory-tier.c via
>>
>> late_initcall(memory_tier_late_init);
>>
>> Then, we don't need hmat to call it.
>>
>
> Thanks. Learned!
>
>
>> >   return 0;
>> >  out_put:
>> >   hmat_free_structures();
>> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> > index 42ee360cf4e3..aee17ab59f4f 100644
>> > --- a/drivers/dax/kmem.c
>> > +++ b/drivers/dax/kmem.c
>> > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
>> >
>> >  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
>> >  {
>> > - bool found = false;
>> > 

[PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Gavin Shan
The issue is reported by Yihuang Yu who have 'netperf' test on
NVidia's grace-grace and grace-hopper machines. The 'netperf'
client is started in the VM hosted by grace-hopper machine,
while the 'netperf' server is running on grace-grace machine.

The VM is started with virtio-net and vhost has been enabled.
We observe a error message spew from VM and then soft-lockup
report. The error message indicates the data associated with
the descriptor (index: 135) has been released, and the queue
is marked as broken. It eventually leads to the endless effort
to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
and soft-lockup. The stale index 135 is fetched from the available
ring and published to the used ring by vhost, meaning we have
disordred write to the available ring element and available index.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host\
 : \
  -netdev tap,id=vnet0,vhost=on\
  -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \

  [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!

Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
ARM64. It should work for other architectures, but performance loss is
expected.

Cc: sta...@vger.kernel.org
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
 drivers/virtio/virtio_ring.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
 
-   /* Descriptors and available array need to be set before we expose the
-* new available array entries. */
-   virtio_wmb(vq->weak_barriers);
+   /*
+* Descriptors and available array need to be set before we expose
+* the new available array entries. virtio_wmb() should be enough
+* to ensuere the order theoretically. However, a stronger barrier
+* is needed by ARM64. Otherwise, the stale data can be observed
+* by the host (vhost). A stronger barrier should work for other
+* architectures, but performance loss is expected.
+*/
+   virtio_mb(false);
vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
-- 
2.44.0