Public bug reported:
[IMPACT]
In Bluefield-2 and Bluefield-3 embedded ARM cores (Ubuntu 22.04 Jammy), ptp4l
randomly goes out of sync during long-running operations (~24 hours) with the
error message:
"ptp4l[3416283.946]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)"
Debugging traces reveal that the failure occurs in the network stack's
sendto() system call when ptp4l attempts to send DelayReq messages,
returning error code -6 (ENXIO - "No such device or address"). This
issue affects PTP synchronization reliability on Bluefield hardware and
was reproduced consistently on BF2 and BF3 systems but not on CX6 DX
hardware.
The root cause is corrupted FPSIMD (floating point SIMD) register state
during kernel mode context switches. When the kernel uses NEON/FPSIMD
instructions for network operations or cryptographic functions, the
register state can be lost or corrupted if a context switch occurs,
leading to unpredictable behavior in subsequent operations including
network socket calls. This corruption manifests as the observed sendto()
failures that disrupt PTP synchronization.
[FIX]
Backporting the upstream commit:
aefbab8e77eb16b56e18f24b85a09ebf4dc60e93 ("arm64: fpsimd: Preserve/restore
kernel mode NEON at context switch")
This commit introduces proper preservation and restoration of FPSIMD
register state during context switches when kernel code is using
NEON/FPSIMD instructions. It adds a new thread flag TIF_KERNEL_FPSTATE
to track when tasks are using FPSIMD in kernel mode, and modifies the
context switch hook to save/restore the kernel FPSIMD state to/from
struct thread_struct. This prevents FPSIMD register corruption that can
affect network operations and other kernel functions relying on floating
point calculations.
The backport adapts the upstream changes to the linux-bluefield-5.15
kernel structure and ensures compatibility with the existing FPSIMD
handling infrastructure in the Jammy kernel base.
[TEST CASE]
Compile tested on linux-bluefield-5.15 on the master-next branch.
Functional testing involved reproducing the original ptp4l synchronization
issue on BF2/BF3 hardware by running extended PTP operations for multiple days.
After applying the patch, the system was tested for 7 consecutive days under
the same conditions that previously triggered the issue within 24 hours. No
ptp4l synchronization failures or ENXIO errors from sendto() calls were
observed during the extended test period.
[REGRESSION POTENTIAL]
The backport introduces new code paths for FPSIMD state management during
context switches. Potential regression areas include context switch performance
overhead and compatibility with existing kernel FPSIMD users. However, the
extensive 7-day testing provides confidence in the backported implementation's
stability.
** Affects: linux-bluefield (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2119457
Title:
Ubuntu 22.04: ptp4l randomly goes out of sync on BF2/BF3 with ENXIO
errors from sendto() calls
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2119457/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs