Am 22.12.21 um 15:11 schrieb Alex Deucher:
On Wed, Dec 22, 2021 at 3:18 AM Deng, Emily wrote:
[AMD Official Use Only]
Currently, only ampere found this issue, but it is hard to detect ampere board,
especially on arm passthrough environment.
Isn't this already handled in drm_arch_can_wc_memor
Am 22.12.21 um 21:53 schrieb Daniel Vetter:
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
[SNIP]
Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this problem. I
really don't want to have random
On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
wrote:
>
> [Public]
>
> > -Original Message-
> > From: Deucher, Alexander
> > Sent: Tuesday, December 21, 2021 12:01 PM
> > To: Linus Torvalds ; Imre Deak
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Daniel Vetter ; Kai-Heng Feng
> >
> > S
[AMD Official Use Only]
Reviewed-by: Evan Quan
From: Nikolic, Marina
Sent: Wednesday, December 22, 2021 7:25 PM
To: Quan, Evan ; Russell, Kent ;
amd-gfx@lists.freedesktop.org
Cc: Mitrovic, Milan ; Kitchen, Greg
Subject: Re: [PATCH] amdgpu/pm: Modify sysfs to have only read permission in
SRI
Sorry for the typo in my previous email. Please read Adrian Reber*
On 12/22/2021 8:49 PM, Bhardwaj, Rajneesh wrote:
Adding Adrian Rebel who is the CRIU maintainer and CRIU list
On 12/22/2021 3:53 PM, Daniel Vetter wrote:
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
On
Adding Adrian Rebel who is the CRIU maintainer and CRIU list
On 12/22/2021 3:53 PM, Daniel Vetter wrote:
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
On 12/20/2021 4:29 AM, Daniel Vetter wrote:
On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
Am 09.12.21
In CRIU resume stage, resume all the shared virtual memory ranges from
the data stored inside the resuming kfd process during CRIU restore
phase. Also setup xnack mode and free up the resources.
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 10 +
drivers/gpu
- Add rock-rel_defconfig for release builds.
Signed-off-by: Rajneesh Bhardwaj
---
arch/x86/configs/rock-rel_defconfig | 4927 +++
1 file changed, 4927 insertions(+)
create mode 100644 arch/x86/configs/rock-rel_defconfig
diff --git a/arch/x86/configs/rock-rel_defconfig
From: David Yat Sin
Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.
Signed-off-by: David Yat Sin
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +-
..
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore
support its possible that the SVM ranges can be resumed on another node
where the actual_gpu_id may not be same as the original (user_gpu_id)
gpu id. So modify svm code to use user_gpu_id.
Signed-off-by: Rajneesh Bhardwaj
---
During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_ch
During checkpoint stage, save the shared virtual memory ranges and
attributes for the target process. A process may contain a number of svm
ranges and each range might contain a number of arrtibutes. While not
all attributes may be applicable for a given prange but during
checkpoint we store all po
From: David Yat Sin
When re-creating queues during CRIU restore, restore the queue with the
same sdma id value used during CRIU dump.
Signed-off-by: David Yat Sin
---
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 48 ++-
.../drm/amd/amdkfd/kfd_device_queue_manager.h | 3 +-
Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.
Signed-off-by: Rajneesh Bhardwaj
--
From: David Yat Sin
When re-creating queues during CRIU restore, restore the queue with the
same doorbell id value used during CRIU dump.
Signed-off-by: David Yat Sin
---
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 60 +--
1 file changed, 41 insertions(+), 19 deletions(-)
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct
from current but for a Checkpoint or Restore operation, the current->mm
will fetch the mm for the CRIU master process. So modify these helpers to
accept the task mm for a target kfd process to support Checkpoint
Restore.
Signed-o
A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by
From: David Yat Sin
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.
Signed-off-by: David Yat Sin
---
drivers/gpu/
KFD buffer objects do not associate a GEM handle with them so cannot
directly be used with libdrm to initiate a system dma (sDMA) operation
to speedup the checkpoint and restore operation so export them as dmabuf
objects and use with libdrm helper (amdgpu_bo_import) to further process
the sdma comm
From: David Yat Sin
Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.
Signed-off-by: David Yat Sin
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 70 +-
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 272 ---
drivers/gpu/dr
This implements the KFD CRIU Restore ioctl that lays the basic
foundation for the CRIU restore operation. It provides support to
create the buffer objects corresponding to Non-Paged system memory
mapped for GPU and/or CPU access and lays basic foundation for the
userptrs buffer objects which will b
From: David Yat Sin
When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the
This adds support to discover the buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.
From: David Yat Sin
Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.
Signed-off-by: David Yat Sin
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +-
.../drm/amd/amdkfd/kfd_device_queue_manage
This IOCTL is expected to be called as a precursor to the actual
Checkpoint operation. This does the basic discovery into the target
process seized by CRIU and relays the information to the userspace that
utilizes it to start the Checkpoint operation via another dedicated
IOCTL.
The process_info I
From: David Yat Sin
Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.
Signed-off-by: David Yat Sin
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 110 -
drivers/gpu/drm/amd/amdkfd/kf
From: David Yat Sin
When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.
Signed-off-by: Rajneesh Bhardwaj
Signed-off-by: David Yat Sin
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_dbgde
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can
snapshot a running process and later restore it on same or a remote
machine but expects the processes that have a device file (e.g. GPU)
associated with them, provide necessary driver support to assist CRIU
and its extensible plugin
This adds support to create userptr BOs on restore and introduces a new
ioctl to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore proc
- Update debug config for Checkpoint-Restore (CR) support
- Also include necessary options for CR with docker containers.
Signed-off-by: Rajneesh Bhardwaj
---
arch/x86/configs/rock-dbg_defconfig | 53 ++---
1 file changed, 34 insertions(+), 19 deletions(-)
diff --git a
CRIU is a user space tool which is very popular for container live
migration in datacentres. It can checkpoint a running application, save
its complete state, memory contents and all system resources to images
on disk which can be migrated to another m achine and restored later.
More information on
Since now all GPU resets are serialzied there is no need for this.
This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++
1 file chang
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 -
drivers/gpu/drm/amd/amdgpu/amdgpu_xg
Since now flr work is serialized against GPU resets
there is no need for this.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
2 files changed, 22 deletions(-)
diff --git a/drivers/gpu/drm/amd/
No need to to trigger another work queue inside the work queue.
Suggested-by: Liu Shaoyun
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 +--
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 +--
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 +--
3 files changed
[AMD Official Use Only]
Hi all,
This week this patchset was tested on the following systems:
Lenovo Thinkpad T14s Gen2 with AMD Ryzen 5 5650U, with the following display
types: eDP 1080p 60hz, 4k 60hz (via USB-C to DP/HDMI), 1440p 144hz (via USB-C
to DP/HDMI), 1680*1050 60hz (via USB-C to D
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/g
Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeue
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/d
From: Nicholas Kazlauskas
[Why]
To maintain compatibility with firmware older than 4.0.11.
Those firmware may have interrmittent hangs with RDCSPIPE or the PHY,
but we shouldn't regress their previous behavior.
[How]
Use the new path if firmware is development or 4.0.11 or newer. Use the
legacy
This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
o
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Daniel Vetter
Suggested-by: Christian König
Reviewed-by: Christian Kön
From: Mikita Lipski
[why]
We want to know if new crtc state is enabling MPO configuration before
enabling it.
[how]
Detect if both primary and overlay planes are enabled on the same CRTC.
Reviewed-by: Bhawanpreet Lakha
Acked-by: Rodrigo Siqueira
Signed-off-by: Mikita Lipski
---
drivers/gpu/d
From: Charlene Liu
[why]
driver missed the check.
[how]
add the check.
add min display clock = 100mhz check based on dccg doc.
[note]
add SetPhyclkVoltageByFreq as confirmed with smu, but not enabled in
this change.
Reviewed-by: Dmytro Laktyushkin
Acked-by: Rodrigo Siqueira
Signed-off-by: Ch
Hi,
This is the last DC upstream of this year. As a result, it is a very
tiny one with a few bug fixes.
Just for curiosity, I decided to calculate how many patches we upstream
via this weekly process in 2021, and it was approximately 740 patches
where Daniel Wheeler tested each patchset. Thanks t
From: Nicholas Kazlauskas
[Why]
PSP will suspend and resume DMCUB. Driver should just wait for DMCUB to
finish the auto load before continuining instead of placing it into
reset, wiping its firmware state and reinitializing.
If we don't let DMCUB fully finish initializing for S0ix then some stat
From: Wenjing Liu
[why]
1. Current code hard codes link to PHY mapping in dc link level per asic
per revision.
This is not scalable. In long term the mapping will be obatined from
DMUB and store in dc resource.
2. Depending on DCN revision and endpoint type, the definition of
dio_output_idx dio_
From: Yi-Ling Chen
[Why]
Depend on res_pool->res_cap->num_timing_generator to query timing
gernerator information, it would case underflow at the fused display
pipes case.
Due to the res_pool->res_cap->num_timing_generator records default
timing generator resource built in driver, not the current
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>
> On 12/20/2021 4:29 AM, Daniel Vetter wrote:
> > On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
> > > Am 09.12.21 um 19:28 schrieb Felix Kuehling:
> > > > Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
>
When runtime pm kicks in and the device goes into runtime
suspend, we often see random calls (small rendering calls,
etc.) into the driver which cause the device to runtime
resume. On resume we issue a hotplug event in case any
displays were changed during suspend, however, these events
cause the
Applied. Thanks!
Alex
On Fri, Dec 17, 2021 at 11:22 PM Yizhuo Zhai wrote:
>
> In function enable_stream_features(), the variable "old_downspread.raw"
> could be uninitialized if core_link_read_dpcd() fails, however, it is
> used in the later if statement, and further, core_link_write_dpcd()
> m
On Tue, Dec 21, 2021 at 10:13 PM Quan, Evan wrote:
>
> [AMD Official Use Only]
>
>
>
> > -Original Message-
> > From: amd-gfx On Behalf Of Alex
> > Deucher
> > Sent: Tuesday, December 21, 2021 10:59 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Deucher, Alexander
> > Subject: [PATCH]
amdgpu_umc_poison_handler for UMC RAS consumption gets
called in KFD queue reset, but it needs to return early when
RAS context is NULL. This can guarantee lower access to
RAS context like in amdgpu_umc_do_page_retirement. Also
improve coding style in amdgpu_umc_poison_handler.
Signed-off-by: Guch
From: Ira Weiny
kmap() is being deprecated and these usages are all local to the thread
so there is no reason kmap_local_page() can't be used.
Replace kmap() calls with kmap_local_page().
Signed-off-by: Ira Weiny
---
NOTE: I'm sending as a follow on to the V1 patch. Please let me know if you
On Wed, Dec 22, 2021 at 3:18 AM Deng, Emily wrote:
>
> [AMD Official Use Only]
>
> Currently, only ampere found this issue, but it is hard to detect ampere
> board, especially on arm passthrough environment.
Isn't this already handled in drm_arch_can_wc_memory()?
Alex
>
> Best wishes
> Emily D
On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson wrote:
>
>
>
> - Mail original -
> > De: "Alex Deucher"
> > À: "Yann Dirson"
> > Cc: "Christian König" , "amd-gfx list"
> >
> > Envoyé: Mardi 21 Décembre 2021 23:31:01
> > Objet: Re: Various problems trying to vga-passthrough a Renoir iGPU to
On 12/22/2021 4:55 PM, Nikolic, Marina wrote:
[AMD Official Use Only]
[AMD Official Use Only]
From a6512c0897aa58ccac9e5483d31193d83fb590b2 Mon Sep 17 00:00:00 2001
From: Marina Nikolic
Date: Tue, 14 Dec 2021 20:57:53 +0800
Subject: [PATCH] amdgpu/pm: Modify sysfs to have only read permi
On 12/22/2021 10:53 AM, Quan, Evan wrote:
[AMD Official Use Only]
-Original Message-
From: Lazar, Lijo
Sent: Tuesday, December 21, 2021 2:22 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander
Subject: Re: [PATCH V5 13/16] drm/amd/pm: relocate the power relat
[AMD Official Use Only]
>From a6512c0897aa58ccac9e5483d31193d83fb590b2 Mon Sep 17 00:00:00 2001
From: Marina Nikolic
Date: Tue, 14 Dec 2021 20:57:53 +0800
Subject: [PATCH] amdgpu/pm: Modify sysfs to have only read permission in
SRIOV/ONEVF mode
== Description ==
Setting through sysfs should not
[AMD Official Use Only]
Currently, only ampere found this issue, but it is hard to detect ampere board,
especially on arm passthrough environment.
Best wishes
Emily Deng
>-Original Message-
>From: amd-gfx On Behalf Of
>Christian König
>Sent: Wednesday, December 22, 2021 4:11 PM
>To:
Am 22.12.21 um 06:51 schrieb Victor Zhao:
Some Arm based platform has hardware issue which may
generate incorrect addresses when receiving writes from the CPU
with a discontiguous set of byte enables. This affects the writes
with write combine property.
Can you point out which arm platforms are
61 matches
Mail list logo