Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-13 Thread Andrey Grodzovsky



On 11/13/2017 10:27 AM, Christian König wrote:

Am 13.11.2017 um 15:57 schrieb Andrey Grodzovsky:

On 11/13/2017 07:39 AM, Christian König wrote:


Am 13.11.2017 um 12:32 schrieb Michel Dänzer:

On 12/11/17 10:35 AM, Christian König wrote:

A few comments on the code:


+/* Validate bo size is bit bigger then the request domain */
+static inline bool amdgpu_bo_validate_bo_size(struct amdgpu_device
*adev,
+  unsigned long size, u32 domain)

Drop the inline keyword and the second _bo_ in the name here.


+{
+struct ttm_mem_type_manager *man = NULL;
+
+if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
+man = >mman.bdev.man[TTM_PL_VRAM];
+
+if (man && size < (man->size << PAGE_SHIFT))

Drop the extra check that man is not NULL. We get the pointer to an
array element, that can't be NULL.


+return true;

Mhm, domain is a bitmask of allowed domains.

So we should check all valid domains if the size fit, not just the 
first

one.

Assuming VRAM <-> system migration of BOs larger than the GTT domain
works, I'd say we should only require that the BO can fit in any of 
the

allowed domains. Otherwise it must also always fit in GTT.
Good point, and yes VRAM <-> system migration of BOs larger than the 
GTT domain works now.


I can agree on that VRAM should probably be optional, otherwise we 
can't allocate anything large when the driver uses only very low 
amounts of stolen VRAM on APUs.


But I think when userspace requests VRAM and GTT at the same time we 
still should be able to fall back to GTT.


Attached V2 patch, I still don't understand why I experience the 
SIGSEV in the tester when the check fails and the IOCTLs will return 
ENOMEM




Reviewed-by: Christian König  for this one, 
but please use git send-email to send out patches.


I will update the libdrm test to correctly handle mem failure, it 
segfaults at the moment.


Sounds like it just tries to use the BO for VM or CPU mapping while 
the underlying function has failed (or we have another bug somewhere).


Yes, the segfault is because I am using gpu_mem_alloc which continues 
executing after amdgpu_bo_alloc failed, the segfault is in amdgpu_bo_va_op.




Please commit the kernel patch and leave me a note so that I can push 
the libdrm patches. 


Areyou gonna push patches 1-3 from the original series and then I need 
to resend patch 4 to fix the segfault ?



BTW: Do you have the link where you request an account at hand? I want 
to ping the admins once more.


https://bugs.freedesktop.org/show_bug.cgi?id=103566

Thanks,
Andrey



Regards,
Christian.




Thanks,
Andey


Regards,
Christian.






___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-13 Thread Christian König

Am 13.11.2017 um 15:57 schrieb Andrey Grodzovsky:

On 11/13/2017 07:39 AM, Christian König wrote:


Am 13.11.2017 um 12:32 schrieb Michel Dänzer:

On 12/11/17 10:35 AM, Christian König wrote:

A few comments on the code:


+/* Validate bo size is bit bigger then the request domain */
+static inline bool amdgpu_bo_validate_bo_size(struct amdgpu_device
*adev,
+  unsigned long size, u32 domain)

Drop the inline keyword and the second _bo_ in the name here.


+{
+    struct ttm_mem_type_manager *man = NULL;
+
+    if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
+    man = >mman.bdev.man[TTM_PL_VRAM];
+
+    if (man && size < (man->size << PAGE_SHIFT))

Drop the extra check that man is not NULL. We get the pointer to an
array element, that can't be NULL.


+    return true;

Mhm, domain is a bitmask of allowed domains.

So we should check all valid domains if the size fit, not just the 
first

one.

Assuming VRAM <-> system migration of BOs larger than the GTT domain
works, I'd say we should only require that the BO can fit in any of the
allowed domains. Otherwise it must also always fit in GTT.
Good point, and yes VRAM <-> system migration of BOs larger than the 
GTT domain works now.


I can agree on that VRAM should probably be optional, otherwise we 
can't allocate anything large when the driver uses only very low 
amounts of stolen VRAM on APUs.


But I think when userspace requests VRAM and GTT at the same time we 
still should be able to fall back to GTT.


Attached V2 patch, I still don't understand why I experience the 
SIGSEV in the tester when the check fails and the IOCTLs will return 
ENOMEM




Reviewed-by: Christian König  for this one, 
but please use git send-email to send out patches.


I will update the libdrm test to correctly handle mem failure, it 
segfaults at the moment.


Sounds like it just tries to use the BO for VM or CPU mapping while the 
underlying function has failed (or we have another bug somewhere).


Please commit the kernel patch and leave me a note so that I can push 
the libdrm patches. BTW: Do you have the link where you request an 
account at hand? I want to ping the admins once more.


Regards,
Christian.




Thanks,
Andey


Regards,
Christian.




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-13 Thread Andrey Grodzovsky

On 11/13/2017 07:39 AM, Christian König wrote:


Am 13.11.2017 um 12:32 schrieb Michel Dänzer:

On 12/11/17 10:35 AM, Christian König wrote:

A few comments on the code:


+/* Validate bo size is bit bigger then the request domain */
+static inline bool amdgpu_bo_validate_bo_size(struct amdgpu_device
*adev,
+  unsigned long size, u32 domain)

Drop the inline keyword and the second _bo_ in the name here.


+{
+struct ttm_mem_type_manager *man = NULL;
+
+if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
+man = >mman.bdev.man[TTM_PL_VRAM];
+
+if (man && size < (man->size << PAGE_SHIFT))

Drop the extra check that man is not NULL. We get the pointer to an
array element, that can't be NULL.


+return true;

Mhm, domain is a bitmask of allowed domains.

So we should check all valid domains if the size fit, not just the 
first

one.

Assuming VRAM <-> system migration of BOs larger than the GTT domain
works, I'd say we should only require that the BO can fit in any of the
allowed domains. Otherwise it must also always fit in GTT.
Good point, and yes VRAM <-> system migration of BOs larger than the 
GTT domain works now.


I can agree on that VRAM should probably be optional, otherwise we 
can't allocate anything large when the driver uses only very low 
amounts of stolen VRAM on APUs.


But I think when userspace requests VRAM and GTT at the same time we 
still should be able to fall back to GTT.


Attached V2 patch, I still don't understand why I experience the SIGSEV 
in the tester when the check fails and the IOCTLs will return ENOMEM


I will update the libdrm test to correctly handle mem failure, it 
segfaults at the moment.



Thanks,
Andey


Regards,
Christian.


>From 37369b3a58027dedf7e9dc65fd9bc0f9e86d0d80 Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky 
Date: Fri, 10 Nov 2017 18:35:56 -0500
Subject: drm/amdgpu: Implement BO size validation V2

Validates BO size against each requested domain's total memory.

v2:
Make GTT size check a MUST to allow fall back to GTT.
Rmove redundant NULL check.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 41 ++
 1 file changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index a937c49..5acf20c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -281,6 +281,44 @@ void amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64 *gpu_addr,
 		*cpu_addr = NULL;
 }
 
+/* Validate bo size is bit bigger then the request domain */
+static bool amdgpu_bo_validate_size(struct amdgpu_device *adev,
+	  unsigned long size, u32 domain)
+{
+	struct ttm_mem_type_manager *man = NULL;
+
+	/*
+	 * If GTT is part of requested domains the check must succeed to
+	 * allow fall back to GTT
+	 */
+	if (domain & AMDGPU_GEM_DOMAIN_GTT) {
+		man = >mman.bdev.man[TTM_PL_TT];
+
+		if (size < (man->size << PAGE_SHIFT))
+			return true;
+		else
+			goto fail;
+	}
+
+	if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
+		man = >mman.bdev.man[TTM_PL_VRAM];
+
+		if (size < (man->size << PAGE_SHIFT))
+			return true;
+		else
+			goto fail;
+	}
+
+
+	/* TODO add more domains checks, such as AMDGPU_GEM_DOMAIN_CPU */
+	return true;
+
+fail:
+	DRM_ERROR("BO size %lu > total memory in domain: %llu\n", size,
+	  man->size << PAGE_SHIFT);
+	return false;
+}
+
 static int amdgpu_bo_do_create(struct amdgpu_device *adev,
 			   unsigned long size, int byte_align,
 			   bool kernel, u32 domain, u64 flags,
@@ -299,6 +337,9 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
 	page_align = roundup(byte_align, PAGE_SIZE) >> PAGE_SHIFT;
 	size = ALIGN(size, PAGE_SIZE);
 
+	if (!amdgpu_bo_validate_size(adev, size, domain))
+		return -ENOMEM;
+
 	if (kernel) {
 		type = ttm_bo_type_kernel;
 	} else if (sg) {
-- 
2.7.4

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-13 Thread Christian König

Am 13.11.2017 um 12:32 schrieb Michel Dänzer:

On 12/11/17 10:35 AM, Christian König wrote:

A few comments on the code:


+/* Validate bo size is bit bigger then the request domain */
+static inline bool amdgpu_bo_validate_bo_size(struct amdgpu_device
*adev,
+                      unsigned long size, u32 domain)

Drop the inline keyword and the second _bo_ in the name here.


+{
+    struct ttm_mem_type_manager *man = NULL;
+
+    if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
+        man = >mman.bdev.man[TTM_PL_VRAM];
+
+        if (man && size < (man->size << PAGE_SHIFT))

Drop the extra check that man is not NULL. We get the pointer to an
array element, that can't be NULL.


+            return true;

Mhm, domain is a bitmask of allowed domains.

So we should check all valid domains if the size fit, not just the first
one.

Assuming VRAM <-> system migration of BOs larger than the GTT domain
works, I'd say we should only require that the BO can fit in any of the
allowed domains. Otherwise it must also always fit in GTT.
Good point, and yes VRAM <-> system migration of BOs larger than the GTT 
domain works now.


I can agree on that VRAM should probably be optional, otherwise we can't 
allocate anything large when the driver uses only very low amounts of 
stolen VRAM on APUs.


But I think when userspace requests VRAM and GTT at the same time we 
still should be able to fall back to GTT.


Regards,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-13 Thread Michel Dänzer
On 12/11/17 10:35 AM, Christian König wrote:
> A few comments on the code:
> 
>> +/* Validate bo size is bit bigger then the request domain */
>> +static inline bool amdgpu_bo_validate_bo_size(struct amdgpu_device
>> *adev,
>> +                      unsigned long size, u32 domain)
> Drop the inline keyword and the second _bo_ in the name here.
> 
>> +{
>> +    struct ttm_mem_type_manager *man = NULL;
>> +
>> +    if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
>> +        man = >mman.bdev.man[TTM_PL_VRAM];
>> +
>> +        if (man && size < (man->size << PAGE_SHIFT))
> 
> Drop the extra check that man is not NULL. We get the pointer to an
> array element, that can't be NULL.
> 
>> +            return true;
> Mhm, domain is a bitmask of allowed domains.
> 
> So we should check all valid domains if the size fit, not just the first
> one.

Assuming VRAM <-> system migration of BOs larger than the GTT domain
works, I'd say we should only require that the BO can fit in any of the
allowed domains. Otherwise it must also always fit in GTT.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-12 Thread Christian König

A few comments on the code:


+/* Validate bo size is bit bigger then the request domain */
+static inline bool amdgpu_bo_validate_bo_size(struct amdgpu_device *adev,
+                      unsigned long size, u32 domain)

Drop the inline keyword and the second _bo_ in the name here.


+{
+    struct ttm_mem_type_manager *man = NULL;
+
+    if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
+        man = >mman.bdev.man[TTM_PL_VRAM];
+
+        if (man && size < (man->size << PAGE_SHIFT))


Drop the extra check that man is not NULL. We get the pointer to an 
array element, that can't be NULL.



+            return true;

Mhm, domain is a bitmask of allowed domains.

So we should check all valid domains if the size fit, not just the first 
one.



+    }
+
+    if (domain & AMDGPU_GEM_DOMAIN_GTT) {
+        man = >mman.bdev.man[TTM_PL_TT];
+
+        if (man && size < (man->size << PAGE_SHIFT))
+            return true;
+    }
+
+    if (domain & AMDGPU_GEM_DOMAIN_CPU) {
+        man = >mman.bdev.man[TTM_PL_SYSTEM];
+
+        if (man && size < (man->size << PAGE_SHIFT))
+            return true;
+    }
IIRC we never actually set a size on the system domain, so that check 
can't succeed.


(Maybe we should set a size and actually enforce it, that would solve 
our eviction problem as well)


Regards,
Christian.

Am 11.11.2017 um 00:43 schrieb Andrey Grodzovsky:



On 11/10/2017 10:48 AM, Christian König wrote:

Am 10.11.2017 um 16:36 schrieb Andrey Grodzovsky:



On 11/10/2017 07:17 AM, Christian König wrote:

Series is Acked-by: Christian König .

Please note that I think your OOM killer test shows quite a bug we 
currently have in the kernel driver.


A single allocation of 1TB shouldn't trigger the OOM killer, but 
rather be reacted immediately.


Maybe we should add a second test which does incremental 1GB 
allocations but still keep this tests ? With this test  i get a 
callstack as bellow + crash of the test suite
with general protection fault - As normal behavior I would have 
expected just some errno returning from the amdgpu_bo_alloc which we 
could check in the test.


Yeah, totally agree. And when this works correctly we should really 
enable this test case by default as well.


When I implemented scattered eviction I completely removed the check 
which limited the BO size. That was probably a bit to extreme.


We still need to check the size here so that we don't create a BO 
larger than what makes sense for the domain it should be stored in.


Patch attached, tested with the DRM tester, call stack is gone but I 
still get SIGSEV and tester crashes, attaching debugger shows SIGSEV 
recived when the tester is still in the IOCTL -

r = drmCommandWriteRead(dev->fd, DRM_AMDGPU_GEM_CREATE,
    , sizeof(args));

dmesg

[  104.608664 <   16.227791>] [drm:amdgpu_bo_do_create [amdgpu]] 
*ERROR* BO size 1 > total memory in domain: 1073741824
[  104.608911 <    0.000247>] [drm:amdgpu_bo_do_create [amdgpu]] 
*ERROR* BO size 1 > total memory in domain: 3221225472
[  104.609168 <    0.000257>] [drm:amdgpu_gem_object_create [amdgpu]] 
*ERROR* Failed to allocate GEM object (1, 6, 4096, -12)
[  104.609301 <    0.000133>] traps: lt-amdgpu_test[1142] general 
protection ip:7f21c9ed6007 sp:7ffe08ae1e30 error:0 in 
libdrm_amdgpu.so.1.0.0[7f21c9ed2000+b000]



Thanks,
Andrey



Regards,
Christian.



Thanks,
Andrey

[169053.128981 <72032.811683>] [ cut here ]
[169053.129006 <    0.25>] WARNING: CPU: 0 PID: 22883 at 
mm/page_alloc.c:3883 __alloc_pages_slowpath+0xf03/0x14e0
[169053.129007 <    0.01>] Modules linked in: amdgpu chash ttm 
drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect 
sysimgblt edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul 
crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel 
snd_hda_codec_generic pcbc snd_hda_codec_hdmi snd_hda_intel 
aesni_intel snd_hda_codec aes_x86_64 snd_hda_core crypto_simd 
glue_helper snd_hwdep rfkill_gpio cryptd snd_pcm snd_seq_midi 
snd_seq_midi_event serio_raw snd_rawmidi snd_seq cdc_ether usbnet 
snd_seq_device joydev fam15h_power k10temp r8152 snd_timer mii 
i2c_piix4 rtsx_pci_ms snd memstick soundcore shpchp 8250_dw 
i2c_designware_platform i2c_designware_core mac_hid binfmt_misc nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc parport_pc ppdev lp parport 
autofs4 rtsx_pci_sdmmc psmouse rtsx_pci sdhci_pci ahci sdhci libahci

[169053.129084 <    0.77>]  video i2c_hid hid_generic usbhid hid
[169053.129096 <    0.12>] CPU: 0 PID: 22883 Comm: 
lt-amdgpu_test Tainted: G    W   4.14.0-rc3+ #1
[169053.129097 <    0.01>] Hardware name: AMD Gardenia/Gardenia, 
BIOS RGA1101C 07/20/2015
[169053.129099 <    0.02>] task: 880048803d80 task.stack: 
880064688000
[169053.129103 <    0.04>] RIP: 
0010:__alloc_pages_slowpath+0xf03/0x14e0
[169053.129105 <    0.02>] RSP: 0018:88006468f108 EFLAGS: 
00010246
[169053.129108 <    

Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-10 Thread Andrey Grodzovsky



On 11/10/2017 10:48 AM, Christian König wrote:

Am 10.11.2017 um 16:36 schrieb Andrey Grodzovsky:



On 11/10/2017 07:17 AM, Christian König wrote:

Series is Acked-by: Christian König .

Please note that I think your OOM killer test shows quite a bug we 
currently have in the kernel driver.


A single allocation of 1TB shouldn't trigger the OOM killer, but 
rather be reacted immediately.


Maybe we should add a second test which does incremental 1GB 
allocations but still keep this tests ? With this test  i get a 
callstack as bellow + crash of the test suite
with general protection fault - As normal behavior I would have 
expected just some errno returning from the amdgpu_bo_alloc which we 
could check in the test.


Yeah, totally agree. And when this works correctly we should really 
enable this test case by default as well.


When I implemented scattered eviction I completely removed the check 
which limited the BO size. That was probably a bit to extreme.


We still need to check the size here so that we don't create a BO 
larger than what makes sense for the domain it should be stored in.


Patch attached, tested with the DRM tester, call stack is gone but I 
still get SIGSEV and tester crashes, attaching debugger shows SIGSEV 
recived when the tester is still in the IOCTL -

r = drmCommandWriteRead(dev->fd, DRM_AMDGPU_GEM_CREATE,
, sizeof(args));

dmesg

[  104.608664 <   16.227791>] [drm:amdgpu_bo_do_create [amdgpu]] *ERROR* 
BO size 1 > total memory in domain: 1073741824
[  104.608911 <0.000247>] [drm:amdgpu_bo_do_create [amdgpu]] *ERROR* 
BO size 1 > total memory in domain: 3221225472
[  104.609168 <0.000257>] [drm:amdgpu_gem_object_create [amdgpu]] 
*ERROR* Failed to allocate GEM object (1, 6, 4096, -12)
[  104.609301 <0.000133>] traps: lt-amdgpu_test[1142] general 
protection ip:7f21c9ed6007 sp:7ffe08ae1e30 error:0 in 
libdrm_amdgpu.so.1.0.0[7f21c9ed2000+b000]



Thanks,
Andrey



Regards,
Christian.



Thanks,
Andrey

[169053.128981 <72032.811683>] [ cut here ]
[169053.129006 <0.25>] WARNING: CPU: 0 PID: 22883 at 
mm/page_alloc.c:3883 __alloc_pages_slowpath+0xf03/0x14e0
[169053.129007 <0.01>] Modules linked in: amdgpu chash ttm 
drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect 
sysimgblt edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul 
crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel 
snd_hda_codec_generic pcbc snd_hda_codec_hdmi snd_hda_intel 
aesni_intel snd_hda_codec aes_x86_64 snd_hda_core crypto_simd 
glue_helper snd_hwdep rfkill_gpio cryptd snd_pcm snd_seq_midi 
snd_seq_midi_event serio_raw snd_rawmidi snd_seq cdc_ether usbnet 
snd_seq_device joydev fam15h_power k10temp r8152 snd_timer mii 
i2c_piix4 rtsx_pci_ms snd memstick soundcore shpchp 8250_dw 
i2c_designware_platform i2c_designware_core mac_hid binfmt_misc nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc parport_pc ppdev lp parport 
autofs4 rtsx_pci_sdmmc psmouse rtsx_pci sdhci_pci ahci sdhci libahci

[169053.129084 <0.77>]  video i2c_hid hid_generic usbhid hid
[169053.129096 <0.12>] CPU: 0 PID: 22883 Comm: lt-amdgpu_test 
Tainted: GW   4.14.0-rc3+ #1
[169053.129097 <0.01>] Hardware name: AMD Gardenia/Gardenia, 
BIOS RGA1101C 07/20/2015
[169053.129099 <0.02>] task: 880048803d80 task.stack: 
880064688000
[169053.129103 <0.04>] RIP: 
0010:__alloc_pages_slowpath+0xf03/0x14e0
[169053.129105 <0.02>] RSP: 0018:88006468f108 EFLAGS: 
00010246
[169053.129108 <0.03>] RAX:  RBX: 
014000c0 RCX: 81279065
[169053.129109 <0.01>] RDX: dc00 RSI: 
000f RDI: 82609000
[169053.129111 <0.02>] RBP: 88006468f328 R08: 
 R09: 8576
[169053.129113 <0.02>] R10: 5c2044e7 R11: 
 R12: 88006468f3d8
[169053.129114 <0.01>] R13: 880048803d80 R14: 
0140c0c0 R15: 000f
[169053.129117 <0.03>] FS:  7f707863b700() 
GS:88006ce0() knlGS:
[169053.129119 <0.02>] CS:  0010 DS:  ES:  CR0: 
80050033
[169053.129120 <0.01>] CR2: 0125 CR3: 
644cf000 CR4: 001406f0

[169053.129122 <0.02>] Call Trace:
[169053.129131 <0.09>]  ? __module_address+0x145/0x190
[169053.129135 <0.04>]  ? is_bpf_text_address+0xe/0x20
[169053.129140 <0.05>]  ? __kernel_text_address+0x12/0x40
[169053.129144 <0.04>]  ? unwind_get_return_address+0x36/0x50
[169053.129150 <0.06>]  ? memcmp+0x5b/0x90
[169053.129152 <0.02>]  ? warn_alloc+0x250/0x250
[169053.129156 <0.04>]  ? get_page_from_freelist+0x147/0x10f0
[169053.129160 <0.04>]  ? save_stack_trace+0x1b/0x20
[169053.129164 <0.04>]  ? kasan_kmalloc+0xad/0xe0
[169053.129186 

Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-10 Thread Christian König

Am 10.11.2017 um 16:36 schrieb Andrey Grodzovsky:



On 11/10/2017 07:17 AM, Christian König wrote:

Series is Acked-by: Christian König .

Please note that I think your OOM killer test shows quite a bug we 
currently have in the kernel driver.


A single allocation of 1TB shouldn't trigger the OOM killer, but 
rather be reacted immediately.


Maybe we should add a second test which does incremental 1GB 
allocations but still keep this tests ? With this test  i get a 
callstack as bellow + crash of the test suite
with general protection fault - As normal behavior I would have 
expected just some errno returning from the amdgpu_bo_alloc which we 
could check in the test.


Yeah, totally agree. And when this works correctly we should really 
enable this test case by default as well.


When I implemented scattered eviction I completely removed the check 
which limited the BO size. That was probably a bit to extreme.


We still need to check the size here so that we don't create a BO larger 
than what makes sense for the domain it should be stored in.


Regards,
Christian.



Thanks,
Andrey

[169053.128981 <72032.811683>] [ cut here ]
[169053.129006 <    0.25>] WARNING: CPU: 0 PID: 22883 at 
mm/page_alloc.c:3883 __alloc_pages_slowpath+0xf03/0x14e0
[169053.129007 <    0.01>] Modules linked in: amdgpu chash ttm 
drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect 
sysimgblt edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul 
crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel 
snd_hda_codec_generic pcbc snd_hda_codec_hdmi snd_hda_intel 
aesni_intel snd_hda_codec aes_x86_64 snd_hda_core crypto_simd 
glue_helper snd_hwdep rfkill_gpio cryptd snd_pcm snd_seq_midi 
snd_seq_midi_event serio_raw snd_rawmidi snd_seq cdc_ether usbnet 
snd_seq_device joydev fam15h_power k10temp r8152 snd_timer mii 
i2c_piix4 rtsx_pci_ms snd memstick soundcore shpchp 8250_dw 
i2c_designware_platform i2c_designware_core mac_hid binfmt_misc nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc parport_pc ppdev lp parport 
autofs4 rtsx_pci_sdmmc psmouse rtsx_pci sdhci_pci ahci sdhci libahci

[169053.129084 <    0.77>]  video i2c_hid hid_generic usbhid hid
[169053.129096 <    0.12>] CPU: 0 PID: 22883 Comm: lt-amdgpu_test 
Tainted: G    W   4.14.0-rc3+ #1
[169053.129097 <    0.01>] Hardware name: AMD Gardenia/Gardenia, 
BIOS RGA1101C 07/20/2015
[169053.129099 <    0.02>] task: 880048803d80 task.stack: 
880064688000
[169053.129103 <    0.04>] RIP: 
0010:__alloc_pages_slowpath+0xf03/0x14e0
[169053.129105 <    0.02>] RSP: 0018:88006468f108 EFLAGS: 
00010246
[169053.129108 <    0.03>] RAX:  RBX: 
014000c0 RCX: 81279065
[169053.129109 <    0.01>] RDX: dc00 RSI: 
000f RDI: 82609000
[169053.129111 <    0.02>] RBP: 88006468f328 R08: 
 R09: 8576
[169053.129113 <    0.02>] R10: 5c2044e7 R11: 
 R12: 88006468f3d8
[169053.129114 <    0.01>] R13: 880048803d80 R14: 
0140c0c0 R15: 000f
[169053.129117 <    0.03>] FS:  7f707863b700() 
GS:88006ce0() knlGS:
[169053.129119 <    0.02>] CS:  0010 DS:  ES:  CR0: 
80050033
[169053.129120 <    0.01>] CR2: 0125 CR3: 
644cf000 CR4: 001406f0

[169053.129122 <    0.02>] Call Trace:
[169053.129131 <    0.09>]  ? __module_address+0x145/0x190
[169053.129135 <    0.04>]  ? is_bpf_text_address+0xe/0x20
[169053.129140 <    0.05>]  ? __kernel_text_address+0x12/0x40
[169053.129144 <    0.04>]  ? unwind_get_return_address+0x36/0x50
[169053.129150 <    0.06>]  ? memcmp+0x5b/0x90
[169053.129152 <    0.02>]  ? warn_alloc+0x250/0x250
[169053.129156 <    0.04>]  ? get_page_from_freelist+0x147/0x10f0
[169053.129160 <    0.04>]  ? save_stack_trace+0x1b/0x20
[169053.129164 <    0.04>]  ? kasan_kmalloc+0xad/0xe0
[169053.129186 <    0.22>]  ? ttm_bo_mem_space+0x79/0x6b0 [ttm]
[169053.129196 <    0.10>]  ? ttm_bo_validate+0x178/0x220 [ttm]
[169053.129200 <    0.04>] __alloc_pages_nodemask+0x3c4/0x400
[169053.129203 <    0.03>]  ? __alloc_pages_slowpath+0x14e0/0x14e0
[169053.129205 <    0.02>]  ? __save_stack_trace+0x66/0xd0
[169053.129209 <    0.04>]  ? rb_insert_color+0x32/0x3e0
[169053.129213 <    0.04>]  ? do_syscall_64+0xea/0x280
[169053.129217 <    0.04>] alloc_pages_current+0x75/0x110
[169053.129221 <    0.04>]  kmalloc_order+0x1f/0x80
[169053.129223 <    0.02>] kmalloc_order_trace+0x24/0xa0
[169053.129226 <    0.03>]  __kmalloc+0x264/0x280
[169053.129383 <    0.000157>] amdgpu_vram_mgr_new+0x11b/0x3b0 [amdgpu]
[169053.129391 <    0.08>]  ? 
reservation_object_reserve_shared+0x64/0xf0

[169053.129401 <    0.10>]  ttm_bo_mem_space+0x196/0x6b0 [ttm]
[169053.129478 <    0.77>]  

Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-10 Thread Andrey Grodzovsky



On 11/10/2017 07:17 AM, Christian König wrote:

Series is Acked-by: Christian König .

Please note that I think your OOM killer test shows quite a bug we 
currently have in the kernel driver.


A single allocation of 1TB shouldn't trigger the OOM killer, but 
rather be reacted immediately.


Maybe we should add a second test which does incremental 1GB allocations 
but still keep this tests ? With this test  i get a callstack as bellow 
+ crash of the test suite
with general protection fault - As normal behavior I would have expected 
just some errno returning from the amdgpu_bo_alloc which we could check 
in the test.


Thanks,
Andrey

[169053.128981 <72032.811683>] [ cut here ]
[169053.129006 <0.25>] WARNING: CPU: 0 PID: 22883 at 
mm/page_alloc.c:3883 __alloc_pages_slowpath+0xf03/0x14e0
[169053.129007 <0.01>] Modules linked in: amdgpu chash ttm 
drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect 
sysimgblt edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul 
crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel 
snd_hda_codec_generic pcbc snd_hda_codec_hdmi snd_hda_intel aesni_intel 
snd_hda_codec aes_x86_64 snd_hda_core crypto_simd glue_helper snd_hwdep 
rfkill_gpio cryptd snd_pcm snd_seq_midi snd_seq_midi_event serio_raw 
snd_rawmidi snd_seq cdc_ether usbnet snd_seq_device joydev fam15h_power 
k10temp r8152 snd_timer mii i2c_piix4 rtsx_pci_ms snd memstick soundcore 
shpchp 8250_dw i2c_designware_platform i2c_designware_core mac_hid 
binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc parport_pc ppdev 
lp parport autofs4 rtsx_pci_sdmmc psmouse rtsx_pci sdhci_pci ahci sdhci 
libahci

[169053.129084 <0.77>]  video i2c_hid hid_generic usbhid hid
[169053.129096 <0.12>] CPU: 0 PID: 22883 Comm: lt-amdgpu_test 
Tainted: GW   4.14.0-rc3+ #1
[169053.129097 <0.01>] Hardware name: AMD Gardenia/Gardenia, 
BIOS RGA1101C 07/20/2015
[169053.129099 <0.02>] task: 880048803d80 task.stack: 
880064688000

[169053.129103 <0.04>] RIP: 0010:__alloc_pages_slowpath+0xf03/0x14e0
[169053.129105 <0.02>] RSP: 0018:88006468f108 EFLAGS: 00010246
[169053.129108 <0.03>] RAX:  RBX: 
014000c0 RCX: 81279065
[169053.129109 <0.01>] RDX: dc00 RSI: 
000f RDI: 82609000
[169053.129111 <0.02>] RBP: 88006468f328 R08: 
 R09: 8576
[169053.129113 <0.02>] R10: 5c2044e7 R11: 
 R12: 88006468f3d8
[169053.129114 <0.01>] R13: 880048803d80 R14: 
0140c0c0 R15: 000f
[169053.129117 <0.03>] FS:  7f707863b700() 
GS:88006ce0() knlGS:
[169053.129119 <0.02>] CS:  0010 DS:  ES:  CR0: 
80050033
[169053.129120 <0.01>] CR2: 0125 CR3: 
644cf000 CR4: 001406f0

[169053.129122 <0.02>] Call Trace:
[169053.129131 <0.09>]  ? __module_address+0x145/0x190
[169053.129135 <0.04>]  ? is_bpf_text_address+0xe/0x20
[169053.129140 <0.05>]  ? __kernel_text_address+0x12/0x40
[169053.129144 <0.04>]  ? unwind_get_return_address+0x36/0x50
[169053.129150 <0.06>]  ? memcmp+0x5b/0x90
[169053.129152 <0.02>]  ? warn_alloc+0x250/0x250
[169053.129156 <0.04>]  ? get_page_from_freelist+0x147/0x10f0
[169053.129160 <0.04>]  ? save_stack_trace+0x1b/0x20
[169053.129164 <0.04>]  ? kasan_kmalloc+0xad/0xe0
[169053.129186 <0.22>]  ? ttm_bo_mem_space+0x79/0x6b0 [ttm]
[169053.129196 <0.10>]  ? ttm_bo_validate+0x178/0x220 [ttm]
[169053.129200 <0.04>] __alloc_pages_nodemask+0x3c4/0x400
[169053.129203 <0.03>]  ? __alloc_pages_slowpath+0x14e0/0x14e0
[169053.129205 <0.02>]  ? __save_stack_trace+0x66/0xd0
[169053.129209 <0.04>]  ? rb_insert_color+0x32/0x3e0
[169053.129213 <0.04>]  ? do_syscall_64+0xea/0x280
[169053.129217 <0.04>]  alloc_pages_current+0x75/0x110
[169053.129221 <0.04>]  kmalloc_order+0x1f/0x80
[169053.129223 <0.02>]  kmalloc_order_trace+0x24/0xa0
[169053.129226 <0.03>]  __kmalloc+0x264/0x280
[169053.129383 <0.000157>] amdgpu_vram_mgr_new+0x11b/0x3b0 [amdgpu]
[169053.129391 <0.08>]  ? 
reservation_object_reserve_shared+0x64/0xf0

[169053.129401 <0.10>]  ttm_bo_mem_space+0x196/0x6b0 [ttm]
[169053.129478 <0.77>]  ? add_hole+0x20a/0x220 [drm]
[169053.129489 <0.11>]  ttm_bo_validate+0x178/0x220 [ttm]
[169053.129498 <0.09>]  ? ttm_bo_evict_mm+0x70/0x70 [ttm]
[169053.129508 <0.10>]  ? ttm_check_swapping+0xf6/0x110 [ttm]
[169053.129541 <0.33>]  ? drm_vma_offset_add+0x5b/0x80 [drm]
[169053.129572 <0.31>]  ? drm_vma_offset_add+0x68/0x80 [drm]
[169053.129584 <0.12>] ttm_bo_init_reserved+0x546/0x630 [ttm]
[169053.129716 <

Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-10 Thread Christian König

Am 10.11.2017 um 13:17 schrieb Christian König:

Series is Acked-by: Christian König .

Please note that I think your OOM killer test shows quite a bug we 
currently have in the kernel driver.


A single allocation of 1TB shouldn't trigger the OOM killer, but 
rather be reacted immediately.


Typo: "rejected immediately".

Christian.



Instead I expected that we need to do multiple 1GB allocations to 
trigger the next problem that our TTM code doesn't imply a global limit.


Regards,
Christian.

Am 10.11.2017 um 05:29 schrieb Andrey Grodzovsky:

THe following  patch series intoroduce dynamic tests dusabling/enabling
in amdgpu  tester using Cunit API. Today test suits that
don't apply to specific HW just return success w/o executing while
single tests that can't be executed properly are commented out.

Suits are diasbled based on hooks they provide (e.g incompatible
ASIC or missing blocks) while single tests are diasbled explicitly 
since this is
usually due to some bug preventing from the tester  or the system  to 
handle

the test w/o crashing or killing the tester.

Inside this series also a minor cleanup and new test for memory over 
allocation.


Andrey Grodzovsky (4):
   amdgpu: Add functions to disable suites and tests.
   amdgpu: Use new suite/test disabling functionality.
   amdgpu: Move memory alloc tests in bo suite.
   amdgpu: Add memory over allocation test.

  tests/amdgpu/amdgpu_test.c    | 169 
+-

  tests/amdgpu/amdgpu_test.h    |  46 
  tests/amdgpu/basic_tests.c    |  49 
  tests/amdgpu/bo_tests.c   |  69 +
  tests/amdgpu/deadlock_tests.c |   8 +-
  tests/amdgpu/uvd_enc_tests.c  |  81 
  tests/amdgpu/vce_tests.c  |  65 
  tests/amdgpu/vcn_tests.c  |  74 --
  8 files changed, 363 insertions(+), 198 deletions(-)



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

2017-11-10 Thread Christian König

Series is Acked-by: Christian König .

Please note that I think your OOM killer test shows quite a bug we 
currently have in the kernel driver.


A single allocation of 1TB shouldn't trigger the OOM killer, but rather 
be reacted immediately.


Instead I expected that we need to do multiple 1GB allocations to 
trigger the next problem that our TTM code doesn't imply a global limit.


Regards,
Christian.

Am 10.11.2017 um 05:29 schrieb Andrey Grodzovsky:

THe following  patch series intoroduce  dynamic tests dusabling/enabling
in amdgpu  tester using Cunit API. Today test suits that
don't apply to specific HW just return success w/o executing while
single tests that can't be executed properly are commented out.

Suits are diasbled based on hooks they provide (e.g incompatible
ASIC or missing blocks) while single tests are diasbled explicitly since this is
usually due to some bug preventing from the tester  or the system  to handle
the test w/o crashing or killing the tester.

Inside this series also a minor cleanup and new test for memory over allocation.

Andrey Grodzovsky (4):
   amdgpu: Add functions to disable suites and tests.
   amdgpu: Use new suite/test disabling functionality.
   amdgpu: Move memory alloc tests in bo suite.
   amdgpu: Add memory over allocation test.

  tests/amdgpu/amdgpu_test.c| 169 +-
  tests/amdgpu/amdgpu_test.h|  46 
  tests/amdgpu/basic_tests.c|  49 
  tests/amdgpu/bo_tests.c   |  69 +
  tests/amdgpu/deadlock_tests.c |   8 +-
  tests/amdgpu/uvd_enc_tests.c  |  81 
  tests/amdgpu/vce_tests.c  |  65 
  tests/amdgpu/vcn_tests.c  |  74 --
  8 files changed, 363 insertions(+), 198 deletions(-)



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel