Re: [PATCH] drm/amdgpu: Fix memory leak in amdgpu_fence_emit

2019-10-21 Thread Koenig, Christian
Am 21.10.19 um 20:09 schrieb Navid Emamdoost:
> In the impelementation of amdgpu_fence_emit() if dma_fence_wait() fails
> and returns an errno, before returning the allocated memory for fence
> should be released.
>
> Fixes: 3d2aca8c8620 ("drm/amdgpu: fix old fence check in amdgpu_fence_emit")
> Signed-off-by: Navid Emamdoost 

Reviewed-by: Christian König 

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 23085b352cf2..2f59c9270a7e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -166,8 +166,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
> dma_fence **f,
>   if (old) {
>   r = dma_fence_wait(old, false);
>   dma_fence_put(old);
> - if (r)
> + if (r) {
> + dma_fence_put(fence);
>   return r;
> + }
>   }
>   }
>   



Re: [PATCH v4 06/11] drm/ttm: factor out ttm_bo_mmap_vma_setup

2019-10-17 Thread Koenig, Christian
Am 16.10.19 um 14:18 schrieb Christian König:
> Am 16.10.19 um 13:51 schrieb Gerd Hoffmann:
>> Factor out ttm vma setup to a new function.
>> Reduces code duplication a bit.
>>
>> v2: don't change vm_flags (moved to separate patch).
>> v4: make ttm_bo_mmap_vma_setup static.
>>
>> Signed-off-by: Gerd Hoffmann 
>
> Reviewed-by: Christian König  for this one 
> and #7 in the series.

Any objections that I add these two to my drm-ttm-next pull request or 
did you wanted to merge that through some other tree?

Thanks,
Christian.

>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 46 +
>>   1 file changed, 24 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 4aa007edffb0..53345c0854d5 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -426,6 +426,28 @@ static struct ttm_buffer_object 
>> *ttm_bo_vm_lookup(struct ttm_bo_device *bdev,
>>   return bo;
>>   }
>>   +static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, 
>> struct vm_area_struct *vma)
>> +{
>> +    vma->vm_ops = &ttm_bo_vm_ops;
>> +
>> +    /*
>> + * Note: We're transferring the bo reference to
>> + * vma->vm_private_data here.
>> + */
>> +
>> +    vma->vm_private_data = bo;
>> +
>> +    /*
>> + * We'd like to use VM_PFNMAP on shared mappings, where
>> + * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
>> + * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
>> + * bad for performance. Until that has been sorted out, use
>> + * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
>> + */
>> +    vma->vm_flags |= VM_MIXEDMAP;
>> +    vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>> +}
>> +
>>   int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>>   struct ttm_bo_device *bdev)
>>   {
>> @@ -449,24 +471,7 @@ int ttm_bo_mmap(struct file *filp, struct 
>> vm_area_struct *vma,
>>   if (unlikely(ret != 0))
>>   goto out_unref;
>>   -    vma->vm_ops = &ttm_bo_vm_ops;
>> -
>> -    /*
>> - * Note: We're transferring the bo reference to
>> - * vma->vm_private_data here.
>> - */
>> -
>> -    vma->vm_private_data = bo;
>> -
>> -    /*
>> - * We'd like to use VM_PFNMAP on shared mappings, where
>> - * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
>> - * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
>> - * bad for performance. Until that has been sorted out, use
>> - * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
>> - */
>> -    vma->vm_flags |= VM_MIXEDMAP;
>> -    vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>> +    ttm_bo_mmap_vma_setup(bo, vma);
>>   return 0;
>>   out_unref:
>>   ttm_bo_put(bo);
>> @@ -481,10 +486,7 @@ int ttm_fbdev_mmap(struct vm_area_struct *vma, 
>> struct ttm_buffer_object *bo)
>>     ttm_bo_get(bo);
>>   -    vma->vm_ops = &ttm_bo_vm_ops;
>> -    vma->vm_private_data = bo;
>> -    vma->vm_flags |= VM_MIXEDMAP;
>> -    vma->vm_flags |= VM_IO | VM_DONTEXPAND;
>> +    ttm_bo_mmap_vma_setup(bo, vma);
>>   return 0;
>>   }
>>   EXPORT_SYMBOL(ttm_fbdev_mmap);
>



Re: [PATCH v3] drm/amdgpu: fix multiple memory leaks in acp_hw_init

2019-10-01 Thread Koenig, Christian
Am 30.09.19 um 23:26 schrieb Navid Emamdoost:
> In acp_hw_init there are some allocations that needs to be released in
> case of failure:
>
> 1- adev->acp.acp_genpd should be released if any allocation attemp for
> adev->acp.acp_cell, adev->acp.acp_res or i2s_pdata fails.
> 2- all of those allocations should be released if
> mfd_add_hotplug_devices or pm_genpd_add_device fail.
> 3- Release is needed in case of time out values expire.
>
> Signed-off-by: Navid Emamdoost 
> ---
> Changes in v2:
>   -- moved the releases under goto
>
> Changes in v3:
>   -- fixed multiple goto issue
>   -- added goto for 3 other failure cases: one when
> mfd_add_hotplug_devices fails, and two when time out values expires.
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c | 41 -
>   1 file changed, 27 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c
> index eba42c752bca..7809745ec0f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c
> @@ -184,12 +184,12 @@ static struct device *get_mfd_cell_dev(const char 
> *device_name, int r)
>*/
>   static int acp_hw_init(void *handle)
>   {
> - int r, i;
> + int r, i, ret;

Please don't add another "ret" variable, instead always use "r" here.

Apart from that looks good to me,
Christian.

>   uint64_t acp_base;
>   u32 val = 0;
>   u32 count = 0;
>   struct device *dev;
> - struct i2s_platform_data *i2s_pdata;
> + struct i2s_platform_data *i2s_pdata = NULL;
>   
>   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
> @@ -231,20 +231,21 @@ static int acp_hw_init(void *handle)
>   adev->acp.acp_cell = kcalloc(ACP_DEVS, sizeof(struct mfd_cell),
>   GFP_KERNEL);
>   
> - if (adev->acp.acp_cell == NULL)
> - return -ENOMEM;
> + if (adev->acp.acp_cell == NULL) {
> + ret = -ENOMEM;
> + goto failure;
> + }
>   
>   adev->acp.acp_res = kcalloc(5, sizeof(struct resource), GFP_KERNEL);
>   if (adev->acp.acp_res == NULL) {
> - kfree(adev->acp.acp_cell);
> - return -ENOMEM;
> + ret = -ENOMEM;
> + goto failure;
>   }
>   
>   i2s_pdata = kcalloc(3, sizeof(struct i2s_platform_data), GFP_KERNEL);
>   if (i2s_pdata == NULL) {
> - kfree(adev->acp.acp_res);
> - kfree(adev->acp.acp_cell);
> - return -ENOMEM;
> + ret = -ENOMEM;
> + goto failure;
>   }
>   
>   switch (adev->asic_type) {
> @@ -340,15 +341,18 @@ static int acp_hw_init(void *handle)
>   
>   r = mfd_add_hotplug_devices(adev->acp.parent, adev->acp.acp_cell,
>   ACP_DEVS);
> - if (r)
> - return r;
> + if (r) {
> + ret = r;
> + goto failure;
> + }
>   
>   for (i = 0; i < ACP_DEVS ; i++) {
>   dev = get_mfd_cell_dev(adev->acp.acp_cell[i].name, i);
>   r = pm_genpd_add_device(&adev->acp.acp_genpd->gpd, dev);
>   if (r) {
>   dev_err(dev, "Failed to add dev to genpd\n");
> - return r;
> + ret = r;
> + goto failure;
>   }
>   }
>   
> @@ -367,7 +371,8 @@ static int acp_hw_init(void *handle)
>   break;
>   if (--count == 0) {
>   dev_err(&adev->pdev->dev, "Failed to reset ACP\n");
> - return -ETIMEDOUT;
> + ret = -ETIMEDOUT;
> + goto failure;
>   }
>   udelay(100);
>   }
> @@ -384,7 +389,8 @@ static int acp_hw_init(void *handle)
>   break;
>   if (--count == 0) {
>   dev_err(&adev->pdev->dev, "Failed to reset ACP\n");
> - return -ETIMEDOUT;
> + ret = -ETIMEDOUT;
> + goto failure;
>   }
>   udelay(100);
>   }
> @@ -393,6 +399,13 @@ static int acp_hw_init(void *handle)
>   val &= ~ACP_SOFT_RESET__SoftResetAud_MASK;
>   cgs_write_register(adev->acp.cgs_device, mmACP_SOFT_RESET, val);
>   return 0;
> +
> +failure:
> + kfree(i2s_pdata);
> + kfree(adev->acp.acp_res);
> + kfree(adev->acp.acp_cell);
> + kfree(adev->acp.acp_genpd);
> + return ret;
>   }
>   
>   /**



Re: TTM huge page-faults WAS: Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit

2019-09-24 Thread Koenig, Christian
Am 11.09.19 um 17:08 schrieb Thomas Hellström (VMware):
> On 9/11/19 4:06 PM, Koenig, Christian wrote:
>> Am 11.09.19 um 12:10 schrieb Thomas Hellström (VMware):
>> [SNIP]
>>>>> The problem seen in TTM is that we want to be able to change the
>>>>> vm_page_prot from the fault handler, but it's problematic since we
>>>>> have the mmap_sem typically only in read mode. Hence the fake vma
>>>>> hack. From what I can tell it's reasonably well-behaved, since
>>>>> pte_modify() skips the bits TTM updates, so mprotect() and mremap()
>>>>> works OK. I think split_huge_pmd may run into trouble, but we don't
>>>>> support it (yet) with TTM.
>>>> Ah! I actually ran into this while implementing huge page support for
>>>> TTM and never figured out why that doesn't work. Dropped CPU huge page
>>>> support because of this.
>>> By incident, I got slightly sidetracked the other day and started
>>> looking at this as well. Got to the point where I figured out all the
>>> hairy alignment issues and actually got huge_fault() calls, but never
>>> implemented the handler. I think that's definitely something worth
>>> having. Not sure it will work for IO memory, though, (split_huge_pmd
>>> will just skip non-page-backed memory) but if we only support
>>> VM_SHARED (non COW) vmas there's no reason to split the huge pmds
>>> anyway. Definitely something we should have IMO.
>> Well our primary use case would be IO memory, cause system memory is
>> only optionally allocate as huge page but we nearly always allocate VRAM
>> in chunks of at least 2MB because we otherwise get a huge performance
>> penalty.
>
> But that system memory option is on by default, right? In any case, a 
> request for a huge_fault
> would probably need to check that there is actually an underlying 
> huge_page and otherwise fallback to ordinary faults.
>
> Another requirement would be for IO memory allocations to be 
> PMD_PAGE_SIZE aligned in the mappable aperture, to avoid fallbacks to 
> ordinary faults. Probably increasing fragmentation somewhat. (Seems 
> like pmd entries can only point to PMD_PAGE_SIZE aligned physical 
> addresses) Would that work for you?

Yeah, we do it this way anyway.

Regards,
Christian.


Re: [PATCH v2 07/11] drm/ttm: drop VM_DONTDUMP

2019-09-17 Thread Koenig, Christian
Am 17.09.19 um 11:24 schrieb Gerd Hoffmann:
> Not obvious why this is needed.  According to Deniel Vetter this is most
> likely a historic artefact dating back to the days where drm drivers
> exposed hardware registers as mmap'able gem objects, to avoid dumping
> touching those registers.

Clearly a NAK.

We still have that and really don't want to try dumping any CPU 
inaccessible VRAM content even if it is mapped into the address space 
somewhere.

Regards,
Christian.

>
> Signed-off-by: Gerd Hoffmann 
> ---
>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 7c0e85c10e0e..4dc77a66aaf6 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -445,7 +445,7 @@ void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, 
> struct vm_area_struct *
>* VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
>*/
>   vma->vm_flags |= VM_MIXEDMAP;
> - vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
> + vma->vm_flags |= VM_IO | VM_DONTEXPAND;
>   }
>   EXPORT_SYMBOL(ttm_bo_mmap_vma_setup);
>   



Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit

2019-09-11 Thread Koenig, Christian
Am 10.09.19 um 21:26 schrieb Thomas Hellström (VMware):
> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
>>
>>> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig  
>>> wrote:
>>>
 On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) 
 wrote:
> On 9/5/19 4:15 PM, Dave Hansen wrote:
> Hi Thomas,
>
> Thanks for the second batch of patches!  These look much improved 
> on all
> fronts.
 Yes, although the TTM functionality isn't in yet. Hopefully we 
 won't have to
 bother you with those though, since this assumes TTM will be using 
 the dma
 API.
>>> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
>>> branch:
>>>
>>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
>>>
>>> they should allow to fault dma api pages in the page fault handler.  
>>> But
>>> this is totally hot off the press and not actually tested for the last
>>> few patches.  Note that I've also included your two patches from this
>>> series to handle SEV.
>> I read that patch, and it seems like you’ve built in the assumption 
>> that all pages in the mapping use identical protection or, if not, 
>> that the same fake vma hack that TTM already has is used to fudge 
>> around it.  Could it be reworked slightly to avoid this?
>>
>> I wonder if it’s a mistake to put the encryption bits in vm_page_prot 
>> at all.
>
> From my POW, the encryption bits behave quite similar in behaviour to 
> the caching mode bits, and they're also in vm_page_prot. They're the 
> reason TTM needs to modify the page protection in the fault handler in 
> the first place.
>
> The problem seen in TTM is that we want to be able to change the 
> vm_page_prot from the fault handler, but it's problematic since we 
> have the mmap_sem typically only in read mode. Hence the fake vma 
> hack. From what I can tell it's reasonably well-behaved, since 
> pte_modify() skips the bits TTM updates, so mprotect() and mremap() 
> works OK. I think split_huge_pmd may run into trouble, but we don't 
> support it (yet) with TTM.

Ah! I actually ran into this while implementing huge page support for 
TTM and never figured out why that doesn't work. Dropped CPU huge page 
support because of this.

>
> We could probably get away with a WRITE_ONCE() update of the 
> vm_page_prot before taking the page table lock since
>
> a) We're locking out all other writers.
> b) We can't race with another fault to the same vma since we hold an 
> address space lock ("buffer object reservation")
> c) When we need to update there are no valid page table entries in the 
> vma, since it only happens directly after mmap(), or after an 
> unmap_mapping_range() with the same address space lock. When another 
> reader (for example split_huge_pmd()) sees a valid page table entry, 
> it also sees the new page protection and things are fine.

Yeah, that's exactly why I always wondered why we need this hack with 
the vma copy on the stack.

>
> But that would really be a special case. To solve this properly we'd 
> probably need an additional lock to protect the vm_flags and 
> vm_page_prot, taken after mmap_sem and i_mmap_lock.

Well we already have a special lock for this: The reservation object. So 
memory barriers etc should be in place and I also think we can just 
update the vm_page_prot on the fly.

Christian.

>
> /Thomas
>
>
>
>



Re: [PATCH 00/14] PCI/P2PDMA: Support transactions that hit the host bridge

2019-07-23 Thread Koenig, Christian
Am 23.07.19 um 01:08 schrieb Logan Gunthorpe:
> As discussed on the list previously, in order to fully support the
> whitelist Christian added with the IOMMU, we must ensure that we
> map any buffer going through the IOMMU with an aprropriate dma_map
> call. This patchset accomplishes this by cleaning up the output of
> upstream_bridge_distance() to better indicate the mapping requirements,
> caching these requirements in an xarray, then looking them up at map
> time and applying the appropriate mapping method.
>
> After this patchset, it's possible to use the NVMe-of P2P support to
> transfer between devices without a switch on the whitelisted root
> complexes. A couple Intel device I have tested this on have also
> been added to the white list.
>
> Most of the changes are contained within the p2pdma.c, but there are
> a few minor touches to other subsystems, mostly to add support
> to call an unmap function.
>
> The final patch in this series demonstrates a possible
> pci_p2pdma_map_resource() function that I expect Christian will need
> but does not have any users at this time so I don't intend for it to be
> considered for merging.
>
> This patchset is based on 5.3-rc1 and a git branch is available here:
>
> https://github.com/sbates130272/linux-p2pmem/ p2pdma_rc_map_v1

I reviewed patches #1-#3 and #14.

Feel free to stick an Acked-by: Christian König 
 to the rest, but I'm not really deep into the 
NVMe P2P handling here.

Regards,
Christian.


>
> --
>
> Logan Gunthorpe (14):
>PCI/P2PDMA: Add constants for not-supported result
>  upstream_bridge_distance()
>PCI/P2PDMA: Factor out __upstream_bridge_distance()
>PCI/P2PDMA: Apply host bridge white list for ACS
>PCI/P2PDMA: Cache the result of upstream_bridge_distance()
>PCI/P2PDMA: Factor out host_bridge_whitelist()
>PCI/P2PDMA: Add whitelist support for Intel Host Bridges
>PCI/P2PDMA: Add the provider's pci_dev to the dev_pgmap struct
>PCI/P2PDMA: Add attrs argument to pci_p2pdma_map_sg()
>PCI/P2PDMA: Introduce pci_p2pdma_unmap_sg()
>PCI/P2PDMA: Factor out __pci_p2pdma_map_sg()
>PCI/P2PDMA: dma_map P2PDMA map requests that traverse the host bridge
>PCI/P2PDMA: No longer require no-mmu for host bridge whitelist
>PCI/P2PDMA: Update documentation for pci_p2pdma_distance_many()
>PCI/P2PDMA: Introduce pci_p2pdma_[un]map_resource()
>
>   drivers/infiniband/core/rw.c |   6 +-
>   drivers/nvme/host/pci.c  |  10 +-
>   drivers/pci/p2pdma.c | 400 +++
>   include/linux/memremap.h |   1 +
>   include/linux/pci-p2pdma.h   |  28 ++-
>   5 files changed, 341 insertions(+), 104 deletions(-)
>
> --
> 2.20.1



Re: [PATCH 14/14] PCI/P2PDMA: Introduce pci_p2pdma_[un]map_resource()

2019-07-23 Thread Koenig, Christian
Am 23.07.19 um 01:08 schrieb Logan Gunthorpe:
> pci_p2pdma_[un]map_resource() can be used to map a resource given
> it's physical address and the backing pci_dev. The functions will call
> dma_[un]map_resource() when appropriate.
>
> This is for demonstration purposes only as there are no users of this
> function at this time. Thus, this patch should not be merged at
> this time.
>
> Signed-off-by: Logan Gunthorpe 

Not sure if pci_p2pdma_phys_to_bus actually needs to be exported. But 
apart from that looks fine to me.

Reviewed-by: Christian König 

Christian.

> ---
>   drivers/pci/p2pdma.c | 85 
>   1 file changed, 85 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index baf476039396..20c834cfd2d3 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -874,6 +874,91 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, 
> struct scatterlist *sg,
>   }
>   EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>   
> +static pci_bus_addr_t pci_p2pdma_phys_to_bus(struct pci_dev *dev,
> + phys_addr_t start, size_t size)
> +{
> + struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> + phys_addr_t end = start + size;
> + struct resource_entry *window;
> +
> + resource_list_for_each_entry(window, &bridge->windows) {
> + if (window->res->start <= start && window->res->end >= end)
> + return start - window->offset;
> + }
> +
> + return DMA_MAPPING_ERROR;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_phys_to_bus);
> +
> +/**
> + * pci_p2pdma_map_resource - map a PCI peer-to-peer physical address for DMA
> + * @provider: pci device that provides the memory backed by phys_addr
> + * @dma_dev: device doing the DMA request
> + * @phys_addr: physical address of the memory to map
> + * @size: size of the memory to map
> + * @dir: DMA direction
> + * @attrs: dma attributes passed to dma_map_resource() (if called)
> + *
> + * Maps a BAR physical address for programming a DMA engine.
> + *
> + * Returns the dma_addr_t to map or DMA_MAPPING_ERROR on failure
> + */
> +dma_addr_t pci_p2pdma_map_resource(struct pci_dev *provider,
> + struct device *dma_dev, phys_addr_t phys_addr, size_t size,
> + enum dma_data_direction dir, unsigned long attrs)
> +{
> + struct pci_dev *client;
> + int dist;
> +
> + client = find_parent_pci_dev(dma_dev);
> + if (!client)
> + return DMA_MAPPING_ERROR;
> +
> + dist = upstream_bridge_distance(provider, client, NULL);
> + if (dist & P2PDMA_NOT_SUPPORTED)
> + return DMA_MAPPING_ERROR;
> +
> + if (dist & P2PDMA_THRU_HOST_BRIDGE)
> + return dma_map_resource(dma_dev, phys_addr, size, dir, attrs);
> + else
> + return pci_p2pdma_phys_to_bus(provider, phys_addr, size);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_map_resource);
> +
> +/**
> + * pci_p2pdma_unmap_resource - unmap a resource mapped with
> + *   pci_p2pdma_map_resource()
> + * @provider: pci device that provides the memory backed by phys_addr
> + * @dma_dev: device doing the DMA request
> + * @addr: dma address returned by pci_p2pdma_unmap_resource()
> + * @size: size of the memory to map
> + * @dir: DMA direction
> + * @attrs: dma attributes passed to dma_unmap_resource() (if called)
> + *
> + * Maps a BAR physical address for programming a DMA engine.
> + *
> + * Returns the dma_addr_t to map or DMA_MAPPING_ERROR on failure
> + */
> +void pci_p2pdma_unmap_resource(struct pci_dev *provider,
> + struct device *dma_dev, dma_addr_t addr, size_t size,
> + enum dma_data_direction dir, unsigned long attrs)
> +{
> + struct pci_dev *client;
> + int dist;
> +
> + client = find_parent_pci_dev(dma_dev);
> + if (!client)
> + return;
> +
> + dist = upstream_bridge_distance(provider, client, NULL);
> + if (dist & P2PDMA_NOT_SUPPORTED)
> + return;
> +
> + if (dist & P2PDMA_THRU_HOST_BRIDGE)
> + dma_unmap_resource(dma_dev, addr, size, dir, attrs);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_resource);
> +
>   /**
>* pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
>*  to enable p2pdma



Re: [PATCH 03/14] PCI/P2PDMA: Apply host bridge white list for ACS

2019-07-23 Thread Koenig, Christian
Am 23.07.19 um 01:08 schrieb Logan Gunthorpe:
> When a P2PDMA transfer is rejected due to ACS being set, we
> can also check the white list and allow the transactions.
>
> Do this by pushing the whitelist check into the
> upstream_bridge_distance() function.
>
> Signed-off-by: Logan Gunthorpe 

This one and patch #2 are Reviewed-by: Christian König 
.

But I actually think the two patches could be merged.

Christian.

> ---
>   drivers/pci/p2pdma.c | 25 ++---
>   1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 289d03a31e7d..d5034e28d1e1 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -324,15 +324,7 @@ static int __upstream_bridge_distance(struct pci_dev 
> *provider,
>   dist_a++;
>   }
>   
> - /*
> -  * Allow the connection if both devices are on a whitelisted root
> -  * complex, but add an arbitrary large value to the distance.
> -  */
> - if (root_complex_whitelist(provider) &&
> - root_complex_whitelist(client))
> - return (dist_a + dist_b) | P2PDMA_THRU_HOST_BRIDGE;
> -
> - return (dist_a + dist_b) | P2PDMA_NOT_SUPPORTED;
> + return (dist_a + dist_b) | P2PDMA_THRU_HOST_BRIDGE;
>   
>   check_b_path_acs:
>   bb = b;
> @@ -350,7 +342,8 @@ static int __upstream_bridge_distance(struct pci_dev 
> *provider,
>   }
>   
>   if (acs_cnt)
> - return P2PDMA_NOT_SUPPORTED | P2PDMA_ACS_FORCES_UPSTREAM;
> + return (dist_a + dist_b) | P2PDMA_ACS_FORCES_UPSTREAM |
> + P2PDMA_THRU_HOST_BRIDGE;
>   
>   return dist_a + dist_b;
>   }
> @@ -397,7 +390,17 @@ static int upstream_bridge_distance(struct pci_dev 
> *provider,
>   struct pci_dev *client,
>   struct seq_buf *acs_list)
>   {
> - return __upstream_bridge_distance(provider, client, acs_list);
> + int dist;
> +
> + dist = __upstream_bridge_distance(provider, client, acs_list);
> +
> + if (!(dist & P2PDMA_THRU_HOST_BRIDGE))
> + return dist;
> +
> + if (root_complex_whitelist(provider) && root_complex_whitelist(client))
> + return dist;
> +
> + return dist | P2PDMA_NOT_SUPPORTED;
>   }
>   
>   static int upstream_bridge_distance_warn(struct pci_dev *provider,



Re: [PATCH 01/14] PCI/P2PDMA: Add constants for not-supported result upstream_bridge_distance()

2019-07-23 Thread Koenig, Christian
Am 23.07.19 um 01:08 schrieb Logan Gunthorpe:
> Add constant flags to indicate two devices are not supported or whether
> the data path goes through the host bridge instead of using the negative
> values -1 and -2.
>
> This helps annotate the code better, but the main reason is so we
> can cache the result in an xarray which does not allow us to store
> negative values.
>
> Signed-off-by: Logan Gunthorpe 

Reviewed-by: Christian König 

> ---
>   drivers/pci/p2pdma.c | 52 ++--
>   1 file changed, 31 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 234476226529..e8ec86e1dd00 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -273,6 +273,20 @@ static bool root_complex_whitelist(struct pci_dev *dev)
>   return false;
>   }
>   
> +enum {
> + /*
> +  * Thes arbitrary offset are or'd onto the upstream distance
> +  * calculation for the following conditions:
> +  */
> +
> + /* The data path includes the host-bridge */
> + P2PDMA_THRU_HOST_BRIDGE = 0x0200,
> + /* The data path is forced through the host-bridge due to ACS */
> + P2PDMA_ACS_FORCES_UPSTREAM  = 0x0400,
> + /* The data path is not supported by P2PDMA */
> + P2PDMA_NOT_SUPPORTED= 0x0800,
> +};
> +
>   /*
>* Find the distance through the nearest common upstream bridge between
>* two PCI devices.
> @@ -297,22 +311,17 @@ static bool root_complex_whitelist(struct pci_dev *dev)
>* port of the switch, to the common upstream port, back up to the second
>* downstream port and then to Device B.
>*
> - * Any two devices that don't have a common upstream bridge will return -1.
> - * In this way devices on separate PCIe root ports will be rejected, which
> - * is what we want for peer-to-peer seeing each PCIe root port defines a
> - * separate hierarchy domain and there's no way to determine whether the root
> - * complex supports forwarding between them.
> + * Any two devices that cannot communicate using p2pdma will return the 
> distance
> + * with the flag P2PDMA_NOT_SUPPORTED.
>*
> - * In the case where two devices are connected to different PCIe switches,
> - * this function will still return a positive distance as long as both
> - * switches eventually have a common upstream bridge. Note this covers
> - * the case of using multiple PCIe switches to achieve a desired level of
> - * fan-out from a root port. The exact distance will be a function of the
> - * number of switches between Device A and Device B.
> + * Any two devices that have a data path that goes through the host bridge
> + * will consult a whitelist. If the host bridges are on the whitelist,
> + * then the distance will be returned with the flag P2PDMA_THRU_HOST_BRIDGE 
> set.
> + * If either bridge is not on the whitelist, the flag P2PDMA_NOT_SUPPORTED 
> will
> + * be set.
>*
>* If a bridge which has any ACS redirection bits set is in the path
> - * then this functions will return -2. This is so we reject any
> - * cases where the TLPs are forwarded up into the root complex.
> + * then this functions will flag the result with P2PDMA_ACS_FORCES_UPSTREAM.
>* In this case, a list of all infringing bridge addresses will be
>* populated in acs_list (assuming it's non-null) for printk purposes.
>*/
> @@ -359,9 +368,9 @@ static int upstream_bridge_distance(struct pci_dev 
> *provider,
>*/
>   if (root_complex_whitelist(provider) &&
>   root_complex_whitelist(client))
> - return 0x1000 + dist_a + dist_b;
> + return (dist_a + dist_b) | P2PDMA_THRU_HOST_BRIDGE;
>   
> - return -1;
> + return (dist_a + dist_b) | P2PDMA_NOT_SUPPORTED;
>   
>   check_b_path_acs:
>   bb = b;
> @@ -379,7 +388,7 @@ static int upstream_bridge_distance(struct pci_dev 
> *provider,
>   }
>   
>   if (acs_cnt)
> - return -2;
> + return P2PDMA_NOT_SUPPORTED | P2PDMA_ACS_FORCES_UPSTREAM;
>   
>   return dist_a + dist_b;
>   }
> @@ -395,16 +404,17 @@ static int upstream_bridge_distance_warn(struct pci_dev 
> *provider,
>   return -ENOMEM;
>   
>   ret = upstream_bridge_distance(provider, client, &acs_list);
> - if (ret == -2) {
> - pci_warn(client, "cannot be used for peer-to-peer DMA as ACS 
> redirect is set between the client and provider (%s)\n",
> + if (ret & P2PDMA_ACS_FORCES_UPSTREAM) {
> + pci_warn(client, "ACS redirect is set between the client and 
> provider (%s)\n",
>pci_name(provider));
>   /* Drop final semicolon */
>   acs_list.buffer[acs_list.len-1] = 0;
>   pci_warn(client, "to disable ACS redirect for this path, add 
> the kernel parameter: pci=disable_acs_redir=%s\n",
>acs_list.buffer);
> + }
>   
> - } else if (ret < 0) {
> - pci_warn(client,

Re: [PATCH] uaccess: add noop untagged_addr definition

2019-06-04 Thread Koenig, Christian
Am 04.06.19 um 13:48 schrieb Andrey Konovalov:
> On Tue, Jun 4, 2019 at 1:46 PM Koenig, Christian
>  wrote:
>> Am 04.06.19 um 13:44 schrieb Andrey Konovalov:
>>> Architectures that support memory tagging have a need to perform untagging
>>> (stripping the tag) in various parts of the kernel. This patch adds an
>>> untagged_addr() macro, which is defined as noop for architectures that do
>>> not support memory tagging. The oncoming patch series will define it at
>>> least for sparc64 and arm64.
>>>
>>> Acked-by: Catalin Marinas 
>>> Reviewed-by: Khalid Aziz 
>>> Signed-off-by: Andrey Konovalov 
>>> ---
>>>include/linux/mm.h | 4 
>>>1 file changed, 4 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 0e8834ac32b7..949d43e9c0b6 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly;
>>>#include 
>>>#include 
>>>
>>> +#ifndef untagged_addr
>>> +#define untagged_addr(addr) (addr)
>>> +#endif
>>> +
>> Maybe add a comment what tagging actually is? Cause that is not really
>> obvious from the context.
> Hi,
>
> Do you mean a comment in the code or an explanation in the patch description?

The code, the patch description actually sounds good to me.

Christian.

>
> Thanks!
>
>> Christian.
>>
>>>#ifndef __pa_symbol
>>>#define __pa_symbol(x)  __pa(RELOC_HIDE((unsigned long)(x), 0))
>>>#endif



Re: [PATCH] uaccess: add noop untagged_addr definition

2019-06-04 Thread Koenig, Christian
Am 04.06.19 um 13:44 schrieb Andrey Konovalov:
> Architectures that support memory tagging have a need to perform untagging
> (stripping the tag) in various parts of the kernel. This patch adds an
> untagged_addr() macro, which is defined as noop for architectures that do
> not support memory tagging. The oncoming patch series will define it at
> least for sparc64 and arm64.
>
> Acked-by: Catalin Marinas 
> Reviewed-by: Khalid Aziz 
> Signed-off-by: Andrey Konovalov 
> ---
>   include/linux/mm.h | 4 
>   1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0e8834ac32b7..949d43e9c0b6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly;
>   #include 
>   #include 
>   
> +#ifndef untagged_addr
> +#define untagged_addr(addr) (addr)
> +#endif
> +

Maybe add a comment what tagging actually is? Cause that is not really 
obvious from the context.

Christian.

>   #ifndef __pa_symbol
>   #define __pa_symbol(x)  __pa(RELOC_HIDE((unsigned long)(x), 0))
>   #endif



Re: Confusing lockdep message

2019-05-20 Thread Koenig, Christian
Please ignore this mail,

I've fixed the double unlock and lockdep is still complaining about the 
nested locking, so I'm actually facing multiple issues here.

Sorry to waste your time,
Christian.

Am 20.05.19 um 13:19 schrieb Christian König:
> Hi guys,
>
> writing the usual suspects about locking/lockdep stuff and also Daniel 
> in CC because he might have stumbled over this as well.
>
> It took me a while to figuring out what the heck lockdep was 
> complaining about. The relevant dmesg was the following:
>> [  145.623005] ==
>> [  145.623094] WARNING: Nested lock was not taken
>> [  145.623184] 5.0.0-rc1+ #144 Not tainted
>> [  145.623261] --
>> [  145.623351] amdgpu_test/1411 is trying to lock:
>> [  145.623442] 98a1c4d3 (reservation_ww_class_mutex){+.+.}, 
>> at: ttm_eu_reserve_buffers+0x46e/0x910 [ttm]
>> [  145.623651]
>>    but this task is not holding:
>> [  145.623758] reservation_ww_class_acquire
>> [  145.623836]
>>    stack backtrace:
>> [  145.623924] CPU: 4 PID: 1411 Comm: amdgpu_test Not tainted 
>> 5.0.0-rc1+ #144
>> [  145.624058] Hardware name: System manufacturer System Product 
>> Name/PRIME X399-A, BIOS 0808 10/12/2018
>> [  145.624234] Call Trace:
>> ...
>
> The problem is now that the message is very confusion because the 
> issue was *not* that I tried to acquire a lock, but rather that I 
> accidentally released a lock twice.
>
> Now releasing a lock twice is a rather common mistake and I'm really 
> surprised that I didn't get that pointed out by lockdep immediately.
>
> Additional to that I'm pretty sure that this used to work correctly 
> sometimes in the past, so I'm either hitting a rare corner case or 
> this broke just recently.
>
> Anyway can somebody take a look? I can try to provide a test case if 
> required.
>
> Thanks in advance,
> Christian.



Confusing lockdep message

2019-05-20 Thread Koenig, Christian
Hi guys,

writing the usual suspects about locking/lockdep stuff and also Daniel 
in CC because he might have stumbled over this as well.

It took me a while to figuring out what the heck lockdep was complaining 
about. The relevant dmesg was the following:
> [  145.623005] ==
> [  145.623094] WARNING: Nested lock was not taken
> [  145.623184] 5.0.0-rc1+ #144 Not tainted
> [  145.623261] --
> [  145.623351] amdgpu_test/1411 is trying to lock:
> [  145.623442] 98a1c4d3 (reservation_ww_class_mutex){+.+.}, 
> at: ttm_eu_reserve_buffers+0x46e/0x910 [ttm]
> [  145.623651]
>    but this task is not holding:
> [  145.623758] reservation_ww_class_acquire
> [  145.623836]
>    stack backtrace:
> [  145.623924] CPU: 4 PID: 1411 Comm: amdgpu_test Not tainted 
> 5.0.0-rc1+ #144
> [  145.624058] Hardware name: System manufacturer System Product 
> Name/PRIME X399-A, BIOS 0808 10/12/2018
> [  145.624234] Call Trace:
> ...

The problem is now that the message is very confusion because the issue 
was *not* that I tried to acquire a lock, but rather that I accidentally 
released a lock twice.

Now releasing a lock twice is a rather common mistake and I'm really 
surprised that I didn't get that pointed out by lockdep immediately.

Additional to that I'm pretty sure that this used to work correctly 
sometimes in the past, so I'm either hitting a rare corner case or this 
broke just recently.

Anyway can somebody take a look? I can try to provide a test case if 
required.

Thanks in advance,
Christian.


Re: [PATCH 04/12] dma-buf: add optional invalidate_mappings callback v5

2019-04-18 Thread Koenig, Christian
Am 18.04.19 um 10:08 schrieb Daniel Vetter:
> On Wed, Apr 17, 2019 at 09:13:22PM +0200, Christian König wrote:
>> Am 17.04.19 um 21:07 schrieb Daniel Vetter:
>>> On Tue, Apr 16, 2019 at 08:38:33PM +0200, Christian König wrote:
 Each importer can now provide an invalidate_mappings callback.

 This allows the exporter to provide the mappings without the need to pin
 the backing store.

 v2: don't try to invalidate mappings when the callback is NULL,
   lock the reservation obj while using the attachments,
   add helper to set the callback
 v3: move flag for invalidation support into the DMA-buf,
   use new attach_info structure to set the callback
 v4: use importer_priv field instead of mangling exporter priv.
 v5: drop invalidation_supported flag

 Signed-off-by: Christian König 
 ---
drivers/dma-buf/dma-buf.c | 37 +
include/linux/dma-buf.h   | 33 +++--
2 files changed, 68 insertions(+), 2 deletions(-)

 diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
 index 83c92bfd964c..a3738fab3927 100644
 --- a/drivers/dma-buf/dma-buf.c
 +++ b/drivers/dma-buf/dma-buf.c
 @@ -563,6 +563,8 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
 dma_buf_attach_info *info
attach->dev = info->dev;
attach->dmabuf = dmabuf;
 +  attach->importer_priv = info->importer_priv;
 +  attach->invalidate = info->invalidate;
mutex_lock(&dmabuf->lock);
 @@ -571,7 +573,9 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
 dma_buf_attach_info *info
if (ret)
goto err_attach;
}
 +  reservation_object_lock(dmabuf->resv, NULL);
list_add(&attach->node, &dmabuf->attachments);
 +  reservation_object_unlock(dmabuf->resv);
mutex_unlock(&dmabuf->lock);
 @@ -615,7 +619,9 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct 
 dma_buf_attachment *attach)
   DMA_BIDIRECTIONAL);
mutex_lock(&dmabuf->lock);
 +  reservation_object_lock(dmabuf->resv, NULL);
list_del(&attach->node);
 +  reservation_object_unlock(dmabuf->resv);
if (dmabuf->ops->detach)
dmabuf->ops->detach(dmabuf, attach);
 @@ -653,7 +659,16 @@ dma_buf_map_attachment_locked(struct 
 dma_buf_attachment *attach,
if (attach->sgt)
return attach->sgt;
 +  /*
 +   * Mapping a DMA-buf can trigger its invalidation, prevent sending this
 +   * event to the caller by temporary removing this attachment from the
 +   * list.
 +   */
 +  if (attach->invalidate)
 +  list_del(&attach->node);
>>> Just noticed this: Why do we need this? invalidate needs the reservation
>>> lock, as does map_attachment. It should be impssoble to have someone else
>>> sneak in here.
>> I was having problems with self triggered invalidations.
>>
>> E.g. client A tries to map an attachment, that in turn causes the buffer to
>> move to a new place and client A is informed about that movement with an
>> invalidation.
> Uh, that sounds like a bug in ttm or somewhere else in the exporter. If
> you evict the bo that you're trying to map, that's bad.
>
> Or maybe it's a framework bug, and we need to track whether an attachment
> has a map or not. That would make more sense ...

Well neither, as far as I can see this is perfectly normal behavior.

We just don't want any invalidation send to a driver which is currently 
making a mapping.

If you want I can do this in the driver as well, but at least of hand it 
looks like a good idea to have that in common code.

Tracking the mappings could work as well, but the problem here is that I 
actually want the lifetime of old and new mappings to overlap for 
pipelining.

Regards,
Christian.

> -Daniel



Re: [PATCH] drm/amdgpu: fix several indentation issues

2019-02-12 Thread Koenig, Christian
Am 12.02.19 um 15:05 schrieb Colin King:
> From: Colin Ian King 
>
> There are several statements that are incorrectly indented. Fix these.
>
> Signed-off-by: Colin Ian King 

Reviewed-by: Christian König 

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  | 2 +-
>   drivers/gpu/drm/amd/amdgpu/dce_v6_0.c| 2 +-
>   drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c   | 2 +-
>   drivers/gpu/drm/amd/amdgpu/si.c  | 2 +-
>   drivers/gpu/drm/amd/powerplay/hwmgr/smu7_powertune.c | 2 +-
>   drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c   | 2 +-
>   6 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index bc62bf41b7e9..b65e18101108 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -207,7 +207,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, 
> unsigned long flags)
>   if (!r) {
>   acpi_status = amdgpu_acpi_init(adev);
>   if (acpi_status)
> - dev_dbg(&dev->pdev->dev,
> + dev_dbg(&dev->pdev->dev,
>   "Error during ACPI methods call\n");
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c 
> b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
> index db443ec53d3a..bea32f076b91 100644
> --- a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
> @@ -2980,7 +2980,7 @@ static int dce_v6_0_pageflip_irq(struct amdgpu_device 
> *adev,
>struct amdgpu_irq_src *source,
>struct amdgpu_iv_entry *entry)
>   {
> - unsigned long flags;
> + unsigned long flags;
>   unsigned crtc_id;
>   struct amdgpu_crtc *amdgpu_crtc;
>   struct amdgpu_flip_work *works;
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c 
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index 221f26e50322..c69d51598cfe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -32,7 +32,7 @@
>   
>   static u32 nbio_v7_4_get_rev_id(struct amdgpu_device *adev)
>   {
> -u32 tmp = RREG32_SOC15(NBIO, 0, mmRCC_DEV0_EPF0_STRAP0);
> + u32 tmp = RREG32_SOC15(NBIO, 0, mmRCC_DEV0_EPF0_STRAP0);
>   
>   tmp &= RCC_DEV0_EPF0_STRAP0__STRAP_ATI_REV_ID_DEV0_F0_MASK;
>   tmp >>= RCC_DEV0_EPF0_STRAP0__STRAP_ATI_REV_ID_DEV0_F0__SHIFT;
> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
> index 79c1a9bbcc21..9d8df68893b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/si.c
> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
> @@ -1436,7 +1436,7 @@ static int si_common_early_init(void *handle)
>   AMD_CG_SUPPORT_UVD_MGCG |
>   AMD_CG_SUPPORT_HDP_LS |
>   AMD_CG_SUPPORT_HDP_MGCG;
> - adev->pg_flags = 0;
> + adev->pg_flags = 0;
>   adev->external_rev_id = (adev->rev_id == 0) ? 1 :
>   (adev->rev_id == 1) ? 5 : 6;
>   break;
> diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_powertune.c 
> b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_powertune.c
> index d138ddae563d..58f5589aaf12 100644
> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_powertune.c
> +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_powertune.c
> @@ -1211,7 +1211,7 @@ int smu7_power_control_set_level(struct pp_hwmgr *hwmgr)
>   hwmgr->platform_descriptor.TDPAdjustment :
>   (-1 * hwmgr->platform_descriptor.TDPAdjustment);
>   
> -  if (hwmgr->chip_id > CHIP_TONGA)
> + if (hwmgr->chip_id > CHIP_TONGA)
>   target_tdp = ((100 + adjust_percent) * 
> (int)(cac_table->usTDP * 256)) / 100;
>   else
>   target_tdp = ((100 + adjust_percent) * 
> (int)(cac_table->usConfigurableTDP * 256)) / 100;
> diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c 
> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> index 0769b1ec562b..aad79affb081 100644
> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> @@ -3456,7 +3456,7 @@ static int vega20_apply_clocks_adjust_rules(struct 
> pp_hwmgr *hwmgr)
>   disable_mclk_switching = ((1 < hwmgr->display_config->num_display) &&
>  !hwmgr->display_config->multi_monitor_in_sync) ||
>   vblank_too_short;
> -latency = hwmgr->display_config->dce_tolerable_mclk_in_active_latency;
> + latency = hwmgr->display_config->dce_tolerable_mclk_in_active_latency;
>   
>   /* gfxclk */
>   dpm_table = &(data->dpm_table.gfx_table);



Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Koenig, Christian
Am 30.01.19 um 09:02 schrieb Christoph Hellwig:
> On Tue, Jan 29, 2019 at 08:58:35PM +, Jason Gunthorpe wrote:
>> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
>>
>>> implement the mapping. And I don't think we should have 'special' vma's
>>> for this (though we may need something to ensure we don't get mapping
>>> requests mixed with different types of pages...).
>> I think Jerome explained the point here is to have a 'special vma'
>> rather than a 'special struct page' as, really, we don't need a
>> struct page at all to make this work.
>>
>> If I recall your earlier attempts at adding struct page for BAR
>> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
> Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> it work.  Without struct page none of the above can work at all.  That
> is why we use struct page for backing BARs in the existing P2P code.
> Not that I'm a particular fan of creating struct page for this device
> memory, but without major invasive surgery to large parts of the kernel
> it is the only way to make it work.

The problem seems to be that struct page does two things:

1. Memory management for system memory.
2. The object to work with in the I/O layer.

This was done because a good part of that stuff overlaps, like reference 
counting how often a page is used.  The problem now is that this doesn't 
work very well for device memory in some cases.

For example on GPUs you usually have a large amount of memory which is 
not even accessible by the CPU. In other words you can't easily create a 
struct page for it because you can't reference it with a physical CPU 
address.

Maybe struct page should be split up into smaller structures? I mean 
it's really overloaded with data.

Christian.




Re: [PATCH] drm: enable uncached DMA optimization for ARM and arm64

2019-01-24 Thread Koenig, Christian
Am 24.01.19 um 13:06 schrieb Ard Biesheuvel:
> The DRM driver stack is designed to work with cache coherent devices
> only, but permits an optimization to be enabled in some cases, where
> for some buffers, both the CPU and the GPU use uncached mappings,
> removing the need for DMA snooping and allocation in the CPU caches.
>
> The use of uncached GPU mappings relies on the correct implementation
> of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
> will use cached mappings nonetheless. On x86 platforms, this does not
> seem to matter, as uncached CPU mappings will snoop the caches in any
> case. However, on ARM and arm64, enabling this optimization on a
> platform where NoSnoop is ignored results in loss of coherency, which
> breaks correct operation of the device. Since we have no way of
> detecting whether NoSnoop works or not, just disable this
> optimization entirely for ARM and arm64.
>
> Cc: Christian Koenig 
> Cc: Alex Deucher 
> Cc: David Zhou 
> Cc: Huang Rui 
> Cc: Junwei Zhang 
> Cc: Michel Daenzer 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: Maarten Lankhorst 
> Cc: Maxime Ripard 
> Cc: Sean Paul 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Will Deacon 
> Cc: Christoph Hellwig 
> Cc: Robin Murphy 
> Cc: amd-gfx list 
> Cc: dri-devel 
> Reported-by: Carsten Haitzler 
> Signed-off-by: Ard Biesheuvel 

The subject line should probably read "disable uncached...".

With that fixed the patch is Reviewed-by: Christian König 
.

Regards,
Christian.

> ---
>   include/drm/drm_cache.h | 18 ++
>   1 file changed, 18 insertions(+)
>
> diff --git a/include/drm/drm_cache.h b/include/drm/drm_cache.h
> index bfe1639df02d..97fc498dc767 100644
> --- a/include/drm/drm_cache.h
> +++ b/include/drm/drm_cache.h
> @@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)
>   return false;
>   #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
>   return false;
> +#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
> + /*
> +  * The DRM driver stack is designed to work with cache coherent devices
> +  * only, but permits an optimization to be enabled in some cases, where
> +  * for some buffers, both the CPU and the GPU use uncached mappings,
> +  * removing the need for DMA snooping and allocation in the CPU caches.
> +  *
> +  * The use of uncached GPU mappings relies on the correct implementation
> +  * of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
> +  * will use cached mappings nonetheless. On x86 platforms, this does not
> +  * seem to matter, as uncached CPU mappings will snoop the caches in any
> +  * case. However, on ARM and arm64, enabling this optimization on a
> +  * platform where NoSnoop is ignored results in loss of coherency, which
> +  * breaks correct operation of the device. Since we have no way of
> +  * detecting whether NoSnoop works or not, just disable this
> +  * optimization entirely for ARM and arm64.
> +  */
> + return false;
>   #else
>   return true;
>   #endif



Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-24 Thread Koenig, Christian
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
> On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
>  wrote:
>> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
>>> [SNIP]
>>> This is *exactly* my point the whole time.
>>>
>>> The current code has
>>>
>>> static inline bool drm_arch_can_wc_memory(void)
>>> {
>>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
>>>  return false;
>>>
>>> which means the optimization is disabled *unless the system is
>>> non-cache coherent*
>>>
>>> So if you have reports that the optimization works on some PowerPC, it
>>> must be non-cache coherent PowerPC, because that is the only place
>>> where it is enabled in the first place.
>>>
>>>> The only problematic here actually seems to be ARM, so you should
>>>> probably just add an "#ifdef .._ARM return false;".
>>>>
>>> ARM/arm64 does not have a Kconfig symbol like
>>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
>>> there are non-coherent ARM systems that are currently working in the
>>> same way as those non-coherent PowerPC systems, we will break them by
>>> doing this.
>> Summing the things I've read so far for ARM up I actually think it
>> depends on a runtime configuration and not on compile time one.
>>
>> So the whole idea of providing the device to the drm_*_can_wc_memory()
>> function isn't so far fetched.
>>
> Thank you.
>
>> But for now I do prefer working and slightly slower system over broken
>> one, so I think we should just disable this on ARM for now.
>>
> Again, this is not about non-cache coherent being slower without the
> optimization, it is about non-cache coherent likely not working *at
> all* unless the optimization is enabled.

As Michel tried to explain this CAN'T happen. The optimization is a 
purely optional request from userspace.

> Otherwise, the driver will vmap() DMA pages with cacheable attributes,
> while the non-cache coherent device uses uncached attributes, breaking
> coherency.

Again this is mandated by the userspace APIs anyway. E.g. we can't 
vmap() pages in any other way or our userspace APIs would break.

Regards,
Christian.


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-24 Thread Koenig, Christian
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
> [SNIP]
> This is *exactly* my point the whole time.
>
> The current code has
>
> static inline bool drm_arch_can_wc_memory(void)
> {
> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> return false;
>
> which means the optimization is disabled *unless the system is
> non-cache coherent*
>
> So if you have reports that the optimization works on some PowerPC, it
> must be non-cache coherent PowerPC, because that is the only place
> where it is enabled in the first place.
>
>> The only problematic here actually seems to be ARM, so you should
>> probably just add an "#ifdef .._ARM return false;".
>>
> ARM/arm64 does not have a Kconfig symbol like
> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
> there are non-coherent ARM systems that are currently working in the
> same way as those non-coherent PowerPC systems, we will break them by
> doing this.

Summing the things I've read so far for ARM up I actually think it 
depends on a runtime configuration and not on compile time one.

So the whole idea of providing the device to the drm_*_can_wc_memory() 
function isn't so far fetched.

But for now I do prefer working and slightly slower system over broken 
one, so I think we should just disable this on ARM for now.

Christian.


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-24 Thread Koenig, Christian
Am 24.01.19 um 10:28 schrieb Ard Biesheuvel:
> On Thu, 24 Jan 2019 at 10:25, Koenig, Christian
>  wrote:
>> Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
>>> On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
>>>> But my concern is that it seems likely that non-cache coherent
>>>> implementations are relying on this hack as well. There must be a
>>>> reason that this hack is only disabled for PowerPC platforms if they
>>>> are cache coherent, for instance, and I suspect that that reason is
>>>> that the hack is the only thing ensuring that the CPU mapping
>>>> attributes match the device ones used for these buffers (the vmap()ed
>>>> ones), whereas the rings and other consistent data structures are
>>>> using the DMA API as intended, and thus getting uncached attributes in
>>>> the correct way.
>>> Dave, who added that commit is on Cc together with just about everyone
>>> involved in the review chain.  Based on the previous explanation
>>> that idea what we might want an uncached mapping for some non-coherent
>>> architectures for this to work at all makes sense, but then again
>>> the way to create those mappings is entirely architecture specific,
>>> and also need a cache flushing before creating the mapping to work
>>> properly.  So my working theory is that this code never properly
>>> worked on architectures without DMA coherent for PCIe at all, but
>>> I'd love to be corrected by concrete examples including an explanation
>>> of how it actually ends up working.
>> Cache coherency is mandatory for modern GPU operation.
>>
>> Otherwise you can't implement a bunch of the requirements of the
>> userspace APIs.
>>
>> In other words the applications doesn't inform the driver that the GPU
>> or the CPU is accessing data, it just does it and assumes that it works.
>>
> Wonderful!
>
> In that case, do you have any objections to the patch proposed by
> Christoph above?

Yeah, the patch of Christoph actually goes way to far cause we have 
reports that this works on a bunch of other architectures.

E.g. X86 64bit, PowerPC (under some conditions) and some MIPS.

The only problematic here actually seems to be ARM, so you should 
probably just add an "#ifdef .._ARM return false;".

Regards,
Christian.


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-24 Thread Koenig, Christian
Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
> On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
>> But my concern is that it seems likely that non-cache coherent
>> implementations are relying on this hack as well. There must be a
>> reason that this hack is only disabled for PowerPC platforms if they
>> are cache coherent, for instance, and I suspect that that reason is
>> that the hack is the only thing ensuring that the CPU mapping
>> attributes match the device ones used for these buffers (the vmap()ed
>> ones), whereas the rings and other consistent data structures are
>> using the DMA API as intended, and thus getting uncached attributes in
>> the correct way.
> Dave, who added that commit is on Cc together with just about everyone
> involved in the review chain.  Based on the previous explanation
> that idea what we might want an uncached mapping for some non-coherent
> architectures for this to work at all makes sense, but then again
> the way to create those mappings is entirely architecture specific,
> and also need a cache flushing before creating the mapping to work
> properly.  So my working theory is that this code never properly
> worked on architectures without DMA coherent for PCIe at all, but
> I'd love to be corrected by concrete examples including an explanation
> of how it actually ends up working.

Cache coherency is mandatory for modern GPU operation.

Otherwise you can't implement a bunch of the requirements of the 
userspace APIs.

In other words the applications doesn't inform the driver that the GPU 
or the CPU is accessing data, it just does it and assumes that it works.

Regards,
Christian.


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-21 Thread Koenig, Christian
Am 21.01.19 um 11:06 schrieb Ard Biesheuvel:
> Currently, the DRM code assumes that PCI devices are always cache
> coherent for DMA, and that this can be selectively overridden for
> some buffers using non-cached mappings on the CPU side and PCIe
> NoSnoop transactions on the bus side.
>
> Whether the NoSnoop part is implemented correctly is highly platform
> specific. Whether it /matters/ if NoSnoop is implemented correctly or
> not is architecture specific: on x86, such transactions are coherent
> with the CPU whether the NoSnoop attribute is honored or not. On other
> architectures, it depends on whether such transactions may allocate in
> caches that are non-coherent with the CPU's uncached mappings.
>
> Bottom line is that we should not rely on this optimization to work
> correctly for cache coherent devices in the general case. On the
> other hand, disabling this optimization for non-coherent devices
> is likely to cause breakage as well, since the driver will assume
> cache coherent PCIe if this optimization is turned off.
>
> So rename drm_arch_can_wc_memory() to drm_device_can_wc_memory(), and
> pass the drm_device pointer into it so we can base the return value
> on whether the device is cache coherent or not if not running on
> X86.
>
> Cc: Christian Koenig 
> Cc: Alex Deucher 
> Cc: David Zhou 
> Cc: Huang Rui 
> Cc: Junwei Zhang 
> Cc: Michel Daenzer 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: Maarten Lankhorst 
> Cc: Maxime Ripard 
> Cc: Sean Paul 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Will Deacon 
> Reported-by: Carsten Haitzler 
> Signed-off-by: Ard Biesheuvel 

Looks sane to me, but I can't fully judge if the check is actually 
correct or not.

So Acked-by: Christian König 

> ---
> This is a followup to '[RFC PATCH] drm/ttm: force cached mappings for system
> RAM on ARM'
>
> https://lore.kernel.org/linux-arm-kernel/20190110072841.3283-1-ard.biesheu...@linaro.org/
>
> Without t
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  2 +-
>   drivers/gpu/drm/radeon/radeon_object.c |  2 +-
>   include/drm/drm_cache.h| 19 +++
>   3 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 728e15e5d68a..777fa251838f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -480,7 +480,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
>   /* For architectures that don't support WC memory,
>* mask out the WC flag from the BO
>*/
> - if (!drm_arch_can_wc_memory())
> + if (!drm_device_can_wc_memory(adev->ddev))
>   bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
>   #endif
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_object.c 
> b/drivers/gpu/drm/radeon/radeon_object.c
> index 833e909706a9..610889bf6ab5 100644
> --- a/drivers/gpu/drm/radeon/radeon_object.c
> +++ b/drivers/gpu/drm/radeon/radeon_object.c
> @@ -249,7 +249,7 @@ int radeon_bo_create(struct radeon_device *rdev,
>   /* For architectures that don't support WC memory,
>* mask out the WC flag from the BO
>*/
> - if (!drm_arch_can_wc_memory())
> + if (!drm_device_can_wc_memory(rdev->ddev))
>   bo->flags &= ~RADEON_GEM_GTT_WC;
>   #endif
>   
> diff --git a/include/drm/drm_cache.h b/include/drm/drm_cache.h
> index bfe1639df02d..ced63b1207a3 100644
> --- a/include/drm/drm_cache.h
> +++ b/include/drm/drm_cache.h
> @@ -33,6 +33,8 @@
>   #ifndef _DRM_CACHE_H_
>   #define _DRM_CACHE_H_
>   
> +#include 
> +#include 
>   #include 
>   
>   void drm_clflush_pages(struct page *pages[], unsigned long num_pages);
> @@ -41,15 +43,16 @@ void drm_clflush_virt_range(void *addr, unsigned long 
> length);
>   u64 drm_get_max_iomem(void);
>   
>   
> -static inline bool drm_arch_can_wc_memory(void)
> +static inline bool drm_device_can_wc_memory(struct drm_device *ddev)
>   {
> -#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> - return false;
> -#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
> - return false;
> -#else
> - return true;
> -#endif
> + if (IS_ENABLED(CONFIG_PPC))
> + return IS_ENABLED(CONFIG_NOT_COHERENT_CACHE);
> + else if (IS_ENABLED(CONFIG_MIPS))
> + return !IS_ENABLED(CONFIG_CPU_LOONGSON3);
> + else if (IS_ENABLED(CONFIG_X86))
> + return true;
> +
> + return !dev_is_dma_coherent(ddev->dev);
>   }
>   
>   #endif



Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-15 Thread Koenig, Christian
Am 16.01.19 um 01:33 schrieb Benjamin Herrenschmidt:
> On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote:
 As far as I know Power doesn't really supports un-cached memory at all,
 except for a very very old and odd configuration with AGP.
>>> Hopefully Michael/Ben can elaborate here, but I was under the (possibly
>>> mistaken) impression that mismatched attributes could cause a machine-check
>>> on Power.
>> That's what I've always been told, but I can't actually find where it's
>> documented, I'll keep searching.
>>
>> But you're right that mixing cached / uncached is not really supported,
>> and probably results in a machine check or worse.
>   .. or worse :) It could checkstop.

Not sure if that would be so bad, it would at least give us a clear 
indicator that something is wrong instead of silently corrupting data.

> It's also my understanding that on ARM v7 and above, it's technically
> forbidden to map the same physical page with both cached and non-cached
> mappings, since the cached one could prefetch (or speculatively load),
> thus creating collisions and inconsistencies. Am I wrong here ?

No, but you answer the wrong question.

See we don't want to have different mappings of cached and non-cached on 
the CPU, but rather want to know if a snooped DMA from the PCIe counts 
as cached access as well.

As far as I know on x86 it doesn't, so when you have an un-cached page 
you can still access it with a snooping DMA read/write operation and 
don't cause trouble.

> The old hack of using non-cached mapping to avoid snoop cost in AGP and
> others is just that ... an ugly and horrible hacks that should have
> never eventuated, when the search for performance pushes HW people into
> utter insanity :)

Well I agree that un-cached system memory makes things much more 
complicated for a questionable gain.

But fact is we now have to deal with the mess, so no point in 
complaining about it to much :)

Cheers,
Christian.

>
> Cheers,
> Ben.
>
>
>



Re: [PATCH] lib/scatterlist: Provide a DMA page iterator

2019-01-15 Thread Koenig, Christian
Am 16.01.19 um 08:09 schrieb Thomas Hellstrom:
> On Tue, 2019-01-15 at 21:58 +0100, h...@lst.de wrote:
>> On Tue, Jan 15, 2019 at 07:13:11PM +0000, Koenig, Christian wrote:
>>> Thomas is correct that the interface you propose here doesn't work
>>> at
>>> all for GPUs.
>>>
>>> The kernel driver is not informed of flush/sync, but rather just
>>> setups
>>> coherent mappings between system memory and devices.
>>>
>>> In other words you have an array of struct pages and need to map
>>> that to
>>> a specific device and so create dma_addresses for the mappings.
>> If you want a coherent mapping you need to use dma_alloc_coherent
>> and dma_mmap_coherent and you are done, that is not the problem.
>> That actually is one of the vmgfx modes, so I don't understand what
>> problem we are trying to solve if you don't actually want a non-
>> coherent mapping.
> For vmwgfx, not making dma_alloc_coherent default has a couple of
> reasons:
> 1) Memory is associated with a struct device. It has not been clear
> that it is exportable to other devices.
> 2) There seems to be restrictions in the system pages allowable. GPUs
> generally prefer highmem pages but dma_alloc_coherent returns a virtual
> address implying GFP_KERNEL? While not used by vmwgfx, TTM typically
> prefers HIGHMEM pages to facilitate caching mode switching without
> having to touch the kernel map.
> 3) Historically we had APIs to allow coherent access to user-space
> defined pages. That has gone away not but the infrastructure was built
> around it.
>
> dma_mmap_coherent isn't use because as the data moves between system
> memory, swap and VRAM, PTEs of user-space mappings are adjusted
> accordingly, meaning user-space doesn't have to unmap when an operation
> is initiated that might mean the data is moved.

To summarize once more: We have an array of struct pages and want to 
coherently map that to a device.

If that is not possible because of whatever reason we want to get an 
error code or even not load the driver from the beginning.

>
>
>> Although last time I had that discussion with Daniel Vetter
>> I was under the impressions that GPUs really wanted non-coherent
>> mappings.
> Intel historically has done things a bit differently. And it's also
> possible that embedded platforms and ARM prefer this mode of operation,
> but I haven't caught up on that discussion.
>
>> But if you want a coherent mapping you can't go to a struct page,
>> because on many systems you can't just map arbitrary memory as
>> uncachable.  It might either come from very special limited pools,
>> or might need other magic applied to it so that it is not visible
>> in the normal direct mapping, or at least not access through it.
>
> The TTM subsystem has been relied on to provide coherent memory with
> the option to switch caching mode of pages. But only on selected and
> well tested platforms. On other platforms we simply do not load, and
> that's fine for now.
>
> But as mentioned multiple times, to make GPU drivers more compliant,
> we'd really want that
>
> bool dma_streaming_is_coherent(const struct device *)
>
> API to help us decide when to load or not.

Yes, please.

Christian.

>
> Thanks,
> Thomas
>
>
>
>
>
>
>



Re: [PATCH] lib/scatterlist: Provide a DMA page iterator

2019-01-15 Thread Koenig, Christian
Am 15.01.19 um 19:31 schrieb h...@lst.de:
> On Tue, Jan 15, 2019 at 06:03:39PM +, Thomas Hellstrom wrote:
>> In the graphics case, it's probably because it doesn't fit the graphics
>> use-cases:
>>
>> 1) Memory typically needs to be mappable by another device. (the "dma-
>> buf" interface)
> And there is nothing preventing dma-buf sharing of these buffers.
> Unlike the get_sgtable mess it can actually work reliably on
> architectures that have virtually tagged caches and/or don't
> guarantee cache coherency with mixed attribute mappings.
>
>> 2) DMA buffers are exported to user-space and is sub-allocated by it.
>> Mostly there are no GPU user-space kernel interfaces to sync / flush
>> subregions and these syncs may happen on a smaller-than-cache-line
>> granularity.
> I know of no architectures that can do cache maintainance on a less
> than cache line basis.  Either the instructions require you to
> specifcy cache lines, or they do sometimes more, sometimes less
> intelligent rounding up.
>
> Note that as long dma non-coherent buffers are devices owned it
> is up to the device and the user space driver to take care of
> coherency, the kernel very much is out of the picture.

Thomas is correct that the interface you propose here doesn't work at 
all for GPUs.

The kernel driver is not informed of flush/sync, but rather just setups 
coherent mappings between system memory and devices.

In other words you have an array of struct pages and need to map that to 
a specific device and so create dma_addresses for the mappings.

Regards,
Christian.


Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-14 Thread Koenig, Christian
Am 14.01.19 um 20:13 schrieb Will Deacon:
> On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote:
>> Am 14.01.19 um 18:32 schrieb Ard Biesheuvel:
>>  - The reason remapping the CPU side as cacheable does work 
>> (which I
>>  did test) is because the GPU's uncacheable accesses (which I 
>> assume
>>  are made using the NoSnoop PCIe transaction attribute) are 
>> actually
>>  emitted as cacheable in some cases.
>> . On my AMD Seattle, with or without SMMU (which is stage 2 
>> only), I
>>  must use cacheable accesses from the CPU side or things are 
>> broken.
>>  This might be a h/w flaw, though.
>> . On systems with stage 1+2 SMMUs, the driver uses stage 1
>>  translations which always override the memory attributes to 
>> cacheable
>>  for DMA coherent devices. This is what is affecting the Cavium
>>  ThunderX2 (although it appears the attributes emitted by the RC 
>> may be
>>  incorrect as well.)
>>
>>  The latter issue is a shortcoming in the SMMU driver that we 
>> have to
>>  fix, i.e., it should take care not to modify the incoming 
>> attributes
>>  of DMA coherent PCIe devices for NoSnoop to be able to work.
>>
>>  So in summary, the mismatch appears to be between the CPU 
>> accessing
>>  the vmap region with non-cacheable attributes and the GPU 
>> accessing
>>  the same memory with cacheable attributes, resulting in a loss 
>> of
>>  coherency and lots of visible corruption.
>>
>>  Actually it is the other way around. The CPU thinks some data is in 
>> the
>>  cache and the GPU only updates the system memory version because the
>>  snoop flag is not set.
>>
>>
>>  That doesn't seem to be what is happening. As far as we can tell from
>>  our experiments, all inbound transactions are always cacheable, and so
>>  the only way to make things work is to ensure that the CPU uses the
>>  same attributes.
>>
>>
>> Ok that doesn't make any sense. If inbound transactions are cacheable or not 
>> is
>> irrelevant when the CPU always uses uncached accesses.
>>
>> See on the PCIe side you have the snoop bit in the read/write transactions
>> which tells the root hub if the device wants to snoop caches or not.
>>
>> When the CPU accesses some memory as cached then devices need to snoop the
>> cache for coherent accesses.
>>
>> When the CPU accesses some memory as uncached then devices can disable 
>> snooping
>> to improve performance, but when they don't do this it is mandated by the 
>> spec
>> that this still works.
> Which spec?

The PCIe spec. The snoop bit (or rather the NoSnoop) in the transaction 
is perfectly optional IIRC.

> The Arm architecture (and others including Power afaiu) doesn't
> guarantee coherency when memory is accessed using mismatched cacheability
> attributes.

Well what exactly goes wrong on ARM?

As far as I know Power doesn't really supports un-cached memory at all, 
except for a very very old and odd configuration with AGP.

I mean in theory I agree that devices should use matching cacheability 
attributes, but in practice I know of quite a bunch of devices/engines 
which fails to do this correctly.

Regards,
Christian.

>
> Will



Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-14 Thread Koenig, Christian
Am 14.01.19 um 11:53 schrieb Ard Biesheuvel:
> On Thu, 10 Jan 2019 at 10:34, Michel Dänzer  wrote:
>> On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote:
>>> ARM systems do not permit the use of anything other than cached
>>> mappings for system memory, since that memory may be mapped in the
>>> linear region as well, and the architecture does not permit aliases
>>> with mismatched attributes.
>>>
>>> So short-circuit the evaluation in ttm_io_prot() if the flags include
>>> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
>>> attributes immediately.
>>>
>>> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
>>> Without this change, amdgpu does not start at all, and radeon only
>>> produces corrupt display output.
>>>
>>> Cc: Christian Koenig 
>>> Cc: Huang Rui 
>>> Cc: Junwei Zhang 
>>> Cc: David Airlie 
>>> Reported-by: Carsten Haitzler 
>>> Signed-off-by: Ard Biesheuvel 
>>> ---
>>>   drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
>>> b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> index 046a6dda690a..0c1eef5f7ae3 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t 
>>> tmp)
>>>if (caching_flags & TTM_PL_FLAG_CACHED)
>>>return tmp;
>>>
>>> +#if defined(__arm__) || defined(__aarch64__)
>>> + /* ARM only permits cached mappings of system memory */
>>> + if (caching_flags & TTM_PL_SYSTEM)
>>> + return tmp;
>>> +#endif
>>>   #if defined(__i386__) || defined(__x86_64__)
>>>if (caching_flags & TTM_PL_FLAG_WC)
>>>tmp = pgprot_writecombine(tmp);
>>>
>> Apart from Christian's concerns, I think this is the wrong place for
>> this, because other TTM / driver code will still consider the memory
>> uncacheable. E.g. the amdgpu driver will program the GPU to treat the
>> memory as uncacheable, so it won't participate in cache coherency
>> protocol for it, which is unlikely to work as expected in general if the
>> CPU treats the memory as cacheable.
>>
> Will and I have spent some time digging into this, so allow me to
> share some preliminary findings while we carry on and try to fix this
> properly.
>
> - The patch above is flawed, i.e., it doesn't do what it intends to
> since it uses TTM_PL_SYSTEM instead of TTM_PL_FLAG_SYSTEM. Apologies
> for that.
> - The existence of a linear region mapping with mismatched attributes
> is likely not the culprit here. (We do something similar with
> non-cache coherent DMA in other places).

This is still rather problematic.

The issue is that we often don't create a vmap for a page, but rather 
access the page directly using the linear mapping.

So we would use the wrong access type here.

> - The reason remapping the CPU side as cacheable does work (which I
> did test) is because the GPU's uncacheable accesses (which I assume
> are made using the NoSnoop PCIe transaction attribute) are actually
> emitted as cacheable in some cases.
>. On my AMD Seattle, with or without SMMU (which is stage 2 only), I
> must use cacheable accesses from the CPU side or things are broken.
> This might be a h/w flaw, though.
>. On systems with stage 1+2 SMMUs, the driver uses stage 1
> translations which always override the memory attributes to cacheable
> for DMA coherent devices. This is what is affecting the Cavium
> ThunderX2 (although it appears the attributes emitted by the RC may be
> incorrect as well.)
>
> The latter issue is a shortcoming in the SMMU driver that we have to
> fix, i.e., it should take care not to modify the incoming attributes
> of DMA coherent PCIe devices for NoSnoop to be able to work.
>
> So in summary, the mismatch appears to be between the CPU accessing
> the vmap region with non-cacheable attributes and the GPU accessing
> the same memory with cacheable attributes, resulting in a loss of
> coherency and lots of visible corruption.

Actually it is the other way around. The CPU thinks some data is in the 
cache and the GPU only updates the system memory version because the 
snoop flag is not set.

> To be able to debug this further, could you elaborate a bit on
> - How does the hardware emit those uncached/wc inbound accesses? Do
> they rely on NoSnoop?

The GPU has a separate page walker in the MC and the page tables there 
have a bits saying if the access should go to the PCIe bus and if yes if 
the snoop bit should be set.

> - Christian pointed out that some accesses must be uncached even when
> not using WC. What kind of accesses are those? And do they access
> system RAM?

On some hardware generations we have a buggy engine which fails to 
forward the snoop bit and because of this the system memory page used by 
this engine must be uncached. But this only applies if you use ROCm in a 
specific configuration.

Regards,
Christian.


Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-10 Thread Koenig, Christian
Hi Ard,

thanks a lot for this! At least somebody who can explain why this 
doesn't work as expected.

The problem is that the hardware actually needs a few pages as uncached 
in a couple of cases to work correctly. So we could still run into 
issues with that solution.

For now we have blocked userspace/kernel from extensively using write 
combine mappings by adjusting drm_arch_can_wc_memory(), but that is 
probably degrading performance quite a bit.

What can be done to improve the solution? At least on X86 we solve this 
by marking the write combined pages in the linear mapping as uncached as 
well. Would this be doable on ARM as well?

Thanks,
Christian.

Am 10.01.19 um 08:28 schrieb Ard Biesheuvel:
> ARM systems do not permit the use of anything other than cached
> mappings for system memory, since that memory may be mapped in the
> linear region as well, and the architecture does not permit aliases
> with mismatched attributes.
>
> So short-circuit the evaluation in ttm_io_prot() if the flags include
> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
> attributes immediately.
>
> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
> Without this change, amdgpu does not start at all, and radeon only
> produces corrupt display output.
>
> Cc: Christian Koenig 
> Cc: Huang Rui 
> Cc: Junwei Zhang 
> Cc: David Airlie 
> Reported-by: Carsten Haitzler 
> Signed-off-by: Ard Biesheuvel 
> ---
>   drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
> b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 046a6dda690a..0c1eef5f7ae3 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t 
> tmp)
>   if (caching_flags & TTM_PL_FLAG_CACHED)
>   return tmp;
>   
> +#if defined(__arm__) || defined(__aarch64__)
> + /* ARM only permits cached mappings of system memory */
> + if (caching_flags & TTM_PL_SYSTEM)
> + return tmp;
> +#endif
>   #if defined(__i386__) || defined(__x86_64__)
>   if (caching_flags & TTM_PL_FLAG_WC)
>   tmp = pgprot_writecombine(tmp);



Re: [PATCH 2/2] drm: Revert syncobj timeline changes.

2018-12-21 Thread Koenig, Christian
Am 21.12.18 um 19:35 schrieb Dmitry Osipenko:
> On 21.12.2018 21:27, Christian König wrote:
>> Am 19.12.18 um 18:53 schrieb Dmitry Osipenko:
>>> [SNIP]
 @@ -931,9 +718,6 @@ static signed long 
 drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
      if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT) {
    for (i = 0; i < count; ++i) {
 -    if (entries[i].fence)
 -    continue;
 -
    drm_syncobj_fence_get_or_add_callback(syncobjs[i],
      &entries[i].fence,
      &entries[i].syncobj_cb,
>>> Hello,
>>>
>>> The above three removed lines we added in commit 337fe9f5c1e7de 
>>> ("drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set") that fixed a 
>>> memleak. Removal of the lines returns the memleak because of disbalanced 
>>> fence refcounting and it looks like they were removed unintentionally in 
>>> this patch.
>> That was already fixed with 61a98b1b9a8c7 drm/syncobj: remove drm_syncobj_cb 
>> and cleanup.
>>
>> This cleanup removed all the duplicate checking and is now adding the 
>> callback only once.
> Okay, though that commit is not in linux-next. I assume it will show up 
> eventually.

Need to double check, that could indeed be a problem.

Christian.


Re: [PATCH 1/4] mm: Check if mmu notifier callbacks are allowed to fail

2018-12-10 Thread Koenig, Christian
Patches #1 and #3 are Reviewed-by: Christian König 


Patch #2 is Acked-by: Christian König  because 
I can't judge if adding the counter in the thread structure is actually 
a good idea.

In patch #4 I honestly don't understand at all how this stuff works, so 
no-comment from my side on this.

Christian.

Am 10.12.18 um 11:36 schrieb Daniel Vetter:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
>
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
>
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
>
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
>
> v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
> the problematic case (Michal Hocko).
>
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: David Rientjes 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Paolo Bonzini 
> Signed-off-by: Daniel Vetter 
> ---
>   mm/mmu_notifier.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 5119ff846769..ccc22f21b735 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -190,6 +190,9 @@ int __mmu_notifier_invalidate_range_start(struct 
> mm_struct *mm,
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   
> mn->ops->invalidate_range_start, _ret,
>   !blockable ? "non-" : "");
> + if (blockable)
> + pr_warn("%pS callback failure not 
> allowed\n",
> + 
> mn->ops->invalidate_range_start);
>   ret = _ret;
>   }
>   }



Re: [PATCH] drm/sched: Always trace the dependencies we wait on, to fix a race.

2018-12-07 Thread Koenig, Christian
Am 07.12.18 um 20:16 schrieb Eric Anholt:
> The entity->dependency can go away completely once we've called
> drm_sched_entity_add_dependency_cb() (if the cb is called before we
> get around to tracing).  The tracepoint is more useful if we trace
> every dependency instead of just ones that get callbacks installed,
> anyway, so just do that.
>
> Fixes any easy-to-produce OOPS when tracing the scheduler on V3D with
> "perf record -a -e gpu_scheduler:.\* glxgears" and DEBUG_SLAB enabled.
>
> Signed-off-by: Eric Anholt 

Reviewed-by: Christian König 

Going to pick that up for upstream and will add with a CC: stable.

Thanks,
Christian.

> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
>   1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 4463d3826ecb..e2942c9a11a7 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -440,13 +440,10 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct 
> drm_sched_entity *entity)
>   
>   while ((entity->dependency =
>   sched->ops->dependency(sched_job, entity))) {
> + trace_drm_sched_job_wait_dep(sched_job, entity->dependency);
>   
> - if (drm_sched_entity_add_dependency_cb(entity)) {
> -
> - trace_drm_sched_job_wait_dep(sched_job,
> -  entity->dependency);
> + if (drm_sched_entity_add_dependency_cb(entity))
>   return NULL;
> - }
>   }
>   
>   /* skip jobs from entity that marked guilty */



Re: linux-next: build failure after merge of the drm-misc tree

2018-12-07 Thread Koenig, Christian
Hi Stephen,

yeah, that is a known problem. I missed the change during rebase of the 
revert.

Please see patch "2312f9842854 drm/v3d: fix broken build" which is 
already in drm-misc-next and fixes the issue.

Christian.

Am 06.12.18 um 03:32 schrieb Stephen Rothwell:
> Hi all,
>
> After merging the drm-misc tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
>
> drivers/gpu/drm/v3d/v3d_gem.c: In function 'v3d_submit_tfu_ioctl':
> drivers/gpu/drm/v3d/v3d_gem.c:719:3: error: too many arguments to function 
> 'drm_syncobj_replace_fence'
> drm_syncobj_replace_fence(sync_out, 0, sched_done_fence);
> ^
> In file included from drivers/gpu/drm/v3d/v3d_gem.c:5:
> include/drm/drm_syncobj.h:134:6: note: declared here
>   void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
>^
>
> Caused by commit
>
>0b258ed1a219 ("drm: revert "expand replace_fence to support timeline point 
> v2"")
>
> interacting with commit
>
>1584f16ca96e ("drm/v3d: Add support for submitting jobs to the TFU")
>
> I have used the drm-misc tree from next-20181205 for today.
>



Re: [PATCH] dma-buf: fix debugfs versus rcu and fence dumping v2

2018-12-06 Thread Koenig, Christian
Am 06.12.18 um 17:58 schrieb Chris Wilson:
> Quoting jgli...@redhat.com (2018-12-06 15:47:04)
>> From: Jérôme Glisse 
>>
>> The debugfs take reference on fence without dropping them. Also the
>> rcu section are not well balance. Fix all that ...
> Wouldn't the code be a lot simpler (and a consistent snapshot) if it used
> reservation_object_get_fences_rcu()?

Yeah, thought about that as well.

Or even better move that code into reservation_object.c as 
reservation_object_show_fences() or something like that.

Christian.

> -Chris



Re: [PATCH] dma-buf: balance refcount inbalance

2018-12-06 Thread Koenig, Christian
Am 06.12.18 um 17:18 schrieb jgli...@redhat.com:
> From: Jérôme Glisse 
>
> The debugfs take reference on fence without dropping them.
>
> Signed-off-by: Jérôme Glisse 
> Cc: Christian König 
> Cc: Daniel Vetter 
> Cc: Sumit Semwal 
> Cc: linux-me...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: linaro-mm-...@lists.linaro.org
> Cc: Stéphane Marchesin 
> Cc: sta...@vger.kernel.org

Reviewed-by: Christian König 

> ---
>   drivers/dma-buf/dma-buf.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 13884474d158..69842145c223 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1069,6 +1069,7 @@ static int dma_buf_debug_show(struct seq_file *s, void 
> *unused)
>  fence->ops->get_driver_name(fence),
>  fence->ops->get_timeline_name(fence),
>  dma_fence_is_signaled(fence) ? "" : "un");
> + dma_fence_put(fence);
>   }
>   rcu_read_unlock();
>   



Re: [PATCH] dma-buf: fix debugfs versus rcu and fence dumping

2018-12-06 Thread Koenig, Christian
Am 06.12.18 um 16:21 schrieb Jerome Glisse:
> On Thu, Dec 06, 2018 at 08:09:28AM +0000, Koenig, Christian wrote:
>> Am 06.12.18 um 02:41 schrieb jgli...@redhat.com:
>>> From: Jérôme Glisse 
>>>
>>> The debugfs take reference on fence without dropping them. Also the
>>> rcu section are not well balance. Fix all that ...
>>>
>>> Signed-off-by: Jérôme Glisse 
>>> Cc: Christian König 
>>> Cc: Daniel Vetter 
>>> Cc: Sumit Semwal 
>>> Cc: linux-me...@vger.kernel.org
>>> Cc: dri-de...@lists.freedesktop.org
>>> Cc: linaro-mm-...@lists.linaro.org
>>> Cc: Stéphane Marchesin 
>>> Cc: sta...@vger.kernel.org
>> Well NAK, you are now taking the RCU lock twice and dropping the RCU and
>> still accessing fobj has a huge potential for accessing freed up memory.
>>
>> The only correct thing I can see here is to grab a reference to the
>> fence before printing any info on it,
>> Christian.
> Hu ? That is exactly what i am doing, take reference under rcu,
> rcu_unlock print the fence info, drop the fence reference, rcu
> lock rinse and repeat ...
>
> Note that the fobj in _existing_ code is access outside the rcu
> end that there is an rcu imbalance in that code ie a lonlely
> rcu_unlock after the for loop.
>
> So that the existing code is broken.

No, the existing code is perfectly fine.

Please note the break in the loop before the rcu_unlock();
>   if (!read_seqcount_retry(&robj->seq, seq))
>   break; <- HERE!
>   rcu_read_unlock();
>   }

So your patch breaks that and take the RCU read lock twice.

Regards,
Christian.

>
>>> ---
>>>drivers/dma-buf/dma-buf.c | 11 +--
>>>1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 13884474d158..f6f4de42ac49 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -1051,24 +1051,31 @@ static int dma_buf_debug_show(struct seq_file *s, 
>>> void *unused)
>>> fobj = rcu_dereference(robj->fence);
>>> shared_count = fobj ? fobj->shared_count : 0;
>>> fence = rcu_dereference(robj->fence_excl);
>>> +   fence = dma_fence_get_rcu(fence);
>>> if (!read_seqcount_retry(&robj->seq, seq))
>>> break;
>>> rcu_read_unlock();
>>> }
>>> -
>>> -   if (fence)
>>> +   if (fence) {
>>> seq_printf(s, "\tExclusive fence: %s %s %ssignalled\n",
>>>fence->ops->get_driver_name(fence),
>>>fence->ops->get_timeline_name(fence),
>>>dma_fence_is_signaled(fence) ? "" : "un");
>>> +   dma_fence_put(fence);
>>> +   }
>>> +
>>> +   rcu_read_lock();
>>> for (i = 0; i < shared_count; i++) {
>>> fence = rcu_dereference(fobj->shared[i]);
>>> if (!dma_fence_get_rcu(fence))
>>> continue;
>>> +   rcu_read_unlock();
>>> seq_printf(s, "\tShared fence: %s %s %ssignalled\n",
>>>fence->ops->get_driver_name(fence),
>>>fence->ops->get_timeline_name(fence),
>>>dma_fence_is_signaled(fence) ? "" : "un");
>>> +   dma_fence_put(fence);
>>> +   rcu_read_lock();
>>> }
>>> rcu_read_unlock();
>>>



Re: [PATCH] dma-buf: fix debugfs versus rcu and fence dumping

2018-12-06 Thread Koenig, Christian
Am 06.12.18 um 02:41 schrieb jgli...@redhat.com:
> From: Jérôme Glisse 
>
> The debugfs take reference on fence without dropping them. Also the
> rcu section are not well balance. Fix all that ...
>
> Signed-off-by: Jérôme Glisse 
> Cc: Christian König 
> Cc: Daniel Vetter 
> Cc: Sumit Semwal 
> Cc: linux-me...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: linaro-mm-...@lists.linaro.org
> Cc: Stéphane Marchesin 
> Cc: sta...@vger.kernel.org

Well NAK, you are now taking the RCU lock twice and dropping the RCU and 
still accessing fobj has a huge potential for accessing freed up memory.

The only correct thing I can see here is to grab a reference to the 
fence before printing any info on it,
Christian.

> ---
>   drivers/dma-buf/dma-buf.c | 11 +--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 13884474d158..f6f4de42ac49 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1051,24 +1051,31 @@ static int dma_buf_debug_show(struct seq_file *s, 
> void *unused)
>   fobj = rcu_dereference(robj->fence);
>   shared_count = fobj ? fobj->shared_count : 0;
>   fence = rcu_dereference(robj->fence_excl);
> + fence = dma_fence_get_rcu(fence);
>   if (!read_seqcount_retry(&robj->seq, seq))
>   break;
>   rcu_read_unlock();
>   }
> -
> - if (fence)
> + if (fence) {
>   seq_printf(s, "\tExclusive fence: %s %s %ssignalled\n",
>  fence->ops->get_driver_name(fence),
>  fence->ops->get_timeline_name(fence),
>  dma_fence_is_signaled(fence) ? "" : "un");
> + dma_fence_put(fence);
> + }
> +
> + rcu_read_lock();
>   for (i = 0; i < shared_count; i++) {
>   fence = rcu_dereference(fobj->shared[i]);
>   if (!dma_fence_get_rcu(fence))
>   continue;
> + rcu_read_unlock();
>   seq_printf(s, "\tShared fence: %s %s %ssignalled\n",
>  fence->ops->get_driver_name(fence),
>  fence->ops->get_timeline_name(fence),
>  dma_fence_is_signaled(fence) ? "" : "un");
> + dma_fence_put(fence);
> + rcu_read_lock();
>   }
>   rcu_read_unlock();
>   



Re: [PATCH] drm/sched: Fix a use-after-free when tracing the scheduler.

2018-12-03 Thread Koenig, Christian
Am 03.12.18 um 21:14 schrieb Eric Anholt:
> With DEBUG_SLAB (poisoning on free) enabled, I could quickly produce
> an oops when tracing V3D.

Good catch, but the solution is a clear NAK.

drm_sched_entity_add_dependency_cb() can result in setting 
entity->dependency to NULL. That in turn can lead to a memory leak 
because we call the _put with a NULL fence.

Instead we should rather call trace_drm_sched_job_wait_dep() before even 
calling drm_sched_entity_add_dependency_cb(). This is also cleaner 
because we want to trace which dependencies the driver gave to the 
scheduler and not which we actually needed to add a callback to.

Regards,
Christian.

>
> Signed-off-by: Eric Anholt 
> ---
>
> I think this patch is correct (though maybe a bigger refactor could
> avoid the extra get/put?), but I've still got this with "vblank_mode=0
> perf record -a -e v3d:.\* -e gpu_scheduler:.\* glxgears".  Any ideas?
>
> [  139.842191] Unable to handle kernel NULL pointer dereference at virtual 
> address 0020
> [  139.850413] pgd = eab7bb57
> [  139.854424] [0020] *pgd=8040004003, *pmd=
> [  139.860523] Internal error: Oops: 206 [#1] SMP ARM
> [  139.865340] Modules linked in:
> [  139.868404] CPU: 0 PID: 1161 Comm: v3d_render Not tainted 4.20.0-rc4+ #552
> [  139.875287] Hardware name: Broadcom STB (Flattened Device Tree)
> [  139.881228] PC is at perf_trace_drm_sched_job_wait_dep+0xa8/0xf4
> [  139.887243] LR is at 0xe9790274
> [  139.890388] pc : []lr : []psr: a0050013
> [  139.896662] sp : ed21dec0  ip : ed21dec0  fp : ed21df04
> [  139.901893] r10: ed267478  r9 :   r8 : ff7bde04
> [  139.907123] r7 :   r6 : 0063  r5 :   r4 : c1208448
> [  139.913659] r3 : c1265690  r2 : ff7bf660  r1 : 0034  r0 : ff7bf660
> [  139.920196] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
> user
> [  139.927339] Control: 30c5383d  Table: 68fa3b40  DAC: fffd
> [  139.933095] Process v3d_render (pid: 1161, stack limit = 0xb3c84b1b)
> [  139.939457] Stack: (0xed21dec0 to 0xed21e000)
> [  139.943821] dec0: 20050013  eb9700cc  ec0e8e80  
> eb9700cc e9790274
> [  139.952009] dee0:  e2f59345 eb970078 eba8f680 c12ae00c c1208478 
>  e8c2b048
> [  139.960197] df00: eb9700cc c06e92e4 c06e8f04  80050013 ed267478 
> eb970078 
> [  139.968385] df20: ed267578 c0e45ae0 e9093080 c06e831c ed267630 c06e8120 
> c06e77d4 c1208448
> [  139.976573] df40: ee2e8acc 0001  ee2e8640 c0272ab4 ed21df54 
> ed21df54 e2f59345
> [  139.984762] df60: ed21c000 ed1b4800 ed2d7840  ed21c000 ed267478 
> c06e8084 ee935cb0
> [  139.992950] df80: ed1b4838 c0249b44 ed21c000 ed2d7840 c02499e4  
>  
> [  140.001138] dfa0:    c02010ac   
>  
> [  140.009326] dfc0:       
>  
> [  140.017514] dfe0:     0013  
>  
> [  140.025707] [] (perf_trace_drm_sched_job_wait_dep) from 
> [] (drm_sched_entity_pop_job+0x394/0x438)
> [  140.036332] [] (drm_sched_entity_pop_job) from [] 
> (drm_sched_main+0x9c/0x298)
> [  140.045221] [] (drm_sched_main) from [] 
> (kthread+0x160/0x168)
> [  140.052716] [] (kthread) from [] 
> (ret_from_fork+0x14/0x28)
>
>   drivers/gpu/drm/scheduler/sched_entity.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 4463d3826ecb..0d4fc86089cb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -440,13 +440,15 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct 
> drm_sched_entity *entity)
>   
>   while ((entity->dependency =
>   sched->ops->dependency(sched_job, entity))) {
> -
> + dma_fence_get(entity->dependency);
>   if (drm_sched_entity_add_dependency_cb(entity)) {
>   
>   trace_drm_sched_job_wait_dep(sched_job,
>entity->dependency);
> + dma_fence_put(entity->dependency);
>   return NULL;
>   }
> + dma_fence_put(entity->dependency);
>   }
>   
>   /* skip jobs from entity that marked guilty */



Re: [PATCH 1/2] Revert "drm/sched: fix timeout handling v2"

2018-11-08 Thread Koenig, Christian
Am 08.11.18 um 17:19 schrieb Eric Anholt:
> "Koenig, Christian"  writes:
>
>> Am 08.11.18 um 17:04 schrieb Eric Anholt:
>>> This reverts commit 0efd2d2f68cd5dbddf4ecd974c33133257d16a8e.  Fixes
>>> this failure in V3D GPU reset:
>>>
>>> [ 1418.227796] Unable to handle kernel NULL pointer dereference at virtual 
>>> address 0018
>>> [ 1418.235947] pgd = dc4c55ca
>>> [ 1418.238695] [0018] *pgd=8040004003, *pmd=
>>> [ 1418.244132] Internal error: Oops: 206 [#1] SMP ARM
>>> [ 1418.248934] Modules linked in:
>>> [ 1418.252001] CPU: 0 PID: 10253 Comm: kworker/0:0 Not tainted 4.19.0-rc6+ 
>>> #486
>>> [ 1418.259058] Hardware name: Broadcom STB (Flattened Device Tree)
>>> [ 1418.265002] Workqueue: events drm_sched_job_timedout
>>> [ 1418.269986] PC is at dma_fence_remove_callback+0x8/0x50
>>> [ 1418.275218] LR is at drm_sched_job_timedout+0x4c/0x118
>>> ...
>>> [ 1418.415891] [] (dma_fence_remove_callback) from [] 
>>> (drm_sched_job_timedout+0x4c/0x118)
>>> [ 1418.425571] [] (drm_sched_job_timedout) from [] 
>>> (process_one_work+0x2c8/0x7bc)
>>> [ 1418.434552] [] (process_one_work) from [] 
>>> (worker_thread+0x44/0x590)
>>> [ 1418.442663] [] (worker_thread) from [] 
>>> (kthread+0x160/0x168)
>>> [ 1418.450076] [] (kthread) from [] 
>>> (ret_from_fork+0x14/0x28)
>>>
>>> Cc: Christian König 
>>> Cc: Nayan Deshmukh 
>>> Cc: Alex Deucher 
>>> Signed-off-by: Eric Anholt 
>> Well NAK. The problem here is that fence->parent is NULL which is most
>> likely caused by an issue somewhere else.
>>
>> We could easily work around that with an extra NULL check, but reverting
>> the patch would break GPU recovery again.
> My GPU recovery works with the revert and reliably doesn't work without
> it, so my idea of "break GPU recovery" is the opposite of yours.  Can
> you help figure out what in this change broke my driver?

The problem is here:

> - list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
> - struct drm_sched_fence *fence = job->s_fence;
> -
> - if (!dma_fence_remove_callback(fence->parent, &fence->cb))
> - goto already_signaled;

dma_fence_remove_callback() will fault if fence->parent is NULL. A 
simple "if (!fence->parent) continue;" should be enough to work around that.

But I'm not sure how exactly fence->parent became NULL in the first place.

Going to double check the code once more,
Christian.