Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse
On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +
> "Stephen  Bates"  wrote:
> > >Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use 
> > it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >Also the IOMMU isolation do matter a lot to us. Think someone using 
> > > this
> > >peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
Jérôme


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse
On Thu, May 10, 2018 at 04:29:44PM +0200, Christian König wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are 
> > still no users of intel_svm_bind_mm() but I see the AMD version used in 
> > that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which 
> > DMA events require ATS translations and which do not. I am assuming the 
> > driver implements someway of indicating that and its not just a global ON 
> > or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to 
> > support ATS what would need to be added in the NVMe spec above and beyond 
> > what we have in PCI ATS to support efficient use of ATS (for example would 
> > we need a flag in the submission queue entries to indicate a particular 
> > IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
Jérôme


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse
On Thu, May 10, 2018 at 02:16:25PM +, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >together with ATS to identify the address space which a transaction 
> >should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS 
> then it will issue a TR TLP to obtain a translated address from the IOMMU. 
> This TR TLP will be addressed to the RP and so regardless of ACS it is going 
> up to the Root Port. When it gets the response it gets the physical address 
> and can use that with the TA bit set for the p2pdma. In the case of ATS 
> support we also have more control over ACS as we can disable it just for TA 
> addresses (as per 7.7.7.7.2 of the spec).
> 
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>
> You are right here and I think this illustrates a problem for using the IOMMU 
> at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I 
> think we have to disable the IOMMU (something Mike suggested earlier). The 
> reason is that since ATS is not an option the EP must initiate the DMA using 
> the addresses passed down to it. If the IOMMU is on then this is an IOVA that 
> could (with some non-zero probability) point to an IO Memory address in the 
> same PCI domain. So if we disable ACS we are in trouble as we might MemWr to 
> the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. 
> Disabling the IOMMU removes the IOVA risk and ironically also resolves the 
> IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't 
> have ATS (and no NVMe SSDs today support ATS) then we have to disable the 
> IOMMU. I know this is problematic for AMDs use case so perhaps we also need 
> to consider a mode for P2PDMA for devices that DO support ATS where we can 
> enable the IOMMU (but in this case EPs without ATS cannot participate as 
> P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
Jérôme


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Jerome Glisse
On Wed, May 09, 2018 at 04:30:32PM +, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
> 
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me 
> too it) or is this all done via user-space? Based on my grepping of the 
> kernel code I see zero EP drivers using in-kernel ATS functionality right 
> now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
Jérôme


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Jerome Glisse
On Wed, May 09, 2018 at 03:41:44PM +, Stephen  Bates wrote:
> Christian
> 
> >Interesting point, give me a moment to check that. That finally makes 
> >all the hardware I have standing around here valuable :)
> 
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

Jérôme


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Jerome Glisse
On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
Jérôme


Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue

2018-04-20 Thread Jerome Glisse
On Fri, Apr 20, 2018 at 12:57:41PM -0700, Tim Chen wrote:
> On 04/04/2018 12:17 PM, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=generic-write-protection-rfc
> > 
> > This is an RFC for LSF/MM discussions. It impacts the file subsystem,
> > the block subsystem and the mm subsystem. Hence it would benefit from
> > a cross sub-system discussion.
> > 
> > Patchset is not fully bake so take it with a graint of salt. I use it
> > to illustrate the fact that it is doable and now that i did it once i
> > believe i have a better and cleaner plan in my head on how to do this.
> > I intend to share and discuss it at LSF/MM (i still need to write it
> > down). That plan lead to quite different individual steps than this
> > patchset takes and his also easier to split up in more manageable
> > pieces.
> > 
> > I also want to apologize for the size and number of patches (and i am
> > not even sending them all).
> > 
> > --
> > The Why ?
> > 
> > I have two objectives: duplicate memory read only accross nodes and or
> > devices and work around PCIE atomic limitations. More on each of those
> > objective below. I also want to put forward that it can solve the page
> > wait list issue ie having each page with its own wait list and thus
> > avoiding long wait list traversale latency recently reported [1].
> > 
> > It does allow KSM for file back pages (truely generic KSM even between
> > both anonymous and file back page). I am not sure how useful this can
> > be, this was not an objective i did pursue, this is just a for free
> > feature (see below).
> > 
> > [1] https://groups.google.com/forum/#!topic/linux.kernel/Iit1P5BNyX8
> > 
> > --
> > Per page wait list, so long page_waitqueue() !
> > 
> > Not implemented in this RFC but below is the logic and pseudo code
> > at bottom of this email.
> > 
> > When there is a contention on struct page lock bit, the caller which
> > is trying to lock the page will add itself to a waitqueue. The issues
> > here is that multiple pages share the same wait queue and on large
> > system with a lot of ram this means we can quickly get to a long list
> > of waiters for differents pages (or for the same page) on the same
> > list [1].
> 
> Your approach seems useful if there are lots of locked pages sharing
> the same wait queue.  
> 
> That said, in the original workload from our customer with the long wait queue
> problem, there was a single super hot page getting migrated, and it
> is being accessed by all threads which caused the big log jam while they wait 
> for
> the migration to get completed.  
> With your approach, we will still likely end up with a long queue 
> in that workload even if we have per page wait queue.
> 
> Thanks.

Ok so i re-read the thread, i was writting this cover letter from memory
and i had bad recollection of your issue, so sorry.

First, do you have a way to reproduce the issue ? Something easy would
be nice :)

So what i am proposing for per page wait queue would only marginaly help
you (it might not even be mesurable in your workload). It would certainly
make the code smaller and easier to understand i believe.

Now that i have look back at your issue i think there is 2 things we
should do. First keep migration page map read only, this would at least
avoid CPU read fault. In trace you captured i wasn't able to ascertain
if this were read or write fault.

Second idea i have is about NUMA, everytime we NUMA migrate a page we
could attach a temporary struct to the page (using page->mapping). So
if we scan that page again we can inspect information about previous
migration and see if we are not over migrating that page (ie bouncing
it all over). If so we can mark the page (maybe with a page flag if we
can find one) to protect it from further migration. That temporary
struct would be remove after a while, ie autonuma would preallocate a
bunch of those and keep an LRU of them and recycle the oldest when it
needs a new one to migrate another page.


LSF/MM slots:

Michal can i get 2 slots to talk about this ? MM only discussion, one
to talk about doing migration with page map read only but write
protected while migration is happening. The other one to talk about
attaching auto NUMA tracking struct to page.

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 10:21:37PM +0100, Al Viro wrote:
> On Thu, Apr 19, 2018 at 04:58:20PM -0400, Jerome Glisse wrote:
> 
> > I need a struct to link part of device context with mm struct for a
> > process. Most of device context is link to the struct file of the
> > device file (ie process open has a file descriptor for the device
> > file).
> 
> Er...  You do realize that
>   fd = open(...)
>   mmap(fd, ...)
>   close(fd)
> is absolutely legitimate, right?  IOW, existence/stability/etc. of
> a file descriptor is absolutely *not* guaranteed - in fact, there might
> be not a single file descriptor referring to a given openen and mmaped
> file.

Yes and that's fine, on close(fd) the device driver is tear down and
struct i want to store is tear down too and free.

> 
> > Device driver for GPU have some part of their process context tied to
> > the process mm (accessing process address space directly from the GPU).
> > However we can not store this context information in the struct file
> > private data because of clone (same struct file accross different mm).
> > 
> > So today driver have an hashtable in their global device structure to
> > lookup context information for a given mm. This is sub-optimal and
> > duplicate a lot of code among different drivers.
> 
> Umm...  Examples?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_mn.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_userptr.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c

RDMA folks too have similar construct.

> 
> > Hence why i want something generic that allow a device driver to store
> > context structure that is specific to a mm. I thought that adding a
> > new array on the side of struct file array would be a good idea but it
> > has too many kludges.
> > 
> > So i will do something inside mmu_notifier and there will be no tie to
> > any fs aspect. I expect only a handful of driver to care about this and
> > for a given platform you won't see that many devices hence you won't
> > have that many pointer to deal with.
> 
> Let's step back for a second - lookups by _what_?  If you are associating
> somethin with a mapping, vm_area_struct would be a natural candidate for
> storing such data, wouldn't it?
> 
> What do you have and what do you want to find?

So you are in an ioctl against the device file, you have struct file
and driver store a pointer to some file context info in struct file
private data which itself has a pointer to some global device driver
structure which itself has a pointer to struct device.

Hence i have struct mm (from current->mm), and dev_t easily available.

The context information is tie to the mm for the device and can only
be use against said mm. Even if the struct file of the device outlive
the original process, no one can use that struct with a process that
do not have the same mm. Moreover that struct is freed if the mm is
destroy.

If child, share the struct file but have a different and want to use
same feature then a new structure is created and has same property ie
can only be use against this new mm.

The link with struct file is not explicit but you can only use objects
tie to that struct through ioctl against the struct file.

Hopes this clarify the use case.

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 09:39:53PM +0100, Al Viro wrote:
> On Thu, Apr 19, 2018 at 01:25:13PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 19, 2018 at 04:15:02PM -0400, Jerome Glisse wrote:
> > > On Thu, Apr 19, 2018 at 12:56:37PM -0700, Matthew Wilcox wrote:
> > > > > Well scratch that whole idea, i would need to add a new array to task
> > > > > struct which make it a lot less appealing. Hence a better solution is
> > > > > to instead have this as part of mm (well indirectly).
> > > > 
> > > > It shouldn't be too bad to add a struct radix_tree to the fdtable.
> > > > 
> > > > I'm sure we could just not support weird cases like sharing the fdtable
> > > > without sharing the mm.  Does anyone actually do that?
> > > 
> > > Well like you pointed out what i really want is a 1:1 structure linking
> > > a device struct an a mm_struct. Given that this need to be cleanup when
> > > mm goes away hence tying this to mmu_notifier sounds like a better idea.
> > > 
> > > I am thinking of adding a hashtable to mmu_notifier_mm using file id for
> > > hash as this should be a good hash value for common cases. I only expect
> > > few drivers to need that (GPU drivers, RDMA). Today GPU drivers do have
> > > a hashtable inside their driver and they has on the mm struct pointer,
> > > i believe hash mmu_notifier_mm using file id will be better.
> > 
> > file descriptors are small positive integers ...
> 
> ... except when there's a lot of them.  Or when something uses dup2() in
> interesting ways, but hey - we could "just not support" that, right?
> 
> > ideal for the radix tree.
> > If you need to find your data based on the struct file address, then by
> > all means a hashtable is the better data structure.
> 
> Perhaps it would be a good idea to describe whatever is being attempted?
> 
> FWIW, passing around descriptors is almost always a bloody bad idea.  There
> are very few things really associated with those and just about every time
> I'd seen internal APIs that work in terms of those "small positive numbers"
> they had been badly racy and required massive redesign to get something even
> remotely sane.

Ok i will use struct device pointer as index, or something else (i
would like to use PCI domain:bus:slot but i don't want this to be
PCIE only), maybe dev_t ...

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 09:33:07PM +0100, Al Viro wrote:
> On Thu, Apr 19, 2018 at 01:26:10PM -0400, Jerome Glisse wrote:
> 
> > Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> > add void * *private_data; to struct fdtable (also a default array to
> > struct files_struct). The callback would be part of struct file_operations.
> > and only call if it exist (os overhead is only for device driver that
> > care).
> 
> Hell, *NO*.  This is insane - you would need to maintain extra counts
> ("how many descriptors refer to this struct file... for this descriptor
> table").
> 
> Besides, _what_ private_data?  What would own and maintain it?  A specific
> driver?  What if more than one of them wants that thing?

I hadn't something complex in mind (ie timelife link to struct file and
no refcouting changes). But anyway i gave up on that idea and will add
what i need in mmu_notifier.

>  
> > Did i miss something fundamental ? copy_files() call dup_fd() so i
> > should be all set here.
> 
> That looks like an extremely misguided kludge for hell knows what purpose,
> almost certainly architecturally insane.  What are you actually trying to
> achieve?

I need a struct to link part of device context with mm struct for a
process. Most of device context is link to the struct file of the
device file (ie process open has a file descriptor for the device
file).

Device driver for GPU have some part of their process context tied to
the process mm (accessing process address space directly from the GPU).
However we can not store this context information in the struct file
private data because of clone (same struct file accross different mm).

So today driver have an hashtable in their global device structure to
lookup context information for a given mm. This is sub-optimal and
duplicate a lot of code among different drivers.

Hence why i want something generic that allow a device driver to store
context structure that is specific to a mm. I thought that adding a
new array on the side of struct file array would be a good idea but it
has too many kludges.

So i will do something inside mmu_notifier and there will be no tie to
any fs aspect. I expect only a handful of driver to care about this and
for a given platform you won't see that many devices hence you won't
have that many pointer to deal with.

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 12:56:37PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 03:31:08PM -0400, Jerome Glisse wrote:
> > > > Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> > > > add void * *private_data; to struct fdtable (also a default array to
> > > > struct files_struct). The callback would be part of struct 
> > > > file_operations.
> > > > and only call if it exist (os overhead is only for device driver that
> > > > care).
> > > > 
> > > > Did i miss something fundamental ? copy_files() call dup_fd() so i
> > > > should be all set here.
> > > > 
> > > > I will work on patches i was hoping this would not be too much work.
> > 
> > Well scratch that whole idea, i would need to add a new array to task
> > struct which make it a lot less appealing. Hence a better solution is
> > to instead have this as part of mm (well indirectly).
> 
> It shouldn't be too bad to add a struct radix_tree to the fdtable.
> 
> I'm sure we could just not support weird cases like sharing the fdtable
> without sharing the mm.  Does anyone actually do that?

Well like you pointed out what i really want is a 1:1 structure linking
a device struct an a mm_struct. Given that this need to be cleanup when
mm goes away hence tying this to mmu_notifier sounds like a better idea.

I am thinking of adding a hashtable to mmu_notifier_mm using file id for
hash as this should be a good hash value for common cases. I only expect
few drivers to need that (GPU drivers, RDMA). Today GPU drivers do have
a hashtable inside their driver and they has on the mm struct pointer,
i believe hash mmu_notifier_mm using file id will be better.

Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 02:31:07PM -0400, Jeff Layton wrote:
> On Thu, 2018-04-19 at 13:26 -0400, Jerome Glisse wrote:
> > On Thu, Apr 19, 2018 at 12:58:39PM -0400, Jeff Layton wrote:
> > > On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> > > > On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > > > > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > > > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > > > > any people against having a callback everytime a struct file is 
> > > > > > added
> > > > > > to a task_struct and also having a secondary array so that special
> > > > > > file like device file can store something opaque per task_struct per
> > > > > > struct file.
> > > > > 
> > > > > Do you really want something per _thread_, and not per _mm_?
> > > > 
> > > > Well per mm would be fine but i do not see how to make that happen with
> > > > reasonable structure. So issue is that you can have multiple task with
> > > > same mm but different file descriptors (or am i wrong here ?) thus there
> > > > would be no easy way given a struct file to lookup the per mm struct.
> > > > 
> > > > So as a not perfect solution i see a new array in filedes which would
> > > > allow device driver to store a pointer to their per mm data structure.
> > > > To be fair usualy you will only have a single fd in a single task for
> > > > a given device.
> > > > 
> > > > If you see an easy way to get a per mm per inode pointer store somewhere
> > > > with easy lookup i am all ears :)
> > > > 
> > > 
> > > I may be misunderstanding, but to be clear: struct files don't get
> > > added to a thread, per-se.
> > > 
> > > When userland calls open() or similar, the struct file gets added to
> > > the files_struct. Those are generally shared with other threads within
> > > the same process. The files_struct can also be shared with other
> > > processes if you clone() with the right flags.
> > > 
> > > Doing something per-thread on every open may be rather difficult to do.
> > 
> > Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> > add void * *private_data; to struct fdtable (also a default array to
> > struct files_struct). The callback would be part of struct file_operations.
> > and only call if it exist (os overhead is only for device driver that
> > care).
> > 
> > Did i miss something fundamental ? copy_files() call dup_fd() so i
> > should be all set here.
> > 
> > I will work on patches i was hoping this would not be too much work.
> > 
> 
> No, I think I misunderstood. I was thinking you wanted to iterate over
> all of the threads that might be associated with a struct file, and
> that's rather non-trivial.
> 
> A callback when you add a file to the files_struct seems like it would
> probably be OK (in principle).

Well scratch that whole idea, i would need to add a new array to task
struct which make it a lot less appealing. Hence a better solution is
to instead have this as part of mm (well indirectly).

Thanks folks for chimming in. I will discuss this in my mmu_notifier
session.

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 12:58:39PM -0400, Jeff Layton wrote:
> On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> > On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > > any people against having a callback everytime a struct file is added
> > > > to a task_struct and also having a secondary array so that special
> > > > file like device file can store something opaque per task_struct per
> > > > struct file.
> > > 
> > > Do you really want something per _thread_, and not per _mm_?
> > 
> > Well per mm would be fine but i do not see how to make that happen with
> > reasonable structure. So issue is that you can have multiple task with
> > same mm but different file descriptors (or am i wrong here ?) thus there
> > would be no easy way given a struct file to lookup the per mm struct.
> > 
> > So as a not perfect solution i see a new array in filedes which would
> > allow device driver to store a pointer to their per mm data structure.
> > To be fair usualy you will only have a single fd in a single task for
> > a given device.
> > 
> > If you see an easy way to get a per mm per inode pointer store somewhere
> > with easy lookup i am all ears :)
> > 
> 
> I may be misunderstanding, but to be clear: struct files don't get
> added to a thread, per-se.
> 
> When userland calls open() or similar, the struct file gets added to
> the files_struct. Those are generally shared with other threads within
> the same process. The files_struct can also be shared with other
> processes if you clone() with the right flags.
> 
> Doing something per-thread on every open may be rather difficult to do.

Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
add void * *private_data; to struct fdtable (also a default array to
struct files_struct). The callback would be part of struct file_operations.
and only call if it exist (os overhead is only for device driver that
care).

Did i miss something fundamental ? copy_files() call dup_fd() so i
should be all set here.

I will work on patches i was hoping this would not be too much work.

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > Oh can i get one more small slot for fs ? I want to ask if they are
> > any people against having a callback everytime a struct file is added
> > to a task_struct and also having a secondary array so that special
> > file like device file can store something opaque per task_struct per
> > struct file.
> 
> Do you really want something per _thread_, and not per _mm_?

Well per mm would be fine but i do not see how to make that happen with
reasonable structure. So issue is that you can have multiple task with
same mm but different file descriptors (or am i wrong here ?) thus there
would be no easy way given a struct file to lookup the per mm struct.

So as a not perfect solution i see a new array in filedes which would
allow device driver to store a pointer to their per mm data structure.
To be fair usualy you will only have a single fd in a single task for
a given device.

If you see an easy way to get a per mm per inode pointer store somewhere
with easy lookup i am all ears :)

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 11:55:08AM +1000, Dave Chinner wrote:
> On Wed, Apr 18, 2018 at 05:19:39PM -0400, Jerome Glisse wrote:
> > Just wanted to suggest to push HMM status down one slot in the
> > agenda to avoid having FS and MM first going into their own
> > room and then merging back for GUP and DAX, and re-splitting
> > after. More over HMM and NUMA talks will be good to have back
> > to back as they deal with same kind of thing mostly.
> 
> So while we are talking about schedule suggestions, we see that
> there's lots of empty slots in the FS track. We (xfs guys) were just
> chatting on #xfs about whether we'd have time to have a "XFS devel
> meeting" at some point during LSF/MM as we are rarely in the same
> place at the same time.
> 
> I'd like to propose that we compact the fs sessions so that we get a
> 3-slot session reserved for "Individual filesystem discussions" one
> afternoon. That way we've got time in the schedule for the all the
> ext4/btrfs/XFS/NFS/CIFS devs to get together with each other and
> talk about things of interest only to their own fileystems.
> 
> That means we all don't have to find time outside the schedule to do
> this, and think this wold be time very well spent for most fs people
> at the conf

Oh can i get one more small slot for fs ? I want to ask if they are
any people against having a callback everytime a struct file is added
to a task_struct and also having a secondary array so that special
file like device file can store something opaque per task_struct per
struct file.

I will try to stich a patchset tomorrow for that. A lot of device
driver would like to have this.

Cheers,
Jérôme


[LSF/MM] schedule suggestion

2018-04-18 Thread Jerome Glisse
Just wanted to suggest to push HMM status down one slot in the
agenda to avoid having FS and MM first going into their own
room and then merging back for GUP and DAX, and re-splitting
after. More over HMM and NUMA talks will be good to have back
to back as they deal with same kind of thing mostly.

So on Monday afternoon GUP in first slot would be nice :)

Just a suggestion

https://docs.google.com/spreadsheets/d/15XFz_Zsklmle--L9CO4-ygCqmSHwBsFPjvjDpiL5qwM/edit#gid=0

Cheers,
Jérôme


Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue

2018-04-18 Thread Jerome Glisse
On Wed, Apr 18, 2018 at 04:13:37PM +0200, Jan Kara wrote:
> Hello,
> 
> so I finally got to this :)
> 
> On Wed 04-04-18 15:17:50, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 

[...]

> > --
> > The Why ?
> > 
> > I have two objectives: duplicate memory read only accross nodes and or
> > devices and work around PCIE atomic limitations. More on each of those
> > objective below. I also want to put forward that it can solve the page
> > wait list issue ie having each page with its own wait list and thus
> > avoiding long wait list traversale latency recently reported [1].
> > 
> > It does allow KSM for file back pages (truely generic KSM even between
> > both anonymous and file back page). I am not sure how useful this can
> > be, this was not an objective i did pursue, this is just a for free
> > feature (see below).
> 
> I know some people (Matthew Wilcox?) wanted to do something like KSM for
> file pages - not all virtualization schemes use overlayfs and e.g. if you
> use reflinks (essentially shared on-disk extents among files) for your
> container setup, you could save significant amounts of memory with the
> ability to share pages in page cache among files that are reflinked.

Yes i believe they are still use case where KSM with file back page make
senses, i am just not familiar enough with those workload to know how big
of a deal this is.

> > [1] https://groups.google.com/forum/#!topic/linux.kernel/Iit1P5BNyX8
> > 
> > --
> > Per page wait list, so long page_waitqueue() !
> > 
> > Not implemented in this RFC but below is the logic and pseudo code
> > at bottom of this email.
> > 
> > When there is a contention on struct page lock bit, the caller which
> > is trying to lock the page will add itself to a waitqueue. The issues
> > here is that multiple pages share the same wait queue and on large
> > system with a lot of ram this means we can quickly get to a long list
> > of waiters for differents pages (or for the same page) on the same
> > list [1].
> > 
> > The present patchset virtualy kills all places that need to access the
> > page->mapping field and only a handfull are left, namely for testing
> > page truncation and for vmscan. The former can be remove if we reuse
> > the PG_waiters flag for a new PG_truncate flag set on truncation then
> > we can virtualy kill all derefence of page->mapping (this patchset
> > proves it is doable). NOTE THIS DOES NOT MEAN THAT MAPPING is FREE TO
> > BE USE BY ANYONE TO STORE WHATEVER IN STRUCT PAGE. SORRY NO !
> 
> It is interesting that you can get rid of page->mapping uses in most
> places. For page reclaim (vmscan) you'll still need a way to get from a
> page to an address_space so that you can reclaim the page so you can hardly
> get rid of page->mapping completely but you're right that with such limited
> use that transition could be more complex / expensive.

Idea for vmscan is that you either have regular mapping pointer store in
page->mapping or you have a pointer to special struct which has a function
pointer to a reclaim/walker function (rmap_walk_ksm)

> What I wonder though is what is the cost of this (in the terms of code size
> and speed) - propagating the mapping down the stack costs something... Also
> in terms of maintainability, code readability suffers a bit.

I haven't checked that, i will, i was not so concern because in the vast
majority of places there is already struct address_space on the stack
frame (ie local variable in function being call) so moving it to function
argument shouldn't impact that. However as i expect this will be merge
over multiple kernel release cycle and the intermediary step will see an
increase in stack size. The code size should only grow marginaly i expect.
I will provide numbers with my next posting after LSF/MM.


> This could be helped though. In some cases it seems we just use the mapping
> because it was easily available but could get away without it. In other
> case (e.g. lot of fs/buffer.c) we could make bh -> mapping transition easy
> by storing the mapping in the struct buffer_head - possibly it could
> replace b_bdev pointer as we could get to that from the mapping with a bit
> of magic and pointer chasing and accessing b_bdev is not very performance
> critical. OTOH such optimizations make a rather complex patches from mostly
> mechanical replacement so I can see why you didn't go that route.

I am willing to do the buffer_head change, i remember considering it but
i don't remember why not doing it (i failed to take note of that).


> Overall I think you'd need to make a good benchmarking comparison showing
> how much this helps some real workloads (your motivation) and also how
> other loads on lower end machines are affected.

Do you have any specific benchmark you would like to see ? My list was:
  https://github.com/01org/lkp-tests
  

Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 09:38:43PM +, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
Jérôme


Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 09:32:20PM +, Stephen  Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment 
> > or p2p
> 
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) 
> are such that we really only expect CONFIG_P2P_DMA to be enabled in specific 
> instances and in those instances the users have made a decision to favor P2P 
> over IOMMU isolation. Or they have setup their PCIe topology in a way that 
> gives them IOMMU isolation where they want it and P2P where they want it.
> 
> 

Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
struct hmm_devmem *devmem;
unsigned long ppfn;

/* Sanity test maybe BUG_ON() */
if (!is_device_private_page(page))
return -1UL;

devmem = page->pgmap->data;
ppfn = page_to_page(page) - devmem->pfn_first;
return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
Jérôme


Re: [Lsf-pc] [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-02-01 Thread Jerome Glisse
On Thu, Feb 01, 2018 at 09:00:13AM -0700, Jens Axboe wrote:
> On 2/1/18 8:57 AM, Jerome Glisse wrote:
> > On Thu, Feb 01, 2018 at 08:34:58AM -0700, Jens Axboe wrote:
> >> On 1/31/18 11:13 AM, Jerome Glisse wrote:
> >>> That's one solution, another one is to have struct bio_vec store
> >>> buffer_head pointer and not page pointer, from buffer_head you can
> >>> find struct page and using buffer_head and struct page pointer you
> >>> can walk the KSM rmap_item chain to find back the mapping. This
> >>> would be needed on I/O error for pending writeback of a newly write
> >>> protected page, so one can argue that the overhead of the chain lookup
> >>> to find back the mapping against which to report IO error, is an
> >>> acceptable cost.
> >>
> >> Ehm nope. bio_vec is a generic container for pages, requiring
> >> buffer_heads to be able to do IO would be insanity.
> > 
> > The extra pointer dereference would be killing performance ?
> 
> No, I'm saying that requiring a buffer_head to be able to do IO
> is insanity. That's how things used to be in the pre-2001 days.

Oh ok i didn't thought it would be a problem, iirc it seemed to me that
nobh fs were allocating a buffer_head just do I/O but my memory is probably
confuse. Well i can use the one bit flag idea then allmost same semantic
patch but if flag is (ie page is write protected)set then to get the real
page address you have to do an extra memory dereference. So it would add
an extra test for common existing case and an extra derefence for the write
protect case. No need for buffer_head.

Thanks for pointing out this buffer_head thing :)

Jérôme


Re: [Lsf-pc] [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-02-01 Thread Jerome Glisse
On Thu, Feb 01, 2018 at 08:34:58AM -0700, Jens Axboe wrote:
> On 1/31/18 11:13 AM, Jerome Glisse wrote:
> > That's one solution, another one is to have struct bio_vec store
> > buffer_head pointer and not page pointer, from buffer_head you can
> > find struct page and using buffer_head and struct page pointer you
> > can walk the KSM rmap_item chain to find back the mapping. This
> > would be needed on I/O error for pending writeback of a newly write
> > protected page, so one can argue that the overhead of the chain lookup
> > to find back the mapping against which to report IO error, is an
> > acceptable cost.
> 
> Ehm nope. bio_vec is a generic container for pages, requiring
> buffer_heads to be able to do IO would be insanity.

The extra pointer dereference would be killing performance ? Note that
i am not saying have one vec entry per buffer_head but keep thing as
they are and run the following semantic patch:

@@
struct bio_vec *bvec;
expression E;
@@
-bvec->bv_page = E;
+bvec_set_page(bvec, E);

@@
struct bio_vec *bvec;
@@
-bvec->bv_page
+bvec_get_page(bvec);

Then inside struct bio_vec:
s/struct page *bv_head;/struct buffer_head *bv_bh;/

Finally add:
struct page *bvec_get_page(const struct bio_vec *bvec)
{
return bvec->bv_bh->page;
}

void bvec_set_page(struct bio_vec *bvec, struct page *page)
{
bvec->bv_bh = first_buffer_head(page);
}

Well you get the idea. Point is that it just add one more pointer
dereference so one more memory lookup. But if it is an issue they
are other way to achieve what i want. For instance i can have a
flags in the address store (1 bit) and make the extra dereference
only needed for write protected page. Or the other solution in
previous email, or something i haven't thought of yet :)

Like i said i don't think i will change the block subsystem, for
block i would only need to change if i ever want to allow write
protection to happen before pending writeback completion. Which
as of now feels to me like a micro-optimization that i might never
need.

In any case i am happy to discuss my ideas and try to find one
that people likes :)

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-02-01 Thread Jerome Glisse
On Thu, Feb 01, 2018 at 03:27:52PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jan 31, 2018 at 12:42:45PM -0500, Jerome Glisse wrote:
> > The overall idea i have is that in any place in the kernel (except memory 
> > reclaim
> > but that's ok) we can either get mapping or buffer_head information without 
> > relying
> > on struct page and if we have either one and a struct page then we can find 
> > the
> > other one.
> 
> Why is it okay for reclaim?
> 
> And what about physical memory scanners that doesn't have any side information
> about the page they step onto?

Reclaim is only interested in unmapping and reclaiming, KSM already provide
special function for unmapping (rmap walk) so it is just about extending that
for file back page.

For physical memory scanners it depends on what their individual objectives are.
I have not reviewed them but had the feeling that i could add special KSM 
helpers
to achieve aims of each one of them.

The mapping information is not lost, it would just not be easily accessible
for those write protected pages.

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Jerome Glisse
On Wed, Jan 31, 2018 at 05:55:58PM +, Al Viro wrote:
> On Wed, Jan 31, 2018 at 12:42:45PM -0500, Jerome Glisse wrote:
> 
> > For block devices the idea is to use struct page and buffer_head (first one 
> > of
> > a page) as a key to find mapping (struct address_space) back.
> 
> Details, please...

Note that i am not talking about block device page (i am excluding
those from that). So just regular filesystem page (ext*,xfs,btrfs,
...).

So in block device context AFAIK only time when you need mapping is
if they are some I/O error. Given than i am doing this with intent to
write protect the page one can argue that i can wait for all writeback
to complete before proceeding. At that time, it does not matter to block
device if page->mapping is no longer an address_space because the block
device code is done with the page and has forget about it.

That's one solution, another one is to have struct bio_vec store
buffer_head pointer and not page pointer, from buffer_head you can
find struct page and using buffer_head and struct page pointer you
can walk the KSM rmap_item chain to find back the mapping. This
would be needed on I/O error for pending writeback of a newly write
protected page, so one can argue that the overhead of the chain lookup
to find back the mapping against which to report IO error, is an
acceptable cost.

Another solution is to override the writeback end callback with
special one capable of finding the mapping from struct page and bio
pointer. This would not need any change to block device code. It
would have the same overhead thought as solution 2 above.


My intention was to stick to first solution (wait for writeback and
make no modification to block device struct or function). Then latter
if it make sense to add support to write protect a page before write
back is done.

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Jerome Glisse
On Wed, Jan 31, 2018 at 07:09:48PM +0200, Igor Stoppa wrote:
> On 30/01/18 02:43, Jerome Glisse wrote:
> 
> [...]
> 
> > Maybe we can kill page->mapping altogether as a result of this. However 
> > this is
> > not my motivation at this time.
> 
> We had a discussion some time ago
> 
> http://www.openwall.com/lists/kernel-hardening/2017/07/07/7
> 
> where you advised to use it for tracking pmalloc pages vs area, which
> generated this patch:
> 
> http://www.openwall.com/lists/kernel-hardening/2018/01/24/7
> 
> Could you please comment what wold happen to the shortcut from struct
> page to vm_struct that this patch is now introducing?

Sadly struct page fields means different thing depending on the context
in which the page is use. This is confusing i know. So when i say kill
page->mapping i am not saying shrink the struct page and remove that
field, i am saying maybe we can kill current user of page->mapping
for regular process page (ie page that are in some mmap() area of a
process).

Other use of that field in different context like yours are not affected
by this change and can ignore it alltogether.

Hope this clarify it :)

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Jerome Glisse
On Wed, Jan 31, 2018 at 04:56:46PM +, Al Viro wrote:
> On Mon, Jan 29, 2018 at 07:43:48PM -0500, Jerome Glisse wrote:
> > I started a patchset about $TOPIC a while ago, right now i am working on 
> > other
> > thing but i hope to have an RFC for $TOPIC before LSF/MM and thus would 
> > like a
> > slot during common track to talk about it as it impacts FS, BLOCK and MM (i 
> > am
> > assuming their will be common track).
> > 
> > Idea is that mapping (struct address_space) is available in virtualy all the
> > places where it is needed and that their should be no reasons to depend 
> > only on
> > struct page->mapping field. My patchset basicly add mapping to a bunch of 
> > vfs
> > callback (struct address_space_operations) where it is missing, changing 
> > call
> > site. Then i do an individual patch per filesystem to leverage the new 
> > argument
> > instead on struct page.
> 
> Oh?  What about the places like fs/coda?  Or block devices, for that matter...
> You can't count upon file->f_mapping->host == file_inode(file).

What matter is that the place that call an address_space_operations callback
already has mapping == page->mapping in many places this is obvious. For
instance page just have been looked up using mapping and thus you must have
mapping == page->mapping. But i believe this holds in all places. They are
few dark corners (fuse, splice, ...). Truncate also factor in all this as
page->mapping is use to determine if a page has been truncated, but it
should not be an issue.

So i am not counting on file->f_mapping->host == file_inode(file) but i might
count in _some_ place on vma->file->f_mapping == page->mapping of any non 
private
page inside that vma. AFAICT this holds for coda and should hold elsewhere too.

For block devices the idea is to use struct page and buffer_head (first one of
a page) as a key to find mapping (struct address_space) back.

The overall idea i have is that in any place in the kernel (except memory 
reclaim
but that's ok) we can either get mapping or buffer_head information without 
relying
on struct page and if we have either one and a struct page then we can find the
other one.

Like i said i am not done with a patchset for that yet so maybe i am too
optimistic. I have another patchset i need to finish first before i go back to
this. I hope to have an RFC sometime in February or March and maybe by then
i would have found a roadblock, i am crossing my fingers until then :)

If it turns out that it is not doable i will comment on this thread and we can
kill that of from the agenda.

Cheers,
Jérôme


[LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-29 Thread Jerome Glisse
I started a patchset about $TOPIC a while ago, right now i am working on other
thing but i hope to have an RFC for $TOPIC before LSF/MM and thus would like a
slot during common track to talk about it as it impacts FS, BLOCK and MM (i am
assuming their will be common track).

Idea is that mapping (struct address_space) is available in virtualy all the
places where it is needed and that their should be no reasons to depend only on
struct page->mapping field. My patchset basicly add mapping to a bunch of vfs
callback (struct address_space_operations) where it is missing, changing call
site. Then i do an individual patch per filesystem to leverage the new argument
instead on struct page.

I am doing this for a generic page write protection mechanism which generalize
KSM to file back page. They are couple other aspect like struct page->index,
struct page->private which are addressed in similar way. The block layer is
mostly affected because on block device error it needs the page->mapping to
report I/O error.

Maybe we can kill page->mapping altogether as a result of this. However this is
not my motivation at this time.


Sorry for absence of patchset at this time but i wanted to submit the subject
before LSF/MM deadline.

Cheers,
Jérôme


Re: [PATCH 04/12] pci-p2p: Clear ACS P2P flags for all client devices

2018-01-04 Thread Jerome Glisse
On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote:
> On Thu, 4 Jan 2018 17:00:47 -0700
> Logan Gunthorpe  wrote:
> 
> > On 04/01/18 03:35 PM, Alex Williamson wrote:
> > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> > > on the isolation of that downstream port and I suspect also any peers
> > > within the same PCI slot of that port and their downstream devices.  The
> > > entire sub-hierarchy grouping needs to be re-evaluated.  This
> > > potentially affects running devices that depend on that isolation, so
> > > I'm not sure how that happens dynamically.  A boot option might be
> > > easier.  Thanks,  
> > 
> > I don't see how this is the case in current kernel code. It appears to 
> > only enable ACS globally if the IOMMU requests it.
> 
> IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM
> both request ACS be enabled if an IOMMU is present, so I'm not sure
> what you're getting at here.  Also, in reply to your other email, if
> the IOMMU is enabled, every device handled by the IOMMU is a member of
> an IOMMU group, see struct device.iommu_group.  There's an
> iommu_group_get() accessor to get a reference to it.
>  
> > I also don't see how turning off ACS isolation for a specific device is 
> > going to hurt anything. The IOMMU should still be able to keep going on 
> > unaware that anything has changed. The only worry is that a security 
> > hole may now be created if a user was relying on the isolation between 
> > two devices that are in different VMs or something. However, if a user 
> > was relying on this, they probably shouldn't have turned on P2P in the 
> > first place.
> 
> That's exactly what IOMMU groups represent, the smallest set of devices
> which have DMA isolation from other devices.  By poking this hole, the
> IOMMU group is invalid.  We cannot turn off ACS only for a specific
> device, in order to enable p2p it needs to be disabled at every
> downstream port between the devices where we want to enable p2p.
> Depending on the topology, that could mean we're also enabling p2p for
> unrelated devices.  Those unrelated devices might be in active use and
> the p2p IOVAs now have a different destination which is no longer IOMMU
> translated.
>  
> > We started with a fairly unintelligent choice to simply disable ACS on 
> > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like 
> > a good idea going forward. Instead, we now selectively disable the ACS 
> > bit only on the downstream ports that are involved in P2P transactions. 
> > This seems like the safest choice and still allows people to (carefully) 
> > use P2P adjacent to other devices that need to be isolated.
> 
> I don't see that the code is doing much checking that adjacent devices
> are also affected by the p2p change and of course the IOMMU group is
> entirely invalid once the p2p holes start getting poked.
> 
> > I don't think anyone wants another boot option that must be set in order 
> > to use this functionality (and only some hardware would require this). 
> > That's just a huge pain for users.
> 
> No, but nor do we need IOMMU groups that no longer represent what
> they're intended to describe or runtime, unchecked routing changes
> through the topology for devices that might already be using
> conflicting IOVA ranges.  Maybe soft hotplugs are another possibility,
> designate a sub-hierarchy to be removed and re-scanned with ACS
> disabled.  Otherwise it seems like disabling and re-enabling ACS needs
> to also handle merging and splitting groups dynamically.  Thanks,
> 

Dumb question, can we use a PCI bar address of one device into the
IOMMU page table of another address ie like we would DMA map a
regular system page ?

It would be much better in my view to follow down such path if that
is at all possible from hardware point of view (i am not sure where
to dig in the specification to answer my above question).

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Jerome Glisse
On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
> Back when we were first attempting to support DMA for DAX mappings of
> persistent memory the plan was to forgo 'struct page' completely and
> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
> effort died in this thread:
> 
> https://lkml.org/lkml/2015/8/14/3
> 
> ...where we learned that the dependencies on struct page for dma
> mapping are deeper than a PFN_PHYS() conversion for some
> architectures. That was the moment we pivoted to ZONE_DEVICE and
> arranged for a 'struct page' to be available for any persistent memory
> range that needs to be the target of DMA. ZONE_DEVICE enables any
> device-driver that can target "System RAM" to also be able to target
> persistent memory through a DAX mapping.
> 
> Since that time the "page-less" DAX path has continued to mature [1]
> without growing new dependencies on struct page, but at the same time
> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
> 
> Peer-to-peer DMA appears to be evolving from a niche embedded use case
> to something general purpose platforms will need to comprehend. The
> "map_peer_resource" [2] approach looks to be headed to the same
> destination as the pfn-to-scatterlist effort. It's difficult to avoid
> 'struct page' for describing DMA operations without custom driver
> code.
> 
> With that background, a statement and a question to discuss at LSF/MM:
> 
> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
> requires pfn_to_page() support across the entire physical address
> range mapped.

Note that in my case it is even worse. The pfn of the page does not
correspond to anything so it need to go through a special function
to find if a page can be mapped for another device and to provide a
valid pfn at which the page can be access by other device.

Basicly the PCIE bar is like a window into the device memory that is
dynamicly remap to specific page of the device memory. Not all device
memory can be expose through PCIE bar because of PCIE issues.

> 
> Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
> collides with platform alignment assumptions [3], and if there's a
> wider effort to rework memory hotplug [4] it seems DMA support should
> be part of the discussion.

Obvioulsy i would like to join this discussion :)

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-15 Thread Jerome Glisse
On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
>  page handling>
> 
> > > So won't it be easier to leave the pagecache page where it is and *copy* 
> > > it
> > > to the device? Can the device notify us *before* it is going to modify a
> > > page, not just after it has modified it? Possibly if we just give it the
> > > page read-only and it will have to ask CPU to get write permission? If 
> > > yes,
> > > then I belive this could work and even fs support should be doable.
> > 
> > Well yes and no. Device obey the same rule as CPU so if a file back page is
> > map read only in the process it must first do a write fault which will call
> > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > there is no way to be notify by hardware on every write. First the hardware
> > do not have the capability. Second we are talking thousand (10 000 is upper
> > range in today device) of concurrent thread, each can possibly write to page
> > under consideration.
> 
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
> 
> > We really want the device page to behave just like regular page. Most fs 
> > code
> > path never map file content, it only happens during read/write and i believe
> > this can be handled either by migrating back or by using bounce page. I want
> > to provide the choice between the two solutions as one will be better for 
> > some
> > workload and the other for different workload.
> 
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
> 
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
> 
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.
> 
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
> 
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
> 
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?

This is do-able but i think it will require the same amount of changes than
what i had in mind (excluding the block bounce code) with one drawback. Doing
it that way we can not free page A.

On some workload this probably does not hurt much but on workload where you
read a big dataset from disk and then use it only on the GPU for long period
of time (minutes/hours) you will waste GB of system memory.

Right now i am working on some other patchset, i intend to take a stab at this
in January/February time frame, before summit so i can post an RFC and have a
clear picture of every code path that needs modifications. I expect this would
provide better frame for discussion.

I assume i will have to change >readpage >readpages writepage >writepages but
i think that the only place i really need to change are do_generic_file_read()
and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to
fs that use those generic helpers.

I also probably will change >mmap or rather the helper it uses to set the pte
depending on what looks better.

Note that i don't think wrapping is an easy task. I would need to replace page
A mapping (struct page.mapping) to point to a wrapping address_space but there
is enough place in the kernel that directly dereference that and expect to hit
the right (real) address_space. I would need to replace all dereference of
page->mapping to an helper function and possibly would need to change some of
the call site logic accordingly. This might prove a bigger change than just
having to use bounce in do_generic_file_read() and generic_perform_write().

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-14 Thread Jerome Glisse
On Wed, Dec 14, 2016 at 03:23:13PM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > > > From kernel point of view such memory is almost like any other, 
> > > > > > > > it
> > > > > > > > has a struct page and most of the mm code is non the wiser, nor 
> > > > > > > > need
> > > > > > > > to be about it. CPU access trigger a migration back to regular 
> > > > > > > > CPU
> > > > > > > > accessible page.
> > > > > > > 
> > > > > > > That sounds ... complex. Page migration on page cache access 
> > > > > > > inside
> > > > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > > > a great way to cause deadlocks
> > > > > > 
> > > > > > There are few restriction on device page, no one can do GUP on them 
> > > > > > and
> > > > > > thus no one can pin them. Hence they can always be migrated back. 
> > > > > > Yes
> > > > > > each fs need modification, most of it (if not all) is isolated in 
> > > > > > common
> > > > > > filemap helpers.
> > > > > 
> > > > > Sure, but you haven't answered my question: how do you propose we
> > > > > address the issue of placing all the mm locks required for migration
> > > > > under the filesystem IO path locks?
> > > > 
> > > > Two different plans (which are non exclusive of each other). First is 
> > > > to use
> > > > workqueue and have read/write wait on the workqueue to be done 
> > > > migrating the
> > > > page back.
> > > 
> > > Pushing something to a workqueue and then waiting on the workqueue
> > > to complete the work doesn't change lock ordering problems - it
> > > just hides them away and makes them harder to debug.
> > 
> > Migration doesn't need many lock below is a list and i don't see any lock 
> > issue
> > in respect to ->read or ->write.
> > 
> >  lock_page(page);
> >  spin_lock_irq(>tree_lock);
> >  lock_buffer(bh); // if page has buffer_head
> >  i_mmap_lock_read(mapping);
> >  vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
> > // page table lock for each entry
> >  }
> 
> We can't take the page or mapping tree locks that while we hold
> various filesystem locks.
> 
> e.g. The IO path lock order is, in places:
> 
> inode->i_rwsem
>   get page from page cache
>   lock_page(page)
>   inode->allocation lock
> zero page data
> 
> Filesystems are allowed to do this, because the IO path has
> guaranteed them access to the page cache data on the page that is
> locked. Your ZONE_DEVICE proposal breaks this guarantee - we might
> have a locked page, but we don't have access to it's data.
> 
> Further, in various filesystems once the allocation lock is taken
> (e.g. the i_lock in XFS) we're not allowed to lock pages or the
> mapping tree as that leads to deadlocks with truncate, hole punch,
> etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that
> requires migration before the zeroing can occur, we can't perform
> migration here.
> 
> Why are we even considering migration in situations where we already
> hold the ZONE_DEVICE page locked, hold other filesystem locks inside
> the page lock, and have an open dirty filesystem transaction as well?
> 
> Even if migration si possible and succeeds, the struct page in the
> mapping tree for the file offset we are operating on is going to be
> different after migration. That implies we need to completely
> restart the operation. But given that we've already made changes,
> backing out at this point is ...  complex and may not even be
> possible.

So i skim through xfs code and i still think this is doable. So in the
above sequence:

  inode->i_rwsem
  page = find_get_page();
  if (device_unaddressable(page)) {
 page = migratepage();
  }
  ...

Now there is thing like filemap_write_and_wait...() but thus can be

Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > accessible page.
> > > > > 
> > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > a great way to cause deadlocks
> > > > 
> > > > There are few restriction on device page, no one can do GUP on them and
> > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > each fs need modification, most of it (if not all) is isolated in common
> > > > filemap helpers.
> > > 
> > > Sure, but you haven't answered my question: how do you propose we
> > > address the issue of placing all the mm locks required for migration
> > > under the filesystem IO path locks?
> > 
> > Two different plans (which are non exclusive of each other). First is to use
> > workqueue and have read/write wait on the workqueue to be done migrating the
> > page back.
> 
> Pushing something to a workqueue and then waiting on the workqueue
> to complete the work doesn't change lock ordering problems - it
> just hides them away and makes them harder to debug.

Migration doesn't need many lock below is a list and i don't see any lock issue
in respect to ->read or ->write.

 lock_page(page);
 spin_lock_irq(>tree_lock);
 lock_buffer(bh); // if page has buffer_head
 i_mmap_lock_read(mapping);
 vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
// page table lock for each entry
 }

I don't think i miss any and thus i don't see any real issues here. Care to 
point
to the lock you think is gona be problematic ?


> > Second solution is to use a bounce page during I/O so that there is no need
> > for migration.
> 
> Which means the page in the device is left with out-of-date
> contents, right?
>
> If so, how do you prevent data corruption/loss when the device
> has modified the page out of sight of the CPU and the bounce page
> doesn't contain those modifications? Or if the dirty device page is
> written back directly without containing the changes made in the
> bounce page?

There is no issue here, if bounce page is use then the page is mark as read
only on the device until write is done and device copy is updated with what
we have been ask to write. So no coherency issue between the 2 copy.

> 
> Hmmm - what happens when we invalidate and release a range of
> file pages that have been migrated to a device? e.g. on truncate?

Same as if it where regular memory, access by device trigger SIGBUS which is
reported through the device API. On that respect it follows the exact same
code path as regular page.

> > > > > > But for thing like writeback i want to be able to do writeback with-
> > > > > > out having to migrate page back first. So that data can stay on the
> > > > > > device while writeback is happening.
> > > > > 
> > > > > Why can't you do writeback before migration, so only clean pages get
> > > > > moved?
> > > > 
> > > > Because device can write to the page while the page is inside the device
> > > > memory and we might want to writeback to disk while page stays in device
> > > > memory and computation continues.
> > > 
> > > Ok. So how does the device trigger ->page_mkwrite on a clean page to
> > > tell the filesystem that the page has been dirtied? So that, for
> > > example, if the page covers a hole because the file is sparse the
> > > filesytem can do the required block allocation and data
> > > initialisation (i.e. zero the cached page) before it gets marked
> > > dirty and any data gets written to it?
> > > 
> > > And if zeroing the page during such a fault requires CPU access to
> > > the data, how do you propose we handle page migration in the middle
> > > of the page fault to allow the CPU to zero the page? Seems like more
> > > lock order/inversion problems there, too...
> > 
> > File back page 

Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
On Tue, Dec 13, 2016 at 02:08:22PM -0800, Dave Hansen wrote:
> On 12/13/2016 01:24 PM, Jerome Glisse wrote:
> > 
> >>> > > From kernel point of view such memory is almost like any other, it
> >>> > > has a struct page and most of the mm code is non the wiser, nor need
> >>> > > to be about it. CPU access trigger a migration back to regular CPU
> >>> > > accessible page.
> >> > 
> >> > That sounds ... complex. Page migration on page cache access inside
> >> > the filesytem IO path locking during read()/write() sounds like
> >> > a great way to cause deadlocks
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Huh, that's pretty different from the other ZONE_DEVICE uses.  For
> those, you *can* do get_user_pages().
> 
> I'd be really interested to see the feature set that these pages have
> and how it differs from regular memory and the ZONE_DEVICE memory that
> have have in-kernel today.

Well i can do a list for current patchset where i do not allow migration
of file back page. Roughly you can not kmap and GUP. But GUP has many more
implications like direct I/O (source or destination of direct I/O) ...

> 
> BTW, how is this restriction implemented?  I would have expected to see
> follow_page_pte() or vm_normal_page() getting modified.  I don't see a
> single reference to get_user_pages or "GUP" in any of the latest HMM
> patch set or the changelogs.
> 
> As best I can tell, the slow GUP path will get stuck in a loop inside
> follow_page_pte(), while the fast GUP path will allow you to acquire a
> reference to the page.  But, maybe I'm reading the code wrong.

It is a side effect of having a special swap pte so follow_page_pte()
returns NULL which trigger page fault through handle_mm_fault() which
trigger migration back to regular page. Same for fast GUP version.
There is never a valid pte for an un-addressable page.

Cheers,
Jérome
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context 
> > > > > > of
> > > > > > filesystem and block device. Specificaly how to handle write-back, 
> > > > > > read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU 
> > > > > > can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> 
> So you want the device to be able to dirty mmapped pages that the
> CPU can't access?

Yes, correct.


> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> 
> Thought so. Lots of problems with this.
> 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file.
> 
> From the application point of view. Filesystem, page cache, etc
> there's substantial problems here...
> 
> > It is a regular file on your filesystem.
> 
> ... because of this.
> 
> > > > From kernel point of view such memory is almost like any other, it
> > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > accessible page.
> > > 
> > > That sounds ... complex. Page migration on page cache access inside
> > > the filesytem IO path locking during read()/write() sounds like
> > > a great way to cause deadlocks
> > 
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Sure, but you haven't answered my question: how do you propose we
> address the issue of placing all the mm locks required for migration
> under the filesystem IO path locks?

Two different plans (which are non exclusive of each other). First is to use
workqueue and have read/write wait on the workqueue to be done migrating the
page back.

Second solution is to use a bounce page during I/O so that there is no need
for migration.


> > > > But for thing like writeback i want to be able to do writ

Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the context of
> > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > access.
> > > 
> > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > becomes non-DAX capable?
> > > 
> > > If you are not talking about pmem and DAX, then exactly what does
> > > "when a filesystem page is migrated to device memory that CPU can
> > > not access" mean? What "filesystem page" are we talking about that
> > > can get migrated from main RAM to something the CPU can't access?
> > 
> > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > board memory that can not be expose transparently to the CPU. I am
> > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > https://lwn.net/Articles/706856/
> 
> So ZONE_DEVICE memory that is a DMA target but not CPU addressable?

Well not only target, it can be source too. But the device can read
and write any system memory and dma to/from that memory to its on
board memory.

> 
> > So in my case i am only considering non DAX/PMEM filesystem ie any
> > "regular" filesystem back by a "regular" block device. I want to be
> > able to migrate mmaped area of such filesystem to device memory while
> > the device is actively using that memory.
> 
> "migrate mmapped area of such filesystem" means what, exactly?

fd = open("/path/to/some/file")
ptr = mmap(fd, ...);
gpu_compute_something(ptr);

> 
> Are you talking about file data contents that have been copied into
> the page cache and mmapped into a user process address space?
> IOWs, migrating ZONE_NORMAL page cache page content and state
> to a new ZONE_DEVICE page, and then migrating back again somehow?

Take any existing application that mmap a file and allow to migrate
chunk of that mmaped file to device memory without the application
even knowing about it. So nothing special in respect to that mmaped
file. It is a regular file on your filesystem.


> > From kernel point of view such memory is almost like any other, it
> > has a struct page and most of the mm code is non the wiser, nor need
> > to be about it. CPU access trigger a migration back to regular CPU
> > accessible page.
> 
> That sounds ... complex. Page migration on page cache access inside
> the filesytem IO path locking during read()/write() sounds like
> a great way to cause deadlocks

There are few restriction on device page, no one can do GUP on them and
thus no one can pin them. Hence they can always be migrated back. Yes
each fs need modification, most of it (if not all) is isolated in common
filemap helpers.


> > But for thing like writeback i want to be able to do writeback with-
> > out having to migrate page back first. So that data can stay on the
> > device while writeback is happening.
> 
> Why can't you do writeback before migration, so only clean pages get
> moved?

Because device can write to the page while the page is inside the device
memory and we might want to writeback to disk while page stays in device
memory and computation continues.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> 
> You mean pmem that is DAX-capable that suddenly, without warning,
> becomes non-DAX capable?
> 
> If you are not talking about pmem and DAX, then exactly what does
> "when a filesystem page is migrated to device memory that CPU can
> not access" mean? What "filesystem page" are we talking about that
> can get migrated from main RAM to something the CPU can't access?

I am talking about GPU, FPGA, ... any PCIE device that have fast on
board memory that can not be expose transparently to the CPU. I am
reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
https://lwn.net/Articles/706856/

So in my case i am only considering non DAX/PMEM filesystem ie any
"regular" filesystem back by a "regular" block device. I want to be
able to migrate mmaped area of such filesystem to device memory while
the device is actively using that memory.

>From kernel point of view such memory is almost like any other, it
has a struct page and most of the mm code is non the wiser, nor need
to be about it. CPU access trigger a migration back to regular CPU
accessible page.

But for thing like writeback i want to be able to do writeback with-
out having to migrate page back first. So that data can stay on the
device while writeback is happening.

Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
On Tue, Dec 13, 2016 at 12:01:04PM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote:
> > On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> > > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the
> > > > context 
> > > > of filesystem and block device. Specificaly how to handle write
> > > > -back,
> > > > read, ... when a filesystem page is migrated to device memory
> > > > that 
> > > > CPU can not access.
> > > > 
> > > > I intend to post a patchset leveraging the same idea as the
> > > > existing
> > > > block bounce helper (block/bounce.c) to handle this. I believe
> > > > this 
> > > > is worth discussing during summit see how people feels about such
> > > > plan and if they have better ideas.
> > > 
> > > Isn't this pretty much what the transcendent memory interfaces we
> > > currently have are for?  It's current use cases seem to be
> > > compressed
> > > swap and distributed memory, but there doesn't seem to be any
> > > reason in
> > > principle why you can't use the interface as well.
> > > 
> > 
> > I am not a specialist of tmem or cleancache
> 
> Well, that makes two of us; I just got to sit through Dan Magenheimer's
> talks and some stuff stuck.
> 
> >  but my understand is that there is no way to allow for file back 
> > page to be dirtied while being in this special memory.
> 
> Unless you have some other definition of dirtied, I believe that's what
> an exclusive tmem get in frontswap actually does.  It marks the page
> dirty when it comes back because it may have been modified.

Well frontswap only support anonymous or share page, not random filemap
page. So it doesn't help for what i am aiming at :) Note that in my case
the device report accurate dirty information (did the device modified
the page or not) assuming hardware bugs doesn't exist.


> > In my case when you migrate a page to the device it might very well 
> > be so that the device can write something in it (results of some sort 
> > of computation). So page might migrate to device memory as clean but
> > return from it in dirty state.
> > 
> > Second aspect is that even if memory i am dealing with is un
> > -addressable i still have struct page for it and i want to be able to 
> > use regular page migration.
> 
> Tmem keeps a struct page ... what's the problem with page migration?
> the fact that tmem locks the page when it's not addressable and you
> want to be able to migrate the page even when it's not addressable?

Well the way cleancache or frontswap works is that they are use when
kernel is trying to make room or evict something. In my case it is the
device that trigger the migration for a range of virtual address of a
process. Sure i can make a weird helper that would force to frontswap
or cleancache pages i want to migrate but it seems counter intuitive
to me.

One extra requirement for me is to be able to easily and quickly find
the migrated page by looking at the CPU page table of the process.
With frontswap it adds a level of indirection where i need to find
through frontswap the memory. With cleancache there isn't even any
information left (the page table entry is cleared).


> 
> > So given my requirement i didn't thought that cleancache was the way
> > to address them. Maybe i am wrong.
> 
> I'm not saying it is, I just asked if you'd considered it, since the
> requirements look similar.

Yes i briefly consider it but from the highlevel overview i had it did
not seems to address all my requirement. Maybe it is because i lack
in depth knowledge of cleancache/frontswap but skiming through code
didn't convince me that i needed to dig deeper.

The solution i am pursuing use struct page and thus everything is as
if it was regular page to the kernel. The only thing that doesn't work
is kmap or mapping it into a process. But this can easily be handled.
For filesystem issues are about anything that do I/O so read/write/
writeback.

In many case if CPU I/O happens what i want to do is migrate back to a
regular page, so the read/write case is easy. But for writeback if page
is dirty on the device and device reports it (calling set_page_dirty())
then i still want to have writeback to work so i don't loose data (if
device dirtied the page it is probably because it was instructed to
save current computations).

With this in mind, the bounce helper design to work around block device
limitation in respect to page they can access seemed to be a perfect fit.
All i care about is providing a bounce page allowing writeback to happen
without having to go through the "slow" page migration back to system
page.

Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context 
> > of filesystem and block device. Specificaly how to handle write-back,
> > read, ... when a filesystem page is migrated to device memory that 
> > CPU can not access.
> > 
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this 
> > is worth discussing during summit see how people feels about such 
> > plan and if they have better ideas.
> 
> Isn't this pretty much what the transcendent memory interfaces we
> currently have are for?  It's current use cases seem to be compressed
> swap and distributed memory, but there doesn't seem to be any reason in
> principle why you can't use the interface as well.
> 

I am not a specialist of tmem or cleancache but my understand is that
there is no way to allow for file back page to be dirtied while being
in this special memory.

In my case when you migrate a page to the device it might very well be
so that the device can write something in it (results of some sort of
computation). So page might migrate to device memory as clean but
return from it in dirty state.

Second aspect is that even if memory i am dealing with is un-addressable
i still have struct page for it and i want to be able to use regular
page migration.

So given my requirement i didn't thought that cleancache was the way
to address them. Maybe i am wrong.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread Jerome Glisse
I would like to discuss un-addressable device memory in the context of
filesystem and block device. Specificaly how to handle write-back, read,
... when a filesystem page is migrated to device memory that CPU can not
access.

I intend to post a patchset leveraging the same idea as the existing
block bounce helper (block/bounce.c) to handle this. I believe this is
worth discussing during summit see how people feels about such plan and
if they have better ideas.


I also like to join discussions on:
  - Peer-to-Peer DMAs between PCIe devices
  - CDM coherent device memory
  - PMEM
  - overall mm discussions

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html