Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Jeff Garzik

Andrea Arcangeli wrote:
> 
> On Fri, Oct 20, 2000 at 02:19:59PM -0400, Jeff Garzik wrote:
> > Why? [..]
> 
> vma information isn't passed from v4l layer to lowlevel layer.

so I see :(

The Matrox Meteor II driver I'm developing uses DMA memory, PCI shared
memory, -or- reserve_bootmem memory in mmap(2), depending on card and
system capabilities.  It sure would be nice to have the vma there,
especially now that I know about using nopage() for mmap, but it can be
worked around...

-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 02:19:59PM -0400, Jeff Garzik wrote:
> Why? [..]

vma information isn't passed from v4l layer to lowlevel layer.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Linus Torvalds



On Fri, 20 Oct 2000, Jeff Garzik wrote:
> 
> If I understand your patch, I should call vma_reserve(), and then
> completely remove my no-op swapout().  Correct?

Note that I dislike "wrapper.h", and I just removed that part.

I don't think it's any clearer to write "vma_reserve(vma)" than it is to
just say "vma->vm_flags |= VM_RESERVE". 

But yes, add that line and remove the swapout, and you should be golden -
no unnecessary faults (well, it won't pre-fault, of course) and no trouble
with calculating locked pages.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-20 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 09:42:36PM -0700, Linus Torvalds wrote:

> Now, the way I'v ealways envisioned this to work is that the VM scanning
> function basically always does the equivalent of just
> 
>  - get PTE entry, clear it out.
>  - if PTE was dirty, add the page to the swap cache, and mark it dirty,
>but DON'T ACTUALLY START THE IO!
>  - free the page.
> 
> Then, we'd move the "writeout" part into the LRU queue side, and at that
> point I agree with you 100% that we probably should just delay it until
> there are no mappings available

I've just been talking about this with Ben LaHaise and Rik van Riel,
and Ben brought up a nasty problem --- NFS, which because of its
credentials requirements needs to have the struct file available in
its writepage function.  Of course, if we defer the write then we
don't necessarily have the file available when we come to flush the
page from cache.

One answer is to say "well then NFS is broken, fix it".  It's not too
hard --- NFS mmaps need a wp_page function which registers the
caller's credentials against the page when we dirty it so that we can
use those credentials on flush.  That means that writes to a
multiply-mapped file essentially get random credentials, but I don't
think we care --- the credentials eventually used will be enough to
avoid the root_squash problems and the permissions at open make sure
we're not doing anything illegal.  

(Changing permissions on an already-mmaped file and causing the NFS
server to refuse the write raises problems which are ... interesting,
but I'm not convinced that that is a new problem; I suspect we can
fabricate such a failure today.)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-20 Thread faith

In article <[EMAIL PROTECTED]>,
 you write:
>
>
>On Thu, 19 Oct 2000, Jeff Garzik wrote:
>>
>> I stole the last two lines from drivers/char/drm/vm.c, which might need
>> to be fixed up also..  He uses the vm_flags above and nevers calls
>> get_page, at the very least.
>
>The DRM code does
>
>   atomic_inc(_to_page(physical)->count);
>
>which is basically what get_page() expands into. The DRM code looks ugly,
>but correct, at least as far as this thing is concerned.
>
>But you're right about the mmap vm_flags/vm_file things.

We'll look at this and submit the changes with the next patch set.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 01:46:53PM -0400, Jeff Garzik wrote:
> In any case, we shouldn't modify videodev.c to call vma_reserve()... 
> Let the driver's mmap operation do that or not do that, as it chooses.

It can't with the current mmap video4linux kernel API.

In practice it doesn't matter because none driver (before your VIA soundcard :)
is implementing the swapout callback. And vma_reserve fits perfectly in your
->nopage driver too.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 10:44:40AM -0700, Linus Torvalds wrote:
> agree with your change, but I just suspect it will break drivers that have

you're right, it would break it, the driver should really somehow increase the
pagecount for each mapping with the PG_reserved removed (in the future that can
be easily done in the nopage callback).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Jeff Garzik

Andrea Arcangeli wrote:
> > As to your rvmalloc()/rvfree() changes, I don't think they are safe as-is:
> > I think it's the right thing to do, but I don't trust the drivers to
> > maintain the right page counts. The code used to mark the pages as
> > reserved, which probably means that it hides bad drivers that do not do
> > proper reference counting - and I'm not willing to make that kind of
> > change at this point.
> 
> The page count of the mapped pages should be ok, it seems those mapped pages
> have a reference count of 1 just from the vmalloc allocation and they use
> PG_reserved just to skip swap_out, but I feel safer too if the bttv maintainers
> will check it and send it to you themself after checking it's correct. (I only
> verified that it was compiling correctly)

In any case, we shouldn't modify videodev.c to call vma_reserve()... 
Let the driver's mmap operation do that or not do that, as it chooses.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-20 Thread Linus Torvalds



On Thu, 19 Oct 2000, Stephen Tweedie wrote:
> > 
> > Then, we'd move the "writeout" part into the LRU queue side, and at that
> > point I agree with you 100% that we probably should just delay it until
> > there are no mappings available
> 
> I've just been talking about this with Ben LaHaise and Rik van Riel,
> and Ben brought up a nasty problem --- NFS, which because of its
> credentials requirements needs to have the struct file available in
> its writepage function.  Of course, if we defer the write then we
> don't necessarily have the file available when we come to flush the
> page from cache.

Yes. But that doesn't mean that swapping couldn't do it (swapping
fundamentally doesn't have credentials).

And note that this is not about "NFS is broken" - any remote filesystem
will have some issues like this, and shared mappings will always have to
handle this case.

So basically I agree that shared mappings cannot be converted to this
setup, I was only talking about the specific case of the swapping (and
anonymous shared memory, which along with SysV IPC shm is basically the
same thing and already uses the swap cache).

So what I was thinking of was the very end of try_to_swap_out(), where we
have noticed that we do not have a "swapout()" function, and we need to
add the page to the swap cache. I would suggest moving _that_ code to the
LRU queue, and handling it conceptually together with the stuff that
handles the buffer cache writeout.

--

And no, I haven't forgotten about the case of direct IO into a shared
mapping. That _is_ going to be different in many ways, and I suspect that
a solution to that particular issue may be to move the "vm_file"
information from when we do the virtual kiobuf lookup into the kiobuf's,
because otherwise we'd basically lose that information.

(We _already_ lose that information, in fact. Keeping the page in the
virtual mapping doesn't really even fix it - because the page can be in
multiple virtual mappings with different vm_file's and thus different
credentials. And the kiobuf's do not really contain any information of
_which_ of the credentials we looked up. It happens to work, but it's
conceptually not very correct).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 10:10:05AM -0700, Linus Torvalds wrote:
> Sure. I have no problem at all with this suggestion: it's basically just a
> hint to the VM layer that trying to page something out in this vma is
> useless, as its backing store is in memory anyway.

Yes, that is _exactly_ the point.

> As to your rvmalloc()/rvfree() changes, I don't think they are safe as-is:
> I think it's the right thing to do, but I don't trust the drivers to
> maintain the right page counts. The code used to mark the pages as
> reserved, which probably means that it hides bad drivers that do not do
> proper reference counting - and I'm not willing to make that kind of
> change at this point.

The page count of the mapped pages should be ok, it seems those mapped pages
have a reference count of 1 just from the vmalloc allocation and they use
PG_reserved just to skip swap_out, but I feel safer too if the bttv maintainers
will check it and send it to you themself after checking it's correct. (I only
verified that it was compiling correctly)

Thanks.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Thu, Oct 19, 2000 at 03:45:43PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 20 Oct 2000, Andrea Arcangeli wrote:
> 
> > On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> > > solution is really elegant.  Excluding all the debug code and assertions
> > > I stick in there, the guts of via audio mmap support went from ~50 lines
> > > to ~10.
> > 
> > Was it 50 lines with remap_page_range?
> 
> You cannot use remap_page_range() if the pages aren't physically
> contiguous.

Yep.

And about the other question?

> Face it, Andrea, remap_page_range() is a very limited hack that only works
> for a very limited subset of the things that real people want to do. It is
> _not_ a generic solution, while the nopage() approach _is_ generic.

But without keeping those pages pinned in the ptes, the nopage approch is
_suboptimal_ and adding a dummy swapout that returns "progress" where no
progress is been made will also _break_ the VM with relatively large device
mmapped vmas. It would at least _certainly_ break 2.2.18pre15aa1 (if 2.4.x has
an oom deadlock in GFP it won't break of course).

I'm fine to drop remap_page_range and the PG_reserved bit, but to do that I'd
suggest to add a new per-VMA VM_RESERVED bitflags. This will also avoid to walk
pagetables where it makes no sense to unmap the pages (it won't only avoid
senseless page faults) and it will fix the VM callbacks problem that returns
"success" when no progress is made. (SHM returns a dummy 0 too, but progress is
been made in that case)

The only reason it could make sense to do that unmapping is if we would be able
to also free the _pagetables_, but we do that only via munmap if a whole gdt
entry gets unmapped (and after munmap the page is been just unmapped anyways).
So once we'll free pagetables via swap_out (maybe never) some driver with
houndred of mbytes of virtual mapping may choose to drop the VM_RESERVED
bitflag from their vmas if pagefaults aren't a performance problem for that
drivers.

I just find hard to widespread propose something that ends doing the wrong
thing (ok I see it's not a big deal but it's still the _wrong_ thing to do and
the VM_RESERVED way will fix it a _zero_ cost, it will improve performance
compared to PG_reserved and it will also avoid the uglyness of having all the
SG drivers out there implementing a dummy swapout callback :).

This introduces VM_RESERVED against test10-pre4:

--- 2.4.0-test10-pre4/include/linux/mm.h.~1~Fri Oct 20 17:56:09 2000
+++ 2.4.0-test10-pre4/include/linux/mm.hFri Oct 20 18:23:47 2000
@@ -95,6 +95,7 @@
 
 #define VM_DONTCOPY0x0002  /* Do not copy this vma on fork */
 #define VM_DONTEXPAND  0x0004  /* Cannot expand with mremap() */
+#define VM_RESERVED0x0008  /* Don't unmap it from swap_out */
 
 #define VM_STACK_FLAGS 0x0177
 
--- 2.4.0-test10-pre4/include/linux/wrapper.h.~1~   Tue Aug  8 06:01:36 2000
+++ 2.4.0-test10-pre4/include/linux/wrapper.h   Fri Oct 20 18:26:02 2000
@@ -29,8 +29,17 @@
 #define vma_get_end(v) v->vm_end
 #define vma_get_page_prot(v) v->vm_page_prot
 
+/*
+ * mem_map_reserve()/unreserve() are going to be obsoleted by
+ * vma_reserve(). (unreserve shouldn't be necessary)
+ *
+ * Instead of marking the pages as reserved, just mark the vma as reserved
+ * this will improve performance (it's zero cost unlike the PG_reserved check)
+ * and it will be trivial for not physically contigous mappings too.
+ */
 #define mem_map_reserve(p) set_bit(PG_reserved, >flags)
 #define mem_map_unreserve(p) clear_bit(PG_reserved, >flags)
 #define mem_map_inc_count(p) atomic_inc(&(p->count))
 #define mem_map_dec_count(p) atomic_dec(&(p->count))
+#define vma_reserve(vma)   ((vma)->vm_flags |= VM_RESERVED)
 #endif
--- 2.4.0-test10-pre4/mm/vmscan.c.~1~   Fri Oct 20 17:56:10 2000
+++ 2.4.0-test10-pre4/mm/vmscan.c   Fri Oct 20 18:11:09 2000
@@ -318,7 +318,7 @@
unsigned long end;
 
/* Don't swap out areas which are locked down */
-   if (vma->vm_flags & VM_LOCKED)
+   if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
return 0;
 
pgdir = pgd_offset(mm, address);


And this shows a pratical (but untested :) demonstration of the VM_RESERVED
usage:

--- 2.4.0-test10-pre4/drivers/media/video/bttv-driver.c.~1~ Thu Oct 12 03:04:42 
2000
+++ 2.4.0-test10-pre4/drivers/media/video/bttv-driver.c Fri Oct 20 18:32:43 2000
@@ -39,7 +39,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -181,40 +180,17 @@
 static void * rvmalloc(signed long size)
 {
void * mem;
-   unsigned long adr, page;
 
mem=vmalloc_32(size);
if (mem) 
-   {
memset(mem, 0, size); /* Clear the ram out, no junk to the user */
-   adr=(unsigned long) mem;
-   while (size > 0) 
-{
-   page = kvirt_to_pa(adr);
-   mem_map_reserve(virt_to_page(__va(page)));
-   

VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Thu, Oct 19, 2000 at 03:45:43PM -0700, Linus Torvalds wrote:
 
 
 On Fri, 20 Oct 2000, Andrea Arcangeli wrote:
 
  On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
   solution is really elegant.  Excluding all the debug code and assertions
   I stick in there, the guts of via audio mmap support went from ~50 lines
   to ~10.
  
  Was it 50 lines with remap_page_range?
 
 You cannot use remap_page_range() if the pages aren't physically
 contiguous.

Yep.

And about the other question?

 Face it, Andrea, remap_page_range() is a very limited hack that only works
 for a very limited subset of the things that real people want to do. It is
 _not_ a generic solution, while the nopage() approach _is_ generic.

But without keeping those pages pinned in the ptes, the nopage approch is
_suboptimal_ and adding a dummy swapout that returns "progress" where no
progress is been made will also _break_ the VM with relatively large device
mmapped vmas. It would at least _certainly_ break 2.2.18pre15aa1 (if 2.4.x has
an oom deadlock in GFP it won't break of course).

I'm fine to drop remap_page_range and the PG_reserved bit, but to do that I'd
suggest to add a new per-VMA VM_RESERVED bitflags. This will also avoid to walk
pagetables where it makes no sense to unmap the pages (it won't only avoid
senseless page faults) and it will fix the VM callbacks problem that returns
"success" when no progress is made. (SHM returns a dummy 0 too, but progress is
been made in that case)

The only reason it could make sense to do that unmapping is if we would be able
to also free the _pagetables_, but we do that only via munmap if a whole gdt
entry gets unmapped (and after munmap the page is been just unmapped anyways).
So once we'll free pagetables via swap_out (maybe never) some driver with
houndred of mbytes of virtual mapping may choose to drop the VM_RESERVED
bitflag from their vmas if pagefaults aren't a performance problem for that
drivers.

I just find hard to widespread propose something that ends doing the wrong
thing (ok I see it's not a big deal but it's still the _wrong_ thing to do and
the VM_RESERVED way will fix it a _zero_ cost, it will improve performance
compared to PG_reserved and it will also avoid the uglyness of having all the
SG drivers out there implementing a dummy swapout callback :).

This introduces VM_RESERVED against test10-pre4:

--- 2.4.0-test10-pre4/include/linux/mm.h.~1~Fri Oct 20 17:56:09 2000
+++ 2.4.0-test10-pre4/include/linux/mm.hFri Oct 20 18:23:47 2000
@@ -95,6 +95,7 @@
 
 #define VM_DONTCOPY0x0002  /* Do not copy this vma on fork */
 #define VM_DONTEXPAND  0x0004  /* Cannot expand with mremap() */
+#define VM_RESERVED0x0008  /* Don't unmap it from swap_out */
 
 #define VM_STACK_FLAGS 0x0177
 
--- 2.4.0-test10-pre4/include/linux/wrapper.h.~1~   Tue Aug  8 06:01:36 2000
+++ 2.4.0-test10-pre4/include/linux/wrapper.h   Fri Oct 20 18:26:02 2000
@@ -29,8 +29,17 @@
 #define vma_get_end(v) v-vm_end
 #define vma_get_page_prot(v) v-vm_page_prot
 
+/*
+ * mem_map_reserve()/unreserve() are going to be obsoleted by
+ * vma_reserve(). (unreserve shouldn't be necessary)
+ *
+ * Instead of marking the pages as reserved, just mark the vma as reserved
+ * this will improve performance (it's zero cost unlike the PG_reserved check)
+ * and it will be trivial for not physically contigous mappings too.
+ */
 #define mem_map_reserve(p) set_bit(PG_reserved, p-flags)
 #define mem_map_unreserve(p) clear_bit(PG_reserved, p-flags)
 #define mem_map_inc_count(p) atomic_inc((p-count))
 #define mem_map_dec_count(p) atomic_dec((p-count))
+#define vma_reserve(vma)   ((vma)-vm_flags |= VM_RESERVED)
 #endif
--- 2.4.0-test10-pre4/mm/vmscan.c.~1~   Fri Oct 20 17:56:10 2000
+++ 2.4.0-test10-pre4/mm/vmscan.c   Fri Oct 20 18:11:09 2000
@@ -318,7 +318,7 @@
unsigned long end;
 
/* Don't swap out areas which are locked down */
-   if (vma-vm_flags  VM_LOCKED)
+   if (vma-vm_flags  (VM_LOCKED|VM_RESERVED))
return 0;
 
pgdir = pgd_offset(mm, address);


And this shows a pratical (but untested :) demonstration of the VM_RESERVED
usage:

--- 2.4.0-test10-pre4/drivers/media/video/bttv-driver.c.~1~ Thu Oct 12 03:04:42 
2000
+++ 2.4.0-test10-pre4/drivers/media/video/bttv-driver.c Fri Oct 20 18:32:43 2000
@@ -39,7 +39,6 @@
 #include linux/sched.h
 #include asm/segment.h
 #include linux/types.h
-#include linux/wrapper.h
 #include linux/interrupt.h
 #include linux/kmod.h
 #include linux/vmalloc.h
@@ -181,40 +180,17 @@
 static void * rvmalloc(signed long size)
 {
void * mem;
-   unsigned long adr, page;
 
mem=vmalloc_32(size);
if (mem) 
-   {
memset(mem, 0, size); /* Clear the ram out, no junk to the user */
-   adr=(unsigned long) mem;
-   while (size  0) 
-{
-   page = kvirt_to_pa(adr);
-   

Re: mapping user space buffer to kernel address space

2000-10-20 Thread Linus Torvalds



On Thu, 19 Oct 2000, Stephen Tweedie wrote:
  
  Then, we'd move the "writeout" part into the LRU queue side, and at that
  point I agree with you 100% that we probably should just delay it until
  there are no mappings available
 
 I've just been talking about this with Ben LaHaise and Rik van Riel,
 and Ben brought up a nasty problem --- NFS, which because of its
 credentials requirements needs to have the struct file available in
 its writepage function.  Of course, if we defer the write then we
 don't necessarily have the file available when we come to flush the
 page from cache.

Yes. But that doesn't mean that swapping couldn't do it (swapping
fundamentally doesn't have credentials).

And note that this is not about "NFS is broken" - any remote filesystem
will have some issues like this, and shared mappings will always have to
handle this case.

So basically I agree that shared mappings cannot be converted to this
setup, I was only talking about the specific case of the swapping (and
anonymous shared memory, which along with SysV IPC shm is basically the
same thing and already uses the swap cache).

So what I was thinking of was the very end of try_to_swap_out(), where we
have noticed that we do not have a "swapout()" function, and we need to
add the page to the swap cache. I would suggest moving _that_ code to the
LRU queue, and handling it conceptually together with the stuff that
handles the buffer cache writeout.

--

And no, I haven't forgotten about the case of direct IO into a shared
mapping. That _is_ going to be different in many ways, and I suspect that
a solution to that particular issue may be to move the "vm_file"
information from when we do the virtual kiobuf lookup into the kiobuf's,
because otherwise we'd basically lose that information.

(We _already_ lose that information, in fact. Keeping the page in the
virtual mapping doesn't really even fix it - because the page can be in
multiple virtual mappings with different vm_file's and thus different
credentials. And the kiobuf's do not really contain any information of
_which_ of the credentials we looked up. It happens to work, but it's
conceptually not very correct).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 10:10:05AM -0700, Linus Torvalds wrote:
 Sure. I have no problem at all with this suggestion: it's basically just a
 hint to the VM layer that trying to page something out in this vma is
 useless, as its backing store is in memory anyway.

Yes, that is _exactly_ the point.

 As to your rvmalloc()/rvfree() changes, I don't think they are safe as-is:
 I think it's the right thing to do, but I don't trust the drivers to
 maintain the right page counts. The code used to mark the pages as
 reserved, which probably means that it hides bad drivers that do not do
 proper reference counting - and I'm not willing to make that kind of
 change at this point.

The page count of the mapped pages should be ok, it seems those mapped pages
have a reference count of 1 just from the vmalloc allocation and they use
PG_reserved just to skip swap_out, but I feel safer too if the bttv maintainers
will check it and send it to you themself after checking it's correct. (I only
verified that it was compiling correctly)

Thanks.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Jeff Garzik

Andrea Arcangeli wrote:
  As to your rvmalloc()/rvfree() changes, I don't think they are safe as-is:
  I think it's the right thing to do, but I don't trust the drivers to
  maintain the right page counts. The code used to mark the pages as
  reserved, which probably means that it hides bad drivers that do not do
  proper reference counting - and I'm not willing to make that kind of
  change at this point.
 
 The page count of the mapped pages should be ok, it seems those mapped pages
 have a reference count of 1 just from the vmalloc allocation and they use
 PG_reserved just to skip swap_out, but I feel safer too if the bttv maintainers
 will check it and send it to you themself after checking it's correct. (I only
 verified that it was compiling correctly)

In any case, we shouldn't modify videodev.c to call vma_reserve()... 
Let the driver's mmap operation do that or not do that, as it chooses.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 10:44:40AM -0700, Linus Torvalds wrote:
 agree with your change, but I just suspect it will break drivers that have

you're right, it would break it, the driver should really somehow increase the
pagecount for each mapping with the PG_reserved removed (in the future that can
be easily done in the nopage callback).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 01:46:53PM -0400, Jeff Garzik wrote:
 In any case, we shouldn't modify videodev.c to call vma_reserve()... 
 Let the driver's mmap operation do that or not do that, as it chooses.

It can't with the current mmap video4linux kernel API.

In practice it doesn't matter because none driver (before your VIA soundcard :)
is implementing the swapout callback. And vma_reserve fits perfectly in your
-nopage driver too.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-20 Thread faith

In article [EMAIL PROTECTED],
 you write:


On Thu, 19 Oct 2000, Jeff Garzik wrote:

 I stole the last two lines from drivers/char/drm/vm.c, which might need
 to be fixed up also..  He uses the vm_flags above and nevers calls
 get_page, at the very least.

The DRM code does

   atomic_inc(virt_to_page(physical)-count);

which is basically what get_page() expands into. The DRM code looks ugly,
but correct, at least as far as this thing is concerned.

But you're right about the mmap vm_flags/vm_file things.

We'll look at this and submit the changes with the next patch set.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-20 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 09:42:36PM -0700, Linus Torvalds wrote:

 Now, the way I'v ealways envisioned this to work is that the VM scanning
 function basically always does the equivalent of just
 
  - get PTE entry, clear it out.
  - if PTE was dirty, add the page to the swap cache, and mark it dirty,
but DON'T ACTUALLY START THE IO!
  - free the page.
 
 Then, we'd move the "writeout" part into the LRU queue side, and at that
 point I agree with you 100% that we probably should just delay it until
 there are no mappings available

I've just been talking about this with Ben LaHaise and Rik van Riel,
and Ben brought up a nasty problem --- NFS, which because of its
credentials requirements needs to have the struct file available in
its writepage function.  Of course, if we defer the write then we
don't necessarily have the file available when we come to flush the
page from cache.

One answer is to say "well then NFS is broken, fix it".  It's not too
hard --- NFS mmaps need a wp_page function which registers the
caller's credentials against the page when we dirty it so that we can
use those credentials on flush.  That means that writes to a
multiply-mapped file essentially get random credentials, but I don't
think we care --- the credentials eventually used will be enough to
avoid the root_squash problems and the permissions at open make sure
we're not doing anything illegal.  

(Changing permissions on an already-mmaped file and causing the NFS
server to refuse the write raises problems which are ... interesting,
but I'm not convinced that that is a new problem; I suspect we can
fabricate such a failure today.)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Linus Torvalds



On Fri, 20 Oct 2000, Jeff Garzik wrote:
 
 If I understand your patch, I should call vma_reserve(), and then
 completely remove my no-op swapout().  Correct?

Note that I dislike "wrapper.h", and I just removed that part.

I don't think it's any clearer to write "vma_reserve(vma)" than it is to
just say "vma-vm_flags |= VM_RESERVE". 

But yes, add that line and remove the swapout, and you should be golden -
no unnecessary faults (well, it won't pre-fault, of course) and no trouble
with calculating locked pages.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Andrea Arcangeli

On Fri, Oct 20, 2000 at 02:19:59PM -0400, Jeff Garzik wrote:
 Why? [..]

vma information isn't passed from v4l layer to lowlevel layer.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM_RESERVED [was Re: mapping user space buffer to kernel address space]

2000-10-20 Thread Jeff Garzik

Andrea Arcangeli wrote:
 
 On Fri, Oct 20, 2000 at 02:19:59PM -0400, Jeff Garzik wrote:
  Why? [..]
 
 vma information isn't passed from v4l layer to lowlevel layer.

so I see :(

The Matrox Meteor II driver I'm developing uses DMA memory, PCI shared
memory, -or- reserve_bootmem memory in mmap(2), depending on card and
system capabilities.  It sure would be nice to have the vma there,
especially now that I know about using nopage() for mmap, but it can be
worked around...

-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Jeff Garzik

Andrea Arcangeli wrote:
> 
> On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> > solution is really elegant.  Excluding all the debug code and assertions
> > I stick in there, the guts of via audio mmap support went from ~50 lines
> > to ~10.
> 
> Was it 50 lines with remap_page_range?

Yeah.  Via audio is scatter-gather, so I had a function via_mmap_chan
which had to walk the scatter-gather list, calling remap_page_range for
each PAGE_SIZE'd dma buffer.

Now with nopage(), the need to walk the scatter-gather list is
completely avoided.  The scatter-gather info is stored as an array, so
finding the physical address for a virtual page is a direct lookup.


> Which is the advantage of introducing pagefaults that we can avoid? (and
> that we are also used to avoid)

Well, I don't know the VM well so I can't say how bad the lack of
VM_LOCK will affect latency, if at all.

I prefer this way because it seems the most clean -- let the system
pagefault if it wants to, we're not really losing the buffer, only a
reference to it.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Fri, 20 Oct 2000, Andrea Arcangeli wrote:

> On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> > solution is really elegant.  Excluding all the debug code and assertions
> > I stick in there, the guts of via audio mmap support went from ~50 lines
> > to ~10.
> 
> Was it 50 lines with remap_page_range?

You cannot use remap_page_range() if the pages aren't physically
contiguous.

Face it, Andrea, remap_page_range() is a very limited hack that only works
for a very limited subset of the things that real people want to do. It is
_not_ a generic solution, while the nopage() approach _is_ generic.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Thu, 19 Oct 2000, Jeff Garzik wrote:
>
> I stole the last two lines from drivers/char/drm/vm.c, which might need
> to be fixed up also..  He uses the vm_flags above and nevers calls
> get_page, at the very least.

The DRM code does

atomic_inc(_to_page(physical)->count);

which is basically what get_page() expands into. The DRM code looks ugly,
but correct, at least as far as this thing is concerned.

But you're right about the mmap vm_flags/vm_file things.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Andrea Arcangeli

On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
> solution is really elegant.  Excluding all the debug code and assertions
> I stick in there, the guts of via audio mmap support went from ~50 lines
> to ~10.

Was it 50 lines with remap_page_range?

Which is the advantage of introducing pagefaults that we can avoid? (and
that we are also used to avoid)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Jeff Garzik

Linus Torvalds wrote:
> and quite frankly we should just change the damn "nopage()" arguments to
> be
> 
> struct page * (*nopage)(struct vm_area_struct * area, unsigned long pgoff, 
>int write_access);
> 
> because the nopage() routine really shouldn't care about "address" at all
> (the VM layer cares, but "nopage()" should just return the proper page at
> the proper offset into the mapping).

fine with me, it makes my code even shorter.  ;-)  This new mmap
solution is really elegant.  Excluding all the debug code and assertions
I stick in there, the guts of via audio mmap support went from ~50 lines
to ~10.


> > The first draft of code had only
> >
> >   vma->vm_ops = _mm_ops;
> >   vma->vm_private_data = card;
> >
> > and userspace apps segfaulted.  I then added the two objectionable
> > lines, and things started working:
> >
> >   vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
> >   vma->vm_file = file;
> >
> > Have not yet checked any combinations in between...
> 
> Ahh, I think I see why. You should do a "get_page()" on the page before
> you return it from your nopage() - otherwise your page counts will be
> completely wrong.
> 
> Actually, as far as I can tell your page counts ended up being wrong
> anyway, but probably only at the end of the munmap() rather than at
> runtime.

Thanks, fixed.

I stole the last two lines from drivers/char/drm/vm.c, which might need
to be fixed up also..  He uses the vm_flags above and nevers calls
get_page, at the very least.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Thu, 19 Oct 2000, Jeff Garzik wrote:
> 
> > > Since this code works in my local tests, my two concerns at this point
> > > are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.
> 
> Can I completely ignore vm_pgoff in nopage()?  Currently I just do the
> following, to get the page number relative to the start of the vma:
> 
>   pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
> 
> where 'address' is the value passed to nopage().

Oh, no, you can't do that. It should be something like

pgoff = vma->vm_pgoff + (address - vma->vm_start) >> PAGE_SHIFT;

and quite frankly we should just change the damn "nopage()" arguments to
be

struct page * (*nopage)(struct vm_area_struct * area, unsigned long pgoff, int 
write_access);

because the nopage() routine really shouldn't care about "address" at all
(the VM layer cares, but "nopage()" should just return the proper page at
the proper offset into the mapping).

I didn't notice..

> 
> The first draft of code had only
> 
>   vma->vm_ops = _mm_ops;
>   vma->vm_private_data = card;
> 
> and userspace apps segfaulted.  I then added the two objectionable
> lines, and things started working:
> 
>   vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
>   vma->vm_file = file;
> 
> Have not yet checked any combinations in between...

Ahh, I think I see why. You should do a "get_page()" on the page before
you return it from your nopage() - otherwise your page counts will be
completely wrong.

Actually, as far as I can tell your page counts ended up being wrong
anyway, but probably only at the end of the munmap() rather than at
runtime.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Jeff Garzik

Linus Torvalds wrote:
> Looks reasonable - although your "max_size" checks are wrong at mmap time.
> You should check for something like

thanks


> > Since this code works in my local tests, my two concerns at this point
> > are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.

Can I completely ignore vm_pgoff in nopage()?  Currently I just do the
following, to get the page number relative to the start of the vma:

pgoff = (address - vma->vm_start) >> PAGE_SHIFT;

where 'address' is the value passed to nopage().


> You should drop the
> 
> vma->vm_file = file;
> 
> as the VM layer will have done that already.
> 
> Also, the VM_LOCKED | VM_SHM thing is wrong, because it will cause
> mis-calulcation of the mm->vm_locked fields, and it's also unnecessary.
> Sure, it will cause the VM layer to not swap, but the no-op "swapping"
> shouldn't much hurt anyway, as it's easy enough to re-populate the page
> again.
> 
> So either just drop it, or make sure that the locked page count doesn't
> get corrupted.

The first draft of code had only

vma->vm_ops = _mm_ops;
vma->vm_private_data = card;

and userspace apps segfaulted.  I then added the two objectionable
lines, and things started working:

vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
vma->vm_file = file;

Have not yet checked any combinations in between...

Regards,

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Wed, 18 Oct 2000, Jeff Garzik wrote:
> 
> Well coolio.  Would somebody be up for sanity checking my audio mmap
> code (attached)?  It doesn't look too hard at all to get the audio
> drivers doing the correct thing.

Looks reasonable - although your "max_size" checks are wrong at mmap time.
You should check for something like

if (size > max_size)
goto out;
if (offset > max_size - size)
goto out;

instead of your current (offset >= max_size) check. As it stands, I think
you can mmap past the end by just having size and offset < max_size, but
the _sum_ being bigger..

> Since this code works in my local tests, my two concerns at this point
> are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.

You should drop the 

vma->vm_file = file;

as the VM layer will have done that already.

Also, the VM_LOCKED | VM_SHM thing is wrong, because it will cause
mis-calulcation of the mm->vm_locked fields, and it's also unnecessary.
Sure, it will cause the VM layer to not swap, but the no-op "swapping"
shouldn't much hurt anyway, as it's easy enough to re-populate the page
again.

So either just drop it, or make sure that the locked page count doesn't
get corrupted.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Andrea Arcangeli

On Wed, Oct 18, 2000 at 10:30:45PM -0400, Jeff Garzik wrote:
> Well coolio.  Would somebody be up for sanity checking my audio mmap
> code (attached)?  It doesn't look too hard at all to get the audio

Nice patch ;). 

>   vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */

Since you do vm_falgs |= VM_LOCKED you won't need to implement swapout since
it will never run. (and the VM_SHM there is wrong)
 
The driver will work also this way but it will be slower because it will
generate pagefaults that would be otherwise avoided.

A case where the above would make tons of sense is if the mapped area would
be of the order of gigabytes (or at least houndred of mbytes) where lazily
mapping would avoid a big mmap(/dev/dsp) latency and precious pagetables at
startup.

I just don't think the additional complexity is worthwhile for a sound driver.

However I see it's fun :).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Andrea Arcangeli

On Wed, Oct 18, 2000 at 10:30:45PM -0400, Jeff Garzik wrote:
 Well coolio.  Would somebody be up for sanity checking my audio mmap
 code (attached)?  It doesn't look too hard at all to get the audio

Nice patch ;). 

   vma-vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */

Since you do vm_falgs |= VM_LOCKED you won't need to implement swapout since
it will never run. (and the VM_SHM there is wrong)
 
The driver will work also this way but it will be slower because it will
generate pagefaults that would be otherwise avoided.

A case where the above would make tons of sense is if the mapped area would
be of the order of gigabytes (or at least houndred of mbytes) where lazily
mapping would avoid a big mmap(/dev/dsp) latency and precious pagetables at
startup.

I just don't think the additional complexity is worthwhile for a sound driver.

However I see it's fun :).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Wed, 18 Oct 2000, Jeff Garzik wrote:
 
 Well coolio.  Would somebody be up for sanity checking my audio mmap
 code (attached)?  It doesn't look too hard at all to get the audio
 drivers doing the correct thing.

Looks reasonable - although your "max_size" checks are wrong at mmap time.
You should check for something like

if (size  max_size)
goto out;
if (offset  max_size - size)
goto out;

instead of your current (offset = max_size) check. As it stands, I think
you can mmap past the end by just having size and offset  max_size, but
the _sum_ being bigger..

 Since this code works in my local tests, my two concerns at this point
 are correct vma-vm_pgoff treatment, and correct vm_flags/vm_file usage.

You should drop the 

vma-vm_file = file;

as the VM layer will have done that already.

Also, the VM_LOCKED | VM_SHM thing is wrong, because it will cause
mis-calulcation of the mm-vm_locked fields, and it's also unnecessary.
Sure, it will cause the VM layer to not swap, but the no-op "swapping"
shouldn't much hurt anyway, as it's easy enough to re-populate the page
again.

So either just drop it, or make sure that the locked page count doesn't
get corrupted.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Jeff Garzik

Linus Torvalds wrote:
 Looks reasonable - although your "max_size" checks are wrong at mmap time.
 You should check for something like

thanks


  Since this code works in my local tests, my two concerns at this point
  are correct vma-vm_pgoff treatment, and correct vm_flags/vm_file usage.

Can I completely ignore vm_pgoff in nopage()?  Currently I just do the
following, to get the page number relative to the start of the vma:

pgoff = (address - vma-vm_start)  PAGE_SHIFT;

where 'address' is the value passed to nopage().


 You should drop the
 
 vma-vm_file = file;
 
 as the VM layer will have done that already.
 
 Also, the VM_LOCKED | VM_SHM thing is wrong, because it will cause
 mis-calulcation of the mm-vm_locked fields, and it's also unnecessary.
 Sure, it will cause the VM layer to not swap, but the no-op "swapping"
 shouldn't much hurt anyway, as it's easy enough to re-populate the page
 again.
 
 So either just drop it, or make sure that the locked page count doesn't
 get corrupted.

The first draft of code had only

vma-vm_ops = via_mm_ops;
vma-vm_private_data = card;

and userspace apps segfaulted.  I then added the two objectionable
lines, and things started working:

vma-vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
vma-vm_file = file;

Have not yet checked any combinations in between...

Regards,

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Thu, 19 Oct 2000, Jeff Garzik wrote:
 
   Since this code works in my local tests, my two concerns at this point
   are correct vma-vm_pgoff treatment, and correct vm_flags/vm_file usage.
 
 Can I completely ignore vm_pgoff in nopage()?  Currently I just do the
 following, to get the page number relative to the start of the vma:
 
   pgoff = (address - vma-vm_start)  PAGE_SHIFT;
 
 where 'address' is the value passed to nopage().

Oh, no, you can't do that. It should be something like

pgoff = vma-vm_pgoff + (address - vma-vm_start)  PAGE_SHIFT;

and quite frankly we should just change the damn "nopage()" arguments to
be

struct page * (*nopage)(struct vm_area_struct * area, unsigned long pgoff, int 
write_access);

because the nopage() routine really shouldn't care about "address" at all
(the VM layer cares, but "nopage()" should just return the proper page at
the proper offset into the mapping).

I didn't notice..

 
 The first draft of code had only
 
   vma-vm_ops = via_mm_ops;
   vma-vm_private_data = card;
 
 and userspace apps segfaulted.  I then added the two objectionable
 lines, and things started working:
 
   vma-vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
   vma-vm_file = file;
 
 Have not yet checked any combinations in between...

Ahh, I think I see why. You should do a "get_page()" on the page before
you return it from your nopage() - otherwise your page counts will be
completely wrong.

Actually, as far as I can tell your page counts ended up being wrong
anyway, but probably only at the end of the munmap() rather than at
runtime.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Jeff Garzik

Linus Torvalds wrote:
 and quite frankly we should just change the damn "nopage()" arguments to
 be
 
 struct page * (*nopage)(struct vm_area_struct * area, unsigned long pgoff, 
int write_access);
 
 because the nopage() routine really shouldn't care about "address" at all
 (the VM layer cares, but "nopage()" should just return the proper page at
 the proper offset into the mapping).

fine with me, it makes my code even shorter.  ;-)  This new mmap
solution is really elegant.  Excluding all the debug code and assertions
I stick in there, the guts of via audio mmap support went from ~50 lines
to ~10.


  The first draft of code had only
 
vma-vm_ops = via_mm_ops;
vma-vm_private_data = card;
 
  and userspace apps segfaulted.  I then added the two objectionable
  lines, and things started working:
 
vma-vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
vma-vm_file = file;
 
  Have not yet checked any combinations in between...
 
 Ahh, I think I see why. You should do a "get_page()" on the page before
 you return it from your nopage() - otherwise your page counts will be
 completely wrong.
 
 Actually, as far as I can tell your page counts ended up being wrong
 anyway, but probably only at the end of the munmap() rather than at
 runtime.

Thanks, fixed.

I stole the last two lines from drivers/char/drm/vm.c, which might need
to be fixed up also..  He uses the vm_flags above and nevers calls
get_page, at the very least.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Andrea Arcangeli

On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
 solution is really elegant.  Excluding all the debug code and assertions
 I stick in there, the guts of via audio mmap support went from ~50 lines
 to ~10.

Was it 50 lines with remap_page_range?

Which is the advantage of introducing pagefaults that we can avoid? (and
that we are also used to avoid)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Thu, 19 Oct 2000, Jeff Garzik wrote:

 I stole the last two lines from drivers/char/drm/vm.c, which might need
 to be fixed up also..  He uses the vm_flags above and nevers calls
 get_page, at the very least.

The DRM code does

atomic_inc(virt_to_page(physical)-count);

which is basically what get_page() expands into. The DRM code looks ugly,
but correct, at least as far as this thing is concerned.

But you're right about the mmap vm_flags/vm_file things.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Linus Torvalds



On Fri, 20 Oct 2000, Andrea Arcangeli wrote:

 On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
  solution is really elegant.  Excluding all the debug code and assertions
  I stick in there, the guts of via audio mmap support went from ~50 lines
  to ~10.
 
 Was it 50 lines with remap_page_range?

You cannot use remap_page_range() if the pages aren't physically
contiguous.

Face it, Andrea, remap_page_range() is a very limited hack that only works
for a very limited subset of the things that real people want to do. It is
_not_ a generic solution, while the nopage() approach _is_ generic.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-19 Thread Jeff Garzik

Andrea Arcangeli wrote:
 
 On Thu, Oct 19, 2000 at 05:16:14PM -0400, Jeff Garzik wrote:
  solution is really elegant.  Excluding all the debug code and assertions
  I stick in there, the guts of via audio mmap support went from ~50 lines
  to ~10.
 
 Was it 50 lines with remap_page_range?

Yeah.  Via audio is scatter-gather, so I had a function via_mmap_chan
which had to walk the scatter-gather list, calling remap_page_range for
each PAGE_SIZE'd dma buffer.

Now with nopage(), the need to walk the scatter-gather list is
completely avoided.  The scatter-gather info is stored as an array, so
finding the physical address for a virtual page is a direct lookup.


 Which is the advantage of introducing pagefaults that we can avoid? (and
 that we are also used to avoid)

Well, I don't know the VM well so I can't say how bad the lack of
VM_LOCK will affect latency, if at all.

I prefer this way because it seems the most clean -- let the system
pagefault if it wants to, we're not really losing the buffer, only a
reference to it.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Jeff Garzik

Linus Torvalds wrote:
> Anyway, I didn't realize you were talking about the sound drivers use of
> remap_page_range(). That's not the original reason for remap_page_range()
> at all, and in fact it's the _ugly_ way to do things. It's simple and it
> works, but it's not pretty.
> 
> Quite frankly, the way I'd conceptually prefer people do these kinds of
> DMA buffers etc is to just have a "nopage()" function, and let that
> nopage() function just fill in the page from the DMA buffer directly. This
> would work fine, and would get rid of all the playing with PG_reserved
> etc.

Well coolio.  Would somebody be up for sanity checking my audio mmap
code (attached)?  It doesn't look too hard at all to get the audio
drivers doing the correct thing.

Converting to this was easy...  my driver hardcodes the buffer size at
PAGE_SIZE.  Since Via audio uses scatter-gather DMA, all I need to do in
'nopage' is get the virtual page number for the vma, and use that as a
direct index into the scatter-gather array.

Since this code works in my local tests, my two concerns at this point
are correct vma->vm_pgoff treatment, and correct vm_flags/vm_file usage.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |


#ifdef VIA_SUPPORT_MMAP
static struct page * via_mm_nopage (struct vm_area_struct * vma,
unsigned long address, int write_access)
{
struct via_info *card = vma->vm_private_data;
struct via_channel *chan = >ch_out;
unsigned long pgoff;
int rd, wr;

DPRINTK ("ENTER, start %lXh, ofs %lXh, pgoff %ld, addr %lXh, wr %d\n",
 vma->vm_start,
 address - vma->vm_start,
 (address - vma->vm_start) >> PAGE_SHIFT,
 address,
 write_access);

assert (VIA_DMA_BUF_SIZE == PAGE_SIZE);

if (address > vma->vm_end) {
DPRINTK ("EXIT, returning NOPAGE_SIGBUS\n");
return NOPAGE_SIGBUS; /* Disallow mremap */
}
if (!card) {
DPRINTK ("EXIT, returning NOPAGE_OOM\n");
return NOPAGE_OOM;  /* Nothing allocated */
}

/* XXX what about vma->vm_pgoff? */
pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
rd = card->ch_in.is_mapped;
wr = card->ch_out.is_mapped;

#ifndef VIA_NDEBUG
{
unsigned long max_bufs = VIA_DMA_BUFFERS;
if (rd && wr) max_bufs *= 2;
/* via_dsp_mmap() should ensure this */
assert (pgoff < max_bufs);
}
#endif

/* if full-duplex (read+write) and we have two sets of bufs,
 * then the playback buffers come first, sez soundcard.c */
if (pgoff >= VIA_DMA_BUFFERS) {
pgoff -= VIA_DMA_BUFFERS;
chan = >ch_in;
} else if (!wr)
chan = >ch_in;

assert unsigned long)chan->sgbuf[pgoff].cpuaddr) % PAGE_SIZE) == 0);
DPRINTK ("EXIT, returning page %p for cpuaddr %lXh\n",
 virt_to_page (chan->sgbuf[pgoff].cpuaddr),
 (unsigned long) chan->sgbuf[pgoff].cpuaddr);
return virt_to_page (chan->sgbuf[pgoff].cpuaddr);
}


static int via_mm_swapout (struct page *page, struct file *filp)
{
return 0;
}


struct vm_operations_struct via_mm_ops = {
nopage: via_mm_nopage,
swapout:via_mm_swapout,
};


static int via_dsp_mmap(struct file *file, struct vm_area_struct *vma)
{
struct via_info *card;
int nonblock = (file->f_flags & O_NONBLOCK);
int rc = -EINVAL, rd=0, wr=0;
unsigned long max_size, size, start, offset;

assert (file != NULL);
assert (vma != NULL);
card = file->private_data;
assert (card != NULL);

DPRINTK ("ENTER, start %lXh, size %ld, pgoff %ld\n",
 vma->vm_start,
 vma->vm_end - vma->vm_start,
 vma->vm_pgoff);

assert (VIA_DMA_BUF_SIZE == PAGE_SIZE);

max_size = 0;
if (file->f_mode & FMODE_READ) {
rd = 1;
max_size += (VIA_DMA_BUFFERS * VIA_DMA_BUF_SIZE);
}
if (file->f_mode & FMODE_WRITE) {
wr = 1;
max_size += (VIA_DMA_BUFFERS * VIA_DMA_BUF_SIZE);
}

start = vma->vm_start;
offset = (vma->vm_pgoff << PAGE_SHIFT);
size = vma->vm_end - vma->vm_start;

if (size > max_size)
goto out;
if ((offset >= max_size) || (offset >= size))
goto out;

rc = via_syscall_down (card, nonblock);
if (rc) goto out;

vma->vm_ops = _mm_ops;
vma->vm_private_data = card;
vma->vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
vma->vm_file = file;

if (rd)
  

Re: mapping user space buffer to kernel address space

2000-10-18 Thread Linus Torvalds



On Wed, 18 Oct 2000, Andrea Arcangeli wrote:
> 
> > Quite frankly, the way I'd conceptually prefer people do these kinds of
> > DMA buffers etc is to just have a "nopage()" function, and let that
> > nopage() function just fill in the page from the DMA buffer directly. This
> > would work fine, and would get rid of all the playing with PG_reserved
> > etc.
> 
> And it would generate useless page faults as well. What's the point
> of unmapping a page that can't be freed? Once the page gets freed
> the mapping just gone away via munmap anyways. What's the point of
> introducing that performance hit?

As I explained in the part you snipped, this is why I _do_ accept
remap_page_range().

But to answer your question: the point is cleanliness.

For example, remap_page_range() cannot do many things. Because a reserved
page does not maintain any counts (that's the definition of "reserved" as
far as the Linux MM is concerned - the pages never count anything), you
cannot do proper memory management with it.

In the specific case of braindead hardware that needs a 1:1 direct mapping
due to DMA limitations etc, that's exactly what you want. You can't let
people move pages around anyway or do things like that. You can never let
people swap these pages out, because there is no point.

But this is NOT how the MM layer should work in general. And it's sure as
h*ll not how the direct IO layer is going to work. You can argue until you
get blue in the face, I don't care. I'm _not_ going to accept some
braindamaged setup where you can only have one concurrent write into a
page at the same time. I'm _not_ going to accept some braindamaged setup
where you can't unmap such a page in another process or thread.

In short, I'm _not_ willing to paper over bugs by just saying "don't touch
this page".

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Andrea Arcangeli

On Wed, Oct 18, 2000 at 03:40:39PM +0100, Stephen C. Tweedie wrote:
> shm already does it: [..]

Right. Only the shared mappings and anonymoys memory need to be changed.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Stephen Tweedie

Hi,

On Wed, Oct 18, 2000 at 03:23:17PM +0200, Andrea Arcangeli wrote:

> This change makes sense and I agree it would cover the problem. However I
> prefer to clarify that doing it for the swap cache as described is not nearly
> enough to cover the mm corruption (everything that gets written via a memory
> balancing mechanism should do the same).
> 
> Said that I think it would be possible to do it for SHM and shared mappings too.

shm already does it: shm_swap_core returns RETRY unless page count is
1.  file maps are close: file swapping already uses the address space
a_ops, but currently uses that at the wrong time: unmap time rather
than last-release.  Rik?

> However this still makes me wonder why should we unmap the pte of a page that
> we can't free until we unmap_kiobuf? That's not as bad as having a
> nopage+swapout dummy operations in the sound driver DMA page case, because
> usually user-kiobufs are temporary just for the time of the DMA I/O, though.

It's no different from mlock(): if you've got some reason why one
specific mapping can't be unmapped (either it's locked or it is
inherently non-swappable like a kiobuf), then there's no reason to
make that per-mapping information visible at the per-page level.

We allow multiple processes to mlock() the same page, after all.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 09:26:07PM -0700, Linus Torvalds wrote:

> I hated people mis-using it the way it's being done by the sound drivers,
> but because I also realize that it allows for some simplifications I do
> accept it - it's basically an ugly hack that doesn't really matter because
> the exact same code _is_ needed for the real IO mapping case anyway, and
> as far as the kernel is concerned the whole thing with PG_reserved is all
> just a game to make the kernel VM subsystem think that some real RAM is
> actually IO space and shouldn't be touched.

I've got kiobuf diffs which allow you to (a) map any kernel memory
(including vmalloced areas) into a kiobuf, and then (b) mmap any
kiobuf into user space using sane, faulting vmas (with the option of
prefaulting for performance if you want it).  Nobody is using it right
now so I wasn't planning on resending it to you until 2.5, but if you
want to give people a chance to start using it earlier I can send it
in.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 09:42:36PM -0700, Linus Torvalds wrote:
>  - get PTE entry, clear it out.
>  - if PTE was dirty, add the page to the swap cache, and mark it dirty,
>but DON'T ACTUALLY START THE IO!
>  - free the page.
> 
> Basically, we removed the page from the virtual mapping, and it's now in
> the LRU queues, and marked dirty there.
> 
> Then, we'd move the "writeout" part into the LRU queue side, and at that
> point I agree with you 100% that we probably should just delay it until
> there are no mappings available - is we'd only write out a swap cache
> entry if the count == 1 (ie it only exists in the swap cache), because
> before that is true there are other people marking it dirty.

This change makes sense and I agree it would cover the problem. However I
prefer to clarify that doing it for the swap cache as described is not nearly
enough to cover the mm corruption (everything that gets written via a memory
balancing mechanism should do the same).

Said that I think it would be possible to do it for SHM and shared mappings too.

However this still makes me wonder why should we unmap the pte of a page that
we can't free until we unmap_kiobuf? That's not as bad as having a
nopage+swapout dummy operations in the sound driver DMA page case, because
usually user-kiobufs are temporary just for the time of the DMA I/O, though.

> twice, but I think you see what I'm talking about on a conceptual level.

I see.

> See? THAT, in my opinion, is the clean way to handle this all. 

Ok. I'm still not completly convinced that it's right to unmap a page "pinned"
(get_page()) on the physical layer but I think the above is conceptually a good
idea regardless of rawio, and if done everywhere it will avoid us to pin the
page in the pte. Only point left in not pinning the page in the pte is
performance wise that as said it's probably not big deal in real life as user
kiobufs usually stays there only for the duration of the I/O.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 09:26:07PM -0700, Linus Torvalds wrote:
> Maybe you mean PG_reserved? 

Yes of course. (sorry for the typo)

> Quite frankly, the way I'd conceptually prefer people do these kinds of
> DMA buffers etc is to just have a "nopage()" function, and let that
> nopage() function just fill in the page from the DMA buffer directly. This
> would work fine, and would get rid of all the playing with PG_reserved
> etc.

And it would generate useless page faults as well. What's the point
of unmapping a page that can't be freed? Once the page gets freed
the mapping just gone away via munmap anyways. What's the point of
introducing that performance hit?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 09:26:07PM -0700, Linus Torvalds wrote:
 Maybe you mean PG_reserved? 

Yes of course. (sorry for the typo)

 Quite frankly, the way I'd conceptually prefer people do these kinds of
 DMA buffers etc is to just have a "nopage()" function, and let that
 nopage() function just fill in the page from the DMA buffer directly. This
 would work fine, and would get rid of all the playing with PG_reserved
 etc.

And it would generate useless page faults as well. What's the point
of unmapping a page that can't be freed? Once the page gets freed
the mapping just gone away via munmap anyways. What's the point of
introducing that performance hit?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 09:42:36PM -0700, Linus Torvalds wrote:
  - get PTE entry, clear it out.
  - if PTE was dirty, add the page to the swap cache, and mark it dirty,
but DON'T ACTUALLY START THE IO!
  - free the page.
 
 Basically, we removed the page from the virtual mapping, and it's now in
 the LRU queues, and marked dirty there.
 
 Then, we'd move the "writeout" part into the LRU queue side, and at that
 point I agree with you 100% that we probably should just delay it until
 there are no mappings available - is we'd only write out a swap cache
 entry if the count == 1 (ie it only exists in the swap cache), because
 before that is true there are other people marking it dirty.

This change makes sense and I agree it would cover the problem. However I
prefer to clarify that doing it for the swap cache as described is not nearly
enough to cover the mm corruption (everything that gets written via a memory
balancing mechanism should do the same).

Said that I think it would be possible to do it for SHM and shared mappings too.

However this still makes me wonder why should we unmap the pte of a page that
we can't free until we unmap_kiobuf? That's not as bad as having a
nopage+swapout dummy operations in the sound driver DMA page case, because
usually user-kiobufs are temporary just for the time of the DMA I/O, though.

 twice, but I think you see what I'm talking about on a conceptual level.

I see.

 See? THAT, in my opinion, is the clean way to handle this all. 

Ok. I'm still not completly convinced that it's right to unmap a page "pinned"
(get_page()) on the physical layer but I think the above is conceptually a good
idea regardless of rawio, and if done everywhere it will avoid us to pin the
page in the pte. Only point left in not pinning the page in the pte is
performance wise that as said it's probably not big deal in real life as user
kiobufs usually stays there only for the duration of the I/O.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 09:26:07PM -0700, Linus Torvalds wrote:

 I hated people mis-using it the way it's being done by the sound drivers,
 but because I also realize that it allows for some simplifications I do
 accept it - it's basically an ugly hack that doesn't really matter because
 the exact same code _is_ needed for the real IO mapping case anyway, and
 as far as the kernel is concerned the whole thing with PG_reserved is all
 just a game to make the kernel VM subsystem think that some real RAM is
 actually IO space and shouldn't be touched.

I've got kiobuf diffs which allow you to (a) map any kernel memory
(including vmalloced areas) into a kiobuf, and then (b) mmap any
kiobuf into user space using sane, faulting vmas (with the option of
prefaulting for performance if you want it).  Nobody is using it right
now so I wasn't planning on resending it to you until 2.5, but if you
want to give people a chance to start using it earlier I can send it
in.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Stephen Tweedie

Hi,

On Wed, Oct 18, 2000 at 03:23:17PM +0200, Andrea Arcangeli wrote:

 This change makes sense and I agree it would cover the problem. However I
 prefer to clarify that doing it for the swap cache as described is not nearly
 enough to cover the mm corruption (everything that gets written via a memory
 balancing mechanism should do the same).
 
 Said that I think it would be possible to do it for SHM and shared mappings too.

shm already does it: shm_swap_core returns RETRY unless page count is
1.  file maps are close: file swapping already uses the address space
a_ops, but currently uses that at the wrong time: unmap time rather
than last-release.  Rik?

 However this still makes me wonder why should we unmap the pte of a page that
 we can't free until we unmap_kiobuf? That's not as bad as having a
 nopage+swapout dummy operations in the sound driver DMA page case, because
 usually user-kiobufs are temporary just for the time of the DMA I/O, though.

It's no different from mlock(): if you've got some reason why one
specific mapping can't be unmapped (either it's locked or it is
inherently non-swappable like a kiobuf), then there's no reason to
make that per-mapping information visible at the per-page level.

We allow multiple processes to mlock() the same page, after all.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Andrea Arcangeli

On Wed, Oct 18, 2000 at 03:40:39PM +0100, Stephen C. Tweedie wrote:
 shm already does it: [..]

Right. Only the shared mappings and anonymoys memory need to be changed.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Linus Torvalds



On Wed, 18 Oct 2000, Andrea Arcangeli wrote:
 
  Quite frankly, the way I'd conceptually prefer people do these kinds of
  DMA buffers etc is to just have a "nopage()" function, and let that
  nopage() function just fill in the page from the DMA buffer directly. This
  would work fine, and would get rid of all the playing with PG_reserved
  etc.
 
 And it would generate useless page faults as well. What's the point
 of unmapping a page that can't be freed? Once the page gets freed
 the mapping just gone away via munmap anyways. What's the point of
 introducing that performance hit?

As I explained in the part you snipped, this is why I _do_ accept
remap_page_range().

But to answer your question: the point is cleanliness.

For example, remap_page_range() cannot do many things. Because a reserved
page does not maintain any counts (that's the definition of "reserved" as
far as the Linux MM is concerned - the pages never count anything), you
cannot do proper memory management with it.

In the specific case of braindead hardware that needs a 1:1 direct mapping
due to DMA limitations etc, that's exactly what you want. You can't let
people move pages around anyway or do things like that. You can never let
people swap these pages out, because there is no point.

But this is NOT how the MM layer should work in general. And it's sure as
h*ll not how the direct IO layer is going to work. You can argue until you
get blue in the face, I don't care. I'm _not_ going to accept some
braindamaged setup where you can only have one concurrent write into a
page at the same time. I'm _not_ going to accept some braindamaged setup
where you can't unmap such a page in another process or thread.

In short, I'm _not_ willing to paper over bugs by just saying "don't touch
this page".

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-18 Thread Jeff Garzik

Linus Torvalds wrote:
 Anyway, I didn't realize you were talking about the sound drivers use of
 remap_page_range(). That's not the original reason for remap_page_range()
 at all, and in fact it's the _ugly_ way to do things. It's simple and it
 works, but it's not pretty.
 
 Quite frankly, the way I'd conceptually prefer people do these kinds of
 DMA buffers etc is to just have a "nopage()" function, and let that
 nopage() function just fill in the page from the DMA buffer directly. This
 would work fine, and would get rid of all the playing with PG_reserved
 etc.

Well coolio.  Would somebody be up for sanity checking my audio mmap
code (attached)?  It doesn't look too hard at all to get the audio
drivers doing the correct thing.

Converting to this was easy...  my driver hardcodes the buffer size at
PAGE_SIZE.  Since Via audio uses scatter-gather DMA, all I need to do in
'nopage' is get the virtual page number for the vma, and use that as a
direct index into the scatter-gather array.

Since this code works in my local tests, my two concerns at this point
are correct vma-vm_pgoff treatment, and correct vm_flags/vm_file usage.

Jeff



-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |


#ifdef VIA_SUPPORT_MMAP
static struct page * via_mm_nopage (struct vm_area_struct * vma,
unsigned long address, int write_access)
{
struct via_info *card = vma-vm_private_data;
struct via_channel *chan = card-ch_out;
unsigned long pgoff;
int rd, wr;

DPRINTK ("ENTER, start %lXh, ofs %lXh, pgoff %ld, addr %lXh, wr %d\n",
 vma-vm_start,
 address - vma-vm_start,
 (address - vma-vm_start)  PAGE_SHIFT,
 address,
 write_access);

assert (VIA_DMA_BUF_SIZE == PAGE_SIZE);

if (address  vma-vm_end) {
DPRINTK ("EXIT, returning NOPAGE_SIGBUS\n");
return NOPAGE_SIGBUS; /* Disallow mremap */
}
if (!card) {
DPRINTK ("EXIT, returning NOPAGE_OOM\n");
return NOPAGE_OOM;  /* Nothing allocated */
}

/* XXX what about vma-vm_pgoff? */
pgoff = (address - vma-vm_start)  PAGE_SHIFT;
rd = card-ch_in.is_mapped;
wr = card-ch_out.is_mapped;

#ifndef VIA_NDEBUG
{
unsigned long max_bufs = VIA_DMA_BUFFERS;
if (rd  wr) max_bufs *= 2;
/* via_dsp_mmap() should ensure this */
assert (pgoff  max_bufs);
}
#endif

/* if full-duplex (read+write) and we have two sets of bufs,
 * then the playback buffers come first, sez soundcard.c */
if (pgoff = VIA_DMA_BUFFERS) {
pgoff -= VIA_DMA_BUFFERS;
chan = card-ch_in;
} else if (!wr)
chan = card-ch_in;

assert unsigned long)chan-sgbuf[pgoff].cpuaddr) % PAGE_SIZE) == 0);
DPRINTK ("EXIT, returning page %p for cpuaddr %lXh\n",
 virt_to_page (chan-sgbuf[pgoff].cpuaddr),
 (unsigned long) chan-sgbuf[pgoff].cpuaddr);
return virt_to_page (chan-sgbuf[pgoff].cpuaddr);
}


static int via_mm_swapout (struct page *page, struct file *filp)
{
return 0;
}


struct vm_operations_struct via_mm_ops = {
nopage: via_mm_nopage,
swapout:via_mm_swapout,
};


static int via_dsp_mmap(struct file *file, struct vm_area_struct *vma)
{
struct via_info *card;
int nonblock = (file-f_flags  O_NONBLOCK);
int rc = -EINVAL, rd=0, wr=0;
unsigned long max_size, size, start, offset;

assert (file != NULL);
assert (vma != NULL);
card = file-private_data;
assert (card != NULL);

DPRINTK ("ENTER, start %lXh, size %ld, pgoff %ld\n",
 vma-vm_start,
 vma-vm_end - vma-vm_start,
 vma-vm_pgoff);

assert (VIA_DMA_BUF_SIZE == PAGE_SIZE);

max_size = 0;
if (file-f_mode  FMODE_READ) {
rd = 1;
max_size += (VIA_DMA_BUFFERS * VIA_DMA_BUF_SIZE);
}
if (file-f_mode  FMODE_WRITE) {
wr = 1;
max_size += (VIA_DMA_BUFFERS * VIA_DMA_BUF_SIZE);
}

start = vma-vm_start;
offset = (vma-vm_pgoff  PAGE_SHIFT);
size = vma-vm_end - vma-vm_start;

if (size  max_size)
goto out;
if ((offset = max_size) || (offset = size))
goto out;

rc = via_syscall_down (card, nonblock);
if (rc) goto out;

vma-vm_ops = via_mm_ops;
vma-vm_private_data = card;
vma-vm_flags |= VM_LOCKED | VM_SHM; /* Don't swap */
vma-vm_file = file;

if (rd)
card-ch_in.is_mapped = 1;
  

Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Wed, 18 Oct 2000, Andrea Arcangeli wrote:
> 
> > Are you suggesting something like: if it is reading from a page (ie
> > writing the contents of that page somewhere else), we don't lock it, but
> > if it is writing to a page, we lock it so that the dirty bit won't get
> > lost.
> 
> That wasn't what I suggested but I like also the the way you describe above.
> It makes sense.

I don't think it really makes sense - I can see that it works, but I don't
like the way it ties in the dirty bit handling with the lock bit handling.
I personally think they are (and should be) unrelated.

> > Sure, that works (modulo the fact that it still has the issues with
> > serializing mmap's and accesses to other areas in the same page). But do
> > you really claim that it's the clean solution?
> 
> It looks cleaner than swapping out a page while a device is writing
> to it in DMA under the swapout.

Note that _that_ is something I'd much rather handle another way entirely:
something I've long long wanted to do is to handle all swap-outs from the
"struct page *" rather than based on the VM scan.

Now, the way I'v ealways envisioned this to work is that the VM scanning
function basically always does the equivalent of just

 - get PTE entry, clear it out.
 - if PTE was dirty, add the page to the swap cache, and mark it dirty,
   but DON'T ACTUALLY START THE IO!
 - free the page.

Basically, we removed the page from the virtual mapping, and it's now in
the LRU queues, and marked dirty there.

Then, we'd move the "writeout" part into the LRU queue side, and at that
point I agree with you 100% that we probably should just delay it until
there are no mappings available - is we'd only write out a swap cache
entry if the count == 1 (ie it only exists in the swap cache), because
before that is true there are other people marking it dirty.

What are the advantges of this approach? Never mind the kiobuf issues at
this point - I wanted to do this long before kiobuf's happened.

Basically, it means that we'd write out shared pages only once, instead of
initiating write-back multiple times for each mapping that the page exists
in. Right now this isn't actually much of a problem, simply because it's
actually fairly hard to get shared dirty pages that would get written out
twice, but I think you see what I'm talking about on a conceptual level.

It also makes the kiobuf dirty issues a _completely_ separate issue, and
makes it very clean to handle: what kiobuf's become is just a kind of
virtual "pseudo-address-space". A "kiobuf address space" doesn't have a
page table, but it ends up being basically equivalent to a virtual address
space without the TLB overhead. Like a "real" address space attached to a
process, it has dirty pages in it, and like the real address space it
informs the VM layer of that through the page dirty bit.

See? THAT, in my opinion, is the clean way to handle this all. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Wed, 18 Oct 2000, Andrea Arcangeli wrote:

> On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
> > It so happens that the vmscan stuff won't ever remove a physical page
> > mapping, but that's simply because such a page CANNOT be swapped out. How
> 
> So if I write a mechanism that allows those driver-private-pages that are used
> for DMA of a soundcard to be swapped out, then you would remove the
> PG_referenced bitflag from them too?

PG_referenced?

Maybe you mean PG_reserved? 

Anyway, I didn't realize you were talking about the sound drivers use of
remap_page_range(). That's not the original reason for remap_page_range()
at all, and in fact it's the _ugly_ way to do things. It's simple and it
works, but it's not pretty.

Quite frankly, the way I'd conceptually prefer people do these kinds of
DMA buffers etc is to just have a "nopage()" function, and let that
nopage() function just fill in the page from the DMA buffer directly. This
would work fine, and would get rid of all the playing with PG_reserved
etc.

Obviously, such a page would never really get swapped out: you'd have a
"swapout()" function that would be a no-op (as the page would not truly be
swapped out: the page is actually held by the driver inside the kernel,
and the user mapping is just a _window_ into that kernel buffer). 

Now, remap_page_range() ends up giving this same functionality without
sound drivers needing to have nopage() or swapout() functions at all, and
obviously it's trivial to do this way. But you should realize that this
use of remap_page_range() is very much a hack - the whole point of
remap_page_range() was always to do the PCI (and legacy ISA region)
mapping for things like the X server etc that access hardware directly.

I hated people mis-using it the way it's being done by the sound drivers,
but because I also realize that it allows for some simplifications I do
accept it - it's basically an ugly hack that doesn't really matter because
the exact same code _is_ needed for the real IO mapping case anyway, and
as far as the kernel is concerned the whole thing with PG_reserved is all
just a game to make the kernel VM subsystem think that some real RAM is
actually IO space and shouldn't be touched.

But for going the other way (ie moving a user space page into kernel
space, a la kiobuf mapping) is rather different, and I think it would be a
big ugly hairy lie to mark such a user space page be "PCI memory", because
that user space page might in fact be a page cache page etc. It's not at
all the same kind of controlled lie that the sound driver ugly hack is..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Wed, Oct 18, 2000 at 01:00:48AM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
> > It so happens that the vmscan stuff won't ever remove a physical page
> > mapping, but that's simply because such a page CANNOT be swapped out. How
> 
> So if I write a mechanism that allows those driver-private-pages that are used
> for DMA of a soundcard to be swapped out, then you would remove the
> PG_referenced bitflag from them too?

Are you sure you don't mean PG_Reserved?

If they get removed from the process's working set, then sure --- why
not?  Why should the driver care?  As long as the page is pinned in
physical memory (ie. as long as it has a raised page count), the
driver won't mind at all.  If your vma has a nopage method to
reestablish the ptes on page fault, then swap would work fine, just as
long as we fix the problem we've already identified about swapout
writing pages early rather than once they are finally leaving the swap
cache.

It's a completely different matter if you're dealing with ioremap of
PCI-aperture pages, but that's mostly because they don't have valid
struct page *'s (at least not unless you have 4GB ram).

> What sense it has to start to writeout to disk a page that is under DMA when
> you don't know if this page is getting changed under the swapout?

Umm, we've already said that we want to fix the raw IO problem by
setting the page dirty on initial swapout but doing the IO once we
know it's only in swap cache.  Do that and we won't ever write a swap
page that is mapped by any IO subsystem.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 01:02:01PM -0700, Linus Torvalds wrote:
> But what about things like:
>  - linearized circular buffers (where "linearized" means that the buffer
>is mapped twice or more consecutively virtually in memory, so that the
>user doesn't need to worry about the boundary conditions that normal
>circular buffers have)

This is exactly the case pointed out by SCT and this should be trivial to
handle just browsing backwards the maplist in the trylock slow path.

>  - multiple buffers per page.
> 
> And note in particular that you may not be able to page-align everything:
> and it can be extremely costly (both in memory use and in cache footprint)
> to always pad out your buffers etc.

See below for this one.

> It's a simple lookup function, looking up the physical pages that
> correspond to a specific range of virtual memory.
> 
> Nothing more.

Yep.

> > In fact the reason I don't like to put VM stuff into rawio is that I like
> > the clean design that you described:
> > 
> > o   lookup the physical page
> > o   do I/O on the physical page
> 
> If you like it, why do you want to break it?
> 
> You _say_ that you like it, but yet you want to add extra conditions like
> 
>  o while I/O is pending the virtual mapping is fixed

That's not the condition that I want to add. Everybody can do a mremap of the
vma while the I/O runs or as you pointed out everybody can munmap the
virtual memory area while the physical page is referenced by the kiobuf.

What I want to add is basically only:

o   swap_out doesn't unmap the pages mapped by a kiobuf from their pte
and in turn it doesn't swapout them while they could be under I/O

> Are you suggesting something like: if it is reading from a page (ie
> writing the contents of that page somewhere else), we don't lock it, but
> if it is writing to a page, we lock it so that the dirty bit won't get
> lost.

That wasn't what I suggested but I like also the the way you describe above.
It makes sense.

It also seems to handle the case of multiple buffers in the same page.
When we do the pagein (if a pagein is necessary) all users of the page have to
wait anyways.  For the pageout they will be able to start writes at the same
time this way.

> Sure, that works (modulo the fact that it still has the issues with
> serializing mmap's and accesses to other areas in the same page). But do
> you really claim that it's the clean solution?

It looks cleaner than swapping out a page while a device is writing
to it in DMA under the swapout.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
> It so happens that the vmscan stuff won't ever remove a physical page
> mapping, but that's simply because such a page CANNOT be swapped out. How

So if I write a mechanism that allows those driver-private-pages that are used
for DMA of a soundcard to be swapped out, then you would remove the
PG_referenced bitflag from them too?

What sense it has to start to writeout to disk a page that is under DMA when
you don't know if this page is getting changed under the swapout? Furthmore
despite of the swapout I/O those pages can't be freed until we run unmap_kiobuf.

> would you swap out the PCI space?

(minor note: that's not PCI space (the pci space case is handled by the
!VALID_PAGE(page) check). That's normal kernel memory exported to userspace
handled by the PageReserved(page) check.)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 10:06:35PM +0200, Andrea Arcangeli wrote:

> > also don't see why any bug with kiobufs can't be fixed without the
> > expensive and complex pinning.
> 
> IMHO pinning the page in the pte is less expensive and less complex than making
> rawio and the VM aware of those issues.

raw IO will never be aware of the issues at all.  It's all done when
we do the page lookup.

What are the VM issues?  There are only two left that I know about.
One is the the swap code --- we are evicting swap cache pages without
caring that there may be a user of the page who has written to it
since we did the unmap.  We got away with it until now because of the
assumption that shared swap cache was always readonly (anonymous
shared memory cannot swap through the swap cache for precisely this
reason).  In current 2.4, this is a trivial assumption to eliminate.

The other VM gotcha is threads plus fork --- a kiobuf mapping is
basically equivalent to a temporary shared mapping as far as the page
is concerned, but the vma is not marked shared.  So we're left with
the possibility of another thread forking, both MMs getting marked as
COW, and then if the page gets touched by the parent thread first, COW
results in the parent getting a new page and the child process
inheriting the page in which the raw IO is still happening.

Does this matter?  Maybe, if you want to play kiobuf tricks with
things like pipes and you care about preserving POSIX semantics in all
cases.  Not, if you're satisfied with "well, don't do that, it's
stupid!"  (Although it _would_ be nice to get the doubled pipe
bandwidth of davem's kiobuf pipes to thoroughly stomp on anybody
else's lmbench numbers...)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
> 
> IMHO pinning the page in the pte is less expensive and less complex than making
> rawio and the VM aware of those issues. (remap_page_range is so clean
> implementation exactly because it pins the page into the pte)

You keep on bringing up remap_page_range(), and it does nothing of the
sort.

remap_page_range() does one thing, and one thing only: it populates the
page tables with physical page mappings.

It has nothing to do with pinning.

It so happens that the vmscan stuff won't ever remove a physical page
mapping, but that's simply because such a page CANNOT be swapped out. How
would you swap out the PCI space?

I don't see why you consider that to have anything to do with kiobuf's.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 12:27:30PM -0700, David S. Miller wrote:
> Hint: smp_flush_tlb_page()
> 
> Current kiobufs never need to do that, under any circumstances.
> This is not by accident.

I don't understand. flush_tlb_page() done in the context of a thread won't care
about the state of the physical page. I don't see how it's related to kiobufs.

> I don't see what the big deal is in requiring that user applications
> do their own locking to protect page contents being modified in one
> thread while a direct I/O from it is happening in another thread.  I

We never talked about this issue in the previous emails. That is required
regardless of how we solve the MM corruption.

> also don't see why any bug with kiobufs can't be fixed without the
> expensive and complex pinning.

IMHO pinning the page in the pte is less expensive and less complex than making
rawio and the VM aware of those issues. (remap_page_range is so clean
implementation exactly because it pins the page into the pte)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Eric Lowe

Hello,

> For example if both threads are reading different part of disk using the same
> buffer that's also a wrong condition that will provide impredictable result (or
> if they're reading the same part of disk why are they doing it twice?). If both
> threads are writing to different part of disk using the same buffer then you're
> right we could push more I/O at the same time by avoiding the locked bit but
> that case doesn't look very interesting to me.

*nod*

> > Also, note that one of my requirements to accept the direct IO patches in
> > the first place from Stephen was that I wanted them to NOT mess at all
> > with virtual memory mappings. Basically, the way kio buffers work is that
> > they are 100% based on only physical pages. There are no virtual issues at
> > all in the IO, and that's exactly how I want it. There is no reason to
> 
> That's kiobuf.
> 
> Example: we use kiobuf to handle COWs in the LVM snapshotting code. That only
> deals with physical pages. It never walks pagetables. It never cares about the
> VM. Those pages never become visible from userspace. All right.
> 
> But in map_user_kiobuf we can't avoid to play with the virtual memory,
> as we can't avoid to do that in the remap_page_range case.

I've been down that road, and I don't think we want to go there.
Most research shows that fast multiprocessors simply suffer too much
from the TLB shootdown costs to make up for the difference between
copying the page and not copying it.  Twiddling with pte's to make them
COW during the I/O is a waste of cycles and it doesn't gain you much
except in the threaded case, in which case the application was brain-dead
to begin with.

Is that what you're suggesting, making the pages COW?  Or am I missing
the point?

> 
> > conceptual standpoint at all. Because the "virtual" part does not even
> > _exist_ by the time we do IO.
> 
> The fact is that the "virtual" part _exists_ since the page that we are doing
> I/O from/to is mapped in the virtual address of the task and so the VM will play
> with it from under us.

Doesn't incrementing the page count mean we still have the page?  I mean,
w/o twiddling with the mapping, the buffer may change while we're doing
I/O on it, but I really don't care if that happens.. if it means I can
avoid 75% of the overhead in the page locking.  Since we're not doing
in-place I/O in the general case (e.g. read/write), I think any sane
people can deal with these changed semantics (fbuf-like, that is)

> In short for the same reason all drivers mapping in userspace kernel memory
> via remap_page_range are first calling mem_map_reserve on the physical
> page.
> 
> Why should we make the VM and rawio more complex when we can simply pin the
> page in the pte as mem_map_reserve always did so far? I can see the fact you
> can't write to two different part of disk the same buffer from two threads at
> the same time but that looks a non interesting case and if we need to optimize
> it I'd prefer to do it in another way than to put further VM stuff into rawio.

Again, because you're adding overhead for the brain-dead semantics.

I think this is Linus' point here -- twiddling with the VM doesn't really
gain you anything at all.  The I/O still happens either way, but by
doing it based ONLY on physical pages (as it is now), there's a LOT
less overhead and it still works correctly as long as the application
and driver are well behaved.

And I don't agree that we should ever try to do kiobuf() things in
the read/write general case.  I think that mapping files into
kernel buffers and copying the pages to/from the page cache
would suffice, and with the benefit that all page flushing could
then be done through the VM and bdflush eliminated.

--
Eric Lowe
Software Engineer, Systran Corporation
[EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
> 
> For example if both threads are reading different part of disk using the same
> buffer that's also a wrong condition that will provide impredictable result (or
> if they're reading the same part of disk why are they doing it twice?).

I'm not talking about users that do things that are obviously
meaningless.

But what about things like:
 - linearized circular buffers (where "linearized" means that the buffer
   is mapped twice or more consecutively virtually in memory, so that the
   user doesn't need to worry about the boundary conditions that normal
   circular buffers have)
 - multiple buffers per page.

And note in particular that you may not be able to page-align everything:
and it can be extremely costly (both in memory use and in cache footprint)
to always pad out your buffers etc.

> But in map_user_kiobuf we can't avoid to play with the virtual memory,
> as we can't avoid to do that in the remap_page_range case.

map_user_kiobuf() isn't "playing" with the virtual memory.

It's a simple lookup function, looking up the physical pages that
correspond to a specific range of virtual memory.

Nothing more.

> In fact the reason I don't like to put VM stuff into rawio is that I like
> the clean design that you described:
> 
> o lookup the physical page
> o do I/O on the physical page

If you like it, why do you want to break it?

You _say_ that you like it, but yet you want to add extra conditions like

 o while I/O is pending the virtual mapping is fixed

Which basically means that you suddenly have lots of interactions with the
VM layer - even though everybody agrees that is a bad thing.

Your solution would be that you'd mark the _virtual_ page dirty, then lock
it so that the dirty bit cannot go away (and the page cannot be unmapped),
and then keep it locked until the IO is complete, and basically have all
dirty handling based on virtual addresses. Even though those virtual
addresses have _nothing_ to do with the IO itself, and you say that you
want to do IO on the physical page.

In contrast, I'm saying that direct IO that reads into a (physical) page
should obviously mark that (physical) page dirty. In the meantime, others
may choose to do IO to the same page. They could even use the same buffer,
because we don't lock things: this has extremely well-defined meaning in
the case of "write this page to these 100 devices asynchronously".

How would you suggest we handle the "write to 100 clients simulataneously"
issue?

Are you suggesting something like: if it is reading from a page (ie
writing the contents of that page somewhere else), we don't lock it, but
if it is writing to a page, we lock it so that the dirty bit won't get
lost.

Sure, that works (modulo the fact that it still has the issues with
serializing mmap's and accesses to other areas in the same page). But do
you really claim that it's the clean solution?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread David S. Miller

   Date:Tue, 17 Oct 2000 21:32:43 +0200
   From: Andrea Arcangeli <[EMAIL PROTECTED]>

   > Bye, bye, performance. You might as well remove the whole thing
   > completely.

   I don't think that is a common case relevant for performance. I
   seen it only as a case that we must handle without losing
   stability.

Hint: smp_flush_tlb_page()

Current kiobufs never need to do that, under any circumstances.
This is not by accident.

I don't see what the big deal is in requiring that user applications
do their own locking to protect page contents being modified in one
thread while a direct I/O from it is happening in another thread.  I
also don't see why any bug with kiobufs can't be fixed without the
expensive and complex pinning.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 11:36:22AM -0700, Linus Torvalds wrote:
> On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
> > 
> > > Andrea, explain to me how pinning _could_ work? Explain to me how you'd
> > > lock down pages in virtual address space with multiple threads, and how
> > > you'd handle the cases of:
> > > 
> > >  - two threads doing direct IO from different parts of the same page
> > >  - one thread starting IO from a page, another thread unmapping the range
> > 
> > I don't see the problem with those two cases.
> > 
> > In the first case the first task will sleep in wait_on_page until the
> > second task will unmap the kiobuf.
> 
> Ehh.. Remind me again what the whole _point_ of physical IO was, in the
> first place?
> 
> Was it maybe PERFORMANCE?

It should certainly run fast :).

> Yeah. I thought so. 
> 
> Bye, bye, performance. You might as well remove the whole thing
> completely.

I don't think that is a common case relevant for performance. I seen it only as
a case that we must handle without losing stability.

For example if both threads are reading different part of disk using the same
buffer that's also a wrong condition that will provide impredictable result (or
if they're reading the same part of disk why are they doing it twice?). If both
threads are writing to different part of disk using the same buffer then you're
right we could push more I/O at the same time by avoiding the locked bit but
that case doesn't look very interesting to me.

> Also, note that one of my requirements to accept the direct IO patches in
> the first place from Stephen was that I wanted them to NOT mess at all
> with virtual memory mappings. Basically, the way kio buffers work is that
> they are 100% based on only physical pages. There are no virtual issues at
> all in the IO, and that's exactly how I want it. There is no reason to

That's kiobuf.

Example: we use kiobuf to handle COWs in the LVM snapshotting code. That only
deals with physical pages. It never walks pagetables. It never cares about the
VM. Those pages never become visible from userspace. All right.

But in map_user_kiobuf we can't avoid to play with the virtual memory,
as we can't avoid to do that in the remap_page_range case.

> conceptual standpoint at all. Because the "virtual" part does not even
> _exist_ by the time we do IO.

The fact is that the "virtual" part _exists_ since the page that we are doing
I/O from/to is mapped in the virtual address of the task and so the VM will play
with it from under us.

> And WHY do you want to pin the damn things virtually in the first place?

In short for the same reason all drivers mapping in userspace kernel memory
via remap_page_range are first calling mem_map_reserve on the physical
page.

Why should we make the VM and rawio more complex when we can simply pin the
page in the pte as mem_map_reserve always did so far? I can see the fact you
can't write to two different part of disk the same buffer from two threads at
the same time but that looks a non interesting case and if we need to optimize
it I'd prefer to do it in another way than to put further VM stuff into rawio.

In fact the reason I don't like to put VM stuff into rawio is that I like
the clean design that you described:

o   lookup the physical page
o   do I/O on the physical page

and not:

o   lookup the physical page
o   do I/O on the physical page
o   check what the physical page become, if it's swap cache
the set this bit, otherwise set this other bit and then
the VM will write us to disk again later because we put
some more check in the VM to handle us

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
> 
> > Andrea, explain to me how pinning _could_ work? Explain to me how you'd
> > lock down pages in virtual address space with multiple threads, and how
> > you'd handle the cases of:
> > 
> >  - two threads doing direct IO from different parts of the same page
> >  - one thread starting IO from a page, another thread unmapping the range
> 
> I don't see the problem with those two cases.
> 
> In the first case the first task will sleep in wait_on_page until the
> second task will unmap the kiobuf.

Ehh.. Remind me again what the whole _point_ of physical IO was, in the
first place?

Was it maybe PERFORMANCE?

Yeah. I thought so. 

Bye, bye, performance. You might as well remove the whole thing
completely.

Also, note that one of my requirements to accept the direct IO patches in
the first place from Stephen was that I wanted them to NOT mess at all
with virtual memory mappings. Basically, the way kio buffers work is that
they are 100% based on only physical pages. There are no virtual issues at
all in the IO, and that's exactly how I want it. There is no reason to
confuse virtual addresses into this, because the thing should be usable
even in the complete absense of virtual mappings (ie the kernel can do
direct IO purely based on pages - think sendfile() etc).

What does this mean? It means that the notion of writing directly from
user space doesn't actually exist in the Linux implementation of direct IO
at all. What Linux direct IO does is really a two-phase operation:

 - look up the physical pages. 
 - do IO based on physical pages

The two are completely and entirely unrelated. Think "good design". Think
"modularity". Think "containment".

Which means, btw, that "virtual pinning" does not make sense from a
conceptual standpoint at all. Because the "virtual" part does not even
_exist_ by the time we do IO.

> The only "problematic" case that I can see is the one mentioned by SCT that is
> the same physical page present in two entries of the maplist in the kiobuf (in
> my patch the user_map_kiobuf will return -EINVAL gracefully in that case and
> that behaviour doesn't look horribly wrong to me). And that case can be
> trivially handled by checking the list backwards in the trylock slowpath as I
> just said.

And WHY do you want to pin the damn things virtually in the first place?

Your bug is a non-issue: it can obviously be fixed other ways, and has
nothing to do with virtually pinning the page. Yes, pinning the page in
the virtual map would also fix it,  but at the cost of completely breaking
the whole idea of what direct IO is all about.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 08:11:55PM -0700, Linus Torvalds wrote:
> Oh. So to fix a bug, you say "either delete the code, or do something else
> that is completely idiotic instead"?

I'm not saying this because the "something else" doesn't look completly idiotic
to me.

> Andrea, explain to me how pinning _could_ work? Explain to me how you'd
> lock down pages in virtual address space with multiple threads, and how
> you'd handle the cases of:
> 
>  - two threads doing direct IO from different parts of the same page
>  - one thread starting IO from a page, another thread unmapping the range

I don't see the problem with those two cases.

In the first case the first task will sleep in wait_on_page until the
second task will unmap the kiobuf.

The second case the I/O will run even while the page is not visible
to userspace anymore. Then when unmap_kiobuf will run after the I/O
completed the page will be put back in the freelist.

The only "problematic" case that I can see is the one mentioned by SCT that is
the same physical page present in two entries of the maplist in the kiobuf (in
my patch the user_map_kiobuf will return -EINVAL gracefully in that case and
that behaviour doesn't look horribly wrong to me). And that case can be
trivially handled by checking the list backwards in the trylock slowpath as I
just said.

> The second case would require that unmap() synchronize completely with the
> IO. [..]

unmap doesn't need to synchronize at all. If it works without the locked
bit, it will work with the locked bit too.

NOTE: in the patch I'm definitely also increasing the page count (that is
certainly necessary). I never suggested to replace get_page with
the TryLock. I'm suggesting to _add_ the TryLock to avoid having to deal
with the VM for something that is just under I/O and unfreeable anyways.
Maybe you understood I was suggesting to remove the get_page? In that case I
explained wrong but you can check the patch to see I never removed the get_page
as confirm of my current words.

> [..] Which is wasteful, and doesn't make any sense: what's the point, when
> you can avoid it by just not pinning?

The point is to avoid making the driver to deal with the VM and the VM to know
about new special cases invoked by the driver.  Exactly as a driver that uses
remap_page_range never known about the VM because it always kept the pages
pinned in the pte via the reserved bit just for that purpose.

If you want to put VM knowledge inside device drivers and also put new
conditions in the VM in order to avoid pinning the pages mapped via
map_user_kiobuf in the pte (so with no gain), then you have my disagreement
about the design.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Mon, Oct 16, 2000 at 08:11:55PM -0700, Linus Torvalds wrote:

> I'm sure this bug will get fixed too. And the fix probably won't end up
> even being all that painful - it's probably a question of marking the page
> dirty after completing IO into it and making sure the swap-out logic does
> the right thing (ie try to write it out again - which is exactly the same
> thing that happens right now if a user dirties a page while it's busy
> doing write-out).
> 
> In fact, the code may do this already, I'll let sct look into the exact
> details and fix it.

I've looked, and it doesn't yet, but Rik has already marked the places
where this would be implemented and it looks simple (famous last
words): mark the swap cache page dirty at writeout and flush to disk
during refill_inactive.  Rik is sitting a few rows behind me right now
so we'll look at the code later today.  He's already been talking
about doing this anyway to improve the write clustering of the swapout
code.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 12:13:49AM +0200, Andrea Arcangeli wrote:
> 
> Correct. But the problem is that the page won't stay in physical memory after
> we finished the I/O because swap cache with page count 1 will be freed by the
> VM.

Rik has been waiting for an excuse to get deferred swapout into the
mainline.  Sounds like we've got the excuse.

> And anyways from a design standpoint it looks much better to really pin the
> page in the pte too (just like kernel reserved pages are pinend after a
> remap_page_range).

Unfortunately, there is one common case where we want to do exactly
that.  "dd < /dev/zero > something_using_raw_io" maps a whole series
of identical readonly ZERO_PAGE pages into the kiobuf.  One of the
reasons I removed the automatic page locking was that otherwise we're
forced to special-case things like ZERO_PAGE in the locking code.

Even ignoring that, users _will_ submit multiple IOs in the same page.
Pinning the physical page with page->count is clean.  Doing the
locking with the page lock makes no sense if you have adjacent IOs or
if you want to maintain the kiobuf mapping for any length of time.
The point of kiobufs was to avoid VM hacks so that IO can be done at
physical page level.  Pinning ptes should not have anything to do with
the IO or we've lost that abstraction.

> Replacing the get_user/put_user with handle_mm_fault _after_ changing
> follow_page to check the dirty bit too in the write case should be ok.

Right.

> > Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
> > to call the existing access_one_page() from ptrace.  You're right,
> 
> access_one_page is missing the pagetable lock too, but that seems the only
> problem. I'm not convinced mixing the internal of access_one_page and
> map_user_kiobuf is a good thing since they needs to do a very different thing
> in the critical section.

Not the whole of access_one_page, but the pagetable-locked
follow-page / handle_mm_fault loop should be common code.  That's
where we're having the problem, so let's avoid having to maintain it
in two places.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 12:13:49AM +0200, Andrea Arcangeli wrote:
 
 Correct. But the problem is that the page won't stay in physical memory after
 we finished the I/O because swap cache with page count 1 will be freed by the
 VM.

Rik has been waiting for an excuse to get deferred swapout into the
mainline.  Sounds like we've got the excuse.

 And anyways from a design standpoint it looks much better to really pin the
 page in the pte too (just like kernel reserved pages are pinend after a
 remap_page_range).

Unfortunately, there is one common case where we want to do exactly
that.  "dd  /dev/zero  something_using_raw_io" maps a whole series
of identical readonly ZERO_PAGE pages into the kiobuf.  One of the
reasons I removed the automatic page locking was that otherwise we're
forced to special-case things like ZERO_PAGE in the locking code.

Even ignoring that, users _will_ submit multiple IOs in the same page.
Pinning the physical page with page-count is clean.  Doing the
locking with the page lock makes no sense if you have adjacent IOs or
if you want to maintain the kiobuf mapping for any length of time.
The point of kiobufs was to avoid VM hacks so that IO can be done at
physical page level.  Pinning ptes should not have anything to do with
the IO or we've lost that abstraction.

 Replacing the get_user/put_user with handle_mm_fault _after_ changing
 follow_page to check the dirty bit too in the write case should be ok.

Right.

  Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
  to call the existing access_one_page() from ptrace.  You're right,
 
 access_one_page is missing the pagetable lock too, but that seems the only
 problem. I'm not convinced mixing the internal of access_one_page and
 map_user_kiobuf is a good thing since they needs to do a very different thing
 in the critical section.

Not the whole of access_one_page, but the pagetable-locked
follow-page / handle_mm_fault loop should be common code.  That's
where we're having the problem, so let's avoid having to maintain it
in two places.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Mon, Oct 16, 2000 at 08:11:55PM -0700, Linus Torvalds wrote:

 I'm sure this bug will get fixed too. And the fix probably won't end up
 even being all that painful - it's probably a question of marking the page
 dirty after completing IO into it and making sure the swap-out logic does
 the right thing (ie try to write it out again - which is exactly the same
 thing that happens right now if a user dirties a page while it's busy
 doing write-out).
 
 In fact, the code may do this already, I'll let sct look into the exact
 details and fix it.

I've looked, and it doesn't yet, but Rik has already marked the places
where this would be implemented and it looks simple (famous last
words): mark the swap cache page dirty at writeout and flush to disk
during refill_inactive.  Rik is sitting a few rows behind me right now
so we'll look at the code later today.  He's already been talking
about doing this anyway to improve the write clustering of the swapout
code.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 08:11:55PM -0700, Linus Torvalds wrote:
 Oh. So to fix a bug, you say "either delete the code, or do something else
 that is completely idiotic instead"?

I'm not saying this because the "something else" doesn't look completly idiotic
to me.

 Andrea, explain to me how pinning _could_ work? Explain to me how you'd
 lock down pages in virtual address space with multiple threads, and how
 you'd handle the cases of:
 
  - two threads doing direct IO from different parts of the same page
  - one thread starting IO from a page, another thread unmapping the range

I don't see the problem with those two cases.

In the first case the first task will sleep in wait_on_page until the
second task will unmap the kiobuf.

The second case the I/O will run even while the page is not visible
to userspace anymore. Then when unmap_kiobuf will run after the I/O
completed the page will be put back in the freelist.

The only "problematic" case that I can see is the one mentioned by SCT that is
the same physical page present in two entries of the maplist in the kiobuf (in
my patch the user_map_kiobuf will return -EINVAL gracefully in that case and
that behaviour doesn't look horribly wrong to me). And that case can be
trivially handled by checking the list backwards in the trylock slowpath as I
just said.

 The second case would require that unmap() synchronize completely with the
 IO. [..]

unmap doesn't need to synchronize at all. If it works without the locked
bit, it will work with the locked bit too.

NOTE: in the patch I'm definitely also increasing the page count (that is
certainly necessary). I never suggested to replace get_page with
the TryLock. I'm suggesting to _add_ the TryLock to avoid having to deal
with the VM for something that is just under I/O and unfreeable anyways.
Maybe you understood I was suggesting to remove the get_page? In that case I
explained wrong but you can check the patch to see I never removed the get_page
as confirm of my current words.

 [..] Which is wasteful, and doesn't make any sense: what's the point, when
 you can avoid it by just not pinning?

The point is to avoid making the driver to deal with the VM and the VM to know
about new special cases invoked by the driver.  Exactly as a driver that uses
remap_page_range never known about the VM because it always kept the pages
pinned in the pte via the reserved bit just for that purpose.

If you want to put VM knowledge inside device drivers and also put new
conditions in the VM in order to avoid pinning the pages mapped via
map_user_kiobuf in the pte (so with no gain), then you have my disagreement
about the design.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
 
  Andrea, explain to me how pinning _could_ work? Explain to me how you'd
  lock down pages in virtual address space with multiple threads, and how
  you'd handle the cases of:
  
   - two threads doing direct IO from different parts of the same page
   - one thread starting IO from a page, another thread unmapping the range
 
 I don't see the problem with those two cases.
 
 In the first case the first task will sleep in wait_on_page until the
 second task will unmap the kiobuf.

Ehh.. Remind me again what the whole _point_ of physical IO was, in the
first place?

Was it maybe PERFORMANCE?

Yeah. I thought so. 

Bye, bye, performance. You might as well remove the whole thing
completely.

Also, note that one of my requirements to accept the direct IO patches in
the first place from Stephen was that I wanted them to NOT mess at all
with virtual memory mappings. Basically, the way kio buffers work is that
they are 100% based on only physical pages. There are no virtual issues at
all in the IO, and that's exactly how I want it. There is no reason to
confuse virtual addresses into this, because the thing should be usable
even in the complete absense of virtual mappings (ie the kernel can do
direct IO purely based on pages - think sendfile() etc).

What does this mean? It means that the notion of writing directly from
user space doesn't actually exist in the Linux implementation of direct IO
at all. What Linux direct IO does is really a two-phase operation:

 - look up the physical pages. 
 - do IO based on physical pages

The two are completely and entirely unrelated. Think "good design". Think
"modularity". Think "containment".

Which means, btw, that "virtual pinning" does not make sense from a
conceptual standpoint at all. Because the "virtual" part does not even
_exist_ by the time we do IO.

 The only "problematic" case that I can see is the one mentioned by SCT that is
 the same physical page present in two entries of the maplist in the kiobuf (in
 my patch the user_map_kiobuf will return -EINVAL gracefully in that case and
 that behaviour doesn't look horribly wrong to me). And that case can be
 trivially handled by checking the list backwards in the trylock slowpath as I
 just said.

And WHY do you want to pin the damn things virtually in the first place?

Your bug is a non-issue: it can obviously be fixed other ways, and has
nothing to do with virtually pinning the page. Yes, pinning the page in
the virtual map would also fix it,  but at the cost of completely breaking
the whole idea of what direct IO is all about.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 11:36:22AM -0700, Linus Torvalds wrote:
 On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
  
   Andrea, explain to me how pinning _could_ work? Explain to me how you'd
   lock down pages in virtual address space with multiple threads, and how
   you'd handle the cases of:
   
- two threads doing direct IO from different parts of the same page
- one thread starting IO from a page, another thread unmapping the range
  
  I don't see the problem with those two cases.
  
  In the first case the first task will sleep in wait_on_page until the
  second task will unmap the kiobuf.
 
 Ehh.. Remind me again what the whole _point_ of physical IO was, in the
 first place?
 
 Was it maybe PERFORMANCE?

It should certainly run fast :).

 Yeah. I thought so. 
 
 Bye, bye, performance. You might as well remove the whole thing
 completely.

I don't think that is a common case relevant for performance. I seen it only as
a case that we must handle without losing stability.

For example if both threads are reading different part of disk using the same
buffer that's also a wrong condition that will provide impredictable result (or
if they're reading the same part of disk why are they doing it twice?). If both
threads are writing to different part of disk using the same buffer then you're
right we could push more I/O at the same time by avoiding the locked bit but
that case doesn't look very interesting to me.

 Also, note that one of my requirements to accept the direct IO patches in
 the first place from Stephen was that I wanted them to NOT mess at all
 with virtual memory mappings. Basically, the way kio buffers work is that
 they are 100% based on only physical pages. There are no virtual issues at
 all in the IO, and that's exactly how I want it. There is no reason to

That's kiobuf.

Example: we use kiobuf to handle COWs in the LVM snapshotting code. That only
deals with physical pages. It never walks pagetables. It never cares about the
VM. Those pages never become visible from userspace. All right.

But in map_user_kiobuf we can't avoid to play with the virtual memory,
as we can't avoid to do that in the remap_page_range case.

 conceptual standpoint at all. Because the "virtual" part does not even
 _exist_ by the time we do IO.

The fact is that the "virtual" part _exists_ since the page that we are doing
I/O from/to is mapped in the virtual address of the task and so the VM will play
with it from under us.

 And WHY do you want to pin the damn things virtually in the first place?

In short for the same reason all drivers mapping in userspace kernel memory
via remap_page_range are first calling mem_map_reserve on the physical
page.

Why should we make the VM and rawio more complex when we can simply pin the
page in the pte as mem_map_reserve always did so far? I can see the fact you
can't write to two different part of disk the same buffer from two threads at
the same time but that looks a non interesting case and if we need to optimize
it I'd prefer to do it in another way than to put further VM stuff into rawio.

In fact the reason I don't like to put VM stuff into rawio is that I like
the clean design that you described:

o   lookup the physical page
o   do I/O on the physical page

and not:

o   lookup the physical page
o   do I/O on the physical page
o   check what the physical page become, if it's swap cache
the set this bit, otherwise set this other bit and then
the VM will write us to disk again later because we put
some more check in the VM to handle us

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread David S. Miller

   Date:Tue, 17 Oct 2000 21:32:43 +0200
   From: Andrea Arcangeli [EMAIL PROTECTED]

Bye, bye, performance. You might as well remove the whole thing
completely.

   I don't think that is a common case relevant for performance. I
   seen it only as a case that we must handle without losing
   stability.

Hint: smp_flush_tlb_page()

Current kiobufs never need to do that, under any circumstances.
This is not by accident.

I don't see what the big deal is in requiring that user applications
do their own locking to protect page contents being modified in one
thread while a direct I/O from it is happening in another thread.  I
also don't see why any bug with kiobufs can't be fixed without the
expensive and complex pinning.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
 
 For example if both threads are reading different part of disk using the same
 buffer that's also a wrong condition that will provide impredictable result (or
 if they're reading the same part of disk why are they doing it twice?).

I'm not talking about users that do things that are obviously
meaningless.

But what about things like:
 - linearized circular buffers (where "linearized" means that the buffer
   is mapped twice or more consecutively virtually in memory, so that the
   user doesn't need to worry about the boundary conditions that normal
   circular buffers have)
 - multiple buffers per page.

And note in particular that you may not be able to page-align everything:
and it can be extremely costly (both in memory use and in cache footprint)
to always pad out your buffers etc.

 But in map_user_kiobuf we can't avoid to play with the virtual memory,
 as we can't avoid to do that in the remap_page_range case.

map_user_kiobuf() isn't "playing" with the virtual memory.

It's a simple lookup function, looking up the physical pages that
correspond to a specific range of virtual memory.

Nothing more.

 In fact the reason I don't like to put VM stuff into rawio is that I like
 the clean design that you described:
 
 o lookup the physical page
 o do I/O on the physical page

If you like it, why do you want to break it?

You _say_ that you like it, but yet you want to add extra conditions like

 o while I/O is pending the virtual mapping is fixed

Which basically means that you suddenly have lots of interactions with the
VM layer - even though everybody agrees that is a bad thing.

Your solution would be that you'd mark the _virtual_ page dirty, then lock
it so that the dirty bit cannot go away (and the page cannot be unmapped),
and then keep it locked until the IO is complete, and basically have all
dirty handling based on virtual addresses. Even though those virtual
addresses have _nothing_ to do with the IO itself, and you say that you
want to do IO on the physical page.

In contrast, I'm saying that direct IO that reads into a (physical) page
should obviously mark that (physical) page dirty. In the meantime, others
may choose to do IO to the same page. They could even use the same buffer,
because we don't lock things: this has extremely well-defined meaning in
the case of "write this page to these 100 devices asynchronously".

How would you suggest we handle the "write to 100 clients simulataneously"
issue?

Are you suggesting something like: if it is reading from a page (ie
writing the contents of that page somewhere else), we don't lock it, but
if it is writing to a page, we lock it so that the dirty bit won't get
lost.

Sure, that works (modulo the fact that it still has the issues with
serializing mmap's and accesses to other areas in the same page). But do
you really claim that it's the clean solution?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Eric Lowe

Hello,

 For example if both threads are reading different part of disk using the same
 buffer that's also a wrong condition that will provide impredictable result (or
 if they're reading the same part of disk why are they doing it twice?). If both
 threads are writing to different part of disk using the same buffer then you're
 right we could push more I/O at the same time by avoiding the locked bit but
 that case doesn't look very interesting to me.

*nod*

  Also, note that one of my requirements to accept the direct IO patches in
  the first place from Stephen was that I wanted them to NOT mess at all
  with virtual memory mappings. Basically, the way kio buffers work is that
  they are 100% based on only physical pages. There are no virtual issues at
  all in the IO, and that's exactly how I want it. There is no reason to
 
 That's kiobuf.
 
 Example: we use kiobuf to handle COWs in the LVM snapshotting code. That only
 deals with physical pages. It never walks pagetables. It never cares about the
 VM. Those pages never become visible from userspace. All right.
 
 But in map_user_kiobuf we can't avoid to play with the virtual memory,
 as we can't avoid to do that in the remap_page_range case.

I've been down that road, and I don't think we want to go there.
Most research shows that fast multiprocessors simply suffer too much
from the TLB shootdown costs to make up for the difference between
copying the page and not copying it.  Twiddling with pte's to make them
COW during the I/O is a waste of cycles and it doesn't gain you much
except in the threaded case, in which case the application was brain-dead
to begin with.

Is that what you're suggesting, making the pages COW?  Or am I missing
the point?

 
  conceptual standpoint at all. Because the "virtual" part does not even
  _exist_ by the time we do IO.
 
 The fact is that the "virtual" part _exists_ since the page that we are doing
 I/O from/to is mapped in the virtual address of the task and so the VM will play
 with it from under us.

Doesn't incrementing the page count mean we still have the page?  I mean,
w/o twiddling with the mapping, the buffer may change while we're doing
I/O on it, but I really don't care if that happens.. if it means I can
avoid 75% of the overhead in the page locking.  Since we're not doing
in-place I/O in the general case (e.g. read/write), I think any sane
people can deal with these changed semantics (fbuf-like, that is)

 In short for the same reason all drivers mapping in userspace kernel memory
 via remap_page_range are first calling mem_map_reserve on the physical
 page.
 
 Why should we make the VM and rawio more complex when we can simply pin the
 page in the pte as mem_map_reserve always did so far? I can see the fact you
 can't write to two different part of disk the same buffer from two threads at
 the same time but that looks a non interesting case and if we need to optimize
 it I'd prefer to do it in another way than to put further VM stuff into rawio.

Again, because you're adding overhead for the brain-dead semantics.

I think this is Linus' point here -- twiddling with the VM doesn't really
gain you anything at all.  The I/O still happens either way, but by
doing it based ONLY on physical pages (as it is now), there's a LOT
less overhead and it still works correctly as long as the application
and driver are well behaved.

And I don't agree that we should ever try to do kiobuf() things in
the read/write general case.  I think that mapping files into
kernel buffers and copying the pages to/from the page cache
would suffice, and with the benefit that all page flushing could
then be done through the VM and bdflush eliminated.

--
Eric Lowe
Software Engineer, Systran Corporation
[EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 12:27:30PM -0700, David S. Miller wrote:
 Hint: smp_flush_tlb_page()
 
 Current kiobufs never need to do that, under any circumstances.
 This is not by accident.

I don't understand. flush_tlb_page() done in the context of a thread won't care
about the state of the physical page. I don't see how it's related to kiobufs.

 I don't see what the big deal is in requiring that user applications
 do their own locking to protect page contents being modified in one
 thread while a direct I/O from it is happening in another thread.  I

We never talked about this issue in the previous emails. That is required
regardless of how we solve the MM corruption.

 also don't see why any bug with kiobufs can't be fixed without the
 expensive and complex pinning.

IMHO pinning the page in the pte is less expensive and less complex than making
rawio and the VM aware of those issues. (remap_page_range is so clean
implementation exactly because it pins the page into the pte)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
 
 IMHO pinning the page in the pte is less expensive and less complex than making
 rawio and the VM aware of those issues. (remap_page_range is so clean
 implementation exactly because it pins the page into the pte)

You keep on bringing up remap_page_range(), and it does nothing of the
sort.

remap_page_range() does one thing, and one thing only: it populates the
page tables with physical page mappings.

It has nothing to do with pinning.

It so happens that the vmscan stuff won't ever remove a physical page
mapping, but that's simply because such a page CANNOT be swapped out. How
would you swap out the PCI space?

I don't see why you consider that to have anything to do with kiobuf's.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Tue, Oct 17, 2000 at 10:06:35PM +0200, Andrea Arcangeli wrote:

  also don't see why any bug with kiobufs can't be fixed without the
  expensive and complex pinning.
 
 IMHO pinning the page in the pte is less expensive and less complex than making
 rawio and the VM aware of those issues.

raw IO will never be aware of the issues at all.  It's all done when
we do the page lookup.

What are the VM issues?  There are only two left that I know about.
One is the the swap code --- we are evicting swap cache pages without
caring that there may be a user of the page who has written to it
since we did the unmap.  We got away with it until now because of the
assumption that shared swap cache was always readonly (anonymous
shared memory cannot swap through the swap cache for precisely this
reason).  In current 2.4, this is a trivial assumption to eliminate.

The other VM gotcha is threads plus fork --- a kiobuf mapping is
basically equivalent to a temporary shared mapping as far as the page
is concerned, but the vma is not marked shared.  So we're left with
the possibility of another thread forking, both MMs getting marked as
COW, and then if the page gets touched by the parent thread first, COW
results in the parent getting a new page and the child process
inheriting the page in which the raw IO is still happening.

Does this matter?  Maybe, if you want to play kiobuf tricks with
things like pipes and you care about preserving POSIX semantics in all
cases.  Not, if you're satisfied with "well, don't do that, it's
stupid!"  (Although it _would_ be nice to get the doubled pipe
bandwidth of davem's kiobuf pipes to thoroughly stomp on anybody
else's lmbench numbers...)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
 It so happens that the vmscan stuff won't ever remove a physical page
 mapping, but that's simply because such a page CANNOT be swapped out. How

So if I write a mechanism that allows those driver-private-pages that are used
for DMA of a soundcard to be swapped out, then you would remove the
PG_referenced bitflag from them too?

What sense it has to start to writeout to disk a page that is under DMA when
you don't know if this page is getting changed under the swapout? Furthmore
despite of the swapout I/O those pages can't be freed until we run unmap_kiobuf.

 would you swap out the PCI space?

(minor note: that's not PCI space (the pci space case is handled by the
!VALID_PAGE(page) check). That's normal kernel memory exported to userspace
handled by the PageReserved(page) check.)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Andrea Arcangeli

On Tue, Oct 17, 2000 at 01:02:01PM -0700, Linus Torvalds wrote:
 But what about things like:
  - linearized circular buffers (where "linearized" means that the buffer
is mapped twice or more consecutively virtually in memory, so that the
user doesn't need to worry about the boundary conditions that normal
circular buffers have)

This is exactly the case pointed out by SCT and this should be trivial to
handle just browsing backwards the maplist in the trylock slow path.

  - multiple buffers per page.
 
 And note in particular that you may not be able to page-align everything:
 and it can be extremely costly (both in memory use and in cache footprint)
 to always pad out your buffers etc.

See below for this one.

 It's a simple lookup function, looking up the physical pages that
 correspond to a specific range of virtual memory.
 
 Nothing more.

Yep.

  In fact the reason I don't like to put VM stuff into rawio is that I like
  the clean design that you described:
  
  o   lookup the physical page
  o   do I/O on the physical page
 
 If you like it, why do you want to break it?
 
 You _say_ that you like it, but yet you want to add extra conditions like
 
  o while I/O is pending the virtual mapping is fixed

That's not the condition that I want to add. Everybody can do a mremap of the
vma while the I/O runs or as you pointed out everybody can munmap the
virtual memory area while the physical page is referenced by the kiobuf.

What I want to add is basically only:

o   swap_out doesn't unmap the pages mapped by a kiobuf from their pte
and in turn it doesn't swapout them while they could be under I/O

 Are you suggesting something like: if it is reading from a page (ie
 writing the contents of that page somewhere else), we don't lock it, but
 if it is writing to a page, we lock it so that the dirty bit won't get
 lost.

That wasn't what I suggested but I like also the the way you describe above.
It makes sense.

It also seems to handle the case of multiple buffers in the same page.
When we do the pagein (if a pagein is necessary) all users of the page have to
wait anyways.  For the pageout they will be able to start writes at the same
time this way.

 Sure, that works (modulo the fact that it still has the issues with
 serializing mmap's and accesses to other areas in the same page). But do
 you really claim that it's the clean solution?

It looks cleaner than swapping out a page while a device is writing
to it in DMA under the swapout.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Stephen Tweedie

Hi,

On Wed, Oct 18, 2000 at 01:00:48AM +0200, Andrea Arcangeli wrote:
 On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
  It so happens that the vmscan stuff won't ever remove a physical page
  mapping, but that's simply because such a page CANNOT be swapped out. How
 
 So if I write a mechanism that allows those driver-private-pages that are used
 for DMA of a soundcard to be swapped out, then you would remove the
 PG_referenced bitflag from them too?

Are you sure you don't mean PG_Reserved?

If they get removed from the process's working set, then sure --- why
not?  Why should the driver care?  As long as the page is pinned in
physical memory (ie. as long as it has a raised page count), the
driver won't mind at all.  If your vma has a nopage method to
reestablish the ptes on page fault, then swap would work fine, just as
long as we fix the problem we've already identified about swapout
writing pages early rather than once they are finally leaving the swap
cache.

It's a completely different matter if you're dealing with ioremap of
PCI-aperture pages, but that's mostly because they don't have valid
struct page *'s (at least not unless you have 4GB ram).

 What sense it has to start to writeout to disk a page that is under DMA when
 you don't know if this page is getting changed under the swapout?

Umm, we've already said that we want to fix the raw IO problem by
setting the page dirty on initial swapout but doing the IO once we
know it's only in swap cache.  Do that and we won't ever write a swap
page that is mapped by any IO subsystem.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Wed, 18 Oct 2000, Andrea Arcangeli wrote:

 On Tue, Oct 17, 2000 at 02:04:10PM -0700, Linus Torvalds wrote:
  It so happens that the vmscan stuff won't ever remove a physical page
  mapping, but that's simply because such a page CANNOT be swapped out. How
 
 So if I write a mechanism that allows those driver-private-pages that are used
 for DMA of a soundcard to be swapped out, then you would remove the
 PG_referenced bitflag from them too?

PG_referenced?

Maybe you mean PG_reserved? 

Anyway, I didn't realize you were talking about the sound drivers use of
remap_page_range(). That's not the original reason for remap_page_range()
at all, and in fact it's the _ugly_ way to do things. It's simple and it
works, but it's not pretty.

Quite frankly, the way I'd conceptually prefer people do these kinds of
DMA buffers etc is to just have a "nopage()" function, and let that
nopage() function just fill in the page from the DMA buffer directly. This
would work fine, and would get rid of all the playing with PG_reserved
etc.

Obviously, such a page would never really get swapped out: you'd have a
"swapout()" function that would be a no-op (as the page would not truly be
swapped out: the page is actually held by the driver inside the kernel,
and the user mapping is just a _window_ into that kernel buffer). 

Now, remap_page_range() ends up giving this same functionality without
sound drivers needing to have nopage() or swapout() functions at all, and
obviously it's trivial to do this way. But you should realize that this
use of remap_page_range() is very much a hack - the whole point of
remap_page_range() was always to do the PCI (and legacy ISA region)
mapping for things like the X server etc that access hardware directly.

I hated people mis-using it the way it's being done by the sound drivers,
but because I also realize that it allows for some simplifications I do
accept it - it's basically an ugly hack that doesn't really matter because
the exact same code _is_ needed for the real IO mapping case anyway, and
as far as the kernel is concerned the whole thing with PG_reserved is all
just a game to make the kernel VM subsystem think that some real RAM is
actually IO space and shouldn't be touched.

But for going the other way (ie moving a user space page into kernel
space, a la kiobuf mapping) is rather different, and I think it would be a
big ugly hairy lie to mark such a user space page be "PCI memory", because
that user space page might in fact be a page cache page etc. It's not at
all the same kind of controlled lie that the sound driver ugly hack is..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-17 Thread Linus Torvalds



On Wed, 18 Oct 2000, Andrea Arcangeli wrote:
 
  Are you suggesting something like: if it is reading from a page (ie
  writing the contents of that page somewhere else), we don't lock it, but
  if it is writing to a page, we lock it so that the dirty bit won't get
  lost.
 
 That wasn't what I suggested but I like also the the way you describe above.
 It makes sense.

I don't think it really makes sense - I can see that it works, but I don't
like the way it ties in the dirty bit handling with the lock bit handling.
I personally think they are (and should be) unrelated.

  Sure, that works (modulo the fact that it still has the issues with
  serializing mmap's and accesses to other areas in the same page). But do
  you really claim that it's the clean solution?
 
 It looks cleaner than swapping out a page while a device is writing
 to it in DMA under the swapout.

Note that _that_ is something I'd much rather handle another way entirely:
something I've long long wanted to do is to handle all swap-outs from the
"struct page *" rather than based on the VM scan.

Now, the way I'v ealways envisioned this to work is that the VM scanning
function basically always does the equivalent of just

 - get PTE entry, clear it out.
 - if PTE was dirty, add the page to the swap cache, and mark it dirty,
   but DON'T ACTUALLY START THE IO!
 - free the page.

Basically, we removed the page from the virtual mapping, and it's now in
the LRU queues, and marked dirty there.

Then, we'd move the "writeout" part into the LRU queue side, and at that
point I agree with you 100% that we probably should just delay it until
there are no mappings available - is we'd only write out a swap cache
entry if the count == 1 (ie it only exists in the swap cache), because
before that is true there are other people marking it dirty.

What are the advantges of this approach? Never mind the kiobuf issues at
this point - I wanted to do this long before kiobuf's happened.

Basically, it means that we'd write out shared pages only once, instead of
initiating write-back multiple times for each mapping that the page exists
in. Right now this isn't actually much of a problem, simply because it's
actually fairly hard to get shared dirty pages that would get written out
twice, but I think you see what I'm talking about on a conceptual level.

It also makes the kiobuf dirty issues a _completely_ separate issue, and
makes it very clean to handle: what kiobuf's become is just a kind of
virtual "pseudo-address-space". A "kiobuf address space" doesn't have a
page table, but it ends up being basically equivalent to a virtual address
space without the TLB overhead. Like a "real" address space attached to a
process, it has dirty pages in it, and like the real address space it
informs the VM layer of that through the page dirty bit.

See? THAT, in my opinion, is the clean way to handle this all. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
> 
> If you won't delete map_user_kiobuf from your tree I think I've just provided a
> real world MM corruption case where the user send the bug report back to us if
> we only increase the reference count of the page to pin it.

Oh. So to fix a bug, you say "either delete the code, or do something else
that is completely idiotic instead"?

Sure, that's sensible. NOT.

Andrea, explain to me how pinning _could_ work? Explain to me how you'd
lock down pages in virtual address space with multiple threads, and how
you'd handle the cases of:

 - two threads doing direct IO from different parts of the same page
 - one thread starting IO from a page, another thread unmapping the range

Basically, you can't handle it sanly, because the notion of virtual
pinning really isn't a sane notion. The first case would need a special
"pinning count". Which is too expensive to be an option, although I've
seen patches that seemed to do something like that - I consider the whole
notion idiotic.

The second case would require that unmap() synchronize completely with the
IO. Which is wasteful, and doesn't make any sense: what's the point, when
you can avoid it by just not pinning?

Neither option is, quite frankly, acceptable.

So we're left with your suggestion to remove direct IO completely.
Something that I wouldn't mind horribly much, but too many people seem to
consider it worth-while - and while I've stubborny fought the direct-IO
patches a long time, every single technical argument I've had has been
successfully addressed over time.

I'm sure this bug will get fixed too. And the fix probably won't end up
even being all that painful - it's probably a question of marking the page
dirty after completing IO into it and making sure the swap-out logic does
the right thing (ie try to write it out again - which is exactly the same
thing that happens right now if a user dirties a page while it's busy
doing write-out).

In fact, the code may do this already, I'll let sct look into the exact
details and fix it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 03:21:11PM -0700, Linus Torvalds wrote:
> Pinning will not happen.

Pinning happens every day on my box while I use rawio.

If you want to avoid pinning _userspace_ pages then we should delete
map_user_kiobuf and define a new functionality and API to replace RAWIO for
DBMS that must bypass the kernel cache while doing I/O to disk and possibly
providing zero copy below highmem as rawio does. (later on we should
do the same for networkng)

IMHO rawio is providing the above necessary functionality with a
sane userspace interface.

> (And remap_page_range() has nothing to do with pinning - they are just
> pages that cannot be swapped out because they are not normal pages at
> all).

They are _normal_ pages allocated by a device driver and made temporarly
visible to userspace mapping them into the virtual address space of the process
_after_ "pinning" them using the PG_reserved bitflag. If we wouldn't pin them
too, they would be unmapped as well as soon as they're visible in the virtual
address space of the process.

I don't think the thing is much different. The main difference I can see is the
one that was buggy: that is remap_page_range doesn't have to care that the page
stays there while pinning it, because before pinning it it's still private and
not visible by the VM (that's why it's much simpler). map_user_kiobuf instead
is more complex because it must make sure that the page stays there while we
pin it (and that should be fixed now).

I hope we're not getting confused by the term "pin". With "pin" I always meant
to avoid the userspace-visible page to go away from under us while we use it
from kernel space because of underlying VM activities. I don't see any
other possible meaning for "pin" in the context of map_user_kiobuf.

(in journaling we instead use "pin" to mean a page that can't be freed but that
also can't be written before some other transaction is committed to disk)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 02:59:59PM -0700, Linus Torvalds wrote:
> No. Because "pinning" is _stupid_.

Pinning using map_user_kiobuf looks just the other way around of what we do
usually with the mmap callback of the device driver and remap_page_range.  I
considered "pinning" just to convert the user page to one of the pages mapped
with remap_page_range so they can threat them in the same way internally in the
device driver (and the ones mapped with remap_page_range don't get unmapped
and then into swap of course).

> Instead, the Linux answer is to say: pinning is bad, pinning is stupid,
> pinning is useless - so dont do it.

So just delete map_user_kiobuf from your tree if people shouldn't do it. That
would fix the mm corruption too indeed. I don't see why you provide that
functionality and then you keep telling people not to use it. Also rawio
and networking could use the other way around. I think that's mainly a
matter of API. People is used to think in read/write terms. And infact rawio
doesn't provide a completly transpartent read/write because it have constrants
on the buffer alignment.

If you won't delete map_user_kiobuf from your tree I think I've just provided a
real world MM corruption case where the user send the bug report back to us if
we only increase the reference count of the page to pin it.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Linus Torvalds



On Tue, 17 Oct 2000, Andrea Arcangeli wrote:
> 
> And anyways from a design standpoint it looks much better to really pin the
> page in the pte too (just like kernel reserved pages are pinend after a
> remap_page_range).

No. 

Read my emails.

"Pinning" is stupid. There are no ifs, buts, or maybes about it.

Pinning will not happen.

(And remap_page_range() has nothing to do with pinning - they are just
pages that cannot be swapped out because they are not normal pages at
all).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 10:14:01PM +0100, Stephen C. Tweedie wrote:
> [..] If the VM
> chooses to unmap the page temporarily, then as long as the page
> remains in physical memory, then a subsequent page fault will
> reestablish the mapping. [..]

Correct. But the problem is that the page won't stay in physical memory after
we finished the I/O because swap cache with page count 1 will be freed by the
VM.  And even it stays in physical memory because a read fault happened on the
swap cache before we freed it, we could have swap cache that isn't coherent
with the on-disk data (that could lead to the same problem but later).

And anyways from a design standpoint it looks much better to really pin the
page in the pte too (just like kernel reserved pages are pinend after a
remap_page_range).

> It's an important distinction, because we really would rather avoid
> taking the page lock.  If you happen to have the same page mapped into
> multiple VAs within the region being written, should the kernel
> object? [..]

I see your point but if you want to allow that we should simply check if the
page that we can't lock is just locked in the earlier part of the kiobuf (just
browsing backwards the iobuf->maplist). I just don't think that not locking the
page is the right way to provide that feature.

I think it's not horribly wrong if people can't do map_user_kiobuf on virtual
pages pointing to the same physical page because that functionality is quite
special, just like rawio is quite special in requiring people to use
hardblocksize aligned buffers. And yes, we could also allow people to do
rawio without that userspace alignment constraint but with that constraint
we force them to do zero copy.

> shouldn't matter.  For some other applications, it might be important,
> which is why I wanted to keep the map_user_kiobuf() and lock_kiobuf()
> functions distinct.

I'm not sure which are those apps but if we need we can easily handle that case
by browsing backwards the maplist in the TryLockPage == 1 slow path.

> Note that if you have a threaded application and another thread goes
> messing with the MM while your IO is in progress, then it's possible
> that the pages in the user's VA at the end of the IO are not the same
> as the ones that were there at the start.  No big deal, that's no
> different to the situation if you have any other read or write going
> on in parallel with other MM activity.

Yep, that looks ok.

> > +   if (!(map = follow_page(ptr, write))) {
> > +   char * user_ptr = (char *) ptr;
> > +   char c;
> > spin_unlock(>page_table_lock);
> > +   up(>mmap_sem);
> > +   if (get_user(c, user_ptr))
> > +   goto out;
> > +   if (write && put_user(c, user_ptr))
> > +   goto out;
> > +   down(>mmap_sem);
> > +   goto refind;
> 
> looks unnecessarily messy.  We don't need the get_user() in ptrace:
> why do we need it here?  Similarly, the put_user should be dealt with
> by the handle_mm_fault.  We already absolutely rely on the fact that
> handle_mm_fault never continually fails to make progress for ever.  If
> it did, then a page fault could produce an infinite loop in the
> kernel.  

Replacing the get_user/put_user with handle_mm_fault _after_ changing
follow_page to check the dirty bit too in the write case should be ok.

If you want I can do this change plus the backwards maplist check for allowing
mapping the same physical page multiple times in a kiobuf (the unmap_kiobuf
will unlock the same physical page multiple times but that's ok).

> Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
> to call the existing access_one_page() from ptrace.  You're right,

access_one_page is missing the pagetable lock too, but that seems the only
problem. I'm not convinced mixing the internal of access_one_page and
map_user_kiobuf is a good thing since they needs to do a very different thing
in the critical section.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Linus Torvalds



On Mon, 16 Oct 2000, Andrea Arcangeli wrote:
> 
> If the page isn't locked swap_out will unmap it from the pte and anybody will
> be able to start any kind of regular VM I/O on the page.

Doesn't matter.

If you have increased the page count, the page _will_ stay in the page
cache. So everybody who wants to see the page that was mapped, will see it
with no possibility of getting an alias.

>This isn't what the
> programmer expects

If so, the programmer has shit for brains. That simple.

>and that's not what I consider pinning.

No. Because "pinning" is _stupid_.

Imagine having two threads that both do direct IO from overlapping pages.

YOU NEED TO COUNT THE PINNING.

A "lock" bit is useless, and anybody who uses a lock bit is stupid and
ends up having to serialize things for no good reason.

Instead, the Linux answer is to say: pinning is bad, pinning is stupid,
pinning is useless - so dont do it.

Instead, we have the notion of "I have a reference to this page, don't let
it go away". Sure, the page can be _unmapped_ (ie it is not pinned), but
WHO CARES?

Nobody.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 11:29:27AM -0700, Linus Torvalds wrote:
> The page count is (or should be) sufficient, and if it weren't sufficient
> that would be a bug in the swap-out handling of anonymous or shm memory. I

If the page isn't locked swap_out will unmap it from the pte and anybody will
be able to start any kind of regular VM I/O on the page. This isn't what the
programmer expects and that's not what I consider pinning. It just looks much
saner to completly pin it and to keep it mapped as if it would be kernel memory
marked as reserved and then mapped to userspace via remap_page_range.  I don't
see any particular benefit in allowing swap_out to mess with pinned pages that
would make worthwhile not to lock them.

Said that (and the above arguments are more a design issue) there's also a
pratical problem that I can see in not taking the page locked.  What will
happen without the lock on the page is that swapout will unmap the pte while
adding the page to the swap cache and while writing it to disk. Then the page
will be clean and in the swap cache with reference count 2 because we take it
pinned and because it's part of the swap cache.  Then we do the DMA from disk
to the page but the page is still considered clean from the VM
because it's an unlocked swap cache page. But instead the contents of the page
aren't in sync anymore with the contents of the on-disk counterpart of the swap
cache after DMA completed. So that will again generate corruption because it
breaks VM assumptions on the swap cache (even if it's more subtle to trigger
than the other problems). Maybe even more subtle issue could arise by
not taking the lock on the page while it's pinned.

> refuse to see page locking or similar for this kind of pinning (counts are
> recursive and re-entrant, a "lock bit" is not).

Infact swapin_readahead was deadlocking exactly because lock bit isn't
reentrant.  But the fix for that deadlock looks worthwhile regardless of the
pinning-user-memory-with-lock-bit issue.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Stephen Tweedie

Hi,

On Mon, Oct 16, 2000 at 12:08:54AM +0200, Andrea Arcangeli wrote:

> The basic problem is that map_user_kiobuf tries to map those pages calling an
> handle_mm_fault on their virtual addresses and it's thinking that when
> handle_mm_fault returns 1 the page is mapped.  That's wrong.

Good point --- good work.

> Furthmore in 2.4.x SMP (this couldn't happen in 2.2.x SMP at least) the page
> can also go away from under us even assuming handle_mm_fault did its work right
> by luck.

Hmm --- we had the BKL to protect against this when this code was
first done for 2.3.  That's definitely a regression, agreed.

> I'm also not convinced that only increasing the page count in the critical
> section in map_user_kiobuf is enough because swap_out doesn't care about the
> page count (in 2.2.x rawio it's taking the page lock). The page count is
> significant as a "pin" only for the page cache and not for anonymous or shm
> memory for example. swap_out can mess with anonymous memory with a page
> count > 1 for example.

This bit I think we have to talk about --- I'm not sure I agree.  I
dropped that automatic-page-locking from the map_user_kiobuf code
quite deliberately.  

Basically, we don't care at all about pinning the pte in the page
tables during the IO.  All we really care about is preserving the
mapping between the user's VA and the physical page.  If the VM
chooses to unmap the page temporarily, then as long as the page
remains in physical memory, then a subsequent page fault will
reestablish the mapping.  The swap cache and page cache are sufficient
to make this guarantee.

It's an important distinction, because we really would rather avoid
taking the page lock.  If you happen to have the same page mapped into
multiple VAs within the region being written, should the kernel
object?  If you're writing that region to disk, then it really
shouldn't matter.  For some other applications, it might be important,
which is why I wanted to keep the map_user_kiobuf() and lock_kiobuf()
functions distinct.

Note that if you have a threaded application and another thread goes
messing with the MM while your IO is in progress, then it's possible
that the pages in the user's VA at the end of the IO are not the same
as the ones that were there at the start.  No big deal, that's no
different to the situation if you have any other read or write going
on in parallel with other MM activity.

One final point: the new code,

> + if (!(map = follow_page(ptr, write))) {
> + char * user_ptr = (char *) ptr;
> + char c;
>   spin_unlock(>page_table_lock);
> + up(>mmap_sem);
> + if (get_user(c, user_ptr))
> + goto out;
> + if (write && put_user(c, user_ptr))
> + goto out;
> + down(>mmap_sem);
> + goto refind;

looks unnecessarily messy.  We don't need the get_user() in ptrace:
why do we need it here?  Similarly, the put_user should be dealt with
by the handle_mm_fault.  We already absolutely rely on the fact that
handle_mm_fault never continually fails to make progress for ever.  If
it did, then a page fault could produce an infinite loop in the
kernel.  

Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
to call the existing access_one_page() from ptrace.  You're right,
this stuff is racy, so we should avoid duplicating it in memory.c.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Stephen Tweedie

Hi,

On Fri, Oct 13, 2000 at 12:30:49PM +0100, Malcolm Beattie wrote:

> free_kiovec(1, );/* does an implicit unlock_kiovec */
> 
> It doesn't do an unmap_kiobuf(iobuf) so I don't understand where
> the per-page map->count that map_user_kiobuf incremented gets
> decremented again. Anyone?

The 2.4 raw code I'm looking at does an explicit unmap_kiobuf after
each brw_kiovec().

> Lowlevel I/O on a kiovec can be done
> with something like an ll_rw_kiovec which sct said was going to get
> put in but since I haven't read anything more recent than
> 2.4.0-test5 at the moment, I can't say if it's there or what it
> looks like.

It's being maintained inside the SGI XFS tree right now.  They've got
it pretty stable under XFS load so I'll probably put together the
ll_rw_kio functionality using their low level code the next time I do
a kiovec release.  I believe that the XFS tree has 2.4.0-test9 support
for this code.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Stephen Tweedie

Hi,

On Fri, Oct 13, 2000 at 12:30:49PM +0100, Malcolm Beattie wrote:

 free_kiovec(1, iobuf);/* does an implicit unlock_kiovec */
 
 It doesn't do an unmap_kiobuf(iobuf) so I don't understand where
 the per-page map-count that map_user_kiobuf incremented gets
 decremented again. Anyone?

The 2.4 raw code I'm looking at does an explicit unmap_kiobuf after
each brw_kiovec().

 Lowlevel I/O on a kiovec can be done
 with something like an ll_rw_kiovec which sct said was going to get
 put in but since I haven't read anything more recent than
 2.4.0-test5 at the moment, I can't say if it's there or what it
 looks like.

It's being maintained inside the SGI XFS tree right now.  They've got
it pretty stable under XFS load so I'll probably put together the
ll_rw_kio functionality using their low level code the next time I do
a kiovec release.  I believe that the XFS tree has 2.4.0-test9 support
for this code.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Stephen Tweedie

Hi,

On Mon, Oct 16, 2000 at 12:08:54AM +0200, Andrea Arcangeli wrote:

 The basic problem is that map_user_kiobuf tries to map those pages calling an
 handle_mm_fault on their virtual addresses and it's thinking that when
 handle_mm_fault returns 1 the page is mapped.  That's wrong.

Good point --- good work.

 Furthmore in 2.4.x SMP (this couldn't happen in 2.2.x SMP at least) the page
 can also go away from under us even assuming handle_mm_fault did its work right
 by luck.

Hmm --- we had the BKL to protect against this when this code was
first done for 2.3.  That's definitely a regression, agreed.

 I'm also not convinced that only increasing the page count in the critical
 section in map_user_kiobuf is enough because swap_out doesn't care about the
 page count (in 2.2.x rawio it's taking the page lock). The page count is
 significant as a "pin" only for the page cache and not for anonymous or shm
 memory for example. swap_out can mess with anonymous memory with a page
 count  1 for example.

This bit I think we have to talk about --- I'm not sure I agree.  I
dropped that automatic-page-locking from the map_user_kiobuf code
quite deliberately.  

Basically, we don't care at all about pinning the pte in the page
tables during the IO.  All we really care about is preserving the
mapping between the user's VA and the physical page.  If the VM
chooses to unmap the page temporarily, then as long as the page
remains in physical memory, then a subsequent page fault will
reestablish the mapping.  The swap cache and page cache are sufficient
to make this guarantee.

It's an important distinction, because we really would rather avoid
taking the page lock.  If you happen to have the same page mapped into
multiple VAs within the region being written, should the kernel
object?  If you're writing that region to disk, then it really
shouldn't matter.  For some other applications, it might be important,
which is why I wanted to keep the map_user_kiobuf() and lock_kiobuf()
functions distinct.

Note that if you have a threaded application and another thread goes
messing with the MM while your IO is in progress, then it's possible
that the pages in the user's VA at the end of the IO are not the same
as the ones that were there at the start.  No big deal, that's no
different to the situation if you have any other read or write going
on in parallel with other MM activity.

One final point: the new code,

 + if (!(map = follow_page(ptr, write))) {
 + char * user_ptr = (char *) ptr;
 + char c;
   spin_unlock(mm-page_table_lock);
 + up(mm-mmap_sem);
 + if (get_user(c, user_ptr))
 + goto out;
 + if (write  put_user(c, user_ptr))
 + goto out;
 + down(mm-mmap_sem);
 + goto refind;

looks unnecessarily messy.  We don't need the get_user() in ptrace:
why do we need it here?  Similarly, the put_user should be dealt with
by the handle_mm_fault.  We already absolutely rely on the fact that
handle_mm_fault never continually fails to make progress for ever.  If
it did, then a page fault could produce an infinite loop in the
kernel.  

Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
to call the existing access_one_page() from ptrace.  You're right,
this stuff is racy, so we should avoid duplicating it in memory.c.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 11:29:27AM -0700, Linus Torvalds wrote:
 The page count is (or should be) sufficient, and if it weren't sufficient
 that would be a bug in the swap-out handling of anonymous or shm memory. I

If the page isn't locked swap_out will unmap it from the pte and anybody will
be able to start any kind of regular VM I/O on the page. This isn't what the
programmer expects and that's not what I consider pinning. It just looks much
saner to completly pin it and to keep it mapped as if it would be kernel memory
marked as reserved and then mapped to userspace via remap_page_range.  I don't
see any particular benefit in allowing swap_out to mess with pinned pages that
would make worthwhile not to lock them.

Said that (and the above arguments are more a design issue) there's also a
pratical problem that I can see in not taking the page locked.  What will
happen without the lock on the page is that swapout will unmap the pte while
adding the page to the swap cache and while writing it to disk. Then the page
will be clean and in the swap cache with reference count 2 because we take it
pinned and because it's part of the swap cache.  Then we do the DMA from disk
to the page but the page is still considered clean from the VM
because it's an unlocked swap cache page. But instead the contents of the page
aren't in sync anymore with the contents of the on-disk counterpart of the swap
cache after DMA completed. So that will again generate corruption because it
breaks VM assumptions on the swap cache (even if it's more subtle to trigger
than the other problems). Maybe even more subtle issue could arise by
not taking the lock on the page while it's pinned.

 refuse to see page locking or similar for this kind of pinning (counts are
 recursive and re-entrant, a "lock bit" is not).

Infact swapin_readahead was deadlocking exactly because lock bit isn't
reentrant.  But the fix for that deadlock looks worthwhile regardless of the
pinning-user-memory-with-lock-bit issue.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Linus Torvalds



On Mon, 16 Oct 2000, Andrea Arcangeli wrote:
 
 If the page isn't locked swap_out will unmap it from the pte and anybody will
 be able to start any kind of regular VM I/O on the page.

Doesn't matter.

If you have increased the page count, the page _will_ stay in the page
cache. So everybody who wants to see the page that was mapped, will see it
with no possibility of getting an alias.

This isn't what the
 programmer expects

If so, the programmer has shit for brains. That simple.

and that's not what I consider pinning.

No. Because "pinning" is _stupid_.

Imagine having two threads that both do direct IO from overlapping pages.

YOU NEED TO COUNT THE PINNING.

A "lock" bit is useless, and anybody who uses a lock bit is stupid and
ends up having to serialize things for no good reason.

Instead, the Linux answer is to say: pinning is bad, pinning is stupid,
pinning is useless - so dont do it.

Instead, we have the notion of "I have a reference to this page, don't let
it go away". Sure, the page can be _unmapped_ (ie it is not pinned), but
WHO CARES?

Nobody.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-16 Thread Andrea Arcangeli

On Mon, Oct 16, 2000 at 10:14:01PM +0100, Stephen C. Tweedie wrote:
 [..] If the VM
 chooses to unmap the page temporarily, then as long as the page
 remains in physical memory, then a subsequent page fault will
 reestablish the mapping. [..]

Correct. But the problem is that the page won't stay in physical memory after
we finished the I/O because swap cache with page count 1 will be freed by the
VM.  And even it stays in physical memory because a read fault happened on the
swap cache before we freed it, we could have swap cache that isn't coherent
with the on-disk data (that could lead to the same problem but later).

And anyways from a design standpoint it looks much better to really pin the
page in the pte too (just like kernel reserved pages are pinend after a
remap_page_range).

 It's an important distinction, because we really would rather avoid
 taking the page lock.  If you happen to have the same page mapped into
 multiple VAs within the region being written, should the kernel
 object? [..]

I see your point but if you want to allow that we should simply check if the
page that we can't lock is just locked in the earlier part of the kiobuf (just
browsing backwards the iobuf-maplist). I just don't think that not locking the
page is the right way to provide that feature.

I think it's not horribly wrong if people can't do map_user_kiobuf on virtual
pages pointing to the same physical page because that functionality is quite
special, just like rawio is quite special in requiring people to use
hardblocksize aligned buffers. And yes, we could also allow people to do
rawio without that userspace alignment constraint but with that constraint
we force them to do zero copy.

 shouldn't matter.  For some other applications, it might be important,
 which is why I wanted to keep the map_user_kiobuf() and lock_kiobuf()
 functions distinct.

I'm not sure which are those apps but if we need we can easily handle that case
by browsing backwards the maplist in the TryLockPage == 1 slow path.

 Note that if you have a threaded application and another thread goes
 messing with the MM while your IO is in progress, then it's possible
 that the pages in the user's VA at the end of the IO are not the same
 as the ones that were there at the start.  No big deal, that's no
 different to the situation if you have any other read or write going
 on in parallel with other MM activity.

Yep, that looks ok.

  +   if (!(map = follow_page(ptr, write))) {
  +   char * user_ptr = (char *) ptr;
  +   char c;
  spin_unlock(mm-page_table_lock);
  +   up(mm-mmap_sem);
  +   if (get_user(c, user_ptr))
  +   goto out;
  +   if (write  put_user(c, user_ptr))
  +   goto out;
  +   down(mm-mmap_sem);
  +   goto refind;
 
 looks unnecessarily messy.  We don't need the get_user() in ptrace:
 why do we need it here?  Similarly, the put_user should be dealt with
 by the handle_mm_fault.  We already absolutely rely on the fact that
 handle_mm_fault never continually fails to make progress for ever.  If
 it did, then a page fault could produce an infinite loop in the
 kernel.  

Replacing the get_user/put_user with handle_mm_fault _after_ changing
follow_page to check the dirty bit too in the write case should be ok.

If you want I can do this change plus the backwards maplist check for allowing
mapping the same physical page multiple times in a kiobuf (the unmap_kiobuf
will unlock the same physical page multiple times but that's ok).

 Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
 to call the existing access_one_page() from ptrace.  You're right,

access_one_page is missing the pagetable lock too, but that seems the only
problem. I'm not convinced mixing the internal of access_one_page and
map_user_kiobuf is a good thing since they needs to do a very different thing
in the critical section.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   >