Re: Implement mmap for PUD

2011-09-16 Thread Masao Uebayashi
I think mmap can work as follows:
- blktap(4) allocates shared ring memory
- pud(4) is attached to a parent blktap(4)
- userland asks mmap buffer (shared ring memory)
- UVM finds a VA range, attaches pud(4) there
- touching the buffer causes fault, uvm_fault -> udv_fault -> pud_mmap
are called
- pud_mmap in turn calls the parent blktap(4)'s mmap, which returns
the ring buffers physical address
- the ring buffer address is entered to MMU (udv_fault -> pmap_enter)
- userland can access shared ring memory

On Sat, Sep 17, 2011 at 12:04 PM, Masao Uebayashi  wrote:
> OK, I've re-read this topic.  Your goal is to implement blktap on
> NetBSD/Xen, right?
>
> According to Xen wiki [1], blktap provides an interface for userland
> to handle block device I/O.  I guess blktap gets inter-domain, shared
> ring memory from hypervisor.  Dom0 userland mmaps the ring memory and
> handles I/O requests.
>
> pud(4) is different; it pretends a device driver backed by a real H/W.
>  Kernel passes buffers to pud(4) so that pud(4) can read/write data
> from/to a real H/W, either PIO or DMA.  PIO uses kernel address space
> to access passed buffers.  DMA uses physical addresses.
>
> Here you want to mmap those buffers to userland, right?  I don't think
> it's possible.  Underlying pages of given buffers are marked "busy
> doing I/O" (PG_BUSY).  Users (either vnode/anon owners) are not
> allowed to map those pages until I/O completes.  If some pud(4)
> backend driver process suddenly tries to mmap those pages, VM will
> surely get upset.
>
> So a possible blktap(4) would look like:
> - run only in Xen Dom0
> - allocate (map) shared ring memory (both I/O requests and buffers)
> from hypervisor using Xen machine-dependent API
> - convert blktap I/O requests to NetBSD's bdev/strategy/struct buf format
> - and call pud(4) in kernel
> ?
>
> [1] http://wiki.xensource.com/xenwiki/blktap
>


Re: Implement mmap for PUD

2011-09-16 Thread Masao Uebayashi
OK, I've re-read this topic.  Your goal is to implement blktap on
NetBSD/Xen, right?

According to Xen wiki [1], blktap provides an interface for userland
to handle block device I/O.  I guess blktap gets inter-domain, shared
ring memory from hypervisor.  Dom0 userland mmaps the ring memory and
handles I/O requests.

pud(4) is different; it pretends a device driver backed by a real H/W.
 Kernel passes buffers to pud(4) so that pud(4) can read/write data
from/to a real H/W, either PIO or DMA.  PIO uses kernel address space
to access passed buffers.  DMA uses physical addresses.

Here you want to mmap those buffers to userland, right?  I don't think
it's possible.  Underlying pages of given buffers are marked "busy
doing I/O" (PG_BUSY).  Users (either vnode/anon owners) are not
allowed to map those pages until I/O completes.  If some pud(4)
backend driver process suddenly tries to mmap those pages, VM will
surely get upset.

So a possible blktap(4) would look like:
- run only in Xen Dom0
- allocate (map) shared ring memory (both I/O requests and buffers)
from hypervisor using Xen machine-dependent API
- convert blktap I/O requests to NetBSD's bdev/strategy/struct buf format
- and call pud(4) in kernel
?

[1] http://wiki.xensource.com/xenwiki/blktap


Re: Implement mmap for PUD

2011-09-12 Thread Eduardo Horvath
On Sat, 10 Sep 2011, Masao Uebayashi wrote:

> On Sat, Sep 10, 2011 at 7:24 PM, Roger Pau Monné
>  wrote:

> > PUD is a framework present in NetBSD that allows to implement
> > character and block devices in userspace. I'm trying to implement a
> > blktap [1] driver purely in userspace, and the mmap operation is
> > needed (and it would also be beneficial for PUD to have the full set
> > of operations implemented, for future uses). The implementation of
> > blktap driver using fuse was discused in the port-xen mailing list,
> > but the blktap driver needs a set of specific operations over
> > character and block devices.
> >
> > My main concern is if it is possible to pass memory form a userspace
> > program to another trough the kernel (that is mainly what my mmap
> > implementation tries to accomplish). I trough that I could accomplish
> 
> It is called pipe(2), isn't it?

Did you forget a smiley there?  No it isn't, that's page loaning.

I don't think the device mmap() infrastructure will work for you.  As I 
said before, it's designed to hand out unmanaged device physical memory 
and you're working with managed memory.  While you may be able to cobble 
together something that appears functional, it will probably not properly 
manage VM reference counts and eventually go belly up.  Keep in mind the 
way device mmap is not done during the mmap() call.  Instead nothing 
happens until there's a page fault, which is when the driver's mmap() 
routine is called to do the v->p mapping and insert it into the pmap().

I think you may need to create a new uvm object to hold the pages you want 
to share and attach it to the vmspaces of both the server process handing 
out the pages and the ... err ... client(?) process trying to do the mmap.
fork() does this as well as mmap() of a file and sysv_shm.  I think the 
set of operations in sysv_shm is the best bet since it's the closest to 
what you want to do.  

You will probably need to find some way to intercept the mmap() syscall 
and have it do something unique for the PUD device, maybe by fiddling with 
vnode OP vectors.  I don't know, but I don't think this will be 
straight-forward.

Eduardo

Re: Implement mmap for PUD

2011-09-10 Thread Roger Pau Monné
> It is called pipe(2), isn't it?

Thanks for the reply, but I don't understand why pipe could be helpful
in this situation, the mmap kernel call needs to return a paddr_t (a
memory region), and pipe returns a pair of file descriptors, that I
cannot pass to the kernel. The flow of a PUD call is something like:

1. Program A makes a mmap call to a char device.
2. The kernel PUD driver handles the call and passes it to the
userspace daemon that registered the device.
3. The userspace PUD daemon allocates a memory region and returns the
address to the kernel PUD driver.
4. The kernel PUD driver translates the address to a physical
direction and returns it to program A.
5. Program A receives the memory region.

I have archived most of this process, but I think the memory used to
return the call should be marked as shared somehow, or locked, because
I get the kernel panic described in a previous message, even when
trying to lock the memory with uvm_vslock. I don't know if it's
possible to return a memory address allocated from userspace in a mmap
kernel implementation.

Regards, Roger.


Re: Implement mmap for PUD

2011-09-10 Thread Masao Uebayashi
On Sat, Sep 10, 2011 at 7:24 PM, Roger Pau Monné
 wrote:
> Hello,
>
> Thanks for the "atop" tip, now I'm able to pass the memory around, but
> the kernel crashes shortly after reading the value from the returned
> memory region:
>
> panic: kernel diagnostic assertion "uvm_page_locked_p(pg)" failed:
> file "/usr/src-current/src/sys/arch/x86/x86/pmap.c", line 3215
> cpu0: Begin traceback...
> copyright() at 8098aed4
> uvm_fault(0x80cbc5a0, 0x2bb13000, 1) -> e
> fatal page fault in supervisor mode
> trap type 6 code 0 rip 8022d887 cs e030 rflags 10246 cr2
> 2bb13a50 cpl 0 rsp a0002bb130e0
> Skipping crash dump on recursive panic
> panic: trap
> Faulted in mid-traceback; aborting...rebooting...
>
> I still have to look into this (I'm going to try to mark the memory as
> loked using uvm_vslock, but I'm not really sure if that is going to
> help).
>
> 2011/9/9 Eduardo Horvath :
>> On Wed, 7 Sep 2011, Roger Pau Monné wrote:
>>
>>> Basically we use pud_request to pass the request to the user-space
>>> server, and the server returns a memory address, allocated in the
>>> user-space memory of it's process. Then I try to read the value of the
>>> user space memory from the kernel, which works ok, I can fetch the
>>> correct value. After reading the value (that is just used for
>>> debugging), the physical address of the memory region is collected
>>> using pmap_extract and returned.
>>
>> I'm not sure you can do this.  The mmap() interface in drivers is designed
>> to hand out unmanaged pages, not managed page frames.  Userland processes
>> use page frames to hold pages that could be paged out at any time.  You
>> could have nasty problems with wiring and reference counts.  What you
>> really need to do here is shared memory not a typical device mmap().
>>
>> WHY do you want to do this?  What is PUD?  Why do you have kernel devices
>> backed by userland daemons?  I think a filesystem may be more appropriate
>> than a device in this case.
>
> PUD is a framework present in NetBSD that allows to implement
> character and block devices in userspace. I'm trying to implement a
> blktap [1] driver purely in userspace, and the mmap operation is
> needed (and it would also be beneficial for PUD to have the full set
> of operations implemented, for future uses). The implementation of
> blktap driver using fuse was discused in the port-xen mailing list,
> but the blktap driver needs a set of specific operations over
> character and block devices.
>
> My main concern is if it is possible to pass memory form a userspace
> program to another trough the kernel (that is mainly what my mmap
> implementation tries to accomplish). I trough that I could accomplish

It is called pipe(2), isn't it?


Re: Implement mmap for PUD

2011-09-10 Thread Roger Pau Monné
Hello,

Thanks for the "atop" tip, now I'm able to pass the memory around, but
the kernel crashes shortly after reading the value from the returned
memory region:

panic: kernel diagnostic assertion "uvm_page_locked_p(pg)" failed:
file "/usr/src-current/src/sys/arch/x86/x86/pmap.c", line 3215
cpu0: Begin traceback...
copyright() at 8098aed4
uvm_fault(0x80cbc5a0, 0x2bb13000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip 8022d887 cs e030 rflags 10246 cr2
2bb13a50 cpl 0 rsp a0002bb130e0
Skipping crash dump on recursive panic
panic: trap
Faulted in mid-traceback; aborting...rebooting...

I still have to look into this (I'm going to try to mark the memory as
loked using uvm_vslock, but I'm not really sure if that is going to
help).

2011/9/9 Eduardo Horvath :
> On Wed, 7 Sep 2011, Roger Pau Monné wrote:
>
>> Basically we use pud_request to pass the request to the user-space
>> server, and the server returns a memory address, allocated in the
>> user-space memory of it's process. Then I try to read the value of the
>> user space memory from the kernel, which works ok, I can fetch the
>> correct value. After reading the value (that is just used for
>> debugging), the physical address of the memory region is collected
>> using pmap_extract and returned.
>
> I'm not sure you can do this.  The mmap() interface in drivers is designed
> to hand out unmanaged pages, not managed page frames.  Userland processes
> use page frames to hold pages that could be paged out at any time.  You
> could have nasty problems with wiring and reference counts.  What you
> really need to do here is shared memory not a typical device mmap().
>
> WHY do you want to do this?  What is PUD?  Why do you have kernel devices
> backed by userland daemons?  I think a filesystem may be more appropriate
> than a device in this case.

PUD is a framework present in NetBSD that allows to implement
character and block devices in userspace. I'm trying to implement a
blktap [1] driver purely in userspace, and the mmap operation is
needed (and it would also be beneficial for PUD to have the full set
of operations implemented, for future uses). The implementation of
blktap driver using fuse was discused in the port-xen mailing list,
but the blktap driver needs a set of specific operations over
character and block devices.

My main concern is if it is possible to pass memory form a userspace
program to another trough the kernel (that is mainly what my mmap
implementation tries to accomplish). I trough that I could accomplish
this by making the memory wired using mlock (which assures that the
physical address of the memory is always the same) and then passing
this block of memory around, but it looks like it's going to be be
much more complicated than that, suggestions are welcome.

Thanks for all the help, Roger.

[1] http://lxr.xensource.com/lxr/source/tools/blktap/


Re: Implement mmap for PUD

2011-09-09 Thread Eduardo Horvath
On Wed, 7 Sep 2011, Roger Pau Monné wrote:

> Basically we use pud_request to pass the request to the user-space
> server, and the server returns a memory address, allocated in the
> user-space memory of it's process. Then I try to read the value of the
> user space memory from the kernel, which works ok, I can fetch the
> correct value. After reading the value (that is just used for
> debugging), the physical address of the memory region is collected
> using pmap_extract and returned.

I'm not sure you can do this.  The mmap() interface in drivers is designed 
to hand out unmanaged pages, not managed page frames.  Userland processes 
use page frames to hold pages that could be paged out at any time.  You 
could have nasty problems with wiring and reference counts.  What you 
really need to do here is shared memory not a typical device mmap().

WHY do you want to do this?  What is PUD?  Why do you have kernel devices 
backed by userland daemons?  I think a filesystem may be more appropriate 
than a device in this case.

Eduardo

Re: Implement mmap for PUD

2011-09-08 Thread David Young
On Wed, Sep 07, 2011 at 10:33:54AM +0200, Roger Pau Monné wrote:
> Hello,
> 
> Since there is no mmap implementation for PUD devices I began working
> on one. I would like to make an implementation that avoids copying
> buffers around user-space and kernel memory, since mmap is usually
> used for fast applications. I began working on pud_dev.c file, that
> contains the kernel implementation of the mmap call, which is then
> passed to user space. My mmap function looks like:
> 
> 
> static paddr_t
> pud_cdev_mmap(dev_t dev, off_t off, int flag)
> {
>   struct pud_creq_mmap pc_mmap;
>   struct vmspace *vm;
>   int error;
>   int num;
>   paddr_t pa;
> 
>   pc_mmap.pm_flag  = flag;
>   pc_mmap.pm_pageoff = off;
> 
>   printf("Inside mmap, off: %jd flag: %d\n", (intmax_t) off, flag);
> 
>   error = pud_request(dev, &pc_mmap, sizeof(pc_mmap),
>   PUD_REQ_CDEV, PUD_CDEV_MMAP);
>   if (error)
>   return (paddr_t) -1;
> 
>   mutex_enter(proc_lock);
>   pc_mmap.pm_proc = proc_find(pc_mmap.pid);
>   mutex_exit(proc_lock);
>   /* Catch error? */
> 
>   error = proc_vmspace_getref(pc_mmap.pm_proc, &vm);
>   if (error)
>   panic("Unable to get vmspace");
> 
>   /* Try to read the value */
>   if(copyin_proc(pc_mmap.pm_proc, (void *) pc_mmap.pm_addr, &num,
> sizeof(num)) == EFAULT)
>   panic("Unable to read value from user-space");
>   printf("Read: %d from addr: %" PRIxPADDR "\n", num, pc_mmap.pm_addr);
> 
>   if ((vaddr_t)pc_mmap.pm_addr & PAGE_MASK)
>   panic("pud_cdev_mmap: memory not page aligned");
> 
>   if (pmap_extract(vm->vm_map.pmap,
>   (vaddr_t) pc_mmap.pm_addr + (u_int) off, &pa) == FALSE)
>   panic("pud_cdev_mmap: memory page not mapped");
> 
>   uvmspace_free(vm);
>   printf("Inside mmap, returning: %" PRIxPADDR "\n", pa);
> 
>   return pa;
> }

I think that you need to return atop(pa) instead of pa.  I.e., a
driver's mmap() returns a physical page number, not a physical address.
Is this unnecessarily confusing?  Yes. :-)  Documented in any manual
page?  Maybe not.

I'm curious whether the pa -> atop(pa) change helps.

> void * memory;
> 
> vaddr_t
> test_mmap(dev_t dev, off_t off, int flag, void *auxdata)
> {
>   int *num;
>   if (off > 0)
>   return (vaddr_t)-1;
>   if (memory == NULL)
>   {
>   memory = malloc(PAGE_SIZE);
>   if (mlock(memory, PAGE_SIZE) < 0)
>   err(EXIT_FAILURE, "Unable to lock pages");

It's probably not desirable to mlock() every PUD device page, but if the
PUD device page gets paged out, what do the userspace device and/or the
kernel need to do about paging it back in?

Dave

-- 
David Young OJC Technologies
dyo...@ojctech.com  Urbana, IL * (217) 344-0444 x24


Re: Implement mmap for PUD

2011-09-07 Thread Jean-Yves Migeon
On 07.09.2011 10:33, Roger Pau Monné wrote:
> Since there is no mmap implementation for PUD devices I began working
> on one. I would like to make an implementation that avoids copying
> buffers around user-space and kernel memory, since mmap is usually
> used for fast applications. I began working on pud_dev.c file, that
> contains the kernel implementation of the mmap call, which is then
> passed to user space. My mmap function looks like:

> [snip]

> This *works* ok, the kernel doesn't panic, but if I perform a mmap on
> the device, the pud_cdev_mmap function is called forever with the same
> arguments, and the function never returns (I have to kill the
> program). Under a GENERIC kernel, I don't see any error messages, but
> using a XEN kernel, I see the following messages from the hypervisor:
> 
> (XEN) d0:v0: reserved bit in page table (ec=000D)
> (XEN) Pagetable walk from 7f7ff7ff6000:
> (XEN)  L4[0x0fe] = 000838c8b027 f374
> (XEN)  L3[0x1ff] = 000c0a3be027 00015c41
> (XEN)  L2[0x1bf] = 00083a701027 d8fe
> (XEN)  L1[0x1f6] = 8198ab000125 

[snip]

> That appears every time a mmap call returns. My guess is that I have
> to mark the memory I'm returning from mmap as shared somehow, but I
> don't know how (or if it is even possible to pass memory from
> user-space programs through the kernel). Any hint on how to solve this
> problem would be really appreaciated.

Without looking at your code (or pud), somehow it manages to establish a
mapping with reserved bits set in the page entries, which is something
that will generate a fault.

(XEN) d0:v0: reserved bit in page table (ec=000D)

=> domain 0, VCPU 0, page fault by reserved bit set in page table, error
code 000D =>  8+4+1
8 - reserved bit overwrite
4 - happened while CPU was in user mode (PGEX_U)
1 - protection violation (reserved bits are... reserved) (PGEX_P)

Apparently, native kernel does not trap for this kind of error code [1]
[2]; NetBSD only defines PGEX_{P,W,U,X}. Making the fault explicit in
traps.c will probably reveal the same kind of fault as the one Xen reports.

Now, as to why and how you can fix it: dunno.

[1] http://nxr.netbsd.org/xref/src/sys/arch/x86/include/pte.h#41

[2] http://nxr.netbsd.org/xref/src/sys/arch/amd64/amd64/trap.c#525

-- 
Jean-Yves Migeon
jeanyves.mig...@free.fr