Re: Addition to kauth(9) framework
In article 20110829003259.913f014a...@mail.netbsd.org, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: hi, I'd like to apply the attached patch. It implements two things: - chroot(2)-ed process is given new kauth_cred_t with reference count equal to 1. can you find a way to avoid this? YAMAMOTO Takashi He tried and I think that this is the minimal hook he needs. christos
re: Addition to kauth(9) framework
In article 20110829003259.913f014a...@mail.netbsd.org, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: hi, I'd like to apply the attached patch. It implements two things: - chroot(2)-ed process is given new kauth_cred_t with reference count equal to 1. can you find a way to avoid this? YAMAMOTO Takashi He tried and I think that this is the minimal hook he needs. do you mean that we need to unshare the credential unconditionally, regardless his module is used or not? why? maybe it's just me, but i actually have absolutely no problem with chroot unsharing kauth_cred_t by default. it just seems to have more generic safety aspects. .mrg.
Re: CVS commit: src/sys/arch/xen
Cc:ing tech-kern, to get wider feedback. Thread started here: http://mail-index.netbsd.org/source-changes-d/2011/08/21/msg003897.html JM == Jean-Yves Migeon jeanyves.mig...@free.fr writes: JM On Mon, 22 Aug 2011 12:47:40 +0200, Manuel Bouyer wrote: This is slightly more complicated than it appears. Some of the ops in a per-cpu queue may have ordering dependencies with other cpu queues, and I think this would be hard to express trivially. (an example would be a pte update on one queue, and reading the same pte read on another queue - these cases are quite analogous (although completely unrelated) Hi, So I had a better look at this - implemented per-cpu queues and messed with locking a bit: read don't go through the xpq queue, don't they ? JM Nope, PTE are directly obtained from the recursive mappings JM (vtopte/kvtopte). Let's call this out of band reads. But see below for in-band reads. JM Content is obviously only writable by hypervisor (so it can JM keep control of his mapping alone). I think this is similar to a tlb flush but the other way round, I guess we could use a IPI for this too. JM IIRC that's what the current native x86 code does: it uses an JM IPI to signal other processors that a shootdown is necessary. Xen's TLB_FLUSH operation is synchronous, and doesn't require an IPI (within the domain), which makes the queue ordering even more important (to make sure that stale ptes are not reloaded before the per-cpu queue has made progress). Yes, we can implement a roundabout ipi driven queueflush + tlbflush scheme(described below), but that would be performance sensitive, and the basic issue won't go away, imho. Let's stick to the xpq ops for a second, ignoring out-of-band reads (for which I agree that your assertion, that locking needs to be done at a higher level, holds true). The question here, really is, what are the global ordering requirements of per-cpu memory op queues, given the following basic ops: i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE) ii) read memory via: MMUEXT_PIN_L1_TABLE MMUEXT_PIN_L2_TABLE MMUEXT_PIN_L3_TABLE MMUEXT_PIN_L4_TABLE MMUEXT_UNPIN_TABLE MMUEXT_NEW_BASEPTR MMUEXT_TLB_FLUSH_LOCAL MMUEXT_INVLPG_LOCAL MMUEXT_TLB_FLUSH_MULTI MMUEXT_INVLPG_MULTI MMUEXT_TLB_FLUSH_ALL MMUEXT_INVLPG_ALL MMUEXT_FLUSH_CACHE MMUEXT_NEW_USER_BASEPTR (ie; anything that will cause the processor to re-read data updated via another cpu (via, for eg: pte update with i), above) There's two ways I can think of fixing this: a) *before* queueing a local read-op, synchronously flush queues on all other CPUs via ipis. This is slightly racy, but can be done, I think. An optimisation for invlpg could be to implement a scoreboard that watches mem. locations that have been queued for update on any cpu. Scan through the scoreboard for the memory range we're invlpg-ing. If it's not there, there's no need to flush any queues on other cpus. b) read-ops wait on a global condvar. If it's set, a write-op that needs flushing is pending. Wait (with optional timeout and ipi-nudge) until the remote queue is flushed. When flushing a queue, send a cv_broadcast to any waiters. Option b) is slightly better than my current scheme which is to lock any and all mmu-ops and operate the queues synchronously (via XENDEBUG_SYNC). I cannot think of anything else, other than ad-hoc locking + queue flushing, which could be hard to maintain and debug in the long run. I'm thinking that it might be easier and more justifiable to nuke the current queue scheme and implement shadow page tables, which would fit more naturally and efficiently with CAS pte updates, etc. I'm not sure this would completely fis the issue: with shadow page tables you can't use a CAS to assure atomic operation with the hardware TLB, as this is, precisely, a shadow PT and not the one used by hardware. Definitely worth looking into, I imho. I'm not very comfortable with the queue based scheme for MP. the CAS doesn't provide any guarantees with the TLB on native h/w, afaict. If you do a CAS pte update, and the update succeeded, it's a good idea to invalidate + shootdown anyway (even on baremetal). Do let me know your thoughts, Cheers, -- Cherry
Re: CVS commit: src/sys/arch/xen
On Mon, Aug 29, 2011 at 12:07:05PM +0200, Cherry G. Mathew wrote: JM On Mon, 22 Aug 2011 12:47:40 +0200, Manuel Bouyer wrote: This is slightly more complicated than it appears. Some of the ops in a per-cpu queue may have ordering dependencies with other cpu queues, and I think this would be hard to express trivially. (an example would be a pte update on one queue, and reading the same pte read on another queue - these cases are quite analogous (although completely unrelated) Hi, So I had a better look at this - implemented per-cpu queues and messed with locking a bit: read don't go through the xpq queue, don't they ? JM Nope, PTE are directly obtained from the recursive mappings JM (vtopte/kvtopte). Let's call this out of band reads. But see below for in-band reads. JM Content is obviously only writable by hypervisor (so it can JM keep control of his mapping alone). I think this is similar to a tlb flush but the other way round, I guess we could use a IPI for this too. JM IIRC that's what the current native x86 code does: it uses an JM IPI to signal other processors that a shootdown is necessary. Xen's TLB_FLUSH operation is synchronous, and doesn't require an IPI (within the domain), which makes the queue ordering even more important (to make sure that stale ptes are not reloaded before the per-cpu queue has made progress). Yes, we can implement a roundabout ipi driven queueflush + tlbflush scheme(described below), but that would be performance sensitive, and the basic issue won't go away, imho. Let's stick to the xpq ops for a second, ignoring out-of-band reads (for which I agree that your assertion, that locking needs to be done at a higher level, holds true). The question here, really is, what are the global ordering requirements of per-cpu memory op queues, given the following basic ops: i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE) ii) read memory via: MMUEXT_PIN_L1_TABLE MMUEXT_PIN_L2_TABLE MMUEXT_PIN_L3_TABLE MMUEXT_PIN_L4_TABLE MMUEXT_UNPIN_TABLE This is when adding/removing a page table from a pmap. When this occurs, the pmap is locked, isn't it ? MMUEXT_NEW_BASEPTR MMUEXT_NEW_USER_BASEPTR This is a context switch MMUEXT_TLB_FLUSH_LOCAL MMUEXT_INVLPG_LOCAL MMUEXT_TLB_FLUSH_MULTI MMUEXT_INVLPG_MULTI MMUEXT_TLB_FLUSH_ALL MMUEXT_INVLPG_ALL MMUEXT_FLUSH_CACHE This may, or may not, cause a read. This usually happens after updating the pmap, and I guess this also happens with the pmap locked (I have not carefully checked). So couldn't we just use the pmap lock for this ? I suspect the same lock will also be needed for out of band reads at some point (right now it's protected by splvm()). [...] I'm thinking that it might be easier and more justifiable to nuke the current queue scheme and implement shadow page tables, which would fit more naturally and efficiently with CAS pte updates, etc. I'm not sure this would completely fis the issue: with shadow page tables you can't use a CAS to assure atomic operation with the hardware TLB, as this is, precisely, a shadow PT and not the one used by hardware. Definitely worth looking into, I imho. I'm not very comfortable with the queue based scheme for MP. the CAS doesn't provide any guarantees with the TLB on native h/w, afaict. What makes you think so ? I think the hw TLB also does CAS to update referenced and dirty bits in the PTE, otherwise we couldn't rely on these bits; this would be bad especially for the dirty bit. If you do a CAS pte update, and the update succeeded, it's a good idea to invalidate + shootdown anyway (even on baremetal). Yes, of course inval is needed after updating the PTE. But using a true CAS is important to get the refereced and dirty bits right. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: Addition to kauth(9) framework
On Mon 29 Aug 2011 at 13:32:50 +0200, Martin Husemann wrote: On Mon, Aug 29, 2011 at 12:13:38PM +0200, Aleksey Cheusov wrote: we will lost our data. Data set by first listerner will be overriden by the second listerner. This is not just waste of time. Yes, but it is a design bug in the modules or in kauth and unrelated to the (un-)sharing, isn't it? My expectation would be that if the first module unshares, the newly unshared data is passed to the second module, who can unshare it again. (I have not looked at code to check if that is what happens, or even if it is easy to make it so) -Olaf. -- ___ Olaf 'Rhialto' Seibert -- There's no point being grown-up if you \X/ rhialto/at/xs4all.nl-- can't be childish sometimes. -The 4th Doctor
re: Addition to kauth(9) framework
On Aug 29, 7:54pm, m...@eterna.com.au (matthew green) wrote: -- Subject: re: Addition to kauth(9) framework | | In article 20110829003259.913f014a...@mail.netbsd.org, | YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: | hi, | | I'd like to apply the attached patch. | It implements two things: | | - chroot(2)-ed process is given new kauth_cred_t with reference count | equal to 1. | | can you find a way to avoid this? | | YAMAMOTO Takashi | | He tried and I think that this is the minimal hook he needs. | | do you mean that we need to unshare the credential unconditionally, | regardless his module is used or not? why? | | maybe it's just me, but i actually have absolutely no problem | with chroot unsharing kauth_cred_t by default. it just seems | to have more generic safety aspects. I share the same sentiment; I don't see the change as a big deal. christos
Re: CVS commit: src/sys/arch/xen
On Mon, Aug 29, 2011 at 03:03:37PM +0200, Cherry G. Mathew wrote: Hi Manuel, Manuel == Manuel Bouyer bou...@antioche.eu.org writes: [...] Xen's TLB_FLUSH operation is synchronous, and doesn't require an IPI (within the domain), which makes the queue ordering even more important (to make sure that stale ptes are not reloaded before the per-cpu queue has made progress). Yes, we can implement a roundabout ipi driven queueflush + tlbflush scheme(described below), but that would be performance sensitive, and the basic issue won't go away, imho. Let's stick to the xpq ops for a second, ignoring out-of-band reads (for which I agree that your assertion, that locking needs to be done at a higher level, holds true). The question here, really is, what are the global ordering requirements of per-cpu memory op queues, given the following basic ops: i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE) ii) read memory via: MMUEXT_PIN_L1_TABLE MMUEXT_PIN_L2_TABLE MMUEXT_PIN_L3_TABLE MMUEXT_PIN_L4_TABLE MMUEXT_UNPIN_TABLE Manuel This is when adding/removing a page table from a pmap. When Manuel this occurs, the pmap is locked, isn't it ? MMUEXT_NEW_BASEPTR MMUEXT_NEW_USER_BASEPTR Manuel This is a context switch MMUEXT_TLB_FLUSH_LOCAL MMUEXT_INVLPG_LOCAL MMUEXT_TLB_FLUSH_MULTI MMUEXT_INVLPG_MULTI MMUEXT_TLB_FLUSH_ALL MMUEXT_INVLPG_ALL MMUEXT_FLUSH_CACHE Manuel This may, or may not, cause a read. This usually happens Manuel after updating the pmap, and I guess this also happens with Manuel the pmap locked (I have not carefully checked). Manuel So couldn't we just use the pmap lock for this ? I suspect Manuel the same lock will also be needed for out of band reads at Manuel some point (right now it's protected by splvm()). I'm a bit confused now - are we assuming that the pmap lock protects the (pte update op queue-push(es) + pmap_pte_flush()) as a single atomic operation (ie; no invlpg/tlbflush or out-of-band-read can occur between the update(s) and the pmap_pte_flush()) ? out of band reads can always occurs, there's no lock which can protect against this. If so, I think I've slightly misunderstood the scope of the mmu queue design - I assumed that the queue is longer-lived, and the sync points (for the queue flush) can span across pmap locking - a sort of lazy pte update, with the queue being flushed at out-of-band or in-band read time ( I guess that won't work though - how does one know when the hardware walks the page table ?) . It seems that the queue is meant for pte updates in loops, for eg:, quickly followed by a flush. Is this correct ? it was not explicitely designed this way, but I think that's how things are in practice, yes. Usage would need to be checked, though. There may be some special case in the kernel pmap area ... If so, there's just one hazard afaict - the synchronous TLB_FLUSH_MULTI could beat the race between the queue update and the queue flush via pmap_tlb_shootnow() (see pmap_tlb.c on the cherry-xenmp branch, and *if* other CPUs reload their TLBs before the flush, they'll have stale info. So the important question (rmind@ ?) is, is pmap_tlb_shootnow() guaranteed to be always called with the pmap lock held ? I don't think so; but I also don't think that's the problem. There shouldn't be more race with this than on native hardware. In real life, I removed the global xpq_queue_lock() and the pmap was falling apart. So a bit of debugging ahead. Hum, on seocnd though, something more may be needed to protect the queue. The pmap lock won't probably work for pmap_kernel ... [...] I'm thinking that it might be easier and more justifiable to nuke the current queue scheme and implement shadow page tables, which would fit more naturally and efficiently with CAS pte updates, etc. I'm not sure this would completely fis the issue: with shadow page tables you can't use a CAS to assure atomic operation with the hardware TLB, as this is, precisely, a shadow PT and not the one used by hardware. Definitely worth looking into, I imho. I'm not very comfortable with the queue based scheme for MP. the CAS doesn't provide any guarantees with the TLB on native h/w, afaict. Manuel What makes you think so ? I think the hw TLB also does CAS Manuel to update referenced and dirty bits in the PTE, otherwise we Manuel couldn't rely on these bits; this would be bad especially Manuel for the dirty bit. Yes, I missed that one (which is much of the point of the CAS in the first place!), you're right. If you do a CAS pte update, and the update succeeded, it's a good idea to invalidate + shootdown anyway (even on
Re: Addition to kauth(9) framework
On Mon, Aug 29, 2011 at 09:19:11AM -0400, Christos Zoulas wrote: On Aug 29, 7:54pm, m...@eterna.com.au (matthew green) wrote: -- Subject: re: Addition to kauth(9) framework | | In article 20110829003259.913f014a...@mail.netbsd.org, | YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: | hi, | | I'd like to apply the attached patch. | It implements two things: | | - chroot(2)-ed process is given new kauth_cred_t with reference count | equal to 1. | | can you find a way to avoid this? | | YAMAMOTO Takashi | | He tried and I think that this is the minimal hook he needs. | | do you mean that we need to unshare the credential unconditionally, | regardless his module is used or not? why? | | maybe it's just me, but i actually have absolutely no problem | with chroot unsharing kauth_cred_t by default. it just seems | to have more generic safety aspects. I share the same sentiment; I don't see the change as a big deal. Likewise - the whole idea behind chroot is the isolatino of operations, and I can only see the unsharing of kauth_cred_t by default as helping this. Maybe I'm missing something here? Thanks, Alistair
Re: netbsd32 emulation in driver open() or read()
On Mon, 29 Aug 2011, Manuel Bouyer wrote: So: is there a way to know if the emulation used by a userland program doing an open() is 32 or 64bit ? sys/proc.h: 1.233 ad343: /* 1.273 ad344: * These flags are kept in p_flag and are protected by p_lock. Access from 1.233 ad345: * process context only. 346: */ ... 353: #definePK_32 0x0004 /* 32-bit process (used on 64-bit kernels) */ So you can check if that bit is set in the current proc's p_flag member. Eduardo
Re: Addition to kauth(9) framework
On Mon, Aug 29, 2011 at 02:36:04PM +0200, Aleksey Cheusov wrote: If sender (chroot(2)) cares about unsharing kauth_cred_t structure, all listeners will set their data without any problem provided that kauth_key_t keys they use are different. Key uniqueness is garanteed by kauth_register_key. I'm sorry, I'm very likely still missing some important detail: this sounds to me as if we have to choose here between the sender distributing individual unshared credentials to every receiver (I thought kauth would handle the messaging?), which means every receiver gets its own copy, but those lack modifications done by previous receivers - or if the receiver does the unsharing, its modifications will get lost if we have multiple receivers. Both options sound wrong to me, what did I misunderstand? Thanks, Martin
Re: netbsd32 emulation in driver open() or read()
In article 20110829151339.ga24...@asim.lip6.fr, Manuel Bouyer bou...@antioche.eu.org wrote: Hello, I'm working on getting bpf(4) in a 64bit kernel play with a 32bit userland. I've translated the ioctls, but I'm now stuck with read(). read(2) on a bpf device returns wire packets (no problems with this) with a bpf-specific header in front of each packet. This bpf header is: struct bpf_hdr { struct bpf_timeval bh_tstamp; /* time stamp */ uint32_tbh_caplen; /* length of captured portion */ uint32_tbh_datalen; /* original length of packet */ uint16_tbh_hdrlen; /* length of bpf header (this struct plus alignment padding) */ }; with: struct bpf_timeval { long tv_sec; long tv_usec; }; and this is the problem (sizeof(bpf_timeval) changes). It doens't look easy to just move struct bpf_timeval to fixed-size types (compat issues, I guess this would require a rename of open() or read()). On the other hand, if bpf(4) did know if the program doing the open() syscall is 32 or 64bits, it could appends the right header (could also be done in read() but it's less easy: it would require translating an existing buffer; while flagging it at open() time allows to build the right buffer from start). So: is there a way to know if the emulation used by a userland program doing an open() is 32 or 64bit ? Yes, look at PK_32 in the process flags. If you are going to do this, please look at what FreeBSD did with bpf_ts/bpf_xhdr and the time format changes and do the same (provide timespec/bintime etc). This is how they handle compatibility mode too. christos
Re: CVS commit: src/sys/arch/xen
On 29.08.2011 15:03, Cherry G. Mathew wrote: I'm a bit confused now - are we assuming that the pmap lock protects the (pte update op queue-push(es) + pmap_pte_flush()) as a single atomic operation (ie; no invlpg/tlbflush or out-of-band-read can occur between the update(s) and the pmap_pte_flush()) ? If so, I think I've slightly misunderstood the scope of the mmu queue design - I assumed that the queue is longer-lived, and the sync points (for the queue flush) can span across pmap locking - a sort of lazy pte update, with the queue being flushed at out-of-band or in-band read time ( I guess that won't work though - how does one know when the hardware walks the page table ?) . It seems that the queue is meant for pte updates in loops, for eg:, quickly followed by a flush. Is this correct ? IMHO, this should be regarded this way, and nothing else. x86 and xen pmap(9) share a lot in common, low level operations (like these: PT/PD editing, TLB flushes, MMU updates...) should not leak through this abstraction. Said differently, the way Xen handles MMU must remain transparent to pmap, except in a few places. Remember, although we are adding a level of indirection through hypervisor, the calls should remain close to native x86 semantic. However, for convenience, Xen offers multiseats MMU hypercalls, where you can schedule more than one op at a time to avoid unneeded context switches, like in the pmap_alloc_level() function. This is our problematic part. If so, there's just one hazard afaict - the synchronous TLB_FLUSH_MULTI could beat the race between the queue update and the queue flush via pmap_tlb_shootnow() (see pmap_tlb.c on the cherry-xenmp branch, and *if* other CPUs reload their TLBs before the flush, they'll have stale info. What stale info? If a VCPU queue isn't empty while another VCPU has scheduled a TLB_FLUSH_MULTI op, the stale content of the queue will eventually be depleted later after a pmap_pte_flush() followed by a local invalidation. This is the part that should be carefully reviewed. For clarity, I would expect a queue to be empty when leaving pmap (e.g. when releasing its lock). Assertions might be necessary to catch all corner cases. So the important question (rmind@ ?) is, is pmap_tlb_shootnow() guaranteed to be always called with the pmap lock held ? In real life, I removed the global xpq_queue_lock() and the pmap was falling apart. So a bit of debugging ahead. Did you keep the same queue for all CPUs? If that was the case, yes, this is a call for trouble. Anyway, we can't put a giant lock around xpq_queue. This doesn't make any sense in a MP system, especially for operations that are frequently called and still may take a while to complete. Just imagine: all CPUs waiting for one to finish its TLB flush before they can edit their PD/PT again... ouch. -- Jean-Yves Migeon jeanyves.mig...@free.fr
Areca 1880?
Hi, all, Is there any support for the Areca 1880, be it in -current or someone's not-yet-checked-in tree? Thanks, John Klos
Re: Addition to kauth(9) framework
On Mon, Aug 29, 2011 at 07:54:38PM +1000, matthew green wrote: do you mean that we need to unshare the credential unconditionally, regardless his module is used or not? why? maybe it's just me, but i actually have absolutely no problem with chroot unsharing kauth_cred_t by default. it just seems to have more generic safety aspects. So hold on. kauth_cred_t (ignoring some silly typedef issues) is the process credentials structure, with the uids and gids in it. Right? In the normal world, each process should have its own, so all sharing is purely an optimization and should be copy-on-write. Therefore, anything that changes the contents of the structure should first establish a private copy of it. Because the type is private to kern_auth.c, this is fairly easy to establish and audit. So far so good. Now, the question is: in the Elad world, are there cases where this thing is supposed to be shared for modification? If so, what are they, nothing as far as i know. YAMAMOTO Takashi what's the intent and the intended semantics, and is there an implementation that makes it all behave reasonably without being a code nightmare? Like with sharing pages in VM, handling objects that are both shared-update and copy-on-write at once is not entirely trivial. -- David A. Holland dholl...@netbsd.org