Re: UVM and the NULL page
On Tue, Dec 27, 2016 at 02:12:59PM +0100, Wolfgang Solfrank wrote: > Hi, > > >Any cpu that doesn't require special instructions for copyin/out > >is susceptible to user processes mapping code to address 0 and > >converting a kernel 'jump through unset pointer' from a panic > >into a massive security hole (executing process code with the > >'supervisor' bit set). > > Only if you do a naive implementation of copyin/out. Nothing prevents > you from implementing copyin/out on these cpus by mapping only the > relevant part of the user address space at some reserved address > (maybe even one page at a time), do the copying and then unmap the > user space part. No reason to share the user address space all the > time. That requires you do a full 'pmap' change on every system call entry and exit - which will slow things down somewhat. You don't even want to invalidate the use tlb. David -- David Laight: da...@l8s.co.uk
Re: vrele vs. syncer deadlock
On Sun, Dec 11, 2016 at 08:39:06PM +, Michael van Elst wrote: > dholland-t...@netbsd.org (David Holland) writes: > > >On a low-memory machine Nick ran into the following deadlock: > > > (a) rename -> vrele on child -> inactive -> truncate -> getblk -> > > no memory in buffer pool -> wait for syncer Could the child vnode tidyup be done at a later time? ie just queue it in the vrele path. David -- David Laight: da...@l8s.co.uk
Re: x86: move the LAPIC va
On Sat, Oct 08, 2016 at 05:14:43PM +0200, Maxime Villard wrote: > On x86 there's a set of memory-mapped registers that are per-cpu and called > the > LAPIC. They reside at a fixed one-page-sized physical address, and in order > to > read or write to them the kernel has to allocate a virtual address and then > kenter it to the aforementioned physical one. > > In the NetBSD kernel, however, we do something a little bizarre: instead of > following this model, the kernel has a blank page at the beginning of the > data > segment, and it then directly kenters the va of this page into the LAPIC pa. > > The issue with this design is that it implies the first page of .data does > not > actually belong to .data, and therefore it is not possible to map the > beginning > of .data with large pages. In addition to this, without going into useless > details, it creates an inconsistency in the low memory map, because the > pa<->va > translation is not linear, even if it seemingly is harmless. If you are going to change it, why not pick a more appropriate fixed virtual address? The smp code will already be using one for things like curproc. That way you don't need to add all the indirections to the asm code and don't need asm #defines that use temporary registers. You'll still need a physica page for non LAPIC cpu (probably not smp-capable designs). David -- David Laight: da...@l8s.co.uk
Re: UVM and the NULL page
On Mon, Aug 01, 2016 at 03:56:01PM +, Eduardo Horvath wrote: > On Sat, 30 Jul 2016, Thor Lancelot Simon wrote: > > > 1) amd64 partially shares VA space between the kernel and userland. It > >is not unique in this but most architectures do not. > > FWIW all the pmaps I worked on have split user/kernel address spaces and > do not share this vulnerability. Wakes up... You've worked on a strange set of cpus then. Any cpu that doesn't require special instructions for copyin/out is susceptible to user processes mapping code to address 0 and converting a kernel 'jump through unset pointer' from a panic into a massive security hole (executing process code with the 'supervisor' bit set). The only reason I know for mapping address zero would be to run exectables for very old emulations where the program entry point was zero. There might be sine old 68000 ones. ISTR that wine is actually mapping 'everywhere' in order to ensure the addresses it needs later can be made available by unmapping specific ranges. Anyway mmap() without MAP_FIXED should never return NULL. Even if technically allowed by the standard. If nothing else I think the compiler is allowed to assumes that NULL is special and generate 'unexpected' code. David -- David Laight: da...@l8s.co.uk
Re: New Syscall
On Thu, Oct 15, 2015 at 02:12:35PM +0100, Robert Swindells wrote: > > Taylor R Campbell wrote: > > Date: Wed, 14 Oct 2015 22:55:41 +0100 (BST) > > From: Robert Swindells <r...@fdy2.co.uk> > > > > The syscall is sctp_peeloff(). > > > >Hmm... Introducing a protocol-specific syscall doesn't strike me as a > >great design. I can imagine wanting to do something similar with, > >e.g., minimalt, if we ever had that in-kernel. > > > >If we have to have something protocol-specific, an ioctl would work > >just as well, and use up a somewhat less scarce resource. > > The code is from KAME, I didn't write it from scratch, FreeBSD also has > a syscall for it. > > Linux uses getsockopt() for this, which seems wrong to me as you are > not just reading a setting when you make the call. Be careful, I think one of the sctp rfs requires the use of setsockopt() for a lot of things that ought to be separate socket calls. I can't remember about peeloff. The 'peeloff' code really shouldn't have been anything to do with sctp - it is just is method of multiplexing connections over a single socket. A strange solution to the problem I think they were trying to solve. Not that much of sctp works the way people expect it to... David -- David Laight: da...@l8s.co.uk
Re: Brainy: bug in x86/cpu_ucode_intel.c
On Sun, Oct 04, 2015 at 04:28:35PM +, David Holland wrote: > On Sun, Oct 04, 2015 at 11:52:18AM +1100, matthew green wrote: > > how about this: > > I would suggest using void * for the unaligned pointer, but other than > that looks at least correctly consistent with the discussion here. Agree - or char *. It might not matter for this code, and for x86, but in general you don't want gcc to see misaligned pointers. It is also worth noting that you only need to add 8 (for amd64) to the size, and that the pointer can only need 8 adding. OTOH having an allocator not return aligned memory is stupid. Adding a 16 or 32 byte header to allocation requests that are not powers of 2 probably makes little difference to the footprint. If code allocates 4k you don't really want a header at all. David -- David Laight: da...@l8s.co.uk
Re: New sysctl entry: proc.PID.realpath
On Mon, Sep 07, 2015 at 07:01:45PM +, David Holland wrote: > On Mon, Sep 07, 2015 at 11:13:35AM +0200, Joerg Sonnenberger wrote: > > > Two nits: > > > > > > 1) vnode_to_path(9) is undocumented > > > 2) it only works if you are lucky (IIUC) - which you mostly are > > > > > > The former is easy to fix, the latter IMHO is a killer before we expose > > > this interface prominently and make debuggers depend on it. We then > should > > > also make $ORIGIN work in ld.elf_so ;-} > > > > My suggestion was to just provide the filesystem id and inode number as > > fallback. I still believe we should just turn on the code that remembers > > the realpath on exec in first place, if you want to debug > > something_with_a_very_very_very_very_..._very_long_name, you can always > > override the (missing) default. > > As best I recall (having tried to page the context in the past few > days) the only reason that code is disabled is so that it fails in a > way that's readily explainable (non-absolute paths) vs. arbitrarily > and capriciously. > > There's another problem this thread hasn't mentioned, which is that > the result of vnode_to_path for non-directories isn't necessarily > unique or deterministic even if the object hasn't been moved about. Perhaps the kernel should hold a reference to the directory vnode for every process. An open() of the directory could then be used for $ORIGIN etc. You might want this vnode to be 'revokeable' by unmount. An actual path could be found using the same code ad pwd. David -- David Laight: da...@l8s.co.uk
Re: kernel libraries and dead code in MODULAR kernels
On Fri, Sep 04, 2015 at 06:39:46PM -0700, Dennis Ferguson wrote: > > > > Yes, finding unused functions is hard. Not only in libkern, but also > > libc, or variables in abandaned `Makefile.kern.inc'. Removing one > > needs so much mental energy (especially when those picky wandering). > > I'm not so interested in ridding the kernel of all unused code (though > I suspect someone clever might be able to use the -ffunction-sections and > -fdata-sections compiler flags plus the ld --gc-sections option to > find some of it). I'd be happy if modular and non-modular kernels > had the same unused stuff, and that libkern.a could be used when > building either. Just parse the xref list from ld. I don't think you should assume that all kernel modulrs are built at the same time as the main kernel. So functions thst loadable modules might need have to be present. Hence the inclusion of all of libkern. David -- David Laight: da...@l8s.co.uk
Re: Understanding SPL(9)
On Mon, Aug 31, 2015 at 03:30:36PM +, Eduardo Horvath wrote: > On Mon, 31 Aug 2015, Stephan wrote: > > > I?m trying to understand interrupt priority levels using the example > > of x86. From what I?ve seen so far I?d say that all spl*() functions > > end up in either splraise() or spllower() from > > sys/arch/i386/i386/spl.S. What these functions actually do is not > > clear to me. For example, splraise() starts with this: > > > > ENTRY(splraise) > > movl4(%esp),%edx > > movlCPUVAR(ILEVEL),%eax > > cmpl%edx,%eax > > ja 1f > > movl%edx,CPUVAR(ILEVEL) > > ... > > > > I?m unable to find out what CPUVAR(ILEVEL) means. I would guess that > > something needs to happen to the APIC?s task priority register. > > However I can?t see any coherence just now. > > Don't look at x86, it doesn't have real interrupt levels. Look at SPARC > or 68K which do. Old x86 fed interrupts through (the equiv of) an 8259? interrupt controller that dates from the 1970s (8080 cpu). This has 8 interrupt priority levels and the spl() could (and used to) modify the mask. The problem is that these accesses are very, very slow. Since interrupts are much rarer than spl calls it is much faster to not update the hardware mask unless you get a level-sensitive interrupt that should be masked. Amd64 cpus have a built-in intertupt priority register (cr8) that can be used to mask low priority interrupts. Unlike all other control registers, accesses to cr8 aren't sequencing instructions so are fast. I don't know whether netbsd dynamically changes cr8. > Most machines nowadays only have one interrupt line and an external > interrupt controller. True interrupt levels are simulated by assigning > levels to individual interrupt sources and masking the appropriate ones in > the interrupt controller. This makes the code rather complicated, > especially since interrupts can nest. Multiple interrupt priorities for level sensitive interrupts require hardware support. David -- David Laight: da...@l8s.co.uk
Re: change MSI/MSI-X APIs
On Mon, May 11, 2015 at 03:15:25PM +0900, Kengo NAKAHARA wrote: Hi, I received feedback from some device driver authors. They point out establish, disestablish and release APIs should be unified for INTx, MSI and MSI-X. So, I would change the APIs as below: Some more feedback... PCIe devices that only support MSI-X could have support for very large numbers of interrupts. Some might only be needed if a specific function is used, or might only be used to load sharing (eg multi-q ethernet). In either case you might want to allocate some of the MSI-X vectors at driver load time, and allocate others at a much later time. IIRC nothing in the hardware spec stops you doing this. Theoretically you could allocate the MSI-X vector (etc) when the interrupt is enabled, and dealloacate on disable (apart from timing problems on the hardware). I'm not suggesting you go that far! This would also make it less likely that interrupts won't be available for drivers that initialise later on. David -- David Laight: da...@l8s.co.uk
Re: kernel constructor
On Thu, Nov 13, 2014 at 03:29:48PM +0900, Masao Uebayashi wrote: On Wed, Nov 12, 2014 at 2:53 AM, Taylor R Campbell campb...@mumble.net wrote: Date: Tue, 11 Nov 2014 17:42:51 + From: Antti Kantee po...@iki.fi 2: init_main ordering I think that code reading is an absolute requirement there, i.e. we should be able to know offline what will happen at runtime. Maybe that problem is better addressed with an offline preprocessor which figures out the correct order? rcorder(8)...? I'll imlement tsort in config(1), because config(1) knows module dependency. Module objects will be ordered when linking, that is also reflected in the order of constructors. I believe this is good enough for most cases. You could look for symbol name (say) init_fn_foo, init_fn_foo_requires_xxx and init_fn_foo_provides_yyy and use them to generate a C file of calls to foo() in the correct order and then relink the kernel. Or as part of the final stage of converting a netbsd.o (generated by ld -r) into a fully fixed kernel. David -- David Laight: da...@l8s.co.uk
Re: FW: ixg(4) performances
On Sun, Aug 31, 2014 at 12:07:38PM -0400, Terry Moore wrote: This is not 2.5G Transfers per second. PCIe talks about transactions rather than transfers; one transaction requires either 12 bytes (for 32-bit systems) or 16 bytes (for 64-bit systems) of overhead at the transaction layer, plus 7 bytes at the link layer. The maximum number of transactions per second paradoxically transfers the fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only about 60,000 such transactions are possible per second (moving about 248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims, for example 95% efficiency is typical for storage controllers.] The gain for large transfer requests is probably minimal. There can be multiple requests outstanding at any one time (the limit is negotiated, I'm guessing that 8 and 16 are typical values). A typical PCIe dma controller will generate multiple concurrent transfer requests, so even if the requests are only 128 bytes you can get a reasonable overall throughput. A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million transactions are possible per second, but those 9 million transactions can only move 36 million bytes. Except that nothing will generate adequately overlapped short transfers. The real performance killer is cpu pio cycles. Every one that the driver does will hit the throughput - the cpu will be spinning for a long, long time (think ISA bus speeds). A side effect of this is that PCI-PCIe bridges (either way) are doomed to be very inefficient. Multiple lanes scale things fairly linearly. But there has to be one byte per lane; a x8 configuration says that physical transfers are padded so that each the 4-byte write (which takes 27 bytes on the bus) will have to take 32 bytes. Instead of getting 72 million transactions per second, you get 62.5 million transactions/second, so it doesn't scale as nicely. I think that individual PCIe transfers requests always use a single lane. Multiple lanes help if you have multiple concurrent transfers. So different chunks of an ethernet frame can be transferred in parrallel over multiple lanes, with the transfer not completing until all the individual parts complete. So the ring status transfer can't be scheduled until all the other data fragment transfers have completed. I also believe that the PCIe transfers are inherently 64bit. There are byte-enables indicating which bytes of the first and last 64bit words are actually required. The real thing to remember about PCIe is that it is a comms protocol, not a bus protocol. It is high throughput, high latency. I've had 'fun' getting even moderate PCIe throughput into an fpga. David -- David Laight: da...@l8s.co.uk
Re: msdosfs and small sectors
On Wed, Jul 16, 2014 at 06:26:00PM +, David Holland wrote: On Wed, Jul 16, 2014 at 03:10:01PM +0200, Maxime Villard wrote: I thought about that. I haven't found a clear spec on this, but it is implicitly suggested that 512 is the minimal size (from what I've seen here and there). And the smallest BytesPerSec allowed for fat devices is 512. But still, nothing really clear. If you're afraid some real device might turn up with 128-byte sectors or something, complain if it's less than 64. Or 32. It doesn't really matter. Real floppies certainly had 128 byte sectors. Some even had 128 byte ones on track 0 but 256 byte ones on the rest of the disk! Is there a check that the sector size is a power of two? That might depend on where it comes from. Real devices with 'unusual' sector sizes do exist (like audio CD), but they won't have a FAT fs on them. (and ICL system25 whcih wanted 100 byte sectors). David -- David Laight: da...@l8s.co.uk
Re: crunchgen and c++
On Mon, Jul 14, 2014 at 11:54:35AM -0500, Frank Zerangue wrote: Is crunchgen compatible with c++ executables? I thik you answer yourself... I was able to build the c++ tool into a crunched binary but get an illegal instruction trap when trying to execute the tool. Clearly not :-) And static variables in the c++ tool are initialized when any of the binaries crunched are executed. To stop that happening the linker section names for the initialiers (and destructors) in each tool would need renaming, and then the constructors run (in the correct order) before calling the tool's main(). (and even that might not work). Thanks for any ideas on this matter. I'd try a minimal crunched binary and see why it fails. All crunchgen really does is rename the program's symbols so that the ones from each 'tool' are separate. David -- David Laight: da...@l8s.co.uk
Re: serious performance regression in .41
On Thu, May 22, 2014 at 07:42:51PM +0200, J. Hannken-Illjes wrote: While I'm interested in the results, this change is wrong. As long as we have forced unmount and revoke it is not safe to access an inode without locking the corresponding vnode. Holding a reference to the vnode just guarantees the vnode will not disappear, it does not prevent the inode from disappearing. Forced unmount and revoke can use other synchronisation techniques that are expensive for the unusual operation but cheap in the normal path. Something like rcu would do. Might even be more generally useful for some of these structures. David -- David Laight: da...@l8s.co.uk
Re: CVS commit: src/sys/ufs/ufs
On Fri, May 16, 2014 at 03:54:44PM +, David Holland wrote: Indeed rebooting with an updated kernel will give active NFS clients problems, but I am not sure we should realy care nor how we could possibly avoid this one time issue. We have changed encoding of filehandles before (at least once). I don't think this is a problem, but maybe I'll put a note in UPDATING. Never mind that problem. Consider what happens if you reboot with a different CD in the drive! I once fixed a filesystem to use different faked inode numbers every time a filesystem was mounted. Without that NFS clients would write to the wrong file in the wrong FS. The 'impossible to get rid of' retries for hard mounts were something up with which I had to put. (A preposition is something you should not end a sentence with.) David -- David Laight: da...@l8s.co.uk
Re: resource leak in linux emulation?
On Thu, Apr 17, 2014 at 01:23:15AM +0200, Sergio L?pez wrote: 2014-04-03 11:57 GMT+02:00 Mark Davies m...@ecs.vuw.ac.nz: Note that nprocs (2nd to last value in the /proc/loadavg output) increments every time javac runs until it hits maxproc. You're right, the problem appears when the last thead alive in a multithreaded linux process is not the main thread, but one of the children. This only happens with when using the linux emulation, because is the only case when LWPs have their own respective PIDs. To fix, this should be added somewhere, probably at sys/kern/kern_exit.c:487 (but I'm not sure if there's a better location): if ((l-l_pflag LP_PIDLID) != 0 l-l_lid != p-p_pid) { proc_free_pid(l-l_lid); } That doesn't look like the right place. I think it should be further down (and with proc_lock held). David -- David Laight: da...@l8s.co.uk
Re: Patch: cprng_fast performance - please review.
On Fri, Apr 18, 2014 at 02:41:07PM -0400, Thor Lancelot Simon wrote: Of the few systems which do have instructions that accellerate AES, on the most common implementation -- x86 -- we cannot use the instructions in question in the kernel because they use CPU state we do not save/ restore when the kernel runs. I'd welcome anyone's work to fix that, so long as it does not impose major performance costs of its own, but I do not personally have the skill to do it, and if wishes were horses... On x86 the xmm registers could be used in kernel code provided that: 1) If the fpu registers are owned by a different process they are saved into the pcb (because an IPI might ask they be saved). (Or save the resister values somewhere the IPI can save them to the pcb from.) and: 2) Pre-emption is disabled. and: 3a) If the fpu registers are owned by the current process the registers used are saved and restored. or: 3b) If the fpu is not active it is enabled (and then disabled). You probably don't want to to a full fpu save unless you really need to. I'd guess that the AES instruction would only need a couple of xmm/ymm registers. There is one luring issue with the intel cpus though If the user program has used AVX encoded instructions that affect the ymm registers there is a big clock penalty for the first non-avx encoded instruction that uses the xmm ones (don't ask what the hw guys f*cked up and bodged a fix for...). The ABI requires that the ymm (high) registers be cleared with a special instruction before every function call - which will include all system calls, but this won't be true if the kernel is entered by an interrupt. I don't know about amd cpus. David -- David Laight: da...@l8s.co.uk
Re: cprng_fast implementation benchmarks
On Wed, Apr 23, 2014 at 03:30:09PM +0200, Manuel Bouyer wrote: On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote: [...] Do we still have a compile-time way to check if the kernel (or port) is uniprocessor only? If so we should probably #ifdef away the percpu calls in such kernels, which are probably for slower hardware anyway. AFAIK options MULTIPROCESSOR is still here Do the percpu() calls collapse out for non-MULTIPROCESSOR kernels? In any case you'd want to do what is done with some of the mutex code. ie overwrite the code of the SMP version with that of the uniprocessor one if the current system only has one cpu. David -- David Laight: da...@l8s.co.uk
Re: Changes to make /dev/*random better sooner
On Thu, Apr 10, 2014 at 04:14:46PM -0700, Dennis Ferguson wrote: On 10 Apr, 2014, at 05:34 , Thor Lancelot Simon t...@panix.com wrote: On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote: I'd really like to understand what problem is fixed by this. It seems to make the code more expensive (significantly so since it precludes using timestamps in their cheapest-to-obtain form) but I'm missing the benefit this cost pays for. It's no more expensive to sample a 64-bit than a 32-bit cycle counter, if you have both. Where do we have access to only a 32-bit cycle counter? I admit that the problem exists in theory. I am not so sure at all that it exists in practice. 32 bit ARM processors have a 32-bit CPU cycle counter, when they have one. PowerPC processors have a 64-bit counter but the 32-bit instruction set provides no way to get an atomic sample of all 64 bits. It requires three special instructions followed by a check and a possible repeat of the three instructions to get a consistent sample, which makes that significantly less useful for accurate event timing than the single atomic instruction which obtains the low order 32 bits alone. I know i386, and 32-bit sparc running on a 64-bit processor, can get atomic samples of 64 bits of cycle counter from the 32-bit instruction set but I think those are exceptions rather than rules. For the purposes of obtaining entropy it doesn't matter if the high and low parts don't match. Is there likely to be interesting entropy in the high bits anyway - certainly not more than once. Also, having read high, low, high and found that the two 'high' values differ, take the latter high bits and zero the low bits. The value returned occurred while the counter was being read - so is a valid return value. David -- David Laight: da...@l8s.co.uk
Re: Proposal for kernel clock changes
On Fri, Mar 28, 2014 at 06:16:23PM -0400, Dennis Ferguson wrote: I would like to rework the clock support in the kernel a bit to correct some deficiencies which exist now, and to provide new functionality. The issues I would like to try to address include: A few comments, I've deleted the body so they aren't hidden! One problem I do see is knowing which counter to trust most. You are trying to cross synchronise values and it might be that the clock with the best long term accuracy is a very slow one with a lot of jitter (NTP over dialup anyone?). Whereas the fastest clock is likely to have the least jitter, but may not have the long term stability. There are places where you are only interested in the difference between timestamps - rather than needing them converting to absolute times. I also wonder whether there are timestamps for which you are never really interested in the absolute accuracy of old values. Possibly because 'old' timestamps will already have been converted to some other clock. This might be the case for ethernet packet timestamps, you may want to be able to synchronise the timestamps from different interfaces, but you may not be interested in the absolute accuracy of timestamps from packets takem several hours ago. This may mean that you can (effectively) count the ticks on all your clocks since 'boot' and then scale the frequency of each to give the same 'time since boot' - even though that will slightly change the relationship between old timestamps taken on different clocks. Possibly you do need a small offset for each clock to avoid discrepencies in the 'current time' when you recalculate the clocks frequency. If the 128bit divides are being done to generate corrected frequences, it might be that you can use the error term to adjust the current value - and remove the need for the divide at all (after the initial setup). One thought I've sometimes had is that, instead of trying to synchronise the TSC counters in an SMP system, move them as far from each other as possible! Then, when you read the TSC, you can tell from the value which cpu it must have come from! David
Re: resource leak in linux emulation?
On Thu, Mar 27, 2014 at 02:00:37PM +1300, Mark Davies wrote: On a NetBSD/amd64 6.1_STABLE system, I have a perl script that effectively calls /usr/pkg/java/sun-7/bin/javac twice. It doesn't really matter what java file its compiling. If I call this script in an infinite loop, after an hour or so the javac's start failing with memory errors: # There is insufficient memory for the Java Runtime Environment to continue. # Cannot create GC thread. Out of system resources. and after some more time the perl fails to fork (to exec the second javac) 23766 1 perl CALL fork 23766 1 perl RET fork -1 errno 35 Resource temporarily unavailable Mar 27 11:43:24 test /netbsd: proc: table is full - increase kern.maxproc or NPROC But all through this top et al tell me there are plenty of processes and memory I think this has been seen before. But I can't remember the resolution. David -- David Laight: da...@l8s.co.uk
Re: Enhance ptyfs to handle multiple instances.
On Mon, Mar 24, 2014 at 10:49:15AM -0400, Christos Zoulas wrote: On Mar 24, 5:46pm, net...@izyk.ru (Ilya Zykov) wrote: -- Subject: Re: Enhance ptyfs to handle multiple instances. | Hello! | | Please, tell me know if I wrong. | In general case I can't find(easy), from driver, where its device file located on file system, | its vnode or its directory vnode where this file located. | Such files can be many and I can't find what file used for current operation. | Maybe anybody had being attempted get this info from the driver? You can't find from the driver where the device node file is located in the filesystem, as well as you cannot reliably find from the vnode of the device node the filesystem path. There could be many device nodes that satisfy the criteria (you can make your own tty node with mknod) FWIW SYSV ptys (etc) would be created as a 'clone', a /dev entry created/found with the required path and the correct major/minor, and then reopened through the filesystem entry. stat() on the /dev entry and fstat() on the fd would then agree (probably disk partition and inode?). This didn't help you find the entry - but it would tell you when you'd found the correct one. OTOH ttyname(3) is probably best implemented with a pair of ioctls. Although chroot() probably complicates things. David -- David Laight: da...@l8s.co.uk
Re: CVS commit: src/sys/kern
On Wed, Mar 05, 2014 at 06:04:02PM +0200, Andreas Gustafsson wrote: 2. I also object to the change of kern_syctl.c 1.247. This change attempts to work around the problems caused by the changes to the variable types by making sysctl() return different types depending on the value of the *oldlenp argument. IMO, this is a bad idea. The *oldlenp argument does *not* specify the size of the data type expected by the caller, but rather the size of a buffer. The sysctl() API allows the caller to pass a buffer larger than the variable being read, and conversely, guarantees that passing a buffer that is too small results in ENOMEM. Both of these aspects of the API are now broken: reading a 4-byte CTLTYPE_INT variable now works for any buffer size = 4 *except* 8, That wasn't the intent of the change. The intent was that if the size was 8 then the code would return a numeric value of size 8, otherwise the size would be chnaged to 4 and/or ENOMEM (stupid errno choice) returned. and attempting to read an 8-byte CTLTYPE_QUAD variable into a buffer of less than 8 bytes is now guaranteed to yield ENOMEM *except* if the buffer size happens to be 4. A request to read a CTLTYPE_QUAD variable into a buffer that is shorter than 8 bytes has always been a programming error. The intent of the change was to relax that is the length happened to be 4. IMO, this behavior violates both the letter of the sysctl() man page and the principle of least astonishment. I'm not sure about the latter. I run 'sysctl -a' and find the name of the sysctl I'm interested in. The result is a small number so I pass the address and size of a integer variable and then print the result. (Or the value is rather large and I think it might exceed 2^31 so I use an int64.) The 'principle of least astonishment' would mean that I get the value that 'sysctl -a' printed. On a BE system I have to be extremely careful with the return values from sysctl() or I see garbage. Note that code calling systctl() has to either know whether the value it is expecting is a string, structure, or number, or use the API calls that expose the kernel internals in order to find out. Also, the work-around is ineffective in the case of a caller that allocates the buffer dynamically using the size given by an initial sysctl() call with oldp = NULL. Code that does that for a numeric value will be quite happy with either a 32bit of 64bit result. David -- David Laight: da...@l8s.co.uk
Re: Recent sysctl changes
On Wed, Mar 05, 2014 at 03:56:54PM -0500, Thor Lancelot Simon wrote: On Wed, Mar 05, 2014 at 08:55:50PM +0200, Andreas Gustafsson wrote: 2. I also object to the change of kern_sysctl.c 1.247. This change attempts to work around the problems caused by the changes to the variable types by making sysctl() return different types depending on the value of the *oldlenp argument. As I recall, we considered this approach before creating hw.physmem64, and decided it was just a little too cute. I don't actually know of any code that hands over a wrong-size buffer and will therefore break, though. Do you? I agree the possibility does exist. I actually wonder if the code should also support single byte reads for things like machdep.sse which are effectively booleans. Maybe we should also allow 1, 4 and 8 byte reads for items declared as booleans. IIRC one of the arm ABIs uses 4 byte booleans - bound to be a cause for confusion at some point. David -- David Laight: da...@l8s.co.uk
Re: Vnode API change: mnt_vnodelist traversal
On Mon, Mar 03, 2014 at 03:55:12PM +0100, J. Hannken-Illjes wrote: On Mar 3, 2014, at 11:32 AM, Thomas Klausner w...@netbsd.org wrote: On Mon, Mar 03, 2014 at 11:11:04AM +0100, J. Hannken-Illjes wrote: A diff implementing this and using it for those operations running vrecycle() is at http://www.netbsd.org/~hannken/vnode-pass4-1.diff Once all operations are converted, vmark() / vunmark() will go and man pages will be updated. Comments or objections anyone? I have no background clue, so please excuse my questions if they are stupid :) +void +vfs_vnode_iterator_init(struct mount *mp, void **marker) +{ + struct vnode **mvpp = (struct vnode **)marker; + + *mvpp = vnalloc(mp); + + mutex_enter(mntvnode_lock); + TAILQ_INSERT_HEAD(mp-mnt_vnodelist, *mvpp, v_mntvnodes); + mutex_exit(mntvnode_lock); +} + +void +vfs_vnode_iterator_destroy(void *marker) +{ + struct vnode *mvp = marker; + + KASSERT((mvp-v_iflag VI_MARKER) != 0); + vnfree(mvp); +} Why do you cast marker in init, but not in destroy or next? Because (void **) to (othertype **) needs a cast. Added casts to destroy and next anyway. I assume that the marker is not struct vnode * so that you can change the type later if you want. It is struct vnode * for now, to the caller it is simply opaque as the caller doesn't need to know the internals. Use the correct type - if the caller doesn't need to know the internals add a 'struct vnode' before the function definition. (Or even 'struct foo' - which might currently be a vnode.) If you use 'void *' it becomes unclear where the pointers are valid. In this case I'm not sure that adding a marker vnode into the list of vnodes is a good idea at all. What you might want is a list of active iterators and their current position so that the 'right' things can happen when a vnode is deleted (especially if they need to save the 'next' vnode to allow the function itself delete the current one). In that case the approriate structure can be allocated in stack as part of the iterator data. For instance, you might decide to scan the vnodes from the hash lists. And for SMP locking you might want to arrange the hash so that any 'next' pointers are within the hash structure - completely removing any linked list between the vnodes themselves. David -- David Laight: da...@l8s.co.uk
Re: Adding truncate/ftruncate length argument checks
On Wed, Feb 26, 2014 at 08:38:28PM +0100, Nicolas Joly wrote: The attached patch add the missing length argument checks, and update the man page accordingly. Isn't there (shouldn't there be) some lock needed to read the limit data? Even for fetching a single value ? I thought it was mostly atomic ? + if (length l-l_proc-p_rlimit[RLIMIT_FSIZE].rlim_cur) { Well... l-l_proc is ok. l_proc-p_rlimit may not be (if it is shared with another process, and an update by another process/thread causes the pointer to change, and the other owners all exit ...) p_rlimit[RLIMIT_FSIZE].rlim_cur is uint64_t so is a problem on 32bit. David -- David Laight: da...@l8s.co.uk
Re: Adding truncate/ftruncate length argument checks
On Wed, Feb 26, 2014 at 10:55:52PM +0100, Nicolas Joly wrote: l_proc-p_rlimit may not be (if it is shared with another process, and an update by another process/thread causes the pointer to change, and the other owners all exit ...) I don't think another process will cause any problem. Before any update, it will have its own private copy, leaving the previous shared version unmodified. Regarding an other thread ... The race does indeed exists, but only once in process life, for the first limit write access. One copy of the structure is shared between all the lwps in a process. It can also be shared with the parent and children. If another lwp in the same process tries to edit a shared (by more than one process) structure the the code could read the old copy after the ref count has been decreased. If you are then really unlucky the process it is shared with will exit - and the data will be freed. It might (in general get unmapped (and fault) or be reallocated, modified and then garbage read. Some kind of rcu in the 'free' path would solve the latter. David -- David Laight: da...@l8s.co.uk
Re: pmap_kenter_pa pmap_kremove
On Sat, Feb 22, 2014 at 10:04:13PM +, Mindaugas Rasiukevicius wrote: Matt Thomas m...@3am-software.com wrote: I've been wondering... Should pmap_kenter_pa overwrite an existing entry should it be operating on an unmapped VA. You mean already mapped VA? I think that if you want to change a mapping, you should do a pmap_kremove first. I tend to agree. I have not seen a need for such re-mapping (overwriting), but even if there is, it can be done efficiently by removing, entering and then calling pmap_update(). With the deferred update, that would result in a single TLB flush/invalidation. Anything that uses a small KVA area to reference a large amount of physical addresses? I'd guess that you'd want a flag somewhere to know it was likely (either for the call, but probably a property of the KVA address. David -- David Laight: da...@l8s.co.uk
Re: pcb offset into uarea
On Wed, Feb 19, 2014 at 09:14:05AM -0800, Matt Thomas wrote: For the aarch64 port, the only thing in the PCB is the fpu register set. Everything else is in mdlwp. Now the context switch code can ignore the PCB entirely. I've been thinking of doing something similar for other ports i maintain. Makes sense. That would remove a rather pointless indirection for those fields as well. On amd64 and i386 the pcb is slightly over 64 bytes (+fpu save area). So moving those into the lwp won't make much difference. It isn't as though anyone has considered swapping uareas for a while. David -- David Laight: da...@l8s.co.uk
Re: pcb offset into uarea
On Sun, Feb 16, 2014 at 01:27:50PM -0800, Matt Thomas wrote: An alternative would be to place the FP save area at the start of the uarea. This would mean that, on stack overflow, the FP save area would be trashed before some random piece of memory. It might even be worth putting the pcb at the start of the uarea - so that stack overflow crashes out the failing process, and probably earlier than the random corruption would. For most ports, the pcb is at the start of the uarea. Interesting since i386 puts it at the end. This gives me three options: A) Put the save area at the end of the pcb and dynamically adjust the pcb offset. B) Put the save area at the start of the uarea, with the pcb at a fixed offset at the end of the uarea. C) Put the save area at the end of the pcb, and put the pcb at the start of the uarea. Votes? What have I missed? Keep a default mmx/sse save area in the pcb along with a pointer to it. If a variant is used that needs a larger save area, dynamically allocate it and save it in the pcb pointer. Since it's unlikely most processes will be AVX why waste the space? Unfortunately I dont think it is possible to determine whether a process has used the AVX instructions. There is a bit for 'os supports avx' (ie swaps on context switch) that causes the instructions to fault (if not set), but applications should look at that before using avx instructions. If a process switch happens in a system call then the avx (xmm and ymm) registers need not be saved and restored. They can be zeroed instead because they are all caller saved. I'm not 100% how easy that is to detect, but it shouldn't be too hard an optimisation to perform. Zeroing the ymm registers also has a significant performance benefit. David David -- David Laight: da...@l8s.co.uk
Re: pcb offset into uarea
On Mon, Feb 17, 2014 at 06:39:26PM +, David Holland wrote: On Sun, Feb 16, 2014 at 09:41:08PM +, David Laight wrote: I'm adding code to i386 and amd64 to save the ymm registers on process switch - allowing userspace to use the AVX instructions. [ensuing crap about the u area] Why put it in the u area at all? It's a legacy concept of little continuing value. Certainly most of the stuff that is in the pcb could be put into the lwp structure. Apart form the fp save area it isn't even very big. Putting the FP save area at the low address of the kernel stack pages saves you having to worry about how bit it is. (for 'stack grows down' systems). David -- David Laight: da...@l8s.co.uk
pcb offset into uarea
I'm adding code to i386 and amd64 to save the ymm registers on process switch - allowing userspace to use the AVX instructions. I also don't want to have to do it all again when the next set of extensions appear. This means that the size of the FPU save area (currently embedded in the pcb) can't be determined until runtime. Plan A is to move the FPU save are to the end of the pcb, and then locate the pcb at the correct offset in the uarea so that the written region ends at the end of the page. The problem with this is that the offset of the pcb in the uarea is set by MI code based on some #defines - and there seem to be several related values. Now on x86 (like most systems) the cpu stack advances into low memory. The pcb is placed at the end of the uarea with the intial stack pointer just below it. I suspect that a long time ago (when the uarea had a fixed KVA) an additional memory page was placed below the uarea to give interrupts more stack space. I don't think this happens any more. As an aside: The uarea used to be pageable, whereas (what is now) the lwp structure isn't. Paging of uarea's was disabled a few years back - so there is no real difference between the lifetimes of an lwp a uarea. (zombies probably lose the uarea before the lwp). An alternative would be to place the FP save area at the start of the uarea. This would mean that, on stack overflow, the FP save area would be trashed before some random piece of memory. It might even be worth putting the pcb at the start of the uarea - so that stack overflow crashes out the failing process, and probably earlier than the random corruption would. This gives me three options: A) Put the save area at the end of the pcb and dynamically adjust the pcb offset. B) Put the save area at the start of the uarea, with the pcb at a fixed offset at the end of the uarea. C) Put the save area at the end of the pcb, and put the pcb at the start of the uarea. Votes? What have I missed? David -- David Laight: da...@l8s.co.uk
Re: 4byte aligned com(4) and PCI_MAPREG_TYPE_MEM
On Tue, Feb 11, 2014 at 04:19:26PM +, Eduardo Horvath wrote: We really should enhance the bus_dma framework to add bus_space-like accessor routines so we can implement something like this. Using bswap is a lousy way to implement byte swapping. Yes, on x86 you have byte swap instructions that allow you to work on register contents. But most RISC CPUs do the byte swapping in the load/store path. That really doesn't map well to the bswap API. Instead of one load or store operation to swap a 64-bit value, you need a load/store plus another dozen shift and mask operations. I proposed such an extension years ago. Someone might want to resurrect it. What you don't want to have is an API that swaps data in memory (unless that is really what you want to do). IIRC modern gcc detects uses of its internal byteswap function that are related to memory read/write and uses the appropriate byte-swapping memory access. I can see the advantage of being able to do byteswap in the load/store path, but sometimes that can't be arranged and a byteswap instruction is very useful. I really can't imagine implementing it being a big problem! David -- David Laight: da...@l8s.co.uk
Re: 4byte aligned com(4) and PCI_MAPREG_TYPE_MEM
On Tue, Feb 11, 2014 at 09:21:30PM +, Eduardo Horvath wrote: What you don't want to have is an API that swaps data in memory (unless that is really what you want to do). IIRC modern gcc detects uses of its internal byteswap function that are related to memory read/write and uses the appropriate byte-swapping memory access. I can see the advantage of being able to do byteswap in the load/store path, but sometimes that can't be arranged and a byteswap instruction is very useful. When do you ever really want to byte swap the contents of one register to another register? Byte swapping almost always involves I/O, which means reading or writing memory or a device register. In this case we are specifically talking about DMA, in which case there is always a load or store operation involved. Quite often the structure of the code means that the value has already been read into a register - so you are presented with a value in the wrong byte order. I really can't imagine implementing it being a big problem! Yes, it a big problem. For a 2 byte swap you need to do 2 shift operations, one mask operation (if you're lucky) and one or operation. Double that for a 4 byte swap. And even if you argue that a dozen CPU cycles here or there don't make much difference, the byte swap code is replicated all over the place since the routines are macros, so you're paying for it with your I$ bandwidth. Sorry I meant a big problem for those designing cpus. I know it is a pita in software. About the only VHDL I've written is for a byteswap 'custom instruction' for a soft-cpu. Done because a single cycle byteswap there was easier than getting a ppc to use the byteswapping memory accesses for the relevant fields. David -- David Laight: da...@l8s.co.uk
Re: [Milkymist port] virtual memory management
On Mon, Feb 10, 2014 at 02:38:27PM -0800, Matt Thomas wrote: Hopefully, if they make the caches larger they increase the number of ways. I wouldn't add code to flush. Just add a panic if you detect you can have aliases and deal with it if it ever happens. IIRC A lot of sparc systems have VIPT caches where the cache size (divided by the ways) is much larger than a page size (IIRC at least 64k). If memory has to be mapped at different VA (eg in different processes) then it is best to pick VA so that the data hits the correct part of the cache. Otherwise it has to be mapped uncached. I guess another solution is to use logical pages that are the right size. David -- David Laight: da...@l8s.co.uk
Re: Possible issue with fsck_ffs ?
On Fri, Feb 07, 2014 at 05:50:53PM +, David Holland wrote: On Fri, Feb 07, 2014 at 08:39:39AM -0800, Paul Goyette wrote: I'm sure we have some experts who could figure this out a lot more quickly than me fumbling through the sources :) At my $DAYJOB we have seen instances where newfs(8) can generate a filesystem with fragments per cylinder-group can exceed 0x1. When newfs(8) stores the value in the file-system's superblock, it works correctly since fs_fpg is a 32-bit integer. However, newfs(8) also stores the value in the partition table's p_cpg member, which is only 16-bits. Values above 0x1 will, obviously, get truncated. fsck_ffs(8) works just fine as long as we are able to read the primary superblock. But if we're unable to access the primary SB, we need to use the p_cpg value to find the alternate superblocks, and because of the truncation noted above the search for alternates will fail. IMHO fsck should work without having to find the partition table. One option is to assume that default parameters were used to create the filesystem and the use the same algorithm as newfs to find the alternate - maybe trying a few block/fragment sizes (there aren't that many). As a last resort a linear search wouldn't take that long. Although it would be best to check that some later superblocks match, and that the whole thing is consistent with the partition size. I'd sometimes rather that fsck didn't actually do any disk writes until the end (or interactively after asking a question). IIRC the number of fragments in a 'cylinder group' is limited because the allocation bitmap has to reside within a single FS block. Since blocks are limited to 64kB this limits the is 0x8 fragments. (Any FS with blocks 8k is likely to have 0x1 fragments/CG.) So if you do have the p_cpg value there are only a few locations to try. David -- David Laight: da...@l8s.co.uk
Re: [PATCH] netbsd32 swapctl, round 3
On Sat, Feb 01, 2014 at 08:41:15AM +, Emmanuel Dreyfus wrote: Hi Here is my latest attempt at netbsd32 swapctl. I had to make uvm_swap_stats() available to emul code, but that seems to be what it was intented for, according to comments in the code. I've just looked at the code in uvm_swap_stats(). Might be easier to either clone the entire loop, or pass in a helper function which is passed the 'sdp' and 'inuse' fields. Even for the 'normal' case that would save copying the pathname twice. It isn't as though a lock is released before the copyout. The read lock is held throughout. David -- David Laight: da...@l8s.co.uk
Re: compat_netbsd32 swapctl
On Wed, Jan 29, 2014 at 11:54:29AM +0100, Martin Husemann wrote: On Wed, Jan 29, 2014 at 10:42:14AM +, Emmanuel Dreyfus wrote: The solution is for netbsd32_swapctl() to call sys_swapctl() for each individual record, but it needs to know the i386 size for struct swapent. I suspect there is a macro for that. Someone knows? Tricky. You could define swapent32 with se_dev split into two 32bit halves and do full conversion back and forth, but better check what alignment mips and sparc would require here first. Look at what is done elsewhere in the i386 compat code. There is a 64bit integer type that has an alignment requirement of 8. If that is used instead of a normal 64bit type then the structure alignement under amd64 matches that of i386. David -- David Laight: da...@l8s.co.uk
Re: compat_netbsd32 swapctl
On Wed, Jan 29, 2014 at 06:38:06PM +, paul_kon...@dell.com wrote: On Wed, Jan 29, 2014 at 06:26:14PM +, David Laight wrote: There is a 64bit integer type that has an alignment requirement of 8. If that is used instead of a normal 64bit type then the structure alignement under amd64 matches that of i386. The easiest way to get such alignment is to ask for it explicitly: __attribute__((aligned(8))). paul That won't ensure the structure has the same aligment, there could be pad words before 64bit fields on the 64bit architecture. You could mark all the 64bit fields with aligned(8) so that they have the same alignement. But the general problem is that the 64bit system needs to match a pre-existing 32bit structure. Changing the alignment doesn't then help. It probabli is worth adding an __CTASSERT() for non-trivial structures that are expected to be a fixed size. Then if anything 'odd' happens the compler will bleat. David -- David Laight: da...@l8s.co.uk
Re: amd64 kernel, i386 userland
On Sun, Jan 26, 2014 at 05:01:42PM +1100, matthew green wrote: i think this could be fixed by introducing new disk major numbers for both i386 and amd64 that are associated with the same definition of major() and minor(), but i've never gotten around to or found someone else willing to code this up. An entirely new disk minor to partition map might be appropriate. (Without looking at the current mess...) I think we have (at least) 16 bits of minor number so could split 8/8, but reserve the high 'disk numbers' for the 'raw disk' access for all disks. So minor 0x0204 would be disk 2 partition 4. Minor 0xff03 would be raw access to disk 3. Maybe minor 0xfe03 would be the 'netbsd partition (type 169)' if found. The partition slots for 'whole disk' could then be put where they belong. I suspect a real VAX might have disks that the hardware can't size, but nothing else will. I know I've changed the x86 code to report the actual disk size (for 'd', and maybe 'c') rather than the information that happened to be in the label. David -- David Laight: da...@l8s.co.uk
Re: UVM crash in NetBSD/i386 PAE with 32 GB of RAM
On Tue, Jan 21, 2014 at 10:31:08AM +, Emmanuel Dreyfus wrote: On Mon, Jan 20, 2014 at 04:18:38PM +, Emmanuel Dreyfus wrote: Changing memory fixed the problem. The machine now boots 6.0 i386 PAE with SMP enabled and 128 GB of RAM installed, and it seems to be stable. But I spoke too fast. It is stable, but the i386 PAE kernel does not sees more than 2 GB of memory. An amd64 kernel sees the whole 128 GB. Is it possible that the chipset cannot run PAE? I doubt it. 2G sounds like the amount of memory below 4G - but I'd have thought that would be 3G (or even 3.5G). It might be that having 128G has confused things somewhere. PAE is also a bodge (at all levels) I'd run a 64bit kernel of that system. David -- David Laight: da...@l8s.co.uk
Re: UVM crash in NetBSD/i386 PAE with 32 GB of RAM
On Tue, Jan 21, 2014 at 08:59:19PM +0100, Christoph Egger wrote: Am 21.01.14 20:54, schrieb David Laight: On Tue, Jan 21, 2014 at 10:31:08AM +, Emmanuel Dreyfus wrote: On Mon, Jan 20, 2014 at 04:18:38PM +, Emmanuel Dreyfus wrote: Changing memory fixed the problem. The machine now boots 6.0 i386 PAE with SMP enabled and 128 GB of RAM installed, and it seems to be stable. But I spoke too fast. It is stable, but the i386 PAE kernel does not sees more than 2 GB of memory. An amd64 kernel sees the whole 128 GB. Is it possible that the chipset cannot run PAE? I doubt it. 2G sounds like the amount of memory below 4G - but I'd have thought that would be 3G (or even 3.5G). That depends on the PCI MMIO memory layout. And I am wondering if 64bit PCI devices are accessable at all when their PCI bar is above 4G. The bios will put values below 4G into all the bars - otherwise a 32bit os wouldn't be able to access them at all. That is why there is a gap in the physical memory addresses. Actually the size of the memory chips might force 'low' memory down to 2G - I'm not entirely sure how that memory hole is generated. If amd64 finds all the memeory, I'd do a bit of chasing through the kernel startup code to see what happens. There are KVA issues that might give problems with the page tables needed for that much physical memory. David -- David Laight: da...@l8s.co.uk
Re: amd64 kernel, i386 userland
On Tue, Jan 21, 2014 at 09:14:36PM +0100, Emmanuel Dreyfus wrote: Joerg Sonnenberger jo...@britannica.bec.de wrote: At least raidctl can be found in /rescue, which is statically linked. That's likely easier to play with than any compat hacks. Yes, but that does not solves the problem for ipf, for instance. You could build the 64bit ipf with a different 'elf interpreter' namein it (or patch the string in the binary, it is unlikely to be shared). Then you just need to sset an appropriate LD_LIBRARY_PATH. Or, maybe, run ipf inside a chroot. David -- David Laight: da...@l8s.co.uk
Re: BPF memstore and bpf_validate_ext()
On Fri, Dec 20, 2013 at 01:28:12AM +0200, Mindaugas Rasiukevicius wrote: Alexander Nasonov al...@yandex.ru wrote: Well, if it wasn't needed for many year in bpf, why do we need it now? ;-) Because it was decided to use BPF byte-code for more applications and that meant there is a need for improvements. It is called evolution. :) Has anyone here looked closely at the changes linux is making to bpf? David -- David Laight: da...@l8s.co.uk
Re: qsort_r
On Mon, Dec 09, 2013 at 03:55:30AM +, David Holland wrote: On Sun, Dec 08, 2013 at 11:26:47PM +, David Laight wrote: I have done it by having the original, non-_r functions provide a thunk for the comparison function, as this is least invasive. If we think this is too expensive, an alternative is generating a union of function pointers and making tests at the call sites; another option is to duplicate the code (hopefully with cpp rather than CP) but that seems like a bad plan. I'd prefer to not have another indirect call. The only difference is the definition and expanding a CMP macro differently? Is just casting the function pointers safe in C (well in NetBSD)? (with the calling conventions that Unix effectively requires) No. Well, it is, but it's explicitly illegal C and I don't think we should do it. Actually given that these functions are in libc, their interface is defined by the architecture's function call ABI, not by the C language. Consider what you would do if you wrote an asm wrapper for qsort(a,b) in terms of an asm qsort_r(a,b,d)? For ABI where the the first 3 arguments are passed in registers (eg: amd64, sparc, sparc64) and for ABI where arguments are stacked and cleared by the caller (eg i386) I don't you'd consider doing anything other than putting an extra label on the same code. There might be ABI where the this isn't true - in which case the 'thunk' is an option, but I don't think NetBSD has one. FWIW I think Linux is moving to an alternate ppc64 ABI that doesn't use 'fat pointers'. David -- David Laight: da...@l8s.co.uk
Re: qsort_r
On Sun, Dec 08, 2013 at 11:44:28PM +0100, Joerg Sonnenberger wrote: On Sun, Dec 08, 2013 at 10:29:53PM +, David Holland wrote: I have done it by having the original, non-_r functions provide a thunk for the comparison function, as this is least invasive. If we think this is too expensive, an alternative is generating a union of function pointers and making tests at the call sites; another option is to duplicate the code (hopefully with cpp rather than CP) but that seems like a bad plan. I'd prefer to not have another indirect call. The only difference is the definition and expanding a CMP macro differently? Is just casting the function pointers safe in C (well in NetBSD)? (with the calling conventions that Unix effectively requires) Can anything slightly less nasty be done with varags functions? David -- David Laight: da...@l8s.co.uk
Re: qsort_r
On Sun, Dec 08, 2013 at 10:29:53PM +, David Holland wrote: I have done it by having the original, non-_r functions provide a thunk for the comparison function, as this is least invasive. If we think this is too expensive, an alternative is generating a union of function pointers and making tests at the call sites; another option is to duplicate the code (hopefully with cpp rather than CP) but that seems like a bad plan. Note that the thunks use an extra struct to hold the function pointer; this is to satisfy C standards pedantry about void pointers vs. function pointers, and if we decide not to care it could be simplified. On most architectures I think just: __weak_alias(heapsort_r,heapsort) __weak_alias(heapsort_r,_heapsort) will work. David -- David Laight: da...@l8s.co.uk
Re: posix message queues and multiple receivers
On Sat, Dec 07, 2013 at 12:38:42AM +0100, Johnny Billquist wrote: You know, you might also hit a different problem, which I have had on many occasions. NFS using 8k transfers saturating the ethernet on the server, making the server drop IP fragemnts. That in turn force a resend of the whole 8k after a nfs timeout. That will totally kill your nfs performance. (Obviously, even larger nfs buffers make the problem even worse.) That wasn't the problem in this case since I could see the very delayed responses. That is a big problem, I've NFI why i386 defaults to very large transfers. Even with an elevator scan algorithm and four concurrent nfs clients, you're disk operation will complete within a few hundred ms at most. This was all from one client. I'm not sure how many concurrent NFS requests were actually outstanding - it was quite a few. I remember that the operation was copying a large file to the nfs server, the process might have been doing a very large write of an mmaped file. So the client could easily have a few MB of data to transfer - and be trying to do them all at once. Thinking further, multiple nfsd probably help when there are a lot more reads than writes - reads can be serviced from the server's cache. David -- David Laight: da...@l8s.co.uk
Re: in which we present an ugly hack to make sys/queue.h CIRCLEQ work
On Sun, Nov 24, 2013 at 06:42:40AM -0500, Mouse wrote: (I think that) strict aliasing rules implies that if two types type{1,2} do not match any of the aliasing rules (e.g. type1 is of the same type as the first member of type2, or type1 is a char, or ...), then any two pointers ptr{1,2} on type{1,2} respectively _ARE_ different, because *ptr1 != *ptr2 per the aliasing rules and this implies ptr1 != ptr2. Only if you actually evaluate *ptr1 and *ptr2 (in some cases, I think, just one of them is enough). Otherwise you're not accessing the relevant object(s); the rule is about accesses to values, not about pointers that, if followed, would perform certain accesses to values. One option would have bee to replace the comparison: (void *)foo == (void *)bar with: (char *)foo - (char *)0 == (char *)bar - (char *)0 Which the compiler can't optimise away. well (const char *), but that makes the line too long! I've had to do something similar to cast to, IIRC, (foo * const *). In a function that advances a pointer down an array - which might be const. David -- David Laight: da...@l8s.co.uk
Re: zero-length symlinks
On Sun, Nov 03, 2013 at 04:35:19PM -0800, John Nemeth wrote: It has to do with the fact that historically mkdir(2) was actually mkdir(3), it wasn't an atomic syscall and was a sequence of operation performed by a library routine... Actually I think you'll find that mkdir way always a system call. It was directory rename that was done with a series of link and unlink system calls. Also, if you look at any current fs code the processing of . and .. is special - they will be treated as requests for the current and parent directories regardless of the inodes they reference. Doing otherwise is a complete locking nightmare! David -- David Laight: da...@l8s.co.uk
Re: Getting the device name from a struct tty *
On Tue, Oct 15, 2013 at 01:11:40PM -0400, Mouse wrote: In a tty line discipline, I want to get the name of the tty driver instance, e.g. dtyU0. In what sense is that the name of the tty driver instance? I'm not just being snarky; that's a real question. Names of the dtyU0 kind normally name device special files in /dev but nothing else - the kernel doesn't know anything about them. In theory you could read /dev, but (a) nothing says device special files can't exist elsewhere and (b) you then have to decide what to do if you find other than exactly one device special file pointing to the device in question. (And, of course, (c) you may not be in a context from which reading /dev is feasible.) But if that's the name you want, there may be little choice. There is also no reason (in general) why there should be a /dev entry anywhere at all. The tty could be a cloning driver that allocates a new minor on every open (and without any magic to create a /dev entry). The libc ttyname() function is often implemented using a database of known tty entries - otherwise the lookup can involve a recursive search of /dev - which might be needed anyway if entries are dynamically added - not sure how netbsd handles it. I have been handed (on paper!) several hundred sheets of system call trace from a program that was scanning a list looking for an entry for it's current terminal - and calling ttyname() for each entry! On SYSV we used to go through 'hoops' so that ttyname() could return the expected /dev entry when there might be multiple /dev entries with the same major/minor. David -- David Laight: da...@l8s.co.uk
Re: mknodat(2) device argument type change
On Sun, Oct 06, 2013 at 10:51:36PM +0200, Nicolas Joly wrote: It needs the PAD, syscalls files generation fails without it (sysalign=1). /bin/sh makesyscalls.sh syscalls.conf syscalls.master syscalls.master: line 905: unexpected dev (expected a padding argument) line is: 460 STD RUMP{ int | sys | | mknodat ( int fd , const char * path ,mode_t mode , dev_t dev ) ; } A certain amount of magic (and luck) applies... On i386 64bit fields in structures are only 4-byte aligned, however when the arguments for a function are stacked a 64bit field is 8-byte aligned. For system calls a C stucture gets mapped onto the user stack (copied into kernel). So the kernel struct for the above argument list needs a pad. On amd64 the first 6 arguments are all in registers, so the C struct for the argument list must match the register save area. In this case the structure members all end up being 64bit. So no pad is needed. I suspect this is handled 'by magic'. ... searches for the magic ... I think that the PAD arguments are added by the libc system call stubs even for 64bit architectures - where they waste a real argument slot. This doesn't explain why rump needs so much special code in makesyscalls.sh. David -- David Laight: da...@l8s.co.uk
Re: fixing the vnode lifecycle
On Wed, Sep 25, 2013 at 10:22:36PM -0400, Mouse wrote: Expect some file systems to use a key size != sizeof(ino_t) -- nfs for example uses file handles up to 64 bytes. IIRC all file systems provide a filehandle generation routine, There was a time when fh generation was needed only for the filesystem to be NFS-exportable. Is it now actually required for all filesystems? Doesn't posix (more or less) require files to have inode numbers? In particular I thought some of the fields reported by stat() are supposed to uniquely identify the file (probably st_dev and st_ino). Yes, I know some fs break this. David -- David Laight: da...@l8s.co.uk
Re: Max. number of subdirectories dump
On Sun, Aug 18, 2013 at 03:08:21PM +0200, Manuel Wiesinger wrote: Hello, I am working on a defrag tool for UFS2/FFSv2 as Google Summer of Code Project. The size of a directory offset is of type int32_t (see src/sys/ufs/ufs/dir.h), which is a signed integer. So the maximum size can be (2^31)-1. When testing, the maximum number of subdirectories was 32767, which is (2^15)-1, when trying to add a 32767th directory, I got the error message: Too many links. When my tools reads only the single indirect blocks, it get all 32767 subdirectories. For defrag I'd have though you'd work from the inode table and treat directories no different from files. You would need to scan directories if you decide to renumber inodes, but since they are indexed that may not gain much. It might be worth rewriting directories in order to remove gaps and possibly put subdirectories first (but you really want the most frequently used entries first). FYI A well known british internet payment scheme fell over when the 32768th vendor account was added onto the live system! (Solaris crash badly) David -- David Laight: da...@l8s.co.uk
Re: Use of the PC value in interrupt/exception handlers
On Fri, Aug 02, 2013 at 10:46:31AM +, Piyus Kedia wrote: Dear all, We are working on developing a dynamic binary translator for the kernel. Towards this, we wanted to confirm if the interrupted PC value pushed on stack by an interrupt/exception is used by the interrupt/exception handlers? For example, is the PC value compared against a fixed address to determine the handler behaviour (like Linux's page fault handler compares the faulting PC against an exception table, to allow functions like copy_from_user to fault). IIRC i386 and amd64 both check the faulting PC for copyin() and copyout() (and similar functions). Unlike linux these exist as proper functions so there is only a single set of exception PC bounds (not one for every call site. There will also be checks that a user-space PC actually contains a user address. Also the signal information, coredump, and registers for GDB (etc) contain the PC. David -- David Laight: da...@l8s.co.uk
Re: [PATCH] i386 copy.S routine optimization?
On Mon, Jun 10, 2013 at 10:20:25PM +0200, Yann Sionneau wrote: Hello, I already talked about this with Radoslaw Kujawa on IRC, I understood that it is far from trivial to say if it is good to apply the following patch [0] or not due to x86 cache and pipeline subtleties. Please inline patches in the mail, that way they are definitely in the mail archive. It also makes them much easier to review. Alse ensure you quote the cvs revision of the main file - otherwise the line numbers won't match. If you want to make a measurable improvement to copystr() don't use lodsb or stosb and use 32bit reads from usespace. I can't remember whether it is best to do misaligned reads and aligned writes (or aligned reads and misaligned writes), in any case if you do aligned reads you don't have to worry about faulting at the end of a page. Look at the strlen() code for quick ways of testing for a zero byte. For amd64 the bit masking methods are definitely faster. Probably the worst part of the current code is that the 'jz' to skip the unwanted write will be mis-predicted. David -- David Laight: da...@l8s.co.uk
Re: netbsd-6: pagedaemon freeze when low on memory
On Wed, Mar 06, 2013 at 06:01:50PM -0600, David Young wrote: Here's another thought: What about changing some of the VM_SLEEP calls to VM_NOSLEEP, at least for the userspace-initiated syscalls? The syscalls would then fail, moving the responsibility of dealing with low memory onto the userspace apps (they may be unhappy, but at least the kernel will stay functional). This change would be in addition to the vmem_xalloc() wake changes you proposed (because those wake-ups may never come if the system is truly running on fumes). You could do that. You just have to take care to handle the errors properly. There are more error paths to test. Not trying to discourage you, just point out the trade-offs. :-) Possibly some of the sleeps could be interruptable (by a signal). But that rather depends on the system calls involved. David -- David Laight: da...@l8s.co.uk
Re: netbsd-6: pagedaemon freeze when low on memory
On Tue, Mar 05, 2013 at 11:43:35PM -0600, David Young wrote: Maybe we can avoid unnecessary locking or redundancy using a generation number? Add a generation number to the vmem_t, volatile uint64_t vm_gen; Increase a vmem_t's generation number every time that vmem_free(), vmem_xfree(), or vmem_backend_ready() is called: Won't that generate a very hot cache line on a large smp system? Maybe the associated structures are actually worse here! But per-cpu virtual address free lists might make sense. David -- David Laight: da...@l8s.co.uk
Re: Post-mortem debugging tools
On Mon, Feb 04, 2013 at 09:39:04PM +0100, Joerg Sonnenberger wrote: Hi all, we have quite a few tools in base that still require KVM or optionally support it. Removing all tools that require KVM for operation (and therefore setgid) is one of the open goals. It would be nice if that doesn't require adding lots of duplicate code. For that, a decision is required what programs are required for post-mortem analysis (i.e. debugging kernel dumps) and limit dual-KVM/sysctl code paths to that. For post-mortem work you often want the raw information from the kernel structures (ie including the KVA of things), which the normal user-tools don't need. Putting the work into a single program that grovells KVM for diagnostics (aka SYV crash) means that only one program has to exactly match the kernel (and, maybe, could be compiled with the kernel?). Possibly some of the printf() statements could be shared with ddb. David -- David Laight: da...@l8s.co.uk
open modes O_DENYREAD and O_DENYWRITE
There is a current thread on some of the linux lists and wine-devel about the semantice of two more open modes O_DENYREAD and O_DENYWRITE. These are being implemented (I think only for nfs and samba at the moment) in order to support the equivalent windows open modes. I don't know if NetBSD needs to worry about these (yet). David -- David Laight: da...@l8s.co.uk
Re: open modes O_DENYREAD and O_DENYWRITE
On Wed, Feb 06, 2013 at 09:52:34AM +0100, Martin Husemann wrote: On Wed, Feb 06, 2013 at 08:19:41AM +, David Laight wrote: There is a current thread on some of the linux lists and wine-devel about the semantice of two more open modes O_DENYREAD and O_DENYWRITE. These are being implemented (I think only for nfs and samba at the moment) in order to support the equivalent windows open modes. FYI, the windows modes that match our model are SHARE_DENY_NONE (0) and SHARE_EXCLUSIVE (O_EXCL), but O_SHLOCK is split into SHARE_DENY_READ and SHARE_DENY_WRITE. It kind of makes sense to me, but I don't have an easy example where the difference would be vital. IIRC O_EXCL only applies to creates. Reading the man page O_SHLOCK and O_EXLOCK only acquire flock() type locks. The windows modes are hard enforced - and are a right PITA at times since a lot of programs open files exclusively. David -- David Laight: da...@l8s.co.uk
Re: event counting vs. the cache
On Thu, Jan 17, 2013 at 03:43:13PM -0600, David Young wrote: It's customary to use a 64-bit integer to count events in NetBSD because we don't expect for the count to roll over in the lifetime of a box running NetBSD. I've been thinking about what these wide integers do to the cache footprint of a system and wondering if we shouldn't make a couple of changes: 1) Cram just as many counters into each cacheline as possible. Extend/replace evcnt(9) to allow the caller to provide the storage for the integer. On a multiprocessor box, you don't want CPUs sharing counter cachelines if you can help it, but do cram together each individual CPU's counters. Actually, if the counter can be placed in the same area as some other driver data, then it will typically already be in the cache. This is probably most important for things that are changed very often (like ethernet byte and packet counts). Having error counts in different cache lines probably isn't that important. This does mean that the evcnt(9) interface is completely the wrong one! It looks like 8 + 6 x sizeof (void *) bytes per counter - so every increment is (more or less) guaranteed to be a cache line miss. David -- David Laight: da...@l8s.co.uk
Re: event counting vs. the cache
On Thu, Jan 17, 2013 at 05:25:44PM -0600, David Young wrote: We can end up with silly values with the status quo, too, can't we? On 32-bit architectures like i386, x++ for uint64_t x compiles to addl $0x1, x adcl $0x0, x If the addl carries, then reading x between the addl and adcl will show a silly value. I think that you can avoid the silly values. Say you're using per-CPU counters. If counter x belongs to CPU p, then avoid silly values by reading x in a low-priority thread, t, that's bound to p and reads hi(x) then lo(x) then hi(x) again. If hi(x) changed, then t was preempted by a thread or an interrupt handler that wrapped lo(x), so t has to restart the sequence. You don't actually need to restart, the value new_hi:0 happened while the function was running - so is a valid response. I think there is another problem with that scheme - but I can't remember it! There are other schemes that handle the case of a single writer (guaranteed by something else) and occaisional readers that don't want to acquire whatever context single-threads the writes. One is (I think), add an extra 32bit counter. The writer increments it before updating the stats block, and again afterwards. The reader spins until it is even, reads all the stats, then checks the value hasn't changed. Another involves writing a 3rd value that contains the middle bits of the value (from both the high and low parts). The reader checks consistency. Or, assume a 63 bit counter will also not wrap and replcate the high bit of the low word into the low bit od the high word - reader verifies. On 64bit systems I sometimes wonder whether it is necessary for stats to be 100% accurate - so not using locked increments may be ok. There are also issues making stats per-cpu. While not unreasonable for 2, 4 or 8 cpus it gets a bit silly when there are 1024 or more. The differences between common and uncommon (eg error) stats also needs to be considered. David -- David Laight: da...@l8s.co.uk
Re: event counting vs. the cache
On Thu, Jan 17, 2013 at 03:43:13PM -0600, David Young wrote: 2) Split all counters into two parts: high-order 32 bits, low-order 32 bits. It's only necessary to touch the high-order part when the low-order part rolls over, so in effect you split the counters into write-often (hot) and write-rarely (cold) parts. Cram together the cold parts in cachelines. Cram together the hot parts in cachelines. Only the hot parts change that often, so the ordinary footprint of counters in the cache is cut almost in half. That means have to have special code to read them in order to avoid having 'silly' values. David -- David Laight: da...@l8s.co.uk
Re: USB_DEBUG mess
On Sat, Jan 05, 2013 at 11:12:29PM +, Christos Zoulas wrote: In article c873acdc-a444-4442-a021-42c621d86...@3am-software.com, Matt Thomas m...@3am-software.com wrote: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/usb/files.usb?rev=1.106content-type=text/x-cvsweb-markup Normally, the XXX_DEBUG options are not specified in any files.* files, meaning that as they are unknown options, they will translate into a CPPFLAG of -DXXX_DEBUG in the kernel Makefile which is means if you do the define, your sources don't properly get rebuilt. That's why it was made a config option. Yes, but now every file that includes usb.h needs to include opt_usb.h before, otherwise things don't build right. I'll fix it properly for now until we decide something else. How much of a difference does it make to the structure layouts? If there are only a few fields the extra space probably doesn't matter. David -- David Laight: da...@l8s.co.uk
Re: Hijacking a PCU.
On Sat, Dec 15, 2012 at 11:24:09AM -0800, Matt Thomas wrote: On amd64 the safe area we currently have for SSE2 is 512 bytes. Add support for the 256 AVX instructions and it increases to 832. You really don't want to be allocating multiple such saved areas (per lwp) on the off chance the kernel might want to use the registers. Since this is MD, you only need to save the register your kernel MD code will be using. 3) Saving and restoring a register may zero the high bits of an extended version of that register. That's an md problem. Since you are trying to sort out an MI solution you need to be aware of the known MD problems - otherwise the framework won't work for it. If an x86 program is using the 256bit AVX instructions, and some kernel code wants to use one of the 128bit SSE registers, then the kernel code has to save the 256 bit register, not the 128bit one. (And next year, the register might be extended even further, requiring different save/restore instructions.) Effectively this means that a completely separate fpu save area is needed. You can't just save a couple of registers on stack. David -- David Laight: da...@l8s.co.uk
Re: KNF and the C preprocessor
On Mon, Dec 10, 2012 at 09:55:14PM +, paul_kon...@dell.com wrote: The compiler has some heuristics about what it is good to inline. gcc tends to treat 'inline' as just a hint. I wouldn't describe it that way. And I don't think the GCC documentation does. It does talk about heuristic inlining, but that's for the -O3 feature of inlining stuff that's *not* marked, based on heuristics that it might be useful. Genarally it likes inlining static functions that are only called once. But it doesn't always do so - even when marked 'inline' (marking them inline may have no effect). There are switches that control what gets inlined. In particular, there is one that says not to inline things (other than called-once things) that are bigger than X. If things are not getting inlined when expected, that is one possible cause. I was having issues with static functions that are only called once. They are quite large (not large for a function) but adding a couple of 'boring' lines of code stopped them being inlined, and also stopped some 1 line functions being inlined. Marking everything with __attribute__((always_inline)) fixed it. That particular code (part of an embedded system) has to have everything inlined - any register spills to stack make it too slow. I almosted hacked gcc enough to remove the function prologue (which saves some registers on stack even though the function can't return) so that I could use %sp as a general purpose register! There are a couple of other candidate registers that never used (It is MIPS like, and %r1 is reserved for assembler macros of which there are exactly none, the only interrupts are fatal so the interrupt %pc save (and debug %pc save) are also unused.) David -- David Laight: da...@l8s.co.uk
Re: KNF and the C preprocessor
On Mon, Dec 10, 2012 at 07:12:36PM -0500, Mouse wrote: I want func() inlined twice so that there are only 2 conditional branches and usually a conditional branch in cmd() back to the loop top in each path. Why? (s/func/cmd/ I assume.) So I need to stop the compiler tail merging the two parts of the inside 'if' There is nothing I can put inside an inline function version of cmd() that will stop this happening. There's nothing you can put in a macro that will prevent it, either. Or, rather, as far as I can think of, anything you can do in a macro to prevent it you can also do in an inline function. If you have an example of something that'll work one way but not the other I'd be interested. An assembler comment - from something like: asm volatile(; STR(__LINE__)) David -- David Laight: da...@l8s.co.uk
Re: KNF and the C preprocessor
On Tue, Dec 11, 2012 at 06:10:09AM +, David Holland wrote: On Tue, Dec 11, 2012 at 01:27:09AM +, Roland C. Dowdeswell wrote: As an example, I often define a macro when I am using Kerberos or GSSAPI that looks roughly like: #ifdef K5BAIL(x) do { ret = x; if (ret) { /* format error message and whatnot * in terms of #x. */ goto bail; } } while (0) The code like this in src/sys/nfs is a reliably steady source of problems, and I'd argue that macros of this form are not at all worth the problems they cause. Absolutely, that is one construct I would ban. Encapsulating so you can do: if (K5BAIL(xxx(), error text)) goto bail; Or even separating the function name (so it can be traced) at least leaves the flow control obvious. If you embed a goto (or return) in a #define you'd better have the defininition very close to the use. David -- David Laight: da...@l8s.co.uk
Re: fixing compat_12 getdents
On Mon, Dec 10, 2012 at 09:53:46PM +0200, Alan Barrett wrote: also, EINVAL doesn't seem like a great error code for this condition. it's not an input parameter that's causing the error, but rather that the required output format cannot express the data to be returned. I think solaris uses EOVERFLOW for this kind of situation, and ERANGE doesn't seem too bad either. any opinions on that? There's also E2BIG, but I don't think it fits. ERANGE is documented in terms of the available space, while EOVERFLOW is documented in terms of a numeric result. So perhaps EOVERFLOW for integer is too large to fit in N bits, and ERANGE for string is too long to fit in N bytes? Or vice versa? Somebody(TM) should go through the errno(2) documentation and make the descriptions more generic, and add guidance for choosing which code to return. Then people get upset because they say function foo() isn't allowed to set errno to 'bar'. It is rather a shame that posix tries to list all errno a function can return, not just those for explicit 'likely' (ie normal) non-success returns froma function. For the inode number, it is a slight shame that a 'fake' value can't be returned - maybe 0x - since a lot of the code won't really care. More likely to be an issue for the stat() functions - but not much code really cares. Well not much that you are really going to run compat versions of. There are issues getting unique dev/inode pairs anyway for some filesystems (and things like union mounts). David -- David Laight: da...@l8s.co.uk
Re: KNF and the C preprocessor
On Mon, Dec 10, 2012 at 03:50:00PM -0500, Thor Lancelot Simon wrote: On Mon, Dec 10, 2012 at 02:28:28PM -0600, David Young wrote: On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote: a) #define macros tend to get optimised better. Better even than an __attribute__((always_inline)) function? Consider the following code: int ring[100]; #define ring_end (ring + 100) int *ring_ptr; int ring_wrap_count; #define cmd(n) \ if (__predict_true(ring_ptr ring_end)) \ *ring_ptr++ = n; \ else { \ ring_ptr = ring; \ *ring_ptr++ = n; \ ring_wrap_count++; \ } for (;;) { if (__predict_false(...)) { if (...) { cmd(1); continue; } ... cmd(2); continue; } ... } I want func() inlined twice so that there are only 2 conditional branches and usually a conditional branch in cmd() back to the loop top in each path. So I need to stop the compiler tail merging the two parts of the inside 'if' There is nothing I can put inside an inline function version of cmd() that will stop this happening. In the #define version I can add things that stop the compiler merging the code. Prizes for thinking what! (Yes I could do the same in the outer code, but that happens quite often and I'd much rather hide the hackery in one place.) And yes, this is a real case from some code where I needed to minimise the worst case path enough that the extra branch mattered! The 'unusual' worst case of 'ring wrap' doesn't matter. I've seen other cases where the code for #define is better than that for an inline function. Possibly because an extra early optimisation happens. I know I've also had issues getting compilers to actually inline stuff. gcc's __attribute__((always_inline)) helps - I've had to use it to get static functions that are only called once reliably inlined. I'd like to submit that neither are a good thing, because human beings are demonstrably quite bad at deciding when things should be inlined, particularly in terms of the cache effects of excessive inline use. Indeed - there are some horrid large #define macros lurking. For some of them I can't imagine when they were benefitial. There have been some places where apparantly innocuous #defined have exploded out of all proprotion. The worst I remember was the SYS vn_rele(), by the time the original spl() functions had been replaced with lock functions, and the locks had also become inlined the whole thing exploded. One reason why macros should die is that in the process, inappropriate and harmful excessive inlining of code that would perform better if it were called as subroutines would die. That is true whether inline functions or #defines are used. Are some computer science courses teaching about optimisations that really haven't been true since the days of the m68k? David -- David Laight: da...@l8s.co.uk
Re: KNF and the C preprocessor
On Mon, Dec 10, 2012 at 09:26:08PM +, paul_kon...@dell.com wrote: On Dec 10, 2012, at 4:18 PM, David Young wrote: On Mon, Dec 10, 2012 at 03:50:00PM -0500, Thor Lancelot Simon wrote: On Mon, Dec 10, 2012 at 02:28:28PM -0600, David Young wrote: On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote: a) #define macros tend to get optimised better. Better even than an __attribute__((always_inline)) function? I'd like to submit that neither are a good thing, because human beings are demonstrably quite bad at deciding when things should be inlined, particularly in terms of the cache effects of excessive inline use. I agree with that. However, occasionally I have found when I'm optimizing the code based on actual evidence rather than hunches, and the compiler is letting me down, always_inline was necessary. Dave Is that because of compiler bugs, or because the compiler was doing what it's supposed to be doing? The compiler has some heuristics about what it is good to inline. gcc tends to treat 'inline' as just a hint. Genarally it likes inlining static functions that are only called once. But it doesn't always do so - even when marked 'inline' (marking them inline may have no effect). Inlining leaf functions is particularly useful - as it removes a lot of register pressure in the calling function. If you can inline all calls - making a function a leaf one it is even better. David -- David Laight: da...@l8s.co.uk
Re: KNF and the C preprocessor
On Mon, Dec 10, 2012 at 06:47:16PM -0500, Mouse wrote: b) __LINE__ (etc) have the value of the use, not the definition. Yes, but if you use static inlines, the debugger's got both -- which it won't, if you use macros... Huh? Okay, what's the static inline version of log() here? #define log(msg) log_(__FILE__,__LINE__,(msg)) extern void log_(const char *, int, const char *); I see a #define lurking ! David -- David Laight: da...@l8s.co.uk
Re: nfsd serializing patch
On Fri, Dec 07, 2012 at 06:46:41AM +, YAMAMOTO Takashi wrote: hi, Hello, while working on nfs performance issues with overquota writes (which turned out to be a ffs issue), I came up with the attached patch. What this does it, for nfs over TCP, restrict a socket buffer processing to a single thread (right now, all pending requests are processed by all threads in parallel). This has two advantages: - if a single client sends lots of request (like writes comming from a linux client), a single thread is busy and other threads will be available to serve other client's requests quickly - by avoiding CPU cache sharing and lock contention at the vnode level (if all requests are for the same vnode, which is the common case), we get sighly better performances. My testbed is a linux box with 2 Opteron 2431 (12 core total) and 32GB RAM writing over gigabit ethernet to a NetBSD server (dual Intel(R) Xeon(TM) CPU 3.00GHz, 4 hyperthread cores total) running nfsd -tun4. Without the patch, the server processes about 1230 writes per second, with this patch it processes about 1250 writes/s. Comments ? interesting. but doesn't it have ill effects if the client has multiple indepenedent activities on the mount point? They will be hitting the same physical disc, so probably queue behind each other. I've never seen any reason for the historical '4 nfsd server processes'. A lot of configurations work better with only 1. I've seen cases where the nfs client would be buffering writes, then decide to write a whole load of pages out of the buffer cache. This (or maybe somthing else) led to a considerable number of concurrent 8k nfs writes. The server processes pick one each and the disk becomes busy. The disk access algorythm (probably staircase) leaves one of the requests unfulfilled as new requests for nearer sectors keep ariving. The stalled nfs request times out and is retried. The stalled request finally finishes, but the rpc request has been timed out so is discarded. You now have multiple retry requests making matters worse, almost no progress is made (this is the the ethernet trace I was given!). This is fairly typical if the server is slow/overloaded. With only one server process it is all fine. David -- David Laight: da...@l8s.co.uk
Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1
On Fri, Dec 07, 2012 at 09:57:12AM -0500, Greg Troxel wrote: jnem...@victoria.tc.ca (John Nemeth) writes: On Apr 27, 3:15am, David Laight wrote: } One thing I discovered long ago, in an operating system far ... well } not NetBSD is that dhcp's use of the bpf (equivalent) caused a data } copy for every received ethernet frame - at considerable cost. } I've NFI whether this happens withthe current code. Given that DHCP is very low traffic, I'm not sure that this really matters. I don't think that's what he means. In most drivers, the idiom is if (there are bpf listeners) { m0 = cons up an mbuf chain that represents the packet bpf_mtap(m0, blah blah) } So the work to marshall the packet that might be tapped happens if there is a listener, not if the listener wants this packet. You've also missed the fact that it wasn't NetBSD - try VxWorks. All the filtering happened in the dhcp code. David -- David Laight: da...@l8s.co.uk
Re: nfsd serializing patch
On Fri, Dec 07, 2012 at 10:46:44AM +0100, Ignatios Souvatzis wrote: On Fri, Dec 07, 2012 at 08:42:39AM +, David Laight wrote: With only one server process it is all fine. If all is backed by the same disk? Hm Yep - even with multiple disks all the server process soon end up trying to process requests for the 'busy' disk. To make multiple server processes work well, there would have to be a separate queue for each disk spindle. But AFAIK the server processes just pull the next request off a single queue - originally the udp socket. David -- David Laight: da...@l8s.co.uk
Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1
On Tue, Dec 04, 2012 at 10:17:23PM -0800, John Nemeth wrote: On Apr 22, 5:50pm, Robert Elz wrote: We use ISC's DHCP server. As third party software, it is designed to be portable to many systems. BPF is a fairly portable interface, thus a reasonable interface for it to use. One thing I discovered long ago, in an operating system far ... well not NetBSD is that dhcp's use of the bpf (equivalent) caused a data copy for every received ethernet frame - at considerable cost. I've NFI whether this happens withthe current code. Although DHCP has to do strange things in order to acquire the original lease, renewing it should really only requires packets with the current IP address. David -- David Laight: da...@l8s.co.uk
Re: filesystem namespace regions, or making mountd less bozotic
On Wed, Dec 05, 2012 at 09:29:06PM +, David Holland wrote: I am tired of PR 3019 and its many duplicates, so I'd like to see a scheme that allows managing arbitrary subtrees of the filesystem namespace in a reasonably useful manner. The immediate application is nfs exports and mountd; however, I expect the resulting mechanism will also be useful for handling chroots and possibly also inotify-type mechanisms. Haven't you forgotten about 'file handles'. Since they refer to files you don't know anything about the containing directory. In the old days NFS had the following 'rules': 1) If you export part of a filesystem, you export all of it. 2) If you give anyone access, you give everyone access. 3) If you give anyone write access, you give everyone write access. I suspect 2 3 are no longer true (in NetBSD) as nfs checks the permissions, not just mountd. 1 is true if clients can 'fake up' valid file handles (used to be very easy). David -- David Laight: da...@l8s.co.uk
Re: Problem identified: WAPL/RAIDframe performance problems
On Tue, Dec 04, 2012 at 02:57:52PM +, David Holland wrote: What's a kernel panic got to do with it? If you hand the controller and thus the drive 4K write, the kernel panicing won't suddenly cause you to reverse time and have issued 8 512-byte writes instead. That depends on additional properties of the pathway from the FS to the drive firmware. It might have sent 1 of 2 2048-byte writes before the panic, for example. Or it might be a vintage controller incapable of handling more than one sector at a time. The ATA command set supports writes of multiple sectors and multi-sector writes (probably not using those terms though!). In the first case, although a single command is written the drive will (effectively) loop through the sectors writing them 1 by 1. All drives support this mode. For multi-sector writes, the data transfer for each group of sectors is done as a single burst. So if the drive supports 8 sector multi-sector writes, and you are doing PIO transfers, you take a single 'data' interrupt and then write all 4k bytes at once (assuming 512 byte sectors). The drive identify response indicates whether multi-sector writes are supported, and if so how many sectors can be written at once. If the data transfer is DMA, it probably makes little difference to the driver. For quite a long time the netbsd ata driver mixes them up - and would only request writes of multiple sectors if the drive supported multi-sector writes. Multi-sector writes are probably quite difficult to kill part way through since there is only one DMA transfer block. Given how drives actually write data, I would not be so sanguine that any sector, of whatever size, in-flight when the power fails, is actually written with the values you expect, or not written at all. Yes, I'm aware of that. It remains a useful approximation, especially for already-existing FS code. Given that (AFAIK) a physical sector is not dissimilar from an hdlc frame, once the write has started the old data is gone, it the write is actually interrupted you'll get a (correctable) bad sector. If you are really unlucky the write will be long - and trash the following sector (I managed to power off a floppy controller before it wrecked the rest of a track when I'd reset the writer with write enabled). If you are really, really unlucky I think it is possible to destroy adjacent tracks. David -- David Laight: da...@l8s.co.uk
Re: wapbl_flush() speedup
On Tue, Dec 04, 2012 at 09:53:11PM +0100, J. Hannken-Illjes wrote: On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote: hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes: The attached diff tries to coalesce writes to the journal in MAXPHYS sized and aligned blocks. [...] Comments or objections anyone? + * Write data to the log. + * Try to coalesce writes and emit MAXPHYS aligned blocks. Looks fine, but I would prefer the code to use an arbitrarily sized buffer in case we get individual per device transfer limits. Currently that size would still be MAXPHYS, but then the code could query the driver for a proper size. As `struct wapbl' is per-mount and I suppose this will be per-mount-static it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as tls-maxphys comes to head. Except that you want the writes to be preferably aligned to that length, not just of that length. David -- David Laight: da...@l8s.co.uk
Re: FFS write coalescing
On Mon, Dec 03, 2012 at 06:21:30PM +0100, Edgar Fu? wrote: When FFS does write coalescing, will it try to align the resulting 64k chunk? I.e., if I have 32k blocks and I write blocks 1, 2, 3, 4; will it write (1,2) and (3,4) or 1, (2,3) and 4? Of course, the background for my question is RAID stripe alignment. With that thought, for RAID5 in particular, you'd want the FS code to indicate to the disk that it had some of the nearby data in memory. That would safe the read of the parity data. (Which would be really horrid to implement!) Perhaps the 'disk' should give some 'good parameters' for writes to the FS code when the filesystem is mounted. David -- David Laight: da...@l8s.co.uk
Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1
On Wed, Nov 28, 2012 at 03:19:49PM -0800, Brian Buhrow wrote: Hello. I've just noticed an issue where broadcast traffic on vlans also shows up on the parent interface. My environment is NetBSD-5.1/i386 with the wm(4) driver. I'm not sure yet if the problem is specific to the wm(4) driver or if it's a more general issue. The bug didn't exist in NetBSD-4.x. There are some very recent messages on the Linux 'netdev' list (or possibly the tcpdump one) about issues with vlan tags being visible (or not) on the messages passed to tcpdump (and maybe processed internally). I only skim-read it so can't remember the exact issue - but it is similar. David -- David Laight: da...@l8s.co.uk
Re: Problem identified: WAPL/RAIDframe performance problems
On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote: da...@l8s.co.uk (David Laight) writes: I must look at how to determine that disks have 4k sectors and to ensure filesystesm have 4k fragments - regardless of the fs size. newfs should already ensure that fragment = sector. These disks lie about their actual sector size. The disk's own software does RMW cycles for 512 byte writes. David -- David Laight: da...@l8s.co.uk
Re: core statement on fexecve, O_EXEC, and O_SEARCH
On Mon, Nov 26, 2012 at 01:49:09AM +0300, Alan Barrett wrote: If necessary, the open(2) syscall could be versioned so that O_RDONLY is no longer defined as zero. Actually we could redefine (say) O_RDONLY0x1 O_WRONLY0x2 O_RDWR (O_RDONLY | O_WRONLY) O_SEARCH0x4 O_EXEC 0x8 (or similar) and fallback on the value in bits 0 and 1 if none of the above are set. That doesn't require a syscall version bump. David -- David Laight: da...@l8s.co.uk
Re: fexecve, round 3
On Sun, Nov 25, 2012 at 07:54:59PM +, Christos Zoulas wrote: Does everyone agrees on this interpretation? If we do, next steps are - describe threats this introduce to chrooted processes Given a chrooted process would need a helping process outside the chroot (to pass it the fd), why is allowing the chrooted proccess to exec something any different from it arranging to get the helper to do it? I think it can only matter if the uid of the chroot is root. Even then you could (probably) do nothing you couldn;t do by mmaping some anon space with exec permissions and writing code to it. FWIW IIRC the standard says that O_EXEC can't be applied with O_READONLY (Or O_RDWR) but does it say that you can't read from a file opened O_EXEC ? David -- David Laight: da...@l8s.co.uk
Re: fexecve, round 2
On Mon, Nov 19, 2012 at 05:23:07AM +, David Holland wrote: On Sun, Nov 18, 2012 at 06:51:51PM +, David Holland wrote: This appears to contradict either the description of O_EXEC in the standard, or the standard's rationale for adding fexecve(). The standard says O_EXEC causes the file to be open for execution only. In other words, O_EXEC means you can't read nor write the file. Now the rationale for fexecve() doesn't hold, since you cannot read from the fd, then exec from it without a reopen. Further, requiring O_EXEC would seem to directly contravene the standard's language about fexecve()'s behavior. The standard is clearly wrong on a number of points and doesn't match the historical design and behavior of Unix. Let's either implement something correct, or not implement it at all. Also it seems that the specification of O_SEARCH (and I think the implementation we just got, too) is flawed in the same way - it is performing access checks at use time instead of at open time. So, for the record, I think none of these flags should be added unless they behave the same way opening for write does -- the flag cannot be set except at open time, and only if the opening process has permission to make the selected type of access; once opened the resulting file handle functions as a capability that allows the selected type of access. Anything else creates horrible inconsistencies and violates the principle of least surprise, both of which are not acceptable as part of the access control system. Does fchmod() itself have any issues? If I open a file that doesn't have write permissions, I can use fchmod() to add write permissions. My open fd won't magically gain write access, but maybe I can open it again via /dev/fd (possibly after linking the inode back into the filesystem) and gain the extra permissions. Clearly I would need to be the owner, but with chroots that shouldn't be enough if the file might actually be outsde the chroot. David -- David Laight: da...@l8s.co.uk
Re: fexecve, round 2
On Mon, Nov 19, 2012 at 08:08:58AM +, Emmanuel Dreyfus wrote: On Mon, Nov 19, 2012 at 05:23:07AM +, David Holland wrote: Also, it obviously needs to be possible to open files O_RDONLY|O_EXEC for O_EXEC to be useful, and open directories O_RDONLY|O_SEARCH, and so forth. I don't know what POSIX may have been thinking when they tried to forbid this but forbidding it makes about as much sense as forbidding O_RDWR, maybe less. It seems consistent with the check at system call time that you proposed to forbid. Here is how I understand it for an openat/mkdirat sequence: - openat() without O_SEARCH, get a search check at mkdirat() time - openat() with O_SEARCH, mkdirat() performs no search check. and for openat/fexecve: - openat() without O_SEXEC, get a execute check at fexecve() time - openat() with O_EXEC, fexecve() performs no exec check. If you have r-x permission, you open with O_RDONLY and you do not need O_SEARCH/O_EXEC. If you have --x permission, you open with O_SEARCH/O_EXEC I think the standard implied that O_EXEC gave you read and execute permissions. So you can't use it to open files that are --x. I haven't seen a quote for O_SEARCH. Without the xxxat() functions the read/write state of directory fds (as opposed to that of the directory itself) has never mattered. O_SEARCH might be there to allow you to open . when you don't have read (or write) access to it. For openat() it is plausible that write access to the directory fd might be needed as well as write access to the underlying directory in order to create files. David -- David Laight: da...@l8s.co.uk
Re: fexecve, round 2
On Mon, Nov 19, 2012 at 11:25:07AM -0500, Thor Lancelot Simon wrote: On Mon, Nov 19, 2012 at 03:13:02PM +, Emmanuel Dreyfus wrote: On Mon, Nov 19, 2012 at 02:39:36PM +, Julian Yon wrote: No, Emmanuel is right: [...] use the O_EXEC flag when opening fd. In this case, the application will not be able to perform a checksum test since it will not be able to read the contents of the file. You can open with --x but (correctly) you can't read from the file. Given the comments later about O_SEARCH | O_RDONLY not being distinguishable from O_SEARCH (because, historically, O_RDONLY is zero) and 'similarly for O_EXEC' I suspect the wording of the sections got reworded quite late on - and probably after the bar had opened and everyone at the meeting was hungry! I suspect that, for --x-- items opens with O_EXEC or O_SEARCH might need to succeed, and any later read/mmap requests fail. And it means the standard mandates that one can execute without read access. Weird. What's weird about that? % cp /bin/ls /tmp % chmod 100 /tmp/ls % ls -l /tmp/ls ---x-- 1 tls users 24521 Nov 19 11:24 /tmp/ls % /tmp/ls -l /tmp/ls ---x-- 1 tls users 24521 Nov 19 11:24 /tmp/ls % More fun are #! scripts that are --s-- Typically they can be executed by everyone except the owner! (Provided suid scripts are allowed - and I don't know any reason why they shouldn't be provided the kernel passes the open fd to the interpreter.) David -- David Laight: da...@l8s.co.uk
Re: fexecve, round 2
On Sat, Nov 17, 2012 at 11:48:20AM +0100, Emmanuel Dreyfus wrote: Here is an attempt to address what was said about implementing fexecve() fexecve() checks that the vnode underlying the fd : - is of type VREG - grants execution right O_EXEC cause open()/openat() to fail if the file mode does not grant execute rights There are security concerns with fd passed to chrooted processes, which could help executing code. Here is a proposal for chrooted processes: 1) if current process and executed vnode have different roots, then fexecve() fails I'm not sure how you were intending determining that. You can follow .. for directories, but not files. 2) if the fd was not open with O_EXEC, fexecve() fails. First point avoids executing code from outside the chroot Second point enforces W^X inside the chroot. If we don't want to allow chroot'ed process to exec a file that is outside the chroot, then maybe the kernel could hold a reference to the directory vnode (in the file vnode) whenever a file is opened for execute (including the existing exec() family of calls calls). As well as being used to police fexecve() withing a chroot, it could be used by the dynamic linker for $ORIGIN processing (Probably by some special flags to openat().). David -- David Laight: da...@l8s.co.uk
Re: [PATCH] fexecve
On Fri, Nov 16, 2012 at 12:52:30PM +, Julian Yon wrote: On Fri, 16 Nov 2012 08:34:29 + David Laight da...@l8s.co.uk wrote: On Thu, Nov 15, 2012 at 10:14:18PM +0100, Joerg Sonnenberger wrote: Frankly, I still don't see the point why something would want to use it. How about running a staticly linked executable inside a chroot without needed the executable itself to do the chroot. What does this gain over passing a filename around? (NB. I'm not claiming that's an entirely safe model either, but it's already possible). You don't need the executable image inside the chroot. David -- David Laight: da...@l8s.co.uk
Re: [PATCH] fexecve
On Thu, Nov 15, 2012 at 10:14:18PM +0100, Joerg Sonnenberger wrote: On Thu, Nov 15, 2012 at 08:20:30PM +0100, Emmanuel Dreyfus wrote: Thor Lancelot Simon t...@panix.com wrote: The point is, this is interesting functionality that makes something new possible that is potentially useful from a security point of view, but the new thing that's possible also breaks assumptions that existing code may rely on to get security guarantees it wants. Well, it is standard mandated and we want to be standard compliant. If it is a security hazard, we can have a sysctl to disable the system call. Something like sysctl -w kern.fexecve = 0 and it would return ENOSYS. Frankly, I still don't see the point why something would want to use it. How about running a staticly linked executable inside a chroot without needed the executable itself to do the chroot. Oh, and now make $ORIGIN work for dynamic executables and fexec() :-) (Probably not a good idea inside choots! At least you wouldn't want it to work AFTER the initil program load.) David -- David Laight: da...@l8s.co.uk
Re: [PATCH] fexecve
On Thu, Nov 15, 2012 at 04:02:50PM -0500, Thor Lancelot Simon wrote: From the spec: ?The purpose of the fexecve() function is to enable executing a file which has been verified to be the intended file. It is possible to actively check the file by reading from the file descriptor and be sure that the file is not exchanged for another between the reading and the execution.? ...which seems a reasonable enough thing to want to do. Look at that rationale carefully and I think you will see the race condition that it does not eliminate. Talk about a solution looking for a problem! You could create a temporary file, unlink it, copy the executable into the new file, verify the the contents, and then exec the unlinked temporary file. Better add an open mode that hard disables writes (as used on many systems for executables anyway), open the file with that mode ... David -- David Laight: da...@l8s.co.uk
Re: [PATCH] POSIX extended API set 2
On Sun, Nov 11, 2012 at 04:19:03AM -0800, Matt Thomas wrote: On Nov 11, 2012, at 12:39 AM, Alan Barrett wrote: I want the names to follow a clear and easily-documented pattern. Takes a nameTakes a fd, not a name Takes a name and an at fd (prepend f) (append at) -- --- open- (fopen is different) openat link- linkat unlink - unlinkat rename - renameat chdir fchdir chdirat mkdir fmkdir mkdirat mkfifo fmkfifo mkfifoat utimens futimensutimensat chmod fchmod chmodat (not fchmodat) chown fchown chownat (not fchownat) statfstat statat (not fstatat) access - accessat (not faccessat) However, I also want the inconsistent POSIX names to be provided. Don't forget chrootfchroot chrootat How do these names fit into the previously reserved namespaces? Or has that been completely ignored by the Posix folks again? David -- David Laight: da...@l8s.co.uk
Re: cprng sysctl: WARNING pseudorandom rekeying.
On Fri, Nov 09, 2012 at 06:53:45PM -0500, Greg Troxel wrote: FWIW, I agree with the notion that defaults should be at a path that is ~always in root; it's normal to have /var in a separate fileystem (at least for old-school UNIX types; I realize the kids these days think there should be one whole-disk fs as /). I always try to separate the OS files from my files, mainly so that I can reinstall the OS (often into a different root filesystem) and still have access to the other filesystems. As well as /home, I also put and big source trees in their own fs. (So I'd never have /usr/src ...) David -- David Laight: da...@l8s.co.uk
Re: WAPL/RAIDframe performance problems
For example, /usr/include/ufs/ffs/fs.h suggests that the super block could bein one of 4 different places on your partition, depending on what size your disk is, and what version of superblock you're using. From my memory of the ffs disk layout, fs block/sector numbers start from the beginning of the partition and just avoid allocating the area containing the subperblock copies. So the position of the superblocks (one exists in each cylinder group) is rather irrelevant. What is more likely to cause grief is 512 byte writes - since modern disks have 4k physical sectors. I think netbsd tends to do single sector writes for directory entries and the journal - these will be somewhat suboptimal! David -- David Laight: da...@l8s.co.uk
Re: suenv
On Thu, Oct 25, 2012 at 03:58:33AM +0200, Emmanuel Dreyfus wrote: David Laight da...@l8s.co.uk wrote: Wasn't there a recent change to ld so that NEEDED entries are only added if the shared library is needed to resolve symbols? Which makes the naive addition of -lpthread useless. That seems the best workaround to the problem, but it was either not pulled up in netbsd-6, or it does not work. What files where touched with that change? It is not obvous in ld.elf_so. I'me thinking of a change to ld itself, maybe NetBSD hasn't imported that version yet. (Or my brain cells are faulty.) David -- David Laight: da...@l8s.co.uk
Re: suenv
On Tue, Oct 23, 2012 at 06:08:37PM +0200, Martin Husemann wrote: On Tue, Oct 23, 2012 at 04:31:52PM +0200, Emmanuel Dreyfus wrote: Opinions? Either PAM modules should not be allowed to use shared libraries that use pthreads, or we need to make sure every application using PAM is linked against libpthread. Wasn't there a recent change to ld so that NEEDED entries are only added if the shared library is needed to resolve symbols? Which makes the naive addition of -lpthread useless. I also dug into some linux libc.so and libpthread.so. AFAICT the mutex and condvar functions are in libc. There are somes stubs in libpthread that do a jump indirect, not sure what that is based on though. David -- David Laight: da...@l8s.co.uk
Re: Raidframe and disk strategy
On Tue, Oct 16, 2012 at 08:12:39PM +0200, Edgar Fu? wrote: processes would get stuck in biowait What I usually see is one of nfsd's four subthreads in biowait and the other three in tstile. FWIW I've NFI why nfsd has this default of 4 threads. I've seen a lot of systems where 1 works much better (for all sorts of reasons). David -- David Laight: da...@l8s.co.uk
Re: NetBSD vs Solaris condvar semantics
On Sun, Oct 14, 2012 at 02:27:48PM +, Taylor R Campbell wrote: Date: Sun, 14 Oct 2012 09:37:09 +0200 From: Martin Husemann mar...@duskware.de In the zfs code, where do they store the mutex needed for cv_wait? In the two cases I have come across, dirent locks and range locks, a number of condvars, one per dirent or one per range, share a common mutex in some common enclosing object, such as a znode. So, e.g., the end of zfs_dirent_unlock looks like cv_broadcast(dl-dl_cv); /* dl is a dirent lock stored in dzp. */ mutex_unlock(dzp-z_lock); cv_destroy(dl-dl_cv); kmem_free(dl, sizeof(*dl)); What do the waiters actually look like? A lot of cv definitions allow for 'random' wakeups. eg cv_broadcast() is allowed to wakeup all cv. So after being woken you are required to check something David -- David Laight: da...@l8s.co.uk