from:"David Laight"

Re: UVM and the NULL page

2016-12-29 Thread David Laight

On Tue, Dec 27, 2016 at 02:12:59PM +0100, Wolfgang Solfrank wrote:
> Hi,
> 
> >Any cpu that doesn't require special instructions for copyin/out
> >is susceptible to user processes mapping code to address 0 and
> >converting a kernel 'jump through unset pointer' from a panic
> >into a massive security hole (executing process code with the
> >'supervisor' bit set).
> 
> Only if you do a naive implementation of copyin/out. Nothing prevents
> you from implementing copyin/out on these cpus by mapping only the
> relevant part of the user address space at some reserved address
> (maybe even one page at a time), do the copying and then unmap the
> user space part. No reason to share the user address space all the
> time.

That requires you do a full 'pmap' change on every system call
entry and exit - which will slow things down somewhat.
You don't even want to invalidate the use tlb.

David

-- 
David Laight: da...@l8s.co.uk

Re: vrele vs. syncer deadlock

2016-12-29 Thread David Laight

On Sun, Dec 11, 2016 at 08:39:06PM +, Michael van Elst wrote:
> dholland-t...@netbsd.org (David Holland) writes:
> 
> >On a low-memory machine Nick ran into the following deadlock:
> 
> >  (a) rename -> vrele on child -> inactive -> truncate -> getblk ->
> >  no memory in buffer pool -> wait for syncer

Could the child vnode tidyup be done at a later time?
ie just queue it in the vrele path.

David

-- 
David Laight: da...@l8s.co.uk

Re: x86: move the LAPIC va

2016-12-29 Thread David Laight

On Sat, Oct 08, 2016 at 05:14:43PM +0200, Maxime Villard wrote:
> On x86 there's a set of memory-mapped registers that are per-cpu and called 
> the
> LAPIC. They reside at a fixed one-page-sized physical address, and in order 
> to
> read or write to them the kernel has to allocate a virtual address and then
> kenter it to the aforementioned physical one.
> 
> In the NetBSD kernel, however, we do something a little bizarre: instead of
> following this model, the kernel has a blank page at the beginning of the 
> data
> segment, and it then directly kenters the va of this page into the LAPIC pa.
> 
> The issue with this design is that it implies the first page of .data does 
> not
> actually belong to .data, and therefore it is not possible to map the 
> beginning
> of .data with large pages. In addition to this, without going into useless
> details, it creates an inconsistency in the low memory map, because the 
> pa<->va
> translation is not linear, even if it seemingly is harmless.

If you are going to change it, why not pick a more appropriate fixed virtual
address?
The smp code will already be using one for things like curproc.
That way you don't need to add all the indirections to the asm code
and don't need asm #defines that use temporary registers.
You'll still need a physica page for non LAPIC cpu (probably
not smp-capable designs).

David

-- 
David Laight: da...@l8s.co.uk

Re: UVM and the NULL page

2016-12-26 Thread David Laight

On Mon, Aug 01, 2016 at 03:56:01PM +, Eduardo Horvath wrote:
> On Sat, 30 Jul 2016, Thor Lancelot Simon wrote:
> 
> > 1) amd64 partially shares VA space between the kernel and userland.  It
> >is not unique in this but most architectures do not.
> 
> FWIW all the pmaps I worked on have split user/kernel address spaces and 
> do not share this vulnerability.

Wakes up...

You've worked on a strange set of cpus then.
Any cpu that doesn't require special instructions for copyin/out
is susceptible to user processes mapping code to address 0 and
converting a kernel 'jump through unset pointer' from a panic
into a massive security hole (executing process code with the
'supervisor' bit set).

The only reason I know for mapping address zero would be to run
exectables for very old emulations where the program entry point
was zero. There might be sine old 68000 ones.

ISTR that wine is actually mapping 'everywhere' in order to ensure
the addresses it needs later can be made available by unmapping
specific ranges.

Anyway mmap() without MAP_FIXED should never return NULL.
Even if technically allowed by the standard.
If nothing else I think the compiler is allowed to assumes
that NULL is special and generate 'unexpected' code.

David

-- 
David Laight: da...@l8s.co.uk

Re: New Syscall

2015-10-18 Thread David Laight

On Thu, Oct 15, 2015 at 02:12:35PM +0100, Robert Swindells wrote:
> 
> Taylor R Campbell wrote:
> >   Date: Wed, 14 Oct 2015 22:55:41 +0100 (BST)
> >   From: Robert Swindells <r...@fdy2.co.uk>
> >
> >   The syscall is sctp_peeloff().
> >
> >Hmm...  Introducing a protocol-specific syscall doesn't strike me as a
> >great design.  I can imagine wanting to do something similar with,
> >e.g., minimalt, if we ever had that in-kernel.
> >
> >If we have to have something protocol-specific, an ioctl would work
> >just as well, and use up a somewhat less scarce resource.
> 
> The code is from KAME, I didn't write it from scratch, FreeBSD also has
> a syscall for it.
> 
> Linux uses getsockopt() for this, which seems wrong to me as you are
> not just reading a setting when you make the call.

Be careful, I think one of the sctp rfs requires the use of
setsockopt() for a lot of things that ought to be separate
socket calls.
I can't remember about peeloff.

The 'peeloff' code really shouldn't have been anything to do
with sctp - it is just is method of multiplexing connections
over a single socket.
A strange solution to the problem I think they were trying to solve.

Not that much of sctp works the way people expect it to...

David

-- 
David Laight: da...@l8s.co.uk

Re: Brainy: bug in x86/cpu_ucode_intel.c

2015-10-04 Thread David Laight

On Sun, Oct 04, 2015 at 04:28:35PM +, David Holland wrote:
> On Sun, Oct 04, 2015 at 11:52:18AM +1100, matthew green wrote:
>  > how about this:
> 
> I would suggest using void * for the unaligned pointer, but other than
> that looks at least correctly consistent with the discussion here.

Agree - or char *.
It might not matter for this code, and for x86, but in general
you don't want gcc to see misaligned pointers.

It is also worth noting that you only need to add 8 (for amd64) to the size,
and that the pointer can only need 8 adding.

OTOH having an allocator not return aligned memory is stupid.
Adding a 16 or 32 byte header to allocation requests that are
not powers of 2 probably makes little difference to the footprint.
If code allocates 4k you don't really want a header at all.

    David

-- 
David Laight: da...@l8s.co.uk

Re: New sysctl entry: proc.PID.realpath

2015-09-07 Thread David Laight

On Mon, Sep 07, 2015 at 07:01:45PM +, David Holland wrote:
> On Mon, Sep 07, 2015 at 11:13:35AM +0200, Joerg Sonnenberger wrote:
>  > > Two nits:
>  > > 
>  > >  1) vnode_to_path(9) is undocumented
>  > >  2) it only works if you are lucky (IIUC) - which you mostly are
>  > > 
>  > > The former is easy to fix, the latter IMHO is a killer before we expose
>  > > this interface prominently and make debuggers depend on it. We then 
> should
>  > > also make $ORIGIN work in ld.elf_so ;-}
>  > 
>  > My suggestion was to just provide the filesystem id and inode number as
>  > fallback. I still believe we should just turn on the code that remembers
>  > the realpath on exec in first place, if you want to debug
>  > something_with_a_very_very_very_very_..._very_long_name, you can always
>  > override the (missing) default.
> 
> As best I recall (having tried to page the context in the past few
> days) the only reason that code is disabled is so that it fails in a
> way that's readily explainable (non-absolute paths) vs. arbitrarily
> and capriciously.
> 
> There's another problem this thread hasn't mentioned, which is that
> the result of vnode_to_path for non-directories isn't necessarily
> unique or deterministic even if the object hasn't been moved about.

Perhaps the kernel should hold a reference to the directory vnode
for every process.
An open() of the directory could then be used for $ORIGIN etc.
You might want this vnode to be 'revokeable' by unmount.
An actual path could be found using the same code ad pwd.

David

-- 
David Laight: da...@l8s.co.uk

Re: kernel libraries and dead code in MODULAR kernels

2015-09-06 Thread David Laight

On Fri, Sep 04, 2015 at 06:39:46PM -0700, Dennis Ferguson wrote:
> > 
> > Yes, finding unused functions is hard.  Not only in libkern, but also
> > libc, or variables in abandaned `Makefile.kern.inc'.  Removing one
> > needs so much mental energy (especially when those picky wandering).
> 
> I'm not so interested in ridding the kernel of all unused code (though
> I suspect someone clever might be able to use the -ffunction-sections and
> -fdata-sections compiler flags plus the ld --gc-sections option to
> find some of it).  I'd be happy if modular and non-modular kernels
> had the same unused stuff, and that libkern.a could be used when
> building either.

Just parse the xref list from ld.

I don't think you should assume that all kernel modulrs are built
at the same time as the main kernel.
So functions thst loadable modules might need have to be present.
Hence the inclusion of all of libkern.


David

-- 
David Laight: da...@l8s.co.uk

Re: Understanding SPL(9)

2015-09-02 Thread David Laight

On Mon, Aug 31, 2015 at 03:30:36PM +, Eduardo Horvath wrote:
> On Mon, 31 Aug 2015, Stephan wrote:
> 
> > I?m trying to understand interrupt priority levels using the example
> > of x86. From what I?ve seen so far I?d say that all spl*() functions
> > end up in either splraise() or spllower() from
> > sys/arch/i386/i386/spl.S. What these functions actually do is not
> > clear to me. For example, splraise() starts with this:
> > 
> > ENTRY(splraise)
> > movl4(%esp),%edx
> > movlCPUVAR(ILEVEL),%eax
> > cmpl%edx,%eax
> > ja  1f
> > movl%edx,CPUVAR(ILEVEL)
> > ...
> > 
> > I?m unable to find out what CPUVAR(ILEVEL) means. I would guess that
> > something needs to happen to the APIC?s task priority register.
> > However I can?t see any coherence just now.
> 
> Don't look at x86, it doesn't have real interrupt levels.  Look at SPARC 
> or 68K which do.

Old x86 fed interrupts through (the equiv of) an 8259? interrupt
controller that dates from the 1970s (8080 cpu).
This has 8 interrupt priority levels and the spl() could (and used to)
modify the mask.
The problem is that these accesses are very, very slow. Since interrupts
are much rarer than spl calls it is much faster to not update the
hardware mask unless you get a level-sensitive interrupt that should
be masked.

Amd64 cpus have a built-in intertupt priority register (cr8) that can
be used to mask low priority interrupts.
Unlike all other control registers, accesses to cr8 aren't sequencing
instructions so are fast.
I don't know whether netbsd dynamically changes cr8.

> Most machines nowadays only have one interrupt line and an external 
> interrupt controller.  True interrupt levels are simulated by assigning 
> levels to individual interrupt sources and masking the appropriate ones in 
> the interrupt controller.  This makes the code rather complicated, 
> especially since interrupts can nest.

Multiple interrupt priorities for level sensitive interrupts require
hardware support.

David

-- 
David Laight: da...@l8s.co.uk

Re: change MSI/MSI-X APIs

2015-05-30 Thread David Laight

On Mon, May 11, 2015 at 03:15:25PM +0900, Kengo NAKAHARA wrote:
 Hi,
 
 I received feedback from some device driver authors. They point out
 establish, disestablish and release APIs should be unified for INTx,
 MSI and MSI-X. So, I would change the APIs as below:

Some more feedback...

PCIe devices that only support MSI-X could have support for very
large numbers of interrupts.

Some might only be needed if a specific function is used,
or might only be used to load sharing (eg multi-q ethernet).
In either case you might want to allocate some of the MSI-X
vectors at driver load time, and allocate others at a much later
time.
IIRC nothing in the hardware spec stops you doing this.

Theoretically you could allocate the MSI-X vector (etc)
when the interrupt is enabled, and dealloacate on disable
(apart from timing problems on the hardware).
I'm not suggesting you go that far!

This would also make it less likely that interrupts won't
be available for drivers that initialise later on.

David

-- 
David Laight: da...@l8s.co.uk

Re: kernel constructor

2014-12-15 Thread David Laight

On Thu, Nov 13, 2014 at 03:29:48PM +0900, Masao Uebayashi wrote:
 On Wed, Nov 12, 2014 at 2:53 AM, Taylor R Campbell campb...@mumble.net 
 wrote:
 Date: Tue, 11 Nov 2014 17:42:51 +
 From: Antti Kantee po...@iki.fi

 2: init_main ordering

 I think that code reading is an absolute requirement there, i.e. we
 should be able to know offline what will happen at runtime.  Maybe that
 problem is better addressed with an offline preprocessor which figures
 out the correct order?

  rcorder(8)...?

 I'll imlement tsort in config(1), because config(1) knows module
 dependency.  Module objects will be ordered when linking, that is also
 reflected in the order of constructors.  I believe this is good enough
 for most cases.

You could look for symbol name (say) init_fn_foo, init_fn_foo_requires_xxx
and init_fn_foo_provides_yyy and use them to generate a C file of calls
to foo() in the correct order and then relink the kernel.
Or as part of the final stage of converting a netbsd.o (generated
by ld -r) into a fully fixed kernel.

David

-- 
David Laight: da...@l8s.co.uk

Re: FW: ixg(4) performances

2014-10-01 Thread David Laight

On Sun, Aug 31, 2014 at 12:07:38PM -0400, Terry Moore wrote:
 
 This is not 2.5G Transfers per second. PCIe talks about transactions rather
 than transfers; one transaction requires either 12 bytes (for 32-bit
 systems) or 16 bytes (for 64-bit systems) of overhead at the transaction
 layer, plus 7 bytes at the link layer. 
 
 The maximum number of transactions per second paradoxically transfers the
 fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only
 about 60,000 such transactions are possible per second (moving about
 248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims,
 for example 95% efficiency is typical for storage controllers.]   

The gain for large transfer requests is probably minimal.
There can be multiple requests outstanding at any one time (the limit
is negotiated, I'm guessing that 8 and 16 are typical values).
A typical PCIe dma controller will generate multiple concurrent transfer
requests, so even if the requests are only 128 bytes you can get a
reasonable overall throughput.

 A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million
 transactions are possible per second, but those 9 million transactions can
 only move 36 million bytes.

Except that nothing will generate adequately overlapped short transfers.

The real performance killer is cpu pio cycles.
Every one that the driver does will hit the throughput - the cpu will
be spinning for a long, long time (think ISA bus speeds).

A side effect of this is that PCI-PCIe bridges (either way) are doomed
to be very inefficient.

 Multiple lanes scale things fairly linearly. But there has to be one byte
 per lane; a x8 configuration says that physical transfers are padded so that
 each the 4-byte write (which takes 27 bytes on the bus) will have to take 32
 bytes. Instead of getting 72 million transactions per second, you get 62.5
 million transactions/second, so it doesn't scale as nicely.

I think that individual PCIe transfers requests always use a single lane.
Multiple lanes help if you have multiple concurrent transfers.
So different chunks of an ethernet frame can be transferred in parrallel
over multiple lanes, with the transfer not completing until all the
individual parts complete.
So the ring status transfer can't be scheduled until all the other
data fragment transfers have completed.

I also believe that the PCIe transfers are inherently 64bit.
There are byte-enables indicating which bytes of the first and last
64bit words are actually required.

The real thing to remember about PCIe is that it is a comms protocol,
not a bus protocol.
It is high throughput, high latency.

I've had 'fun' getting even moderate PCIe throughput into an fpga.

David

-- 
David Laight: da...@l8s.co.uk

Re: msdosfs and small sectors

2014-09-10 Thread David Laight

On Wed, Jul 16, 2014 at 06:26:00PM +, David Holland wrote:
 On Wed, Jul 16, 2014 at 03:10:01PM +0200, Maxime Villard wrote:
   I thought about that. I haven't found a clear spec on this, but it is
   implicitly suggested that 512 is the minimal size (from what I've seen
   here and there). And the smallest BytesPerSec allowed for fat devices
   is 512. But still, nothing really clear.
 
 If you're afraid some real device might turn up with 128-byte sectors
 or something, complain if it's less than 64. Or 32. It doesn't really
 matter.

Real floppies certainly had 128 byte sectors.
Some even had 128 byte ones on track 0 but 256 byte ones on the rest
of the disk!

Is there a check that the sector size is a power of two?
That might depend on where it comes from.

Real devices with 'unusual' sector sizes do exist (like audio CD),
but they won't have a FAT fs on them.
(and ICL system25 whcih wanted 100 byte sectors).

David

-- 
David Laight: da...@l8s.co.uk

Re: crunchgen and c++

2014-08-21 Thread David Laight

On Mon, Jul 14, 2014 at 11:54:35AM -0500, Frank Zerangue wrote:
 Is crunchgen compatible with c++ executables?

I thik you answer yourself...

 I was able to build the c++ tool into a crunched binary but get an
 illegal instruction trap when trying to execute the tool.

Clearly not :-)

 And static variables in the c++ tool are initialized when any
 of the binaries crunched are executed.

To stop that happening the linker section names for the initialiers
(and destructors) in each tool would need renaming, and then the
constructors run (in the correct order) before calling the tool's main().
(and even that might not work).

 Thanks for any ideas on this matter.

I'd try a minimal crunched binary and see why it fails.
All crunchgen really does is rename the program's symbols so that
the ones from each 'tool' are separate.

David

-- 
David Laight: da...@l8s.co.uk

Re: serious performance regression in .41

2014-05-23 Thread David Laight

On Thu, May 22, 2014 at 07:42:51PM +0200, J. Hannken-Illjes wrote:
 
 While I'm interested in the results, this change is wrong.  As long as
 we have forced unmount and revoke it is not safe to access an inode
 without locking the corresponding vnode.  Holding a reference to the
 vnode just guarantees the vnode will not disappear, it does not
 prevent the inode from disappearing.

Forced unmount and revoke can use other synchronisation techniques
that are expensive for the unusual operation but cheap in the normal path.

Something like rcu would do.
Might even be more generally useful for some of these structures.

David

-- 
David Laight: da...@l8s.co.uk

Re: CVS commit: src/sys/ufs/ufs

2014-05-16 Thread David Laight

On Fri, May 16, 2014 at 03:54:44PM +, David Holland wrote:
 
   Indeed rebooting with an updated kernel will give active NFS clients
   problems, but I am not sure we should realy care nor how we could
   possibly avoid this one time issue. We have changed encoding of
   filehandles before (at least once).
 
 I don't think this is a problem, but maybe I'll put a note in UPDATING.

Never mind that problem.
Consider what happens if you reboot with a different CD in the drive!

I once fixed a filesystem to use different faked inode numbers every
time a filesystem was mounted.
Without that NFS clients would write to the wrong file in the wrong FS.

The 'impossible to get rid of' retries for hard mounts were something
up with which I had to put. (A preposition is something you should not
end a sentence with.)

David

-- 
David Laight: da...@l8s.co.uk

Re: resource leak in linux emulation?

2014-04-23 Thread David Laight

On Thu, Apr 17, 2014 at 01:23:15AM +0200, Sergio L?pez wrote:
 2014-04-03 11:57 GMT+02:00 Mark Davies m...@ecs.vuw.ac.nz:
  Note that nprocs (2nd to last value in the /proc/loadavg output)
  increments every time javac runs until it hits maxproc.
 
 You're right, the problem appears when the last thead alive in a
 multithreaded linux process is not the main thread, but one of the
 children. This only happens with when using the linux emulation,
 because is the only case when LWPs have their own respective PIDs.
 
 To fix, this should be added somewhere, probably at
 sys/kern/kern_exit.c:487 (but I'm not sure if there's a better
 location):
 
 if ((l-l_pflag  LP_PIDLID) != 0  l-l_lid != p-p_pid) {
 proc_free_pid(l-l_lid);
 }

That doesn't look like the right place.
I think it should be further down (and with proc_lock held).

David

-- 
David Laight: da...@l8s.co.uk

Re: Patch: cprng_fast performance - please review.

2014-04-23 Thread David Laight

On Fri, Apr 18, 2014 at 02:41:07PM -0400, Thor Lancelot Simon wrote:
 
 Of the few systems which do have instructions that accellerate AES, on
 the most common implementation -- x86 -- we cannot use the instructions
 in question in the kernel because they use CPU state we do not save/
 restore when the kernel runs.  I'd welcome anyone's work to fix that,
 so long as it does not impose major performance costs of its own, but
 I do not personally have the skill to do it, and if wishes were horses...

On x86 the xmm registers could be used in kernel code provided that:
1) If the fpu registers are owned by a different process they are saved
   into the pcb (because an IPI might ask they be saved).
   (Or save the resister values somewhere the IPI can save them to the
   pcb from.)
and:
2) Pre-emption is disabled.
and:
3a) If the fpu registers are owned by the current process the registers
   used are saved and restored.
or:
3b) If the fpu is not active it is enabled (and then disabled).

You probably don't want to to a full fpu save unless you really need to.

I'd guess that the AES instruction would only need a couple of xmm/ymm
registers.

There is one luring issue with the intel cpus though
If the user program has used AVX encoded instructions that affect the
ymm registers there is a big clock penalty for the first non-avx encoded 
instruction that uses the xmm ones (don't ask what the hw guys f*cked up
and bodged a fix for...).
The ABI requires that the ymm (high) registers be cleared with a special
instruction before every function call - which will include all system
calls, but this won't be true if the kernel is entered by an interrupt.

I don't know about amd cpus.

David

-- 
David Laight: da...@l8s.co.uk

Re: cprng_fast implementation benchmarks

2014-04-23 Thread David Laight

On Wed, Apr 23, 2014 at 03:30:09PM +0200, Manuel Bouyer wrote:
 On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote:
  [...]
  Do we still have a compile-time way to check if the kernel (or port) is
  uniprocessor only?  If so we should probably #ifdef away the percpu calls
  in such kernels, which are probably for slower hardware anyway.
 
 AFAIK options MULTIPROCESSOR is still here

Do the percpu() calls collapse out for non-MULTIPROCESSOR kernels?

In any case you'd want to do what is done with some of the mutex code.
ie overwrite the code of the SMP version with that of the
uniprocessor one if the current system only has one cpu.

David

-- 
David Laight: da...@l8s.co.uk

Re: Changes to make /dev/*random better sooner

2014-04-11 Thread David Laight

On Thu, Apr 10, 2014 at 04:14:46PM -0700, Dennis Ferguson wrote:
 
 On 10 Apr, 2014, at 05:34 , Thor Lancelot Simon t...@panix.com wrote:
 
  On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote:
  
  I'd really like to understand what problem is fixed by this.  It
  seems to make the code more expensive (significantly so since it
  precludes using timestamps in their cheapest-to-obtain form) but
  I'm missing the benefit this cost pays for.
  
  It's no more expensive to sample a 64-bit than a 32-bit cycle counter,
  if you have both.  Where do we have access to only a 32-bit cycle
  counter?  I admit that the problem exists in theory.  I am not so sure
  at all that it exists in practice.
 
 32 bit ARM processors have a 32-bit CPU cycle counter, when they
 have one.  PowerPC processors have a 64-bit counter but the 32-bit
 instruction set provides no way to get an atomic sample of all 64
 bits.  It requires three special instructions followed by a check
 and a possible repeat of the three instructions to get a consistent
 sample, which makes that significantly less useful for accurate event
 timing than the single atomic instruction which obtains the low order
 32 bits alone.  I know i386, and 32-bit sparc running on a 64-bit
 processor, can get atomic samples of 64 bits of cycle counter from
 the 32-bit instruction set but I think those are exceptions rather
 than rules.

For the purposes of obtaining entropy it doesn't matter if the high
and low parts don't match.
Is there likely to be interesting entropy in the high bits anyway
- certainly not more than once.

Also, having read high, low, high and found that the two 'high'
values differ, take the latter high bits and zero the low bits.
The value returned occurred while the counter was being read -
so is a valid return value.

David

-- 
David Laight: da...@l8s.co.uk

Re: Proposal for kernel clock changes

2014-04-01 Thread David Laight

On Fri, Mar 28, 2014 at 06:16:23PM -0400, Dennis Ferguson wrote:
 I would like to rework the clock support in the kernel a bit to correct
 some deficiencies which exist now, and to provide new functionality.  The
 issues I would like to try to address include:

A few comments, I've deleted the body so they aren't hidden!

One problem I do see is knowing which counter to trust most.
You are trying to cross synchronise values and it might be that
the clock with the best long term accuracy is a very slow one
with a lot of jitter (NTP over dialup anyone?).
Whereas the fastest clock is likely to have the least jitter, but
may not have the long term stability.

There are places where you are only interested in the difference
between timestamps - rather than needing them converting to absolute
times.

I also wonder whether there are timestamps for which you are never
really interested in the absolute accuracy of old values.
Possibly because 'old' timestamps will already have been converted
to some other clock.
This might be the case for ethernet packet timestamps, you may want
to be able to synchronise the timestamps from different interfaces,
but you may not be interested in the absolute accuracy of timestamps
from packets takem several hours ago.

This may mean that you can (effectively) count the ticks on all your
clocks since 'boot' and then scale the frequency of each to give the
same 'time since boot' - even though that will slightly change the
relationship between old timestamps taken on different clocks.
Possibly you do need a small offset for each clock to avoid
discrepencies in the 'current time' when you recalculate the clocks
frequency.

If the 128bit divides are being done to generate corrected frequences,
it might be that you can use the error term to adjust the current value
- and remove the need for the divide at all (after the initial setup).

One thought I've sometimes had is that, instead of trying to synchronise
the TSC counters in an SMP system, move them as far from each other
as possible!
Then, when you read the TSC, you can tell from the value which cpu
it must have come from!

David

Re: resource leak in linux emulation?

2014-03-27 Thread David Laight

On Thu, Mar 27, 2014 at 02:00:37PM +1300, Mark Davies wrote:
 On a NetBSD/amd64 6.1_STABLE system, I have a perl script that 
 effectively calls /usr/pkg/java/sun-7/bin/javac twice.  It doesn't 
 really matter what java file its compiling.
 If I call this script in an infinite loop, after an hour or so the 
 javac's start failing with memory errors:
 
   # There is insufficient memory for the Java Runtime Environment to 
 continue.
   # Cannot create GC thread. Out of system resources.
 
 and after some more time the perl fails to fork (to exec the second 
 javac)
 
23766  1 perl CALL  fork
23766  1 perl RET   fork -1 errno 35 Resource temporarily 
 unavailable
 
 Mar 27 11:43:24 test /netbsd: proc: table is full - increase 
 kern.maxproc or NPROC
 
 But all through this top et al tell me there are plenty of processes 
 and memory

I think this has been seen before.
But I can't remember the resolution.

David

-- 
David Laight: da...@l8s.co.uk

Re: Enhance ptyfs to handle multiple instances.

2014-03-25 Thread David Laight

On Mon, Mar 24, 2014 at 10:49:15AM -0400, Christos Zoulas wrote:
 On Mar 24,  5:46pm, net...@izyk.ru (Ilya Zykov) wrote:
 -- Subject: Re: Enhance ptyfs to handle multiple instances.
 
 | Hello!
 | 
 | Please, tell me know if I wrong.
 | In general case I can't find(easy), from driver, where its device file 
 located on file system,
 | its vnode or its directory vnode where this file located.
 | Such files can be many and I can't find what file used for current 
 operation.
 | Maybe anybody had being attempted get this info from the driver?
 
 You can't find from the driver where the device node file is located
 in the filesystem, as well as you cannot reliably find from the
 vnode of the device node the filesystem path. There could be many
 device nodes that satisfy the criteria (you can make your own tty
 node with mknod)

FWIW SYSV ptys (etc) would be created as a 'clone', a /dev entry
created/found with the required path and the correct major/minor,
and then reopened through the filesystem entry.
stat() on the /dev entry and fstat() on the fd would then agree
(probably disk partition and inode?).

This didn't help you find the entry - but it would tell you when
you'd found the correct one.

OTOH ttyname(3) is probably best implemented with a pair of ioctls.
Although chroot() probably complicates things.

David

-- 
David Laight: da...@l8s.co.uk

Re: CVS commit: src/sys/kern

2014-03-06 Thread David Laight

On Wed, Mar 05, 2014 at 06:04:02PM +0200, Andreas Gustafsson wrote:
 
 2. I also object to the change of kern_syctl.c 1.247.
 
 This change attempts to work around the problems caused by the changes
 to the variable types by making sysctl() return different types
 depending on the value of the *oldlenp argument.
 
 IMO, this is a bad idea.  The *oldlenp argument does *not* specify the
 size of the data type expected by the caller, but rather the size of a
 buffer.  The sysctl() API allows the caller to pass a buffer larger
 than the variable being read, and conversely, guarantees that passing
 a buffer that is too small results in ENOMEM.
 
 Both of these aspects of the API are now broken: reading a 4-byte
 CTLTYPE_INT variable now works for any buffer size = 4 *except* 8,

That wasn't the intent of the change.
The intent was that if the size was 8 then the code would return
a numeric value of size 8, otherwise the size would be chnaged to
4 and/or ENOMEM (stupid errno choice) returned.

 and attempting to read an 8-byte CTLTYPE_QUAD variable into a buffer
 of less than 8 bytes is now guaranteed to yield ENOMEM *except* if the
 buffer size happens to be 4.

A request to read a CTLTYPE_QUAD variable into a buffer that is shorter
than 8 bytes has always been a programming error.
The intent of the change was to relax that is the length happened to be 4.

 IMO, this behavior violates both the
 letter of the sysctl() man page and the principle of least astonishment.

I'm not sure about the latter.
I run 'sysctl -a' and find the name of the sysctl I'm interested in.
The result is a small number so I pass the address and size of a integer
variable and then print the result.
(Or the value is rather large and I think it might exceed 2^31 so I
use an int64.)
The 'principle of least astonishment' would mean that I get the value
that 'sysctl -a' printed.

On a BE system I have to be extremely careful with the return values
from sysctl() or I see garbage.

Note that code calling systctl() has to either know whether the value
it is expecting is a string, structure, or number, or use the API calls
that expose the kernel internals in order to find out.

 Also, the work-around is ineffective in the case of a caller that
 allocates the buffer dynamically using the size given by an initial
 sysctl() call with oldp = NULL.

Code that does that for a numeric value will be quite happy with
either a 32bit of 64bit result.

David

-- 
David Laight: da...@l8s.co.uk

Re: Recent sysctl changes

2014-03-05 Thread David Laight

On Wed, Mar 05, 2014 at 03:56:54PM -0500, Thor Lancelot Simon wrote:
 On Wed, Mar 05, 2014 at 08:55:50PM +0200, Andreas Gustafsson wrote:
  
  2. I also object to the change of kern_sysctl.c 1.247.
  
  This change attempts to work around the problems caused by the changes
  to the variable types by making sysctl() return different types
  depending on the value of the *oldlenp argument.
 
 As I recall, we considered this approach before creating hw.physmem64,
 and decided it was just a little too cute.
 
 I don't actually know of any code that hands over a wrong-size
 buffer and will therefore break, though.  Do you?  I agree the
 possibility does exist.

I actually wonder if the code should also support single byte reads
for things like machdep.sse which are effectively booleans.

Maybe we should also allow 1, 4 and 8 byte reads for items declared
as booleans.
IIRC one of the arm ABIs uses 4 byte booleans - bound to be a cause
for confusion at some point.

David

-- 
David Laight: da...@l8s.co.uk

Re: Vnode API change: mnt_vnodelist traversal

2014-03-03 Thread David Laight

On Mon, Mar 03, 2014 at 03:55:12PM +0100, J. Hannken-Illjes wrote:
 On Mar 3, 2014, at 11:32 AM, Thomas Klausner w...@netbsd.org wrote:
 
  On Mon, Mar 03, 2014 at 11:11:04AM +0100, J. Hannken-Illjes wrote:
  A diff implementing this and using it for those operations running
  vrecycle() is at http://www.netbsd.org/~hannken/vnode-pass4-1.diff
  
  Once all operations are converted, vmark() / vunmark() will go and
  man pages will be updated.
  
  Comments or objections anyone?
  
  I have no background clue, so please excuse my questions if they are
  stupid :)
  
  +void
  +vfs_vnode_iterator_init(struct mount *mp, void **marker)
  +{
  +   struct vnode **mvpp = (struct vnode **)marker;
  +
  +   *mvpp = vnalloc(mp);
  +
  +   mutex_enter(mntvnode_lock);
  +   TAILQ_INSERT_HEAD(mp-mnt_vnodelist, *mvpp, v_mntvnodes);
  +   mutex_exit(mntvnode_lock);
  +}
  +
  +void
  +vfs_vnode_iterator_destroy(void *marker)
  +{
  +   struct vnode *mvp = marker;
  +
  +   KASSERT((mvp-v_iflag  VI_MARKER) != 0);
  +   vnfree(mvp);
  +}
  
  Why do you cast marker in init, but not in destroy or next?
 
 Because (void **) to (othertype **) needs a cast.  Added casts to
 destroy and next anyway.
 
  I assume that the marker is not struct vnode * so that you can change
  the type later if you want.
 
 It is struct vnode * for now, to the caller it is simply opaque as the
 caller doesn't need to know the internals.

Use the correct type - if the caller doesn't need to know the internals
add a 'struct vnode' before the function definition.
(Or even 'struct foo' - which might currently be a vnode.)
If you use 'void *' it becomes unclear where the pointers are valid.

In this case I'm not sure that adding a marker vnode into the list
of vnodes is a good idea at all.

What you might want is a list of active iterators and their current
position so that the 'right' things can happen when a vnode is deleted
(especially if they need to save the 'next' vnode to allow the function
itself delete the current one).

In that case the approriate structure can be allocated in stack as
part of the iterator data.

For instance, you might decide to scan the vnodes from the hash lists.
And for SMP locking you might want to arrange the hash so that any
'next' pointers are within the hash structure - completely removing
any linked list between the vnodes themselves.


David

-- 
David Laight: da...@l8s.co.uk

Re: Adding truncate/ftruncate length argument checks

2014-02-26 Thread David Laight

On Wed, Feb 26, 2014 at 08:38:28PM +0100, Nicolas Joly wrote:
The attached patch add the missing length argument checks, and update
the man page accordingly.
  
  Isn't there (shouldn't there be) some lock needed to read the limit
  data?
 
 Even for fetching a single value ? I thought it was mostly atomic ?

+   if (length  l-l_proc-p_rlimit[RLIMIT_FSIZE].rlim_cur) {  

Well...
l-l_proc is ok.
l_proc-p_rlimit may not be (if it is shared with another process,
  and an update by another process/thread causes the pointer to change,
  and the other owners all exit ...)
p_rlimit[RLIMIT_FSIZE].rlim_cur is uint64_t so is a problem on 32bit.

David

-- 
David Laight: da...@l8s.co.uk

Re: Adding truncate/ftruncate length argument checks

2014-02-26 Thread David Laight

On Wed, Feb 26, 2014 at 10:55:52PM +0100, Nicolas Joly wrote:
  l_proc-p_rlimit may not be (if it is shared with another process,
and an update by another process/thread causes the pointer to change,
and the other owners all exit ...)
 
 I don't think another process will cause any problem. Before any
 update, it will have its own private copy, leaving the previous shared
 version unmodified.
 
 Regarding an other thread ... The race does indeed exists, but only
 once in process life, for the first limit write access.

One copy of the structure is shared between all the lwps in a process.
It can also be shared with the parent and children.

If another lwp in the same process tries to edit a shared (by more than
one process) structure the the code could read the old copy after
the ref count has been decreased. If you are then really unlucky the
process it is shared with will exit - and the data will be freed.
It might (in general get unmapped (and fault) or be reallocated,
modified and then garbage read.

Some kind of rcu in the 'free' path would solve the latter.

David

-- 
David Laight: da...@l8s.co.uk

Re: pmap_kenter_pa pmap_kremove

2014-02-23 Thread David Laight

On Sat, Feb 22, 2014 at 10:04:13PM +, Mindaugas Rasiukevicius wrote:
 Matt Thomas m...@3am-software.com wrote:
  
  I've been wondering...
  
  Should pmap_kenter_pa overwrite an existing entry should it be operating
  on an unmapped VA.
 
 You mean already mapped VA?
 
  I think that if you want to change a mapping, you
  should do a pmap_kremove first.
 
 I tend to agree.  I have not seen a need for such re-mapping (overwriting),
 but even if there is, it can be done efficiently by removing, entering and
 then calling pmap_update().  With the deferred update, that would result in
 a single TLB flush/invalidation.

Anything that uses a small KVA area to reference a large amount of
physical addresses?

I'd guess that you'd want a flag somewhere to know it was likely
(either for the call, but probably a property of the KVA address.

David

-- 
David Laight: da...@l8s.co.uk

Re: pcb offset into uarea

2014-02-19 Thread David Laight

On Wed, Feb 19, 2014 at 09:14:05AM -0800, Matt Thomas wrote:
 
 For the aarch64 port, the only thing in the PCB is the fpu register set.
 Everything else is in mdlwp.  Now the context switch code can ignore
 the PCB entirely.  I've been thinking of doing something similar for
 other ports i maintain.

Makes sense.

That would remove a rather pointless indirection for those fields as well.
On amd64 and i386 the pcb is slightly over 64 bytes (+fpu save area).
So moving those into the lwp won't make much difference.
It isn't as though anyone has considered swapping uareas for a while.

David

-- 
David Laight: da...@l8s.co.uk

Re: pcb offset into uarea

2014-02-17 Thread David Laight

On Sun, Feb 16, 2014 at 01:27:50PM -0800, Matt Thomas wrote:
  An alternative would be to place the FP save area at the start of the uarea.
  This would mean that, on stack overflow, the FP save area would be trashed
  before some random piece of memory.
  It might even be worth putting the pcb at the start of the uarea - so that
  stack overflow crashes out the failing process, and probably earlier
  than the random corruption would.
 
 For most ports, the pcb is at the start of the uarea.

Interesting since i386 puts it at the end.

  This gives me three options:
  A) Put the save area at the end of the pcb and dynamically adjust the pcb
offset.
  B) Put the save area at the start of the uarea, with the pcb at a fixed
offset at the end of the uarea.
  C) Put the save area at the end of the pcb, and put the pcb at the start
of the uarea.
  
  Votes?
  What have I missed?
 
 Keep a default mmx/sse save area in the pcb along with a pointer to it.
 If a variant is used that needs a larger save area, dynamically allocate
 it and save it in the pcb pointer.
 
 Since it's unlikely most processes will be AVX why waste the space?

Unfortunately I dont think it is possible to determine whether a
process has used the AVX instructions.
There is a bit for 'os supports avx' (ie swaps on context switch) that
causes the instructions to fault (if not set), but applications should
look at that before using avx instructions.

If a process switch happens in a system call then the avx (xmm and ymm)
registers need not be saved and restored. They can be zeroed instead
because they are all caller saved.
I'm not 100% how easy that is to detect, but it shouldn't be too hard an
optimisation to perform.

Zeroing the ymm registers also has a significant performance benefit.

David


David

-- 
David Laight: da...@l8s.co.uk

Re: pcb offset into uarea

2014-02-17 Thread David Laight

On Mon, Feb 17, 2014 at 06:39:26PM +, David Holland wrote:
 On Sun, Feb 16, 2014 at 09:41:08PM +, David Laight wrote:
   I'm adding code to i386 and amd64 to save the ymm registers on process
   switch - allowing userspace to use the AVX instructions.
   [ensuing crap about the u area]
 
 Why put it in the u area at all? It's a legacy concept of little
 continuing value.

Certainly most of the stuff that is in the pcb could be put into the lwp
structure. Apart form the fp save area it isn't even very big.

Putting the FP save area at the low address of the kernel stack pages
saves you having to worry about how bit it is.
(for 'stack grows down' systems).

David

-- 
David Laight: da...@l8s.co.uk

pcb offset into uarea

2014-02-16 Thread David Laight

I'm adding code to i386 and amd64 to save the ymm registers on process
switch - allowing userspace to use the AVX instructions.

I also don't want to have to do it all again when the next set of
extensions appear.
This means that the size of the FPU save area (currently embedded in
the pcb) can't be determined until runtime.

Plan A is to move the FPU save are to the end of the pcb, and then
locate the pcb at the correct offset in the uarea so that the written
region ends at the end of the page.
The problem with this is that the offset of the pcb in the uarea
is set by MI code based on some #defines - and there seem to be
several related values.

Now on x86 (like most systems) the cpu stack advances into low memory.
The pcb is placed at the end of the uarea with the intial stack pointer
just below it.
I suspect that a long time ago (when the uarea had a fixed KVA) an
additional memory page was placed below the uarea to give interrupts
more stack space. I don't think this happens any more.

As an aside: The uarea used to be pageable, whereas (what is now) the
lwp structure isn't. Paging of uarea's was disabled a few years back
- so there is no real difference between the lifetimes of an lwp a uarea.
(zombies probably lose the uarea before the lwp).

An alternative would be to place the FP save area at the start of the uarea.
This would mean that, on stack overflow, the FP save area would be trashed
before some random piece of memory.
It might even be worth putting the pcb at the start of the uarea - so that
stack overflow crashes out the failing process, and probably earlier
than the random corruption would.

This gives me three options:
A) Put the save area at the end of the pcb and dynamically adjust the pcb
   offset.
B) Put the save area at the start of the uarea, with the pcb at a fixed
   offset at the end of the uarea.
C) Put the save area at the end of the pcb, and put the pcb at the start
   of the uarea.

Votes?
What have I missed?

David

-- 
David Laight: da...@l8s.co.uk

Re: 4byte aligned com(4) and PCI_MAPREG_TYPE_MEM

2014-02-11 Thread David Laight

On Tue, Feb 11, 2014 at 04:19:26PM +, Eduardo Horvath wrote:
 
 We really should enhance the bus_dma framework to add bus_space-like 
 accessor routines so we can implement something like this.  Using bswap is 
 a lousy way to implement byte swapping.  Yes, on x86 you have byte swap 
 instructions that allow you to work on register contents.  But most RISC 
 CPUs do the byte swapping in the load/store path.  That really doesn't 
 map well to the bswap API.  Instead of one load or store operation to 
 swap a 64-bit value, you need a load/store plus another dozen shift and 
 mask operations.  
 
 I proposed such an extension years ago.  Someone might want to resurrect 
 it.

What you don't want to have is an API that swaps data in memory
(unless that is really what you want to do).

IIRC modern gcc detects uses of its internal byteswap function
that are related to memory read/write and uses the appropriate
byte-swapping memory access.

I can see the advantage of being able to do byteswap in the load/store
path, but sometimes that can't be arranged and a byteswap instruction
is very useful.
I really can't imagine implementing it being a big problem!

David

-- 
David Laight: da...@l8s.co.uk

Re: 4byte aligned com(4) and PCI_MAPREG_TYPE_MEM

2014-02-11 Thread David Laight

On Tue, Feb 11, 2014 at 09:21:30PM +, Eduardo Horvath wrote:
  
  What you don't want to have is an API that swaps data in memory
  (unless that is really what you want to do).
  
  IIRC modern gcc detects uses of its internal byteswap function
  that are related to memory read/write and uses the appropriate
  byte-swapping memory access.
  
  I can see the advantage of being able to do byteswap in the load/store
  path, but sometimes that can't be arranged and a byteswap instruction
  is very useful.
 
 When do you ever really want to byte swap the contents of one register to 
 another register?  Byte swapping almost always involves I/O, which 
 means reading or writing memory or a device register.  In this case we 
 are specifically talking about DMA, in which case there is always a load 
 or store operation involved.

Quite often the structure of the code means that the value has already
been read into a register - so you are presented with a value in the
wrong byte order.

  I really can't imagine implementing it being a big problem!
 
 Yes, it a big problem.  For a 2 byte swap you need to do 2 shift 
 operations, one mask operation (if you're lucky) and one or operation.  
 Double that for a 4 byte swap.  And even if you argue that a dozen CPU 
 cycles here or there don't make much difference, the byte swap code is 
 replicated all over the place since the routines are macros, so you're 
 paying for it with your I$ bandwidth.

Sorry I meant a big problem for those designing cpus.
I know it is a pita in software.

About the only VHDL I've written is for a byteswap 'custom instruction'
for a soft-cpu. Done because a single cycle byteswap there was easier
than getting a ppc to use the byteswapping memory accesses for the
relevant fields.

David

-- 
David Laight: da...@l8s.co.uk

Re: [Milkymist port] virtual memory management

2014-02-10 Thread David Laight

On Mon, Feb 10, 2014 at 02:38:27PM -0800, Matt Thomas wrote:
 Hopefully, if they make the caches larger they increase the number of ways.
 I wouldn't add code to flush.  Just add a panic if you detect you can have
 aliases and deal with it if it ever happens.

IIRC A lot of sparc systems have VIPT caches where the cache size
(divided by the ways) is much larger than a page size (IIRC at least 64k).

If memory has to be mapped at different VA (eg in different processes)
then it is best to pick VA so that the data hits the correct part of the
cache.  Otherwise it has to be mapped uncached.

I guess another solution is to use logical pages that are the right size.

David

-- 
David Laight: da...@l8s.co.uk

Re: Possible issue with fsck_ffs ?

2014-02-07 Thread David Laight

On Fri, Feb 07, 2014 at 05:50:53PM +, David Holland wrote:
 On Fri, Feb 07, 2014 at 08:39:39AM -0800, Paul Goyette wrote:
   I'm sure we have some experts who could figure this out a lot more
   quickly than me fumbling through the sources  :)
   
   At my $DAYJOB we have seen instances where newfs(8) can generate a
   filesystem with fragments per cylinder-group can exceed 0x1.
   When newfs(8) stores the value in the file-system's superblock, it
   works correctly since fs_fpg is a 32-bit integer.  However,
   newfs(8) also stores the value in the partition table's p_cpg
   member, which is only 16-bits.  Values above 0x1 will,
   obviously, get truncated.
   
   fsck_ffs(8) works just fine as long as we are able to read the
   primary superblock.  But if we're unable to access the primary SB,
   we need to use the p_cpg value to find the alternate superblocks,
   and because of the truncation noted above the search for alternates
   will fail.

IMHO fsck should work without having to find the partition table.

One option is to assume that default parameters were used to create
the filesystem and the use the same algorithm as newfs to find the
alternate - maybe trying a few block/fragment sizes (there aren't
that many).
As a last resort a linear search wouldn't take that long.
Although it would be best to check that some later superblocks match,
and that the whole thing is consistent with the partition size.

I'd sometimes rather that fsck didn't actually do any disk writes
until the end (or interactively after asking a question).

IIRC the number of fragments in a 'cylinder group' is limited because the
allocation bitmap has to reside within a single FS block.
Since blocks are limited to 64kB this limits the is 0x8 fragments.
(Any FS with blocks  8k is likely to have  0x1 fragments/CG.)
So if you do have the p_cpg value there are only a few locations to try.

David

-- 
David Laight: da...@l8s.co.uk

Re: [PATCH] netbsd32 swapctl, round 3

2014-02-01 Thread David Laight

On Sat, Feb 01, 2014 at 08:41:15AM +, Emmanuel Dreyfus wrote:
 Hi
 
 Here is my latest attempt at netbsd32 swapctl. I had to make uvm_swap_stats()
 available to emul code, but that seems to be what it was intented for,
 according to comments in the code.

I've just looked at the code in uvm_swap_stats().
Might be easier to either clone the entire loop, or pass in a helper
function which is passed the 'sdp' and 'inuse' fields.

Even for the 'normal' case that would save copying the pathname twice.
It isn't as though a lock is released before the copyout.
The read lock is held throughout.

David

-- 
David Laight: da...@l8s.co.uk

Re: compat_netbsd32 swapctl

2014-01-29 Thread David Laight

On Wed, Jan 29, 2014 at 11:54:29AM +0100, Martin Husemann wrote:
 On Wed, Jan 29, 2014 at 10:42:14AM +, Emmanuel Dreyfus wrote:
  The solution is for netbsd32_swapctl() to call sys_swapctl() for
  each individual record, but it needs to know the i386 size for 
  struct swapent. I suspect there is a macro for that. Someone knows?
 
 Tricky.
 
 You could define swapent32 with se_dev split into two 32bit halves and
 do full conversion back and forth, but better check what alignment
 mips and sparc would require here first.

Look at what is done elsewhere in the i386 compat code.

There is a 64bit integer type that has an alignment requirement of 8.
If that is used instead of a normal 64bit type then the structure
alignement under amd64 matches that of i386.

David

-- 
David Laight: da...@l8s.co.uk

Re: compat_netbsd32 swapctl

2014-01-29 Thread David Laight

On Wed, Jan 29, 2014 at 06:38:06PM +, paul_kon...@dell.com wrote:
 On Wed, Jan 29, 2014 at 06:26:14PM +, David Laight wrote:
  There is a 64bit integer type that has an alignment requirement of 8.
  If that is used instead of a normal 64bit type then the structure
  alignement under amd64 matches that of i386.
 
 The easiest way to get such alignment is to ask for it explicitly: 
 __attribute__((aligned(8))).
 
   paul

That won't ensure the structure has the same aligment, there could
be pad words before 64bit fields on the 64bit architecture.

You could mark all the 64bit fields with aligned(8) so that they have
the same alignement.

But the general problem is that the 64bit system needs to match a 
pre-existing 32bit structure.
Changing the alignment doesn't then help.

It probabli is worth adding an __CTASSERT() for non-trivial structures
that are expected to be a fixed size. Then if anything 'odd' happens
the compler will bleat.

David

-- 
David Laight: da...@l8s.co.uk

Re: amd64 kernel, i386 userland

2014-01-26 Thread David Laight

On Sun, Jan 26, 2014 at 05:01:42PM +1100, matthew green wrote:
 
 i think this could be fixed by introducing new disk major numbers for
 both i386 and amd64 that are associated with the same definition of
 major() and minor(), but i've never gotten around to or found someone
 else willing to code this up.

An entirely new disk minor to partition map might be appropriate.
(Without looking at the current mess...)
I think we have (at least) 16 bits of minor number so could
split 8/8, but reserve the high 'disk numbers' for the 'raw disk'
access for all disks.
So minor 0x0204 would be disk 2 partition 4.
Minor 0xff03 would be raw access to disk 3.
Maybe minor 0xfe03 would be the 'netbsd partition (type 169)' if found.

The partition slots for 'whole disk' could then be put where they belong.
I suspect a real VAX might have disks that the hardware can't size,
but nothing else will.

I know I've changed the x86 code to report the actual disk size (for
'd', and maybe 'c') rather than the information that happened to be
in the label.

David

-- 
David Laight: da...@l8s.co.uk

Re: UVM crash in NetBSD/i386 PAE with 32 GB of RAM

2014-01-21 Thread David Laight

On Tue, Jan 21, 2014 at 10:31:08AM +, Emmanuel Dreyfus wrote:
 On Mon, Jan 20, 2014 at 04:18:38PM +, Emmanuel Dreyfus wrote:
  Changing memory fixed the problem. The machine now boots 6.0 i386 PAE
  with SMP enabled and 128 GB of RAM installed, and it seems to be stable.
 
 But I spoke too fast. It is stable, but the i386 PAE kernel does not
 sees more than 2 GB of memory. An amd64 kernel sees the whole 128 GB.
 
 Is it possible that the chipset cannot run PAE?

I doubt it.

2G sounds like the amount of memory below 4G - but I'd have thought that
would be 3G (or even 3.5G).

It might be that having 128G has confused things somewhere.
PAE is also a bodge (at all levels) I'd run a 64bit kernel of that system.

David

-- 
David Laight: da...@l8s.co.uk

Re: UVM crash in NetBSD/i386 PAE with 32 GB of RAM

2014-01-21 Thread David Laight

On Tue, Jan 21, 2014 at 08:59:19PM +0100, Christoph Egger wrote:
 Am 21.01.14 20:54, schrieb David Laight:
  On Tue, Jan 21, 2014 at 10:31:08AM +, Emmanuel Dreyfus wrote:
  On Mon, Jan 20, 2014 at 04:18:38PM +, Emmanuel Dreyfus wrote:
  Changing memory fixed the problem. The machine now boots 6.0 i386 PAE
  with SMP enabled and 128 GB of RAM installed, and it seems to be stable.
 
  But I spoke too fast. It is stable, but the i386 PAE kernel does not
  sees more than 2 GB of memory. An amd64 kernel sees the whole 128 GB.
 
  Is it possible that the chipset cannot run PAE?
  
  I doubt it.
  
  2G sounds like the amount of memory below 4G - but I'd have thought that
  would be 3G (or even 3.5G).
 
 That depends on the PCI MMIO memory layout.
 
 And I am wondering if 64bit PCI devices are accessable at all when their
 PCI bar is above 4G.

The bios will put values below 4G into all the bars - otherwise a 32bit
os wouldn't be able to access them at all.
That is why there is a gap in the physical memory addresses.

Actually the size of the memory chips might force 'low' memory
down to 2G - I'm not entirely sure how that memory hole is generated.

If amd64 finds all the memeory, I'd do a bit of chasing through the kernel
startup code to see what happens.

There are KVA issues that might give problems with the page tables
needed for that much physical memory.

David

-- 
David Laight: da...@l8s.co.uk

Re: amd64 kernel, i386 userland

2014-01-21 Thread David Laight

On Tue, Jan 21, 2014 at 09:14:36PM +0100, Emmanuel Dreyfus wrote:
 Joerg Sonnenberger jo...@britannica.bec.de wrote:
 
  At least raidctl can be found in /rescue, which is statically linked.
  That's likely easier to play with than any compat hacks.
 
 Yes, but that does not solves the problem for ipf, for instance.

You could build the 64bit ipf with a different 'elf interpreter'
namein it (or patch the string in the binary, it is unlikely to be shared).
Then you just need to sset an appropriate LD_LIBRARY_PATH.

Or, maybe, run ipf inside a chroot.

David

-- 
David Laight: da...@l8s.co.uk

Re: BPF memstore and bpf_validate_ext()

2013-12-19 Thread David Laight

On Fri, Dec 20, 2013 at 01:28:12AM +0200, Mindaugas Rasiukevicius wrote:
 Alexander Nasonov al...@yandex.ru wrote:
  
  Well, if it wasn't needed for many year in bpf, why do we need it now? ;-)
  
 
 Because it was decided to use BPF byte-code for more applications and that
 meant there is a need for improvements.  It is called evolution. :)

Has anyone here looked closely at the changes linux is making to bpf?

David

-- 
David Laight: da...@l8s.co.uk

Re: qsort_r

2013-12-09 Thread David Laight

On Mon, Dec 09, 2013 at 03:55:30AM +, David Holland wrote:
 On Sun, Dec 08, 2013 at 11:26:47PM +, David Laight wrote:
 I have done it by having the original, non-_r functions provide a
 thunk for the comparison function, as this is least invasive. If we
 think this is too expensive, an alternative is generating a union of
 function pointers and making tests at the call sites; another option
 is to duplicate the code (hopefully with cpp rather than CP) but that
 seems like a bad plan.

I'd prefer to not have another indirect call. The only difference
is the definition and expanding a CMP macro differently?
   
   Is just casting the function pointers safe in C (well in NetBSD)?
   (with the calling conventions that Unix effectively requires)
 
 No. Well, it is, but it's explicitly illegal C and I don't think we
 should do it.

Actually given that these functions are in libc, their interface
is defined by the architecture's function call ABI, not by the C language.

Consider what you would do if you wrote an asm wrapper for qsort(a,b)
in terms of an asm qsort_r(a,b,d)?

For ABI where the the first 3 arguments are passed in registers
(eg: amd64, sparc, sparc64) and for ABI where arguments are stacked
and cleared by the caller (eg i386) I don't you'd consider doing anything
other than putting an extra label on the same code.

There might be ABI where the this isn't true - in which case the
'thunk' is an option, but I don't think NetBSD has one.

FWIW I think Linux is moving to an alternate ppc64 ABI that doesn't
use 'fat pointers'.

David

-- 
David Laight: da...@l8s.co.uk

Re: qsort_r

2013-12-08 Thread David Laight

On Sun, Dec 08, 2013 at 11:44:28PM +0100, Joerg Sonnenberger wrote:
 On Sun, Dec 08, 2013 at 10:29:53PM +, David Holland wrote:
  I have done it by having the original, non-_r functions provide a
  thunk for the comparison function, as this is least invasive. If we
  think this is too expensive, an alternative is generating a union of
  function pointers and making tests at the call sites; another option
  is to duplicate the code (hopefully with cpp rather than CP) but that
  seems like a bad plan.
 
 I'd prefer to not have another indirect call. The only difference
 is the definition and expanding a CMP macro differently?

Is just casting the function pointers safe in C (well in NetBSD)?
(with the calling conventions that Unix effectively requires)

Can anything slightly less nasty be done with varags functions?

David

-- 
David Laight: da...@l8s.co.uk

Re: qsort_r

2013-12-08 Thread David Laight

On Sun, Dec 08, 2013 at 10:29:53PM +, David Holland wrote:
 
 I have done it by having the original, non-_r functions provide a
 thunk for the comparison function, as this is least invasive. If we
 think this is too expensive, an alternative is generating a union of
 function pointers and making tests at the call sites; another option
 is to duplicate the code (hopefully with cpp rather than CP) but that
 seems like a bad plan. Note that the thunks use an extra struct to
 hold the function pointer; this is to satisfy C standards pedantry
 about void pointers vs. function pointers, and if we decide not to
 care it could be simplified.

On most architectures I think just:
__weak_alias(heapsort_r,heapsort)   
__weak_alias(heapsort_r,_heapsort)   
will work.

David

-- 
David Laight: da...@l8s.co.uk

Re: posix message queues and multiple receivers

2013-12-07 Thread David Laight

On Sat, Dec 07, 2013 at 12:38:42AM +0100, Johnny Billquist wrote:
 
 You know, you might also hit a different problem, which I have had on 
 many occasions.
 NFS using 8k transfers saturating the ethernet on the server, making the 
 server drop IP fragemnts. That in turn force a resend of the whole 8k 
 after a nfs timeout. That will totally kill your nfs performance. 
 (Obviously, even larger nfs buffers make the problem even worse.)

That wasn't the problem in this case since I could see the very delayed
responses.
That is a big problem, I've NFI why i386 defaults to very large transfers.

 Even with an elevator scan algorithm and four concurrent nfs clients, 
 you're disk operation will complete within a few hundred ms at most.

This was all from one client. I'm not sure how many concurrent NFS
requests were actually outstanding - it was quite a few.
I remember that the operation was copying a large file to the nfs server,
the process might have been doing a very large write of an mmaped file.
So the client could easily have a few MB of data to transfer - and be
trying to do them all at once.

Thinking further, multiple nfsd probably help when there are a lot more
reads than writes - reads can be serviced from the server's cache.

David

-- 
David Laight: da...@l8s.co.uk

Re: in which we present an ugly hack to make sys/queue.h CIRCLEQ work

2013-12-03 Thread David Laight

On Sun, Nov 24, 2013 at 06:42:40AM -0500, Mouse wrote:
  (I think that) strict aliasing rules implies that if two types
  type{1,2} do not match any of the aliasing rules (e.g. type1 is of
  the same type as the first member of type2, or type1 is a char, or
  ...), then any two pointers ptr{1,2} on type{1,2} respectively _ARE_
  different, because *ptr1 != *ptr2 per the aliasing rules and this
  implies ptr1 != ptr2.
 
 Only if you actually evaluate *ptr1 and *ptr2 (in some cases, I think,
 just one of them is enough).  Otherwise you're not accessing the
 relevant object(s); the rule is about accesses to values, not about
 pointers that, if followed, would perform certain accesses to values.

One option would have bee to replace the comparison:
(void *)foo == (void *)bar
with:
(char *)foo - (char *)0 == (char *)bar - (char *)0

Which the compiler can't optimise away.
well (const char *), but that makes the line too long!

I've had to do something similar to cast to, IIRC,  (foo * const *).
In a function that advances a pointer down an array - which might be const.


David

-- 
David Laight: da...@l8s.co.uk

Re: zero-length symlinks

2013-11-05 Thread David Laight

On Sun, Nov 03, 2013 at 04:35:19PM -0800, John Nemeth wrote:
 
  It has to do with the fact that historically mkdir(2) was
 actually mkdir(3), it wasn't an atomic syscall and was a sequence
 of operation performed by a library routine...

Actually I think you'll find that mkdir way always a system call.
It was directory rename that was done with a series of link and
unlink system calls.

Also, if you look at any current fs code the processing of . and
.. is special - they will be treated as requests for the current
and parent directories regardless of the inodes they reference.
Doing otherwise is a complete locking nightmare!

David

-- 
David Laight: da...@l8s.co.uk

Re: Getting the device name from a struct tty *

2013-10-16 Thread David Laight

On Tue, Oct 15, 2013 at 01:11:40PM -0400, Mouse wrote:
  In a tty line discipline, I want to get the name of the tty driver
  instance, e.g. dtyU0.
 
 In what sense is that the name of the tty driver instance?
 
 I'm not just being snarky; that's a real question.  Names of the dtyU0
 kind normally name device special files in /dev but nothing else - the
 kernel doesn't know anything about them.  In theory you could read
 /dev, but (a) nothing says device special files can't exist elsewhere
 and (b) you then have to decide what to do if you find other than
 exactly one device special file pointing to the device in question.
 (And, of course, (c) you may not be in a context from which reading
 /dev is feasible.)  But if that's the name you want, there may be
 little choice.

There is also no reason (in general) why there should be a /dev entry
anywhere at all.  The tty could be a cloning driver that allocates a
new minor on every open (and without any magic to create a /dev entry).

The libc ttyname() function is often implemented using a database
of known tty entries - otherwise the lookup can involve a recursive
search of /dev - which might be needed anyway if entries are dynamically
added - not sure how netbsd handles it.

I have been handed (on paper!) several hundred sheets of system call
trace from a program that was scanning a list looking for an entry
for it's current terminal - and calling ttyname() for each entry!

On SYSV we used to go through 'hoops' so that ttyname() could return
the expected /dev entry when there might be multiple /dev entries
with the same major/minor.

David

-- 
David Laight: da...@l8s.co.uk

Re: mknodat(2) device argument type change

2013-10-09 Thread David Laight

On Sun, Oct 06, 2013 at 10:51:36PM +0200, Nicolas Joly wrote:
 
 It needs the PAD, syscalls files generation fails without it
 (sysalign=1).
 
 /bin/sh makesyscalls.sh syscalls.conf syscalls.master
 syscalls.master: line 905: unexpected dev (expected a padding argument)
 line is:
 460 STD  RUMP{  int | sys |  | mknodat ( int fd , const char  * 
 path ,mode_t mode , dev_t dev ) ;  }

A certain amount of magic (and luck) applies...

On i386 64bit fields in structures are only 4-byte aligned, however when
the arguments for a function are stacked a 64bit field is 8-byte aligned.
For system calls a C stucture gets mapped onto the user stack (copied
into kernel). So the kernel struct for the above argument list needs a pad.

On amd64 the first 6 arguments are all in registers, so the C struct for
the argument list must match the register save area. In this case
the structure members all end up being 64bit. So no pad is needed.
I suspect this is handled 'by magic'.

... searches for the magic ...

I think that the PAD arguments are added by the libc system call stubs
even for 64bit architectures - where they waste a real argument slot.

This doesn't explain why rump needs so much special code in makesyscalls.sh.

David

-- 
David Laight: da...@l8s.co.uk

Re: fixing the vnode lifecycle

2013-09-26 Thread David Laight

On Wed, Sep 25, 2013 at 10:22:36PM -0400, Mouse wrote:
  Expect some file systems to use  a key size != sizeof(ino_t) -- nfs
  for example uses file handles up to 64 bytes.
  IIRC all file systems provide a filehandle generation routine,
 
 There was a time when fh generation was needed only for the filesystem
 to be NFS-exportable.  Is it now actually required for all filesystems?

Doesn't posix (more or less) require files to have inode numbers?
In particular I thought some of the fields reported by stat() are supposed
to uniquely identify the file (probably st_dev and st_ino).

Yes, I know some fs break this.

David

-- 
David Laight: da...@l8s.co.uk

Re: Max. number of subdirectories dump

2013-08-19 Thread David Laight

On Sun, Aug 18, 2013 at 03:08:21PM +0200, Manuel Wiesinger wrote:
 Hello,
 
 I am working on a defrag tool for UFS2/FFSv2 as Google Summer of Code 
 Project.
 
 The size of a directory offset is of type int32_t (see 
 src/sys/ufs/ufs/dir.h), which is a signed integer. So the maximum size 
 can be (2^31)-1.
 
 When testing, the maximum number of subdirectories was 32767, which is 
 (2^15)-1, when trying to add a 32767th directory, I got the error 
 message: Too many links.
 When my tools reads only the single indirect blocks, it get all 32767 
 subdirectories.

For defrag I'd have though you'd work from the inode table and treat
directories no different from files.
You would need to scan directories if you decide to renumber inodes,
but since they are indexed that may not gain much.

It might be worth rewriting directories in order to remove gaps and
possibly put subdirectories first (but you really want the most frequently
used entries first).

FYI A well known british internet payment scheme fell over when the 32768th
vendor account was added onto the live system!
(Solaris crash badly)

David

-- 
David Laight: da...@l8s.co.uk

Re: Use of the PC value in interrupt/exception handlers

2013-08-06 Thread David Laight

On Fri, Aug 02, 2013 at 10:46:31AM +, Piyus Kedia wrote:
 Dear all,
 
 We are working on developing a dynamic binary translator for the kernel.
 Towards this, we wanted to confirm if the interrupted PC value pushed on
 stack by an interrupt/exception is used by the interrupt/exception handlers?
 For example, is the PC value compared against a fixed address to determine
 the handler behaviour (like Linux's page fault handler compares the faulting
 PC against an exception table, to allow functions like copy_from_user to 
 fault).

IIRC i386 and amd64 both check the faulting PC for copyin() and copyout()
(and similar functions). Unlike linux these exist as proper functions
so there is only a single set of exception PC bounds (not one for every
call site.

There will also be checks that a user-space PC actually contains
a user address.

Also the signal information, coredump, and registers for GDB (etc)
contain the PC.

David

-- 
David Laight: da...@l8s.co.uk

Re: [PATCH] i386 copy.S routine optimization?

2013-06-16 Thread David Laight

On Mon, Jun 10, 2013 at 10:20:25PM +0200, Yann Sionneau wrote:
 Hello,
 
 I already talked about this with Radoslaw Kujawa on IRC, I understood 
 that it is far from trivial to say if it is good to apply the following 
 patch [0] or not due to x86 cache and pipeline subtleties.

Please inline patches in the mail, that way they are definitely in the
mail archive. It also makes them much easier to review.
Alse ensure you quote the cvs revision of the main file - otherwise
the line numbers won't match.

If you want to make a measurable improvement to copystr() don't use
lodsb or stosb and use 32bit reads from usespace.
I can't remember whether it is best to do misaligned reads and aligned writes
(or aligned reads and misaligned writes), in any case if you do aligned reads
you don't have to worry about faulting at the end of a page.

Look at the strlen() code for quick ways of testing for a zero byte.
For amd64 the bit masking methods are definitely faster.

Probably the worst part of the current code is that the 'jz' to skip
the unwanted write will be mis-predicted.

David

-- 
David Laight: da...@l8s.co.uk

Re: netbsd-6: pagedaemon freeze when low on memory

2013-03-07 Thread David Laight

On Wed, Mar 06, 2013 at 06:01:50PM -0600, David Young wrote:
  
  Here's another thought:  What about changing some of the VM_SLEEP calls
  to VM_NOSLEEP, at least for the userspace-initiated syscalls?  The
  syscalls would then fail, moving the responsibility of dealing with low
  memory onto the userspace apps (they may be unhappy, but at least the
  kernel will stay functional).  This change would be in addition to the
  vmem_xalloc() wake changes you proposed (because those wake-ups may
  never come if the system is truly running on fumes).
 
 You could do that.  You just have to take care to handle the errors
 properly.  There are more error paths to test.  Not trying to discourage
 you, just point out the trade-offs. :-)

Possibly some of the sleeps could be interruptable (by a signal).
But that rather depends on the system calls involved.

David

-- 
David Laight: da...@l8s.co.uk

Re: netbsd-6: pagedaemon freeze when low on memory

2013-03-05 Thread David Laight

On Tue, Mar 05, 2013 at 11:43:35PM -0600, David Young wrote:
 Maybe we can avoid unnecessary locking or redundancy using a
 generation number?  Add a generation number to the vmem_t,
 
 volatile uint64_t vm_gen;
 
 Increase a vmem_t's generation number every
 time that vmem_free(), vmem_xfree(), or vmem_backend_ready() is
 called:

Won't that generate a very hot cache line on a large smp system?
Maybe the associated structures are actually worse here!
But per-cpu virtual address free lists might make sense.

David

-- 
David Laight: da...@l8s.co.uk

Re: Post-mortem debugging tools

2013-02-06 Thread David Laight

On Mon, Feb 04, 2013 at 09:39:04PM +0100, Joerg Sonnenberger wrote:
 Hi all,
 we have quite a few tools in base that still require KVM or optionally
 support it. Removing all tools that require KVM for operation (and
 therefore setgid) is one of the open goals. It would be nice if that
 doesn't require adding lots of duplicate code. For that, a decision is
 required what programs are required for post-mortem analysis (i.e.
 debugging kernel dumps) and limit dual-KVM/sysctl code paths to that.

For post-mortem work you often want the raw information from the kernel
structures (ie including the KVA of things), which the normal user-tools
don't need.

Putting the work into a single program that grovells KVM for diagnostics
(aka SYV crash) means that only one program has to exactly match the
kernel (and, maybe, could be compiled with the kernel?).

Possibly some of the printf() statements could be shared with ddb.

David

-- 
David Laight: da...@l8s.co.uk

open modes O_DENYREAD and O_DENYWRITE

2013-02-06 Thread David Laight

There is a current thread on some of the linux lists and wine-devel
about the semantice of two more open modes O_DENYREAD and O_DENYWRITE.

These are being implemented (I think only for nfs and samba at the
moment) in order to support the equivalent windows open modes.

I don't know if NetBSD needs to worry about these (yet).

David

-- 
David Laight: da...@l8s.co.uk

Re: open modes O_DENYREAD and O_DENYWRITE

2013-02-06 Thread David Laight

On Wed, Feb 06, 2013 at 09:52:34AM +0100, Martin Husemann wrote:
 On Wed, Feb 06, 2013 at 08:19:41AM +, David Laight wrote:
  There is a current thread on some of the linux lists and wine-devel
  about the semantice of two more open modes O_DENYREAD and O_DENYWRITE.
  
  These are being implemented (I think only for nfs and samba at the
  moment) in order to support the equivalent windows open modes.
 
 FYI, the windows modes that match our model are SHARE_DENY_NONE (0) and
 SHARE_EXCLUSIVE (O_EXCL), but O_SHLOCK is split into SHARE_DENY_READ and
 SHARE_DENY_WRITE.
 
 It kind of makes sense to me, but I don't have an easy example where the
 difference would be vital.

IIRC O_EXCL only applies to creates.
Reading the man page O_SHLOCK and O_EXLOCK only acquire flock() type locks.

The windows modes are hard enforced - and are a right PITA at times since
a lot of programs open files exclusively.

David

-- 
David Laight: da...@l8s.co.uk

Re: event counting vs. the cache

2013-01-19 Thread David Laight

On Thu, Jan 17, 2013 at 03:43:13PM -0600, David Young wrote:
 It's customary to use a 64-bit integer to count events in NetBSD because
 we don't expect for the count to roll over in the lifetime of a box
 running NetBSD.
 
 I've been thinking about what these wide integers do to the cache
 footprint of a system and wondering if we shouldn't make a couple of
 changes:
 
 1) Cram just as many counters into each cacheline as possible.
Extend/replace evcnt(9) to allow the caller to provide the storage
for the integer.
 
On a multiprocessor box, you don't want CPUs sharing counter
cachelines if you can help it, but do cram together each individual
CPU's counters.

Actually, if the counter can be placed in the same area as some other
driver data, then it will typically already be in the cache.
This is probably most important for things that are changed very often
(like ethernet byte and packet counts).
Having error counts in different cache lines probably isn't that important.

This does mean that the evcnt(9) interface is completely the wrong one!
It looks like 8 + 6 x sizeof (void *) bytes per counter - so every increment
is (more or less) guaranteed to be a cache line miss.

David

-- 
David Laight: da...@l8s.co.uk

Re: event counting vs. the cache

2013-01-18 Thread David Laight

On Thu, Jan 17, 2013 at 05:25:44PM -0600, David Young wrote:
 
 We can end up with silly values with the status quo, too, can't we?  On
 32-bit architectures like i386, x++ for uint64_t x compiles to
 
   addl $0x1, x
   adcl $0x0, x
 
 If the addl carries, then reading x between the addl and adcl will show
 a silly value.
 
 I think that you can avoid the silly values.  Say you're using per-CPU
 counters.  If counter x belongs to CPU p, then avoid silly values by
 reading x in a low-priority thread, t, that's bound to p and reads hi(x)
 then lo(x) then hi(x) again.  If hi(x) changed, then t was preempted by
 a thread or an interrupt handler that wrapped lo(x), so t has to restart
 the sequence.

You don't actually need to restart, the value new_hi:0 happened while
the function was running - so is a valid response.

I think there is another problem with that scheme - but I can't remember it!

There are other schemes that handle the case of a single writer (guaranteed
by something else) and occaisional readers that don't want to acquire
whatever context single-threads the writes.

One is (I think), add an extra 32bit counter. The writer increments it
before updating the stats block, and again afterwards. The reader spins
until it is even, reads all the stats, then checks the value hasn't changed.

Another involves writing a 3rd value that contains the middle bits of the
value (from both the high and low parts). The reader checks consistency.

Or, assume a 63 bit counter will also not wrap and replcate the high bit
of the low word into the low bit od the high word - reader verifies.

On 64bit systems I sometimes wonder whether it is necessary for stats to
be 100% accurate - so not using locked increments may be ok.

There are also issues making stats per-cpu.
While not unreasonable for 2, 4 or 8 cpus it gets a bit silly when there
are 1024 or more.

The differences between common and uncommon (eg error) stats also needs to
be considered.

David

-- 
David Laight: da...@l8s.co.uk

Re: event counting vs. the cache

2013-01-17 Thread David Laight

On Thu, Jan 17, 2013 at 03:43:13PM -0600, David Young wrote:
 
 2) Split all counters into two parts: high-order 32 bits, low-order 32
bits.  It's only necessary to touch the high-order part when the
low-order part rolls over, so in effect you split the counters into
write-often (hot) and write-rarely (cold) parts.  Cram together the
cold parts in cachelines.  Cram together the hot parts in cachelines.
Only the hot parts change that often, so the ordinary footprint of
counters in the cache is cut almost in half.

That means have to have special code to read them in order to avoid
having 'silly' values.

David

-- 
David Laight: da...@l8s.co.uk

Re: USB_DEBUG mess

2013-01-06 Thread David Laight

On Sat, Jan 05, 2013 at 11:12:29PM +, Christos Zoulas wrote:
 In article c873acdc-a444-4442-a021-42c621d86...@3am-software.com,
 Matt Thomas  m...@3am-software.com wrote:
 
 http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/usb/files.usb?rev=1.106content-type=text/x-cvsweb-markup
  
  Normally, the XXX_DEBUG options are not specified in any files.* files,
  meaning that as they are unknown options, they will translate into a
  CPPFLAG of -DXXX_DEBUG in the kernel Makefile
 
 which is means if you do the define, your sources don't properly get rebuilt.
 That's why it was made a config option.
 
 Yes, but now every file that includes usb.h needs to include opt_usb.h before,
 otherwise things don't build right. I'll fix it properly for now until we
 decide something else.

How much of a difference does it make to the structure layouts?
If there are only a few fields the extra space probably doesn't matter.

David

-- 
David Laight: da...@l8s.co.uk

Re: Hijacking a PCU.

2012-12-15 Thread David Laight

On Sat, Dec 15, 2012 at 11:24:09AM -0800, Matt Thomas wrote:
  
  On amd64 the safe area we currently have for SSE2 is 512 bytes.
  Add support for the 256 AVX instructions and it increases to 832.
  You really don't want to be allocating multiple such saved areas
  (per lwp) on the off chance the kernel might want to use the registers.
 
 Since this is MD, you only need to save the register your kernel
 MD code will be using.

  3) Saving and restoring a register may zero the high bits of an
extended version of that register.
 
 That's an md problem.

Since you are trying to sort out an MI solution you need to be aware
of the known MD problems - otherwise the framework won't work for it.

If an x86 program is using the 256bit AVX instructions, and some
kernel code wants to use one of the 128bit SSE registers, then the
kernel code has to save the 256 bit register, not the 128bit one.
(And next year, the register might be extended even further, requiring
different save/restore instructions.)

Effectively this means that a completely separate fpu save area is needed.
You can't just save a couple of registers on stack.

David

-- 
David Laight: da...@l8s.co.uk

Re: KNF and the C preprocessor

2012-12-11 Thread David Laight

On Mon, Dec 10, 2012 at 09:55:14PM +, paul_kon...@dell.com wrote:
  
  The compiler has some heuristics about what it is good to inline.
  gcc tends to treat 'inline' as just a hint.
 
 I wouldn't describe it that way.  And I don't think the GCC
 documentation does.  It does talk about heuristic inlining,
 but that's for the -O3 feature of inlining stuff that's *not* marked,
 based on heuristics that it might be useful.
  
  Genarally it likes inlining static functions that are only called once.
  But it doesn't always do so - even when marked 'inline'
  (marking them inline may have no effect).
 
 There are switches that control what gets inlined.  In particular,
 there is one that says not to inline things (other than called-once things)
 that are bigger than X.  If things are not getting inlined when expected,
 that is one possible cause.

I was having issues with static functions that are only called once.
They are quite large (not large for a function) but adding a couple
of 'boring' lines of code stopped them being inlined, and also
stopped some 1 line functions being inlined.
Marking everything with __attribute__((always_inline)) fixed it.

That particular code (part of an embedded system) has to have everything
inlined - any register spills to stack make it too slow.
I almosted hacked gcc enough to remove the function prologue (which
saves some registers on stack even though the function can't return)
so that I could use %sp as a general purpose register!
There are a couple of other candidate registers that never used
(It is MIPS like, and %r1 is reserved for assembler macros of which
there are exactly none, the only interrupts are fatal so the interrupt
%pc save (and debug %pc save) are also unused.)

David

-- 
David Laight: da...@l8s.co.uk

Re: KNF and the C preprocessor

2012-12-11 Thread David Laight

On Mon, Dec 10, 2012 at 07:12:36PM -0500, Mouse wrote:
 
  I want func() inlined twice so that there are only 2 conditional
  branches and usually a conditional branch in cmd() back to the loop
  top in each path.
 
 Why?  (s/func/cmd/ I assume.)
 
  So I need to stop the compiler tail merging the two parts of the
  inside 'if'
  There is nothing I can put inside an inline function version of cmd()
  that will stop this happening.
 
 There's nothing you can put in a macro that will prevent it, either.
 Or, rather, as far as I can think of, anything you can do in a macro to
 prevent it you can also do in an inline function.  If you have an
 example of something that'll work one way but not the other I'd be
 interested.

An assembler comment - from something like:
asm volatile(;  STR(__LINE__))

David

-- 
David Laight: da...@l8s.co.uk

Re: KNF and the C preprocessor

2012-12-11 Thread David Laight

On Tue, Dec 11, 2012 at 06:10:09AM +, David Holland wrote:
 On Tue, Dec 11, 2012 at 01:27:09AM +, Roland C. Dowdeswell wrote:
   As an example, I often define a macro when I am using Kerberos or
   GSSAPI that looks roughly like:
   
   #ifdef K5BAIL(x) do {
  ret = x;
  if (ret) {
  /* format error message and whatnot
   * in terms of #x.
   */
   
  goto bail;
  }
  } while (0)
 
 The code like this in src/sys/nfs is a reliably steady source of
 problems, and I'd argue that macros of this form are not at all worth
 the problems they cause.

Absolutely, that is one construct I would ban.
Encapsulating so you can do:
if (K5BAIL(xxx(), error text))
goto bail;
Or even separating the function name (so it can be traced)
at least leaves the flow control obvious.
If you embed a goto (or return) in a #define you'd better have the
defininition very close to the use.

David

-- 
David Laight: da...@l8s.co.uk

Re: fixing compat_12 getdents

2012-12-10 Thread David Laight

On Mon, Dec 10, 2012 at 09:53:46PM +0200, Alan Barrett wrote:
 also, EINVAL doesn't seem like a great error code for this 
 condition.  it's not an input parameter that's causing the 
 error, but rather that the required output format cannot express 
 the data to be returned.  I think solaris uses EOVERFLOW for 
 this kind of situation, and ERANGE doesn't seem too bad either. 
 any opinions on that?
 
 There's also E2BIG, but I don't think it fits.  ERANGE is 
 documented in terms of the available space, while EOVERFLOW is 
 documented in terms of a numeric result.  So perhaps EOVERFLOW 
 for integer is too large to fit in N bits, and ERANGE for 
 string is too long to fit in N bytes?  Or vice versa?
 
 Somebody(TM) should go through the errno(2) documentation and make 
 the descriptions more generic, and add guidance for choosing which 
 code to return.
 

Then people get upset because they say function foo() isn't allowed
to set errno to 'bar'.
It is rather a shame that posix tries to list all errno a function
can return, not just those for explicit 'likely' (ie normal)
non-success returns froma function.

For the inode number, it is a slight shame that a 'fake' value
can't be returned - maybe 0x - since a lot of the
code won't really care.

More likely to be an issue for the stat() functions - but not much code
really cares.
Well not much that you are really going to run compat versions of.

There are issues getting unique dev/inode pairs anyway for some
filesystems (and things like union mounts).

David

-- 
David Laight: da...@l8s.co.uk

Re: KNF and the C preprocessor

2012-12-10 Thread David Laight

On Mon, Dec 10, 2012 at 03:50:00PM -0500, Thor Lancelot Simon wrote:
 On Mon, Dec 10, 2012 at 02:28:28PM -0600, David Young wrote:
  On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote:
  
   a) #define macros tend to get optimised better.
  
  Better even than an __attribute__((always_inline)) function?

Consider the following code:

int ring[100];
#define ring_end (ring + 100)
int *ring_ptr;
int ring_wrap_count;

#define cmd(n) \
if (__predict_true(ring_ptr  ring_end)) \
*ring_ptr++ = n; \
else { \
ring_ptr = ring; \
*ring_ptr++ = n; \
ring_wrap_count++; \
}



for (;;) {
if (__predict_false(...)) {
if (...) {

cmd(1);
continue;
}
...
cmd(2);
continue;
}
...
}

I want func() inlined twice so that there are only 2 conditional
branches and usually a conditional branch in cmd() back to the loop
top in each path.
So I need to stop the compiler tail merging the two parts of the
inside 'if'
There is nothing I can put inside an inline function version of cmd()
that will stop this happening.

In the #define version I can add things that stop the compiler
merging the code. Prizes for thinking what!
(Yes I could do the same in the outer code, but that happens quite
often and I'd much rather hide the hackery in one place.)

And yes, this is a real case from some code where I needed to minimise
the worst case path enough that the extra branch mattered!
The 'unusual' worst case of 'ring wrap' doesn't matter.

I've seen other cases where the code for #define is better than that
for an inline function. Possibly because an extra early optimisation
happens.

I know I've also had issues getting compilers to actually inline stuff.
gcc's __attribute__((always_inline)) helps - I've had to use it to
get static functions that are only called once reliably inlined.

 I'd like to submit that neither are a good thing, because human
 beings are demonstrably quite bad at deciding when things should
 be inlined, particularly in terms of the cache effects of excessive
 inline use.

Indeed - there are some horrid large #define macros lurking.
For some of them I can't imagine when they were benefitial.

There have been some places where apparantly innocuous #defined
have exploded out of all proprotion.
The worst I remember was the SYS vn_rele(), by the time the
original spl() functions had been replaced with lock functions,
and the locks had also become inlined the whole thing exploded.
 
 One reason why macros should die is that in the process, inappropriate
 and harmful excessive inlining of code that would perform better if
 it were called as subroutines would die.

That is true whether inline functions or #defines are used.

Are some computer science courses teaching about optimisations that
really haven't been true since the days of the m68k?

David

-- 
David Laight: da...@l8s.co.uk

Re: KNF and the C preprocessor

2012-12-10 Thread David Laight

On Mon, Dec 10, 2012 at 09:26:08PM +, paul_kon...@dell.com wrote:
 
 On Dec 10, 2012, at 4:18 PM, David Young wrote:
 
  On Mon, Dec 10, 2012 at 03:50:00PM -0500, Thor Lancelot Simon wrote:
  On Mon, Dec 10, 2012 at 02:28:28PM -0600, David Young wrote:
  On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote:
  
  a) #define macros tend to get optimised better.
  
  Better even than an __attribute__((always_inline)) function?
  
  I'd like to submit that neither are a good thing, because human
  beings are demonstrably quite bad at deciding when things should
  be inlined, particularly in terms of the cache effects of excessive
  inline use.
  
  I agree with that.  However, occasionally I have found when I'm
  optimizing the code based on actual evidence rather than hunches, and
  the compiler is letting me down, always_inline was necessary.
  
  Dave
 
 Is that because of compiler bugs, or because the compiler was doing
 what it's supposed to be doing?

The compiler has some heuristics about what it is good to inline.
gcc tends to treat 'inline' as just a hint.

Genarally it likes inlining static functions that are only called once.
But it doesn't always do so - even when marked 'inline'
(marking them inline may have no effect).

Inlining leaf functions is particularly useful - as it removes a lot
of register pressure in the calling function.
If you can inline all calls - making a function a leaf one it is even better.

David

-- 
David Laight: da...@l8s.co.uk

Re: KNF and the C preprocessor

2012-12-10 Thread David Laight

On Mon, Dec 10, 2012 at 06:47:16PM -0500, Mouse wrote:
 
  b) __LINE__ (etc) have the value of the use, not the definition.
  Yes, but if you use static inlines, the debugger's got both -- which
  it won't, if you use macros...
 
 Huh?
 
 Okay, what's the static inline version of log() here?
 
 #define log(msg) log_(__FILE__,__LINE__,(msg))
 extern void log_(const char *, int, const char *);

I see a #define lurking !

David

-- 
David Laight: da...@l8s.co.uk

Re: nfsd serializing patch

2012-12-07 Thread David Laight

On Fri, Dec 07, 2012 at 06:46:41AM +, YAMAMOTO Takashi wrote:
 hi,
 
  Hello,
  while working on nfs performance issues with overquota writes (which
  turned out to be a ffs issue), I came up with the attached patch.
  What this does it, for nfs over TCP, restrict a socket buffer processing
  to a single thread (right now, all pending requests are processed
  by all threads in parallel). This has two advantages:
  - if a single client sends lots of request (like writes comming from a
linux client), a single thread is busy and other threads will be
available to serve other client's requests quickly
  - by avoiding CPU cache sharing and lock contention at the vnode level
(if all requests are for the same vnode, which is the common case),
we get sighly better performances.
  
  My testbed is a linux box with 2 Opteron 2431 (12 core total) and 32GB RAM
  writing over gigabit ethernet to a NetBSD server (dual
  Intel(R) Xeon(TM) CPU 3.00GHz, 4 hyperthread cores total) running nfsd 
  -tun4.
  Without the patch, the server processes about 1230 writes per second,
  with this patch it processes about 1250 writes/s.
  
  Comments ?
 
 interesting.
 
 but doesn't it have ill effects if the client has multiple indepenedent
 activities on the mount point?

They will be hitting the same physical disc, so probably queue behind
each other.

I've never seen any reason for the historical '4 nfsd server processes'.
A lot of configurations work better with only 1.

I've seen cases where the nfs client would be buffering writes, then
decide to write a whole load of pages out of the buffer cache.
This (or maybe somthing else) led to a considerable number of concurrent 8k
nfs writes. The server processes pick one each and the disk becomes busy.
The disk access algorythm (probably staircase) leaves one of the requests
unfulfilled as new requests for nearer sectors keep ariving.
The stalled nfs request times out and is retried.
The stalled request finally finishes, but the rpc request has been timed
out so is discarded.
You now have multiple retry requests making matters worse, almost no
progress is made (this is the the ethernet trace I was given!).
This is fairly typical if the server is slow/overloaded.
With only one server process it is all fine.

David

-- 
David Laight: da...@l8s.co.uk

Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1

2012-12-07 Thread David Laight

On Fri, Dec 07, 2012 at 09:57:12AM -0500, Greg Troxel wrote:
 
 jnem...@victoria.tc.ca (John Nemeth) writes:
 
  On Apr 27,  3:15am, David Laight wrote:
  } One thing I discovered long ago, in an operating system far ... well
  } not NetBSD is that dhcp's use of the bpf (equivalent) caused a data
  } copy for every received ethernet frame - at considerable cost.
  } I've NFI whether this happens withthe current code.
 
   Given that DHCP is very low traffic, I'm not sure that this really
  matters.
 
 I don't think that's what he means.  In most drivers, the idiom is
 
  if (there are bpf listeners) {
m0 = cons up an mbuf chain that represents the packet
bpf_mtap(m0, blah blah)
  }
 
 So the work to marshall the packet that might be tapped happens if there
 is a listener, not if the listener wants this packet.

You've also missed the fact that it wasn't NetBSD - try VxWorks.
All the filtering happened in the dhcp code.

David

-- 
David Laight: da...@l8s.co.uk

Re: nfsd serializing patch

2012-12-07 Thread David Laight

On Fri, Dec 07, 2012 at 10:46:44AM +0100, Ignatios Souvatzis wrote:
 On Fri, Dec 07, 2012 at 08:42:39AM +, David Laight wrote:
 
  With only one server process it is all fine.
 
 If all is backed by the same disk? Hm 

Yep - even with multiple disks all the server process soon end up
trying to process requests for the 'busy' disk.

To make multiple server processes work well, there would have to be
a separate queue for each disk spindle.
But AFAIK the server processes just pull the next request off a single
queue - originally the udp socket.

David

-- 
David Laight: da...@l8s.co.uk

Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1

2012-12-05 Thread David Laight

On Tue, Dec 04, 2012 at 10:17:23PM -0800, John Nemeth wrote:
 On Apr 22,  5:50pm, Robert Elz wrote:
 
  We use ISC's DHCP server.  As third party software, it is designed
 to be portable to many systems.  BPF is a fairly portable interface,
 thus a reasonable interface for it to use.

One thing I discovered long ago, in an operating system far ... well
not NetBSD is that dhcp's use of the bpf (equivalent) caused a data
copy for every received ethernet frame - at considerable cost.
I've NFI whether this happens withthe current code.

Although DHCP has to do strange things in order to acquire the
original lease, renewing it should really only requires packets
with the current IP address.

David

-- 
David Laight: da...@l8s.co.uk

Re: filesystem namespace regions, or making mountd less bozotic

2012-12-05 Thread David Laight

On Wed, Dec 05, 2012 at 09:29:06PM +, David Holland wrote:
 I am tired of PR 3019 and its many duplicates, so I'd like to see a
 scheme that allows managing arbitrary subtrees of the filesystem
 namespace in a reasonably useful manner.
 
 The immediate application is nfs exports and mountd; however, I expect
 the resulting mechanism will also be useful for handling chroots and
 possibly also inotify-type mechanisms.

Haven't you forgotten about 'file handles'.
Since they refer to files you don't know anything about the containing
directory.

In the old days NFS had the following 'rules':
1) If you export part of a filesystem, you export all of it.
2) If you give anyone access, you give everyone access.
3) If you give anyone write access, you give everyone write access.

I suspect 2  3 are no longer true (in NetBSD) as nfs checks the
permissions, not just mountd.
1 is true if clients can 'fake up' valid file handles (used to be very
easy).

David

-- 
David Laight: da...@l8s.co.uk

Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Laight

On Tue, Dec 04, 2012 at 02:57:52PM +, David Holland wrote:
   
   What's a kernel panic got to do with it?  If you hand the controller
   and thus the drive 4K write, the kernel panicing won't suddenly cause
   you to reverse time and have issued 8 512-byte writes instead.
 
 That depends on additional properties of the pathway from the FS to
 the drive firmware. It might have sent 1 of 2 2048-byte writes before
 the panic, for example. Or it might be a vintage controller incapable
 of handling more than one sector at a time.

The ATA command set supports writes of multiple sectors and multi-sector
writes (probably not using those terms though!).

In the first case, although a single command is written the drive
will (effectively) loop through the sectors writing them 1 by 1.
All drives support this mode.

For multi-sector writes, the data transfer for each group of sectors
is done as a single burst. So if the drive supports 8 sector multi-sector
writes, and you are doing PIO transfers, you take a single 'data'
interrupt and then write all 4k bytes at once (assuming 512 byte sectors).
The drive identify response indicates whether multi-sector writes are
supported, and if so how many sectors can be written at once.
If the data transfer is DMA, it probably makes little difference to the
driver.

For quite a long time the netbsd ata driver mixes them up - and would
only request writes of multiple sectors if the drive supported multi-sector
writes.

Multi-sector writes are probably quite difficult to kill part way through
since there is only one DMA transfer block.

   Given how drives actually write data, I would not be so sanguine
   that any sector, of whatever size, in-flight when the power fails,
   is actually written with the values you expect, or not written
   at all.
 
 Yes, I'm aware of that. It remains a useful approximation, especially
 for already-existing FS code.

Given that (AFAIK) a physical sector is not dissimilar from an hdlc frame,
once the write has started the old data is gone, it the write is actually
interrupted you'll get a (correctable) bad sector.
If you are really unlucky the write will be long - and trash the
following sector (I managed to power off a floppy controller before it
wrecked the rest of a track when I'd reset the writer with write enabled).
If you are really, really unlucky I think it is possible to destroy
adjacent tracks.

David

-- 
David Laight: da...@l8s.co.uk

Re: wapbl_flush() speedup

2012-12-04 Thread David Laight

On Tue, Dec 04, 2012 at 09:53:11PM +0100, J. Hannken-Illjes wrote:
 
 On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote:
 
  hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes:
  
  The attached diff tries to coalesce writes to the journal in MAXPHYS
  sized and aligned blocks.
  [...]
  Comments or objections anyone?
  
  + * Write data to the log.
  + * Try to coalesce writes and emit MAXPHYS aligned blocks.
  
  Looks fine, but I would prefer the code to use an arbitrarily sized
  buffer in case we get individual per device transfer limits. Currently
  that size would still be MAXPHYS, but then the code could query the driver
  for a proper size.
 
 As `struct wapbl' is per-mount and I suppose this will be per-mount-static
 it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as
 tls-maxphys comes to head.

Except that you want the writes to be preferably aligned to that length,
not just of that length.

David

-- 
David Laight: da...@l8s.co.uk

Re: FFS write coalescing

2012-12-03 Thread David Laight

On Mon, Dec 03, 2012 at 06:21:30PM +0100, Edgar Fu? wrote:
 When FFS does write coalescing, will it try to align the resulting 64k chunk?
 I.e., if I have 32k blocks and I write blocks 1, 2, 3, 4; will it write (1,2)
 and (3,4) or 1, (2,3) and 4?
 Of course, the background for my question is RAID stripe alignment.

With that thought, for RAID5 in particular, you'd want the FS code
to indicate to the disk that it had some of the nearby data in memory.
That would safe the read of the parity data.
(Which would be really horrid to implement!)

Perhaps the 'disk' should give some 'good parameters' for writes to the
FS code when the filesystem is mounted.

David

-- 
David Laight: da...@l8s.co.uk

Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1

2012-11-30 Thread David Laight

On Wed, Nov 28, 2012 at 03:19:49PM -0800, Brian Buhrow wrote:
   Hello.  I've just noticed an issue where broadcast traffic on vlans
 also shows up on the parent interface.  My environment is NetBSD-5.1/i386
 with the wm(4) driver.  I'm not sure yet if the problem is specific to the
 wm(4) driver or if it's a more general issue.  The bug  didn't exist in
 NetBSD-4.x.

There are some very recent messages on the Linux 'netdev' list (or
possibly the tcpdump one) about issues with vlan tags being visible
(or not) on the messages passed to tcpdump (and maybe processed
internally).
I only skim-read it so can't remember the exact issue - but it is
similar.

David

-- 
David Laight: da...@l8s.co.uk

Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-30 Thread David Laight

On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote:
 da...@l8s.co.uk (David Laight) writes:
 
 I must look at how to determine that disks have 4k sectors and to
 ensure filesystesm have 4k fragments - regardless of the fs size.
 
 newfs should already ensure that fragment = sector.

These disks lie about their actual sector size.
The disk's own software does RMW cycles for 512 byte writes.

David

-- 
David Laight: da...@l8s.co.uk

Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-11-26 Thread David Laight

On Mon, Nov 26, 2012 at 01:49:09AM +0300, Alan Barrett wrote:
 
 If necessary, the open(2) syscall could be versioned so that 
 O_RDONLY is no longer defined as zero.

Actually we could redefine (say)
O_RDONLY0x1
O_WRONLY0x2
O_RDWR  (O_RDONLY | O_WRONLY)
O_SEARCH0x4
O_EXEC  0x8
(or similar)
and fallback on the value in bits 0 and 1 if none of the above are set.

That doesn't require a syscall version bump.

David

-- 
David Laight: da...@l8s.co.uk

Re: fexecve, round 3

2012-11-25 Thread David Laight

On Sun, Nov 25, 2012 at 07:54:59PM +, Christos Zoulas wrote:
 
  Does everyone agrees on this interpretation? If we do, next steps are
  - describe threats this introduce to chrooted processes

Given a chrooted process would need a helping process outside the
chroot (to pass it the fd), why is allowing the chrooted proccess to
exec something any different from it arranging to get the helper
to do it?

I think it can only matter if the uid of the chroot is root.
Even then you could (probably) do nothing you couldn;t do by
mmaping some anon space with exec permissions and writing code to it.

FWIW IIRC the standard says that O_EXEC can't be applied with O_READONLY
(Or O_RDWR) but does it say that you can't read from a file opened O_EXEC ?

David

-- 
David Laight: da...@l8s.co.uk

Re: fexecve, round 2

2012-11-19 Thread David Laight

On Mon, Nov 19, 2012 at 05:23:07AM +, David Holland wrote:
 On Sun, Nov 18, 2012 at 06:51:51PM +, David Holland wrote:
   This appears to contradict either the description of O_EXEC in the
   standard, or the standard's rationale for adding fexecve().  The
   standard says O_EXEC causes the file to be open for execution only.
   
   In other words, O_EXEC means you can't read nor write the file.  Now
   the rationale for fexecve() doesn't hold, since you cannot read from
   the fd, then exec from it without a reopen.
   
   Further, requiring O_EXEC would seem to directly contravene the
   standard's language about fexecve()'s behavior.
 
 The standard is clearly wrong on a number of points and doesn't match
 the historical design and behavior of Unix. Let's either implement
 something correct, or not implement it at all.
   
   Also it seems that the specification of O_SEARCH (and I think the
   implementation we just got, too) is flawed in the same way - it is
   performing access checks at use time instead of at open time.
 
 So, for the record, I think none of these flags should be added unless
 they behave the same way opening for write does -- the flag cannot be
 set except at open time, and only if the opening process has
 permission to make the selected type of access; once opened the
 resulting file handle functions as a capability that allows the
 selected type of access. Anything else creates horrible
 inconsistencies and violates the principle of least surprise, both of
 which are not acceptable as part of the access control system.

Does fchmod() itself have any issues?
If I open a file that doesn't have write permissions, I can use fchmod()
to add write permissions. My open fd won't magically gain write access,
but maybe I can open it again via /dev/fd (possibly after linking the
inode back into the filesystem) and gain the extra permissions.

Clearly I would need to be the owner, but with chroots that shouldn't
be enough if the file might actually be outsde the chroot.

David

-- 
David Laight: da...@l8s.co.uk

Re: fexecve, round 2

2012-11-19 Thread David Laight

On Mon, Nov 19, 2012 at 08:08:58AM +, Emmanuel Dreyfus wrote:
 On Mon, Nov 19, 2012 at 05:23:07AM +, David Holland wrote:
  Also, it obviously needs to be possible to open files O_RDONLY|O_EXEC
  for O_EXEC to be useful, and open directories O_RDONLY|O_SEARCH, and
  so forth. I don't know what POSIX may have been thinking when they
  tried to forbid this but forbidding it makes about as much sense as
  forbidding O_RDWR, maybe less.
 
 It seems consistent with the check at system call time that you proposed 
 to forbid. Here is how I understand it for an openat/mkdirat sequence:
 - openat() without O_SEARCH, get a search check at mkdirat() time
 - openat() with O_SEARCH, mkdirat() performs no search check.
 
 and for openat/fexecve:
 - openat() without O_SEXEC, get a execute check at fexecve() time
 - openat() with O_EXEC, fexecve() performs no exec check.
 
 If you have r-x permission, you open with O_RDONLY and you do not need
 O_SEARCH/O_EXEC. 
 
 If you have --x permission, you open with O_SEARCH/O_EXEC

I think the standard implied that O_EXEC gave you read and execute
permissions. So you can't use it to open files that are --x.

I haven't seen a quote for O_SEARCH.
Without the xxxat() functions the read/write state of directory fds
(as opposed to that of the directory itself) has never mattered.
O_SEARCH might be there to allow you to open . when you don't
have read (or write) access to it.

For openat() it is plausible that write access to the directory fd
might be needed as well as write access to the underlying directory
in order to create files.

David

-- 
David Laight: da...@l8s.co.uk

Re: fexecve, round 2

2012-11-19 Thread David Laight

On Mon, Nov 19, 2012 at 11:25:07AM -0500, Thor Lancelot Simon wrote:
 On Mon, Nov 19, 2012 at 03:13:02PM +, Emmanuel Dreyfus wrote:
  On Mon, Nov 19, 2012 at 02:39:36PM +, Julian Yon wrote:
   No, Emmanuel is right: [...] use the O_EXEC flag when opening fd. In
   this case, the application will not be able to perform a checksum test
   since it will not be able to read the contents of the file. You can
   open with --x but (correctly) you can't read from the file.

Given the comments later about O_SEARCH | O_RDONLY not being distinguishable
from O_SEARCH (because, historically, O_RDONLY is zero) and 'similarly
for O_EXEC' I suspect the wording of the sections got reworded quite
late on - and probably after the bar had opened and everyone at the
meeting was hungry!

I suspect that, for --x-- items opens with O_EXEC or O_SEARCH
might need to succeed, and any later read/mmap requests fail.

  And it means the standard mandates that one can execute without
  read access. Weird.
 
 What's weird about that?
 
 % cp /bin/ls /tmp
 % chmod 100 /tmp/ls
 % ls -l /tmp/ls
 ---x--  1 tls  users  24521 Nov 19 11:24 /tmp/ls
 % /tmp/ls -l /tmp/ls
 ---x--  1 tls  users  24521 Nov 19 11:24 /tmp/ls
 %

More fun are #! scripts that are --s--
Typically they can be executed by everyone except the owner!
(Provided suid scripts are allowed - and I don't know any reason
why they shouldn't be provided the kernel passes the open fd to
the interpreter.)

David

-- 
David Laight: da...@l8s.co.uk

Re: fexecve, round 2

2012-11-18 Thread David Laight

On Sat, Nov 17, 2012 at 11:48:20AM +0100, Emmanuel Dreyfus wrote:
 Here is an attempt to address what was said about implementing fexecve()
 
 fexecve() checks that the vnode underlying the fd :
 - is of type VREG
 - grants execution right
 
 O_EXEC  cause open()/openat() to fail if the file mode does not grant
 execute rights
 
 There are security concerns with fd passed to chrooted processes, which
 could help executing code. Here is a proposal for chrooted processes:
 1) if current process and executed vnode have different roots, then
 fexecve() fails 

I'm not sure how you were intending determining that.
You can follow .. for directories, but not files.

 2) if the fd was not open with O_EXEC, fexecve() fails.
 
 First point avoids executing code from outside the chroot
 Second point enforces W^X inside the chroot.

If we don't want to allow chroot'ed process to exec a file that
is outside the chroot, then maybe the kernel could hold a reference
to the directory vnode (in the file vnode) whenever a file is opened
for execute (including the existing exec() family of calls calls).

As well as being used to police fexecve() withing a chroot,
it could be used by the dynamic linker for $ORIGIN processing
(Probably by some special flags to openat().).

David

-- 
David Laight: da...@l8s.co.uk

Re: [PATCH] fexecve

2012-11-17 Thread David Laight

On Fri, Nov 16, 2012 at 12:52:30PM +, Julian Yon wrote:
 On Fri, 16 Nov 2012 08:34:29 +
 David Laight da...@l8s.co.uk wrote:
 
  On Thu, Nov 15, 2012 at 10:14:18PM +0100, Joerg Sonnenberger wrote:
   
   Frankly, I still don't see the point why something would want to
   use it.
  
  How about running a staticly linked executable inside a chroot without
  needed the executable itself to do the chroot.
 
 What does this gain over passing a filename around? (NB. I'm not
 claiming that's an entirely safe model either, but it's already
 possible).

You don't need the executable image inside the chroot.

David

-- 
David Laight: da...@l8s.co.uk

Re: [PATCH] fexecve

2012-11-16 Thread David Laight

On Thu, Nov 15, 2012 at 10:14:18PM +0100, Joerg Sonnenberger wrote:
 On Thu, Nov 15, 2012 at 08:20:30PM +0100, Emmanuel Dreyfus wrote:
  Thor Lancelot Simon t...@panix.com wrote:
  
   The point is, this is interesting functionality that makes something
   new possible that is potentially useful from a security point of view,
   but the new thing that's possible also breaks assumptions that existing
   code may rely on to get security guarantees it wants.  
  
  Well, it is standard mandated and we want to be standard compliant. If
  it is a security hazard, we can have a sysctl to disable the system
  call. Something like
  sysctl -w kern.fexecve = 0 and it would return ENOSYS.
 
 Frankly, I still don't see the point why something would want to use it.

How about running a staticly linked executable inside a chroot without
needed the executable itself to do the chroot.

Oh, and now make $ORIGIN work for dynamic executables and fexec() :-)
(Probably not a good idea inside choots! At least you wouldn't want it
to work AFTER the initil program load.)

David

-- 
David Laight: da...@l8s.co.uk

Re: [PATCH] fexecve

2012-11-16 Thread David Laight

On Thu, Nov 15, 2012 at 04:02:50PM -0500, Thor Lancelot Simon wrote:
 
  From the spec: ?The purpose of the fexecve() function is to enable
  executing a file which has been verified to be the intended file. It is
  possible to actively check the file by reading from the file descriptor
  and be sure that the file is not exchanged for another between the
  reading and the execution.? ...which seems a reasonable enough thing to
  want to do.
 
 Look at that rationale carefully and I think you will see the race condition
 that it does not eliminate.  Talk about a solution looking for a problem!

You could create a temporary file, unlink it, copy the executable
into the new file, verify the the contents, and then exec the
unlinked temporary file.

Better add an open mode that hard disables writes (as used on many
systems for executables anyway), open the file with that mode ...

David

-- 
David Laight: da...@l8s.co.uk

Re: [PATCH] POSIX extended API set 2

2012-11-11 Thread David Laight

On Sun, Nov 11, 2012 at 04:19:03AM -0800, Matt Thomas wrote:
 
 On Nov 11, 2012, at 12:39 AM, Alan Barrett wrote:
 
  I want the names to follow a clear and easily-documented pattern.
  
  Takes a nameTakes a fd, not a name  Takes a name and an at fd
  (prepend f)   (append at)
  --  ---
  open- (fopen is different)  openat
  link-   linkat
  unlink  -   unlinkat
  rename  -   renameat
  chdir   fchdir  chdirat
  mkdir   fmkdir  mkdirat
  mkfifo  fmkfifo mkfifoat
  utimens futimensutimensat
  chmod   fchmod  chmodat (not fchmodat)
  chown   fchown  chownat (not fchownat)
  statfstat   statat (not fstatat)
  access  -   accessat (not faccessat)
  
  However, I also want the inconsistent POSIX names to be provided.
 
 Don't forget
 
 chrootfchroot chrootat 

How do these names fit into the previously reserved namespaces?
Or has that been completely ignored by the Posix folks again?

David

-- 
David Laight: da...@l8s.co.uk

Re: cprng sysctl: WARNING pseudorandom rekeying.

2012-11-10 Thread David Laight

On Fri, Nov 09, 2012 at 06:53:45PM -0500, Greg Troxel wrote:
 
 FWIW, I agree with the notion that defaults should be at a path that is
 ~always in root; it's normal to have /var in a separate fileystem (at
 least for old-school UNIX types; I realize the kids these days think
 there should be one whole-disk fs as /).

I always try to separate the OS files from my files, mainly so that
I can reinstall the OS (often into a different root filesystem)
and still have access to the other filesystems.
As well as /home, I also put and big source trees in their own fs.
(So I'd never have /usr/src ...)

David

-- 
David Laight: da...@l8s.co.uk

Re: WAPL/RAIDframe performance problems

2012-11-10 Thread David Laight


 For example, /usr/include/ufs/ffs/fs.h suggests that the
 super block could bein one of 4 different places on your partition,
 depending on what size your disk is, and what version of superblock you're
 using. 

From my memory of the ffs disk layout, fs block/sector numbers start from
the beginning of the partition and just avoid allocating the area
containing the subperblock copies.
So the position of the superblocks (one exists in each cylinder group)
is rather irrelevant.

What is more likely to cause grief is 512 byte writes - since modern
disks have 4k physical sectors.
I think netbsd tends to do single sector writes for directory entries
and the journal - these will be somewhat suboptimal!

David

-- 
David Laight: da...@l8s.co.uk

Re: suenv

2012-10-25 Thread David Laight

On Thu, Oct 25, 2012 at 03:58:33AM +0200, Emmanuel Dreyfus wrote:
 David Laight da...@l8s.co.uk wrote:
 
  Wasn't there a recent change to ld so that NEEDED entries are only
  added if the shared library is needed to resolve symbols?
  
  Which makes the naive addition of -lpthread useless.
 
 That seems the best workaround to the problem, but it was either not
 pulled up in netbsd-6, or it does not work. What files where touched
 with that change? It is not obvous in ld.elf_so.

I'me thinking of a change to ld itself, maybe NetBSD hasn't imported
that version yet.
(Or my brain cells are faulty.)

David

-- 
David Laight: da...@l8s.co.uk

Re: suenv

2012-10-24 Thread David Laight

On Tue, Oct 23, 2012 at 06:08:37PM +0200, Martin Husemann wrote:
 On Tue, Oct 23, 2012 at 04:31:52PM +0200, Emmanuel Dreyfus wrote:
  Opinions?
 
 Either PAM modules should not be allowed to use shared libraries that
 use pthreads, or we need to make sure every application using PAM is
 linked against libpthread.

Wasn't there a recent change to ld so that NEEDED entries are only
added if the shared library is needed to resolve symbols?

Which makes the naive addition of -lpthread useless.

I also dug into some linux libc.so and libpthread.so.
AFAICT the mutex and condvar functions are in libc.
There are somes stubs in libpthread that do a jump indirect,
not sure what that is based on though.

David

-- 
David Laight: da...@l8s.co.uk

Re: Raidframe and disk strategy

2012-10-16 Thread David Laight

On Tue, Oct 16, 2012 at 08:12:39PM +0200, Edgar Fu? wrote:
  processes would get stuck in biowait
 What I usually see is one of nfsd's four subthreads in biowait and the other 
 three in tstile.

FWIW I've NFI why nfsd has this default of 4 threads.
I've seen a lot of systems where 1 works much better
(for all sorts of reasons).

David

-- 
David Laight: da...@l8s.co.uk

Re: NetBSD vs Solaris condvar semantics

2012-10-14 Thread David Laight

On Sun, Oct 14, 2012 at 02:27:48PM +, Taylor R Campbell wrote:
Date: Sun, 14 Oct 2012 09:37:09 +0200
From: Martin Husemann mar...@duskware.de

In the zfs code, where do they store the mutex needed for cv_wait?

 In the two cases I have come across, dirent locks and range locks, a
 number of condvars, one per dirent or one per range, share a common
 mutex in some common enclosing object, such as a znode.  So, e.g., the
 end of zfs_dirent_unlock looks like

 cv_broadcast(dl-dl_cv);   /* dl is a dirent lock stored in dzp.  */
 mutex_unlock(dzp-z_lock);
 cv_destroy(dl-dl_cv);
 kmem_free(dl, sizeof(*dl));

What do the waiters actually look like?
A lot of cv definitions allow for 'random' wakeups.
eg cv_broadcast() is allowed to wakeup all cv.
So after being woken you are required to check something

David

-- 
David Laight: da...@l8s.co.uk

1 2 3 >

1 - 100 of 223 matches

Mail list logo