from:"Bruce Evans"

Re: i386 4/4 change

2018-04-01 Thread Bruce Evans



On Sun, 1 Apr 2018, Dimitry Andric wrote:


On 31 Mar 2018, at 17:57, Bruce Evans  wrote:


On Sat, 31 Mar 2018, Konstantin Belousov wrote:


the change to provide full 4G of address space for both kernel and
user on i386 is ready to land.  The motivation for the work was to both
mitigate Meltdown on i386, and to give more breazing space for still
used 32bit architecture.  The patch was tested by Peter Holm, and I am
satisfied with the code.

If you use i386 with HEAD, I recommend you to apply the patch from
https://reviews.freebsd.org/D14633
and report any regressions before the commit, not after.  Unless
a significant issue is reported, I plan to commit the change somewhere
at Wed/Thu next week.

Also I welcome patch comments and reviews.


It crashes at boot time in getmemsize() unless booted with loader which
I don't want to use.



For me, it at least compiles and boots OK, but I'm one of those crazy
people who use the default boot loader. ;)


I found a quick fix and sent it to kib.  (2 crashes in vm86 code for memory
sizing.  This is not called if loader is used && the system has smap.  Old
systems don't have smap, so they crash even if loader is used.)


I haven't yet run any performance tests, I'll try building world and a
few large ports tomorrow.  General operation from the command line does
not feel "sluggish" in any way, however.


Further performance tests:
- reading /dev/zero using tinygrams is 6 times slower
- read/write of a pipe using tinygrams is 25 times slower.  It also gives
  unexpected values in wait statuses on exit, hopefully just because the
  bug is in the test program is exposed by the changed timing (but later
  it also gave SIGBUS errors).  This does a context switch or 2 for every
  read/write.  It now runs 7 times slower using 2 4.GHz CPUs than in
  FreeBSD-5 using 1 2.0 GHz CPU.  The faster CPUs and 2 of them used to
  make it run 4 times faster.  It shows another slowdown since FreeBSD-5,
  and much larger slowdowns since FreeBSD-1:

  1996 FreeBSD on P1  133MHz:   72k/s
  1997 FreeBSD on P1  133MHz:   44k/s (after dyson's opts for large sizes)
  1997 Linux   on P1  133MHz:   93k/s (simpler is faster for small sizes)
  1999 FreeBSD on K6  266MHz:  129k/s
  2018 FBSD-~5 on AthXP 2GHz:  696k/s
  2018 FreeBSD on i7  2x4GHz: 2900k/s
  2018 FBSD4+4 on i7  2x4GHz:  113k/s (faster than Linux on a P1 133MHz!!)

Netblast to localhost has much the same 6 times slowness as reading
/dev/zero using tinygrams.  This is the slowdown for syscalls.
Tinygrams are hard to avoid for UDP.  Even 1500 bytes is a tinygram
for /dev/zero.  Without 4+4, localhost is very slow because it does
a context switch or 2 for every packet (even with 2 CPUs when there is
no need to switch).  Without 4+4 this used to cost much the same as the
context switches for the pipe benchmark.  Now it costs relatively much
less since (for netblast to localhost) all of the context switches are
between kernel threads.

The pipe benchmark uses select() to avoid busy-waiting.  That was good
for UP.  But for SMP with just 2 CPUs, it is better to busy-wait and
poll in the reader and writer.

netblast already uses busy-waiting.  It used to be a bug that select()
doesn't work on sockets, at least for UDP, so blasting using busy-waiting
is the only possible method (timeouts are usually too coarse-grained to
go as fast as blasting, and if they are fine-grained enough to go fast
then they are not much better than busy-waiting with time wasted for
setting up timeouts).  SMP makes this a feature.  It forces use of busy-
waiting, which is best if you have a CPU free to run it and this method
doesn't take to much power.

Context switches to task queues give similar slowness.  This won't be
affected by 4+4 since task queues are in the kernel.  I don't like
networking in userland since it has large syscall and context switch
costs.  Increasing these by factors of 6 and 25 doesn't help.  It
can only be better by combining i/o in a way that the kernel neglects
to do or which is imposed by per-packet APIs.  Slowdown factors of 6
or 25 require the combined i/o to be 6 or 25 larger to amortise the costs.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: i386 4/4 change

2018-03-31 Thread Bruce Evans


On Sat, 31 Mar 2018, Konstantin Belousov wrote:


the change to provide full 4G of address space for both kernel and
user on i386 is ready to land.  The motivation for the work was to both
mitigate Meltdown on i386, and to give more breazing space for still
used 32bit architecture.  The patch was tested by Peter Holm, and I am
satisfied with the code.

If you use i386 with HEAD, I recommend you to apply the patch from
https://reviews.freebsd.org/D14633
and report any regressions before the commit, not after.  Unless
a significant issue is reported, I plan to commit the change somewhere
at Wed/Thu next week.

Also I welcome patch comments and reviews.


It crashes at boot time in getmemsize() unless booted with loader which
I don't want to use.

It is much slower, and I couldn't find an option to turn it off.

For makeworld, the system time is slightly more than doubled, the user
time is increased by 16%, and the real time is increased by 21%.

On amd64, turning off pti and not having ibrs gives almost no increase
in makeworld times relative to old versions, and pti only costs about
5% IIRC.

Makeworld is not very syscall-intensive.  netblast is very syscall-intensive,
and its throughput is down by a factor of 5 (660/136 = 4.9, 1331/242 = 5.5).

netblast 127.0.0.1 5001 5 10 (localhost, port 5001, 5-byte tinygrams for 10 s):
537 kpps sent, 0 kpps dropped # before this patch (CPU use 1.3)
136 kpps sent, 0 kpps dropped # after (CPU use 2.1)

(Pure software overheads.  It uses 1.6 times as much CPU to go 4 times
slower).

netblast 192.168.2.8 (low end PCI33 lem on low latency 1 Gbps LAN)
275 kpps sent, 1045 kpps dropped  # before (CPU use 1.3)
245 kpps sent, 0kpps dropped  # after (CPU use 1.3)

(The hardware can't do anywhere near line rate of ~1500 kpps, so this
becomes a benchmark of syscalls and dropping packets.  The change makes
FreeBSD so slow that 8 CPUs at 4.08 can't saturate a low end PCI33 NIC
(the hardware saturates at about 282 kpps for tx and about 400 kpps for
rx)).

netblast 192.168.2.8 (low end PCIe em on low latency 1 Gbps LAN)
   1316 kpps sent, 3 kpps dropped # before (CPU use 1.6)
243 kpps sent, 0 kpps dropped # after (CPU use 1.2)

This is seriously slower for the most useful case.  It reduces a system
that could almost reach line rate using about 2 of 8 CPUs at 4 GHz to
one that that is slower than with 1 CPU at 2 GHz (the latter saturates
in software at about 640 kpps in old versions of FreeBSD at at about
400 kpps in -current).

Initial debugging of the crash: it crashes on the first pmap_kenter()
in getmemsize().  I configure debug.late_console to 0.  That works,
and without it getmemsize() can't even be debugged since it is after
console initialization and ddb entry with -d.

In getmemsize(), of course all the preload calls return 0 and smapbase is
NULL.  Then vm86 bios calls work and give basemem = 0x276.  Then
basemem_setup() is called and it returns. Then pmap_kenter() is called
and it crashes:

Stopped at  getmemsize+0xb3:pushl   $0x1000
Stopped at  getmemsize+0xb8:pushl   $0x1000
Stopped at  getmemsize+0xbd:callpmap_kenter
Stopped at  pmap_kenter:pushl   %ebp
Stopped at  pmap_kenter+0x1:movl%esp,%ebp
Stopped at  pmap_kenter+0x3:movl0x8(%ebp),%eax
Stopped at  pmap_kenter+0x6:shrl$0xc,%eax
Stopped at  pmap_kenter+0x9:movl0xc(%ebp),%edx
Stopped at  pmap_kenter+0xc:orl $0x3,%edx
Stopped at  pmap_kenter+0xf:movl%edx,PTmap(,%eax,4)

The last instruction crashes because PTmap is not mapped at this point:

db> p/x $edx
1003
db> p/x PTmap
ff80
db> p/x $eax
   1
db> x/x PTmap
PTmap:KDB: reentering
KDB: stack backtrace:
  db_trace_self_wrapper(cec5cb,1420a04,c6de83,1420978,1,...) at 
db_trace_self_wrapper+0x24/frame 0x142095c
kdb_reenter(1420978,1,ff80003a,1420998,8f1419,...) at kdb_reenter+0x24/frame 
0x1420968
trap(1420a10) at trap+0xa0/frame 0x1420a04
calltrap() at calltrap+0x8/frame 0x1420a04
--- trap 0xc, eip = 0xc5c394, esp = 0x1420a50, ebp = 0x1420a88 ---
db_read_bytes(ff81,3,1420aa0) at db_read_bytes+0x29/frame 0x1420a88
db_get_value(ff80,4,0,0,d2d304,...) at db_get_value+0x20/frame 0x1420ab4
db_examine(ff80,1,,1420b00) at db_examine+0x144/frame 0x1420ae4
db_command(cb1d99,1420be4,8f0f01,d1d28a,0,...) at db_command+0x20a/frame 
0x1420b90
db_command_loop(d1d28a,0,1420bac,1420b9c,1420be4,...) at 
db_command_loop+0x55/frame 0x1420b9c
db_trap(a,4ff0,1,1,80046,...) at db_trap+0xe1/frame 0x1420be4
kdb_trap(a,4ff0,1420cc4) at kdb_trap+0xb1/frame 0x1420c10
trap(1420cc4) at trap+0x523/frame 0x1420cb8
calltrap() at calltrap+0x8/frame 0x1420cb8
--- trap 0xa, eip = 0xc65a4a, esp = 0x1420d04, ebp = 0x1420d04 ---
pmap_kenter(1000,1000,1429000,8efe13,0,...) at pmap_kenter+0xf/frame 0x1420d04
getmemsize(1,5a8807ff,ee,59a80097,ee,...) at getmemsize+0xc2/frame 0x1420fc4
init386(1

Re: Really weird behavior with terminals/sessions in past couple weeks

2017-05-13 Thread Bruce Evans


On Sat, 13 May 2017, Ngie Cooper (yaneurabeya) wrote:


On May 13, 2017, at 11:05, Ngie Cooper (yaneurabeya)  
wrote:


On May 13, 2017, at 11:01, Ngie Cooper (yaneurabeya)  
wrote:

Hi,
I???ve been noticing some really weird behavior with terminal input 
after updating my kernel/userland ??? in particular, if I do `arc diff 
???create` (which opens vi/vim), and try to do edits/use ^c, it will terminate 
the running process for `arc diff ???create`. Similarly, I was seeing really 
weird input via vim (when doing `svn ci`) where if I had one of the editing 
modes on, like insert, it would delete several lines at once; I worked around 
this by using ^c to terminate insert mode, but that???s a really bad hack. It 
worked ok with r316745, got worse in r317727, and doesn???t seem to be any 
better in r318250.


I forgot to mention: I???m using SSH to access my machine.


My gut feeling is the sc(4) commits might have tickled or introduced some bugs. 
I???ll try reverting the following commits over the next couple days to see 
whether or not my experience improves: r316827 r316830 r316865 r316878 r316974 
r316977 r317190 r317198 r317199 r317245 r317256 r317264.


I don't think I touched anything related to editing.  Certainly not for fixing
the mouse cursor starting some time before r317827.  Since then I have spent
too much time on mouse cursors and not much else.

Bruce___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: kernel coding of nobody/nogroup

2017-04-21 Thread Bruce Evans


On Fri, 21 Apr 2017, Rick Macklem wrote:


I need to set the default uid/gid values for nobody/nogroup into kernel
variables. I reverted the commit that hardcoded them, since I agree that
wasn't a good thing to do.

I didn't realize that "nobody" was already defined in sys/conf.h and I can
use that.


I didn't know nobody was already there either.  They are only used by zfs,
while the others were originally only sed for devices.


There is no definition for "nogroup" in sys/conf.h.
Would it be ok to add
#define GID_NOGROUP  65533
to syy/conf.h?
(I know bde@ doesn't like expressing this as 65533, but that is what it is in 
/etc/group.)


sys/conf.h already has GID_NOBODY but it is subtly different from
GID_NOGROUP.  It seems to be a bug that zfs uses nobody's gid instead
of the gid nogroup which is used by no body.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-31 Thread Bruce Evans


On Fri, 31 Mar 2017, Andrey Chernov wrote:


On 30.03.2017 21:53, Bruce Evans wrote:

I think it was the sizing.  The non-updated mode is 80x25, so the row
address can be out of bounds in the teken layer.


I have text 80x30 mode set at rc stage, and _after_ that may have many
kernel messages on console, all without causing reboot. How it is
different from shutdown stage? Syscons mode is unchanged since rc stage.


Probably just because their weren't enough messages to go past row 24.
I had no difficulty reproducing the crash today for entering ddb and
reboot starting 80x30 and rows > 24, after removing just the window
size update in the fix.  I missed seeing it the other day because I
tested with 80x60 to see the smaller console window more clarly, but
must have only tried rebooting with row <= 24.

Another recent fix for sc reduced the problem a little.  Mode changes
are supposed to clear the screen and move the cursor to home, but they
only clear the screen.  You should have noticed the ugliness from that
after the the switch to 80x30.  There are enough boot messages to
reach row 24 and messages continued from there.  Now they start at the
top of the screen again.  Clearing the messages is not ideal, but syscons
always did it.

Syscons also has new and old bugs preserving colors across mode changes:
- it never preserved changes to the palette (FBIO_SETPALETTE ioctl).
  Some mode changes should reset the palette, but some should not.
  Especially not ones for a vt switch
- BIOSes should reset the palette for mode changes (even to the same mode).
  Some BIOSes are confused by syscons setting the DAC to 8 bit mode and
  reset to a garbage (dark) palette then.  They always switch back to
  6 bit mode
- syscons used to maintain the current colors and didn't change them for
  mode changes.  This was slightly broken, since for a mode change from
  a mode with full color to one with less color, the interpretation of
  the color indexes might change.  The colors are now maintained by
  teken and syscons tells teken to do a full window size change which
  resets the entire teken state including colors.  This bug is normally
  hidden by vidcontrol refreshing the colors.

  vidcontrol could be held responsible for refreshing or resetting
  everything after a mode change ioctl, but I think this is backwards
  since there are many low-level details that are better handled in
  the driver.  Switching to graphics modes is already a complicated
  2-ioctl process with not enough options and poor error handling.
  Like a too-simple wrapper for fork-exec.

vt has some interesting related bugs.  It doesn't support mode switches
of course, and even changing the font seems to be unsupported in text
mode.  But in graphics mode, changing the font works and even redraws
the screen where syscons would clear it for the mode change.  But there
are bugs redrawing the screen -- often old history is redrawn.  This
should work like in xterm or a general X window refresh where the
redrawing must be done for lots of other events than resize (exposure,
etc.).


- sysctl debug.kdb.break_to_debugger.  This is documented in ddb(4), but
  only as equivalent to the unbroken BREAK_TO_DEBUGGER.


Thanx. Setting debug.kdb.break_to_debugger=1 makes both Ctrl-Alt-ESC and
Ctrl-PrtScr works in sc only mode and "c" exit don't cause all chars
beeps like in vt. I.e. it works. But I don't understand why debugging
via serial involved in sc case while not involved in vt case and fear
that some serial noise may provoke break.


This is because only syscons has full conflation of serial line breaks
with entering the debugger via a breakpoint instuction.  Syscons does:

kdb_break();

for its KDB keys, while vt does:

kdb_enter(KDB_WHY_BREAK, ...)

for its KDB keys.  The latter bypasses KDB's permissions on entering
the debugger with a BREAK.  It is unclear if this is a layering violation
in vt or incorrect use of kdb_break() in syscons.  It is certainly wrong
for vt to use the KDB_WHY_BREAK code if it is avoiding using kdb_break()
to fix the conflation.


Is there a chance to untie
serial and sc console debuggers?


This is easy to do by copying vt's arguable layering violation.  A little
more is necessary to unconflate serial breaks:
- agree that kdb_break() and KDB_WHY_BREAK are only for serial line breaks
- don't use kdb_break() and KDB_WHY_BREAK for console KDB keys of course.
  vt already has a string saying that the entry is a "manual escape to
  debugger".  Here "to debugger" is redundant, "manual escape" means
  "DDB key hit manaually by the user" and the driver that saw the key
  is left out.  "vt KDB key" would be a more useful message.  syscons
  used to print a similar message, but it now calls kdb_break() which
  produces the conflated code KDB_WHY_BREAK and the consistently
  conflated message "Break to debugger"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-30 Thread Bruce Evans


On Thu, 30 Mar 2017, Andrey Chernov wrote:


On 30.03.2017 18:13, Bruce Evans wrote:

On Thu, 30 Mar 2017, Andrey Chernov wrote:

...
Finally I have good news and bad news with today's -current:

1) It seems your latest commit r316136 fix premature reboot issue.


Now I need to know how that helped.  Do you used a non-default mode?


Perhaps it isn't really helped, but just hide the problem, changing some
another race time parameters.
I use 80x30 text mode on all screens.


I think it was the sizing.  The non-updated mode is 80x25, so the row
address can be out of bounds in the teken layer.


2) I still can't enter KDB using Ctrl-Alt-ESC, while booting, after
booting, after login and while shutdown - nothing happens.
boot -d enters KDB normally, but the keyboard sequence handler is
broken, not boot -d.


Try "~b".


What? It just prints \n, new csh prompt and ~b


This takes ALT_BREAK_TO_DEBUGGER.


It is an old bug that Ctrl-Alt-ESC (and Ctrl-PrtScr)


GENERIC is even more broken than I remembered.  It doesn't even have
ALT_BREAK_TO_DEBUGGER.  In old versions, this didn't affect the syscons
key.  The key was controlled by the SC_DISABLE_DDBKEY option so defaulted
to enabled.  There was no tunable or sysctl to change the default.  Serial
consoles had a BREAK_TO_DEBUGGER option to control entering the debugger
on a serial line break.  This was not per-device or even per-driver.
Things were broken by conflating serial line BREAKs with entering the
debugger using a breakpoint instruction.

Now there are many sysctls and tunable,s but the basic enable is the
conflated BREAK_TO_DEBUGGER.  This now gives the default setting for
entering kdb using a breakpoint instruction.  Syscons calls the function
kdb_break() which calls kdb_enter() which does the breakpoint instruction.
Arches that don't have such an instruction must have a virtual one.

The default setting can be modified using a tunable or sysctl.  So to
have a chance of the syscons debugger keys working, you first have to
configure this setting, using either:
- BREAK_TO_DEBUGGER in static config file.  This is documented in ddb(4),
  but only for its unbroken meaning for serial consoles
- tunable debug.kdb.break_to_debugger.  This seems to be undocumented
- sysctl debug.kdb.break_to_debugger.  This is documented in ddb(4), but
  only as equivalent to the unbroken BREAK_TO_DEBUGGER.
You have to set the variable using 1 or more of these knobs if you want
the syscons and vt debugger keys to work, but this also enables debugger
entry for serial line breaks and thus breaks the reason for existence of
the unbroken BREAK_TO_DEBUGGER option.  Normally you don't want to enter
the debugger for serial line breaks, since then unplugging the cable or
noise on the cable may enter the debugger, and the option exists to enable
the entry for the rare cases where it is safe.

Next there are the sysctl and vt knobs to set, but these have correct
defaults so are enabled automatically.  SC_DISABLE_DDBKEY is now named
SC_DISABLE_KDBKEY.  It always disabled not only the key, but the code
to enable it.  It actually controls 2 keys and 1 sequence of keys.
When it is not configured, the Ctrl-PtrScr and Ctrl-Alt-ESC keys are
enabled by default.  This can be changed by a sysctl but not by a
tunable.  The sysctl is confusingly named with "kbd" (keyboard) in its
name, while the configu option has KDB (kerel debugger) in its name.
The variable for this also controls the sequences of keys which are
more than ddb keys and are controlled by the ALT_BREAK_TO_DEBUGGER
option and its knobs.

vt doesn't have a static config knob to enable the enables.  It has
a tunable as well as a sysctl.  This sysctl only controls the keys,
not key sequences.  (There may be more than 2 debugger keys.  keymap
allows any key to be a debugger key.)

syscons and/or vt also have knobs to control halt, poweroff, reboot
and panic, bug not suspend.  Many of these are defeated by the
sequences enabled by ALT_BREAK_TO_DEBUGGER.  This is a larger bug
in vt.  In vt, ALT_BREAK_TO_DEBUGGER is limited by the sysctl for
the kdb keys.  If kdb entry is allowed, then there is no point in
disallowing anything since anything can be done using kdb if it has
a backend.

This complexity is not enough to give enough control.  The control
should be per-device.  You might have 1 secure console and 1 insecure
console.  Then enable kdb on at most the secure console.  Or 1 remote
serial console with a good cable and serial console with a bad cable.
Then enable kdb entry for serial line breaks on at most the one with
the good cable.  With per-device control, the 6 knobs for controlling
entry at the kdb level would be sillier, but at least 1 knowb is
needed there to prevent all ddb use.


Ctrl-PrtScr does nothing too.


But I think the misconfiguration is the
same for vt.


No, Ctrl-Alt-ESC works for vt at every phase of the system lifecycle.


My point it that it is easy to mis

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-30 Thread Bruce Evans


On Thu, 30 Mar 2017, Andrey Chernov wrote:


On 30.03.2017 12:34, Andrey Chernov wrote:

On 30.03.2017 12:23, Andrey Chernov wrote:

Yes, only for reboot/shutdown. The system does not do anythings wrong
even under high load. On reboot or hang those lines are never printed:

kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done
kernel: Waiting (max 60 seconds) for system process `bufdaemon' to
stop...done
kernel: Waiting (max 60 seconds) for system process `syncer' to stop...
kernel: Syncing disks, vnodes remaining...5 3 0 1 0 0 done
kernel: All buffers synced.
(it is from 10-stable sample, old -current samples are lost)

Moreover, GELI swap deactivation lines are never printed too (I already
mention that I change swap to normal, but nothing is changed).


I start to have raw guess that _any_ kernel printf in shutdown mode
cause not printf but premature reboot.


Finally I have good news and bad news with today's -current:

1) It seems your latest commit r316136 fix premature reboot issue.


Now I need to know how that helped.  Do you used a non-default mode?
The change had 2 parts and I should have split it for testing.  It
fixes the window sizing and constructors.


2) I still can't enter KDB using Ctrl-Alt-ESC, while booting, after
booting, after login and while shutdown - nothing happens.
boot -d enters KDB normally, but the keyboard sequence handler is
broken, not boot -d.


Try "~b".  It is an old bug that Ctrl-Alt-ESC (and Ctrl-PrtScr)
are misconfigured by default.  But I think the misconfiguration is the
same for vt.  There are about 3 layers of options that have to be set
to "enable" or not set to "disable" to enable these keys.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-30 Thread Bruce Evans


On Thu, 30 Mar 2017, Andrey Chernov wrote:


On 30.03.2017 14:23, Andriy Gapon wrote:

On 30/03/2017 12:34, Andrey Chernov wrote:

On 30.03.2017 12:23, Andrey Chernov wrote:

Yes, only for reboot/shutdown. The system does not do anythings wrong
even under high load. On reboot or hang those lines are never printed:

kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done
kernel: Waiting (max 60 seconds) for system process `bufdaemon' to
stop...done
kernel: Waiting (max 60 seconds) for system process `syncer' to stop...
kernel: Syncing disks, vnodes remaining...5 3 0 1 0 0 done
kernel: All buffers synced.
(it is from 10-stable sample, old -current samples are lost)

Moreover, GELI swap deactivation lines are never printed too (I already
mention that I change swap to normal, but nothing is changed).


I start to have raw guess that _any_ kernel printf in shutdown mode
cause not printf but premature reboot.


This sounds somewhat familiar...
I vaguely recall an opposite issue that happened in the past.  After one of my
changes the reboot started hanging for one user.  Turned out that the actual bug
was always there, but previously the system rebooted because of a printf that
caused a LOR (between spinlocks, AFAIR), witness tried to report it... using
printf, and that recursed and there was a triple fault in the end.

Let me try to dig some details, maybe the current issue is related in some ways.

By chance, do you have WITNESS but not WITNESS_SKIPSPIN in your kernel config?


No, I don't have WITNESS*
I think removing all vt* lines from the kernel confing (and leaving sc)
will be enough to reproduce it, but I am not sure.


INVARIANTS with WITNESS is not a bad way to debug problems :-).  I just
remembered to try it with recent changes.  It didn't find any problems
for rebooting.

The problems reported in Andriy's 2012 threads are almost exactly the
ones that I have mostly fixed in syscons -- LORs and deadlocks, and
endless recursion in WITNESS to report the problem.  Syscons now detects
and handles most LORs and deadlocks in itself, but I haven't committed
the fixes for upper layers yet, so syscons mostly doesn't get called.
cnputs() was "fixed" to silently drop the output.

There is still an annoying LOR for devfs vs ufs in reboot.  This is
reported with no problems since it is not related to consoles.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-30 Thread Bruce Evans


On Thu, 30 Mar 2017, Andrey Chernov wrote:


We don't understand the bug yet.  It might not even be in sc.  Do you only
see problems for shutdown?  The shutdown environment is special for
locking.


Yes, only for reboot/shutdown. The system does not do anythings wrong
even under high load. On reboot or hang those lines are never printed:

kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done
kernel: Waiting (max 60 seconds) for system process `bufdaemon' to
stop...done
kernel: Waiting (max 60 seconds) for system process `syncer' to stop...
kernel: Syncing disks, vnodes remaining...5 3 0 1 0 0 done
kernel: All buffers synced.
(it is from 10-stable sample, old -current samples are lost)

Moreover, GELI swap deactivation lines are never printed too (I already
mention that I change swap to normal, but nothing is changed).


A hang in sc means that deadlock occurred and sc's new deadlock detection
didn't work.


Hangs are rare. Most common are premature reboots.


Check that ddb works before shutdown, or just put a lot of printfs in


I can't check it ddb because I can't enter ddb in sc mode, as I already
write, nothing happens. Only vt mode allows Ctrl-Alt-ESC, but the bug
does not exist in vt mode, so it is pointless.


That is signficant.  My changes were initially all about making ddb work
almost perfectly with sc.

ddb is entered by kdb first calling cngrab(), which does much the same
things as cnputc(), but more to set up for using the keyboard.  If the
sc part of cngrab() detects a problem, it should return and then the
sc part of cnputc() should detect the same problem and do emergency output
which might be just to buffer it.

Nothing at all happening looks like a simpler problem, with Ctrl-Alt-ESC
not being recognized.  There are too many ways to enable/disable this
entry, but I didn't change this.


You might have entered ddb in a context which used to race or deadlock.


No. I try about 20 times on machine which does nothing and can't enter
KDB in sc only mode, but got one dead hang instead, when start to repeat
it too fast.


Even earlier than shutdown, and when booting?


I mean in normal operation mode after booting, earlier than shutdown.
Shutdown with premature reboot is too fast to press anything at the
right time. I don't try to enter ddb when booting yet, but tell you
results later.


Look early in kern_reboot(), where it does print_uptime() then cngrab().
Console output before this cngrab() should work normally, and I suspect
that something in cngrab() reboots.  But syncing the file systems is
done before this.  I think they are unmounted later, so are fscked but
don't need more than fsck -p if they have been synced.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-30 Thread Bruce Evans


On Thu, 30 Mar 2017, Andrey Chernov wrote:


On 30.03.2017 9:51, Andrey Chernov wrote:

On 30.03.2017 8:53, Bruce Evans wrote:


The escape sequences in dmesg are very interesting.  You should debug
those.


I'll send you them a bit later. Since I don't want vt at all, I don't
want to debug or fix it, let it die.


Here it is:
kernel: allscreens_kbd cursor^[[=0A^[[=7F^[[=0G^[[=0H^[[=7Ividcontrol:
setting cursor type: Inappropriate ioctl for device

It is caused by vidcontrol call which left from previous sc setup.


This turns out to be uninteresting then.  I think you have to configure
something specially to get console messages in dmesg, but I get then in
console.log, which also requires special configuration (turn this on in
syslog.conf).

In my configuration, vidcontrol only does ioctls in rc.d, so there are
no escape sequences for vidcontrol in console.log, and only 1 error
message (for changing the font to a syscons font).  There should be
more failures, but some ioctls are null instead of working.
"vidcontrol show >/dev/console" works to show the colors and also to
show that escape sequences end up in console.log.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-30 Thread Bruce Evans


On Thu, 30 Mar 2017, Andrey Chernov wrote:


On 29.03.2017 6:29, Bruce Evans wrote:

...
I just found the cause, it is new syscons bug (bde@ cc'ed). I never
compile vt driver into kernel, i.e. I don't have this lines in the
kernel config:

devicevt
devicevt_vga
devicevt_efifb

When I add them, the bug described is gone. It seems syscons goes off to
early, provoking reboot.


Bah, I only have vt and vt_vga to check that I didn't break them.

Unfortunately, syscons still works right when I remove these lines.


Maybe two will be enough too, I don't check. I just don't need _any_ of
vt lines. What is matter it is that syscons only mode (without any vt)
was recently broken, causing shutdown problems and file system damage
each time. Syscons only mode works for years until you break it recently.


Actually, I fixed it not so recently (over the last few months), partly
with much older local fixes.


Kernel messages in syscons are now supposed to be colorized by CPU.  The


It looks really crazy on 8-core CPU and should not be default. And I
don't see colors in vt mode (which should be parallel at that point, at
least), but what about invisible escapes on vidcontrol errors (f.e.
invalid argument) in vt mode?


It is tuned for an 8-core CPU :-).  16 CPUs don't get unique colors
by default, but could get 16 unique foreground ones and 1 reverse video
(reverse video indeed looks crazier for short messages).  2 CPUs don't
get the best choice of colors by default.  More than 16 CPUs woold
need to use lots of reverse video, except in graphics mode I'm
considering expanding to 256 or 64K colors.

vt doesn't support colorized kernel messages since I don't want to touch
it more than necessary.  See subr_terminal.c:termcn_putc().  This is almost
exactly the same as scteken_puts() where the color change and some bugs
were.  It has to switch to the kernel color, and does this by abusing the
user state.  User escape sequences get corrupted by kernel output, and
kernel escape sequence to change the color change the user's color but
not the kernel's if they are atomic and not part of a user escape sequence.

The escape sequences in dmesg are very interesting.  You should debug those.
They might be caused by misparsing of kernel escape sequences, or more
likely by corruption of user escape sequences.  This might happen when:
- user prints foo" and ther terminal parses 
- kernel interrupts this and prints "bar"; "foo" is a supported
  sequence but "bar" isn't
- the error handling is to print the entire escape sequence (that would
  be the interleaved message "bar" up to the point where the error
  is detected.  Kernel console drivers seem to discard the entire mess.
  Userland xterm seems to print the entire message.
Usually there aren't enough kernel messages interleaved with user ones
to make the problem obvious.  My changes should fix the problem for
syscons, not cause it.  But if they are slightly wrong, then they might
cause it.


Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode
anymore - nothing happens. In the vt mode I can, but can't exit via "c"
properly, all chars typed after "c" produce beep unless I switch to
another screen and back.


Try backing out r315984 only.  This is supposed to fix parsing of output.


I'll try. thanx. But most dangerous new syscons bug is the first one,
damaging file system on each reboot. I try to go to KDB to debug it, but
seeing that I can't even enter KDB I understand that all that bugs,
including nasty one, are introduced by your syscons changes, it was a
hint to add completely unneeded and unused vt to my kernel config file.


It's normal to have a slightly damaged file system after a panic.

You might have entered ddb in a context which used to race or deadlock.
It might have seemed to work if it only raced.  After the fix, when in
this mode the following happens:
- in graphics mode, no output is done.  The races and deadlocks are not
  all fixed in the keyboard driver, and it might work in this mode.
- in text mode, output is done specially, direct to the frame buffer,
  in a horizontal window 2/3 of the screen size.  This doesn't use a
  full terminal driver so is hard to use at first.  Even the reduced
  window causes problems.  The colorization was originally to make this
  mode more usable.
This mode is rarely active, except for debugging the console driver
itself, or for low-level trap handlers.  Put a breakpoint almost anywhere
in the console driver to see it.  sc_puts() is a good choice.


vt is real downgrade. Its default console font is plain ugly, it is
impossible to work with it. I can't find proper TERM for it to make
function keys and pseudographics works in ncurses apps (not with xterm,
a little better with xterm-sco), lynx can't display all things properly,

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-30 Thread Bruce Evans


On Thu, 30 Mar 2017, Andrey Chernov wrote:


On 30.03.2017 8:53, Bruce Evans wrote:

Maybe two will be enough too, I don't check. I just don't need _any_ of
vt lines. What is matter it is that syscons only mode (without any vt)
was recently broken, causing shutdown problems and file system damage
each time. Syscons only mode works for years until you break it recently.


Actually, I fixed it not so recently (over the last few months), partly
with much older local fixes.


Please commit your fix as soon as possible.


Committing it is what broke things for you.


vt is broken as designed in
many aspects (I even mention not all of them),


It is not that bad.  It is much cleaner, but 10-20 times slower and too
simple to have as many features or preserve old features, and I don't
like rewrites than remove or move features.  vt does well to be as
compatible as it is, so only annoys people who use the more arcane
syscons features (I don't use most of them, but find them in regression
tests).  Syscons looks ugly, but much better when you look at the details.


but from other hand I
can't allow dirty filesystem (or hang) on each reboot using sc only mode
as always. It is dangerous, and fsck takes big time. Moreover, using sc
while keeping vt bloat compiled in the kernel just as the bug workaround
is the best demotivator for perfectionist.


We don't understand the bug yet.  It might not even be in sc.  Do you only
see problems for shutdown?  The shutdown environment is special for locking.
A hang in sc means that deadlock occurred and sc's new deadlock detection
didn't work.  sc is supposed to either drop the output or do it specially
when it detects deadlock.  Deadlocks can also occur in upper layers of
the console driver, but even more rarely.  I haven't committed fixes for
this yet.  cnputs() detects some deadlocks and handles them by dropping
the output.  This loses WITNESS output when you need it for debugging the
deadlock.


The escape sequences in dmesg are very interesting.  You should debug
those.


I'll send you them a bit later. Since I don't want vt at all, I don't
want to debug or fix it, let it die.


:-)


I'll try. thanx. But most dangerous new syscons bug is the first one,
damaging file system on each reboot. I try to go to KDB to debug it, but
seeing that I can't even enter KDB I understand that all that bugs,
including nasty one, are introduced by your syscons changes, it was a
hint to add completely unneeded and unused vt to my kernel config file.


It's normal to have a slightly damaged file system after a panic.


In sc only mode I have no kernel panic, i.e panic with trace on console
or entering KDB. I have silent reboot in the middle or end of shutdown
sequence or rare dead hang on reboot (which absolutely not acceptable
for remote machine).


There's not much that sc does which can cause that.  Maybe a wrong
pointer for the frame buffer access in emergency ouput.  I saw reboots
when I broke this during booting.

Check that ddb works before shutdown, or just put a lot of printfs in
the shutdown sequence to see where it stops working.  I usually sprinkle
ddb breakpoints instead of printf()s.  This requires more console code
to work.  Both should work until the final shutdown message from a working
version.

ddb breakpoints don't work properly under SMP.  If all CPUs hit the
same one, then the first one corrupts the state for the others.  Shutdown
should be mostly on a single CPU or with not all CPUs running the shutdown
code, so most won't hit breakpoints in shutdown code, so it is fairly
safe to put them there.


You might have entered ddb in a context which used to race or deadlock.


No. I try about 20 times on machine which does nothing and can't enter
KDB in sc only mode, but got one dead hang instead, when start to repeat
it too fast.


Even earlier than shutdown, and when booting?

booting with -d gives a simpler environment until sc is completely attached.
Try testing that first.  Also, do tests before mounting file systems so
that nothing needs fsck'ing.


In vt mode I can enter each time, but there are exit
problems I already mention.
I use text mode in sc.


Strings for function keys:
- these are just broken in both sc and vt


I have all function keys working in sc only mode with TERM=cons25 and
similar ones.


Pseudographics:
- I don't use it enough to see problems in it.  Even finding the unicode
  glyph for the block character took me some time.


Even cp437 have it and dialog library use it for all windows frames,
f.e. all ports config windows use pseudographics if it is available and
working (replaced by +-| etc poor looking ASCII otherwise).


I call this line-drawing characters for cp437, and use them occasionally,
but I don't know the termcap method for using them very well.

Bruce
___
freebsd-current@freebsd.org mai

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-29 Thread Bruce Evans


On Tue, 28 Mar 2017, Ngie Cooper wrote:


On Mar 28, 2017, at 21:40, Bruce Evans  wrote:


On Wed, 29 Mar 2017, Bruce Evans wrote:


On Wed, 29 Mar 2017, Andrey Chernov wrote:
...
Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode
anymore - nothing happens. In the vt mode I can, but can't exit via "c"
properly, all chars typed after "c" produce beep unless I switch to
another screen and back.
All it means that syscons becomes very broken now by itself and even
damages the kernel operations.


I found a bug in screen resizing (the console context doesn't get resized).
This doesn't cause any keyboard problems.


...
But I suspect it is a usb keyboard problem.  Syscons now does almost
correct locking for the screen, but not for the keyboard, and the usb
keyboard is especially fragile, especially in ddb mode.  Console input
is not used in normal operation except for checking for characters on
reboot.

Try using vt with syscons unconfigured.  Syscons shouldn't be used when
vt is selected, but unconfigure it to be sure.  vt has different bugs
using the usb keyboard.  I haven't tested usb keyboards recently.


...
I tested usb keyboards again.  They sometimes work, much the same as
a few months ago after some fixes:
...

The above testing is with a usb keyboard, no ps/2 keyboard, and no kbdmux.
Other combinations and dynamic switching move the bugs around, and a
serial console is needed to recover in cases where the bugs prevent any
keyboard input.


I filed a bug a few years ago about USB keyboards and usability in ddb. If you 
increase the timeout so the USB hubs have enough time to probe/attach, they 
will work.


Is that for user mode or earlier?  ukb has some other fixes for ddb now, but
of course it can't work before it finds the device.

I recently found that usb boot drives sometimes don't have enough time to
probe/attach before they are used in mountroot, and the mount -a prompt
does locking that doesn't allow them enough time if they are not ready
before it.  The usb maintainers already know about this.


I haven't taken the time to follow up on that and fix the issue, or at least 
propose a bit more functional workaround.


Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-29 Thread Bruce Evans


On Wed, 29 Mar 2017, Bruce Evans wrote:


On Wed, 29 Mar 2017, Andrey Chernov wrote:
...

Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode
anymore - nothing happens. In the vt mode I can, but can't exit via "c"
properly, all chars typed after "c" produce beep unless I switch to
another screen and back.

All it means that syscons becomes very broken now by itself and even
damages the kernel operations.


...
But I suspect it is a usb keyboard problem.  Syscons now does almost
correct locking for the screen, but not for the keyboard, and the usb
keyboard is especially fragile, especially in ddb mode.  Console input
is not used in normal operation except for checking for characters on
reboot.

Try using vt with syscons unconfigured.  Syscons shouldn't be used when
vt is selected, but unconfigure it to be sure.  vt has different bugs
using the usb keyboard.  I haven't tested usb keyboards recently.


I tested usb keyboards again.  They sometimes work, much the same as
a few months ago after some fixes:
- after booting with -d, they never work (give no input) at the ddb
  prompt with either sc or vt.  usb is not initialized then, and no usb
  keyboard is attached to sc or vt
- after booting without loader with -a, sc rarely or never works (gives
  no input) at the mountroot prompt
- after booting with loader with -a, vt works at the mountroot prompt.
  I don't normally use loader but need to use it to change the configuration.
  This might be better than before.  There used to be a screen refresh bug.
- after booting with loader with -a, sc works at the mountroot prompt too.
  I previously debugged that vt worked better because it attaches the keyboard
  before this point, while sc attaches it after.  Booting with loader
  apparently fixes the order.
- after any booting, sc works for user input (except sometimes after a
  too-soft hard reset, the keyboard doesn't even work in the BIOS, and it
  takes unplugging the keyboard to fix this)
- after almost any booting, vt doesn't work for user input (gives no input).
  However, if ddb is entered using a serial console, vt does work!  A few
  months ago, normal input was fixed by configuring kbdmux (the default in
  GENERIC).  It is not fixed by unplugging the keyboard.  kbdmux has a known
  bug of not doing nested switching for the keyboard state.  Perhaps this
  "fixes" ddb mode.  But I would have expected it to break ddb mode.
- I didn't test sc after entering ddb, except early when it doesn't work.

The above testing is with a usb keyboard, no ps/2 keyboard, and no kbdmux.
Other combinations and dynamic switching move the bugs around, and a
serial console is needed to recover in cases where the bugs prevent any
keyboard input.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others

2017-03-29 Thread Bruce Evans


On Wed, 29 Mar 2017, Andrey Chernov wrote:


On 29.03.2017 0:46, Ngie Cooper (yaneurabeya) wrote:



On Mar 28, 2017, at 14:27, Andrey Chernov  wrote:


???


Using rc_debug=yes I see that it is the kernel problem, not rc problem.
Sometimes rc backward sequence executed even fully, sometimes only
partly, but in unpredictable moment inside rc sequence the kernel decide
to reboot quickly (or even deadly hang in rare cases). Always without
any "Syncing buffers..." leaving FS dirty. No zfs etc. just normal UFS,
no EFI, no GPT.
I change GELI swap to normal one, but it does not help. The same
untouched config works for years, I see this bug for the first time in
FreeBSD.


I forget to mention that typescript and dmesg does not survive after
this reboot (or rare hang).


Good to note.
The simple explanation to the problem might be r307755, depending on when you 
last synced/built ^/head.

I have a few more questions (if reverting that doesn't pan out):


I just found the cause, it is new syscons bug (bde@ cc'ed). I never
compile vt driver into kernel, i.e. I don't have this lines in the
kernel config:

device  vt
device  vt_vga
device  vt_efifb

When I add them, the bug described is gone. It seems syscons goes off to
early, provoking reboot.


Bah, I only have vt and vt_vga to check that I didn't break them.

Unfortunately, syscons still works right when I remove these lines.


I also find some lines of the kernel messages strange colored instead of
white in the syscons only mode. Even in vt mode vidcontrol errors have
invisible escapes prepended (although visible through /var/log/messages).


Kernel messages in syscons are now supposed to be colorized by CPU.  The
boot messages should show all the colors.  Shutdown and ddb are normally
done by a single random CPU, so are shown in a single random color.  The
colors are bright (light) 8-15 foreground, except bright black (8) is not
so bright.  Configure with a non-default KERNEL_SC_CONS_ATTR (maybe
yellow on black instead of lightwhite on black) to turn of the colorization.
I haven't tested this recently.  There is also a sysctl for setting all
the colors.


Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode
anymore - nothing happens. In the vt mode I can, but can't exit via "c"
properly, all chars typed after "c" produce beep unless I switch to
another screen and back.

All it means that syscons becomes very broken now by itself and even
damages the kernel operations.


Try backing out r315984 only.  This is supposed to fix parsing of output.
It switches to a state indexed by the CPU for every character, and switches
back.  Screen switching does a different switch and would fix any bug in
switching back.

But I suspect it is a usb keyboard problem.  Syscons now does almost
correct locking for the screen, but not for the keyboard, and the usb
keyboard is especially fragile, especially in ddb mode.  Console input
is not used in normal operation except for checking for characters on
reboot.

Try using vt with syscons unconfigured.  Syscons shouldn't be used when
vt is selected, but unconfigure it to be sure.  vt has different bugs
using the usb keyboard.  I haven't tested usb keyboards recently.

Bruce___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: HEADS-UP: IFLIB implementations of sys/dev/e1000 em, lem, igb pending

2017-01-24 Thread Bruce Evans


On Tue, 24 Jan 2017, Sean Bruno wrote:


On 01/24/17 08:27, Olivier Cochard-Labb?? wrote:

On Tue, Jan 24, 2017 at 3:17 PM, Sean Bruno mailto:sbr...@freebsd.org>> wrote:

Did you increase the number of rx/tx rings to 8 and the number of
descriptors to 4k in your tests or just the defaults?

Tuning are same as described in my previous email (rxd|txd=2048, rx|tx
process_limit=-1, max_interrupt_rate=16000).
[root@apu2]~# sysctl hw.igb.
hw.igb.tx_process_limit: -1
hw.igb.rx_process_limit: -1
hw.igb.num_queues: 0
hw.igb.header_split: 0
hw.igb.max_interrupt_rate: 16000
hw.igb.enable_msix: 1
hw.igb.enable_aim: 1
hw.igb.txd: 2048
hw.igb.rxd: 2048


Oh, I think you missed my note on these.  In order to adjust txd/rxd you
need to tweak the iflib version of these numbers.  nrxds/ntxds should be
adjust upwards to your value of 2048.  nrxqs/ntxqs should be adjust
upwards to 8, I think, so you can test equivalent settings to the legacy
driver.

Specifically, you may want to adjust these:

dev.em.0.iflib.override_nrxds: 0
dev.em.0.iflib.override_ntxds: 0

dev.em.0.iflib.override_nrxqs: 0
dev.em.0.iflib.override_ntxqs: 0


That is painful.

My hack to increase the ifq length also no longer works:

X Index: if_em.c
X ===
X --- if_em.c   (revision 312696)
X +++ if_em.c   (working copy)
X @@ -1,3 +1,5 @@
X +int em_qlenadj = -1;
X +

-1 gives a null adjustment; 0 gives a default (very large ifq), and other
values give a non-null adustment.

X  /*-
X   * Copyright (c) 2016 Matt Macy 
X   * All rights reserved.
X @@ -2488,7 +2490,10 @@
X 
X  	/* Single Queue */

X  if (adapter->tx_num_queues == 1) {
X -   if_setsendqlen(ifp, scctx->isc_ntxd[0] - 1);
X +   if (em_qlenadj == 0)
X + em_qlenadj = imax(2 * tick, 0) * 15 / 10;
X + // lem_qlenadj = imax(2 * tick, 0) * 42 / 100;
X +   if_setsendqlen(ifp, scctx->isc_ntxd[0] + em_qlenadj);
X if_setsendqready(ifp);
X   }
X

I don't want larger hardware queues, but sometimes want larger software
queues.  ifq's used to give them.  The if_setsenqlen() call is still there.
but no longer gives them.

The large queues are needed for backet blasting benchmarks since select()
doesn't work for udp sockets, so if the queues fill up then the benchmarks
must busy-wait or sleep waiting for them to drain, and timeout granularity
tends to prevent short sleeps from working so the queues run dry while
sleeping unless the queues are very large.

Bruce___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: SVN r305382 breaks world32 on amd64 (and native 32-bit)

2016-09-04 Thread Bruce Evans


On Sun, 4 Sep 2016, Michael Butler wrote:


Build fails with:

===> lib/msun (obj,all,install)
Building /usr/obj/usr/src/lib/msun/e_fmodf.o
/usr/src/lib/msun/i387/e_fmodf.S:10:17: error: register %rsp is only
available in 64-bit mode
movss %xmm0,-4(%rsp)
   ^~~~
/usr/src/lib/msun/i387/e_fmodf.S:11:17: error: register %rsp is only


Fixed.  I noticed it proof-reading the committed sources instead of
the commit mail, and missed it for a while since I checked amd64 first.

The bug was there for a couple of hours.  At least the build failure
prevented it being run.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: problems with mouse

2016-08-30 Thread Bruce Evans


On Mon, 29 Aug 2016, Hans Petter Selasky wrote:


On 08/29/16 22:12, Antonio Olivares wrote:


I apologize in advance if this is not in the right list, if I need to
pose this question in questions, I will do so as soon as I find out.
I am having trouble with switching apps in Lumina desktop with the
mouse, I removed moused from /etc/rc.conf because I have a usb mouse
and still lose when I switch from firefox to terminal or vice versa.

$ uname -a
FreeBSD hp 11.0-RC2 FreeBSD 11.0-RC2 #0 r304729: Wed Aug 24 06:59:03
UTC 2016 r...@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC
 amd64

Is there a way to troubleshoot this?  Is there something that can fix this?


Bruce Evans has fixed some issues with SC/VT mouse/keyboard stuff in 
12-current. Maybe he has some ideas.


I only know about sc/atkbd and am trying not to break ukbd.

The cause of Bug 211884 (ukbd?) is still unknown.  Bugzilla is too hard
to access for me, but the PR seems to be missing critical info about the
environment (is the console vt or sc?).

kbdmux is still missing the fix that is blamed for causing Bug 211884.
I need to fix kbdmux before changing sc to depend on it being fixed.
vt already depends on it being fixed.  Howver, vt also depends on
going through kbdmux.  ukbd doesn't attach properly directly for vt.

ukbd passed tests of working in panic mode yesterday.  It actually
works perfectly in panic + ddb (polled) mode.  Much better than in
just ddb mode.  Panic mode turns off its locking and thus gives
races instead of deadlocks and assertion failures, and the races
aren't very harmful in panic mode.  So the basic polling method in
ukbd is working except when it tries to do correct locking.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Timing issue with Dummynet on high kernel timer interrupt

2015-11-06 Thread Bruce Evans


On Fri, 6 Nov 2015, Ian Lepore wrote:


On Fri, 2015-11-06 at 17:51 +0100, Hans Petter Selasky wrote:

On 11/06/15 17:43, Ian Lepore wrote:

On Fri, 2015-11-06 at 17:28 +0100, Hans Petter Selasky wrote:

Hi,




Do the test II results change with this setting?

   sysctl kern.timecounter.alloweddeviation=0


Yes, it looks much better:

debug.total: 10013 -> 0
debug.total: 10013 -> 0
...

This isn't the first time that the alloweddeviation feature has led
people (including me in the past) to think there is a timing bug.  I
think the main purpose of the feature is to help save battery power on
laptops by clustering nearby scheduled wakeups to all happen at the
same time and then allow for longer sleeps between each wakeup.


I was trying to remember the flag for turning off that "feature".  It
gives the bizarre behaviour that on an old system with a timer resolution
of 10 msec, "time sleep 1" sleeps for 1 second with an average error of
< 10 msec, but with a timer resolution of 1 msec for hardclock and finer
for short timeouts, "time sleep 1" sleeps for an average of an extra 30
msec (worst case 1.069 seconds IIRC).  Thus high resolution timers give
much lower resolution for medium-sized timeouts.  (For "sleep 10", the
average error is again 30 msec but this is relatively smaller, and for
"sleep .001" the average error must be less than 1 msec to work at all,
though it is likely to be relatively large.)


I've been wondering lately whether this might also be behind the
unexplained "load average is always 0.60" problem people have noticed
on some systems.  If load average is calculated by sampling what work
is happening when a timer interrupt fires, and the system is working
hard to ensure that a timer interrupt only happens when there is actual
work to do, you'd end up with statistics reporting that there is work
being done most of the time when it took a sample.


I use HZ = 100 and haven't seen this.  Strangely, HZ = 100 gives the same
69 msec max error for "sleep 1" as HZ = 1000.

Schedulers should mostly use the actual thread runtimes to avoid
sampling biases.  That might even be faster.  But it doesn't work so
well for the load average, or at all for resource usages that are
averages, or for the usr/sys/intr splitting of the runtime.  It is
good enough for scheduling since the splitting is not need for
scheduling.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Hello fdclose

2014-03-18 Thread Bruce Evans


On Tue, 18 Mar 2014, John Baldwin wrote:


On Monday, March 17, 2014 7:23:19 pm Mariusz Zaborski wrote:
...
I think the code is fine.  I have a few suggestions on the manpage wording:

.Sh RETURN VALUES
-Upon successful completion 0 is returned.
+The
+.Fn fcloseall
+function return no value.
+.Pp
+Upon successful completion
+.Fn fclose
+return 0.
+Otherwise,
+.Dv EOF
+is returned and the global variable
+.Va errno
+is set to indicate the error.


The .Rv macro should be used whenever possible.  Unfortunately, it doesn't
support the EOF return, but only -1, so stdio man pages can rarely use it,
and this one is no exception.  Using it gives standard wording that is
quite different from the above:

standard wording:
 The close() function returns the value 0 if successful; otherwise the
 value -1 is returned and the global variable errno is set to indicate the
 error.
above wording (previous):
 Upon successful completion 0 is returned.Otherwise,
 EOF is returned  and the global variable errno is set to indicate the
 error.
above wording (new):
 Upon successful completion fclose() return [sic] 0.  Otherwise,
 EOF is returned  and the global variable errno is set to indicate the
 error.

These are excessively formal in different ways:
- I don't like "the foo() function".  Why not just "foo()"?  The
  standard wording uses this, and so does the new wording, but the
  previous wording omits the function name (that only works for man pages
  that only have a single function, as they should).
- I don't like "the value N".  Why not just "N"?  The standard wording
  uses this, but the previous and new wordings don't.
- "returns N" is better than "N is returned".  Some man pages use worse
  wordings like "N will be returned".
- "the global variable errno" is excessively detailed/verbose, without
  the details even being correct.  Why not just "errno", with this
  identifier documented elsewhere?  errno isn't a global variable in
  most implementations.  It is can be, and usually is, a macro that
  expands to a modifiable lvalue of type int.  In FreeBSD, the macro
  expands to a function that returns a pointer to int.
- "Upon sucessful completion" is correct but verbose.  The standard
  wording doesn't even use it.
- the standard wording uses a conjunction instead of a new sentence
  before "otherwise" (this is better).  It is missing a comma after
  "otherwise" (this is worse).


+.Pp
+The
+.Fn fdclose
+function return the file descriptor if successfull.
Otherwise,
.Dv EOF


"successfull is consistently misspelled.


One of English's arcane rules is that most verbs append an 's' when used with
singular subjects, so "function returns" shoud be used instead of "function
return", etc.  I do think for this section it would be good to combine the
descriptions of fclose() and fdclose() when possible, so perhaps something
like:

 "The fcloseall() function returns no value.

  Upon successful completion, fclose() returns 0 and fdclose() returns the
  file descriptor of the underlying file.  Otherwise, EOF is returned and
  the global variable errno is set to indicate the error.  In either case
  no further access to the stream is possible."


OK.  You kept "return[s] N" and and deverbosified "the foo() function".
"Upon successful completion" is needed more with several functions.
"the global variable errno" remains consistently bad.

There should be a comma after "In either case".


This allows "in either case" to still read correctly and makes it clear it
applies to both fclose() and fdclose().


Better "In every case".



.Sh ERRORS
+.Bl -tag -width Er
+.It Bq Er EOPNOTSUPP
The
+.Fa _close
+method in
+.Fa stream
+argument to
+.Fn fdclose ,
+was not default.
+.It Bq Er EBADF


The ERRORS section should be sorted.


For the errors section, the first error list needs some sort of introductory
text.  Also, this shouldn't claim that fdclose() can return an errno value for
close(2).

"ERRORS

  The fdclose() function may will fail if:


I don't like the tense given by "will" in man pages.  POSIX says "shall
fail" in similar contexts, and "will fail" is a mistranslation of this
("shall" is a technical term that doesn't suggest future tense).

deshallify.sh does the not-incorrect translation s/shall fail/fails/
(I think this is too simple to always work).  It doesn't translate
anything to "will".

I can't parse "may will" :-).  deshallify.txt doesn't translate "may"
or "should" to anything (these are also technical terms in some
contexts, so they might need translation.  IIRC, "may" is optional
behaviour, mostly for the implementation, while "shall" is required
behaviour, only for the implementation, but "should" is recommended
practice, mostly for applications).  Man pages are very unlikely to
be as consistent as POSIX with these terms.


  [EOPNOTSUPP]   The stream to close uses a non-default close method.

  [EBADF]The stream is not backed by a valid file de

Re: signal 8 (floating point exception) upon resume

2014-03-11 Thread Bruce Evans


On Mon, 10 Mar 2014, John Baldwin wrote:


On Tuesday, March 04, 2014 4:50:01 pm Bruce Evans wrote:

On Tue, 4 Mar 2014, John Baldwin wrote:
% Index: i386/i386/swtch.s
% ===
% --- i386/i386/swtch.s (revision 262711)
% +++ i386/i386/swtch.s (working copy)

[...savectx()]

This function is mostly bogus (see old mails).


I was going off of the commit logs for amd64 that removed this code as savectx()
is not used for fork(), only for IPI_STOP and suspend/resume.


Without fxsave, npxsuspend() cannot be atomic without locking, since
fnsave destroys the state in the FPU and you either need a lock to
reload the old state atomically enough, or a lock to modify FPCURTHREAD
atomically enough.


save_ctx() is now only called from IPI handlers or when doing suspend in
which case we shouldn't have to worry about being preempted.


I don't understand the suspend part.  Is sufficient locking held througout
suspend/resume to prevent states changing after they have been saved here?


% @@ -520,7 +490,16 @@
%   movl%eax,%dr7
%
%  #ifdef DEV_NPX
% - /* XXX FIX ME */
% + /* Restore FPU state */

Is the problem just this missing functionality?


Possibly.


I now think it was just the clobbering of %cr0 so i386 never had the
problem.


I think on amd64 there was also the desire to have the pcb
state be meaningful in dumps (since we IPI_STOP before a dump).  OTOH,


It should also be meaningful in debuggers.  Hopefully stop IPIs put
it there form all stopped CPUs.  I think it remains in the FPU for
the running CPU.


the current approach used by amd64 (and this patch for i386) is to not
dirty fpcurthread's state during save_ctx(), but to instead leave
fpcurthread alone and explicitly save whatever state the FPU is in
in the PCB used for IPI_STOP or suspend.


Hmm, if kernel debuggers actually supported displaying the FPU state, then
they would prefer to find it in the PCB only (after debugger entry puts
it there), but this doesn't work in places like the dna trap handler.
Similarly for IPIs and suspend.  The dna trap handler would be broken
unless any saving in the PCB is undone when normal operation is resumed,
and it seems more difficult to undo it than to save specially so as not
to have anything to undo.  It is OK to save in the usual place in the PCB
so that debuggers can find it more easily (since that place is not used
in normal operation), but not to change the state in the CPU+FPU across
the operation.  Harmful state changes in the CPU+FPU include toggling
CR0_TS and implicit fninit.  For suspend/resume, we have no option but
to undo everything, since other things may clobber the state.




% @@ -761,7 +761,34 @@
%   PCPU_SET(fpcurthread, NULL);
%  }
%
% +/*
% + * Unconditionally save the current co-processor state across suspend and
% + * resume.
% + */
%  void
% +npxsuspend(union safefpu *addr)
% +{
% + register_t cr0;
% +
% + if (!hw_float)
% + return;
% + cr0 = rcr(0);
% + clts();
% + fpusave(addr);
% + load_cr(0, cr0);
% +}

In the !fxsave case, this destroys the state in the npx, leaving
fpcurthread invalid.  It also does the save when the state in the
npx is inactive.  I think jkim intentionally this state so that
resume can load it unconditionally.  It must be arranged that there
are no interactions with fpcurthread.


Given the single-threaded nature of suspend/resume and IPI_STOP /
restart_cpus(), those requirements are met, so it should be safe
to resume whatever state was in the FPU and leave fpcurthread
unchanged.


Is the whole suspend/resume really locked?


This doesn't work so well
without fxsave.  When fpcurthread != NULL, reloading CR0 keeps
CR0_TS and thus ensures that inconsistent state lives for longer.
Things will only be OK if fpcurthread isn't changed until resume.


After the save_ctx() the CPU is going to either resume without
doing a resume_ctx (IPI_STOP case) leaving fpcurthread unchanged
(so save_ctx() just grabbed a snapshot of the FPU state for
debugging purposes) or the CPU is going to power off for suspend.


If it doesn't restore for IPI_STOP, then it will continue with the
state clobbered by fnsave in the !fxsr case.  That is rare but can
happen.  Most CPUs that have IPIs also have fxsr.  But on at least
i386, there is an option to disable fxsr.


During resume it will invoke resume_ctx() which will restore the
FPU state (whatever state it was in) and fpcurthread and only
after those are true is the CPU able to run other threads which
will modify or use the FPU state.


You can probably fix this by using the old code here.  The old code
doesn't need the hw_float test, since fpcurthread != NULL implies
hw_float != 0.

Actually, I don't see any need to change anything on i386 -- after
storing the state for the thread, there should be no need to store it
anywhere else across suspend/resume.  We intentionally use thi

Re: signal 8 (floating point exception) upon resume

2014-03-04 Thread Bruce Evans


On Tue, 4 Mar 2014, John Baldwin wrote:


On Monday, March 03, 2014 6:49:08 pm Adrian Chadd wrote:

I'll try this soon.

I had it fail back to newcons, rather than Xorg normally dying without
restoring state. It wouldn't let me spawn a shell. Logging in worked
fine, but normal shell exec would eventually and quickly lead to
failure, dropping me back to the login prompt.


If you have set CPUTYPE in /etc/src.conf such that your userland binaries
are built with SSE, etc. then I expect most things to break because the FPU
is in a funky state without this patch.  I suspect if you don't set CPUTYPE
so that your userland binaries do not use the FPU, you can probably resume
just fine without this fix.


Non-SSE FPU state might be broken too.


Complete stab in the dark (not compile tested) here:

http://www.FreeBSD.org/~jhb/patches/i386_fpu_suspend.patch


I forget many details of how this works, but noticed that it seems
to break consistency of the state for the !fxsave case and related
locking.

% Index: i386/i386/swtch.s
% ===
% --- i386/i386/swtch.s (revision 262711)
% +++ i386/i386/swtch.s (working copy)
% @@ -417,42 +417,9 @@
%   str PCB_TR(%ecx)
% 
%  #ifdef DEV_NPX

% - /*
% -  * If fpcurthread == NULL, then the npx h/w state is irrelevant and the
% -  * state had better already be in the pcb.  This is true for forks
% -  * but not for dumps (the old book-keeping with FP flags in the pcb
% -  * always lost for dumps because the dump pcb has 0 flags).
% -  *
% -  * If fpcurthread != NULL, then we have to save the npx h/w state to
% -  * fpcurthread's pcb and copy it to the requested pcb, or save to the
% -  * requested pcb and reload.  Copying is easier because we would
% -  * have to handle h/w bugs for reloading.  We used to lose the
% -  * parent's npx state for forks by forgetting to reload.
% -  */

This function is mostly bogus (see old mails).

% - pushfl
% - CLI
% - movlPCPU(FPCURTHREAD),%eax
% - testl   %eax,%eax
% - je  1f

This CLI/STI locking is bogus.  Accesses to FPCURTHREAD are now locked
by critical_enter(), as on amd64, and perhaps a higher level already
did critical_enter() or even CLI.

(CLI/STI in swtch.s seems to be bogus too.  amd64 doesn't do it, and
I think a higher level does mtx_lock_spin() which does too much, including
CLI via spinlock_enter().)

% -
% - pushl   %ecx
% - movlTD_PCB(%eax),%eax
% - movlPCB_SAVEFPU(%eax),%eax
% - pushl   %eax
% - pushl   %eax
% - callnpxsave
% + pushl   PCB_FPUSUSPEND(%ecx)
% + callnpxsuspend

Without fxsave, npxsuspend() cannot be atomic without locking, since
fnsave destroys the state in the FPU and you either need a lock to
reload the old state atomically enough, or a lock to modify FPCURTHREAD
atomically enough.  Reloading the old state is problematic because
the reload might trap.  So the old version uses the second method.
It calls npxsave() to handle most of the details.  But npxsave() was
designed to be efficient for its usual use in cpu_switch(), so it doesn't
handle the detail of checking FPCURTHREAD or the locking needed for this
check, so the above code had to handle these details.

%   addl$4,%esp
% - popl%eax
% - popl%ecx
% -
% - pushl   $PCB_SAVEFPU_SIZE
% - lealPCB_USERFPU(%ecx),%ecx
% - pushl   %ecx
% - pushl   %eax
% - callbcopy
% - addl$12,%esp
% -1:
% - popfl
%  #endif   /* DEV_NPX */

This probably should never have been written in asm.  Only the similar
code in cpu_switch() is time-critical.

% 
%  	movl	$1,%eax

% ...
% @@ -520,7 +490,16 @@
%   movl%eax,%dr7
% 
%  #ifdef DEV_NPX

% - /* XXX FIX ME */
% + /* Restore FPU state */

Is the problem just this missing functionality?

% ...
% Index: i386/isa/npx.c
% ===
% --- i386/isa/npx.c(revision 262711)
% +++ i386/isa/npx.c(working copy)

This has many vestiges of support for interrupt handling (mainly in
comments and in complications in the probe).  CLI/STI was used for
locking partly to reduce complications for the IRQ13 case.  The
comment before npxsave() still says that it needs CLI/STI locking
by callers, but it actually needs critical_enter() locking and
most callers only provided that.

% @@ -761,7 +761,34 @@
%   PCPU_SET(fpcurthread, NULL);
%  }
% 
% +/*

% + * Unconditionally save the current co-processor state across suspend and
% + * resume.
% + */
%  void
% +npxsuspend(union safefpu *addr)
% +{
% + register_t cr0;
% +
% + if (!hw_float)
% + return;
% + cr0 = rcr(0);
% + clts();
% + fpusave(addr);
% + load_cr(0, cr0);
% +}

In the !fxsave case, this destroys the state in the npx, leaving
fpcurthread invalid.  It also does the save when the state in the
npx is inactive.  I think jkim

Re: WEAK_REFERENCE?

2013-11-10 Thread Bruce Evans


On Sat, 9 Nov 2013, Andreas Tobler wrote:


anyone interested in this patch to remove the WEAK_ALIAS and introduce
the WEAK_REFERENCE?

http://people.freebsd.org/~andreast/weak_ref.amd64.diff

I have this running since months on amd64 and I have no issues with.

I remember having had a communication with bde@ that he is in favour in
doing that but I lacked the time to complete.
A similar thing is pending for i386 and sparc64. The ppc stuff is
already committed since a longer time.

If no one is interested, I'm happy to clean up my tree and skip this.


I have only minor interest in it.

I might have looked at it before.  This version formats the backslashes
in macro definitions very badly by putting them in random columns between
about 96 and 120 instead of in column 72.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity

2013-07-11 Thread Bruce Evans


On Thu, 11 Jul 2013, David Chisnall wrote:


On 11 Jul 2013, at 13:11, Bruce Evans  wrote:


The error message for the __builtin_isnan() version is slightly better up
to where it says more.

The less-unportable macro can do more classification and detect problems
at compile time using __typeof().


The attached patch fixes the related test cases in the libc++ test suite.  
Please review.


OK if the ifdefs work and the style bugs are fixed.


This does not use __builtin_isnan(), but it does:

- Stop exposing isnan and isinf in the header.  We already have __isinf in 
libc, so this is used instead.

- Call the static functions for isnan __inline__isnan*() so that they don't 
conflict with the ones in libm.

- Add an __fp_type_select() macro that uses either __Generic(), 
__builtin_choose_expr() / __builtin_choose_expr(), or sizeof() comparisons, 
depending on what the compiler supports.

- Refactor all of the type-generic macros to use __fp_type_select().


% Index: src/math.h
% ===
% --- src/math.h(revision 253148)
% +++ src/math.h(working copy)
% @@ -80,28 +80,39 @@
%  #define  FP_NORMAL   0x04
%  #define  FP_SUBNORMAL0x08
%  #define  FP_ZERO 0x10
% +
% +#if __STDC_VERSION__ >= 201112L
% +#define __fp_type_select(x, f, d, ld) _Generic((x), \
% + float: f(x),\
% + double: d(x),   \
% + long double: ld(x))

The normal formatting of this is unclear.  Except for the tab after #define.
math.h has only 1 other instance of a space after #define.

% +#elif __GNUC_PREREQ__(5, 1)
% +#define __fp_type_select(x, f, d, ld) __builtin_choose_expr( 
  \
% + __builtin_types_compatible_p(__typeof (x), long double), ld(x),\
% +  __builtin_choose_expr(\
% +   __builtin_types_compatible_p(__typeof (x), double), d(x),\
% +__builtin_choose_expr(  \
% + __builtin_types_compatible_p(__typeof (x), float), f(x), (void)0)))

Extra space after __typeof.

Normal formatting doesn't march to the right like this...

% +#else
% +#define __fp_type_select(x, f, d, ld) \
% + ((sizeof (x) == sizeof (float)) ? f(x)\
% +  : (sizeof (x) == sizeof (double)) ? d(x) \
% +  : ld(x))

... or like this.

Extra space after sizeof (bug copied from old code).

% +#endif
% +
% +
% +

Extra blank lines.

%  #define  fpclassify(x) \
% -((sizeof (x) == sizeof (float)) ? __fpclassifyf(x) \
% -: (sizeof (x) == sizeof (double)) ? __fpclassifyd(x) \
% -: __fpclassifyl(x))

Example of normal style in old code (except for the space after sizeof(),
and the backslashes aren't line up like they are in some other places in
this file).

% ...
% @@ -119,10 +130,8 @@
%  #define  isunordered(x, y)   (isnan(x) || isnan(y))
%  #endif /* __MATH_BUILTIN_RELOPS */
% 
% -#define	signbit(x)	\

% -((sizeof (x) == sizeof (float)) ? __signbitf(x)  \
% -: (sizeof (x) == sizeof (double)) ? __signbit(x) \
% -: __signbitl(x))
% +#define signbit(x) \
% + __fp_type_select(x, __signbitf, __signbit, __signbitl)

The tab lossage is especially obvious here.

This macro definition fits on 1 line now.  Similarly for others except
__inline_isnan*, which takes 2 lines.  __inline_isnan* should be named
less verbosely, without __inline.  I think this doesn't cause any
significant conflicts with libm.  Might need __always_inline.
__fp_type_select is also verbose.

% 
%  typedef	__double_t	double_t;

%  typedef  __float_t   float_t;
% @@ -175,6 +184,7 @@
%  int  __isfinite(double) __pure2;
%  int  __isfinitel(long double) __pure2;
%  int  __isinff(float) __pure2;
% +int  __isinf(double) __pure2;
%  int  __isinfl(long double) __pure2;
%  int  __isnanf(float) __pure2;
%  int  __isnanl(long double) __pure2;
% @@ -185,6 +195,23 @@
%  int  __signbitf(float) __pure2;
%  int  __signbitl(long double) __pure2;

The declarations of old extern functions can probably be removed too
when they are replaced by inlines (only __isnan*() for now) .  I think
the declarations of __isnan*() are now only used to prevent warnings
(at higher warning levels than have ever been used) in the file that
implement the functions.

% 
% +static __inline int

% +__inline_isnanf(float __x)
% +{
% + return (__x != __x);
% +}
% +static __inline int
% +__inline_isnan(double __x)
% +{
% + return (__x != __x);
% +}
% +static __inline int
% +__inline_isnanl(long double __x)
% +{
% + return (__x != __x);
% +}
% +
% +

Extra blank lines.

Some insertion sort errors.  In this file, APIs are mostly sorted in the
order double, float, long double.

All the inline functions except __inline_isnan*() only evaluate their
args once, so they can be s

Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity

2013-07-11 Thread Bruce Evans


On Thu, 11 Jul 2013, David Chisnall wrote:


On 11 Jul 2013, at 13:11, Bruce Evans  wrote:


 is also not required to be conforming C code, let alone C++ code,
so there is only a practical requirement that it works when included
in the C++ implementation.


Working with the C++ implementation is the problem that we are trying to solve.


The compatibility that I'm talking about is with old versions of FreeBSD.
isnan() is still in libc as a function since that was part of the FreeBSD
ABI and too many things depended on getting it from there.  It was recently
...


I don't see a problem with changing the name of the function in the header and 
leaving the old symbol in libm for legacy code.


I don't even see why old code needs the symbol.  Old code should link to
old compat libraries that still have it.




It would also be nice to implement these macros using _Generic when compiling 
in C11 mode, as it will allow the compiler to produce more helpful warning 
messages.  I would propose this implementation:



#if __has_builtin(__builtin_isnan)


This won't work for me, since I develop and test msun with old compilers
that don't support __has_builtin().  Much the same set of compilers also
don't have enough FP builtins.


Please look in cdefs.h, which defines __has_builtin(x) to 0 if we the compiler 
does not support it.  It is therefore safe to use __has_builtin() in any 
FreeBSD header.


The old compilers run on old systems that don't have that in cdefs.h
(though I sometimes edit it to add compatibility cruft like that).  msun
sources are otherwise portable to these systems.  Well, not quite.  They
are not fully modular and also depend on stuff in libc/include and
libc/${ARCH}.  I have to update or edit headers there.

This hack also doesn't work with gcc in -current.  gcc has __builtin_isnan
but not __has_builtin(), so __has_builtin(__builtin_isnan) gives the wrong
result 0.


It also doesn't even work.  clang has squillions of builtins that
aren't really builtines so they reduce to libcalls.


Which, again, is not a problem for code outside of libm.  If libm needs 
different definitions of these macros then that's fine, but they should be 
private to libm, not installed as public headers.


Yes it is.  It means that nothing should use isnan() or FP_FAST_FMA* outside
of libm either, since isnan() is too slow and FP_FAST_FMA* can't be trusted.
Even the implementation can't reliably tell if __builtin_isnan is usuable
or better than alternatives.


The msun implementation knows that isnan() and other classification
macros are too slow to actually use, and rarely uses them.


Which makes any concerns that only apply to msun internals irrelevant from the 
perspective of discussing what goes into this header.


No, the efficiency of isnan() is more important for externals, because the
internals already have work-arounds.


#define isnan(x) __builtin_isnan(x)
#else
static __inline int
__isnanf(float __x)
{
  return (__x != __x);
}


Here we can do better in most cases by hard-coding this without the ifdef.


They will generate the same code.  Clang expands the builtin in the LLVM IR to 
a fcmp uno, so will generate the correct code even when doing fast math 
optimisations.


On some arches the same, and not affected by -ffast-math.  But this
is not necessarily the fastest code, so it is a performance bug if clang
akways generates the same code for the builtin.  Bit tests are faster in
some cases, and may be required to prevent exceptions for signaling NaNs.
-ffast-math could reasonably optimize x != x to "false".  It already assumes
that things like overflow and NaN results can't happen, so why not optimize
further by assuming that NaN inputs can't happen?


Generic stuff doesn't seem to work right for either isnan() or
__builtin_isnan(), though it could for at least the latter.  According
to a quick grep of strings $(which clang), __builtin_classify() is
generic but __builtin_isnan*() isn't (the former has no type suffixes
but the latter does, and testing shows that the latter doesn't work
without the suffices).


I'm not sure what you were testing:


Mostly isnan() without including , and gcc.  I was confused by
gcc converting floats to doubles.


$ cat isnan2.c

int test(float f, double d, long double l)
{
   return __builtin_isnan(f) |
   __builtin_isnan(d) |
   __builtin_isnan(l);
}
$ clang isnan2.c -S -emit-llvm -o - -O1
...
 %cmp = fcmp uno float %f, 0.00e+00
 %cmp1 = fcmp uno double %d, 0.00e+00
 %or4 = or i1 %cmp, %cmp1
 %cmp2 = fcmp uno x86_fp80 %l, 0xK
...

As you can see, it parses them as generics and generates different IR for each. 
 I don't believe that there's a way that these would be translated back into 
libcalls in the back end.


Yes, most cases work right.  gcc converts f to double and compares the
result, but

Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity

2013-07-11 Thread Bruce Evans


On Thu, 11 Jul 2013, Tijl Coosemans wrote:


On 2013-07-11 06:21, Bruce Evans wrote:

On Wed, 10 Jul 2013, Garrett Wollman wrote:

< said:

I think isnan(double) and isinf(double) in math.h should only be
visible if (_BSD_VISIBLE || _XSI_VISIBLE) && __ISO_C_VISIBLE < 1999.
For C99 and higher there should only be the isnan/isinf macros.


I believe you are correct.  POSIX.1-2008 (which is aligned with C99)
consistently calls isnan() a "macro", and gives a pseudo-prototype of

int isnan(real-floating x);


Almost any macro may be implemented as a function, if no conforming
program can tell the difference.  It is impossible for technical reasons
to implement isnan() as a macro (except on weird implementations where
all real-floating types are physically the same).  In the FreeBSD
implementation, isnan() is a macro, but it is also a function, and
the macro expands to the function in double precision:

% #defineisnan(x)\
% ((sizeof (x) == sizeof (float)) ? __isnanf(x)\
% : (sizeof (x) == sizeof (double)) ? isnan(x)\
% : __isnanl(x))


The C99 standard says isnan is a macro. I would say that only means
defined(isnan) is true. Whether that macro then expands to function
calls or not is not important.


I think it means only that defined(isnan) is true.  isnan() can still be
a function (declared or just in the compile-time namespace somewhere,
or in a library object).  It is reserved in the compile-time namespace,
and the standard doesn't cover library objects, so conforming applications
can't reference either except via the isnan() macro (if that has its
strange historical implementation).


I don't see how any conforming program can access the isnan() function
directly.  It is just as protected as __isnan() would be.  (isnan)()
gives the function (the function prototype uses this), but conforming
programs can't do that since the function might not exist.


I don't think the standard allows a function to be declared with the same
name as a standard macro (it does allow the reverse: define a macro with
the same name as a standard function). I believe the following code is
C99 conforming but it currently does not compile with our math.h:

--
#include 

int (isnan)(int a, int b, int c) {
   return (a + b + c);
}
--


I think isnan is just reserved, so you can't redefine it an any way.  I
think the reverse is even less allowed.  Almost any standard function may
be implemented as a macro, and then any macro definition of it would
conflict with the previous macro even more than with a previous prototype.
E.g.:

/* Header. */
void exit(int);
#define exit(x) __exit(x)

/* Application. */
#undef exit /* non-conforming */
#define exit(x) my_exit(x)  /* conflicts without the #undef */

Now suppose the header doesn't define exit().

#define exit(x) my_exit(x)

This hides the protoype but doesn't automatically cause problems, especially
if exit() is not used after this point.  But this is still non-conforming,
since exit() is reserved.

Here are some relevant parts of C99 (n869.txt):

%%%
 -- Each  identifier  with  file scope listed in any of the
following  subclauses  (including  the  future  library
directions)  is  reserved  for  use  as macro and as an
identifier with file scope in the same  name  space  if
any of its associated headers is included.

   [#2]  No  other  identifiers  are  reserved.  If the program
   declares or defines an identifier in a context in  which  it
   is  reserved  (other than as allowed by 7.1.4), or defines a
   reserved  identifier  as  a  macro  name,  the  behavior  is
   undefined.

   [#3]   If  the  program  removes  (with  #undef)  any  macro
   definition of an identifier in the first group listed above,
   the behavior is undefined.
%%%

Without any include of a header that is specified to declare exit(),
file scope things are permitted for it, including defining it and
making it a static function, but not making it an extern function.

isnan is reserved for use as a macro and as an identifier with file
scope by the first clause above.  Thus (isnan) cannot even be defined
as a static function.  But (isnan) is not reserved in inner scopes.
I thought that declarations like "int (isnan);" are impossible since
they look like syntax errors, but this syntax seems to be allowed an
actually work with gcc-3.3.3 and TenDRA-5.0.0.  So you can have
variables with silly names like (isnan) and (getchar) :-).  However,
(NULL) for a variable name doesn't work, and (isnan) is a syntax error
for struct member names.  The compilers may be correct in allowing
(isnan) but not (NULL) for variables.  isnan happens to be function-like,
so the parentheses are special for (isnan), but the parentheses are not

Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity

2013-07-11 Thread Bruce Evans


On Thu, 11 Jul 2013, David Chisnall wrote:


You're joining in this discussion starting in the middle, so you probably 
missed the earlier explanation.


I was mainly addressing a C99 point.  I know little about C++ or C11.


On 11 Jul 2013, at 05:21, Bruce Evans  wrote:


I don't see how any conforming program can access the isnan() function
directly.  It is just as protected as __isnan() would be.  (isnan)()
gives the function (the function prototype uses this), but conforming
programs can't do that since the function might not exist.  Maybe some
non-conforming program like autoconfig reads  or libm.a and
creates a bug for C++.


The cmath header defines a template function isnan that invokes the isnan 
macro, but then undefines the isnan macro.  This causes a problem because when 
someone does something along the lines of using namespace std then they end up 
with two functions called isnan and the compiler gets to pick the one to use.  
Unfortunately, std::isnan() returns a bool, whereas isnan() returns an int.

The C++ headers are not required to be conforming C code, because they are not C, and 
our math.h causes namespace pollution in C++ when included from .


 is also not required to be conforming C code, let alone C++ code,
so there is only a practical requirement that it works when included
in the C++ implementation.


The FreeBSD isnan() implementation would be broken by removing the
isnan() function from libm.a or ifdefing it in .  Changing the
function to __isnan() would cause compatibility problems.  The function
is intentionally named isnan() to reduce compatibility problems.


On OS X this is avoided because their isnan() macro expands to call one of the 
__-prefixed inline functions (which adopt your suggestion of being implemented 
as x != x, for all types).  I am not sure that this is required for standards 
conformance, but it is certainly cleaner.  Your statement that having the 
function not called isnan() causes compatibility problems is demonstrably 
false, as neither OS X nor glibc has a function called isnan() and, unlike us, 
they do not experience problems with this macro.


The compatibility that I'm talking about is with old versions of FreeBSD.
isnan() is still in libc as a function since that was part of the FreeBSD
ABI and too many things depended on getting it from there.  It was recently
removed from libc.so, but is still in libm.a.  This causes some
implementation problems in libm that are still not completely solved.  I
keep having to edit msun/src/s_isnan.c the msun sources are more portable.
Mostly I need to kill the isnan() there so that it doesn't get in the
way of the one in libc.  This mostly works even if there is none in libc,
since the builtins result in neither being used.  isnanf() is more of a
problem, since it is mapped to __isnanf() and there is no builtin for
__isnanf().  The old functions have actually been removed from libc.a
too.  They only in libc_pic.a.  libc.a still has isnan.o, but that is bogus
since isnan.o is now empty.


It would also be nice to implement these macros using _Generic when compiling 
in C11 mode, as it will allow the compiler to produce more helpful warning 
messages.  I would propose this implementation:



#if __has_builtin(__builtin_isnan)


This won't work for me, since I develop and test msun with old compilers
that don't support __has_builtin().  Much the same set of compilers also
don't have enough FP builtins.

It also doesn't even work.  clang has squillions of builtins that
aren't really builtines so they reduce to libcalls.  gcc has fewer
builtins, but still many that reduce to libcalls.  An example is fma().
__has_builtin(__builtin_fma) is true for clang on amd64 (freefall),
but at least freefalls's CPU doesn't support fma in hardware, so the
builtin can't really work, and in fact it doesn't -- it reduces to a
libcall.  This might change if the hardware supports fma, but then
__has_builtin(__builtin_fma) would be even more useless for telling
if fma is worth using.  C99 has macros FP_FAST_FMA[FL] whose
implementation makes them almost equally useless.  For example, ia64
has fma in hardware and the implementation defines all of
FP_FAST_FMA[FL] for ia64.  But fma is implemented as an extern function,
partly because there is no way to tell if __builtin_fma is any good
(but IIRC, __builtin_fma is no good on ia64 either, since it reduces
to the same extern function).  The extern function is slow (something
like 20 cycles instead of 1 for the fma operation).  But if you ignore
the existence of the C99 fma API and just write expressions of the
form (a*x + b), then gcc on ia64 will automatically use the hardware
fma, although this is technically wrong in some fenv environments.

For gcc-4.2.1, __has_builtin(__builtin_fma) is a syntax error.  I
test with gcc-3.x.  It is also missing __builtin_isnan().

The msun implementation knows that isnan() and oth

Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity

2013-07-10 Thread Bruce Evans


On Wed, 10 Jul 2013, Garrett Wollman wrote:


< said:


I think isnan(double) and isinf(double) in math.h should only be
visible if (_BSD_VISIBLE || _XSI_VISIBLE) && __ISO_C_VISIBLE < 1999.
For C99 and higher there should only be the isnan/isinf macros.


I believe you are correct.  POSIX.1-2008 (which is aligned with C99)
consistently calls isnan() a "macro", and gives a pseudo-prototype of

int isnan(real-floating x);


Almost any macro may be implemented as a function, if no conforming
program can tell the difference.  It is impossible for technical reasons
to implement isnan() as a macro (except on weird implementations where
all real-floating types are physically the same).  In the FreeBSD
implementation, isnan() is a macro, but it is also a function, and
the macro expands to the function in double precision:

% #define   isnan(x)\
% ((sizeof (x) == sizeof (float)) ? __isnanf(x) \
% : (sizeof (x) == sizeof (double)) ? isnan(x)  \
% : __isnanl(x))

I don't see how any conforming program can access the isnan() function
directly.  It is just as protected as __isnan() would be.  (isnan)()
gives the function (the function prototype uses this), but conforming
programs can't do that since the function might not exist.  Maybe some
non-conforming program like autoconfig reads  or libm.a and
creates a bug for C++.

The FreeBSD isnan() implementation would be broken by removing the
isnan() function from libm.a or ifdefing it in .  Changing the
function to __isnan() would cause compatibility problems.  The function
is intentionally named isnan() to reduce compatibility problems.

OTOH, the all of the extern sub-functions that are currently used should
bever never be used, since using them gives a very low quality of
implementation:
- the functions are very slow
- the functions have names that confuse compilers and thus prevent
  compilers from replacing them by builtins.  Currently, only gcc
  automatically replaces isnan() by __builtin_isnan().  This only
  works in double precision.  So the FreeBSD implementation only
  works right in double precision too, only with gcc, __because__
  it replaces the macro isnan(x) by the function isnan(x).  The
  result is inline expansion, the same as if the macro isnan()
  is replaced by __builtin_isnan().  clang never does this automatic
  replacement, so it generates calls to the slow library functions.
  Other things go wrong for gcc in other precisions:
  - if  is not included, then isnan(x) gives
__builtin_isnan((double)x).  This sort of works on x86, but is
low quality since it is broken for signaling NaNs (see below).
One of the main reasons reason for the existence of the
classification macros is that simply converting the arg to a common
type and classifying the result doesn't always work.
  - if  is not included, then spelling the API isnanf() or
isnanl() gives correct results but a warning about these APIs
not being declared.  These APIs are nonstandard but are converted
to __builtin_isnan[fl] by gcc.
  - if  is included, then:
- if the API is spelled isnan(), then the macro converts to
  __isnanf() or __isnanl().  gcc doesn't understand these, and
  the slow extern functions are used.
- if the API is spelled isnanf() or isnanl(), then the result is
  correct and the warning magically goes away.   declares
  isnanf(), but gcc apparently declares both iff  is included.
  gcc also optimizes isnanl() on a float arg to __builtin_isnanf().
- no function version can work in some cases, because any function version
  may  have unwanted side effects.  This is another of the main reason
  for the existence of these and other macros.  The main unwanted side
  effect is signaling for signaling NaNs.  C99 doesn't really support
  signaling NaNs, even with the IEC 60559 extensions, so almost anything
  is allowed for them.  But IEEE 854 is fairly clear that isnan() and
  classification macros shouldn't raise any exceptions.  IEEE 854 is
  even clearer that copying values without changing their representation
  should (shall?) not cause exceptions.  But on i387, just loading a float
  or double value changes its representation and generates an exception
  for signaling NaNs, while just loading a long double value conforms to
  IEEE 854 and doesn't change its representation or generate an exception.
  Passing of args to functions may or may not load the values.  ABIs may
  require a change of representation.  On i387, passing of double args
  should go through the FPU for efficiency reasons, and this changes the
  representation twice to not even get back to the original (for signaling
  NaNs, it generates an exception and sets the quiet bit in the result;
  thus a classification function can never see a signaling NaN in double
  precision).  So a high quality inplementation must not use function
  versions, and it must also use builtins that don't

Re: [RFC/RFT] calloutng

2013-01-18 Thread Bruce Evans


On Thu, 17 Jan 2013, Ian Lepore wrote:


On Mon, 2013-01-14 at 11:38 +1100, Bruce Evans wrote:



Er, timecounters are called with a spin mutex held in existing code:
though it is dangerous to do so, timecounters are called from fast
interrupt handlers for very timekeeping-critical purposes:
- to implement the TIOCTIMESTAMP ioctl (except this is broken in
   -current).  This was a primitive version of pps timestamping.
- for pps timestamping.  The interrupt handler (which should be a fast
   interrupt handler to minimize latency) calls pps_capture() which
   calls tc_get_timecount() and does other "lock-free" accesses to the
   timecounter state.  This still works in -current (at least there is
   still code for it).


Unfortunately, calling pps_capture() in the primary interrupt context is
no longer an option with the stock pps driver.  Ever since the ppbus
rewrite all ppbus children must use threaded handlers.  I tried to fix
that a couple different ways, and both ended up with crazy-complex code


Hmm, I didn't notice that ppc supported pps (I try not to look at it since
it is ugly :-), and don't know of any version of it that uses non-threaded
handlers (except in FreeBSD-4 before, where normal interrupt handlers
were non-threaded, so ppc had their high latency but not the even higher
latency and overheads of threaded handlers).

OTOH, my x86 RTC interrupt handler is threaded and supports pps, and
I haven't noticed any latency problems with this.  It just can't
possibly give the < ~1 usec jitter that FreeBSD-[3-4] could give ~15
years ago using a fast interrupt handler (there must be only 1 device
using a fast interrupt handler, with this dedicated to pps, else the
multiple fast interrupt handlers will give latency much larger than
~1 usec to each other.  I don't actually use this for anything except
testing whether the RTC can be used for a poor man's pps.


scattered around the ppbus family just to support the rarely-used pps
capture.  It would have been easier to do if filter and threaded
interrupt handlers had the same function signature.

I ended up writting a separate driver that can be used instead of ppc +
ppbus + pps, since anyone who cares about precise pps capture is
unlikely to be sharing the port with a printer or plip device or some
such.


Probably all pps handlers should be special.  On x86 with reasonable
timecounter hardware, say a TSC, it takes about 10 instructions for
an entire pps interrupt handler:

XintrN:
pushl   %eax
pushl   %edx
rdtsc
# Need some ugliness for EIO here or later.
ss:movl %eax,ppscap # Hopefully lock-free via time-domain locking.
ss:movl %edx,ppscap+4
popl%edx
popl%eax
iret

After capturing the timecounter hardware value here, you convert it
to a pps event at leisure.  But since this only happens once per second,
it wouldn't be very inefficient to turn the interrupt handler into a
slow high-latency one, even a threaded one, to handle the pps event
and/or other devices attached to the interrupt.


   OTOH, all drivers that call pps_capture() from their interrupt handler
   then immediately call pps_event().  This has always been very broken,
   and became even more broken with SMPng.  pps_event() does many more
   timecounter and pps accesses whose locking is unclear at best, and
   in some configurations it calls hardpps(), which is only locked by
   Giant, despite comments in kern_ntptime.c still saying that it (and
   many other functions in kern_ntptime.c) must be called at splclock()
   or higher.  splclock() is of course now null, but the locking
   requirements in kern_ntptime.c haven't changed much.  kern_ntptime.c
   always needed to be locked by the equivalent of a spin mutex, which
   is stronger locking than was given by splclock().  pps_event() would
   have to aquire the spin mutex before calling hardpps(), although
   this is bad for fast interrupt handlers.  The correct implementation
   is probably to only do the capture part from fast interrupt handlers.


In my rewritten dedicated pps driver I call pps_capture() from the
filter handler and pps_event() from the threaded handler.  I never found


That seems right.


any good documentation on the low-level details of this stuff, and there
isn't enough good example code to work from.  My hazy memory is that I


THere seem to be no good examples.


ended up studying the pps_capture() and pps_event() code enough to infer
that their design intent seems to be to allow you to capture with no
locking and do the event processing later in some sort of deferred or
threaded context.


That seems to be the design, but there are no examples of separating
the event from the capture.

I think the correct locking is:
- capture in a fast interrupt handler, into a per-device state that
  is locked by whatever locks all of the state accessed by the fast
  interrupt handle

Re: [RFC/RFT] calloutng

2013-01-13 Thread Bruce Evans


On Sun, 13 Jan 2013, Alexander Motin wrote:


On 13.01.2013 20:09, Marius Strobl wrote:

On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:

On 06.01.2013 17:23, Marius Strobl wrote:

I'm not really sure what to do about that. Earlier you already said
that sched_bind(9) also isn't an option in case if td_critnest > 1.
To be honest, I don't really unerstand why using a spin lock in the
timecounter path makes sparc64 the only problematic architecture
for your changes. The x86 i8254_get_timecount() also uses a spin lock
so it should be in the same boat.


The problem is not in using spinlock, but in waiting for other CPU while
spinlock is held. Other CPU may also hold spinlock and wait for
something, causing deadlock. i8254 code uses spinlock just to atomically
access hardware registers, so it causes no problems.


Okay, but wouldn't that be a general problem then? Pretty much
anything triggering an IPI holds smp_ipi_mtx while doing so and
the lower level IPI stuff waits for other CPU(s), including on
x86.


The problem is general. But now it works because single smp_ipi_mtx is
used in all cases where IPI result is waited. As soon as spinning
happens with interrupts still enabled, there is no deadlocks. But
problem reappears if any different lock is used, or locks are nested.

In existing code in HEAD and 9 timecounters are never called with spin
mutex held.  I intentionally tried to avoid that in existing eventtimers
code.


Er, timecounters are called with a spin mutex held in existing code:
though it is dangerous to do so, timecounters are called from fast
interrupt handlers for very timekeeping-critical purposes:
- to implement the TIOCTIMESTAMP ioctl (except this is broken in
  -current).  This was a primitive version of pps timestamping.
- for pps timestamping.  The interrupt handler (which should be a fast
  interrupt handler to minimize latency) calls pps_capture() which
  calls tc_get_timecount() and does other "lock-free" accesses to the
  timecounter state.  This still works in -current (at least there is
  still code for it).

  OTOH, all drivers that call pps_capture() from their interrupt handler
  then immediately call pps_event().  This has always been very broken,
  and became even more broken with SMPng.  pps_event() does many more
  timecounter and pps accesses whose locking is unclear at best, and
  in some configurations it calls hardpps(), which is only locked by
  Giant, despite comments in kern_ntptime.c still saying that it (and
  many other functions in kern_ntptime.c) must be called at splclock()
  or higher.  splclock() is of course now null, but the locking
  requirements in kern_ntptime.c haven't changed much.  kern_ntptime.c
  always needed to be locked by the equivalent of a spin mutex, which
  is stronger locking than was given by splclock().  pps_event() would
  have to aquire the spin mutex before calling hardpps(), although
  this is bad for fast interrupt handlers.  The correct implementation
  is probably to only do the capture part from fast interrupt handlers.


Callout code same time can be called in any environment with any
locks held. And new callout code may need to know precise current time
in any of those conditions. Attempt to use an IPI and wait there can be
fatal.


Callout code can't be called from such a general "any" environment as
timecounter code.  Not from a fast interrupt handler.  Not from an NMI
or IPI handler.  I hope.  But timecounter code has a good chance of
working even for the last 2 environments, due to its design requirement
of working in the first.

The spinlock in the i8254 timecounter certainly breaks some cases.
For example, suppose the lock is held for a timecounter read from
normal context.  It masks hardware interrupts on the current CPU (except
in my version).  It doesn't mask NMIs or other traps.  So if the NMI
or other trap handler does a timecounter hardware call, there is
deadlock in at least the !SMP case.  In my version, it blocks normal
interrupts later if they occur, but doesn't block fast interrupts, so
the pps_capture() call would deadlock if it occurs, like a timecounter
call from an NMI.  I avoid this by not using pps in any fast interrupt
handler, and by only using the i8254 timecounter for testing.  I do
use pps in a (nonstandard) x86 RTC clock interrupt handler.  My clock
interrupt handlers are all non-fast to avoid this and other locking
problems.


FYI, these are the results of the v215 (btw., these (ab)use a bus
cycle counter of the host-PCI-bridge as timecounter) with your
calloutng_12_17.patch and kern.timecounter.alloweddeviation=0:
select 1 23.82
poll   1   1008.23
usleep 1 23.31
nanosleep  1 23.17
kqueue 1   1010.35
kqueueto   1 26.26
syscall1  1.91
select   300307.72
poll 300   1008.23
usleep   300307.64
nanosleep300 23.21


Please fix the tv_nsec initialization so that we can see if nanosleep()
and

Re: [RFC/RFT] calloutng

2013-01-03 Thread Bruce Evans


On Thu, 3 Jan 2013, Alexander Motin wrote:


On 03.01.2013 16:45, Bruce Evans wrote:

On Wed, 2 Jan 2013, Alexander Motin wrote:

More important for scheduling fairness thread's CPU percentage is also
based on hardclock() and hiding from it was trivial before, since all
sleep primitives were strictly aligned to hardclock(). Now it is
slightly less trivial, since this alignment was removed and user-level
APIs provide no easy way to enforce it.


%cpu is actually based on statclock(), and not even used for scheduling.


May be for SCHED_4BSD, but not for SCHED_ULE.  In SCHED_ULE both %cpu and 
thread priority based on the same ts_ticks counter, that is based on 
hardclock() as time source. Interactivity calculation uses alike logic and 
uses the same time source.


Hmm.  I missed this because it hacks on the 'ticks' global.  It is clearer
in intermediate versions which use the scheduler API sched_tick(), which
is the hardclock analogue of sched_clock() for statclock.  sched_tick() is
now bogus since it is null for all schedulers.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [RFC/RFT] calloutng

2013-01-03 Thread Bruce Evans


On Wed, 2 Jan 2013, Alexander Motin wrote:


On 02.01.2013 19:09, Konstantin Belousov wrote:

On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote:

Probably one way to close this discussion would be to provide
a sysctl so the sysadmin can decide which point in the interval
to pick when there is no suitable callout already scheduled.

Isn't trying to synchronize to the external events in this way unsafe ?
I remember, but cannot find the reference right now, a scheduler
exploit(s) which completely hide malicious thread from the time
accounting, by making it voluntary yielding right before statclock
should fire. If statistic gathering could be piggy-backed on the
external interrupt, and attacker can control the source of the external
events, wouldn't this give her a handle ?


Fine-grained timeouts complete fully opening this security hole.
Synchronization without fine-grained timeouts might allow the same,
but is harder to exploit since you can't control the yielding points
directly.  With fine-grained timeouts, you just have to predict the
statclock firing points.  Use one timeout to arrange to yield just
before statclock fires and another to regain control just after it has
fired.  If the timeout resolution is say 50 usec, then this can hope
to run for all except 100 usec out of every 1/stathz seconds.  With
stathz = 128, 1/stathz is 7812 usec, so this gives 7712/7812 of the
CPU with 0 statclock ticks.  Since the scheduler never sees you running,
your priority remains minimal, so the scheduler should prefer to run
you whenever a timeout expires, with only round-robin with other
minimal-priority threads preventing you getting 7712/7812 of the (user
non-rtprio) CPU.

The previous stage of fully opening this security hole was changing
(the default) HZ from 100 to 1000.  HZ must not be much smaller than
stathz, else the security hole is almost fully open.  With HZ = 100
being less than stathz and timeout granularity limiting the fine control
to 2/HZ = 20 msec (except you can use a periodic itimer to get a 1/HZ
granularity at a minor cost of getting more SIGALRMs), it is impossible
to get near 100% of the CPU with 0 statclock ticks.  After yielding,
you can't get control for another 100 or 200 msec.  Since this exceeds
1/stathz = 78.12 usec, you can only hide from statclock ticks by not
running very often or for very long.  Limited hiding is possible by
wasting even more CPU to determine when to hide: since the timeout
granularity is large, it is also ineffective for determining when to
yield.  So when running, you must poll the current time a lot to
determine when to yield.  Yield just before statclock fires, as above.
(Do it 50 usec early, as above, to avoid most races involving polling
the time.)  This actually has good chances of not limiting the hiding
too much, depending on the details of the scheduling.  It yields just
before a statclock tick.  After this tick fires, if the scheduler
reschedules for any reason, then the hiding process would most likely
be run again, since its priority is minimal.  But at least the old
4BSD scheduler doesn't reschedule after _every_ statclock tick.  This
depends on the bugfeature that the priority is not checked on _every_
return to user mode (sched_clock() does change the priority, but this
is not acted on until much later).  Without this bugfeature, there
would be excessive context switches.  OTOH, with timeouts, at least
old non-fine-grained ones, you can force a rescheduling that is acted
on soon enough simply by using timeouts (since timeouts give a context
switch to the softclock thread, the scheduler has no option to skip
checking the priority on return to user mode).

After the previous stage of changing HZ to 1000, the granuarity is fine
enough for using timeouts to hide from the scheduler.  Using a periodic
itimer to get a granularity of 1000 usec, start hiding 50-1000 usec
before each statclock tick and regain control 1000 usec later.  With
stathz = 128, 6812/7812 of the CPU with 0 statclock ticks.  Not much
worse (for the hider) than 7712/7812.

Statclock was supposed to be aperiodic to avoid hiding (see
statclk-usenix93.ps), but this was never implemented in FreeBSD.  With
fine-grained timeouts, it would have to be very aperiodic, to the point
of giving large inaccuracies, to limit the hiding very much.  For
example, suppose that it has an average period of 7812 usec with +-50%
jitter.  You would try to hide from it most of the time by running for
a bit less than 7812/2 usec before yielding in most cases.  If too
much scheduling is done on each statclock tick, then you are likely
to regain control after each one (as above) and then know that there
is almost a full minimal period until the next one.  Otherwise, it
seems to be necessary to determine when the previous statclock tick
occurred, so as to determine the minimum time until the next one.

There are many different kinds of accounting with different characteristics. 
Run time for each thread cal

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans


I finally remembered to remove the .it phk :-).

On Wed, 19 Dec 2012, Luigi Rizzo wrote:


On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote:

...
As I said in my previous email:


typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1 << 32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or
callout_foo(500 * DURNSEC)


only thing, we must be careful with the parentheses

For instance, in your macro, DURNSEC evaluates to 0 and so
does any multiple of it.
We should define them as

#define DURNSEC DURSEC / 100
...

so DURNSEC is still 0 and 500*DURNSEC gives 214

I am curious that Bruce did not mention this :)


Er, he was careful.  DURNSEC gives 4, not 0.  This is not very accurate,
but probably good enough.

Your version without parentheses is not so careful and depends on
a magic order of operations and no overflow from this.  E.g.:

500*DURNSEC = 500*DURSEC / 10 = 500*((dur_t)1 << 32) / 10

This is very accurate and happens not to overflow.  But 5 seconds represented
a little strangely in nanoseconds would overflow:

50*DURNSEC = 50*((dur_t)1 << 32) / 10

So would 5 billion times DURSEC, but 5 billion seconds is more unreasobable
than 5 billion nanoseconds and the format just can't represent that.



(btw the typedef is swapped, should be "typedef int64_t dur_t")


Didn't notice this.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans


On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message <20121220005706.i1...@besplex.bde.org>, Bruce Evans writes:

On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



Except that for absolute timescales, we're running out of the 32 bits
integer part.


Except 32 bit time_t works until 2106 if it is unsigned.


That's sort of not an option.


I think it is.  It is just probably not necessary since 32-bit systems
will go away before 2038.


The real problem was that time_t was not defined as a floating
point number.


That would be convenient too, but bad for efficiency on some systems.
Kernels might not be able to use it, and then would have to use an
alternative representation, which they should have done all along.


[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.


That would be convenient, but bad for efficiency if it were actually
used much.


You can say that about anything but CPU-native operations, and I doubt
it would be as inefficient as struct bintime, which does not have access
to the carry bit.


Yes, I would say that about non-native.  It goes against the spirit of C.

OTOH, compilers are getting closer to giving full access to the carry
bit.  I just checked what clang does in a home-made 128-bit add function:

% static void __noinline
% uadd(struct u *xup, struct u *yup)
% {
%   unsigned long long t;
% 
% 	t = xup->w[0] + yup->w[0];

%   if (t < xup->w[0])
%   xup->w[1]++;
%   xup->w[0] = t;
%   xup->w[1] += yup->w[1];
% }
% 
% 	.align	16, 0x90

%   .type   uadd,@function
% uadd:   # @uadd
%   .cfi_startproc
% # BB#0: # %entry
%   movq(%rdi), %rcx
%   movq8(%rdi), %rax
%   addq(%rsi), %rcx

gcc generates an additional cmpq instruction here.

%   jae .LBB2_2

clang uses the carry bit set by the first addition to avoid the comparison,
but still branches.

% # BB#1: # %if.then
%   incq%rax
%   movq%rax, 8(%rdi)

This adds 1 explicitly instead of using adcq, but this is the slow path.

% .LBB2_2:# %if.end
%   movq%rcx, (%rdi)
%   addq8(%rsi), %rax

This is as efficient as possible except for the extra branch, and the
branch is almost perfectly predictable.

%   movq%rax, 8(%rdi)
%   ret
% .Ltmp22:
%   .size   uadd, .Ltmp22-uadd
%   .cfi_endproc

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans


On Wed, 19 Dec 2012, Davide Italiano wrote:


On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans  wrote:



I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks >= 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks < 0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


Bruce, I don't really agree with this.
The data addressed by cookie should be still stored somewhere, and KBI
will result broken. This, indeed, is not real problem as long as
current calloutng code heavily breaks KBI, but if that was your point,
I don't see how your proposed change could help.


In the old API, it is an error to pass ticks < 0, so only broken old
callers are affected.  Of course, if there are any then it would be
hard to detect their garbage cookies.

Anywy, it's too later to change to this, and maybe also to a 32.32
format.

[32.32 format]

This would make a better general format than timevals, timespecs and
of course bintimes :-).  It is a bit wasteful for timeouts since
its extremes are rarely used.  Malicious and broken callers can
still cause overflow at 68 years, so you have to check for it and
handle it.  The limit of 194 days is just as good for timeouts.


I think the phk's proposal  is better. About your overflow objection,
I think is really unlikely to happen, but better safe than sorry.


It's very easy for applications to cause kernel overflow using valid
syscall args like tv_sec = TIME_T_MAX for a relative time in
nanosleep().  Adding TIME_T_MAX to the current time in seconds overflow
for all current times except for the first second after the Epoch.
There is no difference between the overflow for 32-bit and 64-bit
time_t's for this.  This is now mostly handled so that the behaviour is
harmless although wrong.  E.g., the timeout might become negative,
and then since it is not a cookie it is silently replaced by a timeout
of 1 tick.  In nanosleep(), IIRC there are further overflows that result
in returning early instead of retrying the 1-tick timeouts endlessly.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans


On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message <20121219221518.e1...@besplex.bde.org>, Bruce Evans writes:


With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1 < d2)


This would make a better general format than timevals, timespecs and
of course bintimes :-).


Except that for absolute timescales, we're running out of the 32 bits
integer part.


Except 32 bit time_t works until 2106 if it is unsigned.


Bintimes is a necessary superset of the 32.32 which tries to work
around the necessary but missing int96_t or int128_t[1].

[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.


That would be convenient, but bad for efficiency if it were actually
used much.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans


On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message 
, Davide Italiano writes:


Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.


And that is way overkill for specifying a callout, at best your clock
has short term stabilities approaching 1e-8, but likely as bad as 1e-6.


So you always agreed with me that bintimes are unsuitable for almost
everything, and especially unsuitable for timeouts? :-)


(The reason why bintime is important for timekeeping is that we
accumulate timeintervals approx 1e3 times a second, so the rounding
error has to be much smaller than the short term stability in order
to not dominate)


bintimes are not unsuitable for timekeeping, but they a painful to use
for other APIs.  You have to either put bintimes in layers in the other
APIs, or convert them to a more suitable format, and there is a problem
placing the conversion at points where it is efficient.  This thread
seems to be mostly about putting the conversion in wrong places.  My
original objection was about using bintimes for almost everything at
the implementation level.


I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code


No, that was exactly my point:  It should be an integer so that
comparisons and arithmetic is trivial.   A 32.32 format fits
nicely into a int64_t which is readily available in the language.


I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks >= 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks < 0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


As I said in my previous email:

   typedef dur_t   int64_t;/* signed for bug catching */
   #define DURSEC  ((dur_t)1 << 32)
   #define DURMIN  (DURSEC * 60)
   #define DURMSEC (DURSEC / 1000)
   #define DURUSEC (DURSEC / 1000)
   #define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or
callout_foo(500 * DURNSEC)


Constructing the cookie for my special case would not be so easy.


With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1 < d2)


This would make a better general format than timevals, timespecs and
of course bintimes :-).  It is a bit wasteful for timeouts since
its extremes are rarely used.  Malicious and broken callers can
still cause overflow at 68 years, so you have to check for it and
handle it.  The limit of 194 days is just as good for timeouts.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [RFC/RFT] calloutng

2012-12-16 Thread Bruce Evans


On Sat, 15 Dec 2012, Garrett Cooper wrote:


On Dec 15, 2012, at 12:34 PM, Mark Johnston wrote:


On Sat, Dec 15, 2012 at 06:55:53PM +0200, Alexander Motin wrote:

Hi.

I'm sorry to interrupt review, but as usual good ideas came during the
final testing, causing another round. :)  Here is updated patch for
HEAD, that includes several new changes:
http://people.freebsd.org/~mav/calloutng_12_15.patch


This patch breaks the libprocstat build.

Specifically, the OpenSolaris sys/time.h defines the preprocessor
symbols gethrestime and gethrestime_sec. These symbols are also defined
in cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h.
libprocstat:zfs.c is compiled using include paths that pick up the
OpenSolaris time.h, and with this patch _callout.h includes sys/time.h.

zfs.c includes taskqueue.h (with _KERNEL defined), which includes
_callout.h, so both time.h and zfs_context.h are included in zfs.c, and
the symbols are thus defined twice.


Gross namespace pollution.  sys/_callout.h exists so that the full
namespace pollution of sys/callout.h doesn't get included nested.  But
sys/time.h is much more polluted than sys/callout.h.

However, sys/time.h is old standard pollution in sys/param.h, and
sys/callout.h is not so old standard pollution in sys/systm.h.  It is
a bug to not include sys/param.h and sys/systm.h in most kernel source
code, so these nested includes are just style bugs -- they have no
effect for correct kernel source code.


The patch below fixes the build for me. Another approach might be to
include sys/_task.h instead of taskqueue.h at the beginning of zfs.c.


Good if it works.


I had a patch open once upon a time to cleanup inclusion of sys/time.h all 
over the tree and deal with the sys/time.h <-> time.h pollution issue, but it 
got dropped due to lack of interest (20~30 apps/libs were affected IIRC and I only 
really got assistance in fixing the UFS and bsnmpd pieces, and gave up due to lack of 
response from maintainers). dtrace/zfs is a definite instigator in this pollution (I 
remember nasty cddl/... pollution with the compat sys/time.h header).


Please use the unix newline character in mail.  The above is difficult to
quote.

The standard sys/time.h pollution in sys/param.h is only in the kernel,
and there aren't many direct includes of sys/time.h in the kernel.  Userland
is different and many of the direct includes were correct.  But not POSIX
specifies that struct timespec and struct timeval be defined in most places
where they are needed, so the includes of sys/time.h are not necessary
for POSIX or FreeBSD, although FreeBSD man pages still say that they
are necessary.  The sys/time.h <-> time.h pollution issue is also only
for userland.  Many places depend on one including the other, and include
the wrong one themself.


Bottom line: make sure anything new you're defining isn't already 
defined via POSIX or other OSes, and if so please try to make the 
implementations match (so that eventual POSIX inclusion might be possible) and 
when in doubt I suggest consulting standards@ / brde@.


Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [RFC/RFT] calloutng

2012-12-15 Thread Bruce Evans


On Sat, 15 Dec 2012, Oliver Pinter wrote:


On 12/15/12, Bruce Evans  wrote:



...
Because of the different grouping of the multiplications, the second
is unfortunately slower (1 more multiplication that cannot be done at
compile time).  The second also gives unnecessary (but findamental to
the method) inaccuracy by pulling out the factor of 1000.  The first
gives the same inaccuracy, and now it is because the constant is not
correctly rounded.  It should be

2.0**64 / 10**3 = 1844674407309551.616 (exactly)
= 1844674407309552 (rounded to nearest int)

but is actually rounded down to a multiple of 1000.
...


mav@ already fixed the rounding before I wrote that :-).

He also changed some (uint64_t)1's to use the long long abomination :-(.


Thanks for the detailed answer. :)


Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [RFC/RFT] calloutng

2012-12-15 Thread Bruce Evans


On Sat, 15 Dec 2012, Bruce Evans wrote:


On Fri, 14 Dec 2012, Oliver Pinter wrote:


What is this 1844674407309000LL constant?


This is

   2**64 / 10**6 * 10**3

obfuscated by printing it in hex and doing the scaling by powers of
10 manually, and then giving it a bogus type using the abominable long
long misfeature.  I try to kill this obfuscation and the abimination
whenever I see them.  In sys/time.h, this resulted in a related binary
conversion using a scale factor of

   ((uint64_t)1 << 63) / (10 >> 1).

Here the power of 2 term is 2**63.   2**64 cannot be used since it exceeds
uintmax_t.  The power of 10 term is 10**9.  This is divided by 2 to
compensate for dividing 2**64 by 2.  The abomination is avoided by using
smaller literal values and expandling them to 64-bit values using shifts.


Bah, this is only de-obfuscated and de-abominated in my version:

% Index: time.h
% ===
% RCS file: /home/ncvs/src/sys/sys/time.h,v
% retrieving revision 1.65
% diff -u -2 -r1.65 time.h
% --- time.h7 Apr 2004 04:19:49 -   1.65
% +++ time.h7 Apr 2004 11:28:54 -
% @@ -118,6 +118,5 @@
% 
%  	bt->sec = ts->tv_sec;

% - /* 18446744073 = int(2^64 / 10) */
% -	bt->frac = ts->tv_nsec * (uint64_t)18446744073LL; 
% +	bt->frac = ts->tv_nsec * (((uint64_t)1 << 63) / (10 >> 1));

%  }
%

The magic 1844... in time.h is at least commented on.  This makes it
less obscure, but takes twice as many source lines and risks the comment
getting out of date with the code.  The comment is also sloppy with types
and uses the '^' operator without saying that it is exponentiation and
nothing like the C '^' operator.  The types are especially critical in
the shift exprression.  I like to use the Fortran '**' operator in C
comments without saying what it is instead.

In another reply to this thread, the value in the explanation is off
by a factor of 1000 and the rounding to a multiple of 1000 is not
explained.  It is easy to have such errors in comments, while the code
tends to be more correct since it gets checked by running it.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [RFC/RFT] calloutng

2012-12-15 Thread Bruce Evans


On Fri, 14 Dec 2012, Oliver Pinter wrote:


635 -   return tticks;
636 +   getbinuptime(&pbt);
637 +   bt.sec = data / 1000;
638 +   bt.frac = (data % 1000) * (uint64_t)1844674407309000LL;
639 +   bintime_add(&bt, &pbt);
640 +   return bt;


Style bugs: missing spaces around return value in new and old code.


641  }

What is this 1844674407309000LL constant?


This is

2**64 / 10**6 * 10**3

obfuscated by printing it in hex and doing the scaling by powers of
10 manually, and then giving it a bogus type using the abominable long
long misfeature.  I try to kill this obfuscation and the abimination
whenever I see them.  In sys/time.h, this resulted in a related binary
conversion using a scale factor of

((uint64_t)1 << 63) / (10 >> 1).

Here the power of 2 term is 2**63.   2**64 cannot be used since it exceeds
uintmax_t.  The power of 10 term is 10**9.  This is divided by 2 to
compensate for dividing 2**64 by 2.  The abomination is avoided by using
smaller literal values and expandling them to 64-bit values using shifts.

Long long suffixes on literal constants are only needed to support C90
compilers with the long long extension on 32-bit systems anyway.
Otherwise, C90+extension compilers will warn about literal constants
larger than ULONG_MAX (which can only occur on 32-bit systems).  Since
C99 is now the default, the warnings would only without LL in the above
if you use nonstandard CFLAGS.

The above has to convert from the bad units of milliseconds to the bloated
units of bintimes, and it is less refined than most other bintime
conversions.  I think that since it doesn't try to be optimal, it should
just use the standard bintime conversions after first converting
milliseconds to a timeval.  It already does essentially that with its
divisions by 1000:

struct timeval tv;

tv.tv_sec = data / 1000;
tv.tv_usec = data % 1000 * 1000;
timeval2bintime(&tv, &bt);

The compliler will probably optimize /1000 and %1000 to shifts in both
this and the above.  Then timeval2bintime() does the almost the same
multiplication as above, but spelled differently accuracy.  Both give
unnecessary inaccuracy in the conversion to weenieseconds: the first
gives:

bt.frac = data % 1000 * (2**64 / 10**6 * 10**3);

the second gives:

bt.frac = data % 1000 * 1000 * (2**64 / 10**6);

Because of the different grouping of the multiplications, the second
is unfortunately slower (1 more multiplication that cannot be done at
compile time).  The second also gives unnecessary (but findamental to
the method) inaccuracy by pulling out the factor of 1000.  The first
gives the same inaccuracy, and now it is because the constant is not
correctly rounded.  It should be

2.0**64 / 10**3 = 1844674407309551.616 (exactly)
= 1844674407309552 (rounded to nearest int)

but is actually rounded down to a multiple of 1000.

It would be better to round the scale factors so that the conversions
are inverses of each other and tticks can be recovered from bt, but
this is impossible.  I tried to make the bintime conversions invert
most values correctly by rounding to nearest, but phk didn't like this
and the result is the bogus comment about always rounding down in time.h.
So when you start with 999 msec in tticks, the resulting bt will be
rounded down a little and converting back will give 998 msec; the
next round of conversions will reduce 1 more, and so on until you
reach a value that is exactly representable in both milliseconds and
weenieseconds (875?).  This despite weenieseconds providing vastly
more accuracy than can be measured and vastly more accuracy than needed
to represent all other time values in the kernel in a unique way.  Just
not in a unique way that is expressible using simple scaling conversions.
The conversions that give uniqueness can still be monotonic, but can't
be nonlinear in the same way that simple scaling gives.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: clang compiled kernel panic when mounting zfs root on i386

2012-11-26 Thread Bruce Evans


On Mon, 26 Nov 2012, Konstantin Belousov wrote:


On Mon, Nov 26, 2012 at 06:31:34AM -0800, sig6247 wrote:


Just checked out r243529, this only happens when the kernel is compiled
by clang, and only on i386, either recompiling the kernel with gcc or
booting from a UFS root works fine. Is it a known problem?

It looks like that clang uses more stack than gcc, and zfs makes quite
deep call chains.

It would be a waste, generally, to increase the init process kernel
stack size only to pacify zfs. And I suspect that it would not help
in the similar situations when the same procedure initiated for non-root
mounts.


Or to pacify clang...


--
WARNING: WITNESS option enabled, expect reduced performance.
Trying to mount root from zfs:zroot []...

Fatal double fault:
eip = 0xc0adc37d
esp = 0xc86bffc8
ebp = 0xc86c003c
cpuid = 1; apic id = 01
panic: double fault
cpuid = 1
KDB: enter: panic
[ thread pid 1 tid 12 ]
Stopped at  kdb_enter+0x3d: movl$0,kdb_why
db> bt
Tracing pid 1 tid 12 td 0xc89efbc0
kdb_enter(c1064aa4,c1064aa4,c10b806f,c139e3b8,f5eacada,...) at kdb_enter+0x3d
panic(c10b806f,1,1,1,c86c003c,...) at panic+0x14b
dblfault_handler() at dblfault_handler+0xab
--- trap 0x17, eip = 0xc0adc37d, esp = 0xc86bffc8, ebp = 0xc86c003c ---
witness_checkorder(c1fd7508,9,c109df18,7fa,0,...) at witness_checkorder+0x37d
__mtx_lock_flags(c1fd7518,0,c109df18,7fa,c135d918,...) at __mtx_lock_flags+0x87
uma_zalloc_arg(c1fd66c0,0,1,4d3,c86c0110,...) at uma_zalloc_arg+0x605
vm_map_insert(c1fd508c,c13dfc10,bb1f000,0,cba1e000,...) at vm_map_insert+0x499
kmem_back(c1fd508c,cba1e000,1000,3,c86c01d4,...) at kmem_back+0x76
kmem_malloc(c1fd508c,1000,3) at kmem_malloc+0x250
page_alloc(c1fd1d80,1000,c86c020b,3,c1fd1d80,...) at page_alloc+0x27
keg_alloc_slab(103,4,c109df18,870,cb99ef6c,...) at keg_alloc_slab+0xc3
keg_fetch_slab(103,c1fd1d80,cb99ef6c,c1fc8230,c86c02c0,...) at 
keg_fetch_slab+0xe2
zone_fetch_slab(c1fd1d80,c1fd0480,103,826,0,...) at zone_fetch_slab+0x43
uma_zalloc_arg(c1fd1d80,0,102,3,2,...) at uma_zalloc_arg+0x3f2
malloc(4c,c1686100,102,c86c0388,c173d09a,...) at malloc+0xe9
zfs_kmem_alloc(4c,102,cb618820,c89efbc0,cb618820,...) at zfs_kmem_alloc+0x20
vdev_mirror_io_start(cb8218a0,10,cb8218a0,1,0,...) at vdev_mirror_io_start+0x14a
zio_vdev_io_start(cb8218a0,c89efbc0,0,cb8218a0,c86c0600,...) at 
zio_vdev_io_start+0x228
zio_execute(cb8218a0,cb618000,cba1b640,cb90,400,...) at zio_execute+0x106
spa_load_verify_cb(cb618000,0,cba1b640,cb884b40,c86c0600,...) at 
spa_load_verify_cb+0x89
traverse_visitbp(cb884b40,cba1b640,c86c0600,c86c0ba0,0,...) at 
traverse_visitbp+0x29f
traverse_dnode(cb884b40,0,0,8b,0,...) at traverse_dnode+0x92
traverse_visitbp(cb884bb8,cba07200,c86c0890,cb884bf4,c16ce7e0,...) at 
traverse_visitbp+0xe47
traverse_visitbp(cb884bf4,cb9bf840,c86c0968,c86c0ba0,0,...) at 
traverse_visitbp+0xf32
traverse_dnode(cb884bf4,0,0,0,0,...) at traverse_dnode+0x92
traverse_visitbp(0,cb618398,c86c0b50,2,cb9f1c78,...) at traverse_visitbp+0x96d
traverse_impl(0,0,cb618398,3e1,0,...) at traverse_impl+0x268
traverse_pool(cb618000,3e1,0,d,c1727830,...) at traverse_pool+0x79
spa_load(0,1,c86c0ec4,1e,0,...) at spa_load+0x1dde
spa_load(0,0,c13d8c94,1,3,...) at spa_load+0x11a5
spa_load_best(0,,,1,c0adc395,...) at spa_load_best+0x71
spa_open_common(c17e0e1e,0,0,c86c1190,c16f5a1c,...) at spa_open_common+0x11a
spa_open(c86c1078,c86c1074,c17e0e1e,c135d918,c1fd7798,...) at spa_open+0x27
dsl_dir_open_spa(0,cb770030,c17e11b1,c86c11f8,c86c11f4,...) at 
dsl_dir_open_spa+0x6c
dsl_dataset_hold(cb770030,cb613800,c86c1240,cb613800,cb613800,...) at 
dsl_dataset_hold+0x3a
dsl_dataset_own(cb770030,0,cb613800,c86c1240,c1684e30,...) at 
dsl_dataset_own+0x21
dmu_objset_own(cb770030,2,1,cb613800,c86c1290,...) at dmu_objset_own+0x2a
zfsvfs_create(cb770030,c86c13ac,c17ee05d,681,0,...) at zfsvfs_create+0x4c
zfs_mount(cb78ed20,c17f411c,c9ff4600,c89cae80,0,...) at zfs_mount+0x42c
vfs_donmount(c89efbc0,4000,0,c86c1790,cb6c0800,...) at vfs_donmount+0xc6d
kernel_mount(cb7700b0,4000,0,0,1,...) at kernel_mount+0x6b
parse_mount(cb7700e0,c1194498,0,1,0,...) at parse_mount+0x606
vfs_mountroot(c13d95b0,4,c105c042,2bb,0,...) at vfs_mountroot+0x6cf
start_init(0,c86c1d08,c105e94c,3db,0,...) at start_init+0x6a
fork_exit(c0a42090,0,c86c1d08) at fork_exit+0x7f
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc86c1d40, ebp = 0 ---
db>


43 deep (before the double fault) is disgusting, but even if clang has
broken stack alignment due to a wrong default and no
-mpreferred-stack-boundary to fix it, that's still only about 8*43
extra bytes (8 for the average extra stack to align to 16 bytes).
Probably zfs is also putting large data structures on the stack.

It would be useful if the stack trace printed the the stack pointer
on every function call, so that you could see how much stack each
function used.

All those ', ...' printed after 5 args show further appare

Re: Use of C99 extra long double math functions after r236148

2012-07-25 Thread Bruce Evans


On Wed, 25 Jul 2012, Stephen Montgomery-Smith wrote:


On 07/25/12 12:31, Steve Kargl wrote:

On Wed, Jul 25, 2012 at 12:27:43PM -0500, Stephen Montgomery-Smith wrote:

Just as a point of comparison, here is the answer computed using
Mathematica:

N[Exp[2], 50]
7.3890560989306502272304274605750078131803155705518

As you can see, the expl solution has only a few digits more accuracy
that exp.


Unless you are using sparc64 hardware.

flame:kargl[204] ./testl -V 2
ULP = 0.2670 for x = 2.0e+00
mpfr exp: 7.389056098930650227230427460575008e+00
libm exp: 7.389056098930650227230427460575008e+00



Yes.  It would be nice if long on the Intel was as long as the sparc64.

You want it to be as slow as sparc64?  (About 300 times slower, after
scaling the CPU clock rates.  Doubles on sparc64 are less than 2 times
slower.)

I forgot to mention in a previous reply is that expl has only a few
more decimal digits of accuracy than exp because the extra precision
on x86 wasn't designed to give much more accuracy.  It was designed
to give more chance of full double precision accuracy in naive code.
It was designed in ~1980 when bits were expensive and the extra 11
provided by the 8087 were considered the best tradeoff between cost
and accuracy.  They only previde 2-3 extra decimal digits of accuracy.
They are best thought of as guard bits.  Floating point uses 1 or 2
guard bits internally.  11 extends that significantly and externalizes
it, but is far from doubling the number of bits.  Their use to provide
extra precision was mostly defeated in C by bad C bindings and
implementations.  This was consolidated by my not using the extra bits
for the default rounding precision in FreeBSD.  This has been further
consolidated by SSE not supporting extended precision.  Now the naive
code that uses doubles never gets the extra precision on amd64.  Mixing
of long doubles with doubles is much slower with SSE+i387 than with
i387, since the long doubles are handled in different registers and
must be translated with SSE+i387, while with i387, using long doubles
is almost free (it actually has a negative cost in non-naive code since
it allows avoiding extra precision in software).  Thus SSE also inhibits
using the extra precision intentionally.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Use of C99 extra long double math functions after r236148

2012-07-25 Thread Bruce Evans


On Wed, 25 Jul 2012, Rainer Hurling wrote:


On 25.07.2012 19:00 (UTC+2), Steve Kargl wrote:

On Wed, Jul 25, 2012 at 06:29:18PM +0200, Rainer Hurling wrote:


Many thanks to you three for implementing expl() with r238722 and r238724.

I am not a C programmer, but would like to ask if the following example
is correct and suituable as a minimalistic test of this new C99 function?


It's not clear to me what you mean by test.  If expl() is
not available in libm, then linking the code would fail.
So, testing for the existence of expl() (for example,
in a configure script) is as simple as


Sorry for not being clear enough. I didn't mean testing for the existence, 
but for some comparable output between exp() and expl(), on a system with 
expl() available in libm.


This is basically what I do to test exp() (with a few billion cases
automatically generated and compared).  It is not sufficient for
checking expl(), except for consistency.  (It is assumed that expl()
is reasonably accurate.  If it is in fact less accurate than exp(),
this tends to show up in the comparisons.)


#include 
long double
func(long double x)
{
   return (expl(x));
}


//---
#include 
#include 

int main(void)
{
   double c = 2.0;
   long double d = 2.0;

   double e = exp(c);
   long double f = expl(d);

   printf("exp(%f)  is %.*f\n",  c, 90, e);
   printf("expl(%Lf) is %.*Lf\n", d, 90, f);


If you mean testing that the output is correct, then
asking for 90 digits is of little use.  The following
is sufficient (and my actually produce a digit or two
more than is available in number)


Ok, I understand. I printed the 90 digits to be able to take a look at the 
decimal places, I did not expect to get valid digits in this area.


Use binary format (%a) for manual comparison.  Don't print any more
bits than the format has.  This is DBL_MANT_DIG (53) for doubles and
LDLBL_MANT_DIG (64 on x86) for long doubles.  %a format is in nybbles
and tends to group the bits into nybbles badly.  See below on reducing
problems from this.  Decimal format has to print about 3 more digits
than are really meaningful, to allow recovering the original value
uniquely.  For manual comparison, you need to print these extra digits
and manually round or ignore them as appropriate.  The correct number
of extra digits is hard to determine.  For the "any", type, it is
DECIMAL_DIG (21) on x86.  The corresponding number of normally-accurate
decimal digits for long doubles is given by LDBL_DIG (18).  For
floats and doubles, this corresponds to FLT_DIG (6) and DBL_DIG (15).
Unfortunately,  doesn't define anything corresponding to
DECIMAL_DIG for the smaller types.  21 is a lot of digits and noise
digits take a long time to determine and ignore (its worse on sparc64
where DECIMAL_DIG is 36).  I usually add 2 extra digits to the number
of normally-accurate digits.  This is sloppy.  3 is needed in some
cases, depending on MANT_DIG and the bits in log(2) and/or log(10).


troutmask:fvwm:kargl[203] diff -u a.c.orig a.c
--- a.c.orig2012-07-25 09:38:31.0 -0700
+++ a.c 2012-07-25 09:40:36.0 -0700
@@ -1,5 +1,6 @@
  #include 
  #include 
+#include 

  int main(void)
  {
@@ -9,8 +10,8 @@
double e = exp(c);
long double f = expl(d);

-  printf("exp(%f)  is %.*f\n",  c, 90, e);
-  printf("expl(%Lf) is %.*Lf\n", d, 90, f);
+  printf("exp(%f)  is %.*f\n",  c, DBL_DIG+2, e);
+  printf("expl(%Lf) is %.*Lf\n", d, LDBL_DIG+2, f);

return 0;
  }


Thanks, I was not aware of DBL_DIG and LDBL_DIG.


Steve is sloppy and adds 2 also :-).  For long doubles, it is clear that
3 are strictly needed, since DECIMAL_DIG is 3 more.

For most long double functions on i386, you need to switch the rounding
precision to 64 bits around calls to them, and also to do any operations
on the results except printing them.  expl() is one of the few large
functions that does the switch internally.  So the above should work
(since it only prints), but (expl(d) + 0) should round to the default
53-bit precision and this give the same result as exp(d).


If you actually want to test expl() to see if it is producing
a decent result, you need a reference solution that contains
a higher precision.  I use mpfr with 256 bits of precision.

troutmask:fvwm:kargl[213] ./testl -V 2
ULP = 0.3863
   x = 2.00e+00
libm: 7.389056098930650227e+00 0x1.d8e64b8d4ddadcc4p+2
mpfr: 7.389056098930650227e+00 0x1.d8e64b8d4ddadcc4p+2
mpfr: 7.3890560989306502272304274605750078131803155705518\
   47324087127822522573796079054e+00
mpfr: 
0x7.63992e35376b730ce8ee881ada2aeea11eb9ebd93c887eb59ed77977d109f148p+0


The 1st 'mpfr:' line is produced after converting the results
fof mpfr_exp() to long double.  The 2nd 'mpfr:' line is
produced by mpfr_printf() where the number of printed
digits depends on the 256-bit precision.  The last 'mpfr:'
line is mpfr_printf()'s hex formatting.  Unfortunately, it
does not normalize the hex representation to start with
'0x1.', w

Re: [head tinderbox] failure on i386/i386

2012-05-22 Thread Bruce Evans


On Tue, 22 May 2012, FreeBSD Tinderbox wrote:


[...]
from /obj/i386.i386/src/tmp/usr/include/sys/_types.h:33,
from /obj/i386.i386/src/tmp/usr/include/stdio.h:41,
from /src/sbin/devd/parse.y:33:
/obj/i386.i386/src/tmp/usr/include/x86/_types.h:51: error: expected '=', ',', 
';', 'asm' or '__attribute__' before 'typedef'
/obj/i386.i386/src/tmp/usr/include/x86/_types.h:96: error: expected '=', ',', 
';', 'asm' or '__attribute__' before '__int_least8_t'
cc1: warnings being treated as errors
/src/sbin/devd/parse.y: In function 'yyparse':
/src/sbin/devd/parse.y:103: warning: implicit declaration of function 
'add_attach'


Another bug in the new yacc is that it uses hard-coded GNUisms
like __attribute__(()) (maybe firm-coded by autoconfig) instead of
hard-coded FreeBSDisms like __printflike().

But this is not the bug here.  devd.h is just included in a wrong order
(before its prerequisites) in parse.y.  This worked accidentally because
old yacc includes sufficient namespace pollution earlier.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Some performance measurements on the FreeBSD network stack

2012-04-21 Thread Bruce Evans


On Fri, 20 Apr 2012, K. Macy wrote:


On Fri, Apr 20, 2012 at 4:44 PM, Luigi Rizzo  wrote:



The small penalty when flowtable is disabled but compiled in is
probably because the net.flowtable.enable flag is checked
a bit deep in the code.

The advantage with non-connect()ed sockets is huge. I don't
quite understand why disabling the flowtable still helps there.


Do you mean having it compiled in but disabled still helps
performance? Yes, that is extremely strange.


This reminds me that when I worked on this, I saw very large throughput
differences (in the 20-50% range) as a result of minor changes in
unrelated code.  I could get these changes intentionally by adding or
removing padding in unrelated unused text space, so the differences were
apparently related to text alignment.  I thought I had some significant
micro-optimizations, but it turned out that they were acting mainly by
changing the layout in related used text space where it is harder to
control.  Later, I suspected that the differences were more due to cache
misses for data than for text.  The CPU and its caching must affect this
significantly.  I tested on an AthlonXP and Athlon64, and the differences
were larger on the AthlonXP.  Both of these have a shared I/D cache so
pressure on the I part would affect the D part, but in this benchmark
the D part is much more active than the I part so it is unclear how
text layout could have such a large effect.

Anyway, the large differences made it impossible to trust the results
of benchmarking any single micro-benchmark.  Also, ministat is useless
for understanding the results.  (I note that luigi didn't provide any
standard deviations and neither would I. :-).  My results depended on
the cache behaviour but didn't change significantly when rerun, unless
the code was changed.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: strange ping response times...

2012-04-12 Thread Bruce Evans


On Thu, 12 Apr 2012, Luigi Rizzo wrote:


On Thu, Apr 12, 2012 at 01:18:59PM +1000, Bruce Evans wrote:

On Wed, 11 Apr 2012, Luigi Rizzo wrote:


On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote:

...

ping takes a timestamp in userspace before trying to transmit
the packet, and then the timestamp for the received packet
is recorded in the kernel (in the interrupt or netisr thread
i believe -- anyways, not in userspace).


No, all timestamps recorded used by ping are recorded in userland.


Bruce, look at the code in ping.c -- SO_TIMESTAMP is defined,
so the program does (successfully) a

setsockopt(s, SOL_SOCKET, SO_TIMESTAMP, &on, sizeof(on));

and then (verified that at runtime the code follows this path)
...


Indeed it does.  This accounts for the previously unaccounted for
binuptime call in the kernel.

This is part of fenner's change (perhaps the most important part) to
put the timestamping closer to the actual i/o, so that the RTT doesn't
count setup overheads.  Timestamping still works when SO_TIMESTAMP is
undefed (after fixing the ifdefs on it).  Not using it costs only 1
more gettimeofday() call in ping, but it increases my apparent RTT
from 7-8 usec to 10-11 usec.  3 extra usec seems a lot for the overhead.

I found that ping -f saves 1 gettimeofday() call per packet, and my
version saves another 1.

-current ping -q  localhost: 4
my   ping -q  localhost: 3
-current ping -fq localhost: 3
my   ping -fq localhost: 2

1 gettimeofday() call is needed for putting a timestamp in the output
packet.  Apparently, this is not well integrated with the bookkeeping
for select(), and up to 3 more gettimeofday() calls are used.  select()
timeouts only have 1/HZ granulatrity, so the gettimeofday() calls to
set them up could use clock_gettime() with CLOCK_REALTIME_FAST_N_BROKEN,
but this would be a bogus optimization since most of the overhead is in
the syscall provided the timecounter hardware is not slow (it takes 9-80
cycles to read TSC timecounter hardware and 600-800 cycles for
clock_gettime() with a TSC timecounter).

I just noticed Note that CLOCK_MONOTONIC cannot be used for the timestamp
in the output packet if SO_TIMESTAMP is used for the timestamp in the
input packet, since SO_TIMESTAMP uses CLOCK_REALTIME for historical
reasons.

There are 2 select()s per packet.  Perhaps the number of gettimeofday()s
can be reduced to 1 per packet in most cases (get one for the output packet
and use it for both select()s).  With my version the truss trace for
ping localhost is:

% 64 bytes from 127.0.0.1: icmp_seq=6 ttl=64 time=0.473 ms
% write(1,0x80d4000,57)  = 57 (0x39)
% gettimeofday({1334226305 409249},0x0)  = 0 (0x0)

Need this for the next select().  There is a ~1 second pause here.

% select(4,{3},0x0,0x0,{0 989607})   = 0 (0x0)

Truss doesn't show this until select() returns ~1 second later.  The
gettimeofday() call was needed because we don't simply use a 1 second
pause, but adjust for overheads.  My version uses a fancier adjustment.

% gettimeofday({1334226306 408632},0x0)  = 0 (0x0)

We need this accurately to put in the packet.  But all the other timestamps
can be derived from this for the non-flood case.  We can just try to send
every `interval' and adjust the timeouts a bit when we see that we get
here a bit early or late, and when we get here more than a bit early or
late we can either recalibrate or adjust by more.

Note that this gettimeofday() returned 1.0 - 0.000617 seconds after the
previous one, although we asked for a timeout of ~1/HZ = 0.001 seconds.
Select timeouts have a large granularity and we expect errors of O(1/HZ)
and mostly compensate for them.  The drift without compensation would be
1% with HZ = 100 and -i 1.0, and much larger with -i .  My
version compensates more accurately than -current.

% sendto(0x3,0x80b8d34,0,0x0,{ AF_INET 127.0.0.1:0 },0x10) = 64 (0x40)
% gettimeofday({1334226306 408831},0x0)  = 0 (0x0)
% select(4,{3},0x0,0x0,{0 990025})   = 1 (0x1)
% recvmsg(0x3,0xbfbfe8c0,0x0)= 84 (0x54)
% gettimeofday({1334226306 409104},0x0)  = 0 (0x0)

I don't know what this is for.  We got a timestamp in the returned packet,
and use it.  Since this timestamp has microsecond resolution and select()
only has 1/Hz resolution. this timestamp should be more than good enough
for sleeping for the interval.

% 64 bytes from 127.0.0.1: icmp_seq=7 ttl=64 time=0.472 ms
% write(1,0x80d4000,57)  = 57 (0x39)

Next packet:

% gettimeofday({1334226306 409279},0x0)  = 0 (0x0)
% ...

But select() timeouts are not needed at all.  Versions before fenner's
changes used a select() timeout for flood pings.  alarm() was used to
generate other timeouts.  The alarm() code was not very good.  It did
a lot of work in the signal handler to set up the next alarm (1 call
to signal() and 1 call

Re: strange ping response times...

2012-04-11 Thread Bruce Evans


On Wed, 11 Apr 2012, Luigi Rizzo wrote:


On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote:

On 11.04.2012 13:00, Luigi Rizzo wrote:

On Wed, Apr 11, 2012 at 12:35:10PM +0200, Andre Oppermann wrote:

On 11.04.2012 01:32, Luigi Rizzo wrote:

Things going through loopback go through a NETISR and may
end up queued to avoid LOR situations.  In addition per-cpu
queues with hash-distribution for affinity may cause your
packet to be processed by a different core.  Hence the additional
delay.


so you suggest that the (de)scheduling is costing several microseconds ?


Not directly.  I'm just trying to explain what's going on to
get a better idea where it may go wrong.

There may be a poor ISR/scheduler interaction that prevents that
causes the packet to be processed only on the next tick or something
like that.  I don't have a better explanation for this.


It's certainly abysmally slow.  Just the extra context switching made
in FreeBSD-5 made the RTT for pinging localhost 3-4 times slower than
in FreeBSD-3 in old tests (I compared with FreeBSD-3 instead of
FreeBSD-4 since general bloat had already made FreeBSD-4 significantly
slower, although not 3-4 times).  Direct dispatch of netisrs never did
anything good in old tests, and the situation doesn't seem to have
improved -- you now need an i7 2600 (SMP?) to get the same speed as
my Athlon64 2000 (UP) in the best cases for both (2-3 usec RTT).  SMP
and multiple cores give more chances for scheduler pessimizations.


ok, some final remarks just for archival purposes
(still related to the loopback ping)

ping takes a timestamp in userspace before trying to transmit
the packet, and then the timestamp for the received packet
is recorded in the kernel (in the interrupt or netisr thread
i believe -- anyways, not in userspace).


No, all timestamps recorded used by ping are recorded in userland.
IIRC, there is no kernel timestamping at all for ping packets, unless
ping is invoked with "-M time" to make it use ICMP_TSTAMP, and
ICMP_TSTAMP gives at best milliseconds resolution so it is useless
for measuring RTTs in the 2-999 usec range.
   (ICMP_TSTAMP uses iptime(), and the protocol only supports
   milliseconds resolution, which was good enough for 1 Mbps ethernet.
   iptime() is more broken than that (except in my version), since it
   uses getmicrotime() instead of microtime().  getmicrotime() gives
   at best 1/HZ resolution, so it is not even good enough for 1 Mbps
   ethernet when HZ is small, and now it may give extra inaccuracies
   from stopping the 1/HZ clock while in sleep states.)

This reminds me that slow timecounters make measuring small differences
in times difficult.  It can take longer to read the timecounter than
the entire RTT.  I tested this by pessimizing kern.timecounter.hardware
from TSC to i8254.  On my test system, clock_gettime() with
CLOCK_MONOTONIC takes an average of 273 nsec with the TSC timecounter
and 4695 nsec with the i8254 timecounter.  ping uses gettimeofday()
which is slightly slower and more broken (since it uses CLOCK_REALTIME).
My normal ping -fq localhost RTT is 2-3 usec
(closer to 3; another bug in this area is that the timestamps only
have microseconds resolution so you can't see if 3 is actually
more like 2.5.  I was thinking of changing the resolution to
nanoseconds 8-10 years ago, before the FreeBSD-5 pessimizations
and CPU speeds hitting a wall made this not really necessary),
but the kernel I'm testing with uses ipfw which bloats the RTT to 8-9
usec.  Then kern.timecounter.hardware=i8254 bloates the RTT to 24-25!
That's 16 usec extra, enough for the extra overhead of 4
gettimeofday() calls.  Timecounter statistics confirm that there are
many more than 2 timecounter calls per packet:
- 7 binuptime calls per packet.  That's the hardware part that is
  very slow with an i8254 timecounter.  It apparently takes more
  like 3000 nsec than 4695 nsec (to fit 7 in 24-25 usec).
- 3 bintime calls per packet.  bintime calls binuptime, so this
  accounts for 3 of the above 7.  The other 4 are apparently for
  context switching.  There are 2 context switches per packet :-(.
  I can't explain why there are apparently 2 timestamps per
  context switch.
  (Note that -current uses the inferior cputicker mechanism
  instead of timecounters for timestamping context switches.
  It does this because some timecounters are very slow.  But
  when the timecounter is the TSC, it binuptime() only takes
  a few cycles more than cpu_ticks().  (The above time of
  273 nsec for reading the TSC timecounter is from userland.
  The kernel part takes only about 30 nsec, while cpu_ticks()
  might take 15 nsec.)  So -current wouldn't be pessimized for
  this part by changing the timecounter to i8254, but without
  the pessimization it would be only a few nsec faster than
  old kernels provided the timecounter

Re: Potential deadlock on mbuf

2012-04-03 Thread Bruce Evans


On Tue, 3 Apr 2012, Andre Oppermann wrote:


On 02.04.2012 18:21, Alexandre Martins wrote:

Dear,

I have currently having troubles with a basic socket stress.

The socket are setup to use non-blocking I/O.

During this stress-test, the kernel is running mbuf exhaustion, the goal is 
to

see system limits.

If the program make a write on a socket during this mbuf exhaustion, it 
become
blocked in "write" system call. The status of the process is "zonelimit" 
and

whole network I/O fall in timeout.

I have found the root cause of the block  :
http://svnweb.freebsd.org/base/head/sys/kern/uipc_socket.c?view=markup#l1279

So, the question is : Why m_uiotombuf is called with a blocking parameter
(M_WAITOK) even if is for a non-blocking socket ?

Then, if M_NOWAIT is used, maybe it will be usefull to have an 'ENOMEM' 
error.


I'm surprised you can even see blocking of malloc(... M_WAITOK).
O_NONBLOCK is mostly for operations that might block for a long time,
but malloc() is not expected to block for long.  Regular files are
always so non-blocking that most file systems have no references to
O_NONBLOCK (or FNONBLOCK), but file systems often execute memory
allocation code that can easily block for as long as malloc() does.
When malloc() starts blocking for a long time, lots of things will
fail.


This is a bit of an catch-22 we have here.  Trouble is that when
we return with EAGAIN the next select/poll cycle will tell you
that this and possibly other sockets are writeable again, when in
fact they are not due to kernel memory shortage.  Then the application
will tightly loop around the "writeable" non-writeable sockets.
It's about the interaction of write with O_NONBLOCK and select/poll
on the socket.


This would be difficult to handle better.


Do you have any references how other OSes behave, in particular
Linux?

I've added bde@ as our resident standards compliance expert.
Hopefully he can give us some more insight on this issue.


Standards won't say what happens at this level of detail.

Blocking for network i/o is still completely broken at levels below
sockets AFAIK.  I (and ttcp) mainly wanted it to work for send() of
udp.  I saw no problems at the socket level, but driver queues just
filled up and send() returned ENOBUFS.  I wanted either the opposite
of O_NONBLOCK (block until !ENOBUFS), or at least for select() to work
for waiting until !ENOBUFS.  But select() doesn't work at all for this.
It seemed to work better in Linux.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-25 Thread Bruce Evans


On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

On Fri, 23 Dec 2011, Alexander Best wrote:

...

the gcc(1) man page states the following:

"
This extra alignment does consume extra stack space, and generally
increases code size.  Code that is sensitive to stack space usage,
such as embedded systems and operating system kernels, may want to
reduce the preferred alignment to -mpreferred-stack-boundary=2.
"

the comment in sys/conf/kern.mk however sorta suggests that the default
alignment of 4 bytes might improve performance.


The default stack alignment is 16 bytes, which unimproves performance.


maybe the part of the comment in sys/conf/kern.mk, which mentions that a
stack
alignment of 16 bytes might improve micro benchmark results should be
removed.
this would prevent people (like me) from thinking, using a stack alignment
of
4 bytes is a compromise between size and efficiently. it isn't! currently a
stack alignment of 16 bytes has no advantages towards one with 4 bytes on
i386.


I think the comment is clear enough.  It it mentions all the tradeoffs.
It is only slightly cryptic in saying that these are tradeoffs and that
the configuration is our best guess at the best tradeoff -- it just says
"while" for both.  It goes without saying that we don't use our worst
guess.  Anyone wanting to change this should run benchmarks and beware
that micro-benchmarks are especially useless.  The changed comment is not
so good since it no longer mentions micro-bencharmarks or says "while".


if micro benchmark results aren't of any use, why should the claim that the
default stack alignment of 16 bytes might produce better outcome stay?


Because:
- the actual claim is the opposite of that (it is that the default 16-byte
  alignments is probably a loss overall)
- the claim that the default 16-byte alignment may benefit micro-benchmarks
  is true, even without the weaselish miswording of "might" in it.  There
  is always at least 1 micro-benchmark that will benefit from almost any
  change, and here we expect a benefit in many microbenchmarks that don't
  bust the caches.  Except, 16-byte alignment isn't supported (*) in the
  kernel, so we actually expect a loss from many microbenchmarks that
  don't bust the caches.
- the second claim warns inexperienced benchmarkers not to claim that the
  default is better because it is better in microbenchmarks.


it doesn't seem as if anybody has micro benchmarked 16 bytes vs. 4 bytes stack
alignment, until now. so the micro benchmark statement in the comment seems to
be pure speculation.


No, it is obviously true.


even worse...it indicates that by removing the
-mpreferred-stack-boundary=2 flag, one can gain a performance boost by
sacrifying a few more bytes of kernel (and module) size.


No, it is part of the sentence explaining why removing the
-mpreferred-stack-boundary=2 flag will probably regain the "overall loss"
that is avoided by using the flag.


this suggests that the behavior -mpreferred-stack-boundary=2 vs. not specyfing
it, losely equals the semantics of -Os vs. -O2.


No, -Os guarantees slower execution by forcing optimization to prefer
space savings over time savings in more ways.  Except, -Os is completely
broken in -current (in the kernel), and gives very large negative space
savings (about 50%).  It last worked with gcc-3.  Its brokenness with
gcc-4 is related to kern.pre.mk still specifying -finline-limit flags
that are more suitable for gcc-3 (gcc has _many_ flags for giving more
delicate control over inlining, and better defaults for them) and
excessive inlining in gcc-4 given by -funit-at-a-time
-finline-functions-called-once.  These apparently cause gcc's inliner
to go insane with -Os.  When I tried to fix this by reducing inlining,
I couldn't find any threshold that fixed -Os without breaking inlining
of functions that are declared inline.

(*) A primary part of the lack of support for 16-byte stack alignment in
the kernel no special stack alignment for the main kernel entry point,
namely syscall().  From i386/exception.s:

%   SUPERALIGN_TEXT
% IDTVEC(int0x80_syscall)

At this point, the stack has 5 words on it (it was 16-byte aligned before
that).

%   pushl   $2  /* sizeof "int 0x80" */
%   subl$4,%esp /* skip over tf_trapno */
%   pushal
%   pushl   %ds
%   pushl   %es
%   pushl   %fs
%   SET_KERNEL_SREGS
%   cld
%   FAKE_MCOUNT(TF_EIP(%esp))
%   pushl   %esp

We "push" 14 more words.  This gives perfect misaligment to the worst odd
word boundary (perfect if only word boundaries are allowed).  gcc wants
the stack to be aligned to a 4*n word boundary before function calls,
but here we have a 4*n+3 word boundary.  (4*n+3 is worse th

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans


On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

This almost builds in -current too.  I had to add the following:
- NO_MODULES to de-bloat the compile time
- MK_CTF=no to build -current on FreeBSD.9.  The kernel .mk files are
  still broken (depend on nonstandard/new features in sys.mk).


strange. the build(7) man page claims that:

"
WITH_CTF  If defined, the build process will run the DTrace CTF
  conversion tools on built objects.  Please note that
  this WITH_ option is handled differently than all other
  WITH_ options (there is no WITHOUT_CTF, or correspond-
  ing MK_CTF in the build system).
"

... so setting MK_CTF to anything shouldn't have (according to the man page).


MK_CTF is an implementation detail.  It is normally set in bsd.own.mk
(not in sys.mk line I said -- this gives another, much larger bug (*)).
But when usr/share/mk is old, it doesn't know anything about MK_CTF.
(For example, in FreeBSD-9, sys.mk sets NO_CTF to 1 if WITH_CTF is not
defined.  This corresponds to bsd.own.mk in -current setting MK_CTF
to "no" if WITH_CTF is not defined.  Go back to an older version of
FreeBSD and /usr/share/mk/* won't know anything about any CTF variable.)
So when you try to build a current kernel under an old version of
FreeBSD, MK_CTF is used uninitialized and the build fails.  (Of course,
"you" build kernels normally and don't use the bloated buildkernel
method.)  The bug is in the following files:

kern.post.mk:.if ${MK_CTF} != "no"
kern.pre.mk:.if ${MK_CTF} != "no"
kmod.mk:.if defined(MK_CTF) && ${MK_CTF} != "no"

except for the last one where it has been fixed.

(*) Well, not completely broken, but just annoyingly unportabile.
Consider the following makefile:

%%%
foo: foo.c
%%%

Invoking this under FreeBSD-9 gives:

%%%
cc -O2 -pipe   foo.c  -o foo
[ -z "ctfconvert" -o -n "1" ] ||  (echo ctfconvert -L VERSION foo &&  
ctfconvert -L VERSION foo)
%%%

This is the old ctf method.  It is ugly but is fairly portable.

Invoking this under FreeBSD-9 but with -m gives

%%%
cc -O2 -pipe   foo.c  -o foo
${CTFCONVERT_CMD} expands to empty string
%%%

This is because:
- the rule in sys.mk says ${CTFCONVERT_CMD}
- CTFCONVERT_CMD is normally defined in bsd.own.mk.  But bsd.own.mk is only
  included by BSD makefiles.  It is never included by portable makefiles.
  So ${CTFCONVERT_CMD} is used uninitialized.
- for some reason, using variables uninitialized is not fatal in this
  context, although it is for the comparisons of ${MK_CTF} above.
- ${CTFCONVERT_CMD} is replaced by the empty string.  Old versions of
  make warn about the use of an empty string as a shell command.
- the code that is supposed to prevent the previous warning is in
  bsd.own.mk, where it is not reached for portable makefiles.  It is:

% .if ${MK_CTF} != "no"
% CTFCONVERT_CMD=   ${CTFCONVERT} ${CTFFLAGS} ${.TARGET}

This uses the full ctfconvert if WITH_CTF.

% .elif ${MAKE_VERSION} >= 520300
% CTFCONVERT_CMD=

make(1) has been modified to not complain about the empty string.  The
version test detects which versions of make don't complain.

% .else
% CTFCONVERT_CMD=   @:

The default is to generate this non-empty string and an extra shell command
to execute it, for old versions of make.

% .endif

But none of this works for portable makefiles, since it is not reached.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans


On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

On Fri, 23 Dec 2011, Alexander Best wrote:


is -mpreferred-stack-boundary=2 really necessary for i386 builds any
longer?
i built GENERIC (including modules) with and without that flag. the results
are:


The same as it has always been.  It avoids some bloat.


1654496 bytes with the flag set
vs.
1654952 bytes with the flag unset


I don't believe this.  GENERIC is enormously bloated, so it has size
more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes


i'm sorry. i used du(1) to get those numbers, so i believe those numbers
represent the ammount of 512-byte blocks. if i'm correct GENERIC is even
more bloated than you feared and almost reaches 1GB:

807,859375  megabytes with flag set
vs.
808,0820313 megabytes without the flag set


That's certainly bloated.  It counts all object files and modules, and
probably everything is compiled with -g.  I only counted kernel text
size.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans


On Fri, 23 Dec 2011, Adrian Chadd wrote:


Well, the whole kernel is bloated at the moment, sorry.

I've been trying to build the _bare minimum_ required to bootstrap
-HEAD on these embedded boards and I can't get the kernel down below 5
megabytes - ie, one with FFS (with options disabled), MIPS, INET (no
INET6), net80211, ath (which admittedly is big, but I need it no
matter what, right?) comes in at:

-r-xr-xr-x  1 root  wheel   5307021 Nov 29 19:14 kernel.LSSR71

And with INET6, on another board (and this includes MSDOS and the
relevant geom modules):

-r-xr-xr-x  1 root  wheel   5916759 Nov 28 12:00 kernel.RSPRO

.. honestly, that's what should be addressed. That's honestly a bit ridiculous.


It's disgusting, but what problems does it cause apart from minor slowness
from cache misses?

I used to monitor the size of a minimal i386 kernel:

% machine   i386
% cpu   I686_CPU
% ident MIN
% options   SCHED_4BSD

In FreeBSD-5-CURRENT between 5.1R and 5.2R, this had size:

   textdata bss dec hex filename
 931241   86524   62356 1080121  107b39 /sysc/i386/compile/min/kernel

A minimal kernel is not useful, but maybe you can add some i/o to it
without bloating it too much.

This almost builds in -current too.  I had to add the following:
- NO_MODULES to de-bloat the compile time
- MK_CTF=no to build -current on FreeBSD.9.  The kernel .mk files are
  still broken (depend on nonstandard/new features in sys.mk).
- comment out a line in if.c that refers to Vloif.  if.c is standard
  but the loop device is optional.

A few more changes to remove non-minimalities that are not defaults
made little difference:

% machine   i386
% cpu   I686_CPU
% ident MIN
% options   SCHED_4BSD
% 
% # XXX kill default misconfigurations.

% makeoptions   NO_MODULES=yes
% makeoptions   COPTFLAGS="-O -pipe"
% 
% # XXX from here on is to try to kill everything in DEFAULTS.
% 
% # nodevice		isa	# needed for DELAY...

% # nooptions   ISAPNP  # needed ...
% 
% nodevice		npx
% 
% nodevice		mem

% nodevice  io
% 
% nodevice		uart_ns8250
% 
% nooptions 	GEOM_PART_BSD

% nooptions GEOM_PART_EBR
% nooptions GEOM_PART_EBR_COMPAT
% nooptions GEOM_PART_MBR
% 
% # nooptions 	NATIVE		# needed ...

% # nodeviceatpic   # needed ...
% 
% nooptions 	NEW_PCIB
% 
% nooptions		VFS_ALLOW_NONMPSAFE


   textdata bss dec hex filename
1663902  110632  136892 1911426  1d2a82 kernel

(This was about 100K larger with -O2 and all DEFAULTS).  The bloat since
FreeBSD-5 is only 70%.

Here are some sizes for my standard kernel (on i386).  The newer
versions have about the same number of features since they don't support
so many old isa devices or so many NICs:

   textdata bss dec hex filename
1483269  106972  172524 1762765  1ae5cd FreeBSD-3/kernel
1917408  157472  194228 2269108  229fb4 FreeBSD-4/kernel
2604498  198948  237720 3041166  2e678e FreeBSD-5.1.5/kernel
2833842  206856  242936 3283634  321ab2 FreeBSD-5.1.5/kernel-with-acpi
2887573  192456  288696 3368725  336715 FreeBSD-5.1.5/kernel
with my changes, -O2 and usb
added relative to the above
2582782  195756  298936 3077474  2ef562 previous, with some excessive
inlining avoided, and without -O2,
and with ipfilter
1998276  159436  137748 2295460  2306a4 kernel.4
a more up to date and less hacked on
FreeBSD-4
4365549  262656  209588 4837793  49d1a1 kernel.7
4406155  266496  496532 5169183  4ee01f kernel.7.invariants
3953248  242464  207252 4402964  432f14 kernel.7.noacpi
4418063  268288  240084 4926435  4b2be3 kernel.7.smp
various fairly stock FreeBSD-7R
kernels
3669544  262848  249712 4182104  3fd058 kernel.c
4174317  258240  540144 4972701  4be09d kernel.c.invariants
3964455  250656  249808 4464919  442117 kernel.c.noacpi
3213928  240160  240596 3694684  38605c kernel.c.noacpi-ule
4285040  268288  286160 4839488  49d840 kernel.c.smp
current before FreeBSD-8R
not all built at the same time or
with the same options.  The 20%
bloat between kernel.c.noacpi.ule
and kernel.c.noacpi is mainly
from not killing the default of
-O2.
4742714  315008  401692 5459414  534dd6 kernel.8
4816900  319200 1813916 6950016  6a0c80 kernel.8.invariants
4490209  304832  395260 5190301  4f329d kernel.8.noacpi
4795475  323680  475420 5594575  555dcf kernel.8.smp

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans


On Fri, 23 Dec 2011, Alexander Best wrote:


is -mpreferred-stack-boundary=2 really necessary for i386 builds any longer?
i built GENERIC (including modules) with and without that flag. the results
are:


The same as it has always been.  It avoids some bloat.


1654496 bytes with the flag set
vs.
1654952 bytes with the flag unset


I don't believe this.  GENERIC is enormously bloated, so it has size
more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes
is hard to believe.  I get a savings of 9K (text) in a 5MB kernel.
Changing the default target arch from i386 to pentium-undocumented has
reduced the text space savings a little, since the default for passing
args is now to preallocate stack space for them and store to this,
instead of to push them; this preallocation results in more functions
needing to allocate some stack space explicitly, and when some is
allocated explicitly, the text space cost for this doesn't depend on
the size of the allocation.

Anyway, the savings are mostly from from avoiding cache misses from
sparse allocation on stacks.

Also, FreeBSD-i386 hasn't been programmed to support aligned stacks:
- KSTACK_PAGES on i386 is 2, while on amd64 it is 4.  Using more
  stack might push something over the edge
- not much care is taken to align the initial stack or to keep the
  stack aligned in calls from asm code.  E.g., any alignment for
  mi_startup() (and thus proc0?) is accidental.  This may result
  in perfect alignment or perfect misalignment.  Hopefully, more
  care is taken with thread startup.  For gcc, the alignment is
  done bogusly in main() in userland, but there is no main() in
  the kernel.  The alignment doesn't matter much (provided the
  perfect misalignment is still to a multiple of 4), but when it
  matters, the random misalignment that results from not trying to
  do it at all is better than perfect misalignment from getting it
  wrong.  With 4-byte alignment, the only cases that it helps are
  with 64-bit variables.


the gcc(1) man page states the following:

"
This extra alignment does consume extra stack space, and generally
increases code size.  Code that is sensitive to stack space usage,
such as embedded systems and operating system kernels, may want to
reduce the preferred alignment to -mpreferred-stack-boundary=2.
"

the comment in sys/conf/kern.mk however sorta suggests that the default
alignment of 4 bytes might improve performance.


The default stack alignment is 16 bytes, which unimproves performance.

clang handles stack alignment correctly (only does it when it is needed)
so it doesn't need a -mpreferred-stack-boundary option and doesn't
always break without alignment in main().  Well, at least it used to,
IIRC.  Testing it now shows that it does the necessary andl of the
stack pointer for __aligned(32), but for __aligned(16) it now assumes
that the stack is aligned by the caller.  So it now needs
-mpreferred-stack-boundary=2, but doesn't have it.  OTOH, clang doesn't
do the andl in main() like gcc does (unless you put a dummy __aligned(32)
there), but requires crt to pass an aligned stack.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: SCHED_ULE should not be the default

2011-12-18 Thread Bruce Evans


On Wed, 14 Dec 2011, Ivan Klymenko wrote:


?? Wed, 14 Dec 2011 00:04:42 +0100
Jilles Tjoelker  ??:


On Tue, Dec 13, 2011 at 10:40:48AM +0200, Ivan Klymenko wrote:

If the algorithm ULE does not contain problems - it means the
problem has Core2Duo, or in a piece of code that uses the ULE
scheduler. I already wrote in a mailing list that specifically in
my case (Core2Duo) partially helps the following patch:
--- sched_ule.c.orig2011-11-24 18:11:48.0 +0200
+++ sched_ule.c 2011-12-10 22:47:08.0 +0200
...
@@ -2118,13 +2119,21 @@
struct td_sched *ts;

THREAD_LOCK_ASSERT(td, MA_OWNED);
+   if (td->td_pri_class & PRI_FIFO_BIT)
+   return;
+   ts = td->td_sched;
+   /*
+* We used up one time slice.
+*/
+   if (--ts->ts_slice > 0)
+   return;


This skips most of the periodic functionality (long term load
balancer, saving switch count (?), insert index (?), interactivity
score update for long running thread) if the thread is not going to
be rescheduled right now.

It looks wrong but it is a data point if it helps your workload.


Yes, I did it for as long as possible to delay the execution of the code in 
section:


I don't understand what you are doing here, but recently noticed that
the timeslicing in SCHED_4BSD is completely broken.  This bug may be a
feature.  SCHED_4BSD doesn't have its own timeslice counter like ts_slice
above.  It uses `switchticks' instead.  But switchticks hasn't been usable
for this purpose since long before SCHED_4BSD started using it for this
purpose.  switchticks is reset on every context switch, so it is useless
for almost all purposes -- any interrupt activity on a non-fast interrupt
clobbers it.

Removing the check of ts_slice in the above and always returning might
give a similar bug to the SCHED_4BSD one.

I noticed this while looking for bugs in realtime scheduling.  In the
above, returning early for PRI_FIFO_BIT also skips most of the periodic
functionality.  In SCHED_4BSD, returning early is the usual case, so
the PRI_FIFO_BIT might as well not be checked, and it is the unusual
fifo scheduling case (which is supposed to only apply to realtime
priority threads) which has a chance of working as intended, while the
usual roundrobin case degenerates to an impure form of fifo scheduling
(iit is impure since priority decay still works so it is only fifo
among threads of the same priority).


...

@@ -2144,9 +2153,6 @@
if
(TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
tdq->tdq_ridx = tdq->tdq_idx; }
-   ts = td->td_sched;
-   if (td->td_pri_class & PRI_FIFO_BIT)
-   return;
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
/*
 * We used a tick; charge it to the thread so
@@ -2157,11 +2163,6 @@
sched_priority(td);
}
/*
-* We used up one time slice.
-*/
-   if (--ts->ts_slice > 0)
-   return;
-   /*
 * We're out of time, force a requeue at userret().
 */
ts->ts_slice = sched_slice;


With the ts_slice check here before you moved it, removing it might
give buggy behaviour closer to SCHED_4BSD.


and refusal to use options FULL_PREEMPTION


4-5 years ago, I found that any form of PREMPTION was a pessimization
for at least makeworld (since it caused too many context switches).
PREEMPTION was needed for the !SMP case, at least partly because of
the broken switchticks (switchticks, when it works, gives voluntary
yielding by some CPU hogs in the kernel.  PREEMPTION, if it works,
should do this better).  So I used PREEMPTION in the !SMP case and
not for the SMP case.  I didn't worry about the CPU hogs in the SMP
case since it is rare to have more than 1 of them and 1 will use at
most 1/2 of a multi-CPU system.


But no one has unsubscribed to my letter, my patch helps or not in
the case of Core2Duo...
There is a suspicion that the problems stem from the sections of
code associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this
problem ...


The main point of SCHED_ULE is to give better affinity for multi-CPU
systems.  But the `multi' apparently needs to be strictly more than
2 for it to brak even.

Bruce___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers (v2)

2011-11-26 Thread Bruce Evans


On Sat, 26 Nov 2011, Robert Millan wrote:


On Fri, Nov 25, 2011 at 11:16:15AM -0700, Warner Losh wrote:

Hey Bruce,

These sound like good suggestions, but I'd hoped to actually go through all 
these files with a fine-toothed comb to see which ones were still relevant.  
You've found a bunch of good areas to clean up, but I'd like to humbly suggest 
they be done in a follow-on commit.


Hi,

I'm sending a new patch.  Thanks Bruce for your input.  TTBOMK this corrects
all the problems you spotted that were introduced by my patch.  It doesn't
fix pre-existing problems in the files however, except in cases where I had
to modify that line anyway.

I think it's a good compromise between my initial patch and an exhaustive
cleanup of those headers (which I'm probably not the most indicate for).


It fixes most style bugs, but not some-pre-existing problems, even in cases
where you had to modify the line anyway.

% Index: sys/cam/scsi/scsi_low.h
% ===
% --- sys/cam/scsi/scsi_low.h   (revision 227956)
% +++ sys/cam/scsi/scsi_low.h   (working copy)
% @@ -53,10 +53,10 @@
%  #define  SCSI_LOW_INTERFACE_XS
%  #endif   /* __NetBSD__ */
% 
% -#ifdef	__FreeBSD__

% +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
%  #define  SCSI_LOW_INTERFACE_CAM
%  #define  CAM
% -#endif   /* __FreeBSD__ */
% +#endif /* __FreeBSD__ || __FreeBSD_kernel__ */

It still has the whitespace-after tab style change for cam.

% Index: sys/dev/firewire/firewirereg.h
% ===
% --- sys/dev/firewire/firewirereg.h(revision 227956)
% +++ sys/dev/firewire/firewirereg.h(working copy)
% @@ -75,7 +75,8 @@
%  };
% 
%  struct firewire_softc {

% -#if defined(__FreeBSD__) && __FreeBSD_version >= 50
% +#if (defined(__FreeBSD__) || defined(__FreeBSD_kernel__)) && \
% +__FreeBSD_version >= 50
%   struct cdev *dev;
%  #endif
%   struct firewire_comm *fc;

Here is a pre-existing problem that you didn't fix on a line that you
changed.  The __FreeBSD__ ifdef is nonsense here, since __FreeBSD__
being defined has nothing to do with either whether __FreeBSD_version
is defined or whether there is a struct cdev * in the data structure.

Previously:
- defined(__FreeBSD__) means that the compiler is for FreeBSD
- __FreeBSD_version >= 50 means that FreeBSD  has
  been included and has defined __FreeBSD_version to a value that
  satisifes this.  It would be a bug for anything else to define
  __FreeBSD_version.  Unfortunately, there is a bogus #undef of
  __FreeBSD_version that breaks detection of other things defining
  it.
- the __FreeBSD__ part of the test has no effect except to break
  compiling this file with a non-gcc compiler.  In particular,
  it doesn't prevent errors for -Wundef -Werror.  But other ifdefs
  in this file use an unguarded __FreeBSD_version.  Thus this file
  never worked with -Wundef -Werror, and the __FreeBSD__ part has
  no effect except the breakage.

Now: as above, except:
- defined(__FreeBSD_kernel__) means that FreeBSD 
  been included and that this header is new enough to define
  __FreeBSD_kernel__.  This has the same bug with the #undef,
  which I pointed out before (I noticed it for this but not
  for __FreeBSD_version).  And it has a style bug in its name
  which I pointed out before -- 2 underscores in its name.
  __FreeBSD_version doesn't have this style bug.  The definition
  of __FreeBSD_kernel__ has already been committed.  Is it too
  late to fix its name?
- when  is new enough to define __FreeBSD_kernel__,
  it must be new enough to define __FreeBSD_version >= 50.
  Thus there is now no -Wundef error.
- the __FreeBSD__ ifdef remains nonsense.  If you just removed it,
  then you wouldn't need the __FreeBSD_kernel__ ifdef (modulo the
  -Wundef error).  You didn't add the __FreeBSD_kernel__ ifdef to
  any of the other lines with the __FreeBSD_kernel__ ifdef in this
  file, apparently because the others don't have the nonsensical
  __FreeBSD__ ifdef.

The nonsense and changes to work around it make the logic for this
ifdef even more convoluted and broken than might first appear.  In
a previous patchset, you included  to ensure that
__FreeBSD_kernel__ is defined for newer kernel sources (instead of
testing if it is defined).  Ifdefs like the above make 
a prerequsite for this file anyway, since without knowing
__FreeBSD_version it is impossible to determine if the data structure
has new fields like the cdev in it.   is a prerequisite
for almost all kernel .c files, so this prerequisite should be satisfied
automatically for them, but it isn't clear what happens for user .c files.
I think the ifdef should be something like the following to enforce the
prerequisite:

#ifndef _SYS_PARAM_H_
/*
 * Here I don't support __FreeBSD_version__ to be set outside of
 *  to hack around a missing include of .
 * The case where the kernel is so old that __FreeBSD_

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers (v2)

2011-11-25 Thread Bruce Evans


On Thu, 24 Nov 2011, Robert Millan wrote:


2011/11/24 Bruce Evans :

Now it adds lots of namespace pollution (all of , including
all of its namespace pollution), just to get 1 new symbol defined.


Well, my initial patch (see mail with same subject modulo "v2") didn't
have this problem.  Now that __FreeBSD_kernel__ is defined, many
#ifdefs can be simplified, but maybe it's not desireable for all of
them.  At least not until we can rely on the compiler to define this
macro.

So in this particular case maybe it's better to use the other approach?

See attachment.


That is clean enough, except for some style bugs.  (I thought of worse
ways like duplicating the logic of , or directing
 to only declare version macros, or putting version macros
in a little separate param header and including that.  The latter would
be cleanest, but gives even more includes, and not worth it for this,
but it would have been better for __FreeBSD_version.  I don't like
having to recompile half the universe according to dependencies on
 because only __FreeBSD_version__ in it changed.  Basic
headers rarely change apart from that.  BTW, a recent discussion in
the POSIX mailing list says that standardized generation of depenedencies
should not generate dependencies on system headers.  This would break
the effect of putting mistakes like __FreeBSD_version__ in any system
header :-).)

% diff -ur sys.old/cam/scsi/scsi_low.h sys/cam/scsi/scsi_low.h
% --- sys.old/cam/scsi/scsi_low.h   2007-12-25 18:52:02.0 +0100
% +++ sys/cam/scsi/scsi_low.h   2011-11-13 14:12:41.121908380 +0100
% @@ -53,7 +53,7 @@
%  #define  SCSI_LOW_INTERFACE_XS
%  #endif   /* __NetBSD__ */
% 
% -#ifdef	__FreeBSD__

% +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
%  #define  SCSI_LOW_INTERFACE_CAM
%  #define  CAM
%  #endif   /* __FreeBSD__ */

This also fixes some style bugs (tab instead of space after `#ifdef').
But it doesn't fix others (tab instead of space after `#ifdef', and
comment on a short ifdef).  And it introduces a new one (the comment
on the ifdef now doesn't even match the code).

cam has a highly non-KNF style, so it may require all of these style
bugs except the comment not matching the code.  This makes it hard
for non-cam programmers to maintain.  According to grep, it prefers
a tab to a space after `#ifdef' by a ratio of 89:38 in a version
checked out a year or two ago.  But in 9.0-BETA1, the counts have
blown out and the ratio has reduced to 254:221.  The counts are
more than doubled because the first version is a cvs checkout and
the second version is a svn checkout, and it is too hard to filter
out the svn duplicates.  I guess the ratio changed because the new
ata subsystem is not bug for bug compatible with cam style.  Anywyay,
there never was a consistent cam style to match.

% @@ -64,7 +64,7 @@
%  #include 
%  #endif   /* __NetBSD__ */
% 
% -#ifdef	__FreeBSD__

% +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
%  #include 
%  #include 
%  #include 

Same problems, but now the ifdef is larger (but not large enough to
need a comment on its endif), so the inconsistent comment is not
visible in the patch.

% [... similarly throught cam]

% diff -ur sys.old/contrib/altq/altq/if_altq.h sys/contrib/altq/altq/if_altq.h
% --- sys.old/contrib/altq/altq/if_altq.h   2011-03-10 19:49:15.0 
+0100
% +++ sys/contrib/altq/altq/if_altq.h   2011-11-13 14:12:41.119907128 +0100
% @@ -29,7 +29,7 @@
%  #ifndef _ALTQ_IF_ALTQ_H_
%  #define  _ALTQ_IF_ALTQ_H_
% 
% -#ifdef __FreeBSD__

% +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
%  #include   /* XXX */
%  #include  /* XXX */
%  #include  /* XXX */
% @@ -51,7 +51,7 @@
%   int ifq_len;
%   int ifq_maxlen;
%   int ifq_drops;
% -#ifdef __FreeBSD__
% +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
%   struct  mtx ifq_mtx;
%  #endif
%

No new problems, but I wonder how this even compiles when the ifdefs
are not satisfed.  Here we are exporting mounds of kernel data structures
to userland.  There is a similar mess in .  There it has
no ifdefs at all for the lock, mutex and event headers there, and you
didn't touch it.   is unfortunately actually needed in
userland.  The mutexes in its data structures cannot simply be left
out, since then the data structures become incompatible with the actual
ones.  I don't see how the above can work with the mutex left out.

By "not even compiles", I meant the header itself, but there should be
no problems there because the second ifdef should kill the only use of
all the headers.  And userland should compile since it shouldn't use
the ifdefed out (kernel) parts of the data struct.  But leaving out
the data substructures changes the ABI, so how could any application that
actually uses the full structure work?  And if nothing uses it, it
shouldn't be exported.

E

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers (v2)

2011-11-23 Thread Bruce Evans


On Wed, 23 Nov 2011, Robert Millan wrote:


Here we go again :-)

Out of the kernel headers that are installed in /usr/include/ hierracy, there
are some which include support multiple operating systems (usually FreeBSD and
other *BSD flavours).

This patch adds support to detect GNU/kFreeBSD as well.  In all cases, we
match the same declarations as FreeBSD does (which is to be expected in kernel
headers, since both systems share the same kernel).


Now it adds lots of namespace pollution (all of , including
all of its namespace pollution), just to get 1 new symbol defined.

% Index: sys/cam/scsi/scsi_low.h
% ===
% --- sys/cam/scsi/scsi_low.h   (revision 227831)
% +++ sys/cam/scsi/scsi_low.h   (working copy)
% @@ -44,6 +44,8 @@
%  #ifndef  _SCSI_LOW_H_
%  #define  _SCSI_LOW_H_
% 
% +#include 

% +
%  /*
%   * Scsi low OSDEP 
%   * (All os depend structures should be here!)
% 
% [... 22 more headers polluted]


All the affected headers are poorly implemented ones.  Mostly kernel
headers which escaped to userland.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers

2011-11-20 Thread Bruce Evans


On Sun, 20 Nov 2011, Kostik Belousov wrote:


On Sun, Nov 20, 2011 at 12:40:42PM +0100, Robert Millan wrote:

On Sat, Nov 19, 2011 at 07:56:20PM +0200, Kostik Belousov wrote:

I fully agree with an idea that compiler is not an authorative source
of the knowledge of the FreeBSD version. Even more, I argue that we shall
not rely on compiler for this at all. Ideally, we should be able to
build FreeBSD using the stock compilers without local modifications.
Thus relying on the symbols defined by compiler, and not the source
is the thing to avoid and consistently remove.

We must do this to be able to use third-party tooldchain for FreeBSD builds.

That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version ?
And then make more strong wording about other systems that use the macro,
e.g. remove 'may' from the kFreeBSD example.
Also, please remove the smile from comment.


Ok. New patch attached.


And the last, question, why not do
#ifndef __FreeBSD_kernel__
#define __FreeBSD_kernel__ __FreeBSD_version
#endif
?

#undef is too big tools tool apply there, IMO.


#ifndef is too big to apply here, IMO :-).  __FreeBSD_kernel__ is in the
implementation namespace, so any previous definition of it is a bug.  The
#ifndef breaks the warning for this bug.

And why not use FreeBSD style?  In KNF, the fields are separated by
tabs, not spaces.  In FreeBSD style, trailing underscores are not used
for names in the implementation namespace, since they have no effect
on namespaces.  The name __FreeBSD_version is an example of this.  Does
existing practice require using the name with the trailing underscores?

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [PATCH] Netdump for review and testing -- preliminary version

2010-10-17 Thread Bruce Evans


On Fri, 15 Oct 2010, Robert N. M. Watson wrote:


On 15 Oct 2010, at 20:39, Garrett Cooper wrote:


   But there are already some cases that aren't properly handled
today in the ddb area dealing with dumping that aren't handled
properly. Take for instance the following two scenarios:
1. Call doadump twice from the debugger.
2. Call doadump, exit the debugger, reenter the debugger, and call
doadump again.
   Both of these scenarios hang reliably for me.
   I'm not saying that we should regress things further, but I'm just
noting that there are most likely a chunk of edgecases that aren't
being handled properly when doing dumps that could be handled better /
fixed.


Even thinking about calling doadump even once from within the debugger is
an error.  I was asleep when the similar error for panic was committed,
and this error has propagated.  Debuggers should use a trampoline to
call the "any" function, not the least so that they can be used to debug
the "any" function without the extra complications to make themself
reentrant.  I think gdb has always used a trampoline for this outside of
the kernel.  Not sure what it does within the kernel, but it would have
even larger problems than in userland finding a place for the trampoline.
In the kernel, there is the additional problem of keeping control while
the "any" function is run.  Other CPUs must be kept stopped and interrupts
must be kept masked, except when the "any" function really needs other CPUs
or unmasked interrupts.  Single stepping also needs this and doesn't have
it (other CPUs and interrupt handlers can run and execute any number of
instructions while you are trying to execute a single one).  All ddb
"commands" that change the system state are really non-ddb commands that
should use an external function via a trampoline.  Panicing and dumping
are just the largest ones, so they are the most impossible to do correctly
as commands and the most in need of ddb to debug them.


Right: one of the points I've made to Attilio is that we need to move to a more 
principled model as to what sorts of things we allow in various kernel 
environments. The early boot is a special environment -- so is the debugger, 
but the debugger on panic is not the same as the debugger when you can 
continue. Likewise, the crash dumping code is special, but also not the same as 
the debugger. Right now, exceptional behaviour to limit hangs/etc is done 
inconsistently. We need to develop a set of principles that tell us what is 
permitted in what contexts, and then use that to drive design decisions, 
normalizing what's there already.


ENONUNIXEDITOR.  Format not recovered.

panic() from within a debugger (or a fast interrupt handler, or a fast
interrupt handler that has trappeded to the debugger by request...) is,
although an error, not too bad since panic() must be prepared to work
starting from the "any" state anyway, and as you mention it doesn'tneed
to be able to return (except for RESTARTABLE_PANICS, which makes things
impossibly difficult).  Continuing from a debugger is feasible mainly
because in the usual case the system state is not changed (except for
time-dependent things).  If you use it to modify memory or i/o or run
one of its unsafe commands then you have to be careful.


This is not dissimilar to what we do with locking already, BTW: we define a set 
of kernel environments (fast interrupt handlers, non-sleepable threads, 
sleepable thread holding non-sleepable locks, etc), and based on those 
principles prevent significant sources of instability that might otherwise 
arise in a complex, concurrent kernel. We need to apply the same sort of 
approach to handling kernel debugging and crashing.


Locking has imposed considerable discipline, which if followed by panic()
would should how wrong most of the things done by panic() are -- it will
hit locks, but shouldn't even be calling functions that have locks, since
such functions expect their locks to work.

The rules for fast interrupt handlers are simple and mostly not followed.
They are that a fast interrupt handler may not access any state not
specially locked by its subsystem.  This means that they may not call
any other subsystem or any upper layer except the null set of ones
documented to be safe to call.  In practice, this means not calling the
"any" function, but it is necessary for atomic ops, bus space accesses,
and a couple of scheduling functions to be safe enough.


BTW, my view is that except in very exceptional cases, it should not be 
possible to continue after generating a dump. Dumps often cause disk 
controllers to get reset, which may leave outstanding I/O in nasty situations. 
Unless the dump device and model is known not to interfere with operation, we 
should set state indicating that the system is non-continuable once a dump has 
occurred.


It might be safe if the system reinitialized everything.  Too hard for just
dumping, but it is needed after resume anyway.  So the following could
reason

Re: newfs_msdos and DVD-RAM

2010-04-04 Thread Bruce Evans


On Sat, 3 Apr 2010, Tijl Coosemans wrote:


Wikipedia's article on FAT has this to say about the maximum size of
clusters:

"The limit on partition size was dictated by the 8-bit signed count of
sectors per cluster, which had a maximum power-of-two value of 64. With


That seems unlikely.  The MS-DOS file system is an old 1970's one meant
for implementation in assembly language on an 8-bit CPU.  No assembly
language programmer for an 8-bit microprocessor would expect an 8 bit
or 16 bit counter to be signed, since there aren't enough bits to waste
1 for the sign bit.  My reference written in 1986 by an assembly-language
oriented programmer (Duncan) only says that the value must be a power
of 2 though it says that the most other 8-bit variables are BYTEs.


the standard hard disk sector size of 512 bytes, this gives a maximum
of 32 KB clusters, thereby fixing the "definitive" limit for the FAT16
partition size at 2 gigabytes. On magneto-optical media, which can have
1 or 2 KB sectors instead of 1/2 KB, this size limit is proportionally
larger.


However, there was no need to use counts of larger than 1 in 1980, so
support for values of 128 could easily have been broken.


Much later, Windows NT increased the maximum cluster size to 64 KB by
considering the sectors-per-cluster count as unsigned. However, the
resulting format was not compatible with any other FAT implementation
of the time, and it generated greater internal fragmentation. Windows
98 also supported reading and writing this variant, but its disk
utilities did not work with it."


This is demonstably false, since pcfs in FreeBSD-1 was another FAT
implementation of the time (1993), and it is should be missing the bug
since it uses the natural unsigned types for everything in the BPB.
msdosfs in Linux probably provides a better demonstration since it was
of production quality a year or 2 earlier and unlikely to have the bug.
(I don't have its sources handy to check.)


I'm not sure the second paragraph is worth supporting, but the first
seems to say that 32k limit you have in your patch only applies to
disks with 512 byte sectors. For disks with larger sectors it would
be proportionally larger.


It would be interesting to see what breaks with cluster sizes > 64K.
These can be obtained using emulated or physical sector sizes larger
than 512.

Of course you don't want to actually use cluster sizes larger than 4K
(far below 32K) about since they just give portability and fragmentation
losses for tiny or negative performance gains (lose both space and
time to fragmentation).  My implementation of clustering for msdosfs
made the cluster sizes unimportant provided it is small enough not to
produce fragmentation, and there is little fragmentation due to other
problems, and there is enough CPU to enblock and deblock the clusters.
Clustering works better for msdosfs than for ffs because there are no
indirect blocks or far-away inode blocks to put bubbles in the i/o
pipeline.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: newfs_msdos and DVD-RAM

2010-03-29 Thread Bruce Evans


On Mon, 29 Mar 2010, Andriy Gapon wrote:


...
I am not a FAT expert and I know to take Wikipedia with a grain of salt.
But please take a look at this:
http://en.wikipedia.org/wiki/File_Allocation_Table#Boot_Sector

In our formula:
SecPerClust *= pmp->pm_BlkPerSec;
we have the following parameters:
SecPerClust[in] - sectors per cluster
pm_BlkPerSec - bytes per sector divided by 512 (pm_BytesPerSec / DEV_BSIZE)
SecPerClust[out] - bytes per cluster divided by 512

So we have:
sectors per cluster: 64
bytes per sector: 4096

That Wikipedia article says: "However, the value must not be such that the 
number
of bytes per cluster becomes greater than 32 KB."


64K works under FreeBSD, and I often do performance tests with it (it gives
very bad performance).  It should be avoided for portability too.


But in our case it's 256K, the same value that is passed as 'size' parameter to
bread() in the crash stack trace below.


This error should be detected more cleanly.  ffs fails the mount if the
block size exceeds 64K.  ffs can handle larger block sizes, and it is
unfortunate that it is limited by the non-ffs parameter MAXBSIZE, but
MAXBSIZE has been 64K and non-fuzzy for so long that the portability
considerations for using larger values are even clearer -- larger sizes
shouldn't be used, but 64K works almost everywhere.  I used to often do
performance tests with block size 64K for ffs.  It gives very bad
performance, and since theire are more combinations of block sizes to
test for ffs than for msdosfs, I stopped testing block size 64K for ffs
long ago.

msdosfs has lots more sanity tests for its BPB than does ffs for its
superblock.  Some of these were considered insane and removed, and there
never seems to have been one for this.


By the way, that 32KB limit means that value of SecPerClust[out] should never be
greater than 64 and SecPerClust[in] is limited to 128, so its current must be of
sufficient size to hold all allowed values.

Thus, clearly, it is a fault of a tool that formatted the media for FAT.
It should have picked correct values, or rejected incorrect values if those were
provided as overrides via command line options.


If 256K works under WinDOS, then we should try to support it too.  mav@
wants to increase MAXPHYS.  I don't really believe in this, but if MAXPHYS
is increased then it would be reasonable to increase MAXPHYS too, but
probably not to more than 128K.


f...@r500 /usr/crash $kgdb kernel.1/kernel.symbols vmcore.1

[snip]

Unread portion of the kernel message buffer:
panic: getblk: size(262144) > MAXBSIZE(65536)

[snip]

#11 0x803bedfb in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:562


BTW, why can't gdb find any variables?  They are just stack variables whose
address is easy to find.


...
#14 0x8042f24e in bread (vp=Variable "vp" is not available.
) at /usr/src/sys/kern/vfs_bio.c:748


... and isn't vp a variable?  Maybe the bad default -O2 is destroying
debugging.  Kernels intended for being debugged (and that is almost all
kernels) shouldn't be compiled with many optimizations.  Post-gcc-3, -O2
breaks even backtraces by inlining static functions that are called only
once.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: System hangs solid with ATAPICAM

2003-12-02 Thread Bruce Evans

On Tue, 2 Dec 2003, Sean McNeil wrote:

> I've tried over several weeks to get ATAPICAM to work for me.  I've
> tried with and without acpi (compiled in or disabled via. boot).  I've
> tried turning on all debug.  I've tried a few misc. thigs.  All leave my

Did you try backing out rev.1.23 of ata_lowlevel.c?

> system hanging after the GEOM initialization without any indication of
> debug output.  The only clue I have is that it sounds like my zip-100
> was accessed right before the hang.

That's interesting.  The bug avoided by backing out rev.1.23 of
ata-lowlevel.c is obviously system dependent.  I only see it on a
system that has a zip100.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: ATAPI CD still not detected, verbose boot logs available

2003-12-02 Thread Bruce Evans

On Tue, 2 Dec 2003, Soren Schmidt wrote:

> It seems Christoph Sold wrote:
> > FreeBSD 5.2-B still does not detect my ATAPI DVD-ROM drive. This used to
> > work until Søren's ATAng commits. Other OSes (Win, Linux, Solaris)
> > detect the drive appropriately.
>
> Hmm from the bootlogs it seems that your drive does not set the proper
> ATAPI signature, thats why detection fails:
>
> atapci0:  port 0xd800-0xd80f at device 4.1 on pci0
> ata0: reset tp1 mask=03 ostat0=50 ostat1=50
> ata0-master: stat=0x80 err=0x80 lsb=0x80 msb=0x80
   ! bit 0x80 set says that the master is busy
> ata0-slave:  stat=0x00 err=0x01 lsb=0x14 msb=0x80
>   should be 0xeb
> ata0-master: stat=0x50 err=0x01 lsb=0x00 msb=0x00
   ! now the master is unbusy
> ata0: reset tp2 mask=03 stat0=50 stat1=00 devices=0x1
> ata0: at 0x1f0 irq 14 on atapci0
> ata0: [MPSAFE]

Accessing the slave while the master is busy is invalid.  I believe the
failure mechanism is that the master keeps driving the bus while it is
busy, so reads of the slave registers give garbage.  This isn't a problem
unless the slave becomes ready first and it manages to write a success
code to the "err" register.  Then we trust the garbage.  It doesn't help
that the master eventually becomes ready, since we don't read the slave
registers again.

> There isn't much I can do about that one except you experimenting with
> the device and finding out why it fails setting the right signature

Er, I sent patches for this a few months ago.  After reanalysing their
debugging putput combined with the above debugging output, I think
this bug is is the usual case if there are 2 drives and the drives'
timing after reset is as follows:

o The master must take more than 100 msec to become ready.
  Otherwise the 100 msec initial delay hides the bug.
o The slave must become ready before the master.  Otherwise
  there is no problem with using garbage slave registers,
  although accessing them is strictly invalid.

The bug is just not often seen since most drives don't take 100 msec to
become ready.  I only see it on a system with an 8-9 year old pre-ATA
IDE drive that takes 574 msec to become ready.

For a quick fix, try increasing the initial delay of 100 msec to a second
or more.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: 5.2-BETA: giving up on 4 buffers (ata)

2003-11-27 Thread Bruce Evans

On Thu, 27 Nov 2003, Stefan Ehmann wrote:

> On Wed, 2003-11-26 at 19:37, Matthias Andree wrote:
> > Hi,
> >
> > when I rebooted my 5.2-BETA (kernel about 24 hours old), it gave up on
> > flushing 4 dirty blocks.
> >
> > I had three UFS1 softdep file systems mounted on one ATA drive, one
> ext2
> > file system on another ATA drive and one ext2 file system on a SCSI
> > drive.  Both ext2 file systems had been mounted read-only, so they
> can't
> > have had dirty blocks.
>
> This is a known problem for nearly three months now (See PR 56675). It
> happens to me every time I shut down the system if i don't unmount my
> (read-only) ext2 file systems manually.

I'm not sure if the problem is known for the read-only case.  It is
the same problem as in the read-write case.  ext2fs hangs onto buffers,
so shutdown cannot tell if it can look at the buffers and considers
them to be busy.  Then since shutdown cannot tell if it synced all dirty
buffers or which buffers are associated with which file systems, it
doesn't unmount any file systems and all dirty file systems that aren't
unmounted before shutdown are left dirty.  Read-only-mounted ext2fs file
systems aren't left dirty but they break cleaning of other file systems.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: 40% slowdown with dynamic /bin/sh

2003-11-27 Thread Bruce Evans

On Wed, 26 Nov 2003, Garance A Drosihn wrote:

> At 12:23 AM -0500 11/26/03, Michael Edenfield wrote:
> >
> >Just to provide some real-world numbers, here's what I got
> >out of a buildworld:
>
> I have reformatted the numbers that Michael reported,
> into the following table:
>
> >Static /bin/sh: Dynamic /bin/sh:
> >   real385m29.977s real455m44.852s   => 18.22%
> >   user111m58.508s user113m17.807s   =>  1.18%
> >   sys  93m14.450s sys 103m16.509s   => 10.76%
> >   user+sys  =>  5.53%

What are people doing to make buildworld so slow?  I once optimized
makeworld to take 75 minutes on a K6-233 with 64MB of RAM.  Things
have been pessimized a bit since then, but not signifcantly except for
the 100% slowdown of gcc (we now build large things like secure but
this is partly compensated for by not building large things like perl).
Michael's K7-500 with 320MB (?) of RAM should be serveral times faster
than the K6-233, so I would be unhappy if it took more than 75 minutes
but would expect it to take bit more than 2 hours when well configured.

> Here are some buildworld numbers of my own, from my system.
> In my case, I am running on a single Athlon MP2000, with a
> gig of memory.  It does a buildworld without paging to disk.

I have a similar configuration, except with a single Athlon XP1600
overclocked by 146/133 and I always benchmark full makeworlds.  I
was unhappy when the gcc pessimizations between gcc-2.95 and gcc-3.0
increased the makeworld time from about 24 minutes to about 33 minutes.
The time has since increased to about 38 minutes.  The latter is
cheating slightly -- I leave out the DYNAMICROOT and RESCUE mistakes
and the KERBEROS non-mistake.

> Static sh, No -j:  Dynamic sh, No -j:
>real84m31.366s real86m22.429s   =>  2.04%
>user50m33.013s user51m13.080s   =>  1.32%
>sys 29m59.047s sys 33m04.082s   => 10.29%
>   user+sys =>  4.66%
>
> Static sh, -j2:Dynamic sh, -j2:
>real92m38.656s real95m21.027s   =>  2.92%
>user51m48.970s user52m29.152s   =>  1.29%
>sys 32m07.293s sys 34m40.595s   =>  7.95%
>   user+sys =>  3.84%

This also shows why -j should not be used on non-SMP machines.  Apart
from the make -j bug that causes missed opportunties to run a job,
make -j increases real and user times due to competition for resources,
so it can only possibly help on systems where unbalanced resources (mainly
slow disks) give too much idle time.

My current worst makeworld time is almost twice as small as the fastest
buildworld time in the above (2788 seconds vs 5071 seconds).  From my
collection of makeworld benchmarks:

%%%
Fastest makeworld on a Celeron 366 overclocked by 95/66 (2000/05/15):
3309.30 real  2443.75 user   488.68 sys

Last makeworld on a Celeron 366 overclocked by 95/66 (2001/11/19):
4219.83 real  3253.04 user   667.64 sys

Fastest makeworld on an Athlon XP1600 overclocked by 146/133 (2002/01/03):
1390.18 real   913.56 user   232.63 sys

Last makeworld before gcc-3 on an Athlon XP1600 o/c by 143/133 (2002/05/09)
(overclocking reduced and due to memory problems and some local
memory-related optimizations turned off):
 1532.99 real  1093.08 user   293.15 sys

Early makeworld with gcc-3 on an Athlon XP1600 o/c by 143/133 (2002/05/12):
2268.13 real  1613.25 user   313.56 sys

Fastest makeworld with gcc-3 an Athlon XP1600 overclocked by 146/133
(maximal overclocking recovered; memory increased from 512MB to 1GB, local
memory-related optimizations turned on and tuned) (2003/03/31):
1929.02 real  1576.67 user   205.30 sys

Last makeworld before  on an Athlon XP1600 o/c by 143/133 (2003/04/29:
2012.75 real  1637.59 user   225.07 sys

Makeworld with the defaults (no /etc/make.conf and no local optimizations
in the src tree; mainly no pessimizing for Athlons by optimizing for PII's,
and no building dependencies; only optimizations in the host environment
(mainly no dynamic linkage) on an Athlon as usual (2003/05/06):

Last recorded makeworld with local source and make.conf optimizations
(mainly no dynamic linkage) on an Athlon as usual (2003/10/22):
2225.83 real  1890.64 user   256.33 sys

Last recorded makeworld with the defaults on an Athlon as usual (2003/11/11):
2788.41 real  2316.49 user   357.34 sys
%%%

I don't see such a large slowdown from using a dynamic /bin/sh.  Unrecorded
runs of makeworld gave times like the following:

2262 real ... with local opts including src ones and no dynamic linkage
2290 real ... with same except for /bin/sh (only) dynamically linked

The difference may be because my /usr/bin/true and similar utilities remain
statically linked.  Fork-exec expense depends mor on the exec than the fork.
>From an old

Re: Hanging at boot

2003-11-26 Thread Bruce Evans

On Wed, 26 Nov 2003, Manfred Lotz wrote:

> On Mon, 24 Nov 2003 08:00:49 +0100, Manfred Lotz wrote:
>
> > Hi there,
> >
> > Last time (around middle of October) when I tried out a new current kernel
> > it was hanging at boot time at acd1
> >
> > ata1 is:
> > acd1: DVD-ROM  at ata1-slave UDMA33
> >
> >
> > I tried it again yesterday. Now acd1 seems to be fine. However it hangs
> > at acd2.After the following message
> >  acd2: CD-RW  at ata3-master UDMA33
> >
> > it stops working. No error message is showing up.
>
> In the meantime I found out that the cause of the problem is atapicam.
> If I remove it from my kernel config I'm fine (but I have no atapicam).

Try backing out rev.1.23 of ata_lowlevel.c.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: HEADS UP: /bin and /sbin are now dynamically linked

2003-11-22 Thread Bruce Evans

On Sat, 22 Nov 2003, M. Warner Losh wrote:

> In message: <[EMAIL PROTECTED]>
> Richard Coleman <[EMAIL PROTECTED]> writes:
> : M. Warner Losh wrote:
> :
> : > : I agree.  termcap.small is amazingly uncurrent.  However, perhaps some
> : > : merging and reducing is in order.  Why is a full cons25 or vt2xx needed?
> : > : vi only needs a few capabilities.  I think we mostly use copies of large
> : > : termcap entries because copying the whole things is easier.
> : >
> : > You have a good point.  My termcap was done so that we could run a
> : > number of applications...
> : >
> : > Grepping seems unsatisfying to find out which keys are used.  Do you
> : > have a list?

nvi/cl/cl_bsd.c has a possibly complete enough list in its terminfo
translation table.

> : Is the extra maintenance worth it to save a few hundred bytes?

Probably not, if this is mainly for use by rescue on larger (multi-megabyte)
disks.  I used an 8K termcap on 1200MB floppy rescue disks many years ago,

> Generating them automatically can be kind of difficult.  termcap
> doesn't change that often.

As someone pointed out, ed is sufficient.  It's all we had on the root
partition.  I remember how to use it mainly from using it there.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: HEADS UP: /bin and /sbin are now dynamically linked

2003-11-22 Thread Bruce Evans

On Sat, 22 Nov 2003, M. Warner Losh wrote:

> In message: <[EMAIL PROTECTED]>
>     Bruce Evans <[EMAIL PROTECTED]> writes:
> : On Sat, 22 Nov 2003, M. Warner Losh wrote:

> : > Timing Solutions uses the following minimal termcap for its embedded
> : > applications.  It has a number of terminals that it supports, while
> : > still being tiny.  it is 3.5k in size, which was the goal ( < 4k block
> : > size we were using).  One could SED this down by another 140 bytes or
> : > so.  Removing the comments and the verbose names would net another 300
> : > odd bytes.
> :
> : What's wrong with FreeBSD's /usr/src/etc/termcap.small, except it is
> : twice as large and has a weird selection of entries (zillions of
> : variants of cons25, dosansi and pc3).
>
> Mine is better because it has a more representative slice of currently
> used terminal types.  Maybe we should replace termcap.small with mine
> (maybe with the copyright notice).

I agree.  termcap.small is amazingly uncurrent.  However, perhaps some
merging and reducing is in order.  Why is a full cons25 or vt2xx needed?
vi only needs a few capabilities.  I think we mostly use copies of large
termcap entries because copying the whole things is easier.

Bruce

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: dumb question 'Bad system call' after make world

2003-11-22 Thread Bruce Evans

On Fri, 21 Nov 2003, Barney Wolff wrote:

> Will somebody please tell me when "make world" is ever correct in the
> environment of the last several years?  I've been unable to understand
> its continued existence as a target.

>From my normal world-building script:

DESTDIR=/c/z/root \
MAKEOBJDIRPREFIX=/c/z/obj \
time -l make -s world > /tmp/world.out 2>&1

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Unfortunate dynamic linking for everything

2003-11-21 Thread Bruce Evans

On Fri, 21 Nov 2003, Tim Kientzle wrote:

> Bruce Evans wrote:
> > It obviously uses NSS.  How else could it be so bloated? :
> >
> > $ ls -l /sbin/init
> > -r-x--  1 root  wheel  453348 Nov 18 10:30 /sbin/init
>
> I believe it's actually DNS, not NSS.
>
> Pre-5.0, the resolver ballooned significantly.
> A lot of the bloat in /bin and /sbin came
> from the NIS functions which in turn pull in
> the resolver.

Perhaps both.

> Example: /bin/date on 5.1 is also over 450k
> because of a single call to getservbyname().
> Removing that one call shrinks a static /bin/date
> to a quite reasonable size. (I seem to recall 80k when
> I did this experiment.)

The 2 calls to logwtmp() must also be removed, at least now.
I get the following text sizes: for /bin/date:

RELENG_4: 137491
-current*: 93214 (* = getservbyname() and logwtmp() calls removed)
-current: 371226 (only 412492 total, not 450K yet)

> I note that /sbin/init calls getpwnam();
> I expect that's where the bloat gets pulled in.

Yes, except it's only the latest 200+K of bloat (from 413558 bytes text
to 633390).  Before that there was 100+K of miscellaneous bloat
relative to RELENG_4 (text size 305289 there).  Before that there was
another 200+K of bloat from implementing history.  Compiling with
-DNO_HISTORY removes history support and reduces the text size to
162538 (this is without getpwnam()).  Then there is another 30K of
mostly non-bloat for actual changes within /bin/sh, since compiling
the FreeBSD-1 /bin/sh with current libraries gives a text size of
132966.  Finally, IIRC the text size of the FReeeBSD-1 /bin/sh is
70K (total size 90K), so there is another 60K of miscellaneous bloat
in current libraries to increase the text size from 70K to 130K.

Total text sizes for /bin/sh's internals:
FreeBSD-1 sh compiled with -current's compiler: 55350
current sh compiled with -current's compiler: 87779
87:55 is about right for the increased functionality.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Unfortunate dynamic linking for everything

2003-11-19 Thread Bruce Evans

On Wed, 19 Nov 2003, Ken Smith wrote:

> On Thu, Nov 20, 2003 at 06:27:31AM +1100, Bruce Evans wrote:
>
> > > set init_path=/rescue/init
> >
> > If dynamic root were ready to be turned on, then /rescue/init would be
> > in the default init_path.
>
> I had that explained to me too. :-)
>
> There is a loop in sys/kern/init_main.c that "probes" for an init
> to run.  But it only does what you want for cases of the files
> not existing or otherwise just totally not executable.  It won't
> handle the "started but then dumped core" case the way it would
> need to if /sbin/init were to fail because of shared library
> problems.  So if just relying on this mechanism it would either
> not work right (/sbin/init in the path before /rescue/init) or
> it would always start /rescue/init (/rescue/init before /sbin/init
> in the path).

Oops, better add "... and error handling for init_path would be fixed" -).

I should have remembered this since I got bitten by it recently.  I was
trying to boot RELENG_3 and had a backup init that worked but that didn't
help because there was an execable init earlier in the path.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: hard lock-up writing to tape

2003-11-19 Thread Bruce Evans

On Wed, 19 Nov 2003, Mike Durian wrote:

> On Tuesday 18 November 2003 08:29 pm, Bruce Evans wrote:
> > - -current has the kern.console sysctl for enabling multiple consoles
> >   (buut only 1 sio one).  You can boot with a syscons console and then
> >   enable the serial, and the latter should work if it is on a working
> >   port to begin with.  Anyway, this sysctl shows which sio port can be
> >   a console, if any.
>
> Is there any documentation on this sysctl?  I'm not sure what I
> should set it to.  After a normal boot, it reads:

Only in the source code.

> kern.console: consolectl,/ttyd1,consolectl,

Not even the bug that syscons's consolectl device is printed here is
documented (the actual syscons console is on /dev/ttyv0, but this
bogusly shares a tty struct with /dev/consolectl and many things
cannot tell the difference.  This bug also messes up the columns in
pstat -t, since consolectl is too wide to fit).

Anyway, the stuff to the left of the slash in the above is the list
of active consoles and the stuff to the right of the slash is the
list of possible consoles.  You have to move stuff from one list to
the other.  I vaguely remember that this is done using '-' to delete
things from the left hand list and something more direct to add them.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Unfortunate dynamic linking for everything

2003-11-19 Thread Bruce Evans

On Wed, 19 Nov 2003, Marcel Moolenaar wrote:

> set init_path=/rescue/init

If dynamic root were ready to be turned on, then /rescue/init would be
in the default init_path.

> A dynamicly linked /sbin/init just
> makes it harder to get to the rescue bits, so it makes sense to
> link init(8) staticly. Especially since there's no advantage to
> dynamic linking init(8) that compensates for the inconvience.

It obviously uses NSS.  How else could it be so bloated? :

$ ls -l /sbin/init
-r-x--  1 root  wheel  453348 Nov 18 10:30 /sbin/init

(My version is linked statically of course.)

The NSS parts of init might not be needed in normal operation, but its
hard to tell.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: hard lock-up writing to tape

2003-11-18 Thread Bruce Evans

On Tue, 18 Nov 2003, Mike Durian wrote:

> On Monday 17 November 2003 04:41 pm, Mike Durian wrote:
> >
> > I was finally able to get some partial success by setting flag 0x30
> > for sio1.  When I'd boot, I'd get console messages on my remote
> > tip session.  However, I'd only receive those messages printed
> > from user-level applications.  I would not see any of the bold-face
> > messages from the kernel.
>
> I'm still stumbling with the remote serial console.  Can someone
> who does this often test and verify they can use COM2 as the
> serial console - and then tell me what you did.

Moving the 0x10 flag from sio0 to sio1 should be sufficient for the kernel
part.  Setting the 0x20 flag for sio1 together with the 0x10 flag should
mainly save having to edit the flag for sio0.  If the kernel's serial
console is the same as the boot blocks', then it should use the same speed
as the boot blocks set it too.  Otherwise there may be a speed mismatch.

> The best I can manage is described above and then I get neither
> the bold kernel messages nor the debugger prompt.

This could be from a speed mismatch or from kern.consmute somehwo getting
set.

Some of this stuff can be configured after booting:
- RELENG4 has non-broken boot-time configuration which allows changing
  during the boot.
- -current has the kern.console sysctl for enabling multiple consoles
  (buut only 1 sio one).  You can boot with a syscons console and then
  enable the serial, and the latter should work if it is on a working
  port to begin with.  Anyway, this sysctl shows which sio port can be
  a console, if any.
- RELENG_4 and -current have the machdep.conspeed sysctl for setting the
  console speed.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: HEADS-UP new statfs structure

2003-11-18 Thread Bruce Evans

On Tue, 18 Nov 2003, Rudolf Cejka wrote:

> Hello, and is it possible to review some my (one's from masses :o)
> questions/suggestions?
>
> * cvtstatfs() for freebsd4_* compat syscalls does not copy text fields
>   correctly, so old binaries with new kernel know just about first
>   16 characters from mount points - what do you think about the
>   following patch? (Or maybe with even safer sizeof() - but I did not
>   test it.)

Hmm, there were 2 bugs here:
- MFSNAMELEN was confused with MNAMELEN in some places.  This gives
  unterminated strings as well as excessively truncated strings.
- there were off-by-1 errors which would have given unterminated
  strings even without the previous bug.

> --- sys/kern/vfs_syscalls.c.orig  Sun Nov 16 11:12:09 2003
> +++ sys/kern/vfs_syscalls.c   Sun Nov 16 11:56:07 2003
> @@ -645,11 +645,11 @@
>   osp->f_syncreads = MIN(nsp->f_syncreads, LONG_MAX);
>   osp->f_asyncreads = MIN(nsp->f_asyncreads, LONG_MAX);
>   bcopy(nsp->f_fstypename, osp->f_fstypename,
> - MIN(MFSNAMELEN, OMNAMELEN));
> + MIN(MFSNAMELEN, OMFSNAMELEN - 1));

MFSNAMELEN didn't change, so there is currently only a logical problem
here.  The -1 term could be moved outside of the MIN().  It works in
either place and would save duplicating the terminating NUL in the
unlikely event that the new name length becomes smaller than the old
one.  I'm not sure which is clearest.

>   bcopy(nsp->f_mntonname, osp->f_mntonname,
> - MIN(MFSNAMELEN, OMNAMELEN));
> + MIN(MNAMELEN, OMNAMELEN - 1));

Similarly, plus the larger bug.  MNAMELEN increased from
(88 - 2 * sizeof(long)) to 88, so if it were used without the -1 in
the above, then mount point name lengths longer than the old value
would have been unterminated instead of truncated.

>   bcopy(nsp->f_mntfromname, osp->f_mntfromname,
> - MIN(MFSNAMELEN, OMNAMELEN));
> + MIN(MNAMELEN, OMNAMELEN - 1));

Similarly.

>   if (suser(td)) {
>   osp->f_fsid.val[0] = osp->f_fsid.val[1] = 0;
>   } else {
> ---
>
> * sys/compat/freebsd32/freebsd32_misc.c: If you look into copy_statfs(),
>   you copy 88-byte strings into just 80-byte strings. Fortunately it seems
>   that there are just overwritten spare fields and f_syncreads/f_asyncreads
>   before they are set to the correct value. What about these patches, which
>   furthermore are resistant to possible MFSNAMELEN change in the future?
>   [I'm sorry, these patches are untested]
>
> --- sys/compat/freebsd32/freebsd32.h.orig Tue Nov 18 16:58:28 2003
> +++ sys/compat/freebsd32/freebsd32.h  Tue Nov 18 16:59:36 2003
> @@ -75,6 +75,7 @@
>   int32_t ru_nivcsw;
>  };
>
> +#define FREEBSD32_MFSNAMELEN  16 /* length of type name including null */
>  #define FREEBSD32_MNAMELEN(88 - 2 * sizeof(int32_t)) /* size of on/from 
> name bufs */
>

MFSNAMELEN hasn't changed, so this part is cosmetic.  But don't we now need
to clone all of this compatibility cruft for the new statfs()?  Native
32-bit systems have both.  Then MFSNAMELEN for this version should probably
be spelled OMFSNAMELEN.

>  struct statfs32 {
> @@ -92,7 +93,7 @@
>   int32_t f_flags;
>   int32_t f_syncwrites;
>   int32_t f_asyncwrites;
> - charf_fstypename[MFSNAMELEN];
> + charf_fstypename[FREEBSD32_MFSNAMELEN];
>   charf_mntonname[FREEBSD32_MNAMELEN];
>   int32_t f_syncreads;
>   int32_t f_asyncreads;
> --- sys/compat/freebsd32/freebsd32_misc.c.origTue Nov 18 16:59:49 2003
> +++ sys/compat/freebsd32/freebsd32_misc.c Tue Nov 18 17:03:31 2003
> @@ -276,6 +276,7 @@
>  static void
>  copy_statfs(struct statfs *in, struct statfs32 *out)
>  {
> + bzero(out, sizeof *out);

Yikes.  All copied out structs that might have holes (i.e., all structs
unless you want to examine them in binary for every combination of
arch/compiler/etc) need to be bzero()ed like this, but there are no
bzero()'s in files in this directory.

>   CP(*in, *out, f_bsize);
>   CP(*in, *out, f_iosize);
>   CP(*in, *out, f_blocks);
> @@ -290,14 +291,14 @@
>   CP(*in, *out, f_flags);
>   CP(*in, *out, f_syncwrites);
>   CP(*in, *out, f_asyncwrites);
> - bcopy(in->f_fstypename,
> -   out->f_fstypename, MFSNAMELEN);
> - bcopy(in->f_mntonname,
> -   out->f_mntonname, MNAMELEN);
> + bcopy(in->f_fstypename, out->f_fstypename,
> + MIN(MFSNAMELEN, FREEBSD32_MFSNAMELEN - 1));
> + bcopy(in->f_mntonname, out->f_mntonname,
> + MIN(MNAMELEN, FREEBSD32_MNAMELEN - 1));
>   CP(*in, *out, f_syncreads);
>   CP(*in, *out, f_asyncreads);
> - bcopy(in->f_mntfromname,
> -   out->f_mntfromname, MNAMELEN);
> + bcopy(in->f_mntfromname, out->f_mntfromname,
> + MIN(MNAMELEN, FREEBSD32_MNAMELEN - 1));
>  }
>
>  int

This seems to be correct except possibly for the style (placement of -1
and fixing the indentation of the continuation lines so that it is not
bug-for-bug com

Re: Unfortunate dynamic linking for everything

2003-11-18 Thread Bruce Evans

On Tue, 18 Nov 2003, M. Warner Losh wrote:

> In message: <[EMAIL PROTECTED]>
> [EMAIL PROTECTED] writes:
> : It really doesn't make sense to arbitrarily cut-off a
> : discussion especially when a decision might be incorrect.
>
> I'd say that good technical discussion about why this is wrong would
> be good.  However, emotional ones should be left behind.  Except for
> John's message, most of the earlier messages have been more emotional
> than technical.

I used to use all dynamic linkage, but switched to all static linkage
(except for ports) when I understood John's points many year ago.  It
shouldn't be necessary to repeat the argmuments.

> John, do you have any good set of benchmarks that people can run to
> illustrate your point?

Almost any benchmark that does lots of forks or execs, or uses libraries
a lot will do.  IIRC, 5-10% of my speedups for makeworld was from building
tools static.  Makeworld is not such a good benchmark for this as it used
to be since it always builds tools static so the non-staticness of
standard binaries doesn't matter so much.  Perhaps it still matters for
/bin/sh.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: DIAGNOSTIC LOR in softclock

2003-11-15 Thread Bruce Evans

On Sat, 15 Nov 2003, Poul-Henning Kamp wrote:

> This looks slightly different if I use SCHED_ULE, but the effect is
> the same.
>
> Off the top of my head, I have not been able to find any places
> where softclock would call schedcpu directly.

schedcpu() is a timeout routine, so it is always called indirectly from
softclock.

> lock order reversal
>  1st 0xc072dca0 callout_dont_sleep (callout_dont_sleep) @ kern/kern_timeout.c:223
>  2nd 0xc072d080 allproc (allproc) @ kern/sched_4bsd.c:257
> Stack backtrace:
> backtrace(c06d148d,c072d080,c06cd881,c06cd881,c06cf38b) at backtrace+0x17
> witness_lock(c072d080,0,c06cf38b,101,c5061c3c) at witness_lock+0x672
> _sx_slock(c072d080,c06cf382,101,8,c06cf0a0) at _sx_slock+0xae
> schedcpu(0,0,c06cf097,df,c1183140) at schedcpu+0x3f
> softclock(0,0,c06cbce6,23a,c1189388) at softclock+0x1fb
> ithread_loop(c1180400,c5061d48,c06cbb54,311,558b0424) at ithread_loop+0x192
> fork_exit(c050b090,c1180400,c5061d48) at fork_exit+0xb5
> fork_trampoline() at fork_trampoline+0x8
> --- trap 0x1, eip = 0, esp = 0xc5061d7c, ebp = 0 ---

I'm sure this is known.  schedcpu() always calls sx_lock(&allproc_lock),
so the above always occurs if sx_lock() happens to block.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Who needs these silly statfs changes...

2003-11-15 Thread Bruce Evans

On Fri, 14 Nov 2003 [EMAIL PROTECTED] wrote:

> >> Bruce Evans wrote:
> >> > ...
> >> > I just got around to testing the patch in that reply:
> >> > ...
> >>
> >> Your patch to nfs_vfsops won't apply to my Solaris kernel :-)
> >> The protocol says "abytes" is unsigned, so the server shouldn't be lying
> >> by sending a huge positive value for available space on a full
> >> filesystem. No?
> >
> >Possibly not, but the protocol is broken if it actually requires that.
>
> What makes you say that? I would think the utility of negative counts
> for disk sizes and available spaces is marginal. Solaris, POSIX, and
> NFS seem to get on fine without it. What am I (and they) missing?

Well, the f_bavail field (not to mention all the other fields (until
recently, sigh)) has always been signed and does go negative in BSD's
statfs, so the protocol is broken if it can't support negative values
in it.

> >The type pun to negative values is in most versions of BSD:
> > [snip code snippets and bug]
>
> That's great for interacting with other BSDs, but it still abusing
> the protocol. As filesystems with approaching 2^64 bytes become possible
> it probably has more of an impact.

2^63 won't be needed any time soon.  This problem was more serious with
nfsv2 when file systems reached 2^31 bytes not so long ago.

The current problem is actually more with non-BSD clients and a BSD server.
The BSD server will send the negative values and the non-BSD client may
convert them to huge positive ones.  Non-BSD servers presumably won't send
negative values.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Who needs these silly statfs changes...

2003-11-15 Thread Bruce Evans

On Sat, 15 Nov 2003, Terry Lambert wrote:

> Bruce Evans wrote:
> > I just got around to testing the patch in that reply:
> [ ... ]
> > This seems to work.  On a 2TB-epsilon ffs1 file system (*) on an md malloc
> > disk (**):
>
> Try it again.  This time, take the remote FS below its free reserve
> as the root user, and see what the client machine reports.  Compare
> the results to an identical local FS.

Er, that is the main thing that the test did.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Who needs these silly statfs changes...

2003-11-14 Thread Bruce Evans

On Fri, 14 Nov 2003, Peter Edwards wrote:

> Bruce Evans wrote:
>
> > On Fri, 14 Nov 2003, Peter Edwards wrote:

> >> The NFS protocols have unsigned fields where statfs has signed
> >> equivalents: NFS can't represent negative available disk space ( Without
> >> the knowledge of the underlying filesystem on the server, negative free
> >> space is a little nonsensical anyway, I suppose)
> >>
> >> The attached patch stops the NFS server assigning negative values to
> >> unsigned fields in the statfs response, and works against my local
> >> solaris box. Seem reasonable?
> >
> > The client attampts to fix this by pretending that the unsigned fields
> > are signed. -current tries to do more to support file system sizes larger
> > that 1TB, but the code for this is not even wrong except it may be wrong
> > enough to break the negative values. See my reply to one of the PRs
> > for more details.
> >
> > I just got around to testing the patch in that reply:
> > ...
>
> Your patch to nfs_vfsops won't apply to my Solaris kernel :-)
> The protocol says "abytes" is unsigned, so the server shouldn't be lying
> by sending a huge positive value for available space on a full
> filesystem. No?

Possibly not, but the protocol is broken if it actually requires that.
The "free" fields are signed in struct statfs so that they can be negative.
However, this is broken in POSIX's struct statvfs (all count fields have
type fsblkcnt_t or fsfilcnt_t and these are specified to be unsigned).
Is Solaris bug for bug compatible with that?

Anyway, my patch is mainly supposed to fix the scaling.  The main bug
in the initial scaling patch was that the huge positive values were
scaled before they were interpreted as negative values, so they became
not so huge but still preposterous values that could not be interpreted
as negative values.

The type pun to negative values is in most versions of BSD:

RELENG_4:
u_quad_t tquad;
...
if (v3) {
sbp->f_bsize = NFS_FABLKSIZE;
tquad = fxdr_hyper(&sfp->sf_tbytes);
sbp->f_blocks = (long)(tquad / ((u_quad_t)NFS_FABLKSIZE));
tquad = fxdr_hyper(&sfp->sf_fbytes);
sbp->f_bfree = (long)(tquad / ((u_quad_t)NFS_FABLKSIZE));
tquad = fxdr_hyper(&sfp->sf_abytes);
sbp->f_bavail = (long)(tquad / ((u_quad_t)NFS_FABLKSIZE));
sbp->f_files = (fxdr_unsigned(int32_t,
sfp->sf_tfiles.nfsuquad[1]) & 0x7fff);
sbp->f_ffree = (fxdr_unsigned(int32_t,
sfp->sf_ffiles.nfsuquad[1]) & 0x7fff);
} else {
sbp->f_bsize = fxdr_unsigned(int32_t, sfp->sf_bsize);
sbp->f_blocks = fxdr_unsigned(int32_t, sfp->sf_blocks);
sbp->f_bfree = fxdr_unsigned(int32_t, sfp->sf_bfree);
sbp->f_bavail = fxdr_unsigned(int32_t, sfp->sf_bavail);
sbp->f_files = 0;
sbp->f_ffree = 0;
}

Oops, this has the cast to long perfectly misplaced so that negative
sizes are not converted like I want.  It just prevents warnings.
Overflow has occurred long before, on the server when negative block
counts were converted to hug positive sizes.

NetBSD (nfs_vfsops.c 1.132):
u_quad_t tquad;
...
...
if (v3) {
sbp->f_bsize = NFS_FABLKSIZE;
tquad = fxdr_hyper(&sfp->sf_tbytes);
sbp->f_blocks = (long)((quad_t)tquad / (quad_t)NFS_FABLKSIZE);
tquad = fxdr_hyper(&sfp->sf_fbytes);
sbp->f_bfree = (long)((quad_t)tquad / (quad_t)NFS_FABLKSIZE);
tquad = fxdr_hyper(&sfp->sf_abytes);
sbp->f_bavail = (long)((quad_t)tquad / (quad_t)NFS_FABLKSIZE);
tquad = fxdr_hyper(&sfp->sf_tfiles);
sbp->f_files = (long)tquad;
tquad = fxdr_hyper(&sfp->sf_ffiles);
sbp->f_ffree = (long)tquad;
} else {
sbp->f_bsize = fxdr_unsigned(int32_t, sfp->sf_bsize);
sbp->f_blocks = fxdr_unsigned(int32_t, sfp->sf_blocks);
sbp->f_bfree = fxdr_unsigned(int32_t, sfp->sf_bfree);
sbp->f_bavail = fxdr_unsigned(int32_t, sfp->sf_bavail);
sbp->f_files = 0;
sbp->f_ffree = 0;
}

This converts tquad to quad_t so that the divisions work like I want.  These
conversions were added in rev.1.82 in 1999.

More changes are needed here to catch up with the recent changes to struct
statfs in FreeBSD.  The casts to long are now just wrong since the block
count fields don't have type long.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Who needs these silly statfs changes...

2003-11-14 Thread Bruce Evans

On Fri, 14 Nov 2003, Peter Edwards wrote:

> Bernd Walter wrote:
>
> >On Thu, Nov 13, 2003 at 12:54:18AM -0800, Kris Kennaway wrote:
> >
> >
> >>On Thu, Nov 13, 2003 at 06:44:25PM +1100, Peter Jeremy wrote:
> >>
> >>
> >>>On Wed, Nov 12, 2003 at 06:04:00PM -0800, Kris Kennaway wrote:
> >>>
> >>>
> ...my sparc machine reports that my i386 nfs server has 15 exabytes of
> free space!
> 
> enigma# df -k
> Filesystem  1K-blocks Used Avail Capacity  Mounted on
> rot13:/mnt2  56595176 54032286 18014398507517260 0%/rot13/mnt2
> 
> 
> >>>18014398507517260 = 2^54 - 1964724.  and 2^54KB == 2^64 bytes.  Is it
> >>>possible that rot13:/mnt2 has negative free space?  (ie it's into the
> >>>8-10% reserved area).
> >>>
> >>>
> >>Yes, that's precisely what it is..the bug is either in df or the
> >>kernel (I suspect the latter, i.e. something in the nfs code).
> >>
> >>
> >
> >And it's nothing new - I'm seeing this since several years now.
> >
> >
>
> The NFS protocols have unsigned fields where statfs has signed
> equivalents: NFS can't represent negative available disk space ( Without
> the knowledge of the underlying filesystem on the server, negative free
> space is a little nonsensical anyway, I suppose)
>
> The attached patch stops the NFS server assigning negative values to
> unsigned fields in the statfs response, and works against my local
> solaris box.  Seem reasonable?

The client attampts to fix this by pretending that the unsigned fields
are signed.  -current tries to do more to support file system sizes larger
that 1TB, but the code for this is not even wrong except it may be wrong
enough to break the negative values.  See my reply to one of the PRs
for more details.

I just got around to testing the patch in that reply:

%%%
Index: nfs_vfsops.c
===
RCS file: /home/ncvs/src/sys/nfsclient/nfs_vfsops.c,v
retrieving revision 1.143
diff -u -2 -r1.143 nfs_vfsops.c
--- nfs_vfsops.c12 Nov 2003 02:54:46 -  1.143
+++ nfs_vfsops.c12 Nov 2003 14:37:46 -
@@ -223,5 +223,5 @@
struct mbuf *mreq, *mrep, *md, *mb;
struct nfsnode *np;
-   u_quad_t tquad;
+   quad_t tquad;
int bsize;

@@ -254,19 +254,19 @@
for (bsize = NFS_FABLKSIZE; ; bsize *= 2) {
sbp->f_bsize = bsize;
-   tquad = fxdr_hyper(&sfp->sf_tbytes);
-   if (((long)(tquad / bsize) > LONG_MAX) ||
-   ((long)(tquad / bsize) < LONG_MIN))
+   tquad = (quad_t)fxdr_hyper(&sfp->sf_tbytes) / bsize;
+   if (bsize <= INT_MAX / 2 &&
+   (tquad > LONG_MAX || tquad < LONG_MIN))
continue;
-   sbp->f_blocks = tquad / bsize;
-   tquad = fxdr_hyper(&sfp->sf_fbytes);
-   if (((long)(tquad / bsize) > LONG_MAX) ||
-   ((long)(tquad / bsize) < LONG_MIN))
+   sbp->f_blocks = tquad;
+   tquad = (quad_t)fxdr_hyper(&sfp->sf_fbytes) / bsize;
+   if (bsize <= INT_MAX / 2 &&
+   (tquad > LONG_MAX || tquad < LONG_MIN))
continue;
-   sbp->f_bfree = tquad / bsize;
-   tquad = fxdr_hyper(&sfp->sf_abytes);
-   if (((long)(tquad / bsize) > LONG_MAX) ||
-   ((long)(tquad / bsize) < LONG_MIN))
+   sbp->f_bfree = tquad;
+   tquad = (quad_t)fxdr_hyper(&sfp->sf_abytes) / bsize;
+   if (bsize <= INT_MAX / 2 &&
+   (tquad > LONG_MAX || tquad < LONG_MIN))
continue;
-   sbp->f_bavail = tquad / bsize;
+   sbp->f_bavail = tquad;
sbp->f_files = (fxdr_unsigned(int32_t,
sfp->sf_tfiles.nfsuquad[1]) & 0x7fff);
%%%

This seems to work.  On a 2TB-epsilon ffs1 file system (*) on an md malloc
disk (**):

server:
Filesystem 1K-blocks  Used  Avail Capacity  Mounted on
/dev/md0   21474168960 1975624000 0%/b
client:
Filesystem 1024-blocks Used  Avail Capacity  Mounted on
besplex:/b  21474168960 1975624000 0%/b

These are 1K-blocks so their count fits in an int32_t, but the count in
512-blocks is too large for an int32_t so the scaling must be helping.

With newfs -m 100 (***) to get near negative free space:

server:
Filesystem 1K-blocks  Used Avail Capacity  Mounted on
/dev/md0   21474168960  5696 0%/b
client:
Filesystem 1K-blocks  Used Avail Capacity  Mounted on
besplex:/b  21474168960  5696 0%/b

After using up all the free space by creating a 6MB file:

server:
Filesystem 1K-blocks  Used Avail Capacity  Mo

Re: new kernel and old programs - bad system call

2003-11-13 Thread Bruce Evans

On Thu, 13 Nov 2003, John Hay wrote:

> Is it ok to run a new kernel (after the statfs changes) and older
> programs? I thought so from what i gathered out of the commit
> messages, but my test box doesn't like it at all... Well except
> if something else broke stuff:

I have no problems with a current kernel and an old world.

> ##
> ...
> Mounting root from ufs:/dev/da0s1a
> pid 50 (sh), uid 0: exited on signal 12
> Enter full pathname of shell or RETURN for /bin/sh:
> # ls
> pid 56 (ls), uid 0: exited on signal 12
> Bad system call
> #
> ##

Maybe you don't have old programs.   Unfortunately, even /bin/sh is
affected by the changes (it has a reference to fstatfs).

I often boot old kernels (back to RELENG_4) with current utilities
and will have to do something about this.  Everying except things
like ps works with only the following changes:
- don't use the new eaccess() syscall in test(1).
- change SYS_sigaction and SYS_sigreturn to their old (RELENG_4)
  values so that the newest signal handling is not used.  This
  works almost perfectly because there are no significant changes
  to the data structures (only some semantic changes that most
  utilities don't care about).  Larger changes in signal handling
  are the main thing that prevents current utilities running under
  RELENG_3.
The statfs changes affect data structures, so they can't be avoided
by simply changing the syscall numbers.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Found a problem with new source code

2003-11-11 Thread Bruce Evans

On Mon, 10 Nov 2003, Jason wrote:

> I just wanted to let someone know that my buildworld fails at
> /usr/src/sys/boot/i386/boot2/boot2.c at line 362.  I get an undefined
> error for RB_BOOTINFO, by adding #define RB_BOOTINFO 0x1f it worked.

Sorry, I broke it last night.  it is now fixed.

> Also it failed at sendmail.fc or something, I don't use send mail so I
> just did not build it.  It looks like someone already reported the
> device apic problem.  I just tryed option smp and device apic on my
> single proc athlon, panic on boot unless I chose no apic or is it no
> acpi(?) at boot.
>
> By the way, why adding the smp options do any good for my machine?  I
> mostly care about speed, but it seems it might just make the os unstable
> for me.

No; it is only good for multi-CPU machines.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: boot0 and fdisk / disklabel misbehaviour

2003-11-11 Thread Bruce Evans

On Tue, 11 Nov 2003, Dag-Erling [iso-8859-1] Smørgrav wrote:

> I've been busy installing various OSes on a spare disk in order to try
> to reproduce some of fefe's benchmarks.  In the process, I've noticed
> a couple of bogons in boot0 and disklabel:
>
>  - disklabel -B trashes the partition table:
>
># dd if=/dev/zero of=/dev/ad0 count=20
># fdisk -i ad0
>(create a FreeBSD partition)
># disklabel -rw ad0s1 auto
># newfs -U /dev/ad0s1a
># disklabel -B ad0s1a
>(this trashes the partition table)

I think you mean bsdlabel.  disklabel is just a link to bsdlabel in
-current.

This was fixed in rev.1.8 of disklabel.c, but the change was lost in
bsdlabel.

>This probably happens because fdisk silently allows the user to
>create a partition that overlaps the partition table.  Arguably
>pilot error, but very confusing at the time, and fdisk should warn
>about it.

Yes.  This is the dangerously undedicated case.  Some consider this to
be an error.  I only ever used it for one drive.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: erroneous message from locked-up machine

2003-11-10 Thread Bruce Evans

On Mon, 10 Nov 2003, Michael W. Lucas wrote:

> I came in to work today to find one of my -current machines unable to
> open a pipe.  (This probably had a lot to do with the spamd that went
> stark raving nutters overnight, but that's a separate problem.)  A
> power cycle fixed the problem, but /var/log/messages was filled with:
>
> Nov 10 11:05:44 bewilderbeast kernel: kern.maxpipekva exceeded, please see tuning(7).
>
> Interesting.
>
> bewilderbeast~;sysctl kern.maxpipekva
> sysctl: unknown oid 'kern.maxpipekva'
> bewilderbeast~;

The following patch fixes this and some nearby style bugs:
- source style bug: line too line
- output style bugs: comma splice, verboseness (helps make the source line
  too long), and kernel message terminated with a ".".

%%%
Index: sys_pipe.c
===
RCS file: /home/ncvs/src/sys/kern/sys_pipe.c,v
retrieving revision 1.158
diff -u -2 -r1.158 sys_pipe.c
--- sys_pipe.c  9 Nov 2003 09:17:24 -   1.158
+++ sys_pipe.c  10 Nov 2003 17:21:47 -
@@ -331,5 +331,5 @@
if (error != KERN_SUCCESS) {
if (ppsratecheck(&lastfail, &curfail, 1))
-   printf("kern.maxpipekva exceeded, please see tuning(7).\n");
+   printf("kern.ipc.maxpipekva exceeded; see tuning(7)\n");
return (ENOMEM);
}
%%%

> And tuning(7) doesn't mention this, either.
>
> Is this just work-in-progress, or did someone forget to commit something?

Seems like tuning pipe kva is completely absent in tuning(7) (so the above
message can be shortened further).  You can tune kva generally as documented
there, but the pipe limit is separate.

> PS: Lesson of the day: no pipe KVA, no su.  Great fun on remote
> machines!  :-)

It's interesting that su was the point of failure.  It uses a pipe hack
for IPC.  Otherwise it doesn't use pipes, at least direectly.  It
shouldn't need to use the pipe hack.  My version uses signals instead.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: the PS/2 mouse problem

2003-11-10 Thread Bruce Evans

On Sat, 8 Nov 2003, Morten Johansen wrote:

> Scott Long wrote:
> > Bruce Evans wrote:
> >>[... possibly too much trimmed]
> > The problem here is that the keyboard controller driver tries to be too
> > smart. If it detects that the hardware FIFO is full, it'll drain it into
> > a per-softc, per-function ring buffer.  So having psm(4) just directly
> > read the hardware is insufficient in this scheme.

What is the per-function part?  (I'm not very familar with psm, but once
understood simpler versions of the keyboard driver.)  Several layers of
buffering might not be too bad for slow devices.  The i/o times tend to
dominate unless you do silly things like a context switch to move each
character from one buffer to other, and even that can be fast enough
(I believe it is normal for interactive input on ptys; then there's often
a remote IPC or two per character as well).

> >> - it sometimes calls the DELAY() function, which is not permitted in fast
> >>   interrupt handlers since apart from locking issues, fast interrupt
> >> handlers
> >>   are not permitted to busy-wait.
> >
> > Again, the keyboard controller driver is too smart for its own good.  To
> > summarize:
> >
> > read_aux_data_no_wait()
> > {
> > Does softc->aux ring buffer contain data?
> > return ring buffer data
> >
> > Check the status register
> > Is the keyboard fifo full?
> > DELAY(7us)
> > read keyboard fifo into softc->kbd ring buffer
> > Check the status register
> >
> > Is the aux fifo full?
> > DELAY(7us)
> > return aux fifo data
> > }
> >
> > So you can wind up stalling for 14us in there, presumably because you
> > cannot read the status and data registers back-to-back without a delay.
> > I don't have the atkbd spec handy so I'm not sure how to optimize this.
> > Do you really need to check the status register before reading the data
> > register?

At least it's a bounded delay.  I believe such delays are required for
some layers of the keyboard.  Perhaps only for the keyboard (old hardware
only?) and not for the keyboard controller or the mouse.

> >> Many of the complications for fast interrupt handlers shouldn't be needed
> >> in psm.  Just make psmintr() INTR_MPSAFE.
> >
> > I believe that the previous poster actually tried making it INTR_MPSAFE,
> > but didn't see a measurable benefit because the latency of scheduling
> > the ithread is still unacceptable.
>
> That is 100% correct.
> In the meantime I have taken your's and Bruce's advice and rearranged
> the interrupt handler to look like this:
>
> mtx_lock(&sc->input_mtx);

Er, this is reasonable for INTR_MPSAFE but not for INTR_FAST.
mtx_lock() is a "sleep" lock so it cannot be used in fast interrupt
handlers.  mtx_lock_spin() must be used.  (My version doesn't permit
use of mtx_lock_spin() either; more primitive locking must be used.)

> while((c = read_aux_data_no_wait(sc->kbdc)) != -1) {

This is probably INTR_FAST-safe enough in practice.

>  sc->input_queue.buf[sc->input_queue.tail] = c;
>  if ((++ sc->input_queue.tail) >= PSM_BUFSIZE)
>  sc->input_queue.tail = 0;
>  count = (++ sc->input_queue.count);
> }
> mtx_unlock(&sc->input_mtx);

The locking for the queue seems to be correct except this should operate
on a spinlock too.

> if (count >= sc->mode.packetsize)
>  taskqueue_enqueue(taskqueue_swi_giant, &sc->psm_task);

taskqueue_enqueue() can only be used in non-fast interrupt handlers.
taskqueue_enqueue_fast() must be used in fast interrupt handlers (except
in my version, it is not permitted so it shouldn't exist).  Note that
the spinlock/fast versions can be used for normal interrupt handlers
too, so not much more code is needed to support handlers whose fastness
is dynamically configured.

> And it works, but having it INTR_MPSAFE still does NOT help my problem.
> It looks to me like data is getting lost because the interrupt handler
> is unable to read it before it's gone, and the driver gets out of sync,
> and has to reset itself.
> However it now takes a few more tries to provoke the problem, so
> something seems to have improved somewhere.

This is a bit surprising.  There are still so few INTR_MPSAFE handlers
that there aren't many system activities that get in the way of running
the INTR_MPSAFE ones.  Shared interrupts prevent running of a handler
while other handlers on the same interrupt are running, and the mouse
interrupt is often shared, but if it is shared then it couldn't be fast
until recently and still can't be fast unless all the other handlers on
it are fast.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: serial console oddity

2003-11-09 Thread Bruce Evans

On Sat, 8 Nov 2003, Don Lewis wrote:

> I've been seeing some wierd things for many months when using a serial
> console on my -CURRENT box.  I finally had a chance to take a closer
> look today.
>
> It looks like the problem is some sort of interference between kernel
> output to the console and userland writes to /dev/console.  I typically
> see syslogd output to the console get corrupted.  Each message that
> syslogd writes seems to get truncated or otherwise corrupted.  The most
> common thing I see is that each syslog message is reduced to a space and
> the first character of the month, or sometimes just a space, or
> sometimes nothing at all.

This is (at least primarily) a longstanding bug in ttymsg().  It uses
nonblocking mode so that it doesn't block in write() or close().  For
the same reason, it doesn't wait for output to drain before close().
If the close happens to be the last one on the device, this causes any
data buffered in the tty and lower software layers to be discarded
cleanly and any data in lower hardware layers to by discarded in a
driver plus hardware-dependent way (usually not so cleanly, especially
for the character being transmitted).

> This is totally consistent until I "kill
> -HUP" syslogd, which I believe causes syslogd to close and open
> /dev/console, after which the syslog output appears correct on the
> console. When the syslogd output is being corrupted, I can cat a file to
> /dev/console and the output appears to be correct.

When I debugged this, syslogd didn't seem to keep the console open,
so the open()/close() in ttymsg() always caused the problem.  I didn't
notice killing syslogd makes a difference.  Perhaps it helps due to a
missing close.  Holding the console open may be a workaround or even
the correct fix.  It's not clear where this should be done (should all
clients of ttymsg() do it?).  Running getty on the console or on the
underlying tty device should do it accidentally.

> I truss'ed syslogd, and it appears to be working normally, the writev()
> call that writes the data to the console appears to be writing the
> correct character count, so it would appear that the fault is in the
> kernel.

If there are any kernel bugs in this area, then they would be that
last close of the console affects the underlying tty.  The multiple
console changes are quite likely to have broken this if getty is run
on the underlying tty (they silently discarded the half-close of the
underlying tty which was needed to avoided trashing some of its state
when only the console is closed).

> The problem doesn't appear to be specific to syslogd, because I have
> seen the output from the shutdown scripts that goes to the console get
> truncated as well.

Yes, in theory it should affect anything that uses ttymsg() or does
direct non-blocking writes without waiting for the output to drain.

> I have my serial console running at the default 9600 bps.

I always use 115200 bps and the symptoms are similar right down to
normally getting only the first character of the month name followed
by 0-1 bytes of garbage.  The first character of the month name is
just the first character of the message.  Apparently my systems are
fast enough for close() to be called before transmission of the second
character has completed (2 * 87+ usec at 115200 bps).

Here are some half-baked fixes.  The part that clears O_NONBLOCK is
wrong, and the usleep() part is obviously a hack.  ttymsg() shouldn't
block even in close(), since if the close is in the parent ttymsg()
might block forever and if the close() is in a forked child then
blocking could create zillions of blocked children.

Another part of the patch is concerned with limiting forked children.
If I were happy with that part then blocking would not be so bad.  In
practice, I don't have enough system activity for blocked children to
be a problem.  To see the problem with blocked children, do something
like the following:
- turn off clocal on the console so that the console can block better.
  For sio consoles this often requires turning it off in the lock-state
  device, since the driver defends against this foot shooting by locking
  it on.
- hold the console open or otherwise avoid the original bug in this
  thread, else messages will just be discarded in close() faster than
  they can pile up.
- turn off your external console device or otherwise drop carrier.
- send lots of messages.

%%%
Index: ttymsg.c
===
RCS file: /home/ncvs/src/usr.bin/wall/ttymsg.c,v
retrieving revision 1.11
diff -u -2 -r1.11 ttymsg.c
--- ttymsg.c11 Oct 2002 14:58:34 -  1.11
+++ ttymsg.c11 Oct 2002 18:13:51 -
@@ -32,14 +32,16 @@
  */

-#include 
-
-__FBSDID("$FreeBSD: src/usr.bin/wall/ttymsg.c,v 1.11 2002/10/11 14:58:34 mike Exp $");
-
+#if 0
 #ifndef lint
-static const char sccsid[] = "@(#)ttymsg.c 8.2 (Berkeley) 11/16/93";
+static char sccsid[] = "@(#)ttymsg.c   8.2 (Berkeley) 11/16/93";
+#e

Re: hard lockup with new interrupt code, possible cause irq14: ata0

2003-11-08 Thread Bruce Evans

On Sat, 8 Nov 2003, Barney Wolff wrote:

> Try adding
> options   NO_MIXED_MODE
> to your conf.  That fixed boot-time hangs on my Asus A7M266-D.

BTW, NO_MIXED_MODE is missing in NOTES.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: New interrupt stuff breaks ASUS 2 CPU system

2003-11-07 Thread Bruce Evans

On Fri, 7 Nov 2003, Stefan [iso-8859-1] Eßer wrote:

> On 2003-11-07 20:04 +1100, Bruce Evans <[EMAIL PROTECTED]> wrote:
> > However, using the apic almost doubles the overheads for the a45 cases.
> > This seems to be due to extra interrupts.  The UART and/or driver already
>
> Just another data point:
>
> Seems that the interrupt rate doubled for drm0 on my system
> (from 60 to 120 driving a LCD at 60Hz vertical refresh).
>
> I thought this might be a problem with shared interrupts (drm0
> and xl0 shared APIC IRQ 16), but removing the (actually unused)
> xl driver did not make a difference ...

Hmm.  My a45 UARTs are the only ones with a pci level triggered interrupt:

Nov  7 01:48:44 gamplex kernel: ioapic0: Routing IRQ 5 -> intpin 19
Nov  7 01:48:44 gamplex kernel: ioapic0: intpin 5 disabled
Nov  7 01:48:44 gamplex kernel: ioapic0: intpin 19 trigger: level
Nov  7 01:48:44 gamplex kernel: ioapic0: intpin 19 polarity: active-lo

There is only one other level triggered interrupt the system that is
used:

Nov  7 01:48:44 gamplex kernel: ioapic0: Routing IRQ 11 -> intpin 18
Nov  7 01:48:44 gamplex kernel: ioapic0: intpin 11 disabled
Nov  7 01:48:44 gamplex kernel: ioapic0: intpin 18 trigger: level
Nov  7 01:48:44 gamplex kernel: ioapic0: intpin 18 polarity: active-lo

and I suspect it may be doing strange things too: I found that rev.1.23
of ata_lowlevel.c broke atapicam, but the new interrupt code magically
fixed it.  One of the atapicam devices is the only device on IRQ11.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: the PS/2 mouse problem

2003-11-07 Thread Bruce Evans

On Fri, 7 Nov 2003, Morten Johansen wrote:

> Morten Johansen wrote:
> > Scott Long wrote:
> >
> >> One thought that I had was to make psmintr() be INTR_FAST.  I need to
> >> stare at the code some more to fully understand it, but it looks like it
> >> wouldn't be all that hard to do.  Basically just use the interrupt
> >> handler
> >> to pull all of the data out of the hardware and into a ring buffer in
> >> memory, and then a fast taskqueue to process that ring buffer.  It would
> >> at least answer the question of whether the observed problems are due to
> >> ithread latency.  And if done right, no locks would be needed in
> >> psmintr().

However, it is usually easier to use a lock even if not strictly necessary.
psm as currently structured uses the technique of calling psmintr() from
the timeout handler.  This requires a lock.  If this were not done, then
the timeout routine would probably need to access hardware using scattered
i/o instructions, and these would need locks (to prevent them competing
with i/o instructions in psmintr()).  Putting all the hardware accesses
in the fast interrupt handler is simpler.  The sio driver uses this technique
but doesn't manage to put _all_ the i/o's in the interrupt handler, so it
ends up having to lock out the interrupt handler all over the place.
Ring buffers can be self-locking using delicate atomic instructions, but
they are easier to implement using locks.

> > I can reproduce the problem consistently on my machine, by moving the
> > mouse around, while executing e.g this command in a xterm:
> >
> > dd if=/dev/zero of=test bs=32768 count=4000; sync; sync; sync
> >
> > when the sync'ing sets in the mouse attacks.
> > It is very likely due to interrupt latency.
> >
> > I'd be happy to test any clever patches.
>
> Wow. You are completly right!
> By using a MTX_SPIN mutex instead, and marking the interrupt handler
> INTR_MPSAFE | INTR_FAST, my problem goes away.
> I am no longer able to reproduce the mouse attack.
> I have not noticed any side-effects of this. Could there be any?
> I will file a PR with an updated patch, unless you think it's a better
> idea to rearrange the driver.
> Probably the locking could be done better anyway.

Er, psmintr() needs large changes to become a fast interrupt handler.  it
does many things that may not be done by a fast interrupt handler, starting
with the first statement in it:

/* read until there is nothing to read */
while((c = read_aux_data_no_wait(sc->kbdc)) != -1) {

This calls into the keyboard driver, which is not written to support any
fast interrupt handlers.  In general, fast interrupt handlers may not call
any functions, since the "any" function doesn't know that it is called in
fast interrupt handler context and may do things that may not be done in
fast interrupt handler context.  As it happens, read_aux_data_no_wait()
does the following bad things:
- it accesses private keyboard data.  All data that is accessed by a fast
  interrupt handler must be locked by a common lock or use self-locking
  accesses.  Data in another subsystem can't reasonably be locked by this
  (although the keyboard subsystem is close to psm, you don't want to
  export the complexities of psmintr()'s locking to the keyboard subsystem).
- it calls other functions.  The closure of all these calls must be examined
  and made fast-interrupt-handler safe before this is safe.  The lowest level
  will resolve to something like inb(PSMPORT) and this alone is obviously
  safe provided PSMPORT is only accessed in the interrupt handler or is
  otherwise locked.  (Perhaps the private keyboard data is actually private
  psm data that mainly points to PSMPORT.  Then there is no problem with the
  data accesses.  But the function calls make it unclear who owns the data.)
- it sometimes calls the DELAY() function, which is not permitted in fast
  interrupt handlers since apart from locking issues, fast interrupt handlers
  are not permitted to busy-wait.

Many of the complications for fast interrupt handlers shouldn't be needed
in psm.  Just make psmintr() INTR_MPSAFE.  This is nontrival, however.
Fine grained locking gaves many of the complications that were only
in fast interrupt handlers in RELENG_4.  E.g., for psmintr() to be MPSAFE,
all of its calls into the keyboard subsystem need to be MPSAFE, and they
are unlikely to be so unless the keyboard subsystem is made MPSAFE.

The following method can be used to avoid some of the complications:
make the interrupt handler not touch much data, so that it can be
locked easily.  The data should be little more than a ring buffer.
Make the handler either INTR_MPSAFE or INTR_FAST (it doesn't matter
for slow devices like psm).  Put all the rest of what was in the
interrupt handler in non-MPSAFE code (except where it accesses data
shared with the interrupt handler) so that all of this code and its
closure doesn't need to be made MPSAFE.  This method is what the sio
driver uses in -current, sort of accident

RE: New interrupt stuff breaks ASUS 2 CPU system

2003-11-07 Thread Bruce Evans

On Thu, 6 Nov 2003, John Baldwin wrote:

> On 06-Nov-2003 Harti Brandt wrote:
> > JB>I figured out what is happenning I think.  You are getting a spurious
> > JB>interrupt from the 8259A PIC (which comes in on IRQ 7).  The IRR register
> > JB>lists pending interrupts still waiting to be serviced.  Try using
> > JB>'options NO_MIXED_MODE' to stop using the 8259A's for the clock and see if
> > JB>the spurious IRQ 7 interrupts go away.
> >
> > Ok, that seems to help. Interesting although why do these interrupts
> > happen only with a larger HZ and when the kernel is doing printfs (this
> > machine has a serial console). I have also not tried to disable SIO2 and
> > the parallel port.
>
> Can you also try turning mixed mode back on and using
> http://www.FreeBSD.org/~jhb/patches/spurious.patch
>
> You should get some stray IRQ 7's in the vmstat -i output as well as a few
> printf's to the kernel console.

Other changes fixed the problem with the apic case not working on my BP6,
except the apic causes many more interrupts on serial ports at 921600 bps,
almost enough to overload the system with just 2 active serial ports.
I've now gathered lots of statistics for sio interrupt performance.  The
bad effect of the apic on performance is shown in the "-current(apic)"
lines for a45 and a45b only:

%%%
Keywords:
c04 = send at 115200 bps on cuac00, receive at 115200 bps on cuac04
c04b = like c04 plus send and receive in other direction too (b = bidirectional)
  (cuac* are on a Cyclades 8yo (2 * cd1400 isa))
a01 = like c04 except use ports cuaa[01]
a01b = like a01 except bidirectional
  (cuaa[01] are standard motherboard 16550 clones)
a45 = like a01 except use speed 921600 bps and ports cuaa[45]
a45b = like a45 except bidirectional
  (cuaa[45] are on a VScom 200HV2 (2 * 16950 pci))
-current(ointr) = -current before new interrupt code
-current = plain current (2003/11/06)
-current(apic) = -current with apic configured for UP kernel on SMP hardware
-current(bde) = my version of -current (new interrupt code not merged yet)
&+iir,+stream,+intr0 = my version of -current with variants of sio
  optimizations (only UART-independent ones; optimizations for 16950 UARTs
  give factor of 2 reduction in overheads)

Overheads for doing above I/O in percent (min-max for 3 runs) on an ABIT BP6
with 366 MHz and 400 MHz Celerons:

Devices OS  UP  SMP
--- --  --  ---
c04 RELENG_4(4.9)   6.58-6.59   Not measured (method problems)
-current(ointr) 9.65-9.76   6.77-7.11
-current10.64-10.69 6.09-6.36
-current(apic)  9.63-9.90   As above (apic standard)
-current(bde)   6.83-6.96   3.54-3.78
c04bRELENG_4(4.9)   12.83-12.90 Not measured (method problems)
-current(ointr) 19.42-19.44 13.70-13.90
-current20.23-20.24 12.01-12.48
-current(apic)  17.77-17.89 As above (apic standard)
-current(bde)   12.74-13.23 6.23-6.53
a01 RELENG_4(4.9)   7.50-7.50   Not measured (method problems)
-current(ointr) 7.67-7.69   4.44-4.77
-current8.09-8.13   4.72-5.60
-current(apic)  7.75-8.02   As above (apic standard)
-current(bde)   7.53-7.63   4.49-4.54
&+iir   7.09-7.30   Not measured (kernel problems)
&+stream6.23-6.24
&+iir+stream5.47-5.52
&+intr0+iir 5.24-5.26   2.75-2.91
a01bRELENG_4(4.9)   14.64-14.84 Not measured (method problems)
-current(ointr) 14.36-15.10 8.65-8.92
-current14.79-14.87 8.18-9.77
-current(apic)  14.80-14.91 As above (apic standard)
-current(bde)   14.19-14.24 8.13-8.46
&+iir   14.05-14.13
&+stream12.12-12.17
&+iir+stream10.58-10.62
&+intr0+iir 10.07-10.12 5.10-5.63
a45 RELENG_4(4.9)   21.81-21.86 Not measured (method problems)
-current(ointr) 24.00-24.04 13.3
-current25.13-25.20 31.4-31.5(86)
-current(apic)  51.02-51.05(87) As above (apic standard)
-current(bde)   21.83-22.02 10.71-10.89
&+iir   21.98-22.05
&+stream27.78-27.81
&+iir+stream22.08-22.16
&+intr0+iir 16.76-16.92 6.85-8.11
a45bRELENG_4(4.9)   46.23-46.44(87) Not measured (method problems)
-current(ointr) 54.01-54.37(86) 25.2 (82/82)
-current56.04-56.93(85) 70.1-70.7(80)
-current(apic)  87.35-88.22(78) As above (apic standard)
-current(bde)   42.06-42.12

Re: new interrupt code: panic when going multiuser

2003-11-05 Thread Bruce Evans

On Tue, 4 Nov 2003, John Baldwin wrote:

> On 04-Nov-2003 Bruce Evans wrote:
> >> > - on a BP6, UP kernels without apic work except for cyintr(), but SMP
> >> >   kernels have problems with missing interrupts for ata devices and hang
> >> >   at boot time.
> >>
> >> Is this related to the ata-lowlevel commit you mentioned above?
> >
> > No.  It looks like the interrupt is really going missing for some
> > reason.  This is without any acpica.
>
> What if you try a UP kernel with 'device apic' (i.e. no options SMP),
> do you still have ata problems?  Is this on an SMP machine btw?

Yes, 'device apic' breaks the UP case in the same way that the new
interrupt code breaks the SMP case.  BP6's are SMP and mine used to
mostly work, though not well enough to actually be worth using in SMP
mode (it works faster in UP mode with its slowest CPU overclocked 42%;
mismatched CPUs and thermal problems prevent significant overclocking
in SMP mode).

Other bugs in the new interrupt code that I've noticed so far:
- lots of pessimizations.  The main one is that the PIC is now masked
  and unmasked for fast interrupt handlers.  The masking should be
  done at a higher level for all interrupt handlers so that it doesn't
  need to be undone in some cases, and neither masking not unmasking
  should be done for fast interrupt handlers.  This pessimization and
  other makes fast interrupt handlers more non-fast than before.  They
  are now slower than normal interrupt handlers in FreeBSD-[1-4].  They
  still have lower latency that normal interrupt handlers in FreeBSD-[1-4],
  but not as low as actual fast interrupt handlers.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: new interrupt code: panic when going multiuser

2003-11-04 Thread Bruce Evans

> > The following is without the local changes:
> > - cyintr(int unit) panics becauase it is passed a pointer to somewhere.
> >   I think all compat_isa devices are broken for unit 0 because unit 0
> >   is represented by a null pointer.
>
> Ah, ok.  Yes, this is a semantic change.  To try and support clock interrupts,
> a fast handler that passes a NULL argument will get a pointer to the intrframe
> as its argument.  I got the idea via sparc64 from [EMAIL PROTECTED]  Perhaps 
> something
> can be faked up in the compat_isa shims to fix this.

Clock interrupt handlers have always been a nasty special case.

> Please try http://www.FreeBSD.org/~jhb/patches/isa_compat.patch

Will try later today.  It should work, but adds yet more overhead.

> > - on a BP6, UP kernels without apic work except for cyintr(), but SMP
> >   kernels have problems with missing interrupts for ata devices and hang
> >   at boot time.
>
> Is this related to the ata-lowlevel commit you mentioned above?

No.  It looks like the interrupt is really going missing for some
reason.  This is without any acpica.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: new interrupt code: panic when going multiuser

2003-11-04 Thread Bruce Evans

On Tue, 4 Nov 2003, John Baldwin wrote:

> On 04-Nov-2003 Lukas Ertl wrote:
> > On Tue, 4 Nov 2003, Lukas Ertl wrote:
> >
> >> I somehow can't get at a good vmcore :-(.  But I found out that the
> >> machine boots fine in "Safe Mode", where DMA and hw.ata.wc is turned off.
> >
> > Ok, if I set hw.ata.ata_dma=0 in loader.conf, it boots fine.  Could there
> > be some issue with ATAng + new interrupt code?
>
> Can you provide a dmesg please?  There may be a weird issue with
> some PPro's for example that I haven't been able to test.

I have noticed the following problems with the new interrupt code so far:
- it conflicts with a few thousand lines of local changes.
- yesterday's backup kernels which I preserved to run benchmarks with
  all hang at boot time while probing atapicam devices.  Backing out
  rev.1.23 of ata-lowlevel.c fixes the hang, but I didn't back up
  yesterday's sources so it will take some work to regenerate working
  versions of yesterday's kernels.

The following is without the local changes:
- cyintr(int unit) panics becauase it is passed a pointer to somewhere.
  I think all compat_isa devices are broken for unit 0 because unit 0
  is represented by a null pointer.
- on a BP6, UP kernels without apic work except for cyintr(), but SMP
  kernels have problems with missing interrupts for ata devices and hang
  at boot time.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: NULL td passed to propagate_priority() when using xmms...

2003-11-04 Thread Bruce Evans

On Mon, 3 Nov 2003, John Baldwin wrote:

> On 01-Nov-2003 Soren Schmidt wrote:
> > It seems Sean Chittenden wrote:
> >> Howdy.  I'm not sure if this is a ULE bug or a KSE bug, or both, but,
> >> for those interested (this is using ule 1.67, rebuilding world now),
> >> here's my stack.  I couldn't figure out where td was being set to
> >> NULL.  :( Oh!  Where is TD_SET_LOCK defined?  egrep -r didn't turn up
> >> anything.  -sc
> >
> > Its not ULE, I'm running 4BSD and has gotten this on boot for over a
> > week now, rendering -current totally useless...
>
> Having a kernel panic with INVARIANTS on would really help narrow down
> where the bug is.

I found something that causes this bug fairly reliably:
- configure ddb so that db_print_backtrace() is called on panics.
- break the fd driver so that the panic() in fdstrategy() is called on
  floppy accesses.
- attempt to access a floppy so that fdstrategy() is called.
- db_print_backtrace() then does bad things.  It never completes here,
  though it works in other contexts.  Usually it prints only the first
  line or two.  Then quite often ddb is called for a null pointer panic
  in propagate_priority().

More details about the null pointer panic:  This seems to have nothing
to do with scheduling.  propagate_priority() is not called with a null
td of course, but it sometimes follows a null m:

%%%
/*
 * Pick up the mutex that td is blocked on.
 */
m = td->td_blocked;
MPASS(m != NULL);

/*
 * Check if the thread needs to be moved up on
 * the blocked chain
 */
if (td == TAILQ_FIRST(&m->mtx_blocked)) {
continue;
}
%%%

I don't have invariants enabled, so MPASS(m != NULL) doesn't do anything,
but m is null so attempting to load m->mtx_blocked causes a panic.

For the backtrace context, propagate_priority() gets called for attempting
to aquire a lock in softclock().  Tasks like the softclock task get
scheduled despite the system being in panic().  ps seemed to show that the
user process doing the floppy access no longer existed.  I don't know how
that could happen, since the panic() is done in the context of the that
process.

More details about bugs in db_print_backtrace(): Maybe the stack is
messed up.  Attempting to access invalid stack offsets can cause
problems.  My version of db_print_backtrace() has extra code to attempt
not to access invalid offsets, but there is normally no problem since
ddb's trap handler fixes up the problem.  But backtrace() bogusly calls
db_print_backtrace() in non-ddb context and then the longjmp in the trap
handler goes to hyperspace if anywhere.

Bugs tripped over while debugging this: Putting a breakpoint in fdopen()
didn't work, because fd.c:fdopen() conflicts with kern_descrip.c:fdopen().
This was broken in fd.c 1.259.  There are hundreds of similar conflicts
in GENERIC, some for obviously broken things like the same malloc type
being static in several files.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: More ULE bugs fixed.

2003-11-03 Thread Bruce Evans

On Sun, 2 Nov 2003, Jeff Roberson wrote:

> On Sat, 1 Nov 2003, Bruce Evans wrote:

> > My simple make benchmark now takes infinitely longer with ULE under SMP,
> > since make -j 16 with ULE under SMP now hangs nfs after about a minute.
> > 4BSD works better.  However, some networking bugs have developed in the
> > last few days.  One of their manifestations is that SMP kernels always
> > panic in sbdrop() on shutdown.

This was fixed by setting debug.mpsafenet to 0 (fxp is apparently not MPSAFE
yet).

The last run with sched_ule.c 1.75 shows little difference between ULE
and 4BSD:

% *** zqz.4bsd.1Wed Oct 29 22:03:29 2003
% --- zqz.ule.3 Sun Nov  2 22:58:53 2003
% ***
% *** 4 
% --- 5,6 
% + ===> atm
% + ===> atm/sscop

The tree compiled by 4BSD is 4 days older so ULE does these extra.

% ***
% *** 227 
% !18.49 real 8.26 user 6.38 sys
% --- 229 
% !18.44 real 8.00 user 6.43 sys

Differences for "make obj" (all this in usr.bin tree).

% ***
% *** 229,233 
% !265  average shared memory size
% !116  average unshared data size
% !125  average unshared stack size
% !  23222  page reclaims
% ! 26  page faults
% --- 231,235 
% !274  average shared memory size
% !118  average unshared data size
% !128  average unshared stack size
% !  22760  page reclaims
% ! 25  page faults
% ***
% *** 236,241 
% !918  block output operations
% !   9893  messages sent
% !   9893  messages received
% !230  signals received
% !  13034  voluntary context switches
% !   1216  involuntary context switches
% --- 238,243 
% !926  block output operations
% !   9973  messages sent
% !   9973  messages received
% !232  signals received
% !  17432  voluntary context switches
% !   1583  involuntary context switches

Tiny differences in time -l output for obj stage, except ULE does more
context switches.

The signals are mostly SIGCHLD (needed to fix make(1)).

% ***
% *** 245 
% --- 248,249 
% + ===> atm
% + ===> atm/sscop
% ***
% *** 506 
% !   126.67 real57.42 user43.83 sys
% --- 510 
% !   124.43 real58.07 user42.17 sys
% ***
% *** 508,512 
% !   1973  average shared memory size
% !803  average unshared data size
% !128  average unshared stack size
% ! 203770  page reclaims
% !   1459  page faults
% --- 512,516 
% !   1920  average shared memory size
% !784  average unshared data size
% !127  average unshared stack size
% ! 203124  page reclaims
% !   1464  page faults
% ***
% *** 514,520 
% !165  block input operations
% !   1463  block output operations
% !  83118  messages sent
% !  83117  messages received
% !265  signals received
% ! 100319  voluntary context switches
% !   8113  involuntary context switches
% --- 518,524 
% !167  block input operations
% !   1469  block output operations
% !  83234  messages sent
% !  83236  messages received
% !267  signals received
% ! 125750  voluntary context switches
% !  17825  involuntary context switches

Similarly for depend stage.

% ***
% *** 524 
% --- 529,530 
% + ===> atm
% + ===> atm/sscop
% ***
% *** 701 
% !   291.30 real   307.00 user73.77 sys
% --- 707 
% !   290.28 real   308.16 user74.05 sys
% ***
% *** 703,707 
% !   2073  average shared memory size
% !   2076  average unshared data size
% !127  average unshared stack size
% ! 624020  page reclaims
% !156  page faults
% --- 709,713 
% !   2084  average shared memory size
% !   2056  average unshared data size
% !128  average unshared stack size
% ! 626651  page reclaims
% !154  page faults
% ***
% *** 709,715 
% ! 72  block input operations
% !   2122  block output operations
% !  45315  messages sent
% !  45317  messages received
% !691  signals received
% ! 195785  voluntary context switches
% !  58130  involuntary context switches
% --- 715,721 
% ! 83  block input operations
% !   2133  block output operations
% !  45532  messages sent
% !  45524  messages received
% !759  signals received
% ! 228998  voluntary context switches
% ! 128078  involuntary context switches

Similarly for the "all" stage.  The benchmark was not run carefully enough
for the 1 second differences in the times to be significant.

> You commented on the nice cutoff before.  What do you believe the correct
> behavior is?  In UL

Re: More ULE bugs fixed.

2003-11-02 Thread Bruce Evans

On Fri, 31 Oct 2003, Sam Leffler wrote:

> On Friday 31 October 2003 09:04 am, Bruce Evans wrote:
>
> > My simple make benchmark now takes infinitely longer with ULE under SMP,
> > since make -j 16 with ULE under SMP now hangs nfs after about a minute.
> > 4BSD works better.  However, some networking bugs have developed in the
> > last few days.  One of their manifestations is that SMP kernels always
> > panic in sbdrop() on shutdown.
>
> I'm looking at something similar now.  If you have a stack trace please send
> it to me (along with any other info).  You might also try booting
> debug.mpsafenet=0.

Turning off mpsafenet fixed all these problems.

These console messages are with it not turned off.  fxp is the only
physical network device.

%%%
WARNING: loader(8) metadata is missing!
[ preserving 869208 bytes of kernel symbol table ]
Copyright (c) 1992-2003 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.1-CURRENT #1005: Sun Nov  2 20:38:42 EST 2003
[EMAIL PROTECTED]:/c/sysc/i386/compile/smp
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Pentium II/Pentium II Xeon/Celeron (400.91-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0x665  Stepping = 5
  
Features=0x183fbff
real memory  = 268435456 (256 MB)
avail memory = 255369216 (243 MB)
Programming 24 pins in IOAPIC #0
IOAPIC #0 intpin 2 -> irq 0
IOAPIC #0 intpin 17 -> irq 9
IOAPIC #0 intpin 18 -> irq 11
IOAPIC #0 intpin 19 -> irq 5
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): apic id:  0, version: 0x00040011, at 0xfee0
 cpu1 (AP):  apic id:  1, version: 0x00040011, at 0xfee0
 io0 (APIC): apic id:  2, version: 0x00170011, at 0xfec0
Pentium Pro MTRR support enabled
npx0:  on motherboard
npx0: flags 0x80 npx0: INT 16 interface
pcibios: BIOS version 2.10
Using $PIR table, 8 entries at 0xc00fdef0
pcib0:  at pcibus 0 on motherboard
pci0:  on pcib0
pcib1:  at device 1.0 on pci0
pci1:  on pcib1
pci1:  at device 0.0 (no driver attached)
isab0:  at device 7.0 on pci0
isa0:  on isab0
atapci0:  port 0xf000-0xf00f at device 7.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata0: [MPSAFE]
ata1: at 0x170 irq 15 on atapci0
ata1: [MPSAFE]
pci0:  at device 7.2 (no driver attached)
piix0:  port 0x5000-0x500f at device 7.3 on pci0
Timecounter "PIIX" frequency 3579545 Hz quality 0
pci0:  at device 11.0 (no driver attached)
pci0:  at device 11.1 (no driver attached)
fxp0:  port 0xa400-0xa43f mem 
0xea00-0xea0f,0xea104000-0xea104fff irq 9 at device 13.0 on pci0
fxp0: Ethernet address 00:90:27:99:02:99
miibus0:  on fxp0
inphy0:  on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp0: [MPSAFE]
puc0:  port 0xb000-0xb01f,0xac00-0xac07,0xa800-0xa807 mem 
0xea103000-0xea103fff,0xea102000-0xea102fff irq 5 at device 17.0 on pci0
sio4:  on puc0
sio4: type 16550A
sio5:  on puc0
sio5: type 16550A
atapci1:  port 
0xbc00-0xbcff,0xb800-0xb803,0xb400-0xb407 irq 11 at device 19.0 on pci0
atapci1: [MPSAFE]
ata2: at 0xb400 on atapci1
ata2: [MPSAFE]
atapci2:  port 
0xc800-0xc8ff,0xc400-0xc403,0xc000-0xc007 irq 11 at device 19.1 on pci0
atapci2: [MPSAFE]
ata3: at 0xc000 on atapci2
ata3: [MPSAFE]
orm0:  at iomem 0xc8000-0xcbfff,0xc-0xc7fff on isa0
fdc0:  at port 
0x3f7,0x3f0-0x3f5 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0:  at port 0x64,0x60 on isa0
atkbd0:  flags 0x1 irq 1 on atkbdc0
kbd0 at atkbd0
psm0:  irq 12 on atkbdc0
psm0: model Generic PS/2 mouse, device ID 0
vga0:  at port 0x3c0-0x3df iomem 0xa-0xb on isa0
sc0:  at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x90 on isa0
sio0: type 16550A, console
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
cy0 at iomem 0xd4000-0xd5fff irq 10 on isa0
cy0: driver is using old-style compatibility shims
ppc0:  at port 0x378-0x37f irq 7 on isa0
ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/16 bytes threshold
ppbus0:  on ppc0
ppbus0: IEEE1284 device found
Probing for PnP devices on ppbus0:
plip0:  on ppbus0
lpt0:  on ppbus0
lpt0: Interrupt-driven port
ppi0:  on ppbus0
unknown:  can't assign resources (port)
speaker0:  at port 0x61 on isa0
unknown:  can't assign resources (port)
unknown:  can't assign resources (irq)
unknown:  can't assign resources (port)
unknown:  can't assign resources (port)
unknown:  can't assign resources (port)
unknown:  can't assign resources (port)
APIC_IO: Testing 8254 interrupt delivery
APIC_IO: routing 8254 via IOAPIC #0 intpin 2

Timecounters tick every 10.000 msec
ipfw2 initialized, divert enabled, rule-based forwarding enabled, default to accept, 
logging disabled
GEOM: create disk ad0 dp=0xc2999370

RE: lockmgr panic on shutdown

2003-11-01 Thread Bruce Evans

On Sun, 2 Nov 2003 [EMAIL PROTECTED] wrote:

> The obvious solution might be to change line 1161 of ffs_vfsops to
> pass vget() "curthread" rather than td. I assume there's a good
> reason why "thread0" is passed from boot(), but I can't see why
> that's of any use to the vnode locking.

Passing &thread0 in boot() is a quick (and not even wrong) fix for
the problem that there is no valid current process^Wthread in the
panic case.  Long ago in Net/2 (still in Lite2 for at least the
i386 version), sync() in boot() was passed the completely bogus
parameters ((struct sigcontext *)0) (instead of (p, uap, retval).
This worked to the extent that sync()'s proc pointer was not passed
further or not dereferenced.  Now there are lots of locks, and since
thread0 is never the corerect lock holder, things work at most to
the extent that sync()'s proc pointer is not passed further.
curthread is never null in -current, so upgrading to the version that
passes it (i386/i386/machdep.c 1.111 (actually passes curproc)) would
probably help in the non-panic case without increasing bugs for the
panic case.  However, passing curthread is still wrong for the panic
case due to the following complications:
- panics may occur during context switches or in other critical regions
  when curthread is not quite current.
- under SMP, curthread is per-CPU, so having it non-null doesn't really
  help.  Locks may be held by curproc's running on other CPUs, and in
  panic() it is difficult to handle the other CPUs correctly -- if you
  stop them then they won't be able to release their locks, and if you
  let them run they may run into you.  Hopefully in the case of a
  normal shutdown all the other CPUs release their locks and stop before
  the sync().

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: More ULE bugs fixed.

2003-10-31 Thread Bruce Evans

On Fri, 31 Oct 2003, Jeff Roberson wrote:

> I have commited my SMP fixes.  I would appreciate it if you could post
> update results.  ULE now outperforms 4BSD in a single threaded kernel
> compile and performs almost identically in a 16 way make.  I still have a
> few more things that I can do to improve the situation.  I would expect
> ULE to pull further ahead in the months to come.

My simple make benchmark now takes infinitely longer with ULE under SMP,
since make -j 16 with ULE under SMP now hangs nfs after about a minute.
4BSD works better.  However, some networking bugs have developed in the
last few days.  One of their manifestations is that SMP kernels always
panic in sbdrop() on shutdown.

> The nice issue is still outstanding, as is the incorrect wcpu reporting.

It may be related to nfs processes not getting any cycles even when there
are no niced processes.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1376 matches

Mail list logo