Re: i386 4/4 change
On Sun, 1 Apr 2018, Dimitry Andric wrote: On 31 Mar 2018, at 17:57, Bruce Evans wrote: On Sat, 31 Mar 2018, Konstantin Belousov wrote: the change to provide full 4G of address space for both kernel and user on i386 is ready to land. The motivation for the work was to both mitigate Meltdown on i386, and to give more breazing space for still used 32bit architecture. The patch was tested by Peter Holm, and I am satisfied with the code. If you use i386 with HEAD, I recommend you to apply the patch from https://reviews.freebsd.org/D14633 and report any regressions before the commit, not after. Unless a significant issue is reported, I plan to commit the change somewhere at Wed/Thu next week. Also I welcome patch comments and reviews. It crashes at boot time in getmemsize() unless booted with loader which I don't want to use. For me, it at least compiles and boots OK, but I'm one of those crazy people who use the default boot loader. ;) I found a quick fix and sent it to kib. (2 crashes in vm86 code for memory sizing. This is not called if loader is used && the system has smap. Old systems don't have smap, so they crash even if loader is used.) I haven't yet run any performance tests, I'll try building world and a few large ports tomorrow. General operation from the command line does not feel "sluggish" in any way, however. Further performance tests: - reading /dev/zero using tinygrams is 6 times slower - read/write of a pipe using tinygrams is 25 times slower. It also gives unexpected values in wait statuses on exit, hopefully just because the bug is in the test program is exposed by the changed timing (but later it also gave SIGBUS errors). This does a context switch or 2 for every read/write. It now runs 7 times slower using 2 4.GHz CPUs than in FreeBSD-5 using 1 2.0 GHz CPU. The faster CPUs and 2 of them used to make it run 4 times faster. It shows another slowdown since FreeBSD-5, and much larger slowdowns since FreeBSD-1: 1996 FreeBSD on P1 133MHz: 72k/s 1997 FreeBSD on P1 133MHz: 44k/s (after dyson's opts for large sizes) 1997 Linux on P1 133MHz: 93k/s (simpler is faster for small sizes) 1999 FreeBSD on K6 266MHz: 129k/s 2018 FBSD-~5 on AthXP 2GHz: 696k/s 2018 FreeBSD on i7 2x4GHz: 2900k/s 2018 FBSD4+4 on i7 2x4GHz: 113k/s (faster than Linux on a P1 133MHz!!) Netblast to localhost has much the same 6 times slowness as reading /dev/zero using tinygrams. This is the slowdown for syscalls. Tinygrams are hard to avoid for UDP. Even 1500 bytes is a tinygram for /dev/zero. Without 4+4, localhost is very slow because it does a context switch or 2 for every packet (even with 2 CPUs when there is no need to switch). Without 4+4 this used to cost much the same as the context switches for the pipe benchmark. Now it costs relatively much less since (for netblast to localhost) all of the context switches are between kernel threads. The pipe benchmark uses select() to avoid busy-waiting. That was good for UP. But for SMP with just 2 CPUs, it is better to busy-wait and poll in the reader and writer. netblast already uses busy-waiting. It used to be a bug that select() doesn't work on sockets, at least for UDP, so blasting using busy-waiting is the only possible method (timeouts are usually too coarse-grained to go as fast as blasting, and if they are fine-grained enough to go fast then they are not much better than busy-waiting with time wasted for setting up timeouts). SMP makes this a feature. It forces use of busy- waiting, which is best if you have a CPU free to run it and this method doesn't take to much power. Context switches to task queues give similar slowness. This won't be affected by 4+4 since task queues are in the kernel. I don't like networking in userland since it has large syscall and context switch costs. Increasing these by factors of 6 and 25 doesn't help. It can only be better by combining i/o in a way that the kernel neglects to do or which is imposed by per-packet APIs. Slowdown factors of 6 or 25 require the combined i/o to be 6 or 25 larger to amortise the costs. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: i386 4/4 change
On Sat, 31 Mar 2018, Konstantin Belousov wrote: the change to provide full 4G of address space for both kernel and user on i386 is ready to land. The motivation for the work was to both mitigate Meltdown on i386, and to give more breazing space for still used 32bit architecture. The patch was tested by Peter Holm, and I am satisfied with the code. If you use i386 with HEAD, I recommend you to apply the patch from https://reviews.freebsd.org/D14633 and report any regressions before the commit, not after. Unless a significant issue is reported, I plan to commit the change somewhere at Wed/Thu next week. Also I welcome patch comments and reviews. It crashes at boot time in getmemsize() unless booted with loader which I don't want to use. It is much slower, and I couldn't find an option to turn it off. For makeworld, the system time is slightly more than doubled, the user time is increased by 16%, and the real time is increased by 21%. On amd64, turning off pti and not having ibrs gives almost no increase in makeworld times relative to old versions, and pti only costs about 5% IIRC. Makeworld is not very syscall-intensive. netblast is very syscall-intensive, and its throughput is down by a factor of 5 (660/136 = 4.9, 1331/242 = 5.5). netblast 127.0.0.1 5001 5 10 (localhost, port 5001, 5-byte tinygrams for 10 s): 537 kpps sent, 0 kpps dropped # before this patch (CPU use 1.3) 136 kpps sent, 0 kpps dropped # after (CPU use 2.1) (Pure software overheads. It uses 1.6 times as much CPU to go 4 times slower). netblast 192.168.2.8 (low end PCI33 lem on low latency 1 Gbps LAN) 275 kpps sent, 1045 kpps dropped # before (CPU use 1.3) 245 kpps sent, 0kpps dropped # after (CPU use 1.3) (The hardware can't do anywhere near line rate of ~1500 kpps, so this becomes a benchmark of syscalls and dropping packets. The change makes FreeBSD so slow that 8 CPUs at 4.08 can't saturate a low end PCI33 NIC (the hardware saturates at about 282 kpps for tx and about 400 kpps for rx)). netblast 192.168.2.8 (low end PCIe em on low latency 1 Gbps LAN) 1316 kpps sent, 3 kpps dropped # before (CPU use 1.6) 243 kpps sent, 0 kpps dropped # after (CPU use 1.2) This is seriously slower for the most useful case. It reduces a system that could almost reach line rate using about 2 of 8 CPUs at 4 GHz to one that that is slower than with 1 CPU at 2 GHz (the latter saturates in software at about 640 kpps in old versions of FreeBSD at at about 400 kpps in -current). Initial debugging of the crash: it crashes on the first pmap_kenter() in getmemsize(). I configure debug.late_console to 0. That works, and without it getmemsize() can't even be debugged since it is after console initialization and ddb entry with -d. In getmemsize(), of course all the preload calls return 0 and smapbase is NULL. Then vm86 bios calls work and give basemem = 0x276. Then basemem_setup() is called and it returns. Then pmap_kenter() is called and it crashes: Stopped at getmemsize+0xb3:pushl $0x1000 Stopped at getmemsize+0xb8:pushl $0x1000 Stopped at getmemsize+0xbd:callpmap_kenter Stopped at pmap_kenter:pushl %ebp Stopped at pmap_kenter+0x1:movl%esp,%ebp Stopped at pmap_kenter+0x3:movl0x8(%ebp),%eax Stopped at pmap_kenter+0x6:shrl$0xc,%eax Stopped at pmap_kenter+0x9:movl0xc(%ebp),%edx Stopped at pmap_kenter+0xc:orl $0x3,%edx Stopped at pmap_kenter+0xf:movl%edx,PTmap(,%eax,4) The last instruction crashes because PTmap is not mapped at this point: db> p/x $edx 1003 db> p/x PTmap ff80 db> p/x $eax 1 db> x/x PTmap PTmap:KDB: reentering KDB: stack backtrace: db_trace_self_wrapper(cec5cb,1420a04,c6de83,1420978,1,...) at db_trace_self_wrapper+0x24/frame 0x142095c kdb_reenter(1420978,1,ff80003a,1420998,8f1419,...) at kdb_reenter+0x24/frame 0x1420968 trap(1420a10) at trap+0xa0/frame 0x1420a04 calltrap() at calltrap+0x8/frame 0x1420a04 --- trap 0xc, eip = 0xc5c394, esp = 0x1420a50, ebp = 0x1420a88 --- db_read_bytes(ff81,3,1420aa0) at db_read_bytes+0x29/frame 0x1420a88 db_get_value(ff80,4,0,0,d2d304,...) at db_get_value+0x20/frame 0x1420ab4 db_examine(ff80,1,,1420b00) at db_examine+0x144/frame 0x1420ae4 db_command(cb1d99,1420be4,8f0f01,d1d28a,0,...) at db_command+0x20a/frame 0x1420b90 db_command_loop(d1d28a,0,1420bac,1420b9c,1420be4,...) at db_command_loop+0x55/frame 0x1420b9c db_trap(a,4ff0,1,1,80046,...) at db_trap+0xe1/frame 0x1420be4 kdb_trap(a,4ff0,1420cc4) at kdb_trap+0xb1/frame 0x1420c10 trap(1420cc4) at trap+0x523/frame 0x1420cb8 calltrap() at calltrap+0x8/frame 0x1420cb8 --- trap 0xa, eip = 0xc65a4a, esp = 0x1420d04, ebp = 0x1420d04 --- pmap_kenter(1000,1000,1429000,8efe13,0,...) at pmap_kenter+0xf/frame 0x1420d04 getmemsize(1,5a8807ff,ee,59a80097,ee,...) at getmemsize+0xc2/frame 0x1420fc4 init386(1
Re: Really weird behavior with terminals/sessions in past couple weeks
On Sat, 13 May 2017, Ngie Cooper (yaneurabeya) wrote: On May 13, 2017, at 11:05, Ngie Cooper (yaneurabeya) wrote: On May 13, 2017, at 11:01, Ngie Cooper (yaneurabeya) wrote: Hi, I???ve been noticing some really weird behavior with terminal input after updating my kernel/userland ??? in particular, if I do `arc diff ???create` (which opens vi/vim), and try to do edits/use ^c, it will terminate the running process for `arc diff ???create`. Similarly, I was seeing really weird input via vim (when doing `svn ci`) where if I had one of the editing modes on, like insert, it would delete several lines at once; I worked around this by using ^c to terminate insert mode, but that???s a really bad hack. It worked ok with r316745, got worse in r317727, and doesn???t seem to be any better in r318250. I forgot to mention: I???m using SSH to access my machine. My gut feeling is the sc(4) commits might have tickled or introduced some bugs. I???ll try reverting the following commits over the next couple days to see whether or not my experience improves: r316827 r316830 r316865 r316878 r316974 r316977 r317190 r317198 r317199 r317245 r317256 r317264. I don't think I touched anything related to editing. Certainly not for fixing the mouse cursor starting some time before r317827. Since then I have spent too much time on mouse cursors and not much else. Bruce___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: kernel coding of nobody/nogroup
On Fri, 21 Apr 2017, Rick Macklem wrote: I need to set the default uid/gid values for nobody/nogroup into kernel variables. I reverted the commit that hardcoded them, since I agree that wasn't a good thing to do. I didn't realize that "nobody" was already defined in sys/conf.h and I can use that. I didn't know nobody was already there either. They are only used by zfs, while the others were originally only sed for devices. There is no definition for "nogroup" in sys/conf.h. Would it be ok to add #define GID_NOGROUP 65533 to syy/conf.h? (I know bde@ doesn't like expressing this as 65533, but that is what it is in /etc/group.) sys/conf.h already has GID_NOBODY but it is subtly different from GID_NOGROUP. It seems to be a bug that zfs uses nobody's gid instead of the gid nogroup which is used by no body. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Fri, 31 Mar 2017, Andrey Chernov wrote: On 30.03.2017 21:53, Bruce Evans wrote: I think it was the sizing. The non-updated mode is 80x25, so the row address can be out of bounds in the teken layer. I have text 80x30 mode set at rc stage, and _after_ that may have many kernel messages on console, all without causing reboot. How it is different from shutdown stage? Syscons mode is unchanged since rc stage. Probably just because their weren't enough messages to go past row 24. I had no difficulty reproducing the crash today for entering ddb and reboot starting 80x30 and rows > 24, after removing just the window size update in the fix. I missed seeing it the other day because I tested with 80x60 to see the smaller console window more clarly, but must have only tried rebooting with row <= 24. Another recent fix for sc reduced the problem a little. Mode changes are supposed to clear the screen and move the cursor to home, but they only clear the screen. You should have noticed the ugliness from that after the the switch to 80x30. There are enough boot messages to reach row 24 and messages continued from there. Now they start at the top of the screen again. Clearing the messages is not ideal, but syscons always did it. Syscons also has new and old bugs preserving colors across mode changes: - it never preserved changes to the palette (FBIO_SETPALETTE ioctl). Some mode changes should reset the palette, but some should not. Especially not ones for a vt switch - BIOSes should reset the palette for mode changes (even to the same mode). Some BIOSes are confused by syscons setting the DAC to 8 bit mode and reset to a garbage (dark) palette then. They always switch back to 6 bit mode - syscons used to maintain the current colors and didn't change them for mode changes. This was slightly broken, since for a mode change from a mode with full color to one with less color, the interpretation of the color indexes might change. The colors are now maintained by teken and syscons tells teken to do a full window size change which resets the entire teken state including colors. This bug is normally hidden by vidcontrol refreshing the colors. vidcontrol could be held responsible for refreshing or resetting everything after a mode change ioctl, but I think this is backwards since there are many low-level details that are better handled in the driver. Switching to graphics modes is already a complicated 2-ioctl process with not enough options and poor error handling. Like a too-simple wrapper for fork-exec. vt has some interesting related bugs. It doesn't support mode switches of course, and even changing the font seems to be unsupported in text mode. But in graphics mode, changing the font works and even redraws the screen where syscons would clear it for the mode change. But there are bugs redrawing the screen -- often old history is redrawn. This should work like in xterm or a general X window refresh where the redrawing must be done for lots of other events than resize (exposure, etc.). - sysctl debug.kdb.break_to_debugger. This is documented in ddb(4), but only as equivalent to the unbroken BREAK_TO_DEBUGGER. Thanx. Setting debug.kdb.break_to_debugger=1 makes both Ctrl-Alt-ESC and Ctrl-PrtScr works in sc only mode and "c" exit don't cause all chars beeps like in vt. I.e. it works. But I don't understand why debugging via serial involved in sc case while not involved in vt case and fear that some serial noise may provoke break. This is because only syscons has full conflation of serial line breaks with entering the debugger via a breakpoint instuction. Syscons does: kdb_break(); for its KDB keys, while vt does: kdb_enter(KDB_WHY_BREAK, ...) for its KDB keys. The latter bypasses KDB's permissions on entering the debugger with a BREAK. It is unclear if this is a layering violation in vt or incorrect use of kdb_break() in syscons. It is certainly wrong for vt to use the KDB_WHY_BREAK code if it is avoiding using kdb_break() to fix the conflation. Is there a chance to untie serial and sc console debuggers? This is easy to do by copying vt's arguable layering violation. A little more is necessary to unconflate serial breaks: - agree that kdb_break() and KDB_WHY_BREAK are only for serial line breaks - don't use kdb_break() and KDB_WHY_BREAK for console KDB keys of course. vt already has a string saying that the entry is a "manual escape to debugger". Here "to debugger" is redundant, "manual escape" means "DDB key hit manaually by the user" and the driver that saw the key is left out. "vt KDB key" would be a more useful message. syscons used to print a similar message, but it now calls kdb_break() which produces the conflated code KDB_WHY_BREAK and the consistently conflated message "Break to debugger"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Thu, 30 Mar 2017, Andrey Chernov wrote: On 30.03.2017 18:13, Bruce Evans wrote: On Thu, 30 Mar 2017, Andrey Chernov wrote: ... Finally I have good news and bad news with today's -current: 1) It seems your latest commit r316136 fix premature reboot issue. Now I need to know how that helped. Do you used a non-default mode? Perhaps it isn't really helped, but just hide the problem, changing some another race time parameters. I use 80x30 text mode on all screens. I think it was the sizing. The non-updated mode is 80x25, so the row address can be out of bounds in the teken layer. 2) I still can't enter KDB using Ctrl-Alt-ESC, while booting, after booting, after login and while shutdown - nothing happens. boot -d enters KDB normally, but the keyboard sequence handler is broken, not boot -d. Try "~b". What? It just prints \n, new csh prompt and ~b This takes ALT_BREAK_TO_DEBUGGER. It is an old bug that Ctrl-Alt-ESC (and Ctrl-PrtScr) GENERIC is even more broken than I remembered. It doesn't even have ALT_BREAK_TO_DEBUGGER. In old versions, this didn't affect the syscons key. The key was controlled by the SC_DISABLE_DDBKEY option so defaulted to enabled. There was no tunable or sysctl to change the default. Serial consoles had a BREAK_TO_DEBUGGER option to control entering the debugger on a serial line break. This was not per-device or even per-driver. Things were broken by conflating serial line BREAKs with entering the debugger using a breakpoint instruction. Now there are many sysctls and tunable,s but the basic enable is the conflated BREAK_TO_DEBUGGER. This now gives the default setting for entering kdb using a breakpoint instruction. Syscons calls the function kdb_break() which calls kdb_enter() which does the breakpoint instruction. Arches that don't have such an instruction must have a virtual one. The default setting can be modified using a tunable or sysctl. So to have a chance of the syscons debugger keys working, you first have to configure this setting, using either: - BREAK_TO_DEBUGGER in static config file. This is documented in ddb(4), but only for its unbroken meaning for serial consoles - tunable debug.kdb.break_to_debugger. This seems to be undocumented - sysctl debug.kdb.break_to_debugger. This is documented in ddb(4), but only as equivalent to the unbroken BREAK_TO_DEBUGGER. You have to set the variable using 1 or more of these knobs if you want the syscons and vt debugger keys to work, but this also enables debugger entry for serial line breaks and thus breaks the reason for existence of the unbroken BREAK_TO_DEBUGGER option. Normally you don't want to enter the debugger for serial line breaks, since then unplugging the cable or noise on the cable may enter the debugger, and the option exists to enable the entry for the rare cases where it is safe. Next there are the sysctl and vt knobs to set, but these have correct defaults so are enabled automatically. SC_DISABLE_DDBKEY is now named SC_DISABLE_KDBKEY. It always disabled not only the key, but the code to enable it. It actually controls 2 keys and 1 sequence of keys. When it is not configured, the Ctrl-PtrScr and Ctrl-Alt-ESC keys are enabled by default. This can be changed by a sysctl but not by a tunable. The sysctl is confusingly named with "kbd" (keyboard) in its name, while the configu option has KDB (kerel debugger) in its name. The variable for this also controls the sequences of keys which are more than ddb keys and are controlled by the ALT_BREAK_TO_DEBUGGER option and its knobs. vt doesn't have a static config knob to enable the enables. It has a tunable as well as a sysctl. This sysctl only controls the keys, not key sequences. (There may be more than 2 debugger keys. keymap allows any key to be a debugger key.) syscons and/or vt also have knobs to control halt, poweroff, reboot and panic, bug not suspend. Many of these are defeated by the sequences enabled by ALT_BREAK_TO_DEBUGGER. This is a larger bug in vt. In vt, ALT_BREAK_TO_DEBUGGER is limited by the sysctl for the kdb keys. If kdb entry is allowed, then there is no point in disallowing anything since anything can be done using kdb if it has a backend. This complexity is not enough to give enough control. The control should be per-device. You might have 1 secure console and 1 insecure console. Then enable kdb on at most the secure console. Or 1 remote serial console with a good cable and serial console with a bad cable. Then enable kdb entry for serial line breaks on at most the one with the good cable. With per-device control, the 6 knobs for controlling entry at the kdb level would be sillier, but at least 1 knowb is needed there to prevent all ddb use. Ctrl-PrtScr does nothing too. But I think the misconfiguration is the same for vt. No, Ctrl-Alt-ESC works for vt at every phase of the system lifecycle. My point it that it is easy to mis
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Thu, 30 Mar 2017, Andrey Chernov wrote: On 30.03.2017 12:34, Andrey Chernov wrote: On 30.03.2017 12:23, Andrey Chernov wrote: Yes, only for reboot/shutdown. The system does not do anythings wrong even under high load. On reboot or hang those lines are never printed: kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done kernel: Waiting (max 60 seconds) for system process `bufdaemon' to stop...done kernel: Waiting (max 60 seconds) for system process `syncer' to stop... kernel: Syncing disks, vnodes remaining...5 3 0 1 0 0 done kernel: All buffers synced. (it is from 10-stable sample, old -current samples are lost) Moreover, GELI swap deactivation lines are never printed too (I already mention that I change swap to normal, but nothing is changed). I start to have raw guess that _any_ kernel printf in shutdown mode cause not printf but premature reboot. Finally I have good news and bad news with today's -current: 1) It seems your latest commit r316136 fix premature reboot issue. Now I need to know how that helped. Do you used a non-default mode? The change had 2 parts and I should have split it for testing. It fixes the window sizing and constructors. 2) I still can't enter KDB using Ctrl-Alt-ESC, while booting, after booting, after login and while shutdown - nothing happens. boot -d enters KDB normally, but the keyboard sequence handler is broken, not boot -d. Try "~b". It is an old bug that Ctrl-Alt-ESC (and Ctrl-PrtScr) are misconfigured by default. But I think the misconfiguration is the same for vt. There are about 3 layers of options that have to be set to "enable" or not set to "disable" to enable these keys. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Thu, 30 Mar 2017, Andrey Chernov wrote: On 30.03.2017 14:23, Andriy Gapon wrote: On 30/03/2017 12:34, Andrey Chernov wrote: On 30.03.2017 12:23, Andrey Chernov wrote: Yes, only for reboot/shutdown. The system does not do anythings wrong even under high load. On reboot or hang those lines are never printed: kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done kernel: Waiting (max 60 seconds) for system process `bufdaemon' to stop...done kernel: Waiting (max 60 seconds) for system process `syncer' to stop... kernel: Syncing disks, vnodes remaining...5 3 0 1 0 0 done kernel: All buffers synced. (it is from 10-stable sample, old -current samples are lost) Moreover, GELI swap deactivation lines are never printed too (I already mention that I change swap to normal, but nothing is changed). I start to have raw guess that _any_ kernel printf in shutdown mode cause not printf but premature reboot. This sounds somewhat familiar... I vaguely recall an opposite issue that happened in the past. After one of my changes the reboot started hanging for one user. Turned out that the actual bug was always there, but previously the system rebooted because of a printf that caused a LOR (between spinlocks, AFAIR), witness tried to report it... using printf, and that recursed and there was a triple fault in the end. Let me try to dig some details, maybe the current issue is related in some ways. By chance, do you have WITNESS but not WITNESS_SKIPSPIN in your kernel config? No, I don't have WITNESS* I think removing all vt* lines from the kernel confing (and leaving sc) will be enough to reproduce it, but I am not sure. INVARIANTS with WITNESS is not a bad way to debug problems :-). I just remembered to try it with recent changes. It didn't find any problems for rebooting. The problems reported in Andriy's 2012 threads are almost exactly the ones that I have mostly fixed in syscons -- LORs and deadlocks, and endless recursion in WITNESS to report the problem. Syscons now detects and handles most LORs and deadlocks in itself, but I haven't committed the fixes for upper layers yet, so syscons mostly doesn't get called. cnputs() was "fixed" to silently drop the output. There is still an annoying LOR for devfs vs ufs in reboot. This is reported with no problems since it is not related to consoles. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Thu, 30 Mar 2017, Andrey Chernov wrote: We don't understand the bug yet. It might not even be in sc. Do you only see problems for shutdown? The shutdown environment is special for locking. Yes, only for reboot/shutdown. The system does not do anythings wrong even under high load. On reboot or hang those lines are never printed: kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done kernel: Waiting (max 60 seconds) for system process `bufdaemon' to stop...done kernel: Waiting (max 60 seconds) for system process `syncer' to stop... kernel: Syncing disks, vnodes remaining...5 3 0 1 0 0 done kernel: All buffers synced. (it is from 10-stable sample, old -current samples are lost) Moreover, GELI swap deactivation lines are never printed too (I already mention that I change swap to normal, but nothing is changed). A hang in sc means that deadlock occurred and sc's new deadlock detection didn't work. Hangs are rare. Most common are premature reboots. Check that ddb works before shutdown, or just put a lot of printfs in I can't check it ddb because I can't enter ddb in sc mode, as I already write, nothing happens. Only vt mode allows Ctrl-Alt-ESC, but the bug does not exist in vt mode, so it is pointless. That is signficant. My changes were initially all about making ddb work almost perfectly with sc. ddb is entered by kdb first calling cngrab(), which does much the same things as cnputc(), but more to set up for using the keyboard. If the sc part of cngrab() detects a problem, it should return and then the sc part of cnputc() should detect the same problem and do emergency output which might be just to buffer it. Nothing at all happening looks like a simpler problem, with Ctrl-Alt-ESC not being recognized. There are too many ways to enable/disable this entry, but I didn't change this. You might have entered ddb in a context which used to race or deadlock. No. I try about 20 times on machine which does nothing and can't enter KDB in sc only mode, but got one dead hang instead, when start to repeat it too fast. Even earlier than shutdown, and when booting? I mean in normal operation mode after booting, earlier than shutdown. Shutdown with premature reboot is too fast to press anything at the right time. I don't try to enter ddb when booting yet, but tell you results later. Look early in kern_reboot(), where it does print_uptime() then cngrab(). Console output before this cngrab() should work normally, and I suspect that something in cngrab() reboots. But syncing the file systems is done before this. I think they are unmounted later, so are fscked but don't need more than fsck -p if they have been synced. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Thu, 30 Mar 2017, Andrey Chernov wrote: On 30.03.2017 9:51, Andrey Chernov wrote: On 30.03.2017 8:53, Bruce Evans wrote: The escape sequences in dmesg are very interesting. You should debug those. I'll send you them a bit later. Since I don't want vt at all, I don't want to debug or fix it, let it die. Here it is: kernel: allscreens_kbd cursor^[[=0A^[[=7F^[[=0G^[[=0H^[[=7Ividcontrol: setting cursor type: Inappropriate ioctl for device It is caused by vidcontrol call which left from previous sc setup. This turns out to be uninteresting then. I think you have to configure something specially to get console messages in dmesg, but I get then in console.log, which also requires special configuration (turn this on in syslog.conf). In my configuration, vidcontrol only does ioctls in rc.d, so there are no escape sequences for vidcontrol in console.log, and only 1 error message (for changing the font to a syscons font). There should be more failures, but some ioctls are null instead of working. "vidcontrol show >/dev/console" works to show the colors and also to show that escape sequences end up in console.log. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Thu, 30 Mar 2017, Andrey Chernov wrote: On 29.03.2017 6:29, Bruce Evans wrote: ... I just found the cause, it is new syscons bug (bde@ cc'ed). I never compile vt driver into kernel, i.e. I don't have this lines in the kernel config: devicevt devicevt_vga devicevt_efifb When I add them, the bug described is gone. It seems syscons goes off to early, provoking reboot. Bah, I only have vt and vt_vga to check that I didn't break them. Unfortunately, syscons still works right when I remove these lines. Maybe two will be enough too, I don't check. I just don't need _any_ of vt lines. What is matter it is that syscons only mode (without any vt) was recently broken, causing shutdown problems and file system damage each time. Syscons only mode works for years until you break it recently. Actually, I fixed it not so recently (over the last few months), partly with much older local fixes. Kernel messages in syscons are now supposed to be colorized by CPU. The It looks really crazy on 8-core CPU and should not be default. And I don't see colors in vt mode (which should be parallel at that point, at least), but what about invisible escapes on vidcontrol errors (f.e. invalid argument) in vt mode? It is tuned for an 8-core CPU :-). 16 CPUs don't get unique colors by default, but could get 16 unique foreground ones and 1 reverse video (reverse video indeed looks crazier for short messages). 2 CPUs don't get the best choice of colors by default. More than 16 CPUs woold need to use lots of reverse video, except in graphics mode I'm considering expanding to 256 or 64K colors. vt doesn't support colorized kernel messages since I don't want to touch it more than necessary. See subr_terminal.c:termcn_putc(). This is almost exactly the same as scteken_puts() where the color change and some bugs were. It has to switch to the kernel color, and does this by abusing the user state. User escape sequences get corrupted by kernel output, and kernel escape sequence to change the color change the user's color but not the kernel's if they are atomic and not part of a user escape sequence. The escape sequences in dmesg are very interesting. You should debug those. They might be caused by misparsing of kernel escape sequences, or more likely by corruption of user escape sequences. This might happen when: - user prints foo" and ther terminal parses - kernel interrupts this and prints "bar"; "foo" is a supported sequence but "bar" isn't - the error handling is to print the entire escape sequence (that would be the interleaved message "bar" up to the point where the error is detected. Kernel console drivers seem to discard the entire mess. Userland xterm seems to print the entire message. Usually there aren't enough kernel messages interleaved with user ones to make the problem obvious. My changes should fix the problem for syscons, not cause it. But if they are slightly wrong, then they might cause it. Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode anymore - nothing happens. In the vt mode I can, but can't exit via "c" properly, all chars typed after "c" produce beep unless I switch to another screen and back. Try backing out r315984 only. This is supposed to fix parsing of output. I'll try. thanx. But most dangerous new syscons bug is the first one, damaging file system on each reboot. I try to go to KDB to debug it, but seeing that I can't even enter KDB I understand that all that bugs, including nasty one, are introduced by your syscons changes, it was a hint to add completely unneeded and unused vt to my kernel config file. It's normal to have a slightly damaged file system after a panic. You might have entered ddb in a context which used to race or deadlock. It might have seemed to work if it only raced. After the fix, when in this mode the following happens: - in graphics mode, no output is done. The races and deadlocks are not all fixed in the keyboard driver, and it might work in this mode. - in text mode, output is done specially, direct to the frame buffer, in a horizontal window 2/3 of the screen size. This doesn't use a full terminal driver so is hard to use at first. Even the reduced window causes problems. The colorization was originally to make this mode more usable. This mode is rarely active, except for debugging the console driver itself, or for low-level trap handlers. Put a breakpoint almost anywhere in the console driver to see it. sc_puts() is a good choice. vt is real downgrade. Its default console font is plain ugly, it is impossible to work with it. I can't find proper TERM for it to make function keys and pseudographics works in ncurses apps (not with xterm, a little better with xterm-sco), lynx can't display all things properly,
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Thu, 30 Mar 2017, Andrey Chernov wrote: On 30.03.2017 8:53, Bruce Evans wrote: Maybe two will be enough too, I don't check. I just don't need _any_ of vt lines. What is matter it is that syscons only mode (without any vt) was recently broken, causing shutdown problems and file system damage each time. Syscons only mode works for years until you break it recently. Actually, I fixed it not so recently (over the last few months), partly with much older local fixes. Please commit your fix as soon as possible. Committing it is what broke things for you. vt is broken as designed in many aspects (I even mention not all of them), It is not that bad. It is much cleaner, but 10-20 times slower and too simple to have as many features or preserve old features, and I don't like rewrites than remove or move features. vt does well to be as compatible as it is, so only annoys people who use the more arcane syscons features (I don't use most of them, but find them in regression tests). Syscons looks ugly, but much better when you look at the details. but from other hand I can't allow dirty filesystem (or hang) on each reboot using sc only mode as always. It is dangerous, and fsck takes big time. Moreover, using sc while keeping vt bloat compiled in the kernel just as the bug workaround is the best demotivator for perfectionist. We don't understand the bug yet. It might not even be in sc. Do you only see problems for shutdown? The shutdown environment is special for locking. A hang in sc means that deadlock occurred and sc's new deadlock detection didn't work. sc is supposed to either drop the output or do it specially when it detects deadlock. Deadlocks can also occur in upper layers of the console driver, but even more rarely. I haven't committed fixes for this yet. cnputs() detects some deadlocks and handles them by dropping the output. This loses WITNESS output when you need it for debugging the deadlock. The escape sequences in dmesg are very interesting. You should debug those. I'll send you them a bit later. Since I don't want vt at all, I don't want to debug or fix it, let it die. :-) I'll try. thanx. But most dangerous new syscons bug is the first one, damaging file system on each reboot. I try to go to KDB to debug it, but seeing that I can't even enter KDB I understand that all that bugs, including nasty one, are introduced by your syscons changes, it was a hint to add completely unneeded and unused vt to my kernel config file. It's normal to have a slightly damaged file system after a panic. In sc only mode I have no kernel panic, i.e panic with trace on console or entering KDB. I have silent reboot in the middle or end of shutdown sequence or rare dead hang on reboot (which absolutely not acceptable for remote machine). There's not much that sc does which can cause that. Maybe a wrong pointer for the frame buffer access in emergency ouput. I saw reboots when I broke this during booting. Check that ddb works before shutdown, or just put a lot of printfs in the shutdown sequence to see where it stops working. I usually sprinkle ddb breakpoints instead of printf()s. This requires more console code to work. Both should work until the final shutdown message from a working version. ddb breakpoints don't work properly under SMP. If all CPUs hit the same one, then the first one corrupts the state for the others. Shutdown should be mostly on a single CPU or with not all CPUs running the shutdown code, so most won't hit breakpoints in shutdown code, so it is fairly safe to put them there. You might have entered ddb in a context which used to race or deadlock. No. I try about 20 times on machine which does nothing and can't enter KDB in sc only mode, but got one dead hang instead, when start to repeat it too fast. Even earlier than shutdown, and when booting? booting with -d gives a simpler environment until sc is completely attached. Try testing that first. Also, do tests before mounting file systems so that nothing needs fsck'ing. In vt mode I can enter each time, but there are exit problems I already mention. I use text mode in sc. Strings for function keys: - these are just broken in both sc and vt I have all function keys working in sc only mode with TERM=cons25 and similar ones. Pseudographics: - I don't use it enough to see problems in it. Even finding the unicode glyph for the block character took me some time. Even cp437 have it and dialog library use it for all windows frames, f.e. all ports config windows use pseudographics if it is available and working (replaced by +-| etc poor looking ASCII otherwise). I call this line-drawing characters for cp437, and use them occasionally, but I don't know the termcap method for using them very well. Bruce ___ freebsd-current@freebsd.org mai
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Tue, 28 Mar 2017, Ngie Cooper wrote: On Mar 28, 2017, at 21:40, Bruce Evans wrote: On Wed, 29 Mar 2017, Bruce Evans wrote: On Wed, 29 Mar 2017, Andrey Chernov wrote: ... Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode anymore - nothing happens. In the vt mode I can, but can't exit via "c" properly, all chars typed after "c" produce beep unless I switch to another screen and back. All it means that syscons becomes very broken now by itself and even damages the kernel operations. I found a bug in screen resizing (the console context doesn't get resized). This doesn't cause any keyboard problems. ... But I suspect it is a usb keyboard problem. Syscons now does almost correct locking for the screen, but not for the keyboard, and the usb keyboard is especially fragile, especially in ddb mode. Console input is not used in normal operation except for checking for characters on reboot. Try using vt with syscons unconfigured. Syscons shouldn't be used when vt is selected, but unconfigure it to be sure. vt has different bugs using the usb keyboard. I haven't tested usb keyboards recently. ... I tested usb keyboards again. They sometimes work, much the same as a few months ago after some fixes: ... The above testing is with a usb keyboard, no ps/2 keyboard, and no kbdmux. Other combinations and dynamic switching move the bugs around, and a serial console is needed to recover in cases where the bugs prevent any keyboard input. I filed a bug a few years ago about USB keyboards and usability in ddb. If you increase the timeout so the USB hubs have enough time to probe/attach, they will work. Is that for user mode or earlier? ukb has some other fixes for ddb now, but of course it can't work before it finds the device. I recently found that usb boot drives sometimes don't have enough time to probe/attach before they are used in mountroot, and the mount -a prompt does locking that doesn't allow them enough time if they are not ready before it. The usb maintainers already know about this. I haven't taken the time to follow up on that and fix the issue, or at least propose a bit more functional workaround. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Wed, 29 Mar 2017, Bruce Evans wrote: On Wed, 29 Mar 2017, Andrey Chernov wrote: ... Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode anymore - nothing happens. In the vt mode I can, but can't exit via "c" properly, all chars typed after "c" produce beep unless I switch to another screen and back. All it means that syscons becomes very broken now by itself and even damages the kernel operations. ... But I suspect it is a usb keyboard problem. Syscons now does almost correct locking for the screen, but not for the keyboard, and the usb keyboard is especially fragile, especially in ddb mode. Console input is not used in normal operation except for checking for characters on reboot. Try using vt with syscons unconfigured. Syscons shouldn't be used when vt is selected, but unconfigure it to be sure. vt has different bugs using the usb keyboard. I haven't tested usb keyboards recently. I tested usb keyboards again. They sometimes work, much the same as a few months ago after some fixes: - after booting with -d, they never work (give no input) at the ddb prompt with either sc or vt. usb is not initialized then, and no usb keyboard is attached to sc or vt - after booting without loader with -a, sc rarely or never works (gives no input) at the mountroot prompt - after booting with loader with -a, vt works at the mountroot prompt. I don't normally use loader but need to use it to change the configuration. This might be better than before. There used to be a screen refresh bug. - after booting with loader with -a, sc works at the mountroot prompt too. I previously debugged that vt worked better because it attaches the keyboard before this point, while sc attaches it after. Booting with loader apparently fixes the order. - after any booting, sc works for user input (except sometimes after a too-soft hard reset, the keyboard doesn't even work in the BIOS, and it takes unplugging the keyboard to fix this) - after almost any booting, vt doesn't work for user input (gives no input). However, if ddb is entered using a serial console, vt does work! A few months ago, normal input was fixed by configuring kbdmux (the default in GENERIC). It is not fixed by unplugging the keyboard. kbdmux has a known bug of not doing nested switching for the keyboard state. Perhaps this "fixes" ddb mode. But I would have expected it to break ddb mode. - I didn't test sc after entering ddb, except early when it doesn't work. The above testing is with a usb keyboard, no ps/2 keyboard, and no kbdmux. Other combinations and dynamic switching move the bugs around, and a serial console is needed to recover in cases where the bugs prevent any keyboard input. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: New syscons bugs: shutdown -r doesn't execute rc.d sequence and others
On Wed, 29 Mar 2017, Andrey Chernov wrote: On 29.03.2017 0:46, Ngie Cooper (yaneurabeya) wrote: On Mar 28, 2017, at 14:27, Andrey Chernov wrote: ??? Using rc_debug=yes I see that it is the kernel problem, not rc problem. Sometimes rc backward sequence executed even fully, sometimes only partly, but in unpredictable moment inside rc sequence the kernel decide to reboot quickly (or even deadly hang in rare cases). Always without any "Syncing buffers..." leaving FS dirty. No zfs etc. just normal UFS, no EFI, no GPT. I change GELI swap to normal one, but it does not help. The same untouched config works for years, I see this bug for the first time in FreeBSD. I forget to mention that typescript and dmesg does not survive after this reboot (or rare hang). Good to note. The simple explanation to the problem might be r307755, depending on when you last synced/built ^/head. I have a few more questions (if reverting that doesn't pan out): I just found the cause, it is new syscons bug (bde@ cc'ed). I never compile vt driver into kernel, i.e. I don't have this lines in the kernel config: device vt device vt_vga device vt_efifb When I add them, the bug described is gone. It seems syscons goes off to early, provoking reboot. Bah, I only have vt and vt_vga to check that I didn't break them. Unfortunately, syscons still works right when I remove these lines. I also find some lines of the kernel messages strange colored instead of white in the syscons only mode. Even in vt mode vidcontrol errors have invisible escapes prepended (although visible through /var/log/messages). Kernel messages in syscons are now supposed to be colorized by CPU. The boot messages should show all the colors. Shutdown and ddb are normally done by a single random CPU, so are shown in a single random color. The colors are bright (light) 8-15 foreground, except bright black (8) is not so bright. Configure with a non-default KERNEL_SC_CONS_ATTR (maybe yellow on black instead of lightwhite on black) to turn of the colorization. I haven't tested this recently. There is also a sysctl for setting all the colors. Moreover, I can't enter KDB via Ctrl-Alt-ESC in the syscons only mode anymore - nothing happens. In the vt mode I can, but can't exit via "c" properly, all chars typed after "c" produce beep unless I switch to another screen and back. All it means that syscons becomes very broken now by itself and even damages the kernel operations. Try backing out r315984 only. This is supposed to fix parsing of output. It switches to a state indexed by the CPU for every character, and switches back. Screen switching does a different switch and would fix any bug in switching back. But I suspect it is a usb keyboard problem. Syscons now does almost correct locking for the screen, but not for the keyboard, and the usb keyboard is especially fragile, especially in ddb mode. Console input is not used in normal operation except for checking for characters on reboot. Try using vt with syscons unconfigured. Syscons shouldn't be used when vt is selected, but unconfigure it to be sure. vt has different bugs using the usb keyboard. I haven't tested usb keyboards recently. Bruce___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: HEADS-UP: IFLIB implementations of sys/dev/e1000 em, lem, igb pending
On Tue, 24 Jan 2017, Sean Bruno wrote: On 01/24/17 08:27, Olivier Cochard-Labb?? wrote: On Tue, Jan 24, 2017 at 3:17 PM, Sean Bruno mailto:sbr...@freebsd.org>> wrote: Did you increase the number of rx/tx rings to 8 and the number of descriptors to 4k in your tests or just the defaults? Tuning are same as described in my previous email (rxd|txd=2048, rx|tx process_limit=-1, max_interrupt_rate=16000). [root@apu2]~# sysctl hw.igb. hw.igb.tx_process_limit: -1 hw.igb.rx_process_limit: -1 hw.igb.num_queues: 0 hw.igb.header_split: 0 hw.igb.max_interrupt_rate: 16000 hw.igb.enable_msix: 1 hw.igb.enable_aim: 1 hw.igb.txd: 2048 hw.igb.rxd: 2048 Oh, I think you missed my note on these. In order to adjust txd/rxd you need to tweak the iflib version of these numbers. nrxds/ntxds should be adjust upwards to your value of 2048. nrxqs/ntxqs should be adjust upwards to 8, I think, so you can test equivalent settings to the legacy driver. Specifically, you may want to adjust these: dev.em.0.iflib.override_nrxds: 0 dev.em.0.iflib.override_ntxds: 0 dev.em.0.iflib.override_nrxqs: 0 dev.em.0.iflib.override_ntxqs: 0 That is painful. My hack to increase the ifq length also no longer works: X Index: if_em.c X === X --- if_em.c (revision 312696) X +++ if_em.c (working copy) X @@ -1,3 +1,5 @@ X +int em_qlenadj = -1; X + -1 gives a null adjustment; 0 gives a default (very large ifq), and other values give a non-null adustment. X /*- X * Copyright (c) 2016 Matt Macy X * All rights reserved. X @@ -2488,7 +2490,10 @@ X X /* Single Queue */ X if (adapter->tx_num_queues == 1) { X - if_setsendqlen(ifp, scctx->isc_ntxd[0] - 1); X + if (em_qlenadj == 0) X + em_qlenadj = imax(2 * tick, 0) * 15 / 10; X + // lem_qlenadj = imax(2 * tick, 0) * 42 / 100; X + if_setsendqlen(ifp, scctx->isc_ntxd[0] + em_qlenadj); X if_setsendqready(ifp); X } X I don't want larger hardware queues, but sometimes want larger software queues. ifq's used to give them. The if_setsenqlen() call is still there. but no longer gives them. The large queues are needed for backet blasting benchmarks since select() doesn't work for udp sockets, so if the queues fill up then the benchmarks must busy-wait or sleep waiting for them to drain, and timeout granularity tends to prevent short sleeps from working so the queues run dry while sleeping unless the queues are very large. Bruce___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: SVN r305382 breaks world32 on amd64 (and native 32-bit)
On Sun, 4 Sep 2016, Michael Butler wrote: Build fails with: ===> lib/msun (obj,all,install) Building /usr/obj/usr/src/lib/msun/e_fmodf.o /usr/src/lib/msun/i387/e_fmodf.S:10:17: error: register %rsp is only available in 64-bit mode movss %xmm0,-4(%rsp) ^~~~ /usr/src/lib/msun/i387/e_fmodf.S:11:17: error: register %rsp is only Fixed. I noticed it proof-reading the committed sources instead of the commit mail, and missed it for a while since I checked amd64 first. The bug was there for a couple of hours. At least the build failure prevented it being run. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: problems with mouse
On Mon, 29 Aug 2016, Hans Petter Selasky wrote: On 08/29/16 22:12, Antonio Olivares wrote: I apologize in advance if this is not in the right list, if I need to pose this question in questions, I will do so as soon as I find out. I am having trouble with switching apps in Lumina desktop with the mouse, I removed moused from /etc/rc.conf because I have a usb mouse and still lose when I switch from firefox to terminal or vice versa. $ uname -a FreeBSD hp 11.0-RC2 FreeBSD 11.0-RC2 #0 r304729: Wed Aug 24 06:59:03 UTC 2016 r...@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 Is there a way to troubleshoot this? Is there something that can fix this? Bruce Evans has fixed some issues with SC/VT mouse/keyboard stuff in 12-current. Maybe he has some ideas. I only know about sc/atkbd and am trying not to break ukbd. The cause of Bug 211884 (ukbd?) is still unknown. Bugzilla is too hard to access for me, but the PR seems to be missing critical info about the environment (is the console vt or sc?). kbdmux is still missing the fix that is blamed for causing Bug 211884. I need to fix kbdmux before changing sc to depend on it being fixed. vt already depends on it being fixed. Howver, vt also depends on going through kbdmux. ukbd doesn't attach properly directly for vt. ukbd passed tests of working in panic mode yesterday. It actually works perfectly in panic + ddb (polled) mode. Much better than in just ddb mode. Panic mode turns off its locking and thus gives races instead of deadlocks and assertion failures, and the races aren't very harmful in panic mode. So the basic polling method in ukbd is working except when it tries to do correct locking. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Timing issue with Dummynet on high kernel timer interrupt
On Fri, 6 Nov 2015, Ian Lepore wrote: On Fri, 2015-11-06 at 17:51 +0100, Hans Petter Selasky wrote: On 11/06/15 17:43, Ian Lepore wrote: On Fri, 2015-11-06 at 17:28 +0100, Hans Petter Selasky wrote: Hi, Do the test II results change with this setting? sysctl kern.timecounter.alloweddeviation=0 Yes, it looks much better: debug.total: 10013 -> 0 debug.total: 10013 -> 0 ... This isn't the first time that the alloweddeviation feature has led people (including me in the past) to think there is a timing bug. I think the main purpose of the feature is to help save battery power on laptops by clustering nearby scheduled wakeups to all happen at the same time and then allow for longer sleeps between each wakeup. I was trying to remember the flag for turning off that "feature". It gives the bizarre behaviour that on an old system with a timer resolution of 10 msec, "time sleep 1" sleeps for 1 second with an average error of < 10 msec, but with a timer resolution of 1 msec for hardclock and finer for short timeouts, "time sleep 1" sleeps for an average of an extra 30 msec (worst case 1.069 seconds IIRC). Thus high resolution timers give much lower resolution for medium-sized timeouts. (For "sleep 10", the average error is again 30 msec but this is relatively smaller, and for "sleep .001" the average error must be less than 1 msec to work at all, though it is likely to be relatively large.) I've been wondering lately whether this might also be behind the unexplained "load average is always 0.60" problem people have noticed on some systems. If load average is calculated by sampling what work is happening when a timer interrupt fires, and the system is working hard to ensure that a timer interrupt only happens when there is actual work to do, you'd end up with statistics reporting that there is work being done most of the time when it took a sample. I use HZ = 100 and haven't seen this. Strangely, HZ = 100 gives the same 69 msec max error for "sleep 1" as HZ = 1000. Schedulers should mostly use the actual thread runtimes to avoid sampling biases. That might even be faster. But it doesn't work so well for the load average, or at all for resource usages that are averages, or for the usr/sys/intr splitting of the runtime. It is good enough for scheduling since the splitting is not need for scheduling. Bruce ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Hello fdclose
On Tue, 18 Mar 2014, John Baldwin wrote: On Monday, March 17, 2014 7:23:19 pm Mariusz Zaborski wrote: ... I think the code is fine. I have a few suggestions on the manpage wording: .Sh RETURN VALUES -Upon successful completion 0 is returned. +The +.Fn fcloseall +function return no value. +.Pp +Upon successful completion +.Fn fclose +return 0. +Otherwise, +.Dv EOF +is returned and the global variable +.Va errno +is set to indicate the error. The .Rv macro should be used whenever possible. Unfortunately, it doesn't support the EOF return, but only -1, so stdio man pages can rarely use it, and this one is no exception. Using it gives standard wording that is quite different from the above: standard wording: The close() function returns the value 0 if successful; otherwise the value -1 is returned and the global variable errno is set to indicate the error. above wording (previous): Upon successful completion 0 is returned.Otherwise, EOF is returned and the global variable errno is set to indicate the error. above wording (new): Upon successful completion fclose() return [sic] 0. Otherwise, EOF is returned and the global variable errno is set to indicate the error. These are excessively formal in different ways: - I don't like "the foo() function". Why not just "foo()"? The standard wording uses this, and so does the new wording, but the previous wording omits the function name (that only works for man pages that only have a single function, as they should). - I don't like "the value N". Why not just "N"? The standard wording uses this, but the previous and new wordings don't. - "returns N" is better than "N is returned". Some man pages use worse wordings like "N will be returned". - "the global variable errno" is excessively detailed/verbose, without the details even being correct. Why not just "errno", with this identifier documented elsewhere? errno isn't a global variable in most implementations. It is can be, and usually is, a macro that expands to a modifiable lvalue of type int. In FreeBSD, the macro expands to a function that returns a pointer to int. - "Upon sucessful completion" is correct but verbose. The standard wording doesn't even use it. - the standard wording uses a conjunction instead of a new sentence before "otherwise" (this is better). It is missing a comma after "otherwise" (this is worse). +.Pp +The +.Fn fdclose +function return the file descriptor if successfull. Otherwise, .Dv EOF "successfull is consistently misspelled. One of English's arcane rules is that most verbs append an 's' when used with singular subjects, so "function returns" shoud be used instead of "function return", etc. I do think for this section it would be good to combine the descriptions of fclose() and fdclose() when possible, so perhaps something like: "The fcloseall() function returns no value. Upon successful completion, fclose() returns 0 and fdclose() returns the file descriptor of the underlying file. Otherwise, EOF is returned and the global variable errno is set to indicate the error. In either case no further access to the stream is possible." OK. You kept "return[s] N" and and deverbosified "the foo() function". "Upon successful completion" is needed more with several functions. "the global variable errno" remains consistently bad. There should be a comma after "In either case". This allows "in either case" to still read correctly and makes it clear it applies to both fclose() and fdclose(). Better "In every case". .Sh ERRORS +.Bl -tag -width Er +.It Bq Er EOPNOTSUPP The +.Fa _close +method in +.Fa stream +argument to +.Fn fdclose , +was not default. +.It Bq Er EBADF The ERRORS section should be sorted. For the errors section, the first error list needs some sort of introductory text. Also, this shouldn't claim that fdclose() can return an errno value for close(2). "ERRORS The fdclose() function may will fail if: I don't like the tense given by "will" in man pages. POSIX says "shall fail" in similar contexts, and "will fail" is a mistranslation of this ("shall" is a technical term that doesn't suggest future tense). deshallify.sh does the not-incorrect translation s/shall fail/fails/ (I think this is too simple to always work). It doesn't translate anything to "will". I can't parse "may will" :-). deshallify.txt doesn't translate "may" or "should" to anything (these are also technical terms in some contexts, so they might need translation. IIRC, "may" is optional behaviour, mostly for the implementation, while "shall" is required behaviour, only for the implementation, but "should" is recommended practice, mostly for applications). Man pages are very unlikely to be as consistent as POSIX with these terms. [EOPNOTSUPP] The stream to close uses a non-default close method. [EBADF]The stream is not backed by a valid file de
Re: signal 8 (floating point exception) upon resume
On Mon, 10 Mar 2014, John Baldwin wrote: On Tuesday, March 04, 2014 4:50:01 pm Bruce Evans wrote: On Tue, 4 Mar 2014, John Baldwin wrote: % Index: i386/i386/swtch.s % === % --- i386/i386/swtch.s (revision 262711) % +++ i386/i386/swtch.s (working copy) [...savectx()] This function is mostly bogus (see old mails). I was going off of the commit logs for amd64 that removed this code as savectx() is not used for fork(), only for IPI_STOP and suspend/resume. Without fxsave, npxsuspend() cannot be atomic without locking, since fnsave destroys the state in the FPU and you either need a lock to reload the old state atomically enough, or a lock to modify FPCURTHREAD atomically enough. save_ctx() is now only called from IPI handlers or when doing suspend in which case we shouldn't have to worry about being preempted. I don't understand the suspend part. Is sufficient locking held througout suspend/resume to prevent states changing after they have been saved here? % @@ -520,7 +490,16 @@ % movl%eax,%dr7 % % #ifdef DEV_NPX % - /* XXX FIX ME */ % + /* Restore FPU state */ Is the problem just this missing functionality? Possibly. I now think it was just the clobbering of %cr0 so i386 never had the problem. I think on amd64 there was also the desire to have the pcb state be meaningful in dumps (since we IPI_STOP before a dump). OTOH, It should also be meaningful in debuggers. Hopefully stop IPIs put it there form all stopped CPUs. I think it remains in the FPU for the running CPU. the current approach used by amd64 (and this patch for i386) is to not dirty fpcurthread's state during save_ctx(), but to instead leave fpcurthread alone and explicitly save whatever state the FPU is in in the PCB used for IPI_STOP or suspend. Hmm, if kernel debuggers actually supported displaying the FPU state, then they would prefer to find it in the PCB only (after debugger entry puts it there), but this doesn't work in places like the dna trap handler. Similarly for IPIs and suspend. The dna trap handler would be broken unless any saving in the PCB is undone when normal operation is resumed, and it seems more difficult to undo it than to save specially so as not to have anything to undo. It is OK to save in the usual place in the PCB so that debuggers can find it more easily (since that place is not used in normal operation), but not to change the state in the CPU+FPU across the operation. Harmful state changes in the CPU+FPU include toggling CR0_TS and implicit fninit. For suspend/resume, we have no option but to undo everything, since other things may clobber the state. % @@ -761,7 +761,34 @@ % PCPU_SET(fpcurthread, NULL); % } % % +/* % + * Unconditionally save the current co-processor state across suspend and % + * resume. % + */ % void % +npxsuspend(union safefpu *addr) % +{ % + register_t cr0; % + % + if (!hw_float) % + return; % + cr0 = rcr(0); % + clts(); % + fpusave(addr); % + load_cr(0, cr0); % +} In the !fxsave case, this destroys the state in the npx, leaving fpcurthread invalid. It also does the save when the state in the npx is inactive. I think jkim intentionally this state so that resume can load it unconditionally. It must be arranged that there are no interactions with fpcurthread. Given the single-threaded nature of suspend/resume and IPI_STOP / restart_cpus(), those requirements are met, so it should be safe to resume whatever state was in the FPU and leave fpcurthread unchanged. Is the whole suspend/resume really locked? This doesn't work so well without fxsave. When fpcurthread != NULL, reloading CR0 keeps CR0_TS and thus ensures that inconsistent state lives for longer. Things will only be OK if fpcurthread isn't changed until resume. After the save_ctx() the CPU is going to either resume without doing a resume_ctx (IPI_STOP case) leaving fpcurthread unchanged (so save_ctx() just grabbed a snapshot of the FPU state for debugging purposes) or the CPU is going to power off for suspend. If it doesn't restore for IPI_STOP, then it will continue with the state clobbered by fnsave in the !fxsr case. That is rare but can happen. Most CPUs that have IPIs also have fxsr. But on at least i386, there is an option to disable fxsr. During resume it will invoke resume_ctx() which will restore the FPU state (whatever state it was in) and fpcurthread and only after those are true is the CPU able to run other threads which will modify or use the FPU state. You can probably fix this by using the old code here. The old code doesn't need the hw_float test, since fpcurthread != NULL implies hw_float != 0. Actually, I don't see any need to change anything on i386 -- after storing the state for the thread, there should be no need to store it anywhere else across suspend/resume. We intentionally use thi
Re: signal 8 (floating point exception) upon resume
On Tue, 4 Mar 2014, John Baldwin wrote: On Monday, March 03, 2014 6:49:08 pm Adrian Chadd wrote: I'll try this soon. I had it fail back to newcons, rather than Xorg normally dying without restoring state. It wouldn't let me spawn a shell. Logging in worked fine, but normal shell exec would eventually and quickly lead to failure, dropping me back to the login prompt. If you have set CPUTYPE in /etc/src.conf such that your userland binaries are built with SSE, etc. then I expect most things to break because the FPU is in a funky state without this patch. I suspect if you don't set CPUTYPE so that your userland binaries do not use the FPU, you can probably resume just fine without this fix. Non-SSE FPU state might be broken too. Complete stab in the dark (not compile tested) here: http://www.FreeBSD.org/~jhb/patches/i386_fpu_suspend.patch I forget many details of how this works, but noticed that it seems to break consistency of the state for the !fxsave case and related locking. % Index: i386/i386/swtch.s % === % --- i386/i386/swtch.s (revision 262711) % +++ i386/i386/swtch.s (working copy) % @@ -417,42 +417,9 @@ % str PCB_TR(%ecx) % % #ifdef DEV_NPX % - /* % - * If fpcurthread == NULL, then the npx h/w state is irrelevant and the % - * state had better already be in the pcb. This is true for forks % - * but not for dumps (the old book-keeping with FP flags in the pcb % - * always lost for dumps because the dump pcb has 0 flags). % - * % - * If fpcurthread != NULL, then we have to save the npx h/w state to % - * fpcurthread's pcb and copy it to the requested pcb, or save to the % - * requested pcb and reload. Copying is easier because we would % - * have to handle h/w bugs for reloading. We used to lose the % - * parent's npx state for forks by forgetting to reload. % - */ This function is mostly bogus (see old mails). % - pushfl % - CLI % - movlPCPU(FPCURTHREAD),%eax % - testl %eax,%eax % - je 1f This CLI/STI locking is bogus. Accesses to FPCURTHREAD are now locked by critical_enter(), as on amd64, and perhaps a higher level already did critical_enter() or even CLI. (CLI/STI in swtch.s seems to be bogus too. amd64 doesn't do it, and I think a higher level does mtx_lock_spin() which does too much, including CLI via spinlock_enter().) % - % - pushl %ecx % - movlTD_PCB(%eax),%eax % - movlPCB_SAVEFPU(%eax),%eax % - pushl %eax % - pushl %eax % - callnpxsave % + pushl PCB_FPUSUSPEND(%ecx) % + callnpxsuspend Without fxsave, npxsuspend() cannot be atomic without locking, since fnsave destroys the state in the FPU and you either need a lock to reload the old state atomically enough, or a lock to modify FPCURTHREAD atomically enough. Reloading the old state is problematic because the reload might trap. So the old version uses the second method. It calls npxsave() to handle most of the details. But npxsave() was designed to be efficient for its usual use in cpu_switch(), so it doesn't handle the detail of checking FPCURTHREAD or the locking needed for this check, so the above code had to handle these details. % addl$4,%esp % - popl%eax % - popl%ecx % - % - pushl $PCB_SAVEFPU_SIZE % - lealPCB_USERFPU(%ecx),%ecx % - pushl %ecx % - pushl %eax % - callbcopy % - addl$12,%esp % -1: % - popfl % #endif /* DEV_NPX */ This probably should never have been written in asm. Only the similar code in cpu_switch() is time-critical. % % movl $1,%eax % ... % @@ -520,7 +490,16 @@ % movl%eax,%dr7 % % #ifdef DEV_NPX % - /* XXX FIX ME */ % + /* Restore FPU state */ Is the problem just this missing functionality? % ... % Index: i386/isa/npx.c % === % --- i386/isa/npx.c(revision 262711) % +++ i386/isa/npx.c(working copy) This has many vestiges of support for interrupt handling (mainly in comments and in complications in the probe). CLI/STI was used for locking partly to reduce complications for the IRQ13 case. The comment before npxsave() still says that it needs CLI/STI locking by callers, but it actually needs critical_enter() locking and most callers only provided that. % @@ -761,7 +761,34 @@ % PCPU_SET(fpcurthread, NULL); % } % % +/* % + * Unconditionally save the current co-processor state across suspend and % + * resume. % + */ % void % +npxsuspend(union safefpu *addr) % +{ % + register_t cr0; % + % + if (!hw_float) % + return; % + cr0 = rcr(0); % + clts(); % + fpusave(addr); % + load_cr(0, cr0); % +} In the !fxsave case, this destroys the state in the npx, leaving fpcurthread invalid. It also does the save when the state in the npx is inactive. I think jkim
Re: WEAK_REFERENCE?
On Sat, 9 Nov 2013, Andreas Tobler wrote: anyone interested in this patch to remove the WEAK_ALIAS and introduce the WEAK_REFERENCE? http://people.freebsd.org/~andreast/weak_ref.amd64.diff I have this running since months on amd64 and I have no issues with. I remember having had a communication with bde@ that he is in favour in doing that but I lacked the time to complete. A similar thing is pending for i386 and sparc64. The ppc stuff is already committed since a longer time. If no one is interested, I'm happy to clean up my tree and skip this. I have only minor interest in it. I might have looked at it before. This version formats the backslashes in macro definitions very badly by putting them in random columns between about 96 and 120 instead of in column 72. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity
On Thu, 11 Jul 2013, David Chisnall wrote: On 11 Jul 2013, at 13:11, Bruce Evans wrote: The error message for the __builtin_isnan() version is slightly better up to where it says more. The less-unportable macro can do more classification and detect problems at compile time using __typeof(). The attached patch fixes the related test cases in the libc++ test suite. Please review. OK if the ifdefs work and the style bugs are fixed. This does not use __builtin_isnan(), but it does: - Stop exposing isnan and isinf in the header. We already have __isinf in libc, so this is used instead. - Call the static functions for isnan __inline__isnan*() so that they don't conflict with the ones in libm. - Add an __fp_type_select() macro that uses either __Generic(), __builtin_choose_expr() / __builtin_choose_expr(), or sizeof() comparisons, depending on what the compiler supports. - Refactor all of the type-generic macros to use __fp_type_select(). % Index: src/math.h % === % --- src/math.h(revision 253148) % +++ src/math.h(working copy) % @@ -80,28 +80,39 @@ % #define FP_NORMAL 0x04 % #define FP_SUBNORMAL0x08 % #define FP_ZERO 0x10 % + % +#if __STDC_VERSION__ >= 201112L % +#define __fp_type_select(x, f, d, ld) _Generic((x), \ % + float: f(x),\ % + double: d(x), \ % + long double: ld(x)) The normal formatting of this is unclear. Except for the tab after #define. math.h has only 1 other instance of a space after #define. % +#elif __GNUC_PREREQ__(5, 1) % +#define __fp_type_select(x, f, d, ld) __builtin_choose_expr( \ % + __builtin_types_compatible_p(__typeof (x), long double), ld(x),\ % + __builtin_choose_expr(\ % + __builtin_types_compatible_p(__typeof (x), double), d(x),\ % +__builtin_choose_expr( \ % + __builtin_types_compatible_p(__typeof (x), float), f(x), (void)0))) Extra space after __typeof. Normal formatting doesn't march to the right like this... % +#else % +#define __fp_type_select(x, f, d, ld) \ % + ((sizeof (x) == sizeof (float)) ? f(x)\ % + : (sizeof (x) == sizeof (double)) ? d(x) \ % + : ld(x)) ... or like this. Extra space after sizeof (bug copied from old code). % +#endif % + % + % + Extra blank lines. % #define fpclassify(x) \ % -((sizeof (x) == sizeof (float)) ? __fpclassifyf(x) \ % -: (sizeof (x) == sizeof (double)) ? __fpclassifyd(x) \ % -: __fpclassifyl(x)) Example of normal style in old code (except for the space after sizeof(), and the backslashes aren't line up like they are in some other places in this file). % ... % @@ -119,10 +130,8 @@ % #define isunordered(x, y) (isnan(x) || isnan(y)) % #endif /* __MATH_BUILTIN_RELOPS */ % % -#define signbit(x) \ % -((sizeof (x) == sizeof (float)) ? __signbitf(x) \ % -: (sizeof (x) == sizeof (double)) ? __signbit(x) \ % -: __signbitl(x)) % +#define signbit(x) \ % + __fp_type_select(x, __signbitf, __signbit, __signbitl) The tab lossage is especially obvious here. This macro definition fits on 1 line now. Similarly for others except __inline_isnan*, which takes 2 lines. __inline_isnan* should be named less verbosely, without __inline. I think this doesn't cause any significant conflicts with libm. Might need __always_inline. __fp_type_select is also verbose. % % typedef __double_t double_t; % typedef __float_t float_t; % @@ -175,6 +184,7 @@ % int __isfinite(double) __pure2; % int __isfinitel(long double) __pure2; % int __isinff(float) __pure2; % +int __isinf(double) __pure2; % int __isinfl(long double) __pure2; % int __isnanf(float) __pure2; % int __isnanl(long double) __pure2; % @@ -185,6 +195,23 @@ % int __signbitf(float) __pure2; % int __signbitl(long double) __pure2; The declarations of old extern functions can probably be removed too when they are replaced by inlines (only __isnan*() for now) . I think the declarations of __isnan*() are now only used to prevent warnings (at higher warning levels than have ever been used) in the file that implement the functions. % % +static __inline int % +__inline_isnanf(float __x) % +{ % + return (__x != __x); % +} % +static __inline int % +__inline_isnan(double __x) % +{ % + return (__x != __x); % +} % +static __inline int % +__inline_isnanl(long double __x) % +{ % + return (__x != __x); % +} % + % + Extra blank lines. Some insertion sort errors. In this file, APIs are mostly sorted in the order double, float, long double. All the inline functions except __inline_isnan*() only evaluate their args once, so they can be s
Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity
On Thu, 11 Jul 2013, David Chisnall wrote: On 11 Jul 2013, at 13:11, Bruce Evans wrote: is also not required to be conforming C code, let alone C++ code, so there is only a practical requirement that it works when included in the C++ implementation. Working with the C++ implementation is the problem that we are trying to solve. The compatibility that I'm talking about is with old versions of FreeBSD. isnan() is still in libc as a function since that was part of the FreeBSD ABI and too many things depended on getting it from there. It was recently ... I don't see a problem with changing the name of the function in the header and leaving the old symbol in libm for legacy code. I don't even see why old code needs the symbol. Old code should link to old compat libraries that still have it. It would also be nice to implement these macros using _Generic when compiling in C11 mode, as it will allow the compiler to produce more helpful warning messages. I would propose this implementation: #if __has_builtin(__builtin_isnan) This won't work for me, since I develop and test msun with old compilers that don't support __has_builtin(). Much the same set of compilers also don't have enough FP builtins. Please look in cdefs.h, which defines __has_builtin(x) to 0 if we the compiler does not support it. It is therefore safe to use __has_builtin() in any FreeBSD header. The old compilers run on old systems that don't have that in cdefs.h (though I sometimes edit it to add compatibility cruft like that). msun sources are otherwise portable to these systems. Well, not quite. They are not fully modular and also depend on stuff in libc/include and libc/${ARCH}. I have to update or edit headers there. This hack also doesn't work with gcc in -current. gcc has __builtin_isnan but not __has_builtin(), so __has_builtin(__builtin_isnan) gives the wrong result 0. It also doesn't even work. clang has squillions of builtins that aren't really builtines so they reduce to libcalls. Which, again, is not a problem for code outside of libm. If libm needs different definitions of these macros then that's fine, but they should be private to libm, not installed as public headers. Yes it is. It means that nothing should use isnan() or FP_FAST_FMA* outside of libm either, since isnan() is too slow and FP_FAST_FMA* can't be trusted. Even the implementation can't reliably tell if __builtin_isnan is usuable or better than alternatives. The msun implementation knows that isnan() and other classification macros are too slow to actually use, and rarely uses them. Which makes any concerns that only apply to msun internals irrelevant from the perspective of discussing what goes into this header. No, the efficiency of isnan() is more important for externals, because the internals already have work-arounds. #define isnan(x) __builtin_isnan(x) #else static __inline int __isnanf(float __x) { return (__x != __x); } Here we can do better in most cases by hard-coding this without the ifdef. They will generate the same code. Clang expands the builtin in the LLVM IR to a fcmp uno, so will generate the correct code even when doing fast math optimisations. On some arches the same, and not affected by -ffast-math. But this is not necessarily the fastest code, so it is a performance bug if clang akways generates the same code for the builtin. Bit tests are faster in some cases, and may be required to prevent exceptions for signaling NaNs. -ffast-math could reasonably optimize x != x to "false". It already assumes that things like overflow and NaN results can't happen, so why not optimize further by assuming that NaN inputs can't happen? Generic stuff doesn't seem to work right for either isnan() or __builtin_isnan(), though it could for at least the latter. According to a quick grep of strings $(which clang), __builtin_classify() is generic but __builtin_isnan*() isn't (the former has no type suffixes but the latter does, and testing shows that the latter doesn't work without the suffices). I'm not sure what you were testing: Mostly isnan() without including , and gcc. I was confused by gcc converting floats to doubles. $ cat isnan2.c int test(float f, double d, long double l) { return __builtin_isnan(f) | __builtin_isnan(d) | __builtin_isnan(l); } $ clang isnan2.c -S -emit-llvm -o - -O1 ... %cmp = fcmp uno float %f, 0.00e+00 %cmp1 = fcmp uno double %d, 0.00e+00 %or4 = or i1 %cmp, %cmp1 %cmp2 = fcmp uno x86_fp80 %l, 0xK ... As you can see, it parses them as generics and generates different IR for each. I don't believe that there's a way that these would be translated back into libcalls in the back end. Yes, most cases work right. gcc converts f to double and compares the result, but
Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity
On Thu, 11 Jul 2013, Tijl Coosemans wrote: On 2013-07-11 06:21, Bruce Evans wrote: On Wed, 10 Jul 2013, Garrett Wollman wrote: < said: I think isnan(double) and isinf(double) in math.h should only be visible if (_BSD_VISIBLE || _XSI_VISIBLE) && __ISO_C_VISIBLE < 1999. For C99 and higher there should only be the isnan/isinf macros. I believe you are correct. POSIX.1-2008 (which is aligned with C99) consistently calls isnan() a "macro", and gives a pseudo-prototype of int isnan(real-floating x); Almost any macro may be implemented as a function, if no conforming program can tell the difference. It is impossible for technical reasons to implement isnan() as a macro (except on weird implementations where all real-floating types are physically the same). In the FreeBSD implementation, isnan() is a macro, but it is also a function, and the macro expands to the function in double precision: % #defineisnan(x)\ % ((sizeof (x) == sizeof (float)) ? __isnanf(x)\ % : (sizeof (x) == sizeof (double)) ? isnan(x)\ % : __isnanl(x)) The C99 standard says isnan is a macro. I would say that only means defined(isnan) is true. Whether that macro then expands to function calls or not is not important. I think it means only that defined(isnan) is true. isnan() can still be a function (declared or just in the compile-time namespace somewhere, or in a library object). It is reserved in the compile-time namespace, and the standard doesn't cover library objects, so conforming applications can't reference either except via the isnan() macro (if that has its strange historical implementation). I don't see how any conforming program can access the isnan() function directly. It is just as protected as __isnan() would be. (isnan)() gives the function (the function prototype uses this), but conforming programs can't do that since the function might not exist. I don't think the standard allows a function to be declared with the same name as a standard macro (it does allow the reverse: define a macro with the same name as a standard function). I believe the following code is C99 conforming but it currently does not compile with our math.h: -- #include int (isnan)(int a, int b, int c) { return (a + b + c); } -- I think isnan is just reserved, so you can't redefine it an any way. I think the reverse is even less allowed. Almost any standard function may be implemented as a macro, and then any macro definition of it would conflict with the previous macro even more than with a previous prototype. E.g.: /* Header. */ void exit(int); #define exit(x) __exit(x) /* Application. */ #undef exit /* non-conforming */ #define exit(x) my_exit(x) /* conflicts without the #undef */ Now suppose the header doesn't define exit(). #define exit(x) my_exit(x) This hides the protoype but doesn't automatically cause problems, especially if exit() is not used after this point. But this is still non-conforming, since exit() is reserved. Here are some relevant parts of C99 (n869.txt): %%% -- Each identifier with file scope listed in any of the following subclauses (including the future library directions) is reserved for use as macro and as an identifier with file scope in the same name space if any of its associated headers is included. [#2] No other identifiers are reserved. If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined. [#3] If the program removes (with #undef) any macro definition of an identifier in the first group listed above, the behavior is undefined. %%% Without any include of a header that is specified to declare exit(), file scope things are permitted for it, including defining it and making it a static function, but not making it an extern function. isnan is reserved for use as a macro and as an identifier with file scope by the first clause above. Thus (isnan) cannot even be defined as a static function. But (isnan) is not reserved in inner scopes. I thought that declarations like "int (isnan);" are impossible since they look like syntax errors, but this syntax seems to be allowed an actually work with gcc-3.3.3 and TenDRA-5.0.0. So you can have variables with silly names like (isnan) and (getchar) :-). However, (NULL) for a variable name doesn't work, and (isnan) is a syntax error for struct member names. The compilers may be correct in allowing (isnan) but not (NULL) for variables. isnan happens to be function-like, so the parentheses are special for (isnan), but the parentheses are not
Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity
On Thu, 11 Jul 2013, David Chisnall wrote: You're joining in this discussion starting in the middle, so you probably missed the earlier explanation. I was mainly addressing a C99 point. I know little about C++ or C11. On 11 Jul 2013, at 05:21, Bruce Evans wrote: I don't see how any conforming program can access the isnan() function directly. It is just as protected as __isnan() would be. (isnan)() gives the function (the function prototype uses this), but conforming programs can't do that since the function might not exist. Maybe some non-conforming program like autoconfig reads or libm.a and creates a bug for C++. The cmath header defines a template function isnan that invokes the isnan macro, but then undefines the isnan macro. This causes a problem because when someone does something along the lines of using namespace std then they end up with two functions called isnan and the compiler gets to pick the one to use. Unfortunately, std::isnan() returns a bool, whereas isnan() returns an int. The C++ headers are not required to be conforming C code, because they are not C, and our math.h causes namespace pollution in C++ when included from . is also not required to be conforming C code, let alone C++ code, so there is only a practical requirement that it works when included in the C++ implementation. The FreeBSD isnan() implementation would be broken by removing the isnan() function from libm.a or ifdefing it in . Changing the function to __isnan() would cause compatibility problems. The function is intentionally named isnan() to reduce compatibility problems. On OS X this is avoided because their isnan() macro expands to call one of the __-prefixed inline functions (which adopt your suggestion of being implemented as x != x, for all types). I am not sure that this is required for standards conformance, but it is certainly cleaner. Your statement that having the function not called isnan() causes compatibility problems is demonstrably false, as neither OS X nor glibc has a function called isnan() and, unlike us, they do not experience problems with this macro. The compatibility that I'm talking about is with old versions of FreeBSD. isnan() is still in libc as a function since that was part of the FreeBSD ABI and too many things depended on getting it from there. It was recently removed from libc.so, but is still in libm.a. This causes some implementation problems in libm that are still not completely solved. I keep having to edit msun/src/s_isnan.c the msun sources are more portable. Mostly I need to kill the isnan() there so that it doesn't get in the way of the one in libc. This mostly works even if there is none in libc, since the builtins result in neither being used. isnanf() is more of a problem, since it is mapped to __isnanf() and there is no builtin for __isnanf(). The old functions have actually been removed from libc.a too. They only in libc_pic.a. libc.a still has isnan.o, but that is bogus since isnan.o is now empty. It would also be nice to implement these macros using _Generic when compiling in C11 mode, as it will allow the compiler to produce more helpful warning messages. I would propose this implementation: #if __has_builtin(__builtin_isnan) This won't work for me, since I develop and test msun with old compilers that don't support __has_builtin(). Much the same set of compilers also don't have enough FP builtins. It also doesn't even work. clang has squillions of builtins that aren't really builtines so they reduce to libcalls. gcc has fewer builtins, but still many that reduce to libcalls. An example is fma(). __has_builtin(__builtin_fma) is true for clang on amd64 (freefall), but at least freefalls's CPU doesn't support fma in hardware, so the builtin can't really work, and in fact it doesn't -- it reduces to a libcall. This might change if the hardware supports fma, but then __has_builtin(__builtin_fma) would be even more useless for telling if fma is worth using. C99 has macros FP_FAST_FMA[FL] whose implementation makes them almost equally useless. For example, ia64 has fma in hardware and the implementation defines all of FP_FAST_FMA[FL] for ia64. But fma is implemented as an extern function, partly because there is no way to tell if __builtin_fma is any good (but IIRC, __builtin_fma is no good on ia64 either, since it reduces to the same extern function). The extern function is slow (something like 20 cycles instead of 1 for the fma operation). But if you ignore the existence of the C99 fma API and just write expressions of the form (a*x + b), then gcc on ia64 will automatically use the hardware fma, although this is technically wrong in some fenv environments. For gcc-4.2.1, __has_builtin(__builtin_fma) is a syntax error. I test with gcc-3.x. It is also missing __builtin_isnan(). The msun implementation knows that isnan() and oth
Re: CURRENT: CLANG 3.3 and -stad=c++11 and -stdlib=libc++: isnan()/isninf() oddity
On Wed, 10 Jul 2013, Garrett Wollman wrote: < said: I think isnan(double) and isinf(double) in math.h should only be visible if (_BSD_VISIBLE || _XSI_VISIBLE) && __ISO_C_VISIBLE < 1999. For C99 and higher there should only be the isnan/isinf macros. I believe you are correct. POSIX.1-2008 (which is aligned with C99) consistently calls isnan() a "macro", and gives a pseudo-prototype of int isnan(real-floating x); Almost any macro may be implemented as a function, if no conforming program can tell the difference. It is impossible for technical reasons to implement isnan() as a macro (except on weird implementations where all real-floating types are physically the same). In the FreeBSD implementation, isnan() is a macro, but it is also a function, and the macro expands to the function in double precision: % #define isnan(x)\ % ((sizeof (x) == sizeof (float)) ? __isnanf(x) \ % : (sizeof (x) == sizeof (double)) ? isnan(x) \ % : __isnanl(x)) I don't see how any conforming program can access the isnan() function directly. It is just as protected as __isnan() would be. (isnan)() gives the function (the function prototype uses this), but conforming programs can't do that since the function might not exist. Maybe some non-conforming program like autoconfig reads or libm.a and creates a bug for C++. The FreeBSD isnan() implementation would be broken by removing the isnan() function from libm.a or ifdefing it in . Changing the function to __isnan() would cause compatibility problems. The function is intentionally named isnan() to reduce compatibility problems. OTOH, the all of the extern sub-functions that are currently used should bever never be used, since using them gives a very low quality of implementation: - the functions are very slow - the functions have names that confuse compilers and thus prevent compilers from replacing them by builtins. Currently, only gcc automatically replaces isnan() by __builtin_isnan(). This only works in double precision. So the FreeBSD implementation only works right in double precision too, only with gcc, __because__ it replaces the macro isnan(x) by the function isnan(x). The result is inline expansion, the same as if the macro isnan() is replaced by __builtin_isnan(). clang never does this automatic replacement, so it generates calls to the slow library functions. Other things go wrong for gcc in other precisions: - if is not included, then isnan(x) gives __builtin_isnan((double)x). This sort of works on x86, but is low quality since it is broken for signaling NaNs (see below). One of the main reasons reason for the existence of the classification macros is that simply converting the arg to a common type and classifying the result doesn't always work. - if is not included, then spelling the API isnanf() or isnanl() gives correct results but a warning about these APIs not being declared. These APIs are nonstandard but are converted to __builtin_isnan[fl] by gcc. - if is included, then: - if the API is spelled isnan(), then the macro converts to __isnanf() or __isnanl(). gcc doesn't understand these, and the slow extern functions are used. - if the API is spelled isnanf() or isnanl(), then the result is correct and the warning magically goes away. declares isnanf(), but gcc apparently declares both iff is included. gcc also optimizes isnanl() on a float arg to __builtin_isnanf(). - no function version can work in some cases, because any function version may have unwanted side effects. This is another of the main reason for the existence of these and other macros. The main unwanted side effect is signaling for signaling NaNs. C99 doesn't really support signaling NaNs, even with the IEC 60559 extensions, so almost anything is allowed for them. But IEEE 854 is fairly clear that isnan() and classification macros shouldn't raise any exceptions. IEEE 854 is even clearer that copying values without changing their representation should (shall?) not cause exceptions. But on i387, just loading a float or double value changes its representation and generates an exception for signaling NaNs, while just loading a long double value conforms to IEEE 854 and doesn't change its representation or generate an exception. Passing of args to functions may or may not load the values. ABIs may require a change of representation. On i387, passing of double args should go through the FPU for efficiency reasons, and this changes the representation twice to not even get back to the original (for signaling NaNs, it generates an exception and sets the quiet bit in the result; thus a classification function can never see a signaling NaN in double precision). So a high quality inplementation must not use function versions, and it must also use builtins that don't
Re: [RFC/RFT] calloutng
On Thu, 17 Jan 2013, Ian Lepore wrote: On Mon, 2013-01-14 at 11:38 +1100, Bruce Evans wrote: Er, timecounters are called with a spin mutex held in existing code: though it is dangerous to do so, timecounters are called from fast interrupt handlers for very timekeeping-critical purposes: - to implement the TIOCTIMESTAMP ioctl (except this is broken in -current). This was a primitive version of pps timestamping. - for pps timestamping. The interrupt handler (which should be a fast interrupt handler to minimize latency) calls pps_capture() which calls tc_get_timecount() and does other "lock-free" accesses to the timecounter state. This still works in -current (at least there is still code for it). Unfortunately, calling pps_capture() in the primary interrupt context is no longer an option with the stock pps driver. Ever since the ppbus rewrite all ppbus children must use threaded handlers. I tried to fix that a couple different ways, and both ended up with crazy-complex code Hmm, I didn't notice that ppc supported pps (I try not to look at it since it is ugly :-), and don't know of any version of it that uses non-threaded handlers (except in FreeBSD-4 before, where normal interrupt handlers were non-threaded, so ppc had their high latency but not the even higher latency and overheads of threaded handlers). OTOH, my x86 RTC interrupt handler is threaded and supports pps, and I haven't noticed any latency problems with this. It just can't possibly give the < ~1 usec jitter that FreeBSD-[3-4] could give ~15 years ago using a fast interrupt handler (there must be only 1 device using a fast interrupt handler, with this dedicated to pps, else the multiple fast interrupt handlers will give latency much larger than ~1 usec to each other. I don't actually use this for anything except testing whether the RTC can be used for a poor man's pps. scattered around the ppbus family just to support the rarely-used pps capture. It would have been easier to do if filter and threaded interrupt handlers had the same function signature. I ended up writting a separate driver that can be used instead of ppc + ppbus + pps, since anyone who cares about precise pps capture is unlikely to be sharing the port with a printer or plip device or some such. Probably all pps handlers should be special. On x86 with reasonable timecounter hardware, say a TSC, it takes about 10 instructions for an entire pps interrupt handler: XintrN: pushl %eax pushl %edx rdtsc # Need some ugliness for EIO here or later. ss:movl %eax,ppscap # Hopefully lock-free via time-domain locking. ss:movl %edx,ppscap+4 popl%edx popl%eax iret After capturing the timecounter hardware value here, you convert it to a pps event at leisure. But since this only happens once per second, it wouldn't be very inefficient to turn the interrupt handler into a slow high-latency one, even a threaded one, to handle the pps event and/or other devices attached to the interrupt. OTOH, all drivers that call pps_capture() from their interrupt handler then immediately call pps_event(). This has always been very broken, and became even more broken with SMPng. pps_event() does many more timecounter and pps accesses whose locking is unclear at best, and in some configurations it calls hardpps(), which is only locked by Giant, despite comments in kern_ntptime.c still saying that it (and many other functions in kern_ntptime.c) must be called at splclock() or higher. splclock() is of course now null, but the locking requirements in kern_ntptime.c haven't changed much. kern_ntptime.c always needed to be locked by the equivalent of a spin mutex, which is stronger locking than was given by splclock(). pps_event() would have to aquire the spin mutex before calling hardpps(), although this is bad for fast interrupt handlers. The correct implementation is probably to only do the capture part from fast interrupt handlers. In my rewritten dedicated pps driver I call pps_capture() from the filter handler and pps_event() from the threaded handler. I never found That seems right. any good documentation on the low-level details of this stuff, and there isn't enough good example code to work from. My hazy memory is that I THere seem to be no good examples. ended up studying the pps_capture() and pps_event() code enough to infer that their design intent seems to be to allow you to capture with no locking and do the event processing later in some sort of deferred or threaded context. That seems to be the design, but there are no examples of separating the event from the capture. I think the correct locking is: - capture in a fast interrupt handler, into a per-device state that is locked by whatever locks all of the state accessed by the fast interrupt handle
Re: [RFC/RFT] calloutng
On Sun, 13 Jan 2013, Alexander Motin wrote: On 13.01.2013 20:09, Marius Strobl wrote: On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: On 06.01.2013 17:23, Marius Strobl wrote: I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest > 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The problem is not in using spinlock, but in waiting for other CPU while spinlock is held. Other CPU may also hold spinlock and wait for something, causing deadlock. i8254 code uses spinlock just to atomically access hardware registers, so it causes no problems. Okay, but wouldn't that be a general problem then? Pretty much anything triggering an IPI holds smp_ipi_mtx while doing so and the lower level IPI stuff waits for other CPU(s), including on x86. The problem is general. But now it works because single smp_ipi_mtx is used in all cases where IPI result is waited. As soon as spinning happens with interrupts still enabled, there is no deadlocks. But problem reappears if any different lock is used, or locks are nested. In existing code in HEAD and 9 timecounters are never called with spin mutex held. I intentionally tried to avoid that in existing eventtimers code. Er, timecounters are called with a spin mutex held in existing code: though it is dangerous to do so, timecounters are called from fast interrupt handlers for very timekeeping-critical purposes: - to implement the TIOCTIMESTAMP ioctl (except this is broken in -current). This was a primitive version of pps timestamping. - for pps timestamping. The interrupt handler (which should be a fast interrupt handler to minimize latency) calls pps_capture() which calls tc_get_timecount() and does other "lock-free" accesses to the timecounter state. This still works in -current (at least there is still code for it). OTOH, all drivers that call pps_capture() from their interrupt handler then immediately call pps_event(). This has always been very broken, and became even more broken with SMPng. pps_event() does many more timecounter and pps accesses whose locking is unclear at best, and in some configurations it calls hardpps(), which is only locked by Giant, despite comments in kern_ntptime.c still saying that it (and many other functions in kern_ntptime.c) must be called at splclock() or higher. splclock() is of course now null, but the locking requirements in kern_ntptime.c haven't changed much. kern_ntptime.c always needed to be locked by the equivalent of a spin mutex, which is stronger locking than was given by splclock(). pps_event() would have to aquire the spin mutex before calling hardpps(), although this is bad for fast interrupt handlers. The correct implementation is probably to only do the capture part from fast interrupt handlers. Callout code same time can be called in any environment with any locks held. And new callout code may need to know precise current time in any of those conditions. Attempt to use an IPI and wait there can be fatal. Callout code can't be called from such a general "any" environment as timecounter code. Not from a fast interrupt handler. Not from an NMI or IPI handler. I hope. But timecounter code has a good chance of working even for the last 2 environments, due to its design requirement of working in the first. The spinlock in the i8254 timecounter certainly breaks some cases. For example, suppose the lock is held for a timecounter read from normal context. It masks hardware interrupts on the current CPU (except in my version). It doesn't mask NMIs or other traps. So if the NMI or other trap handler does a timecounter hardware call, there is deadlock in at least the !SMP case. In my version, it blocks normal interrupts later if they occur, but doesn't block fast interrupts, so the pps_capture() call would deadlock if it occurs, like a timecounter call from an NMI. I avoid this by not using pps in any fast interrupt handler, and by only using the i8254 timecounter for testing. I do use pps in a (nonstandard) x86 RTC clock interrupt handler. My clock interrupt handlers are all non-fast to avoid this and other locking problems. FYI, these are the results of the v215 (btw., these (ab)use a bus cycle counter of the host-PCI-bridge as timecounter) with your calloutng_12_17.patch and kern.timecounter.alloweddeviation=0: select 1 23.82 poll 1 1008.23 usleep 1 23.31 nanosleep 1 23.17 kqueue 1 1010.35 kqueueto 1 26.26 syscall1 1.91 select 300307.72 poll 300 1008.23 usleep 300307.64 nanosleep300 23.21 Please fix the tv_nsec initialization so that we can see if nanosleep() and
Re: [RFC/RFT] calloutng
On Thu, 3 Jan 2013, Alexander Motin wrote: On 03.01.2013 16:45, Bruce Evans wrote: On Wed, 2 Jan 2013, Alexander Motin wrote: More important for scheduling fairness thread's CPU percentage is also based on hardclock() and hiding from it was trivial before, since all sleep primitives were strictly aligned to hardclock(). Now it is slightly less trivial, since this alignment was removed and user-level APIs provide no easy way to enforce it. %cpu is actually based on statclock(), and not even used for scheduling. May be for SCHED_4BSD, but not for SCHED_ULE. In SCHED_ULE both %cpu and thread priority based on the same ts_ticks counter, that is based on hardclock() as time source. Interactivity calculation uses alike logic and uses the same time source. Hmm. I missed this because it hacks on the 'ticks' global. It is clearer in intermediate versions which use the scheduler API sched_tick(), which is the hardclock analogue of sched_clock() for statclock. sched_tick() is now bogus since it is null for all schedulers. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [RFC/RFT] calloutng
On Wed, 2 Jan 2013, Alexander Motin wrote: On 02.01.2013 19:09, Konstantin Belousov wrote: On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote: Probably one way to close this discussion would be to provide a sysctl so the sysadmin can decide which point in the interval to pick when there is no suitable callout already scheduled. Isn't trying to synchronize to the external events in this way unsafe ? I remember, but cannot find the reference right now, a scheduler exploit(s) which completely hide malicious thread from the time accounting, by making it voluntary yielding right before statclock should fire. If statistic gathering could be piggy-backed on the external interrupt, and attacker can control the source of the external events, wouldn't this give her a handle ? Fine-grained timeouts complete fully opening this security hole. Synchronization without fine-grained timeouts might allow the same, but is harder to exploit since you can't control the yielding points directly. With fine-grained timeouts, you just have to predict the statclock firing points. Use one timeout to arrange to yield just before statclock fires and another to regain control just after it has fired. If the timeout resolution is say 50 usec, then this can hope to run for all except 100 usec out of every 1/stathz seconds. With stathz = 128, 1/stathz is 7812 usec, so this gives 7712/7812 of the CPU with 0 statclock ticks. Since the scheduler never sees you running, your priority remains minimal, so the scheduler should prefer to run you whenever a timeout expires, with only round-robin with other minimal-priority threads preventing you getting 7712/7812 of the (user non-rtprio) CPU. The previous stage of fully opening this security hole was changing (the default) HZ from 100 to 1000. HZ must not be much smaller than stathz, else the security hole is almost fully open. With HZ = 100 being less than stathz and timeout granularity limiting the fine control to 2/HZ = 20 msec (except you can use a periodic itimer to get a 1/HZ granularity at a minor cost of getting more SIGALRMs), it is impossible to get near 100% of the CPU with 0 statclock ticks. After yielding, you can't get control for another 100 or 200 msec. Since this exceeds 1/stathz = 78.12 usec, you can only hide from statclock ticks by not running very often or for very long. Limited hiding is possible by wasting even more CPU to determine when to hide: since the timeout granularity is large, it is also ineffective for determining when to yield. So when running, you must poll the current time a lot to determine when to yield. Yield just before statclock fires, as above. (Do it 50 usec early, as above, to avoid most races involving polling the time.) This actually has good chances of not limiting the hiding too much, depending on the details of the scheduling. It yields just before a statclock tick. After this tick fires, if the scheduler reschedules for any reason, then the hiding process would most likely be run again, since its priority is minimal. But at least the old 4BSD scheduler doesn't reschedule after _every_ statclock tick. This depends on the bugfeature that the priority is not checked on _every_ return to user mode (sched_clock() does change the priority, but this is not acted on until much later). Without this bugfeature, there would be excessive context switches. OTOH, with timeouts, at least old non-fine-grained ones, you can force a rescheduling that is acted on soon enough simply by using timeouts (since timeouts give a context switch to the softclock thread, the scheduler has no option to skip checking the priority on return to user mode). After the previous stage of changing HZ to 1000, the granuarity is fine enough for using timeouts to hide from the scheduler. Using a periodic itimer to get a granularity of 1000 usec, start hiding 50-1000 usec before each statclock tick and regain control 1000 usec later. With stathz = 128, 6812/7812 of the CPU with 0 statclock ticks. Not much worse (for the hider) than 7712/7812. Statclock was supposed to be aperiodic to avoid hiding (see statclk-usenix93.ps), but this was never implemented in FreeBSD. With fine-grained timeouts, it would have to be very aperiodic, to the point of giving large inaccuracies, to limit the hiding very much. For example, suppose that it has an average period of 7812 usec with +-50% jitter. You would try to hide from it most of the time by running for a bit less than 7812/2 usec before yielding in most cases. If too much scheduling is done on each statclock tick, then you are likely to regain control after each one (as above) and then know that there is almost a full minimal period until the next one. Otherwise, it seems to be necessary to determine when the previous statclock tick occurred, so as to determine the minimum time until the next one. There are many different kinds of accounting with different characteristics. Run time for each thread cal
Re: API explosion (Re: [RFC/RFT] calloutng)
I finally remembered to remove the .it phk :-). On Wed, 19 Dec 2012, Luigi Rizzo wrote: On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote: ... As I said in my previous email: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 << 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) only thing, we must be careful with the parentheses For instance, in your macro, DURNSEC evaluates to 0 and so does any multiple of it. We should define them as #define DURNSEC DURSEC / 100 ... so DURNSEC is still 0 and 500*DURNSEC gives 214 I am curious that Bruce did not mention this :) Er, he was careful. DURNSEC gives 4, not 0. This is not very accurate, but probably good enough. Your version without parentheses is not so careful and depends on a magic order of operations and no overflow from this. E.g.: 500*DURNSEC = 500*DURSEC / 10 = 500*((dur_t)1 << 32) / 10 This is very accurate and happens not to overflow. But 5 seconds represented a little strangely in nanoseconds would overflow: 50*DURNSEC = 50*((dur_t)1 << 32) / 10 So would 5 billion times DURSEC, but 5 billion seconds is more unreasobable than 5 billion nanoseconds and the format just can't represent that. (btw the typedef is swapped, should be "typedef int64_t dur_t") Didn't notice this. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: In message <20121220005706.i1...@besplex.bde.org>, Bruce Evans writes: On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: Except that for absolute timescales, we're running out of the 32 bits integer part. Except 32 bit time_t works until 2106 if it is unsigned. That's sort of not an option. I think it is. It is just probably not necessary since 32-bit systems will go away before 2038. The real problem was that time_t was not defined as a floating point number. That would be convenient too, but bad for efficiency on some systems. Kernels might not be able to use it, and then would have to use an alternative representation, which they should have done all along. [1] A good addition to C would be a general multi-word integer type where you could ask for any int%d_t or uint%d_t you cared for, and have the compiler DTRT. In difference from using a multiword-library, this would still give these types their natural integer behaviour. That would be convenient, but bad for efficiency if it were actually used much. You can say that about anything but CPU-native operations, and I doubt it would be as inefficient as struct bintime, which does not have access to the carry bit. Yes, I would say that about non-native. It goes against the spirit of C. OTOH, compilers are getting closer to giving full access to the carry bit. I just checked what clang does in a home-made 128-bit add function: % static void __noinline % uadd(struct u *xup, struct u *yup) % { % unsigned long long t; % % t = xup->w[0] + yup->w[0]; % if (t < xup->w[0]) % xup->w[1]++; % xup->w[0] = t; % xup->w[1] += yup->w[1]; % } % % .align 16, 0x90 % .type uadd,@function % uadd: # @uadd % .cfi_startproc % # BB#0: # %entry % movq(%rdi), %rcx % movq8(%rdi), %rax % addq(%rsi), %rcx gcc generates an additional cmpq instruction here. % jae .LBB2_2 clang uses the carry bit set by the first addition to avoid the comparison, but still branches. % # BB#1: # %if.then % incq%rax % movq%rax, 8(%rdi) This adds 1 explicitly instead of using adcq, but this is the slow path. % .LBB2_2:# %if.end % movq%rcx, (%rdi) % addq8(%rsi), %rax This is as efficient as possible except for the extra branch, and the branch is almost perfectly predictable. % movq%rax, 8(%rdi) % ret % .Ltmp22: % .size uadd, .Ltmp22-uadd % .cfi_endproc Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Davide Italiano wrote: On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans wrote: I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks >= 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks < 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. Bruce, I don't really agree with this. The data addressed by cookie should be still stored somewhere, and KBI will result broken. This, indeed, is not real problem as long as current calloutng code heavily breaks KBI, but if that was your point, I don't see how your proposed change could help. In the old API, it is an error to pass ticks < 0, so only broken old callers are affected. Of course, if there are any then it would be hard to detect their garbage cookies. Anywy, it's too later to change to this, and maybe also to a 32.32 format. [32.32 format] This would make a better general format than timevals, timespecs and of course bintimes :-). It is a bit wasteful for timeouts since its extremes are rarely used. Malicious and broken callers can still cause overflow at 68 years, so you have to check for it and handle it. The limit of 194 days is just as good for timeouts. I think the phk's proposal is better. About your overflow objection, I think is really unlikely to happen, but better safe than sorry. It's very easy for applications to cause kernel overflow using valid syscall args like tv_sec = TIME_T_MAX for a relative time in nanosleep(). Adding TIME_T_MAX to the current time in seconds overflow for all current times except for the first second after the Epoch. There is no difference between the overflow for 32-bit and 64-bit time_t's for this. This is now mostly handled so that the behaviour is harmless although wrong. E.g., the timeout might become negative, and then since it is not a cookie it is silently replaced by a timeout of 1 tick. In nanosleep(), IIRC there are further overflows that result in returning early instead of retrying the 1-tick timeouts endlessly. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: In message <20121219221518.e1...@besplex.bde.org>, Bruce Evans writes: With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 < d2) This would make a better general format than timevals, timespecs and of course bintimes :-). Except that for absolute timescales, we're running out of the 32 bits integer part. Except 32 bit time_t works until 2106 if it is unsigned. Bintimes is a necessary superset of the 32.32 which tries to work around the necessary but missing int96_t or int128_t[1]. [1] A good addition to C would be a general multi-word integer type where you could ask for any int%d_t or uint%d_t you cared for, and have the compiler DTRT. In difference from using a multiword-library, this would still give these types their natural integer behaviour. That would be convenient, but bad for efficiency if it were actually used much. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: In message , Davide Italiano writes: Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. And that is way overkill for specifying a callout, at best your clock has short term stabilities approaching 1e-8, but likely as bad as 1e-6. So you always agreed with me that bintimes are unsuitable for almost everything, and especially unsuitable for timeouts? :-) (The reason why bintime is important for timekeeping is that we accumulate timeintervals approx 1e3 times a second, so the rounding error has to be much smaller than the short term stability in order to not dominate) bintimes are not unsuitable for timekeeping, but they a painful to use for other APIs. You have to either put bintimes in layers in the other APIs, or convert them to a more suitable format, and there is a problem placing the conversion at points where it is efficient. This thread seems to be mostly about putting the conversion in wrong places. My original objection was about using bintimes for almost everything at the implementation level. I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code No, that was exactly my point: It should be an integer so that comparisons and arithmetic is trivial. A 32.32 format fits nicely into a int64_t which is readily available in the language. I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks >= 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks < 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. As I said in my previous email: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 << 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) Constructing the cookie for my special case would not be so easy. With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 < d2) This would make a better general format than timevals, timespecs and of course bintimes :-). It is a bit wasteful for timeouts since its extremes are rarely used. Malicious and broken callers can still cause overflow at 68 years, so you have to check for it and handle it. The limit of 194 days is just as good for timeouts. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [RFC/RFT] calloutng
On Sat, 15 Dec 2012, Garrett Cooper wrote: On Dec 15, 2012, at 12:34 PM, Mark Johnston wrote: On Sat, Dec 15, 2012 at 06:55:53PM +0200, Alexander Motin wrote: Hi. I'm sorry to interrupt review, but as usual good ideas came during the final testing, causing another round. :) Here is updated patch for HEAD, that includes several new changes: http://people.freebsd.org/~mav/calloutng_12_15.patch This patch breaks the libprocstat build. Specifically, the OpenSolaris sys/time.h defines the preprocessor symbols gethrestime and gethrestime_sec. These symbols are also defined in cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h. libprocstat:zfs.c is compiled using include paths that pick up the OpenSolaris time.h, and with this patch _callout.h includes sys/time.h. zfs.c includes taskqueue.h (with _KERNEL defined), which includes _callout.h, so both time.h and zfs_context.h are included in zfs.c, and the symbols are thus defined twice. Gross namespace pollution. sys/_callout.h exists so that the full namespace pollution of sys/callout.h doesn't get included nested. But sys/time.h is much more polluted than sys/callout.h. However, sys/time.h is old standard pollution in sys/param.h, and sys/callout.h is not so old standard pollution in sys/systm.h. It is a bug to not include sys/param.h and sys/systm.h in most kernel source code, so these nested includes are just style bugs -- they have no effect for correct kernel source code. The patch below fixes the build for me. Another approach might be to include sys/_task.h instead of taskqueue.h at the beginning of zfs.c. Good if it works. I had a patch open once upon a time to cleanup inclusion of sys/time.h all over the tree and deal with the sys/time.h <-> time.h pollution issue, but it got dropped due to lack of interest (20~30 apps/libs were affected IIRC and I only really got assistance in fixing the UFS and bsnmpd pieces, and gave up due to lack of response from maintainers). dtrace/zfs is a definite instigator in this pollution (I remember nasty cddl/... pollution with the compat sys/time.h header). Please use the unix newline character in mail. The above is difficult to quote. The standard sys/time.h pollution in sys/param.h is only in the kernel, and there aren't many direct includes of sys/time.h in the kernel. Userland is different and many of the direct includes were correct. But not POSIX specifies that struct timespec and struct timeval be defined in most places where they are needed, so the includes of sys/time.h are not necessary for POSIX or FreeBSD, although FreeBSD man pages still say that they are necessary. The sys/time.h <-> time.h pollution issue is also only for userland. Many places depend on one including the other, and include the wrong one themself. Bottom line: make sure anything new you're defining isn't already defined via POSIX or other OSes, and if so please try to make the implementations match (so that eventual POSIX inclusion might be possible) and when in doubt I suggest consulting standards@ / brde@. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [RFC/RFT] calloutng
On Sat, 15 Dec 2012, Oliver Pinter wrote: On 12/15/12, Bruce Evans wrote: ... Because of the different grouping of the multiplications, the second is unfortunately slower (1 more multiplication that cannot be done at compile time). The second also gives unnecessary (but findamental to the method) inaccuracy by pulling out the factor of 1000. The first gives the same inaccuracy, and now it is because the constant is not correctly rounded. It should be 2.0**64 / 10**3 = 1844674407309551.616 (exactly) = 1844674407309552 (rounded to nearest int) but is actually rounded down to a multiple of 1000. ... mav@ already fixed the rounding before I wrote that :-). He also changed some (uint64_t)1's to use the long long abomination :-(. Thanks for the detailed answer. :) Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [RFC/RFT] calloutng
On Sat, 15 Dec 2012, Bruce Evans wrote: On Fri, 14 Dec 2012, Oliver Pinter wrote: What is this 1844674407309000LL constant? This is 2**64 / 10**6 * 10**3 obfuscated by printing it in hex and doing the scaling by powers of 10 manually, and then giving it a bogus type using the abominable long long misfeature. I try to kill this obfuscation and the abimination whenever I see them. In sys/time.h, this resulted in a related binary conversion using a scale factor of ((uint64_t)1 << 63) / (10 >> 1). Here the power of 2 term is 2**63. 2**64 cannot be used since it exceeds uintmax_t. The power of 10 term is 10**9. This is divided by 2 to compensate for dividing 2**64 by 2. The abomination is avoided by using smaller literal values and expandling them to 64-bit values using shifts. Bah, this is only de-obfuscated and de-abominated in my version: % Index: time.h % === % RCS file: /home/ncvs/src/sys/sys/time.h,v % retrieving revision 1.65 % diff -u -2 -r1.65 time.h % --- time.h7 Apr 2004 04:19:49 - 1.65 % +++ time.h7 Apr 2004 11:28:54 - % @@ -118,6 +118,5 @@ % % bt->sec = ts->tv_sec; % - /* 18446744073 = int(2^64 / 10) */ % - bt->frac = ts->tv_nsec * (uint64_t)18446744073LL; % + bt->frac = ts->tv_nsec * (((uint64_t)1 << 63) / (10 >> 1)); % } % The magic 1844... in time.h is at least commented on. This makes it less obscure, but takes twice as many source lines and risks the comment getting out of date with the code. The comment is also sloppy with types and uses the '^' operator without saying that it is exponentiation and nothing like the C '^' operator. The types are especially critical in the shift exprression. I like to use the Fortran '**' operator in C comments without saying what it is instead. In another reply to this thread, the value in the explanation is off by a factor of 1000 and the rounding to a multiple of 1000 is not explained. It is easy to have such errors in comments, while the code tends to be more correct since it gets checked by running it. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [RFC/RFT] calloutng
On Fri, 14 Dec 2012, Oliver Pinter wrote: 635 - return tticks; 636 + getbinuptime(&pbt); 637 + bt.sec = data / 1000; 638 + bt.frac = (data % 1000) * (uint64_t)1844674407309000LL; 639 + bintime_add(&bt, &pbt); 640 + return bt; Style bugs: missing spaces around return value in new and old code. 641 } What is this 1844674407309000LL constant? This is 2**64 / 10**6 * 10**3 obfuscated by printing it in hex and doing the scaling by powers of 10 manually, and then giving it a bogus type using the abominable long long misfeature. I try to kill this obfuscation and the abimination whenever I see them. In sys/time.h, this resulted in a related binary conversion using a scale factor of ((uint64_t)1 << 63) / (10 >> 1). Here the power of 2 term is 2**63. 2**64 cannot be used since it exceeds uintmax_t. The power of 10 term is 10**9. This is divided by 2 to compensate for dividing 2**64 by 2. The abomination is avoided by using smaller literal values and expandling them to 64-bit values using shifts. Long long suffixes on literal constants are only needed to support C90 compilers with the long long extension on 32-bit systems anyway. Otherwise, C90+extension compilers will warn about literal constants larger than ULONG_MAX (which can only occur on 32-bit systems). Since C99 is now the default, the warnings would only without LL in the above if you use nonstandard CFLAGS. The above has to convert from the bad units of milliseconds to the bloated units of bintimes, and it is less refined than most other bintime conversions. I think that since it doesn't try to be optimal, it should just use the standard bintime conversions after first converting milliseconds to a timeval. It already does essentially that with its divisions by 1000: struct timeval tv; tv.tv_sec = data / 1000; tv.tv_usec = data % 1000 * 1000; timeval2bintime(&tv, &bt); The compliler will probably optimize /1000 and %1000 to shifts in both this and the above. Then timeval2bintime() does the almost the same multiplication as above, but spelled differently accuracy. Both give unnecessary inaccuracy in the conversion to weenieseconds: the first gives: bt.frac = data % 1000 * (2**64 / 10**6 * 10**3); the second gives: bt.frac = data % 1000 * 1000 * (2**64 / 10**6); Because of the different grouping of the multiplications, the second is unfortunately slower (1 more multiplication that cannot be done at compile time). The second also gives unnecessary (but findamental to the method) inaccuracy by pulling out the factor of 1000. The first gives the same inaccuracy, and now it is because the constant is not correctly rounded. It should be 2.0**64 / 10**3 = 1844674407309551.616 (exactly) = 1844674407309552 (rounded to nearest int) but is actually rounded down to a multiple of 1000. It would be better to round the scale factors so that the conversions are inverses of each other and tticks can be recovered from bt, but this is impossible. I tried to make the bintime conversions invert most values correctly by rounding to nearest, but phk didn't like this and the result is the bogus comment about always rounding down in time.h. So when you start with 999 msec in tticks, the resulting bt will be rounded down a little and converting back will give 998 msec; the next round of conversions will reduce 1 more, and so on until you reach a value that is exactly representable in both milliseconds and weenieseconds (875?). This despite weenieseconds providing vastly more accuracy than can be measured and vastly more accuracy than needed to represent all other time values in the kernel in a unique way. Just not in a unique way that is expressible using simple scaling conversions. The conversions that give uniqueness can still be monotonic, but can't be nonlinear in the same way that simple scaling gives. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: clang compiled kernel panic when mounting zfs root on i386
On Mon, 26 Nov 2012, Konstantin Belousov wrote: On Mon, Nov 26, 2012 at 06:31:34AM -0800, sig6247 wrote: Just checked out r243529, this only happens when the kernel is compiled by clang, and only on i386, either recompiling the kernel with gcc or booting from a UFS root works fine. Is it a known problem? It looks like that clang uses more stack than gcc, and zfs makes quite deep call chains. It would be a waste, generally, to increase the init process kernel stack size only to pacify zfs. And I suspect that it would not help in the similar situations when the same procedure initiated for non-root mounts. Or to pacify clang... -- WARNING: WITNESS option enabled, expect reduced performance. Trying to mount root from zfs:zroot []... Fatal double fault: eip = 0xc0adc37d esp = 0xc86bffc8 ebp = 0xc86c003c cpuid = 1; apic id = 01 panic: double fault cpuid = 1 KDB: enter: panic [ thread pid 1 tid 12 ] Stopped at kdb_enter+0x3d: movl$0,kdb_why db> bt Tracing pid 1 tid 12 td 0xc89efbc0 kdb_enter(c1064aa4,c1064aa4,c10b806f,c139e3b8,f5eacada,...) at kdb_enter+0x3d panic(c10b806f,1,1,1,c86c003c,...) at panic+0x14b dblfault_handler() at dblfault_handler+0xab --- trap 0x17, eip = 0xc0adc37d, esp = 0xc86bffc8, ebp = 0xc86c003c --- witness_checkorder(c1fd7508,9,c109df18,7fa,0,...) at witness_checkorder+0x37d __mtx_lock_flags(c1fd7518,0,c109df18,7fa,c135d918,...) at __mtx_lock_flags+0x87 uma_zalloc_arg(c1fd66c0,0,1,4d3,c86c0110,...) at uma_zalloc_arg+0x605 vm_map_insert(c1fd508c,c13dfc10,bb1f000,0,cba1e000,...) at vm_map_insert+0x499 kmem_back(c1fd508c,cba1e000,1000,3,c86c01d4,...) at kmem_back+0x76 kmem_malloc(c1fd508c,1000,3) at kmem_malloc+0x250 page_alloc(c1fd1d80,1000,c86c020b,3,c1fd1d80,...) at page_alloc+0x27 keg_alloc_slab(103,4,c109df18,870,cb99ef6c,...) at keg_alloc_slab+0xc3 keg_fetch_slab(103,c1fd1d80,cb99ef6c,c1fc8230,c86c02c0,...) at keg_fetch_slab+0xe2 zone_fetch_slab(c1fd1d80,c1fd0480,103,826,0,...) at zone_fetch_slab+0x43 uma_zalloc_arg(c1fd1d80,0,102,3,2,...) at uma_zalloc_arg+0x3f2 malloc(4c,c1686100,102,c86c0388,c173d09a,...) at malloc+0xe9 zfs_kmem_alloc(4c,102,cb618820,c89efbc0,cb618820,...) at zfs_kmem_alloc+0x20 vdev_mirror_io_start(cb8218a0,10,cb8218a0,1,0,...) at vdev_mirror_io_start+0x14a zio_vdev_io_start(cb8218a0,c89efbc0,0,cb8218a0,c86c0600,...) at zio_vdev_io_start+0x228 zio_execute(cb8218a0,cb618000,cba1b640,cb90,400,...) at zio_execute+0x106 spa_load_verify_cb(cb618000,0,cba1b640,cb884b40,c86c0600,...) at spa_load_verify_cb+0x89 traverse_visitbp(cb884b40,cba1b640,c86c0600,c86c0ba0,0,...) at traverse_visitbp+0x29f traverse_dnode(cb884b40,0,0,8b,0,...) at traverse_dnode+0x92 traverse_visitbp(cb884bb8,cba07200,c86c0890,cb884bf4,c16ce7e0,...) at traverse_visitbp+0xe47 traverse_visitbp(cb884bf4,cb9bf840,c86c0968,c86c0ba0,0,...) at traverse_visitbp+0xf32 traverse_dnode(cb884bf4,0,0,0,0,...) at traverse_dnode+0x92 traverse_visitbp(0,cb618398,c86c0b50,2,cb9f1c78,...) at traverse_visitbp+0x96d traverse_impl(0,0,cb618398,3e1,0,...) at traverse_impl+0x268 traverse_pool(cb618000,3e1,0,d,c1727830,...) at traverse_pool+0x79 spa_load(0,1,c86c0ec4,1e,0,...) at spa_load+0x1dde spa_load(0,0,c13d8c94,1,3,...) at spa_load+0x11a5 spa_load_best(0,,,1,c0adc395,...) at spa_load_best+0x71 spa_open_common(c17e0e1e,0,0,c86c1190,c16f5a1c,...) at spa_open_common+0x11a spa_open(c86c1078,c86c1074,c17e0e1e,c135d918,c1fd7798,...) at spa_open+0x27 dsl_dir_open_spa(0,cb770030,c17e11b1,c86c11f8,c86c11f4,...) at dsl_dir_open_spa+0x6c dsl_dataset_hold(cb770030,cb613800,c86c1240,cb613800,cb613800,...) at dsl_dataset_hold+0x3a dsl_dataset_own(cb770030,0,cb613800,c86c1240,c1684e30,...) at dsl_dataset_own+0x21 dmu_objset_own(cb770030,2,1,cb613800,c86c1290,...) at dmu_objset_own+0x2a zfsvfs_create(cb770030,c86c13ac,c17ee05d,681,0,...) at zfsvfs_create+0x4c zfs_mount(cb78ed20,c17f411c,c9ff4600,c89cae80,0,...) at zfs_mount+0x42c vfs_donmount(c89efbc0,4000,0,c86c1790,cb6c0800,...) at vfs_donmount+0xc6d kernel_mount(cb7700b0,4000,0,0,1,...) at kernel_mount+0x6b parse_mount(cb7700e0,c1194498,0,1,0,...) at parse_mount+0x606 vfs_mountroot(c13d95b0,4,c105c042,2bb,0,...) at vfs_mountroot+0x6cf start_init(0,c86c1d08,c105e94c,3db,0,...) at start_init+0x6a fork_exit(c0a42090,0,c86c1d08) at fork_exit+0x7f fork_trampoline() at fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xc86c1d40, ebp = 0 --- db> 43 deep (before the double fault) is disgusting, but even if clang has broken stack alignment due to a wrong default and no -mpreferred-stack-boundary to fix it, that's still only about 8*43 extra bytes (8 for the average extra stack to align to 16 bytes). Probably zfs is also putting large data structures on the stack. It would be useful if the stack trace printed the the stack pointer on every function call, so that you could see how much stack each function used. All those ', ...' printed after 5 args show further appare
Re: Use of C99 extra long double math functions after r236148
On Wed, 25 Jul 2012, Stephen Montgomery-Smith wrote: On 07/25/12 12:31, Steve Kargl wrote: On Wed, Jul 25, 2012 at 12:27:43PM -0500, Stephen Montgomery-Smith wrote: Just as a point of comparison, here is the answer computed using Mathematica: N[Exp[2], 50] 7.3890560989306502272304274605750078131803155705518 As you can see, the expl solution has only a few digits more accuracy that exp. Unless you are using sparc64 hardware. flame:kargl[204] ./testl -V 2 ULP = 0.2670 for x = 2.0e+00 mpfr exp: 7.389056098930650227230427460575008e+00 libm exp: 7.389056098930650227230427460575008e+00 Yes. It would be nice if long on the Intel was as long as the sparc64. You want it to be as slow as sparc64? (About 300 times slower, after scaling the CPU clock rates. Doubles on sparc64 are less than 2 times slower.) I forgot to mention in a previous reply is that expl has only a few more decimal digits of accuracy than exp because the extra precision on x86 wasn't designed to give much more accuracy. It was designed to give more chance of full double precision accuracy in naive code. It was designed in ~1980 when bits were expensive and the extra 11 provided by the 8087 were considered the best tradeoff between cost and accuracy. They only previde 2-3 extra decimal digits of accuracy. They are best thought of as guard bits. Floating point uses 1 or 2 guard bits internally. 11 extends that significantly and externalizes it, but is far from doubling the number of bits. Their use to provide extra precision was mostly defeated in C by bad C bindings and implementations. This was consolidated by my not using the extra bits for the default rounding precision in FreeBSD. This has been further consolidated by SSE not supporting extended precision. Now the naive code that uses doubles never gets the extra precision on amd64. Mixing of long doubles with doubles is much slower with SSE+i387 than with i387, since the long doubles are handled in different registers and must be translated with SSE+i387, while with i387, using long doubles is almost free (it actually has a negative cost in non-naive code since it allows avoiding extra precision in software). Thus SSE also inhibits using the extra precision intentionally. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Use of C99 extra long double math functions after r236148
On Wed, 25 Jul 2012, Rainer Hurling wrote: On 25.07.2012 19:00 (UTC+2), Steve Kargl wrote: On Wed, Jul 25, 2012 at 06:29:18PM +0200, Rainer Hurling wrote: Many thanks to you three for implementing expl() with r238722 and r238724. I am not a C programmer, but would like to ask if the following example is correct and suituable as a minimalistic test of this new C99 function? It's not clear to me what you mean by test. If expl() is not available in libm, then linking the code would fail. So, testing for the existence of expl() (for example, in a configure script) is as simple as Sorry for not being clear enough. I didn't mean testing for the existence, but for some comparable output between exp() and expl(), on a system with expl() available in libm. This is basically what I do to test exp() (with a few billion cases automatically generated and compared). It is not sufficient for checking expl(), except for consistency. (It is assumed that expl() is reasonably accurate. If it is in fact less accurate than exp(), this tends to show up in the comparisons.) #include long double func(long double x) { return (expl(x)); } //--- #include #include int main(void) { double c = 2.0; long double d = 2.0; double e = exp(c); long double f = expl(d); printf("exp(%f) is %.*f\n", c, 90, e); printf("expl(%Lf) is %.*Lf\n", d, 90, f); If you mean testing that the output is correct, then asking for 90 digits is of little use. The following is sufficient (and my actually produce a digit or two more than is available in number) Ok, I understand. I printed the 90 digits to be able to take a look at the decimal places, I did not expect to get valid digits in this area. Use binary format (%a) for manual comparison. Don't print any more bits than the format has. This is DBL_MANT_DIG (53) for doubles and LDLBL_MANT_DIG (64 on x86) for long doubles. %a format is in nybbles and tends to group the bits into nybbles badly. See below on reducing problems from this. Decimal format has to print about 3 more digits than are really meaningful, to allow recovering the original value uniquely. For manual comparison, you need to print these extra digits and manually round or ignore them as appropriate. The correct number of extra digits is hard to determine. For the "any", type, it is DECIMAL_DIG (21) on x86. The corresponding number of normally-accurate decimal digits for long doubles is given by LDBL_DIG (18). For floats and doubles, this corresponds to FLT_DIG (6) and DBL_DIG (15). Unfortunately, doesn't define anything corresponding to DECIMAL_DIG for the smaller types. 21 is a lot of digits and noise digits take a long time to determine and ignore (its worse on sparc64 where DECIMAL_DIG is 36). I usually add 2 extra digits to the number of normally-accurate digits. This is sloppy. 3 is needed in some cases, depending on MANT_DIG and the bits in log(2) and/or log(10). troutmask:fvwm:kargl[203] diff -u a.c.orig a.c --- a.c.orig2012-07-25 09:38:31.0 -0700 +++ a.c 2012-07-25 09:40:36.0 -0700 @@ -1,5 +1,6 @@ #include #include +#include int main(void) { @@ -9,8 +10,8 @@ double e = exp(c); long double f = expl(d); - printf("exp(%f) is %.*f\n", c, 90, e); - printf("expl(%Lf) is %.*Lf\n", d, 90, f); + printf("exp(%f) is %.*f\n", c, DBL_DIG+2, e); + printf("expl(%Lf) is %.*Lf\n", d, LDBL_DIG+2, f); return 0; } Thanks, I was not aware of DBL_DIG and LDBL_DIG. Steve is sloppy and adds 2 also :-). For long doubles, it is clear that 3 are strictly needed, since DECIMAL_DIG is 3 more. For most long double functions on i386, you need to switch the rounding precision to 64 bits around calls to them, and also to do any operations on the results except printing them. expl() is one of the few large functions that does the switch internally. So the above should work (since it only prints), but (expl(d) + 0) should round to the default 53-bit precision and this give the same result as exp(d). If you actually want to test expl() to see if it is producing a decent result, you need a reference solution that contains a higher precision. I use mpfr with 256 bits of precision. troutmask:fvwm:kargl[213] ./testl -V 2 ULP = 0.3863 x = 2.00e+00 libm: 7.389056098930650227e+00 0x1.d8e64b8d4ddadcc4p+2 mpfr: 7.389056098930650227e+00 0x1.d8e64b8d4ddadcc4p+2 mpfr: 7.3890560989306502272304274605750078131803155705518\ 47324087127822522573796079054e+00 mpfr: 0x7.63992e35376b730ce8ee881ada2aeea11eb9ebd93c887eb59ed77977d109f148p+0 The 1st 'mpfr:' line is produced after converting the results fof mpfr_exp() to long double. The 2nd 'mpfr:' line is produced by mpfr_printf() where the number of printed digits depends on the 256-bit precision. The last 'mpfr:' line is mpfr_printf()'s hex formatting. Unfortunately, it does not normalize the hex representation to start with '0x1.', w
Re: [head tinderbox] failure on i386/i386
On Tue, 22 May 2012, FreeBSD Tinderbox wrote: [...] from /obj/i386.i386/src/tmp/usr/include/sys/_types.h:33, from /obj/i386.i386/src/tmp/usr/include/stdio.h:41, from /src/sbin/devd/parse.y:33: /obj/i386.i386/src/tmp/usr/include/x86/_types.h:51: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'typedef' /obj/i386.i386/src/tmp/usr/include/x86/_types.h:96: error: expected '=', ',', ';', 'asm' or '__attribute__' before '__int_least8_t' cc1: warnings being treated as errors /src/sbin/devd/parse.y: In function 'yyparse': /src/sbin/devd/parse.y:103: warning: implicit declaration of function 'add_attach' Another bug in the new yacc is that it uses hard-coded GNUisms like __attribute__(()) (maybe firm-coded by autoconfig) instead of hard-coded FreeBSDisms like __printflike(). But this is not the bug here. devd.h is just included in a wrong order (before its prerequisites) in parse.y. This worked accidentally because old yacc includes sufficient namespace pollution earlier. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Fri, 20 Apr 2012, K. Macy wrote: On Fri, Apr 20, 2012 at 4:44 PM, Luigi Rizzo wrote: The small penalty when flowtable is disabled but compiled in is probably because the net.flowtable.enable flag is checked a bit deep in the code. The advantage with non-connect()ed sockets is huge. I don't quite understand why disabling the flowtable still helps there. Do you mean having it compiled in but disabled still helps performance? Yes, that is extremely strange. This reminds me that when I worked on this, I saw very large throughput differences (in the 20-50% range) as a result of minor changes in unrelated code. I could get these changes intentionally by adding or removing padding in unrelated unused text space, so the differences were apparently related to text alignment. I thought I had some significant micro-optimizations, but it turned out that they were acting mainly by changing the layout in related used text space where it is harder to control. Later, I suspected that the differences were more due to cache misses for data than for text. The CPU and its caching must affect this significantly. I tested on an AthlonXP and Athlon64, and the differences were larger on the AthlonXP. Both of these have a shared I/D cache so pressure on the I part would affect the D part, but in this benchmark the D part is much more active than the I part so it is unclear how text layout could have such a large effect. Anyway, the large differences made it impossible to trust the results of benchmarking any single micro-benchmark. Also, ministat is useless for understanding the results. (I note that luigi didn't provide any standard deviations and neither would I. :-). My results depended on the cache behaviour but didn't change significantly when rerun, unless the code was changed. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: strange ping response times...
On Thu, 12 Apr 2012, Luigi Rizzo wrote: On Thu, Apr 12, 2012 at 01:18:59PM +1000, Bruce Evans wrote: On Wed, 11 Apr 2012, Luigi Rizzo wrote: On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote: ... ping takes a timestamp in userspace before trying to transmit the packet, and then the timestamp for the received packet is recorded in the kernel (in the interrupt or netisr thread i believe -- anyways, not in userspace). No, all timestamps recorded used by ping are recorded in userland. Bruce, look at the code in ping.c -- SO_TIMESTAMP is defined, so the program does (successfully) a setsockopt(s, SOL_SOCKET, SO_TIMESTAMP, &on, sizeof(on)); and then (verified that at runtime the code follows this path) ... Indeed it does. This accounts for the previously unaccounted for binuptime call in the kernel. This is part of fenner's change (perhaps the most important part) to put the timestamping closer to the actual i/o, so that the RTT doesn't count setup overheads. Timestamping still works when SO_TIMESTAMP is undefed (after fixing the ifdefs on it). Not using it costs only 1 more gettimeofday() call in ping, but it increases my apparent RTT from 7-8 usec to 10-11 usec. 3 extra usec seems a lot for the overhead. I found that ping -f saves 1 gettimeofday() call per packet, and my version saves another 1. -current ping -q localhost: 4 my ping -q localhost: 3 -current ping -fq localhost: 3 my ping -fq localhost: 2 1 gettimeofday() call is needed for putting a timestamp in the output packet. Apparently, this is not well integrated with the bookkeeping for select(), and up to 3 more gettimeofday() calls are used. select() timeouts only have 1/HZ granulatrity, so the gettimeofday() calls to set them up could use clock_gettime() with CLOCK_REALTIME_FAST_N_BROKEN, but this would be a bogus optimization since most of the overhead is in the syscall provided the timecounter hardware is not slow (it takes 9-80 cycles to read TSC timecounter hardware and 600-800 cycles for clock_gettime() with a TSC timecounter). I just noticed Note that CLOCK_MONOTONIC cannot be used for the timestamp in the output packet if SO_TIMESTAMP is used for the timestamp in the input packet, since SO_TIMESTAMP uses CLOCK_REALTIME for historical reasons. There are 2 select()s per packet. Perhaps the number of gettimeofday()s can be reduced to 1 per packet in most cases (get one for the output packet and use it for both select()s). With my version the truss trace for ping localhost is: % 64 bytes from 127.0.0.1: icmp_seq=6 ttl=64 time=0.473 ms % write(1,0x80d4000,57) = 57 (0x39) % gettimeofday({1334226305 409249},0x0) = 0 (0x0) Need this for the next select(). There is a ~1 second pause here. % select(4,{3},0x0,0x0,{0 989607}) = 0 (0x0) Truss doesn't show this until select() returns ~1 second later. The gettimeofday() call was needed because we don't simply use a 1 second pause, but adjust for overheads. My version uses a fancier adjustment. % gettimeofday({1334226306 408632},0x0) = 0 (0x0) We need this accurately to put in the packet. But all the other timestamps can be derived from this for the non-flood case. We can just try to send every `interval' and adjust the timeouts a bit when we see that we get here a bit early or late, and when we get here more than a bit early or late we can either recalibrate or adjust by more. Note that this gettimeofday() returned 1.0 - 0.000617 seconds after the previous one, although we asked for a timeout of ~1/HZ = 0.001 seconds. Select timeouts have a large granularity and we expect errors of O(1/HZ) and mostly compensate for them. The drift without compensation would be 1% with HZ = 100 and -i 1.0, and much larger with -i . My version compensates more accurately than -current. % sendto(0x3,0x80b8d34,0,0x0,{ AF_INET 127.0.0.1:0 },0x10) = 64 (0x40) % gettimeofday({1334226306 408831},0x0) = 0 (0x0) % select(4,{3},0x0,0x0,{0 990025}) = 1 (0x1) % recvmsg(0x3,0xbfbfe8c0,0x0)= 84 (0x54) % gettimeofday({1334226306 409104},0x0) = 0 (0x0) I don't know what this is for. We got a timestamp in the returned packet, and use it. Since this timestamp has microsecond resolution and select() only has 1/Hz resolution. this timestamp should be more than good enough for sleeping for the interval. % 64 bytes from 127.0.0.1: icmp_seq=7 ttl=64 time=0.472 ms % write(1,0x80d4000,57) = 57 (0x39) Next packet: % gettimeofday({1334226306 409279},0x0) = 0 (0x0) % ... But select() timeouts are not needed at all. Versions before fenner's changes used a select() timeout for flood pings. alarm() was used to generate other timeouts. The alarm() code was not very good. It did a lot of work in the signal handler to set up the next alarm (1 call to signal() and 1 call
Re: strange ping response times...
On Wed, 11 Apr 2012, Luigi Rizzo wrote: On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote: On 11.04.2012 13:00, Luigi Rizzo wrote: On Wed, Apr 11, 2012 at 12:35:10PM +0200, Andre Oppermann wrote: On 11.04.2012 01:32, Luigi Rizzo wrote: Things going through loopback go through a NETISR and may end up queued to avoid LOR situations. In addition per-cpu queues with hash-distribution for affinity may cause your packet to be processed by a different core. Hence the additional delay. so you suggest that the (de)scheduling is costing several microseconds ? Not directly. I'm just trying to explain what's going on to get a better idea where it may go wrong. There may be a poor ISR/scheduler interaction that prevents that causes the packet to be processed only on the next tick or something like that. I don't have a better explanation for this. It's certainly abysmally slow. Just the extra context switching made in FreeBSD-5 made the RTT for pinging localhost 3-4 times slower than in FreeBSD-3 in old tests (I compared with FreeBSD-3 instead of FreeBSD-4 since general bloat had already made FreeBSD-4 significantly slower, although not 3-4 times). Direct dispatch of netisrs never did anything good in old tests, and the situation doesn't seem to have improved -- you now need an i7 2600 (SMP?) to get the same speed as my Athlon64 2000 (UP) in the best cases for both (2-3 usec RTT). SMP and multiple cores give more chances for scheduler pessimizations. ok, some final remarks just for archival purposes (still related to the loopback ping) ping takes a timestamp in userspace before trying to transmit the packet, and then the timestamp for the received packet is recorded in the kernel (in the interrupt or netisr thread i believe -- anyways, not in userspace). No, all timestamps recorded used by ping are recorded in userland. IIRC, there is no kernel timestamping at all for ping packets, unless ping is invoked with "-M time" to make it use ICMP_TSTAMP, and ICMP_TSTAMP gives at best milliseconds resolution so it is useless for measuring RTTs in the 2-999 usec range. (ICMP_TSTAMP uses iptime(), and the protocol only supports milliseconds resolution, which was good enough for 1 Mbps ethernet. iptime() is more broken than that (except in my version), since it uses getmicrotime() instead of microtime(). getmicrotime() gives at best 1/HZ resolution, so it is not even good enough for 1 Mbps ethernet when HZ is small, and now it may give extra inaccuracies from stopping the 1/HZ clock while in sleep states.) This reminds me that slow timecounters make measuring small differences in times difficult. It can take longer to read the timecounter than the entire RTT. I tested this by pessimizing kern.timecounter.hardware from TSC to i8254. On my test system, clock_gettime() with CLOCK_MONOTONIC takes an average of 273 nsec with the TSC timecounter and 4695 nsec with the i8254 timecounter. ping uses gettimeofday() which is slightly slower and more broken (since it uses CLOCK_REALTIME). My normal ping -fq localhost RTT is 2-3 usec (closer to 3; another bug in this area is that the timestamps only have microseconds resolution so you can't see if 3 is actually more like 2.5. I was thinking of changing the resolution to nanoseconds 8-10 years ago, before the FreeBSD-5 pessimizations and CPU speeds hitting a wall made this not really necessary), but the kernel I'm testing with uses ipfw which bloats the RTT to 8-9 usec. Then kern.timecounter.hardware=i8254 bloates the RTT to 24-25! That's 16 usec extra, enough for the extra overhead of 4 gettimeofday() calls. Timecounter statistics confirm that there are many more than 2 timecounter calls per packet: - 7 binuptime calls per packet. That's the hardware part that is very slow with an i8254 timecounter. It apparently takes more like 3000 nsec than 4695 nsec (to fit 7 in 24-25 usec). - 3 bintime calls per packet. bintime calls binuptime, so this accounts for 3 of the above 7. The other 4 are apparently for context switching. There are 2 context switches per packet :-(. I can't explain why there are apparently 2 timestamps per context switch. (Note that -current uses the inferior cputicker mechanism instead of timecounters for timestamping context switches. It does this because some timecounters are very slow. But when the timecounter is the TSC, it binuptime() only takes a few cycles more than cpu_ticks(). (The above time of 273 nsec for reading the TSC timecounter is from userland. The kernel part takes only about 30 nsec, while cpu_ticks() might take 15 nsec.) So -current wouldn't be pessimized for this part by changing the timecounter to i8254, but without the pessimization it would be only a few nsec faster than old kernels provided the timecounter
Re: Potential deadlock on mbuf
On Tue, 3 Apr 2012, Andre Oppermann wrote: On 02.04.2012 18:21, Alexandre Martins wrote: Dear, I have currently having troubles with a basic socket stress. The socket are setup to use non-blocking I/O. During this stress-test, the kernel is running mbuf exhaustion, the goal is to see system limits. If the program make a write on a socket during this mbuf exhaustion, it become blocked in "write" system call. The status of the process is "zonelimit" and whole network I/O fall in timeout. I have found the root cause of the block : http://svnweb.freebsd.org/base/head/sys/kern/uipc_socket.c?view=markup#l1279 So, the question is : Why m_uiotombuf is called with a blocking parameter (M_WAITOK) even if is for a non-blocking socket ? Then, if M_NOWAIT is used, maybe it will be usefull to have an 'ENOMEM' error. I'm surprised you can even see blocking of malloc(... M_WAITOK). O_NONBLOCK is mostly for operations that might block for a long time, but malloc() is not expected to block for long. Regular files are always so non-blocking that most file systems have no references to O_NONBLOCK (or FNONBLOCK), but file systems often execute memory allocation code that can easily block for as long as malloc() does. When malloc() starts blocking for a long time, lots of things will fail. This is a bit of an catch-22 we have here. Trouble is that when we return with EAGAIN the next select/poll cycle will tell you that this and possibly other sockets are writeable again, when in fact they are not due to kernel memory shortage. Then the application will tightly loop around the "writeable" non-writeable sockets. It's about the interaction of write with O_NONBLOCK and select/poll on the socket. This would be difficult to handle better. Do you have any references how other OSes behave, in particular Linux? I've added bde@ as our resident standards compliance expert. Hopefully he can give us some more insight on this issue. Standards won't say what happens at this level of detail. Blocking for network i/o is still completely broken at levels below sockets AFAIK. I (and ttcp) mainly wanted it to work for send() of udp. I saw no problems at the socket level, but driver queues just filled up and send() returned ENOBUFS. I wanted either the opposite of O_NONBLOCK (block until !ENOBUFS), or at least for select() to work for waiting until !ENOBUFS. But select() doesn't work at all for this. It seemed to work better in Linux. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?
On Sat, 24 Dec 2011, Alexander Best wrote: On Sat Dec 24 11, Bruce Evans wrote: On Sat, 24 Dec 2011, Alexander Best wrote: On Sat Dec 24 11, Bruce Evans wrote: On Fri, 23 Dec 2011, Alexander Best wrote: ... the gcc(1) man page states the following: " This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2. " the comment in sys/conf/kern.mk however sorta suggests that the default alignment of 4 bytes might improve performance. The default stack alignment is 16 bytes, which unimproves performance. maybe the part of the comment in sys/conf/kern.mk, which mentions that a stack alignment of 16 bytes might improve micro benchmark results should be removed. this would prevent people (like me) from thinking, using a stack alignment of 4 bytes is a compromise between size and efficiently. it isn't! currently a stack alignment of 16 bytes has no advantages towards one with 4 bytes on i386. I think the comment is clear enough. It it mentions all the tradeoffs. It is only slightly cryptic in saying that these are tradeoffs and that the configuration is our best guess at the best tradeoff -- it just says "while" for both. It goes without saying that we don't use our worst guess. Anyone wanting to change this should run benchmarks and beware that micro-benchmarks are especially useless. The changed comment is not so good since it no longer mentions micro-bencharmarks or says "while". if micro benchmark results aren't of any use, why should the claim that the default stack alignment of 16 bytes might produce better outcome stay? Because: - the actual claim is the opposite of that (it is that the default 16-byte alignments is probably a loss overall) - the claim that the default 16-byte alignment may benefit micro-benchmarks is true, even without the weaselish miswording of "might" in it. There is always at least 1 micro-benchmark that will benefit from almost any change, and here we expect a benefit in many microbenchmarks that don't bust the caches. Except, 16-byte alignment isn't supported (*) in the kernel, so we actually expect a loss from many microbenchmarks that don't bust the caches. - the second claim warns inexperienced benchmarkers not to claim that the default is better because it is better in microbenchmarks. it doesn't seem as if anybody has micro benchmarked 16 bytes vs. 4 bytes stack alignment, until now. so the micro benchmark statement in the comment seems to be pure speculation. No, it is obviously true. even worse...it indicates that by removing the -mpreferred-stack-boundary=2 flag, one can gain a performance boost by sacrifying a few more bytes of kernel (and module) size. No, it is part of the sentence explaining why removing the -mpreferred-stack-boundary=2 flag will probably regain the "overall loss" that is avoided by using the flag. this suggests that the behavior -mpreferred-stack-boundary=2 vs. not specyfing it, losely equals the semantics of -Os vs. -O2. No, -Os guarantees slower execution by forcing optimization to prefer space savings over time savings in more ways. Except, -Os is completely broken in -current (in the kernel), and gives very large negative space savings (about 50%). It last worked with gcc-3. Its brokenness with gcc-4 is related to kern.pre.mk still specifying -finline-limit flags that are more suitable for gcc-3 (gcc has _many_ flags for giving more delicate control over inlining, and better defaults for them) and excessive inlining in gcc-4 given by -funit-at-a-time -finline-functions-called-once. These apparently cause gcc's inliner to go insane with -Os. When I tried to fix this by reducing inlining, I couldn't find any threshold that fixed -Os without breaking inlining of functions that are declared inline. (*) A primary part of the lack of support for 16-byte stack alignment in the kernel no special stack alignment for the main kernel entry point, namely syscall(). From i386/exception.s: % SUPERALIGN_TEXT % IDTVEC(int0x80_syscall) At this point, the stack has 5 words on it (it was 16-byte aligned before that). % pushl $2 /* sizeof "int 0x80" */ % subl$4,%esp /* skip over tf_trapno */ % pushal % pushl %ds % pushl %es % pushl %fs % SET_KERNEL_SREGS % cld % FAKE_MCOUNT(TF_EIP(%esp)) % pushl %esp We "push" 14 more words. This gives perfect misaligment to the worst odd word boundary (perfect if only word boundaries are allowed). gcc wants the stack to be aligned to a 4*n word boundary before function calls, but here we have a 4*n+3 word boundary. (4*n+3 is worse th
Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?
On Sat, 24 Dec 2011, Alexander Best wrote: On Sat Dec 24 11, Bruce Evans wrote: This almost builds in -current too. I had to add the following: - NO_MODULES to de-bloat the compile time - MK_CTF=no to build -current on FreeBSD.9. The kernel .mk files are still broken (depend on nonstandard/new features in sys.mk). strange. the build(7) man page claims that: " WITH_CTF If defined, the build process will run the DTrace CTF conversion tools on built objects. Please note that this WITH_ option is handled differently than all other WITH_ options (there is no WITHOUT_CTF, or correspond- ing MK_CTF in the build system). " ... so setting MK_CTF to anything shouldn't have (according to the man page). MK_CTF is an implementation detail. It is normally set in bsd.own.mk (not in sys.mk line I said -- this gives another, much larger bug (*)). But when usr/share/mk is old, it doesn't know anything about MK_CTF. (For example, in FreeBSD-9, sys.mk sets NO_CTF to 1 if WITH_CTF is not defined. This corresponds to bsd.own.mk in -current setting MK_CTF to "no" if WITH_CTF is not defined. Go back to an older version of FreeBSD and /usr/share/mk/* won't know anything about any CTF variable.) So when you try to build a current kernel under an old version of FreeBSD, MK_CTF is used uninitialized and the build fails. (Of course, "you" build kernels normally and don't use the bloated buildkernel method.) The bug is in the following files: kern.post.mk:.if ${MK_CTF} != "no" kern.pre.mk:.if ${MK_CTF} != "no" kmod.mk:.if defined(MK_CTF) && ${MK_CTF} != "no" except for the last one where it has been fixed. (*) Well, not completely broken, but just annoyingly unportabile. Consider the following makefile: %%% foo: foo.c %%% Invoking this under FreeBSD-9 gives: %%% cc -O2 -pipe foo.c -o foo [ -z "ctfconvert" -o -n "1" ] || (echo ctfconvert -L VERSION foo && ctfconvert -L VERSION foo) %%% This is the old ctf method. It is ugly but is fairly portable. Invoking this under FreeBSD-9 but with -m gives %%% cc -O2 -pipe foo.c -o foo ${CTFCONVERT_CMD} expands to empty string %%% This is because: - the rule in sys.mk says ${CTFCONVERT_CMD} - CTFCONVERT_CMD is normally defined in bsd.own.mk. But bsd.own.mk is only included by BSD makefiles. It is never included by portable makefiles. So ${CTFCONVERT_CMD} is used uninitialized. - for some reason, using variables uninitialized is not fatal in this context, although it is for the comparisons of ${MK_CTF} above. - ${CTFCONVERT_CMD} is replaced by the empty string. Old versions of make warn about the use of an empty string as a shell command. - the code that is supposed to prevent the previous warning is in bsd.own.mk, where it is not reached for portable makefiles. It is: % .if ${MK_CTF} != "no" % CTFCONVERT_CMD= ${CTFCONVERT} ${CTFFLAGS} ${.TARGET} This uses the full ctfconvert if WITH_CTF. % .elif ${MAKE_VERSION} >= 520300 % CTFCONVERT_CMD= make(1) has been modified to not complain about the empty string. The version test detects which versions of make don't complain. % .else % CTFCONVERT_CMD= @: The default is to generate this non-empty string and an extra shell command to execute it, for old versions of make. % .endif But none of this works for portable makefiles, since it is not reached. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?
On Sat, 24 Dec 2011, Alexander Best wrote: On Sat Dec 24 11, Bruce Evans wrote: On Fri, 23 Dec 2011, Alexander Best wrote: is -mpreferred-stack-boundary=2 really necessary for i386 builds any longer? i built GENERIC (including modules) with and without that flag. the results are: The same as it has always been. It avoids some bloat. 1654496 bytes with the flag set vs. 1654952 bytes with the flag unset I don't believe this. GENERIC is enormously bloated, so it has size more like 16MB than 1.6MB. Even a savings of 4K instead of 456 bytes i'm sorry. i used du(1) to get those numbers, so i believe those numbers represent the ammount of 512-byte blocks. if i'm correct GENERIC is even more bloated than you feared and almost reaches 1GB: 807,859375 megabytes with flag set vs. 808,0820313 megabytes without the flag set That's certainly bloated. It counts all object files and modules, and probably everything is compiled with -g. I only counted kernel text size. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?
On Fri, 23 Dec 2011, Adrian Chadd wrote: Well, the whole kernel is bloated at the moment, sorry. I've been trying to build the _bare minimum_ required to bootstrap -HEAD on these embedded boards and I can't get the kernel down below 5 megabytes - ie, one with FFS (with options disabled), MIPS, INET (no INET6), net80211, ath (which admittedly is big, but I need it no matter what, right?) comes in at: -r-xr-xr-x 1 root wheel 5307021 Nov 29 19:14 kernel.LSSR71 And with INET6, on another board (and this includes MSDOS and the relevant geom modules): -r-xr-xr-x 1 root wheel 5916759 Nov 28 12:00 kernel.RSPRO .. honestly, that's what should be addressed. That's honestly a bit ridiculous. It's disgusting, but what problems does it cause apart from minor slowness from cache misses? I used to monitor the size of a minimal i386 kernel: % machine i386 % cpu I686_CPU % ident MIN % options SCHED_4BSD In FreeBSD-5-CURRENT between 5.1R and 5.2R, this had size: textdata bss dec hex filename 931241 86524 62356 1080121 107b39 /sysc/i386/compile/min/kernel A minimal kernel is not useful, but maybe you can add some i/o to it without bloating it too much. This almost builds in -current too. I had to add the following: - NO_MODULES to de-bloat the compile time - MK_CTF=no to build -current on FreeBSD.9. The kernel .mk files are still broken (depend on nonstandard/new features in sys.mk). - comment out a line in if.c that refers to Vloif. if.c is standard but the loop device is optional. A few more changes to remove non-minimalities that are not defaults made little difference: % machine i386 % cpu I686_CPU % ident MIN % options SCHED_4BSD % % # XXX kill default misconfigurations. % makeoptions NO_MODULES=yes % makeoptions COPTFLAGS="-O -pipe" % % # XXX from here on is to try to kill everything in DEFAULTS. % % # nodevice isa # needed for DELAY... % # nooptions ISAPNP # needed ... % % nodevice npx % % nodevice mem % nodevice io % % nodevice uart_ns8250 % % nooptions GEOM_PART_BSD % nooptions GEOM_PART_EBR % nooptions GEOM_PART_EBR_COMPAT % nooptions GEOM_PART_MBR % % # nooptions NATIVE # needed ... % # nodeviceatpic # needed ... % % nooptions NEW_PCIB % % nooptions VFS_ALLOW_NONMPSAFE textdata bss dec hex filename 1663902 110632 136892 1911426 1d2a82 kernel (This was about 100K larger with -O2 and all DEFAULTS). The bloat since FreeBSD-5 is only 70%. Here are some sizes for my standard kernel (on i386). The newer versions have about the same number of features since they don't support so many old isa devices or so many NICs: textdata bss dec hex filename 1483269 106972 172524 1762765 1ae5cd FreeBSD-3/kernel 1917408 157472 194228 2269108 229fb4 FreeBSD-4/kernel 2604498 198948 237720 3041166 2e678e FreeBSD-5.1.5/kernel 2833842 206856 242936 3283634 321ab2 FreeBSD-5.1.5/kernel-with-acpi 2887573 192456 288696 3368725 336715 FreeBSD-5.1.5/kernel with my changes, -O2 and usb added relative to the above 2582782 195756 298936 3077474 2ef562 previous, with some excessive inlining avoided, and without -O2, and with ipfilter 1998276 159436 137748 2295460 2306a4 kernel.4 a more up to date and less hacked on FreeBSD-4 4365549 262656 209588 4837793 49d1a1 kernel.7 4406155 266496 496532 5169183 4ee01f kernel.7.invariants 3953248 242464 207252 4402964 432f14 kernel.7.noacpi 4418063 268288 240084 4926435 4b2be3 kernel.7.smp various fairly stock FreeBSD-7R kernels 3669544 262848 249712 4182104 3fd058 kernel.c 4174317 258240 540144 4972701 4be09d kernel.c.invariants 3964455 250656 249808 4464919 442117 kernel.c.noacpi 3213928 240160 240596 3694684 38605c kernel.c.noacpi-ule 4285040 268288 286160 4839488 49d840 kernel.c.smp current before FreeBSD-8R not all built at the same time or with the same options. The 20% bloat between kernel.c.noacpi.ule and kernel.c.noacpi is mainly from not killing the default of -O2. 4742714 315008 401692 5459414 534dd6 kernel.8 4816900 319200 1813916 6950016 6a0c80 kernel.8.invariants 4490209 304832 395260 5190301 4f329d kernel.8.noacpi 4795475 323680 475420 5594575 555dcf kernel.8.smp
Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?
On Fri, 23 Dec 2011, Alexander Best wrote: is -mpreferred-stack-boundary=2 really necessary for i386 builds any longer? i built GENERIC (including modules) with and without that flag. the results are: The same as it has always been. It avoids some bloat. 1654496 bytes with the flag set vs. 1654952 bytes with the flag unset I don't believe this. GENERIC is enormously bloated, so it has size more like 16MB than 1.6MB. Even a savings of 4K instead of 456 bytes is hard to believe. I get a savings of 9K (text) in a 5MB kernel. Changing the default target arch from i386 to pentium-undocumented has reduced the text space savings a little, since the default for passing args is now to preallocate stack space for them and store to this, instead of to push them; this preallocation results in more functions needing to allocate some stack space explicitly, and when some is allocated explicitly, the text space cost for this doesn't depend on the size of the allocation. Anyway, the savings are mostly from from avoiding cache misses from sparse allocation on stacks. Also, FreeBSD-i386 hasn't been programmed to support aligned stacks: - KSTACK_PAGES on i386 is 2, while on amd64 it is 4. Using more stack might push something over the edge - not much care is taken to align the initial stack or to keep the stack aligned in calls from asm code. E.g., any alignment for mi_startup() (and thus proc0?) is accidental. This may result in perfect alignment or perfect misalignment. Hopefully, more care is taken with thread startup. For gcc, the alignment is done bogusly in main() in userland, but there is no main() in the kernel. The alignment doesn't matter much (provided the perfect misalignment is still to a multiple of 4), but when it matters, the random misalignment that results from not trying to do it at all is better than perfect misalignment from getting it wrong. With 4-byte alignment, the only cases that it helps are with 64-bit variables. the gcc(1) man page states the following: " This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2. " the comment in sys/conf/kern.mk however sorta suggests that the default alignment of 4 bytes might improve performance. The default stack alignment is 16 bytes, which unimproves performance. clang handles stack alignment correctly (only does it when it is needed) so it doesn't need a -mpreferred-stack-boundary option and doesn't always break without alignment in main(). Well, at least it used to, IIRC. Testing it now shows that it does the necessary andl of the stack pointer for __aligned(32), but for __aligned(16) it now assumes that the stack is aligned by the caller. So it now needs -mpreferred-stack-boundary=2, but doesn't have it. OTOH, clang doesn't do the andl in main() like gcc does (unless you put a dummy __aligned(32) there), but requires crt to pass an aligned stack. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: SCHED_ULE should not be the default
On Wed, 14 Dec 2011, Ivan Klymenko wrote: ?? Wed, 14 Dec 2011 00:04:42 +0100 Jilles Tjoelker ??: On Tue, Dec 13, 2011 at 10:40:48AM +0200, Ivan Klymenko wrote: If the algorithm ULE does not contain problems - it means the problem has Core2Duo, or in a piece of code that uses the ULE scheduler. I already wrote in a mailing list that specifically in my case (Core2Duo) partially helps the following patch: --- sched_ule.c.orig2011-11-24 18:11:48.0 +0200 +++ sched_ule.c 2011-12-10 22:47:08.0 +0200 ... @@ -2118,13 +2119,21 @@ struct td_sched *ts; THREAD_LOCK_ASSERT(td, MA_OWNED); + if (td->td_pri_class & PRI_FIFO_BIT) + return; + ts = td->td_sched; + /* +* We used up one time slice. +*/ + if (--ts->ts_slice > 0) + return; This skips most of the periodic functionality (long term load balancer, saving switch count (?), insert index (?), interactivity score update for long running thread) if the thread is not going to be rescheduled right now. It looks wrong but it is a data point if it helps your workload. Yes, I did it for as long as possible to delay the execution of the code in section: I don't understand what you are doing here, but recently noticed that the timeslicing in SCHED_4BSD is completely broken. This bug may be a feature. SCHED_4BSD doesn't have its own timeslice counter like ts_slice above. It uses `switchticks' instead. But switchticks hasn't been usable for this purpose since long before SCHED_4BSD started using it for this purpose. switchticks is reset on every context switch, so it is useless for almost all purposes -- any interrupt activity on a non-fast interrupt clobbers it. Removing the check of ts_slice in the above and always returning might give a similar bug to the SCHED_4BSD one. I noticed this while looking for bugs in realtime scheduling. In the above, returning early for PRI_FIFO_BIT also skips most of the periodic functionality. In SCHED_4BSD, returning early is the usual case, so the PRI_FIFO_BIT might as well not be checked, and it is the unusual fifo scheduling case (which is supposed to only apply to realtime priority threads) which has a chance of working as intended, while the usual roundrobin case degenerates to an impure form of fifo scheduling (iit is impure since priority decay still works so it is only fifo among threads of the same priority). ... @@ -2144,9 +2153,6 @@ if (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx])) tdq->tdq_ridx = tdq->tdq_idx; } - ts = td->td_sched; - if (td->td_pri_class & PRI_FIFO_BIT) - return; if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) { /* * We used a tick; charge it to the thread so @@ -2157,11 +2163,6 @@ sched_priority(td); } /* -* We used up one time slice. -*/ - if (--ts->ts_slice > 0) - return; - /* * We're out of time, force a requeue at userret(). */ ts->ts_slice = sched_slice; With the ts_slice check here before you moved it, removing it might give buggy behaviour closer to SCHED_4BSD. and refusal to use options FULL_PREEMPTION 4-5 years ago, I found that any form of PREMPTION was a pessimization for at least makeworld (since it caused too many context switches). PREEMPTION was needed for the !SMP case, at least partly because of the broken switchticks (switchticks, when it works, gives voluntary yielding by some CPU hogs in the kernel. PREEMPTION, if it works, should do this better). So I used PREEMPTION in the !SMP case and not for the SMP case. I didn't worry about the CPU hogs in the SMP case since it is rare to have more than 1 of them and 1 will use at most 1/2 of a multi-CPU system. But no one has unsubscribed to my letter, my patch helps or not in the case of Core2Duo... There is a suspicion that the problems stem from the sections of code associated with the SMP... Maybe I'm in something wrong, but I want to help in solving this problem ... The main point of SCHED_ULE is to give better affinity for multi-CPU systems. But the `multi' apparently needs to be strictly more than 2 for it to brak even. Bruce___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers (v2)
On Sat, 26 Nov 2011, Robert Millan wrote: On Fri, Nov 25, 2011 at 11:16:15AM -0700, Warner Losh wrote: Hey Bruce, These sound like good suggestions, but I'd hoped to actually go through all these files with a fine-toothed comb to see which ones were still relevant. You've found a bunch of good areas to clean up, but I'd like to humbly suggest they be done in a follow-on commit. Hi, I'm sending a new patch. Thanks Bruce for your input. TTBOMK this corrects all the problems you spotted that were introduced by my patch. It doesn't fix pre-existing problems in the files however, except in cases where I had to modify that line anyway. I think it's a good compromise between my initial patch and an exhaustive cleanup of those headers (which I'm probably not the most indicate for). It fixes most style bugs, but not some-pre-existing problems, even in cases where you had to modify the line anyway. % Index: sys/cam/scsi/scsi_low.h % === % --- sys/cam/scsi/scsi_low.h (revision 227956) % +++ sys/cam/scsi/scsi_low.h (working copy) % @@ -53,10 +53,10 @@ % #define SCSI_LOW_INTERFACE_XS % #endif /* __NetBSD__ */ % % -#ifdef __FreeBSD__ % +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) % #define SCSI_LOW_INTERFACE_CAM % #define CAM % -#endif /* __FreeBSD__ */ % +#endif /* __FreeBSD__ || __FreeBSD_kernel__ */ It still has the whitespace-after tab style change for cam. % Index: sys/dev/firewire/firewirereg.h % === % --- sys/dev/firewire/firewirereg.h(revision 227956) % +++ sys/dev/firewire/firewirereg.h(working copy) % @@ -75,7 +75,8 @@ % }; % % struct firewire_softc { % -#if defined(__FreeBSD__) && __FreeBSD_version >= 50 % +#if (defined(__FreeBSD__) || defined(__FreeBSD_kernel__)) && \ % +__FreeBSD_version >= 50 % struct cdev *dev; % #endif % struct firewire_comm *fc; Here is a pre-existing problem that you didn't fix on a line that you changed. The __FreeBSD__ ifdef is nonsense here, since __FreeBSD__ being defined has nothing to do with either whether __FreeBSD_version is defined or whether there is a struct cdev * in the data structure. Previously: - defined(__FreeBSD__) means that the compiler is for FreeBSD - __FreeBSD_version >= 50 means that FreeBSD has been included and has defined __FreeBSD_version to a value that satisifes this. It would be a bug for anything else to define __FreeBSD_version. Unfortunately, there is a bogus #undef of __FreeBSD_version that breaks detection of other things defining it. - the __FreeBSD__ part of the test has no effect except to break compiling this file with a non-gcc compiler. In particular, it doesn't prevent errors for -Wundef -Werror. But other ifdefs in this file use an unguarded __FreeBSD_version. Thus this file never worked with -Wundef -Werror, and the __FreeBSD__ part has no effect except the breakage. Now: as above, except: - defined(__FreeBSD_kernel__) means that FreeBSD been included and that this header is new enough to define __FreeBSD_kernel__. This has the same bug with the #undef, which I pointed out before (I noticed it for this but not for __FreeBSD_version). And it has a style bug in its name which I pointed out before -- 2 underscores in its name. __FreeBSD_version doesn't have this style bug. The definition of __FreeBSD_kernel__ has already been committed. Is it too late to fix its name? - when is new enough to define __FreeBSD_kernel__, it must be new enough to define __FreeBSD_version >= 50. Thus there is now no -Wundef error. - the __FreeBSD__ ifdef remains nonsense. If you just removed it, then you wouldn't need the __FreeBSD_kernel__ ifdef (modulo the -Wundef error). You didn't add the __FreeBSD_kernel__ ifdef to any of the other lines with the __FreeBSD_kernel__ ifdef in this file, apparently because the others don't have the nonsensical __FreeBSD__ ifdef. The nonsense and changes to work around it make the logic for this ifdef even more convoluted and broken than might first appear. In a previous patchset, you included to ensure that __FreeBSD_kernel__ is defined for newer kernel sources (instead of testing if it is defined). Ifdefs like the above make a prerequsite for this file anyway, since without knowing __FreeBSD_version it is impossible to determine if the data structure has new fields like the cdev in it. is a prerequisite for almost all kernel .c files, so this prerequisite should be satisfied automatically for them, but it isn't clear what happens for user .c files. I think the ifdef should be something like the following to enforce the prerequisite: #ifndef _SYS_PARAM_H_ /* * Here I don't support __FreeBSD_version__ to be set outside of * to hack around a missing include of . * The case where the kernel is so old that __FreeBSD_
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers (v2)
On Thu, 24 Nov 2011, Robert Millan wrote: 2011/11/24 Bruce Evans : Now it adds lots of namespace pollution (all of , including all of its namespace pollution), just to get 1 new symbol defined. Well, my initial patch (see mail with same subject modulo "v2") didn't have this problem. Now that __FreeBSD_kernel__ is defined, many #ifdefs can be simplified, but maybe it's not desireable for all of them. At least not until we can rely on the compiler to define this macro. So in this particular case maybe it's better to use the other approach? See attachment. That is clean enough, except for some style bugs. (I thought of worse ways like duplicating the logic of , or directing to only declare version macros, or putting version macros in a little separate param header and including that. The latter would be cleanest, but gives even more includes, and not worth it for this, but it would have been better for __FreeBSD_version. I don't like having to recompile half the universe according to dependencies on because only __FreeBSD_version__ in it changed. Basic headers rarely change apart from that. BTW, a recent discussion in the POSIX mailing list says that standardized generation of depenedencies should not generate dependencies on system headers. This would break the effect of putting mistakes like __FreeBSD_version__ in any system header :-).) % diff -ur sys.old/cam/scsi/scsi_low.h sys/cam/scsi/scsi_low.h % --- sys.old/cam/scsi/scsi_low.h 2007-12-25 18:52:02.0 +0100 % +++ sys/cam/scsi/scsi_low.h 2011-11-13 14:12:41.121908380 +0100 % @@ -53,7 +53,7 @@ % #define SCSI_LOW_INTERFACE_XS % #endif /* __NetBSD__ */ % % -#ifdef __FreeBSD__ % +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) % #define SCSI_LOW_INTERFACE_CAM % #define CAM % #endif /* __FreeBSD__ */ This also fixes some style bugs (tab instead of space after `#ifdef'). But it doesn't fix others (tab instead of space after `#ifdef', and comment on a short ifdef). And it introduces a new one (the comment on the ifdef now doesn't even match the code). cam has a highly non-KNF style, so it may require all of these style bugs except the comment not matching the code. This makes it hard for non-cam programmers to maintain. According to grep, it prefers a tab to a space after `#ifdef' by a ratio of 89:38 in a version checked out a year or two ago. But in 9.0-BETA1, the counts have blown out and the ratio has reduced to 254:221. The counts are more than doubled because the first version is a cvs checkout and the second version is a svn checkout, and it is too hard to filter out the svn duplicates. I guess the ratio changed because the new ata subsystem is not bug for bug compatible with cam style. Anywyay, there never was a consistent cam style to match. % @@ -64,7 +64,7 @@ % #include % #endif /* __NetBSD__ */ % % -#ifdef __FreeBSD__ % +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) % #include % #include % #include Same problems, but now the ifdef is larger (but not large enough to need a comment on its endif), so the inconsistent comment is not visible in the patch. % [... similarly throught cam] % diff -ur sys.old/contrib/altq/altq/if_altq.h sys/contrib/altq/altq/if_altq.h % --- sys.old/contrib/altq/altq/if_altq.h 2011-03-10 19:49:15.0 +0100 % +++ sys/contrib/altq/altq/if_altq.h 2011-11-13 14:12:41.119907128 +0100 % @@ -29,7 +29,7 @@ % #ifndef _ALTQ_IF_ALTQ_H_ % #define _ALTQ_IF_ALTQ_H_ % % -#ifdef __FreeBSD__ % +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) % #include /* XXX */ % #include /* XXX */ % #include /* XXX */ % @@ -51,7 +51,7 @@ % int ifq_len; % int ifq_maxlen; % int ifq_drops; % -#ifdef __FreeBSD__ % +#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) % struct mtx ifq_mtx; % #endif % No new problems, but I wonder how this even compiles when the ifdefs are not satisfed. Here we are exporting mounds of kernel data structures to userland. There is a similar mess in . There it has no ifdefs at all for the lock, mutex and event headers there, and you didn't touch it. is unfortunately actually needed in userland. The mutexes in its data structures cannot simply be left out, since then the data structures become incompatible with the actual ones. I don't see how the above can work with the mutex left out. By "not even compiles", I meant the header itself, but there should be no problems there because the second ifdef should kill the only use of all the headers. And userland should compile since it shouldn't use the ifdefed out (kernel) parts of the data struct. But leaving out the data substructures changes the ABI, so how could any application that actually uses the full structure work? And if nothing uses it, it shouldn't be exported. E
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers (v2)
On Wed, 23 Nov 2011, Robert Millan wrote: Here we go again :-) Out of the kernel headers that are installed in /usr/include/ hierracy, there are some which include support multiple operating systems (usually FreeBSD and other *BSD flavours). This patch adds support to detect GNU/kFreeBSD as well. In all cases, we match the same declarations as FreeBSD does (which is to be expected in kernel headers, since both systems share the same kernel). Now it adds lots of namespace pollution (all of , including all of its namespace pollution), just to get 1 new symbol defined. % Index: sys/cam/scsi/scsi_low.h % === % --- sys/cam/scsi/scsi_low.h (revision 227831) % +++ sys/cam/scsi/scsi_low.h (working copy) % @@ -44,6 +44,8 @@ % #ifndef _SCSI_LOW_H_ % #define _SCSI_LOW_H_ % % +#include % + % /* % * Scsi low OSDEP % * (All os depend structures should be here!) % % [... 22 more headers polluted] All the affected headers are poorly implemented ones. Mostly kernel headers which escaped to userland. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers
On Sun, 20 Nov 2011, Kostik Belousov wrote: On Sun, Nov 20, 2011 at 12:40:42PM +0100, Robert Millan wrote: On Sat, Nov 19, 2011 at 07:56:20PM +0200, Kostik Belousov wrote: I fully agree with an idea that compiler is not an authorative source of the knowledge of the FreeBSD version. Even more, I argue that we shall not rely on compiler for this at all. Ideally, we should be able to build FreeBSD using the stock compilers without local modifications. Thus relying on the symbols defined by compiler, and not the source is the thing to avoid and consistently remove. We must do this to be able to use third-party tooldchain for FreeBSD builds. That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version ? And then make more strong wording about other systems that use the macro, e.g. remove 'may' from the kFreeBSD example. Also, please remove the smile from comment. Ok. New patch attached. And the last, question, why not do #ifndef __FreeBSD_kernel__ #define __FreeBSD_kernel__ __FreeBSD_version #endif ? #undef is too big tools tool apply there, IMO. #ifndef is too big to apply here, IMO :-). __FreeBSD_kernel__ is in the implementation namespace, so any previous definition of it is a bug. The #ifndef breaks the warning for this bug. And why not use FreeBSD style? In KNF, the fields are separated by tabs, not spaces. In FreeBSD style, trailing underscores are not used for names in the implementation namespace, since they have no effect on namespaces. The name __FreeBSD_version is an example of this. Does existing practice require using the name with the trailing underscores? Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [PATCH] Netdump for review and testing -- preliminary version
On Fri, 15 Oct 2010, Robert N. M. Watson wrote: On 15 Oct 2010, at 20:39, Garrett Cooper wrote: But there are already some cases that aren't properly handled today in the ddb area dealing with dumping that aren't handled properly. Take for instance the following two scenarios: 1. Call doadump twice from the debugger. 2. Call doadump, exit the debugger, reenter the debugger, and call doadump again. Both of these scenarios hang reliably for me. I'm not saying that we should regress things further, but I'm just noting that there are most likely a chunk of edgecases that aren't being handled properly when doing dumps that could be handled better / fixed. Even thinking about calling doadump even once from within the debugger is an error. I was asleep when the similar error for panic was committed, and this error has propagated. Debuggers should use a trampoline to call the "any" function, not the least so that they can be used to debug the "any" function without the extra complications to make themself reentrant. I think gdb has always used a trampoline for this outside of the kernel. Not sure what it does within the kernel, but it would have even larger problems than in userland finding a place for the trampoline. In the kernel, there is the additional problem of keeping control while the "any" function is run. Other CPUs must be kept stopped and interrupts must be kept masked, except when the "any" function really needs other CPUs or unmasked interrupts. Single stepping also needs this and doesn't have it (other CPUs and interrupt handlers can run and execute any number of instructions while you are trying to execute a single one). All ddb "commands" that change the system state are really non-ddb commands that should use an external function via a trampoline. Panicing and dumping are just the largest ones, so they are the most impossible to do correctly as commands and the most in need of ddb to debug them. Right: one of the points I've made to Attilio is that we need to move to a more principled model as to what sorts of things we allow in various kernel environments. The early boot is a special environment -- so is the debugger, but the debugger on panic is not the same as the debugger when you can continue. Likewise, the crash dumping code is special, but also not the same as the debugger. Right now, exceptional behaviour to limit hangs/etc is done inconsistently. We need to develop a set of principles that tell us what is permitted in what contexts, and then use that to drive design decisions, normalizing what's there already. ENONUNIXEDITOR. Format not recovered. panic() from within a debugger (or a fast interrupt handler, or a fast interrupt handler that has trappeded to the debugger by request...) is, although an error, not too bad since panic() must be prepared to work starting from the "any" state anyway, and as you mention it doesn'tneed to be able to return (except for RESTARTABLE_PANICS, which makes things impossibly difficult). Continuing from a debugger is feasible mainly because in the usual case the system state is not changed (except for time-dependent things). If you use it to modify memory or i/o or run one of its unsafe commands then you have to be careful. This is not dissimilar to what we do with locking already, BTW: we define a set of kernel environments (fast interrupt handlers, non-sleepable threads, sleepable thread holding non-sleepable locks, etc), and based on those principles prevent significant sources of instability that might otherwise arise in a complex, concurrent kernel. We need to apply the same sort of approach to handling kernel debugging and crashing. Locking has imposed considerable discipline, which if followed by panic() would should how wrong most of the things done by panic() are -- it will hit locks, but shouldn't even be calling functions that have locks, since such functions expect their locks to work. The rules for fast interrupt handlers are simple and mostly not followed. They are that a fast interrupt handler may not access any state not specially locked by its subsystem. This means that they may not call any other subsystem or any upper layer except the null set of ones documented to be safe to call. In practice, this means not calling the "any" function, but it is necessary for atomic ops, bus space accesses, and a couple of scheduling functions to be safe enough. BTW, my view is that except in very exceptional cases, it should not be possible to continue after generating a dump. Dumps often cause disk controllers to get reset, which may leave outstanding I/O in nasty situations. Unless the dump device and model is known not to interfere with operation, we should set state indicating that the system is non-continuable once a dump has occurred. It might be safe if the system reinitialized everything. Too hard for just dumping, but it is needed after resume anyway. So the following could reason
Re: newfs_msdos and DVD-RAM
On Sat, 3 Apr 2010, Tijl Coosemans wrote: Wikipedia's article on FAT has this to say about the maximum size of clusters: "The limit on partition size was dictated by the 8-bit signed count of sectors per cluster, which had a maximum power-of-two value of 64. With That seems unlikely. The MS-DOS file system is an old 1970's one meant for implementation in assembly language on an 8-bit CPU. No assembly language programmer for an 8-bit microprocessor would expect an 8 bit or 16 bit counter to be signed, since there aren't enough bits to waste 1 for the sign bit. My reference written in 1986 by an assembly-language oriented programmer (Duncan) only says that the value must be a power of 2 though it says that the most other 8-bit variables are BYTEs. the standard hard disk sector size of 512 bytes, this gives a maximum of 32 KB clusters, thereby fixing the "definitive" limit for the FAT16 partition size at 2 gigabytes. On magneto-optical media, which can have 1 or 2 KB sectors instead of 1/2 KB, this size limit is proportionally larger. However, there was no need to use counts of larger than 1 in 1980, so support for values of 128 could easily have been broken. Much later, Windows NT increased the maximum cluster size to 64 KB by considering the sectors-per-cluster count as unsigned. However, the resulting format was not compatible with any other FAT implementation of the time, and it generated greater internal fragmentation. Windows 98 also supported reading and writing this variant, but its disk utilities did not work with it." This is demonstably false, since pcfs in FreeBSD-1 was another FAT implementation of the time (1993), and it is should be missing the bug since it uses the natural unsigned types for everything in the BPB. msdosfs in Linux probably provides a better demonstration since it was of production quality a year or 2 earlier and unlikely to have the bug. (I don't have its sources handy to check.) I'm not sure the second paragraph is worth supporting, but the first seems to say that 32k limit you have in your patch only applies to disks with 512 byte sectors. For disks with larger sectors it would be proportionally larger. It would be interesting to see what breaks with cluster sizes > 64K. These can be obtained using emulated or physical sector sizes larger than 512. Of course you don't want to actually use cluster sizes larger than 4K (far below 32K) about since they just give portability and fragmentation losses for tiny or negative performance gains (lose both space and time to fragmentation). My implementation of clustering for msdosfs made the cluster sizes unimportant provided it is small enough not to produce fragmentation, and there is little fragmentation due to other problems, and there is enough CPU to enblock and deblock the clusters. Clustering works better for msdosfs than for ffs because there are no indirect blocks or far-away inode blocks to put bubbles in the i/o pipeline. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: newfs_msdos and DVD-RAM
On Mon, 29 Mar 2010, Andriy Gapon wrote: ... I am not a FAT expert and I know to take Wikipedia with a grain of salt. But please take a look at this: http://en.wikipedia.org/wiki/File_Allocation_Table#Boot_Sector In our formula: SecPerClust *= pmp->pm_BlkPerSec; we have the following parameters: SecPerClust[in] - sectors per cluster pm_BlkPerSec - bytes per sector divided by 512 (pm_BytesPerSec / DEV_BSIZE) SecPerClust[out] - bytes per cluster divided by 512 So we have: sectors per cluster: 64 bytes per sector: 4096 That Wikipedia article says: "However, the value must not be such that the number of bytes per cluster becomes greater than 32 KB." 64K works under FreeBSD, and I often do performance tests with it (it gives very bad performance). It should be avoided for portability too. But in our case it's 256K, the same value that is passed as 'size' parameter to bread() in the crash stack trace below. This error should be detected more cleanly. ffs fails the mount if the block size exceeds 64K. ffs can handle larger block sizes, and it is unfortunate that it is limited by the non-ffs parameter MAXBSIZE, but MAXBSIZE has been 64K and non-fuzzy for so long that the portability considerations for using larger values are even clearer -- larger sizes shouldn't be used, but 64K works almost everywhere. I used to often do performance tests with block size 64K for ffs. It gives very bad performance, and since theire are more combinations of block sizes to test for ffs than for msdosfs, I stopped testing block size 64K for ffs long ago. msdosfs has lots more sanity tests for its BPB than does ffs for its superblock. Some of these were considered insane and removed, and there never seems to have been one for this. By the way, that 32KB limit means that value of SecPerClust[out] should never be greater than 64 and SecPerClust[in] is limited to 128, so its current must be of sufficient size to hold all allowed values. Thus, clearly, it is a fault of a tool that formatted the media for FAT. It should have picked correct values, or rejected incorrect values if those were provided as overrides via command line options. If 256K works under WinDOS, then we should try to support it too. mav@ wants to increase MAXPHYS. I don't really believe in this, but if MAXPHYS is increased then it would be reasonable to increase MAXPHYS too, but probably not to more than 128K. f...@r500 /usr/crash $kgdb kernel.1/kernel.symbols vmcore.1 [snip] Unread portion of the kernel message buffer: panic: getblk: size(262144) > MAXBSIZE(65536) [snip] #11 0x803bedfb in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:562 BTW, why can't gdb find any variables? They are just stack variables whose address is easy to find. ... #14 0x8042f24e in bread (vp=Variable "vp" is not available. ) at /usr/src/sys/kern/vfs_bio.c:748 ... and isn't vp a variable? Maybe the bad default -O2 is destroying debugging. Kernels intended for being debugged (and that is almost all kernels) shouldn't be compiled with many optimizations. Post-gcc-3, -O2 breaks even backtraces by inlining static functions that are called only once. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: System hangs solid with ATAPICAM
On Tue, 2 Dec 2003, Sean McNeil wrote: > I've tried over several weeks to get ATAPICAM to work for me. I've > tried with and without acpi (compiled in or disabled via. boot). I've > tried turning on all debug. I've tried a few misc. thigs. All leave my Did you try backing out rev.1.23 of ata_lowlevel.c? > system hanging after the GEOM initialization without any indication of > debug output. The only clue I have is that it sounds like my zip-100 > was accessed right before the hang. That's interesting. The bug avoided by backing out rev.1.23 of ata-lowlevel.c is obviously system dependent. I only see it on a system that has a zip100. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: ATAPI CD still not detected, verbose boot logs available
On Tue, 2 Dec 2003, Soren Schmidt wrote: > It seems Christoph Sold wrote: > > FreeBSD 5.2-B still does not detect my ATAPI DVD-ROM drive. This used to > > work until Søren's ATAng commits. Other OSes (Win, Linux, Solaris) > > detect the drive appropriately. > > Hmm from the bootlogs it seems that your drive does not set the proper > ATAPI signature, thats why detection fails: > > atapci0: port 0xd800-0xd80f at device 4.1 on pci0 > ata0: reset tp1 mask=03 ostat0=50 ostat1=50 > ata0-master: stat=0x80 err=0x80 lsb=0x80 msb=0x80 ! bit 0x80 set says that the master is busy > ata0-slave: stat=0x00 err=0x01 lsb=0x14 msb=0x80 > should be 0xeb > ata0-master: stat=0x50 err=0x01 lsb=0x00 msb=0x00 ! now the master is unbusy > ata0: reset tp2 mask=03 stat0=50 stat1=00 devices=0x1 > ata0: at 0x1f0 irq 14 on atapci0 > ata0: [MPSAFE] Accessing the slave while the master is busy is invalid. I believe the failure mechanism is that the master keeps driving the bus while it is busy, so reads of the slave registers give garbage. This isn't a problem unless the slave becomes ready first and it manages to write a success code to the "err" register. Then we trust the garbage. It doesn't help that the master eventually becomes ready, since we don't read the slave registers again. > There isn't much I can do about that one except you experimenting with > the device and finding out why it fails setting the right signature Er, I sent patches for this a few months ago. After reanalysing their debugging putput combined with the above debugging output, I think this bug is is the usual case if there are 2 drives and the drives' timing after reset is as follows: o The master must take more than 100 msec to become ready. Otherwise the 100 msec initial delay hides the bug. o The slave must become ready before the master. Otherwise there is no problem with using garbage slave registers, although accessing them is strictly invalid. The bug is just not often seen since most drives don't take 100 msec to become ready. I only see it on a system with an 8-9 year old pre-ATA IDE drive that takes 574 msec to become ready. For a quick fix, try increasing the initial delay of 100 msec to a second or more. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5.2-BETA: giving up on 4 buffers (ata)
On Thu, 27 Nov 2003, Stefan Ehmann wrote: > On Wed, 2003-11-26 at 19:37, Matthias Andree wrote: > > Hi, > > > > when I rebooted my 5.2-BETA (kernel about 24 hours old), it gave up on > > flushing 4 dirty blocks. > > > > I had three UFS1 softdep file systems mounted on one ATA drive, one > ext2 > > file system on another ATA drive and one ext2 file system on a SCSI > > drive. Both ext2 file systems had been mounted read-only, so they > can't > > have had dirty blocks. > > This is a known problem for nearly three months now (See PR 56675). It > happens to me every time I shut down the system if i don't unmount my > (read-only) ext2 file systems manually. I'm not sure if the problem is known for the read-only case. It is the same problem as in the read-write case. ext2fs hangs onto buffers, so shutdown cannot tell if it can look at the buffers and considers them to be busy. Then since shutdown cannot tell if it synced all dirty buffers or which buffers are associated with which file systems, it doesn't unmount any file systems and all dirty file systems that aren't unmounted before shutdown are left dirty. Read-only-mounted ext2fs file systems aren't left dirty but they break cleaning of other file systems. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 40% slowdown with dynamic /bin/sh
On Wed, 26 Nov 2003, Garance A Drosihn wrote: > At 12:23 AM -0500 11/26/03, Michael Edenfield wrote: > > > >Just to provide some real-world numbers, here's what I got > >out of a buildworld: > > I have reformatted the numbers that Michael reported, > into the following table: > > >Static /bin/sh: Dynamic /bin/sh: > > real385m29.977s real455m44.852s => 18.22% > > user111m58.508s user113m17.807s => 1.18% > > sys 93m14.450s sys 103m16.509s => 10.76% > > user+sys => 5.53% What are people doing to make buildworld so slow? I once optimized makeworld to take 75 minutes on a K6-233 with 64MB of RAM. Things have been pessimized a bit since then, but not signifcantly except for the 100% slowdown of gcc (we now build large things like secure but this is partly compensated for by not building large things like perl). Michael's K7-500 with 320MB (?) of RAM should be serveral times faster than the K6-233, so I would be unhappy if it took more than 75 minutes but would expect it to take bit more than 2 hours when well configured. > Here are some buildworld numbers of my own, from my system. > In my case, I am running on a single Athlon MP2000, with a > gig of memory. It does a buildworld without paging to disk. I have a similar configuration, except with a single Athlon XP1600 overclocked by 146/133 and I always benchmark full makeworlds. I was unhappy when the gcc pessimizations between gcc-2.95 and gcc-3.0 increased the makeworld time from about 24 minutes to about 33 minutes. The time has since increased to about 38 minutes. The latter is cheating slightly -- I leave out the DYNAMICROOT and RESCUE mistakes and the KERBEROS non-mistake. > Static sh, No -j: Dynamic sh, No -j: >real84m31.366s real86m22.429s => 2.04% >user50m33.013s user51m13.080s => 1.32% >sys 29m59.047s sys 33m04.082s => 10.29% > user+sys => 4.66% > > Static sh, -j2:Dynamic sh, -j2: >real92m38.656s real95m21.027s => 2.92% >user51m48.970s user52m29.152s => 1.29% >sys 32m07.293s sys 34m40.595s => 7.95% > user+sys => 3.84% This also shows why -j should not be used on non-SMP machines. Apart from the make -j bug that causes missed opportunties to run a job, make -j increases real and user times due to competition for resources, so it can only possibly help on systems where unbalanced resources (mainly slow disks) give too much idle time. My current worst makeworld time is almost twice as small as the fastest buildworld time in the above (2788 seconds vs 5071 seconds). From my collection of makeworld benchmarks: %%% Fastest makeworld on a Celeron 366 overclocked by 95/66 (2000/05/15): 3309.30 real 2443.75 user 488.68 sys Last makeworld on a Celeron 366 overclocked by 95/66 (2001/11/19): 4219.83 real 3253.04 user 667.64 sys Fastest makeworld on an Athlon XP1600 overclocked by 146/133 (2002/01/03): 1390.18 real 913.56 user 232.63 sys Last makeworld before gcc-3 on an Athlon XP1600 o/c by 143/133 (2002/05/09) (overclocking reduced and due to memory problems and some local memory-related optimizations turned off): 1532.99 real 1093.08 user 293.15 sys Early makeworld with gcc-3 on an Athlon XP1600 o/c by 143/133 (2002/05/12): 2268.13 real 1613.25 user 313.56 sys Fastest makeworld with gcc-3 an Athlon XP1600 overclocked by 146/133 (maximal overclocking recovered; memory increased from 512MB to 1GB, local memory-related optimizations turned on and tuned) (2003/03/31): 1929.02 real 1576.67 user 205.30 sys Last makeworld before on an Athlon XP1600 o/c by 143/133 (2003/04/29: 2012.75 real 1637.59 user 225.07 sys Makeworld with the defaults (no /etc/make.conf and no local optimizations in the src tree; mainly no pessimizing for Athlons by optimizing for PII's, and no building dependencies; only optimizations in the host environment (mainly no dynamic linkage) on an Athlon as usual (2003/05/06): Last recorded makeworld with local source and make.conf optimizations (mainly no dynamic linkage) on an Athlon as usual (2003/10/22): 2225.83 real 1890.64 user 256.33 sys Last recorded makeworld with the defaults on an Athlon as usual (2003/11/11): 2788.41 real 2316.49 user 357.34 sys %%% I don't see such a large slowdown from using a dynamic /bin/sh. Unrecorded runs of makeworld gave times like the following: 2262 real ... with local opts including src ones and no dynamic linkage 2290 real ... with same except for /bin/sh (only) dynamically linked The difference may be because my /usr/bin/true and similar utilities remain statically linked. Fork-exec expense depends mor on the exec than the fork. >From an old
Re: Hanging at boot
On Wed, 26 Nov 2003, Manfred Lotz wrote: > On Mon, 24 Nov 2003 08:00:49 +0100, Manfred Lotz wrote: > > > Hi there, > > > > Last time (around middle of October) when I tried out a new current kernel > > it was hanging at boot time at acd1 > > > > ata1 is: > > acd1: DVD-ROM at ata1-slave UDMA33 > > > > > > I tried it again yesterday. Now acd1 seems to be fine. However it hangs > > at acd2.After the following message > > acd2: CD-RW at ata3-master UDMA33 > > > > it stops working. No error message is showing up. > > In the meantime I found out that the cause of the problem is atapicam. > If I remove it from my kernel config I'm fine (but I have no atapicam). Try backing out rev.1.23 of ata_lowlevel.c. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: HEADS UP: /bin and /sbin are now dynamically linked
On Sat, 22 Nov 2003, M. Warner Losh wrote: > In message: <[EMAIL PROTECTED]> > Richard Coleman <[EMAIL PROTECTED]> writes: > : M. Warner Losh wrote: > : > : > : I agree. termcap.small is amazingly uncurrent. However, perhaps some > : > : merging and reducing is in order. Why is a full cons25 or vt2xx needed? > : > : vi only needs a few capabilities. I think we mostly use copies of large > : > : termcap entries because copying the whole things is easier. > : > > : > You have a good point. My termcap was done so that we could run a > : > number of applications... > : > > : > Grepping seems unsatisfying to find out which keys are used. Do you > : > have a list? nvi/cl/cl_bsd.c has a possibly complete enough list in its terminfo translation table. > : Is the extra maintenance worth it to save a few hundred bytes? Probably not, if this is mainly for use by rescue on larger (multi-megabyte) disks. I used an 8K termcap on 1200MB floppy rescue disks many years ago, > Generating them automatically can be kind of difficult. termcap > doesn't change that often. As someone pointed out, ed is sufficient. It's all we had on the root partition. I remember how to use it mainly from using it there. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: HEADS UP: /bin and /sbin are now dynamically linked
On Sat, 22 Nov 2003, M. Warner Losh wrote: > In message: <[EMAIL PROTECTED]> > Bruce Evans <[EMAIL PROTECTED]> writes: > : On Sat, 22 Nov 2003, M. Warner Losh wrote: > : > Timing Solutions uses the following minimal termcap for its embedded > : > applications. It has a number of terminals that it supports, while > : > still being tiny. it is 3.5k in size, which was the goal ( < 4k block > : > size we were using). One could SED this down by another 140 bytes or > : > so. Removing the comments and the verbose names would net another 300 > : > odd bytes. > : > : What's wrong with FreeBSD's /usr/src/etc/termcap.small, except it is > : twice as large and has a weird selection of entries (zillions of > : variants of cons25, dosansi and pc3). > > Mine is better because it has a more representative slice of currently > used terminal types. Maybe we should replace termcap.small with mine > (maybe with the copyright notice). I agree. termcap.small is amazingly uncurrent. However, perhaps some merging and reducing is in order. Why is a full cons25 or vt2xx needed? vi only needs a few capabilities. I think we mostly use copies of large termcap entries because copying the whole things is easier. Bruce Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dumb question 'Bad system call' after make world
On Fri, 21 Nov 2003, Barney Wolff wrote: > Will somebody please tell me when "make world" is ever correct in the > environment of the last several years? I've been unable to understand > its continued existence as a target. >From my normal world-building script: DESTDIR=/c/z/root \ MAKEOBJDIRPREFIX=/c/z/obj \ time -l make -s world > /tmp/world.out 2>&1 Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Unfortunate dynamic linking for everything
On Fri, 21 Nov 2003, Tim Kientzle wrote: > Bruce Evans wrote: > > It obviously uses NSS. How else could it be so bloated? : > > > > $ ls -l /sbin/init > > -r-x-- 1 root wheel 453348 Nov 18 10:30 /sbin/init > > I believe it's actually DNS, not NSS. > > Pre-5.0, the resolver ballooned significantly. > A lot of the bloat in /bin and /sbin came > from the NIS functions which in turn pull in > the resolver. Perhaps both. > Example: /bin/date on 5.1 is also over 450k > because of a single call to getservbyname(). > Removing that one call shrinks a static /bin/date > to a quite reasonable size. (I seem to recall 80k when > I did this experiment.) The 2 calls to logwtmp() must also be removed, at least now. I get the following text sizes: for /bin/date: RELENG_4: 137491 -current*: 93214 (* = getservbyname() and logwtmp() calls removed) -current: 371226 (only 412492 total, not 450K yet) > I note that /sbin/init calls getpwnam(); > I expect that's where the bloat gets pulled in. Yes, except it's only the latest 200+K of bloat (from 413558 bytes text to 633390). Before that there was 100+K of miscellaneous bloat relative to RELENG_4 (text size 305289 there). Before that there was another 200+K of bloat from implementing history. Compiling with -DNO_HISTORY removes history support and reduces the text size to 162538 (this is without getpwnam()). Then there is another 30K of mostly non-bloat for actual changes within /bin/sh, since compiling the FreeBSD-1 /bin/sh with current libraries gives a text size of 132966. Finally, IIRC the text size of the FReeeBSD-1 /bin/sh is 70K (total size 90K), so there is another 60K of miscellaneous bloat in current libraries to increase the text size from 70K to 130K. Total text sizes for /bin/sh's internals: FreeBSD-1 sh compiled with -current's compiler: 55350 current sh compiled with -current's compiler: 87779 87:55 is about right for the increased functionality. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Unfortunate dynamic linking for everything
On Wed, 19 Nov 2003, Ken Smith wrote: > On Thu, Nov 20, 2003 at 06:27:31AM +1100, Bruce Evans wrote: > > > > set init_path=/rescue/init > > > > If dynamic root were ready to be turned on, then /rescue/init would be > > in the default init_path. > > I had that explained to me too. :-) > > There is a loop in sys/kern/init_main.c that "probes" for an init > to run. But it only does what you want for cases of the files > not existing or otherwise just totally not executable. It won't > handle the "started but then dumped core" case the way it would > need to if /sbin/init were to fail because of shared library > problems. So if just relying on this mechanism it would either > not work right (/sbin/init in the path before /rescue/init) or > it would always start /rescue/init (/rescue/init before /sbin/init > in the path). Oops, better add "... and error handling for init_path would be fixed" -). I should have remembered this since I got bitten by it recently. I was trying to boot RELENG_3 and had a backup init that worked but that didn't help because there was an execable init earlier in the path. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: hard lock-up writing to tape
On Wed, 19 Nov 2003, Mike Durian wrote: > On Tuesday 18 November 2003 08:29 pm, Bruce Evans wrote: > > - -current has the kern.console sysctl for enabling multiple consoles > > (buut only 1 sio one). You can boot with a syscons console and then > > enable the serial, and the latter should work if it is on a working > > port to begin with. Anyway, this sysctl shows which sio port can be > > a console, if any. > > Is there any documentation on this sysctl? I'm not sure what I > should set it to. After a normal boot, it reads: Only in the source code. > kern.console: consolectl,/ttyd1,consolectl, Not even the bug that syscons's consolectl device is printed here is documented (the actual syscons console is on /dev/ttyv0, but this bogusly shares a tty struct with /dev/consolectl and many things cannot tell the difference. This bug also messes up the columns in pstat -t, since consolectl is too wide to fit). Anyway, the stuff to the left of the slash in the above is the list of active consoles and the stuff to the right of the slash is the list of possible consoles. You have to move stuff from one list to the other. I vaguely remember that this is done using '-' to delete things from the left hand list and something more direct to add them. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Unfortunate dynamic linking for everything
On Wed, 19 Nov 2003, Marcel Moolenaar wrote: > set init_path=/rescue/init If dynamic root were ready to be turned on, then /rescue/init would be in the default init_path. > A dynamicly linked /sbin/init just > makes it harder to get to the rescue bits, so it makes sense to > link init(8) staticly. Especially since there's no advantage to > dynamic linking init(8) that compensates for the inconvience. It obviously uses NSS. How else could it be so bloated? : $ ls -l /sbin/init -r-x-- 1 root wheel 453348 Nov 18 10:30 /sbin/init (My version is linked statically of course.) The NSS parts of init might not be needed in normal operation, but its hard to tell. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: hard lock-up writing to tape
On Tue, 18 Nov 2003, Mike Durian wrote: > On Monday 17 November 2003 04:41 pm, Mike Durian wrote: > > > > I was finally able to get some partial success by setting flag 0x30 > > for sio1. When I'd boot, I'd get console messages on my remote > > tip session. However, I'd only receive those messages printed > > from user-level applications. I would not see any of the bold-face > > messages from the kernel. > > I'm still stumbling with the remote serial console. Can someone > who does this often test and verify they can use COM2 as the > serial console - and then tell me what you did. Moving the 0x10 flag from sio0 to sio1 should be sufficient for the kernel part. Setting the 0x20 flag for sio1 together with the 0x10 flag should mainly save having to edit the flag for sio0. If the kernel's serial console is the same as the boot blocks', then it should use the same speed as the boot blocks set it too. Otherwise there may be a speed mismatch. > The best I can manage is described above and then I get neither > the bold kernel messages nor the debugger prompt. This could be from a speed mismatch or from kern.consmute somehwo getting set. Some of this stuff can be configured after booting: - RELENG4 has non-broken boot-time configuration which allows changing during the boot. - -current has the kern.console sysctl for enabling multiple consoles (buut only 1 sio one). You can boot with a syscons console and then enable the serial, and the latter should work if it is on a working port to begin with. Anyway, this sysctl shows which sio port can be a console, if any. - RELENG_4 and -current have the machdep.conspeed sysctl for setting the console speed. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: HEADS-UP new statfs structure
On Tue, 18 Nov 2003, Rudolf Cejka wrote: > Hello, and is it possible to review some my (one's from masses :o) > questions/suggestions? > > * cvtstatfs() for freebsd4_* compat syscalls does not copy text fields > correctly, so old binaries with new kernel know just about first > 16 characters from mount points - what do you think about the > following patch? (Or maybe with even safer sizeof() - but I did not > test it.) Hmm, there were 2 bugs here: - MFSNAMELEN was confused with MNAMELEN in some places. This gives unterminated strings as well as excessively truncated strings. - there were off-by-1 errors which would have given unterminated strings even without the previous bug. > --- sys/kern/vfs_syscalls.c.orig Sun Nov 16 11:12:09 2003 > +++ sys/kern/vfs_syscalls.c Sun Nov 16 11:56:07 2003 > @@ -645,11 +645,11 @@ > osp->f_syncreads = MIN(nsp->f_syncreads, LONG_MAX); > osp->f_asyncreads = MIN(nsp->f_asyncreads, LONG_MAX); > bcopy(nsp->f_fstypename, osp->f_fstypename, > - MIN(MFSNAMELEN, OMNAMELEN)); > + MIN(MFSNAMELEN, OMFSNAMELEN - 1)); MFSNAMELEN didn't change, so there is currently only a logical problem here. The -1 term could be moved outside of the MIN(). It works in either place and would save duplicating the terminating NUL in the unlikely event that the new name length becomes smaller than the old one. I'm not sure which is clearest. > bcopy(nsp->f_mntonname, osp->f_mntonname, > - MIN(MFSNAMELEN, OMNAMELEN)); > + MIN(MNAMELEN, OMNAMELEN - 1)); Similarly, plus the larger bug. MNAMELEN increased from (88 - 2 * sizeof(long)) to 88, so if it were used without the -1 in the above, then mount point name lengths longer than the old value would have been unterminated instead of truncated. > bcopy(nsp->f_mntfromname, osp->f_mntfromname, > - MIN(MFSNAMELEN, OMNAMELEN)); > + MIN(MNAMELEN, OMNAMELEN - 1)); Similarly. > if (suser(td)) { > osp->f_fsid.val[0] = osp->f_fsid.val[1] = 0; > } else { > --- > > * sys/compat/freebsd32/freebsd32_misc.c: If you look into copy_statfs(), > you copy 88-byte strings into just 80-byte strings. Fortunately it seems > that there are just overwritten spare fields and f_syncreads/f_asyncreads > before they are set to the correct value. What about these patches, which > furthermore are resistant to possible MFSNAMELEN change in the future? > [I'm sorry, these patches are untested] > > --- sys/compat/freebsd32/freebsd32.h.orig Tue Nov 18 16:58:28 2003 > +++ sys/compat/freebsd32/freebsd32.h Tue Nov 18 16:59:36 2003 > @@ -75,6 +75,7 @@ > int32_t ru_nivcsw; > }; > > +#define FREEBSD32_MFSNAMELEN 16 /* length of type name including null */ > #define FREEBSD32_MNAMELEN(88 - 2 * sizeof(int32_t)) /* size of on/from > name bufs */ > MFSNAMELEN hasn't changed, so this part is cosmetic. But don't we now need to clone all of this compatibility cruft for the new statfs()? Native 32-bit systems have both. Then MFSNAMELEN for this version should probably be spelled OMFSNAMELEN. > struct statfs32 { > @@ -92,7 +93,7 @@ > int32_t f_flags; > int32_t f_syncwrites; > int32_t f_asyncwrites; > - charf_fstypename[MFSNAMELEN]; > + charf_fstypename[FREEBSD32_MFSNAMELEN]; > charf_mntonname[FREEBSD32_MNAMELEN]; > int32_t f_syncreads; > int32_t f_asyncreads; > --- sys/compat/freebsd32/freebsd32_misc.c.origTue Nov 18 16:59:49 2003 > +++ sys/compat/freebsd32/freebsd32_misc.c Tue Nov 18 17:03:31 2003 > @@ -276,6 +276,7 @@ > static void > copy_statfs(struct statfs *in, struct statfs32 *out) > { > + bzero(out, sizeof *out); Yikes. All copied out structs that might have holes (i.e., all structs unless you want to examine them in binary for every combination of arch/compiler/etc) need to be bzero()ed like this, but there are no bzero()'s in files in this directory. > CP(*in, *out, f_bsize); > CP(*in, *out, f_iosize); > CP(*in, *out, f_blocks); > @@ -290,14 +291,14 @@ > CP(*in, *out, f_flags); > CP(*in, *out, f_syncwrites); > CP(*in, *out, f_asyncwrites); > - bcopy(in->f_fstypename, > - out->f_fstypename, MFSNAMELEN); > - bcopy(in->f_mntonname, > - out->f_mntonname, MNAMELEN); > + bcopy(in->f_fstypename, out->f_fstypename, > + MIN(MFSNAMELEN, FREEBSD32_MFSNAMELEN - 1)); > + bcopy(in->f_mntonname, out->f_mntonname, > + MIN(MNAMELEN, FREEBSD32_MNAMELEN - 1)); > CP(*in, *out, f_syncreads); > CP(*in, *out, f_asyncreads); > - bcopy(in->f_mntfromname, > - out->f_mntfromname, MNAMELEN); > + bcopy(in->f_mntfromname, out->f_mntfromname, > + MIN(MNAMELEN, FREEBSD32_MNAMELEN - 1)); > } > > int This seems to be correct except possibly for the style (placement of -1 and fixing the indentation of the continuation lines so that it is not bug-for-bug com
Re: Unfortunate dynamic linking for everything
On Tue, 18 Nov 2003, M. Warner Losh wrote: > In message: <[EMAIL PROTECTED]> > [EMAIL PROTECTED] writes: > : It really doesn't make sense to arbitrarily cut-off a > : discussion especially when a decision might be incorrect. > > I'd say that good technical discussion about why this is wrong would > be good. However, emotional ones should be left behind. Except for > John's message, most of the earlier messages have been more emotional > than technical. I used to use all dynamic linkage, but switched to all static linkage (except for ports) when I understood John's points many year ago. It shouldn't be necessary to repeat the argmuments. > John, do you have any good set of benchmarks that people can run to > illustrate your point? Almost any benchmark that does lots of forks or execs, or uses libraries a lot will do. IIRC, 5-10% of my speedups for makeworld was from building tools static. Makeworld is not such a good benchmark for this as it used to be since it always builds tools static so the non-staticness of standard binaries doesn't matter so much. Perhaps it still matters for /bin/sh. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: DIAGNOSTIC LOR in softclock
On Sat, 15 Nov 2003, Poul-Henning Kamp wrote: > This looks slightly different if I use SCHED_ULE, but the effect is > the same. > > Off the top of my head, I have not been able to find any places > where softclock would call schedcpu directly. schedcpu() is a timeout routine, so it is always called indirectly from softclock. > lock order reversal > 1st 0xc072dca0 callout_dont_sleep (callout_dont_sleep) @ kern/kern_timeout.c:223 > 2nd 0xc072d080 allproc (allproc) @ kern/sched_4bsd.c:257 > Stack backtrace: > backtrace(c06d148d,c072d080,c06cd881,c06cd881,c06cf38b) at backtrace+0x17 > witness_lock(c072d080,0,c06cf38b,101,c5061c3c) at witness_lock+0x672 > _sx_slock(c072d080,c06cf382,101,8,c06cf0a0) at _sx_slock+0xae > schedcpu(0,0,c06cf097,df,c1183140) at schedcpu+0x3f > softclock(0,0,c06cbce6,23a,c1189388) at softclock+0x1fb > ithread_loop(c1180400,c5061d48,c06cbb54,311,558b0424) at ithread_loop+0x192 > fork_exit(c050b090,c1180400,c5061d48) at fork_exit+0xb5 > fork_trampoline() at fork_trampoline+0x8 > --- trap 0x1, eip = 0, esp = 0xc5061d7c, ebp = 0 --- I'm sure this is known. schedcpu() always calls sx_lock(&allproc_lock), so the above always occurs if sx_lock() happens to block. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Who needs these silly statfs changes...
On Fri, 14 Nov 2003 [EMAIL PROTECTED] wrote: > >> Bruce Evans wrote: > >> > ... > >> > I just got around to testing the patch in that reply: > >> > ... > >> > >> Your patch to nfs_vfsops won't apply to my Solaris kernel :-) > >> The protocol says "abytes" is unsigned, so the server shouldn't be lying > >> by sending a huge positive value for available space on a full > >> filesystem. No? > > > >Possibly not, but the protocol is broken if it actually requires that. > > What makes you say that? I would think the utility of negative counts > for disk sizes and available spaces is marginal. Solaris, POSIX, and > NFS seem to get on fine without it. What am I (and they) missing? Well, the f_bavail field (not to mention all the other fields (until recently, sigh)) has always been signed and does go negative in BSD's statfs, so the protocol is broken if it can't support negative values in it. > >The type pun to negative values is in most versions of BSD: > > [snip code snippets and bug] > > That's great for interacting with other BSDs, but it still abusing > the protocol. As filesystems with approaching 2^64 bytes become possible > it probably has more of an impact. 2^63 won't be needed any time soon. This problem was more serious with nfsv2 when file systems reached 2^31 bytes not so long ago. The current problem is actually more with non-BSD clients and a BSD server. The BSD server will send the negative values and the non-BSD client may convert them to huge positive ones. Non-BSD servers presumably won't send negative values. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Who needs these silly statfs changes...
On Sat, 15 Nov 2003, Terry Lambert wrote: > Bruce Evans wrote: > > I just got around to testing the patch in that reply: > [ ... ] > > This seems to work. On a 2TB-epsilon ffs1 file system (*) on an md malloc > > disk (**): > > Try it again. This time, take the remote FS below its free reserve > as the root user, and see what the client machine reports. Compare > the results to an identical local FS. Er, that is the main thing that the test did. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Who needs these silly statfs changes...
On Fri, 14 Nov 2003, Peter Edwards wrote: > Bruce Evans wrote: > > > On Fri, 14 Nov 2003, Peter Edwards wrote: > >> The NFS protocols have unsigned fields where statfs has signed > >> equivalents: NFS can't represent negative available disk space ( Without > >> the knowledge of the underlying filesystem on the server, negative free > >> space is a little nonsensical anyway, I suppose) > >> > >> The attached patch stops the NFS server assigning negative values to > >> unsigned fields in the statfs response, and works against my local > >> solaris box. Seem reasonable? > > > > The client attampts to fix this by pretending that the unsigned fields > > are signed. -current tries to do more to support file system sizes larger > > that 1TB, but the code for this is not even wrong except it may be wrong > > enough to break the negative values. See my reply to one of the PRs > > for more details. > > > > I just got around to testing the patch in that reply: > > ... > > Your patch to nfs_vfsops won't apply to my Solaris kernel :-) > The protocol says "abytes" is unsigned, so the server shouldn't be lying > by sending a huge positive value for available space on a full > filesystem. No? Possibly not, but the protocol is broken if it actually requires that. The "free" fields are signed in struct statfs so that they can be negative. However, this is broken in POSIX's struct statvfs (all count fields have type fsblkcnt_t or fsfilcnt_t and these are specified to be unsigned). Is Solaris bug for bug compatible with that? Anyway, my patch is mainly supposed to fix the scaling. The main bug in the initial scaling patch was that the huge positive values were scaled before they were interpreted as negative values, so they became not so huge but still preposterous values that could not be interpreted as negative values. The type pun to negative values is in most versions of BSD: RELENG_4: u_quad_t tquad; ... if (v3) { sbp->f_bsize = NFS_FABLKSIZE; tquad = fxdr_hyper(&sfp->sf_tbytes); sbp->f_blocks = (long)(tquad / ((u_quad_t)NFS_FABLKSIZE)); tquad = fxdr_hyper(&sfp->sf_fbytes); sbp->f_bfree = (long)(tquad / ((u_quad_t)NFS_FABLKSIZE)); tquad = fxdr_hyper(&sfp->sf_abytes); sbp->f_bavail = (long)(tquad / ((u_quad_t)NFS_FABLKSIZE)); sbp->f_files = (fxdr_unsigned(int32_t, sfp->sf_tfiles.nfsuquad[1]) & 0x7fff); sbp->f_ffree = (fxdr_unsigned(int32_t, sfp->sf_ffiles.nfsuquad[1]) & 0x7fff); } else { sbp->f_bsize = fxdr_unsigned(int32_t, sfp->sf_bsize); sbp->f_blocks = fxdr_unsigned(int32_t, sfp->sf_blocks); sbp->f_bfree = fxdr_unsigned(int32_t, sfp->sf_bfree); sbp->f_bavail = fxdr_unsigned(int32_t, sfp->sf_bavail); sbp->f_files = 0; sbp->f_ffree = 0; } Oops, this has the cast to long perfectly misplaced so that negative sizes are not converted like I want. It just prevents warnings. Overflow has occurred long before, on the server when negative block counts were converted to hug positive sizes. NetBSD (nfs_vfsops.c 1.132): u_quad_t tquad; ... ... if (v3) { sbp->f_bsize = NFS_FABLKSIZE; tquad = fxdr_hyper(&sfp->sf_tbytes); sbp->f_blocks = (long)((quad_t)tquad / (quad_t)NFS_FABLKSIZE); tquad = fxdr_hyper(&sfp->sf_fbytes); sbp->f_bfree = (long)((quad_t)tquad / (quad_t)NFS_FABLKSIZE); tquad = fxdr_hyper(&sfp->sf_abytes); sbp->f_bavail = (long)((quad_t)tquad / (quad_t)NFS_FABLKSIZE); tquad = fxdr_hyper(&sfp->sf_tfiles); sbp->f_files = (long)tquad; tquad = fxdr_hyper(&sfp->sf_ffiles); sbp->f_ffree = (long)tquad; } else { sbp->f_bsize = fxdr_unsigned(int32_t, sfp->sf_bsize); sbp->f_blocks = fxdr_unsigned(int32_t, sfp->sf_blocks); sbp->f_bfree = fxdr_unsigned(int32_t, sfp->sf_bfree); sbp->f_bavail = fxdr_unsigned(int32_t, sfp->sf_bavail); sbp->f_files = 0; sbp->f_ffree = 0; } This converts tquad to quad_t so that the divisions work like I want. These conversions were added in rev.1.82 in 1999. More changes are needed here to catch up with the recent changes to struct statfs in FreeBSD. The casts to long are now just wrong since the block count fields don't have type long. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Who needs these silly statfs changes...
On Fri, 14 Nov 2003, Peter Edwards wrote: > Bernd Walter wrote: > > >On Thu, Nov 13, 2003 at 12:54:18AM -0800, Kris Kennaway wrote: > > > > > >>On Thu, Nov 13, 2003 at 06:44:25PM +1100, Peter Jeremy wrote: > >> > >> > >>>On Wed, Nov 12, 2003 at 06:04:00PM -0800, Kris Kennaway wrote: > >>> > >>> > ...my sparc machine reports that my i386 nfs server has 15 exabytes of > free space! > > enigma# df -k > Filesystem 1K-blocks Used Avail Capacity Mounted on > rot13:/mnt2 56595176 54032286 18014398507517260 0%/rot13/mnt2 > > > >>>18014398507517260 = 2^54 - 1964724. and 2^54KB == 2^64 bytes. Is it > >>>possible that rot13:/mnt2 has negative free space? (ie it's into the > >>>8-10% reserved area). > >>> > >>> > >>Yes, that's precisely what it is..the bug is either in df or the > >>kernel (I suspect the latter, i.e. something in the nfs code). > >> > >> > > > >And it's nothing new - I'm seeing this since several years now. > > > > > > The NFS protocols have unsigned fields where statfs has signed > equivalents: NFS can't represent negative available disk space ( Without > the knowledge of the underlying filesystem on the server, negative free > space is a little nonsensical anyway, I suppose) > > The attached patch stops the NFS server assigning negative values to > unsigned fields in the statfs response, and works against my local > solaris box. Seem reasonable? The client attampts to fix this by pretending that the unsigned fields are signed. -current tries to do more to support file system sizes larger that 1TB, but the code for this is not even wrong except it may be wrong enough to break the negative values. See my reply to one of the PRs for more details. I just got around to testing the patch in that reply: %%% Index: nfs_vfsops.c === RCS file: /home/ncvs/src/sys/nfsclient/nfs_vfsops.c,v retrieving revision 1.143 diff -u -2 -r1.143 nfs_vfsops.c --- nfs_vfsops.c12 Nov 2003 02:54:46 - 1.143 +++ nfs_vfsops.c12 Nov 2003 14:37:46 - @@ -223,5 +223,5 @@ struct mbuf *mreq, *mrep, *md, *mb; struct nfsnode *np; - u_quad_t tquad; + quad_t tquad; int bsize; @@ -254,19 +254,19 @@ for (bsize = NFS_FABLKSIZE; ; bsize *= 2) { sbp->f_bsize = bsize; - tquad = fxdr_hyper(&sfp->sf_tbytes); - if (((long)(tquad / bsize) > LONG_MAX) || - ((long)(tquad / bsize) < LONG_MIN)) + tquad = (quad_t)fxdr_hyper(&sfp->sf_tbytes) / bsize; + if (bsize <= INT_MAX / 2 && + (tquad > LONG_MAX || tquad < LONG_MIN)) continue; - sbp->f_blocks = tquad / bsize; - tquad = fxdr_hyper(&sfp->sf_fbytes); - if (((long)(tquad / bsize) > LONG_MAX) || - ((long)(tquad / bsize) < LONG_MIN)) + sbp->f_blocks = tquad; + tquad = (quad_t)fxdr_hyper(&sfp->sf_fbytes) / bsize; + if (bsize <= INT_MAX / 2 && + (tquad > LONG_MAX || tquad < LONG_MIN)) continue; - sbp->f_bfree = tquad / bsize; - tquad = fxdr_hyper(&sfp->sf_abytes); - if (((long)(tquad / bsize) > LONG_MAX) || - ((long)(tquad / bsize) < LONG_MIN)) + sbp->f_bfree = tquad; + tquad = (quad_t)fxdr_hyper(&sfp->sf_abytes) / bsize; + if (bsize <= INT_MAX / 2 && + (tquad > LONG_MAX || tquad < LONG_MIN)) continue; - sbp->f_bavail = tquad / bsize; + sbp->f_bavail = tquad; sbp->f_files = (fxdr_unsigned(int32_t, sfp->sf_tfiles.nfsuquad[1]) & 0x7fff); %%% This seems to work. On a 2TB-epsilon ffs1 file system (*) on an md malloc disk (**): server: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/md0 21474168960 1975624000 0%/b client: Filesystem 1024-blocks Used Avail Capacity Mounted on besplex:/b 21474168960 1975624000 0%/b These are 1K-blocks so their count fits in an int32_t, but the count in 512-blocks is too large for an int32_t so the scaling must be helping. With newfs -m 100 (***) to get near negative free space: server: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/md0 21474168960 5696 0%/b client: Filesystem 1K-blocks Used Avail Capacity Mounted on besplex:/b 21474168960 5696 0%/b After using up all the free space by creating a 6MB file: server: Filesystem 1K-blocks Used Avail Capacity Mo
Re: new kernel and old programs - bad system call
On Thu, 13 Nov 2003, John Hay wrote: > Is it ok to run a new kernel (after the statfs changes) and older > programs? I thought so from what i gathered out of the commit > messages, but my test box doesn't like it at all... Well except > if something else broke stuff: I have no problems with a current kernel and an old world. > ## > ... > Mounting root from ufs:/dev/da0s1a > pid 50 (sh), uid 0: exited on signal 12 > Enter full pathname of shell or RETURN for /bin/sh: > # ls > pid 56 (ls), uid 0: exited on signal 12 > Bad system call > # > ## Maybe you don't have old programs. Unfortunately, even /bin/sh is affected by the changes (it has a reference to fstatfs). I often boot old kernels (back to RELENG_4) with current utilities and will have to do something about this. Everying except things like ps works with only the following changes: - don't use the new eaccess() syscall in test(1). - change SYS_sigaction and SYS_sigreturn to their old (RELENG_4) values so that the newest signal handling is not used. This works almost perfectly because there are no significant changes to the data structures (only some semantic changes that most utilities don't care about). Larger changes in signal handling are the main thing that prevents current utilities running under RELENG_3. The statfs changes affect data structures, so they can't be avoided by simply changing the syscall numbers. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Found a problem with new source code
On Mon, 10 Nov 2003, Jason wrote: > I just wanted to let someone know that my buildworld fails at > /usr/src/sys/boot/i386/boot2/boot2.c at line 362. I get an undefined > error for RB_BOOTINFO, by adding #define RB_BOOTINFO 0x1f it worked. Sorry, I broke it last night. it is now fixed. > Also it failed at sendmail.fc or something, I don't use send mail so I > just did not build it. It looks like someone already reported the > device apic problem. I just tryed option smp and device apic on my > single proc athlon, panic on boot unless I chose no apic or is it no > acpi(?) at boot. > > By the way, why adding the smp options do any good for my machine? I > mostly care about speed, but it seems it might just make the os unstable > for me. No; it is only good for multi-CPU machines. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: boot0 and fdisk / disklabel misbehaviour
On Tue, 11 Nov 2003, Dag-Erling [iso-8859-1] Smørgrav wrote: > I've been busy installing various OSes on a spare disk in order to try > to reproduce some of fefe's benchmarks. In the process, I've noticed > a couple of bogons in boot0 and disklabel: > > - disklabel -B trashes the partition table: > ># dd if=/dev/zero of=/dev/ad0 count=20 ># fdisk -i ad0 >(create a FreeBSD partition) ># disklabel -rw ad0s1 auto ># newfs -U /dev/ad0s1a ># disklabel -B ad0s1a >(this trashes the partition table) I think you mean bsdlabel. disklabel is just a link to bsdlabel in -current. This was fixed in rev.1.8 of disklabel.c, but the change was lost in bsdlabel. >This probably happens because fdisk silently allows the user to >create a partition that overlaps the partition table. Arguably >pilot error, but very confusing at the time, and fdisk should warn >about it. Yes. This is the dangerously undedicated case. Some consider this to be an error. I only ever used it for one drive. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: erroneous message from locked-up machine
On Mon, 10 Nov 2003, Michael W. Lucas wrote: > I came in to work today to find one of my -current machines unable to > open a pipe. (This probably had a lot to do with the spamd that went > stark raving nutters overnight, but that's a separate problem.) A > power cycle fixed the problem, but /var/log/messages was filled with: > > Nov 10 11:05:44 bewilderbeast kernel: kern.maxpipekva exceeded, please see tuning(7). > > Interesting. > > bewilderbeast~;sysctl kern.maxpipekva > sysctl: unknown oid 'kern.maxpipekva' > bewilderbeast~; The following patch fixes this and some nearby style bugs: - source style bug: line too line - output style bugs: comma splice, verboseness (helps make the source line too long), and kernel message terminated with a ".". %%% Index: sys_pipe.c === RCS file: /home/ncvs/src/sys/kern/sys_pipe.c,v retrieving revision 1.158 diff -u -2 -r1.158 sys_pipe.c --- sys_pipe.c 9 Nov 2003 09:17:24 - 1.158 +++ sys_pipe.c 10 Nov 2003 17:21:47 - @@ -331,5 +331,5 @@ if (error != KERN_SUCCESS) { if (ppsratecheck(&lastfail, &curfail, 1)) - printf("kern.maxpipekva exceeded, please see tuning(7).\n"); + printf("kern.ipc.maxpipekva exceeded; see tuning(7)\n"); return (ENOMEM); } %%% > And tuning(7) doesn't mention this, either. > > Is this just work-in-progress, or did someone forget to commit something? Seems like tuning pipe kva is completely absent in tuning(7) (so the above message can be shortened further). You can tune kva generally as documented there, but the pipe limit is separate. > PS: Lesson of the day: no pipe KVA, no su. Great fun on remote > machines! :-) It's interesting that su was the point of failure. It uses a pipe hack for IPC. Otherwise it doesn't use pipes, at least direectly. It shouldn't need to use the pipe hack. My version uses signals instead. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: the PS/2 mouse problem
On Sat, 8 Nov 2003, Morten Johansen wrote: > Scott Long wrote: > > Bruce Evans wrote: > >>[... possibly too much trimmed] > > The problem here is that the keyboard controller driver tries to be too > > smart. If it detects that the hardware FIFO is full, it'll drain it into > > a per-softc, per-function ring buffer. So having psm(4) just directly > > read the hardware is insufficient in this scheme. What is the per-function part? (I'm not very familar with psm, but once understood simpler versions of the keyboard driver.) Several layers of buffering might not be too bad for slow devices. The i/o times tend to dominate unless you do silly things like a context switch to move each character from one buffer to other, and even that can be fast enough (I believe it is normal for interactive input on ptys; then there's often a remote IPC or two per character as well). > >> - it sometimes calls the DELAY() function, which is not permitted in fast > >> interrupt handlers since apart from locking issues, fast interrupt > >> handlers > >> are not permitted to busy-wait. > > > > Again, the keyboard controller driver is too smart for its own good. To > > summarize: > > > > read_aux_data_no_wait() > > { > > Does softc->aux ring buffer contain data? > > return ring buffer data > > > > Check the status register > > Is the keyboard fifo full? > > DELAY(7us) > > read keyboard fifo into softc->kbd ring buffer > > Check the status register > > > > Is the aux fifo full? > > DELAY(7us) > > return aux fifo data > > } > > > > So you can wind up stalling for 14us in there, presumably because you > > cannot read the status and data registers back-to-back without a delay. > > I don't have the atkbd spec handy so I'm not sure how to optimize this. > > Do you really need to check the status register before reading the data > > register? At least it's a bounded delay. I believe such delays are required for some layers of the keyboard. Perhaps only for the keyboard (old hardware only?) and not for the keyboard controller or the mouse. > >> Many of the complications for fast interrupt handlers shouldn't be needed > >> in psm. Just make psmintr() INTR_MPSAFE. > > > > I believe that the previous poster actually tried making it INTR_MPSAFE, > > but didn't see a measurable benefit because the latency of scheduling > > the ithread is still unacceptable. > > That is 100% correct. > In the meantime I have taken your's and Bruce's advice and rearranged > the interrupt handler to look like this: > > mtx_lock(&sc->input_mtx); Er, this is reasonable for INTR_MPSAFE but not for INTR_FAST. mtx_lock() is a "sleep" lock so it cannot be used in fast interrupt handlers. mtx_lock_spin() must be used. (My version doesn't permit use of mtx_lock_spin() either; more primitive locking must be used.) > while((c = read_aux_data_no_wait(sc->kbdc)) != -1) { This is probably INTR_FAST-safe enough in practice. > sc->input_queue.buf[sc->input_queue.tail] = c; > if ((++ sc->input_queue.tail) >= PSM_BUFSIZE) > sc->input_queue.tail = 0; > count = (++ sc->input_queue.count); > } > mtx_unlock(&sc->input_mtx); The locking for the queue seems to be correct except this should operate on a spinlock too. > if (count >= sc->mode.packetsize) > taskqueue_enqueue(taskqueue_swi_giant, &sc->psm_task); taskqueue_enqueue() can only be used in non-fast interrupt handlers. taskqueue_enqueue_fast() must be used in fast interrupt handlers (except in my version, it is not permitted so it shouldn't exist). Note that the spinlock/fast versions can be used for normal interrupt handlers too, so not much more code is needed to support handlers whose fastness is dynamically configured. > And it works, but having it INTR_MPSAFE still does NOT help my problem. > It looks to me like data is getting lost because the interrupt handler > is unable to read it before it's gone, and the driver gets out of sync, > and has to reset itself. > However it now takes a few more tries to provoke the problem, so > something seems to have improved somewhere. This is a bit surprising. There are still so few INTR_MPSAFE handlers that there aren't many system activities that get in the way of running the INTR_MPSAFE ones. Shared interrupts prevent running of a handler while other handlers on the same interrupt are running, and the mouse interrupt is often shared, but if it is shared then it couldn't be fast until recently and still can't be fast unless all the other handlers on it are fast. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: serial console oddity
On Sat, 8 Nov 2003, Don Lewis wrote: > I've been seeing some wierd things for many months when using a serial > console on my -CURRENT box. I finally had a chance to take a closer > look today. > > It looks like the problem is some sort of interference between kernel > output to the console and userland writes to /dev/console. I typically > see syslogd output to the console get corrupted. Each message that > syslogd writes seems to get truncated or otherwise corrupted. The most > common thing I see is that each syslog message is reduced to a space and > the first character of the month, or sometimes just a space, or > sometimes nothing at all. This is (at least primarily) a longstanding bug in ttymsg(). It uses nonblocking mode so that it doesn't block in write() or close(). For the same reason, it doesn't wait for output to drain before close(). If the close happens to be the last one on the device, this causes any data buffered in the tty and lower software layers to be discarded cleanly and any data in lower hardware layers to by discarded in a driver plus hardware-dependent way (usually not so cleanly, especially for the character being transmitted). > This is totally consistent until I "kill > -HUP" syslogd, which I believe causes syslogd to close and open > /dev/console, after which the syslog output appears correct on the > console. When the syslogd output is being corrupted, I can cat a file to > /dev/console and the output appears to be correct. When I debugged this, syslogd didn't seem to keep the console open, so the open()/close() in ttymsg() always caused the problem. I didn't notice killing syslogd makes a difference. Perhaps it helps due to a missing close. Holding the console open may be a workaround or even the correct fix. It's not clear where this should be done (should all clients of ttymsg() do it?). Running getty on the console or on the underlying tty device should do it accidentally. > I truss'ed syslogd, and it appears to be working normally, the writev() > call that writes the data to the console appears to be writing the > correct character count, so it would appear that the fault is in the > kernel. If there are any kernel bugs in this area, then they would be that last close of the console affects the underlying tty. The multiple console changes are quite likely to have broken this if getty is run on the underlying tty (they silently discarded the half-close of the underlying tty which was needed to avoided trashing some of its state when only the console is closed). > The problem doesn't appear to be specific to syslogd, because I have > seen the output from the shutdown scripts that goes to the console get > truncated as well. Yes, in theory it should affect anything that uses ttymsg() or does direct non-blocking writes without waiting for the output to drain. > I have my serial console running at the default 9600 bps. I always use 115200 bps and the symptoms are similar right down to normally getting only the first character of the month name followed by 0-1 bytes of garbage. The first character of the month name is just the first character of the message. Apparently my systems are fast enough for close() to be called before transmission of the second character has completed (2 * 87+ usec at 115200 bps). Here are some half-baked fixes. The part that clears O_NONBLOCK is wrong, and the usleep() part is obviously a hack. ttymsg() shouldn't block even in close(), since if the close is in the parent ttymsg() might block forever and if the close() is in a forked child then blocking could create zillions of blocked children. Another part of the patch is concerned with limiting forked children. If I were happy with that part then blocking would not be so bad. In practice, I don't have enough system activity for blocked children to be a problem. To see the problem with blocked children, do something like the following: - turn off clocal on the console so that the console can block better. For sio consoles this often requires turning it off in the lock-state device, since the driver defends against this foot shooting by locking it on. - hold the console open or otherwise avoid the original bug in this thread, else messages will just be discarded in close() faster than they can pile up. - turn off your external console device or otherwise drop carrier. - send lots of messages. %%% Index: ttymsg.c === RCS file: /home/ncvs/src/usr.bin/wall/ttymsg.c,v retrieving revision 1.11 diff -u -2 -r1.11 ttymsg.c --- ttymsg.c11 Oct 2002 14:58:34 - 1.11 +++ ttymsg.c11 Oct 2002 18:13:51 - @@ -32,14 +32,16 @@ */ -#include - -__FBSDID("$FreeBSD: src/usr.bin/wall/ttymsg.c,v 1.11 2002/10/11 14:58:34 mike Exp $"); - +#if 0 #ifndef lint -static const char sccsid[] = "@(#)ttymsg.c 8.2 (Berkeley) 11/16/93"; +static char sccsid[] = "@(#)ttymsg.c 8.2 (Berkeley) 11/16/93"; +#e
Re: hard lockup with new interrupt code, possible cause irq14: ata0
On Sat, 8 Nov 2003, Barney Wolff wrote: > Try adding > options NO_MIXED_MODE > to your conf. That fixed boot-time hangs on my Asus A7M266-D. BTW, NO_MIXED_MODE is missing in NOTES. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: New interrupt stuff breaks ASUS 2 CPU system
On Fri, 7 Nov 2003, Stefan [iso-8859-1] Eßer wrote: > On 2003-11-07 20:04 +1100, Bruce Evans <[EMAIL PROTECTED]> wrote: > > However, using the apic almost doubles the overheads for the a45 cases. > > This seems to be due to extra interrupts. The UART and/or driver already > > Just another data point: > > Seems that the interrupt rate doubled for drm0 on my system > (from 60 to 120 driving a LCD at 60Hz vertical refresh). > > I thought this might be a problem with shared interrupts (drm0 > and xl0 shared APIC IRQ 16), but removing the (actually unused) > xl driver did not make a difference ... Hmm. My a45 UARTs are the only ones with a pci level triggered interrupt: Nov 7 01:48:44 gamplex kernel: ioapic0: Routing IRQ 5 -> intpin 19 Nov 7 01:48:44 gamplex kernel: ioapic0: intpin 5 disabled Nov 7 01:48:44 gamplex kernel: ioapic0: intpin 19 trigger: level Nov 7 01:48:44 gamplex kernel: ioapic0: intpin 19 polarity: active-lo There is only one other level triggered interrupt the system that is used: Nov 7 01:48:44 gamplex kernel: ioapic0: Routing IRQ 11 -> intpin 18 Nov 7 01:48:44 gamplex kernel: ioapic0: intpin 11 disabled Nov 7 01:48:44 gamplex kernel: ioapic0: intpin 18 trigger: level Nov 7 01:48:44 gamplex kernel: ioapic0: intpin 18 polarity: active-lo and I suspect it may be doing strange things too: I found that rev.1.23 of ata_lowlevel.c broke atapicam, but the new interrupt code magically fixed it. One of the atapicam devices is the only device on IRQ11. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: the PS/2 mouse problem
On Fri, 7 Nov 2003, Morten Johansen wrote: > Morten Johansen wrote: > > Scott Long wrote: > > > >> One thought that I had was to make psmintr() be INTR_FAST. I need to > >> stare at the code some more to fully understand it, but it looks like it > >> wouldn't be all that hard to do. Basically just use the interrupt > >> handler > >> to pull all of the data out of the hardware and into a ring buffer in > >> memory, and then a fast taskqueue to process that ring buffer. It would > >> at least answer the question of whether the observed problems are due to > >> ithread latency. And if done right, no locks would be needed in > >> psmintr(). However, it is usually easier to use a lock even if not strictly necessary. psm as currently structured uses the technique of calling psmintr() from the timeout handler. This requires a lock. If this were not done, then the timeout routine would probably need to access hardware using scattered i/o instructions, and these would need locks (to prevent them competing with i/o instructions in psmintr()). Putting all the hardware accesses in the fast interrupt handler is simpler. The sio driver uses this technique but doesn't manage to put _all_ the i/o's in the interrupt handler, so it ends up having to lock out the interrupt handler all over the place. Ring buffers can be self-locking using delicate atomic instructions, but they are easier to implement using locks. > > I can reproduce the problem consistently on my machine, by moving the > > mouse around, while executing e.g this command in a xterm: > > > > dd if=/dev/zero of=test bs=32768 count=4000; sync; sync; sync > > > > when the sync'ing sets in the mouse attacks. > > It is very likely due to interrupt latency. > > > > I'd be happy to test any clever patches. > > Wow. You are completly right! > By using a MTX_SPIN mutex instead, and marking the interrupt handler > INTR_MPSAFE | INTR_FAST, my problem goes away. > I am no longer able to reproduce the mouse attack. > I have not noticed any side-effects of this. Could there be any? > I will file a PR with an updated patch, unless you think it's a better > idea to rearrange the driver. > Probably the locking could be done better anyway. Er, psmintr() needs large changes to become a fast interrupt handler. it does many things that may not be done by a fast interrupt handler, starting with the first statement in it: /* read until there is nothing to read */ while((c = read_aux_data_no_wait(sc->kbdc)) != -1) { This calls into the keyboard driver, which is not written to support any fast interrupt handlers. In general, fast interrupt handlers may not call any functions, since the "any" function doesn't know that it is called in fast interrupt handler context and may do things that may not be done in fast interrupt handler context. As it happens, read_aux_data_no_wait() does the following bad things: - it accesses private keyboard data. All data that is accessed by a fast interrupt handler must be locked by a common lock or use self-locking accesses. Data in another subsystem can't reasonably be locked by this (although the keyboard subsystem is close to psm, you don't want to export the complexities of psmintr()'s locking to the keyboard subsystem). - it calls other functions. The closure of all these calls must be examined and made fast-interrupt-handler safe before this is safe. The lowest level will resolve to something like inb(PSMPORT) and this alone is obviously safe provided PSMPORT is only accessed in the interrupt handler or is otherwise locked. (Perhaps the private keyboard data is actually private psm data that mainly points to PSMPORT. Then there is no problem with the data accesses. But the function calls make it unclear who owns the data.) - it sometimes calls the DELAY() function, which is not permitted in fast interrupt handlers since apart from locking issues, fast interrupt handlers are not permitted to busy-wait. Many of the complications for fast interrupt handlers shouldn't be needed in psm. Just make psmintr() INTR_MPSAFE. This is nontrival, however. Fine grained locking gaves many of the complications that were only in fast interrupt handlers in RELENG_4. E.g., for psmintr() to be MPSAFE, all of its calls into the keyboard subsystem need to be MPSAFE, and they are unlikely to be so unless the keyboard subsystem is made MPSAFE. The following method can be used to avoid some of the complications: make the interrupt handler not touch much data, so that it can be locked easily. The data should be little more than a ring buffer. Make the handler either INTR_MPSAFE or INTR_FAST (it doesn't matter for slow devices like psm). Put all the rest of what was in the interrupt handler in non-MPSAFE code (except where it accesses data shared with the interrupt handler) so that all of this code and its closure doesn't need to be made MPSAFE. This method is what the sio driver uses in -current, sort of accident
RE: New interrupt stuff breaks ASUS 2 CPU system
On Thu, 6 Nov 2003, John Baldwin wrote: > On 06-Nov-2003 Harti Brandt wrote: > > JB>I figured out what is happenning I think. You are getting a spurious > > JB>interrupt from the 8259A PIC (which comes in on IRQ 7). The IRR register > > JB>lists pending interrupts still waiting to be serviced. Try using > > JB>'options NO_MIXED_MODE' to stop using the 8259A's for the clock and see if > > JB>the spurious IRQ 7 interrupts go away. > > > > Ok, that seems to help. Interesting although why do these interrupts > > happen only with a larger HZ and when the kernel is doing printfs (this > > machine has a serial console). I have also not tried to disable SIO2 and > > the parallel port. > > Can you also try turning mixed mode back on and using > http://www.FreeBSD.org/~jhb/patches/spurious.patch > > You should get some stray IRQ 7's in the vmstat -i output as well as a few > printf's to the kernel console. Other changes fixed the problem with the apic case not working on my BP6, except the apic causes many more interrupts on serial ports at 921600 bps, almost enough to overload the system with just 2 active serial ports. I've now gathered lots of statistics for sio interrupt performance. The bad effect of the apic on performance is shown in the "-current(apic)" lines for a45 and a45b only: %%% Keywords: c04 = send at 115200 bps on cuac00, receive at 115200 bps on cuac04 c04b = like c04 plus send and receive in other direction too (b = bidirectional) (cuac* are on a Cyclades 8yo (2 * cd1400 isa)) a01 = like c04 except use ports cuaa[01] a01b = like a01 except bidirectional (cuaa[01] are standard motherboard 16550 clones) a45 = like a01 except use speed 921600 bps and ports cuaa[45] a45b = like a45 except bidirectional (cuaa[45] are on a VScom 200HV2 (2 * 16950 pci)) -current(ointr) = -current before new interrupt code -current = plain current (2003/11/06) -current(apic) = -current with apic configured for UP kernel on SMP hardware -current(bde) = my version of -current (new interrupt code not merged yet) &+iir,+stream,+intr0 = my version of -current with variants of sio optimizations (only UART-independent ones; optimizations for 16950 UARTs give factor of 2 reduction in overheads) Overheads for doing above I/O in percent (min-max for 3 runs) on an ABIT BP6 with 366 MHz and 400 MHz Celerons: Devices OS UP SMP --- -- -- --- c04 RELENG_4(4.9) 6.58-6.59 Not measured (method problems) -current(ointr) 9.65-9.76 6.77-7.11 -current10.64-10.69 6.09-6.36 -current(apic) 9.63-9.90 As above (apic standard) -current(bde) 6.83-6.96 3.54-3.78 c04bRELENG_4(4.9) 12.83-12.90 Not measured (method problems) -current(ointr) 19.42-19.44 13.70-13.90 -current20.23-20.24 12.01-12.48 -current(apic) 17.77-17.89 As above (apic standard) -current(bde) 12.74-13.23 6.23-6.53 a01 RELENG_4(4.9) 7.50-7.50 Not measured (method problems) -current(ointr) 7.67-7.69 4.44-4.77 -current8.09-8.13 4.72-5.60 -current(apic) 7.75-8.02 As above (apic standard) -current(bde) 7.53-7.63 4.49-4.54 &+iir 7.09-7.30 Not measured (kernel problems) &+stream6.23-6.24 &+iir+stream5.47-5.52 &+intr0+iir 5.24-5.26 2.75-2.91 a01bRELENG_4(4.9) 14.64-14.84 Not measured (method problems) -current(ointr) 14.36-15.10 8.65-8.92 -current14.79-14.87 8.18-9.77 -current(apic) 14.80-14.91 As above (apic standard) -current(bde) 14.19-14.24 8.13-8.46 &+iir 14.05-14.13 &+stream12.12-12.17 &+iir+stream10.58-10.62 &+intr0+iir 10.07-10.12 5.10-5.63 a45 RELENG_4(4.9) 21.81-21.86 Not measured (method problems) -current(ointr) 24.00-24.04 13.3 -current25.13-25.20 31.4-31.5(86) -current(apic) 51.02-51.05(87) As above (apic standard) -current(bde) 21.83-22.02 10.71-10.89 &+iir 21.98-22.05 &+stream27.78-27.81 &+iir+stream22.08-22.16 &+intr0+iir 16.76-16.92 6.85-8.11 a45bRELENG_4(4.9) 46.23-46.44(87) Not measured (method problems) -current(ointr) 54.01-54.37(86) 25.2 (82/82) -current56.04-56.93(85) 70.1-70.7(80) -current(apic) 87.35-88.22(78) As above (apic standard) -current(bde) 42.06-42.12
Re: new interrupt code: panic when going multiuser
On Tue, 4 Nov 2003, John Baldwin wrote: > On 04-Nov-2003 Bruce Evans wrote: > >> > - on a BP6, UP kernels without apic work except for cyintr(), but SMP > >> > kernels have problems with missing interrupts for ata devices and hang > >> > at boot time. > >> > >> Is this related to the ata-lowlevel commit you mentioned above? > > > > No. It looks like the interrupt is really going missing for some > > reason. This is without any acpica. > > What if you try a UP kernel with 'device apic' (i.e. no options SMP), > do you still have ata problems? Is this on an SMP machine btw? Yes, 'device apic' breaks the UP case in the same way that the new interrupt code breaks the SMP case. BP6's are SMP and mine used to mostly work, though not well enough to actually be worth using in SMP mode (it works faster in UP mode with its slowest CPU overclocked 42%; mismatched CPUs and thermal problems prevent significant overclocking in SMP mode). Other bugs in the new interrupt code that I've noticed so far: - lots of pessimizations. The main one is that the PIC is now masked and unmasked for fast interrupt handlers. The masking should be done at a higher level for all interrupt handlers so that it doesn't need to be undone in some cases, and neither masking not unmasking should be done for fast interrupt handlers. This pessimization and other makes fast interrupt handlers more non-fast than before. They are now slower than normal interrupt handlers in FreeBSD-[1-4]. They still have lower latency that normal interrupt handlers in FreeBSD-[1-4], but not as low as actual fast interrupt handlers. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: new interrupt code: panic when going multiuser
> > The following is without the local changes: > > - cyintr(int unit) panics becauase it is passed a pointer to somewhere. > > I think all compat_isa devices are broken for unit 0 because unit 0 > > is represented by a null pointer. > > Ah, ok. Yes, this is a semantic change. To try and support clock interrupts, > a fast handler that passes a NULL argument will get a pointer to the intrframe > as its argument. I got the idea via sparc64 from [EMAIL PROTECTED] Perhaps > something > can be faked up in the compat_isa shims to fix this. Clock interrupt handlers have always been a nasty special case. > Please try http://www.FreeBSD.org/~jhb/patches/isa_compat.patch Will try later today. It should work, but adds yet more overhead. > > - on a BP6, UP kernels without apic work except for cyintr(), but SMP > > kernels have problems with missing interrupts for ata devices and hang > > at boot time. > > Is this related to the ata-lowlevel commit you mentioned above? No. It looks like the interrupt is really going missing for some reason. This is without any acpica. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: new interrupt code: panic when going multiuser
On Tue, 4 Nov 2003, John Baldwin wrote: > On 04-Nov-2003 Lukas Ertl wrote: > > On Tue, 4 Nov 2003, Lukas Ertl wrote: > > > >> I somehow can't get at a good vmcore :-(. But I found out that the > >> machine boots fine in "Safe Mode", where DMA and hw.ata.wc is turned off. > > > > Ok, if I set hw.ata.ata_dma=0 in loader.conf, it boots fine. Could there > > be some issue with ATAng + new interrupt code? > > Can you provide a dmesg please? There may be a weird issue with > some PPro's for example that I haven't been able to test. I have noticed the following problems with the new interrupt code so far: - it conflicts with a few thousand lines of local changes. - yesterday's backup kernels which I preserved to run benchmarks with all hang at boot time while probing atapicam devices. Backing out rev.1.23 of ata-lowlevel.c fixes the hang, but I didn't back up yesterday's sources so it will take some work to regenerate working versions of yesterday's kernels. The following is without the local changes: - cyintr(int unit) panics becauase it is passed a pointer to somewhere. I think all compat_isa devices are broken for unit 0 because unit 0 is represented by a null pointer. - on a BP6, UP kernels without apic work except for cyintr(), but SMP kernels have problems with missing interrupts for ata devices and hang at boot time. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NULL td passed to propagate_priority() when using xmms...
On Mon, 3 Nov 2003, John Baldwin wrote: > On 01-Nov-2003 Soren Schmidt wrote: > > It seems Sean Chittenden wrote: > >> Howdy. I'm not sure if this is a ULE bug or a KSE bug, or both, but, > >> for those interested (this is using ule 1.67, rebuilding world now), > >> here's my stack. I couldn't figure out where td was being set to > >> NULL. :( Oh! Where is TD_SET_LOCK defined? egrep -r didn't turn up > >> anything. -sc > > > > Its not ULE, I'm running 4BSD and has gotten this on boot for over a > > week now, rendering -current totally useless... > > Having a kernel panic with INVARIANTS on would really help narrow down > where the bug is. I found something that causes this bug fairly reliably: - configure ddb so that db_print_backtrace() is called on panics. - break the fd driver so that the panic() in fdstrategy() is called on floppy accesses. - attempt to access a floppy so that fdstrategy() is called. - db_print_backtrace() then does bad things. It never completes here, though it works in other contexts. Usually it prints only the first line or two. Then quite often ddb is called for a null pointer panic in propagate_priority(). More details about the null pointer panic: This seems to have nothing to do with scheduling. propagate_priority() is not called with a null td of course, but it sometimes follows a null m: %%% /* * Pick up the mutex that td is blocked on. */ m = td->td_blocked; MPASS(m != NULL); /* * Check if the thread needs to be moved up on * the blocked chain */ if (td == TAILQ_FIRST(&m->mtx_blocked)) { continue; } %%% I don't have invariants enabled, so MPASS(m != NULL) doesn't do anything, but m is null so attempting to load m->mtx_blocked causes a panic. For the backtrace context, propagate_priority() gets called for attempting to aquire a lock in softclock(). Tasks like the softclock task get scheduled despite the system being in panic(). ps seemed to show that the user process doing the floppy access no longer existed. I don't know how that could happen, since the panic() is done in the context of the that process. More details about bugs in db_print_backtrace(): Maybe the stack is messed up. Attempting to access invalid stack offsets can cause problems. My version of db_print_backtrace() has extra code to attempt not to access invalid offsets, but there is normally no problem since ddb's trap handler fixes up the problem. But backtrace() bogusly calls db_print_backtrace() in non-ddb context and then the longjmp in the trap handler goes to hyperspace if anywhere. Bugs tripped over while debugging this: Putting a breakpoint in fdopen() didn't work, because fd.c:fdopen() conflicts with kern_descrip.c:fdopen(). This was broken in fd.c 1.259. There are hundreds of similar conflicts in GENERIC, some for obviously broken things like the same malloc type being static in several files. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: More ULE bugs fixed.
On Sun, 2 Nov 2003, Jeff Roberson wrote: > On Sat, 1 Nov 2003, Bruce Evans wrote: > > My simple make benchmark now takes infinitely longer with ULE under SMP, > > since make -j 16 with ULE under SMP now hangs nfs after about a minute. > > 4BSD works better. However, some networking bugs have developed in the > > last few days. One of their manifestations is that SMP kernels always > > panic in sbdrop() on shutdown. This was fixed by setting debug.mpsafenet to 0 (fxp is apparently not MPSAFE yet). The last run with sched_ule.c 1.75 shows little difference between ULE and 4BSD: % *** zqz.4bsd.1Wed Oct 29 22:03:29 2003 % --- zqz.ule.3 Sun Nov 2 22:58:53 2003 % *** % *** 4 % --- 5,6 % + ===> atm % + ===> atm/sscop The tree compiled by 4BSD is 4 days older so ULE does these extra. % *** % *** 227 % !18.49 real 8.26 user 6.38 sys % --- 229 % !18.44 real 8.00 user 6.43 sys Differences for "make obj" (all this in usr.bin tree). % *** % *** 229,233 % !265 average shared memory size % !116 average unshared data size % !125 average unshared stack size % ! 23222 page reclaims % ! 26 page faults % --- 231,235 % !274 average shared memory size % !118 average unshared data size % !128 average unshared stack size % ! 22760 page reclaims % ! 25 page faults % *** % *** 236,241 % !918 block output operations % ! 9893 messages sent % ! 9893 messages received % !230 signals received % ! 13034 voluntary context switches % ! 1216 involuntary context switches % --- 238,243 % !926 block output operations % ! 9973 messages sent % ! 9973 messages received % !232 signals received % ! 17432 voluntary context switches % ! 1583 involuntary context switches Tiny differences in time -l output for obj stage, except ULE does more context switches. The signals are mostly SIGCHLD (needed to fix make(1)). % *** % *** 245 % --- 248,249 % + ===> atm % + ===> atm/sscop % *** % *** 506 % ! 126.67 real57.42 user43.83 sys % --- 510 % ! 124.43 real58.07 user42.17 sys % *** % *** 508,512 % ! 1973 average shared memory size % !803 average unshared data size % !128 average unshared stack size % ! 203770 page reclaims % ! 1459 page faults % --- 512,516 % ! 1920 average shared memory size % !784 average unshared data size % !127 average unshared stack size % ! 203124 page reclaims % ! 1464 page faults % *** % *** 514,520 % !165 block input operations % ! 1463 block output operations % ! 83118 messages sent % ! 83117 messages received % !265 signals received % ! 100319 voluntary context switches % ! 8113 involuntary context switches % --- 518,524 % !167 block input operations % ! 1469 block output operations % ! 83234 messages sent % ! 83236 messages received % !267 signals received % ! 125750 voluntary context switches % ! 17825 involuntary context switches Similarly for depend stage. % *** % *** 524 % --- 529,530 % + ===> atm % + ===> atm/sscop % *** % *** 701 % ! 291.30 real 307.00 user73.77 sys % --- 707 % ! 290.28 real 308.16 user74.05 sys % *** % *** 703,707 % ! 2073 average shared memory size % ! 2076 average unshared data size % !127 average unshared stack size % ! 624020 page reclaims % !156 page faults % --- 709,713 % ! 2084 average shared memory size % ! 2056 average unshared data size % !128 average unshared stack size % ! 626651 page reclaims % !154 page faults % *** % *** 709,715 % ! 72 block input operations % ! 2122 block output operations % ! 45315 messages sent % ! 45317 messages received % !691 signals received % ! 195785 voluntary context switches % ! 58130 involuntary context switches % --- 715,721 % ! 83 block input operations % ! 2133 block output operations % ! 45532 messages sent % ! 45524 messages received % !759 signals received % ! 228998 voluntary context switches % ! 128078 involuntary context switches Similarly for the "all" stage. The benchmark was not run carefully enough for the 1 second differences in the times to be significant. > You commented on the nice cutoff before. What do you believe the correct > behavior is? In UL
Re: More ULE bugs fixed.
On Fri, 31 Oct 2003, Sam Leffler wrote: > On Friday 31 October 2003 09:04 am, Bruce Evans wrote: > > > My simple make benchmark now takes infinitely longer with ULE under SMP, > > since make -j 16 with ULE under SMP now hangs nfs after about a minute. > > 4BSD works better. However, some networking bugs have developed in the > > last few days. One of their manifestations is that SMP kernels always > > panic in sbdrop() on shutdown. > > I'm looking at something similar now. If you have a stack trace please send > it to me (along with any other info). You might also try booting > debug.mpsafenet=0. Turning off mpsafenet fixed all these problems. These console messages are with it not turned off. fxp is the only physical network device. %%% WARNING: loader(8) metadata is missing! [ preserving 869208 bytes of kernel symbol table ] Copyright (c) 1992-2003 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.1-CURRENT #1005: Sun Nov 2 20:38:42 EST 2003 [EMAIL PROTECTED]:/c/sysc/i386/compile/smp Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Pentium II/Pentium II Xeon/Celeron (400.91-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x665 Stepping = 5 Features=0x183fbff real memory = 268435456 (256 MB) avail memory = 255369216 (243 MB) Programming 24 pins in IOAPIC #0 IOAPIC #0 intpin 2 -> irq 0 IOAPIC #0 intpin 17 -> irq 9 IOAPIC #0 intpin 18 -> irq 11 IOAPIC #0 intpin 19 -> irq 5 FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): apic id: 0, version: 0x00040011, at 0xfee0 cpu1 (AP): apic id: 1, version: 0x00040011, at 0xfee0 io0 (APIC): apic id: 2, version: 0x00170011, at 0xfec0 Pentium Pro MTRR support enabled npx0: on motherboard npx0: flags 0x80 npx0: INT 16 interface pcibios: BIOS version 2.10 Using $PIR table, 8 entries at 0xc00fdef0 pcib0: at pcibus 0 on motherboard pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 pci1: at device 0.0 (no driver attached) isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0xf000-0xf00f at device 7.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata0: [MPSAFE] ata1: at 0x170 irq 15 on atapci0 ata1: [MPSAFE] pci0: at device 7.2 (no driver attached) piix0: port 0x5000-0x500f at device 7.3 on pci0 Timecounter "PIIX" frequency 3579545 Hz quality 0 pci0: at device 11.0 (no driver attached) pci0: at device 11.1 (no driver attached) fxp0: port 0xa400-0xa43f mem 0xea00-0xea0f,0xea104000-0xea104fff irq 9 at device 13.0 on pci0 fxp0: Ethernet address 00:90:27:99:02:99 miibus0: on fxp0 inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: [MPSAFE] puc0: port 0xb000-0xb01f,0xac00-0xac07,0xa800-0xa807 mem 0xea103000-0xea103fff,0xea102000-0xea102fff irq 5 at device 17.0 on pci0 sio4: on puc0 sio4: type 16550A sio5: on puc0 sio5: type 16550A atapci1: port 0xbc00-0xbcff,0xb800-0xb803,0xb400-0xb407 irq 11 at device 19.0 on pci0 atapci1: [MPSAFE] ata2: at 0xb400 on atapci1 ata2: [MPSAFE] atapci2: port 0xc800-0xc8ff,0xc400-0xc403,0xc000-0xc007 irq 11 at device 19.1 on pci0 atapci2: [MPSAFE] ata3: at 0xc000 on atapci2 ata3: [MPSAFE] orm0: at iomem 0xc8000-0xcbfff,0xc-0xc7fff on isa0 fdc0: at port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: at port 0x64,0x60 on isa0 atkbd0: flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 psm0: irq 12 on atkbdc0 psm0: model Generic PS/2 mouse, device ID 0 vga0: at port 0x3c0-0x3df iomem 0xa-0xb on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> sio0 at port 0x3f8-0x3ff irq 4 flags 0x90 on isa0 sio0: type 16550A, console sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A cy0 at iomem 0xd4000-0xd5fff irq 10 on isa0 cy0: driver is using old-style compatibility shims ppc0: at port 0x378-0x37f irq 7 on isa0 ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/16 bytes threshold ppbus0: on ppc0 ppbus0: IEEE1284 device found Probing for PnP devices on ppbus0: plip0: on ppbus0 lpt0: on ppbus0 lpt0: Interrupt-driven port ppi0: on ppbus0 unknown: can't assign resources (port) speaker0: at port 0x61 on isa0 unknown: can't assign resources (port) unknown: can't assign resources (irq) unknown: can't assign resources (port) unknown: can't assign resources (port) unknown: can't assign resources (port) unknown: can't assign resources (port) APIC_IO: Testing 8254 interrupt delivery APIC_IO: routing 8254 via IOAPIC #0 intpin 2 Timecounters tick every 10.000 msec ipfw2 initialized, divert enabled, rule-based forwarding enabled, default to accept, logging disabled GEOM: create disk ad0 dp=0xc2999370
RE: lockmgr panic on shutdown
On Sun, 2 Nov 2003 [EMAIL PROTECTED] wrote: > The obvious solution might be to change line 1161 of ffs_vfsops to > pass vget() "curthread" rather than td. I assume there's a good > reason why "thread0" is passed from boot(), but I can't see why > that's of any use to the vnode locking. Passing &thread0 in boot() is a quick (and not even wrong) fix for the problem that there is no valid current process^Wthread in the panic case. Long ago in Net/2 (still in Lite2 for at least the i386 version), sync() in boot() was passed the completely bogus parameters ((struct sigcontext *)0) (instead of (p, uap, retval). This worked to the extent that sync()'s proc pointer was not passed further or not dereferenced. Now there are lots of locks, and since thread0 is never the corerect lock holder, things work at most to the extent that sync()'s proc pointer is not passed further. curthread is never null in -current, so upgrading to the version that passes it (i386/i386/machdep.c 1.111 (actually passes curproc)) would probably help in the non-panic case without increasing bugs for the panic case. However, passing curthread is still wrong for the panic case due to the following complications: - panics may occur during context switches or in other critical regions when curthread is not quite current. - under SMP, curthread is per-CPU, so having it non-null doesn't really help. Locks may be held by curproc's running on other CPUs, and in panic() it is difficult to handle the other CPUs correctly -- if you stop them then they won't be able to release their locks, and if you let them run they may run into you. Hopefully in the case of a normal shutdown all the other CPUs release their locks and stop before the sync(). Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: More ULE bugs fixed.
On Fri, 31 Oct 2003, Jeff Roberson wrote: > I have commited my SMP fixes. I would appreciate it if you could post > update results. ULE now outperforms 4BSD in a single threaded kernel > compile and performs almost identically in a 16 way make. I still have a > few more things that I can do to improve the situation. I would expect > ULE to pull further ahead in the months to come. My simple make benchmark now takes infinitely longer with ULE under SMP, since make -j 16 with ULE under SMP now hangs nfs after about a minute. 4BSD works better. However, some networking bugs have developed in the last few days. One of their manifestations is that SMP kernels always panic in sbdrop() on shutdown. > The nice issue is still outstanding, as is the incorrect wcpu reporting. It may be related to nfs processes not getting any cycles even when there are no niced processes. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]"