Re: zfs problems after rebuilding system [SOLVED]
On Sat, 3 Mar 2018, tech-lists wrote: On 03/03/2018 00:23, Dimitry Andric wrote: Indeed. I have had the following for a few years now, due to USB drives with ZFS pools: --- /usr/src/etc/rc.d/zfs 2016-11-08 10:21:29.820131000 +0100 +++ /etc/rc.d/zfs 2016-11-08 12:49:52.971161000 +0100 @@ -25,6 +25,8 @@ zfs_start_main() { + echo "Sleeping for 10 seconds to let USB devices settle..." + sleep 10 zfs mount -va zfs share -a if [ ! -r /etc/zfs/exports ]; then For some reason, USB3 (xhci) controllers can take a very, very long time to correctly attach mass storage devices: I usually see many timeouts before they finally get detected. After that, the devices always work just fine, though. I have one that works for an old USB hard drive but never works for a not so old USB flash drive and a new SSD in a USB dock (just to check the SSD speed when handicapped by USB). Win7 has no problems with the xhci and USB flash drive combination, and FreeBSD has no problems with the drive on other systems. Whether this is due to some sort of BIOS handover trouble, or due to cheap and/or crappy USB-to-SATA bridges (even with brand WD and Seagate disks!), I have no idea. I attempted to debug it at some point, but a well-placed "sleep 10" was an acceptable workaround... :) That fixed it, thank you again :D That won't work for the boot drive. When no boot drive is detected early enough, the kernel goes to the mountroot prompt. That seems to hold a Giant lock which inhibits further progress being made. Sometimes progress can be made by trying to mount unmountable partitions on other drives, but this usually goes too fast, especially if the USB drive often times out. Bruce ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
RE: mpslsi0 : Trying sleep, but thread marked as sleeping prohibited
On Fri, 24 Feb 2012, Desai, Kashyap wrote: From: Alexander Kabaev [mailto:kab...@gmail.com] ... sleep locks are by definition unbound. There is no spinning, no priority propagation. Holders are free to take, say, page faults and go to long journey to disk and back, etc. I understood your above lines. Hardly the stuff _anyone_ would want to do from interrupt handler, thread or otherwise. So the way mps driver does in interrupt handler is as below. mps_lock(sc); mps_intr_locked(data); mps_unlock(sc); We hold the mtx lock in Interrupt handler and do whole bunch of work(this is bit lengthy work) under that. It looks mps driver is miss using mtx_lock. Are we ? No. Most NIC drivers do this. Lengthy work isn't as long as it used to be, and here the lock only locks out other accesses to a single piece of hardware (provided sc is for a single piece of hardware as it should be). Worry instead about more global locks, either in your driver or in upper layers. You might need one to lock your whole driver, and upper layers might need one to lock things globally too. Giant locking is an example of the latter. I don't trust the upper layers much, but for interrupt handling they can be trusted to not have anything locked when the interrupt handler is called (except for Giant locking when the driver requests this). Also worry about your interrupt handler taking too long -- although nothing except interrupt thread priority prevents other code running, it is possible that other code doesn't get enough (or any) cycles if an interrupt handler is too hoggish. This problem is smaller than when there was a single ~1 MHz CPU doing PIO. With multiple ~2GHz CPUs doing DMA, the interrupt handler can often be 100 times sloppier without anyone noticing. But not 1000 times, and not 100 times with certain hardware. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On Wed, 14 Dec 2011, Ivan Klymenko wrote: ?? Wed, 14 Dec 2011 00:04:42 +0100 Jilles Tjoelker jil...@stack.nl ??: On Tue, Dec 13, 2011 at 10:40:48AM +0200, Ivan Klymenko wrote: If the algorithm ULE does not contain problems - it means the problem has Core2Duo, or in a piece of code that uses the ULE scheduler. I already wrote in a mailing list that specifically in my case (Core2Duo) partially helps the following patch: --- sched_ule.c.orig2011-11-24 18:11:48.0 +0200 +++ sched_ule.c 2011-12-10 22:47:08.0 +0200 ... @@ -2118,13 +2119,21 @@ struct td_sched *ts; THREAD_LOCK_ASSERT(td, MA_OWNED); + if (td-td_pri_class PRI_FIFO_BIT) + return; + ts = td-td_sched; + /* +* We used up one time slice. +*/ + if (--ts-ts_slice 0) + return; This skips most of the periodic functionality (long term load balancer, saving switch count (?), insert index (?), interactivity score update for long running thread) if the thread is not going to be rescheduled right now. It looks wrong but it is a data point if it helps your workload. Yes, I did it for as long as possible to delay the execution of the code in section: I don't understand what you are doing here, but recently noticed that the timeslicing in SCHED_4BSD is completely broken. This bug may be a feature. SCHED_4BSD doesn't have its own timeslice counter like ts_slice above. It uses `switchticks' instead. But switchticks hasn't been usable for this purpose since long before SCHED_4BSD started using it for this purpose. switchticks is reset on every context switch, so it is useless for almost all purposes -- any interrupt activity on a non-fast interrupt clobbers it. Removing the check of ts_slice in the above and always returning might give a similar bug to the SCHED_4BSD one. I noticed this while looking for bugs in realtime scheduling. In the above, returning early for PRI_FIFO_BIT also skips most of the periodic functionality. In SCHED_4BSD, returning early is the usual case, so the PRI_FIFO_BIT might as well not be checked, and it is the unusual fifo scheduling case (which is supposed to only apply to realtime priority threads) which has a chance of working as intended, while the usual roundrobin case degenerates to an impure form of fifo scheduling (iit is impure since priority decay still works so it is only fifo among threads of the same priority). ... @@ -2144,9 +2153,6 @@ if (TAILQ_EMPTY(tdq-tdq_timeshare.rq_queues[tdq-tdq_ridx])) tdq-tdq_ridx = tdq-tdq_idx; } - ts = td-td_sched; - if (td-td_pri_class PRI_FIFO_BIT) - return; if (PRI_BASE(td-td_pri_class) == PRI_TIMESHARE) { /* * We used a tick; charge it to the thread so @@ -2157,11 +2163,6 @@ sched_priority(td); } /* -* We used up one time slice. -*/ - if (--ts-ts_slice 0) - return; - /* * We're out of time, force a requeue at userret(). */ ts-ts_slice = sched_slice; With the ts_slice check here before you moved it, removing it might give buggy behaviour closer to SCHED_4BSD. and refusal to use options FULL_PREEMPTION 4-5 years ago, I found that any form of PREMPTION was a pessimization for at least makeworld (since it caused too many context switches). PREEMPTION was needed for the !SMP case, at least partly because of the broken switchticks (switchticks, when it works, gives voluntary yielding by some CPU hogs in the kernel. PREEMPTION, if it works, should do this better). So I used PREEMPTION in the !SMP case and not for the SMP case. I didn't worry about the CPU hogs in the SMP case since it is rare to have more than 1 of them and 1 will use at most 1/2 of a multi-CPU system. But no one has unsubscribed to my letter, my patch helps or not in the case of Core2Duo... There is a suspicion that the problems stem from the sections of code associated with the SMP... Maybe I'm in something wrong, but I want to help in solving this problem ... The main point of SCHED_ULE is to give better affinity for multi-CPU systems. But the `multi' apparently needs to be strictly more than 2 for it to brak even. Bruce___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: mountd has resolving problems
On Thu, 17 Feb 2011, John Baldwin wrote: On Thursday, February 17, 2011 7:18:28 am Steven Hartland wrote: This has become a issue for us in 8.x as well. I'm pretty sure in pre 8.x these nfs mounts would simply background but recently machines are now failing to boot. It seems that failure to lookup nfs mount point hosts now causes this fatal error :( We've just tried Jeremy's netwait script and it works perfectly so either this or something similar needs to get pushed into base. For reference the reason we need a delay here is our core Cisco router takes a while to bring the port up properly on boot. Thanks for sharing the script Jeremy :) I use a similar hack that waits up to 30 seconds for the default gateway to be pingable. I think it is at least partly related to the new ARP code that now drops packets in IP output if the link is down. I use hackish ping -t timeout much smaller than 30 seconds since even 2 seconds is annoyings and traceroutes in /etc/rc.d/netif. Don't know if it is the same problem. It affects mainly nfs and ntpdate/ntpd to local systems here. Even with all-static routes. This can be very problematic during boot since some interfaces take a few seconds to negotiate link but the end result of the new check in IP output is that the attempt to send the packet fails with an error causing gethostbyname() and getaddrinfo() to fail completely without doing any retries. In 7 the packet would either sit in the Also after down/up to change something. If you try to use the network before it is back then you have to wait much longer before it is really back. This is a relatively minor problem since down/up is not needed routinely. descriptor ring until link was up, or it would be dropped, but it would silently fail, so the resolver in libc would just retry in 30 seconds or so at which time it would work fine. Waiting for the default route to be pingable actually fixed a few other problems for us on 7 though as well (often ntpdate would not work on boot and now it works reliably, etc.) so we went with that route. I thought I first saw the problem a little earlier, and it affected bge more than fxp. Maybe the latter is correct and the problem is smaller with fxp just because it is ready sooner. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: poll()-ing a pipe descriptor, watching for POLLHUP
On Wed, 3 Jun 2009, Kostik Belousov wrote: On Wed, Jun 03, 2009 at 05:30:51PM +0300, Kostik Belousov wrote: On Wed, Jun 03, 2009 at 04:10:34PM +0300, Vlad Galu wrote: Hm, I was having an issue with an internal piece of software, but never checked what kind of pipe caused the problem. Turns out it was a FIFO, and I got bitten by the same bug described here: http://lists.freebsd.org/pipermail/freebsd-bugs/2006-March/017591.html The problem is that the reader process isn't notified when the writer G process exits or closes the FIFO fd... So you did found the relevant PR with long audit trail and patches attached. You obviously should contact the author of the patches, Oliver Fromme, who is FreeBSD committer for some time (CCed). I agree that the thing shall be fixed finally. Skimming over the patches in kern/94772, I have some doubts about removal of POLLINIGNEOF flag. The reason is that we are generally do not remove exposed user interfaces. Maybe, but this flag was not a documented interface, and too much ugliness might be required to preserve its behaviour bug-for-bug compatibly (the old buggy behaviour would probably be more wanted for compatibility than the strange behaviour given by this flag! I forward-ported Bruce' patch to the CURRENT. It passes the tests from tools/regression/fifo and a test from kern/94772. Thanks. I won't be committing it any time soon, so you should. I rewrote the test programs extensively (enclosed at the end) in Oct 2007 and updated the kernel patches to match. Please run the new tests to see if you are missing anything important in the kernel part. If so, I will search for the kernel patches later (actually, now -- enclosed in the middle). I just ran them under RELENG_7 and unpatched -current and found no differences with the Oct 2007 version for RELENG_7 in the old test output. The old test output is in the following subdirectories: 4: FreeBSD-4 7: FreeBSD-7 l: Linux-2.6.10 m: my version of FreeBSD-5.2 including patches for this problem. AFAIR, the FreeBSD output in m is the same as the Linux output in all except a couple of cases where Linux select is inconsistent with itself and/or with Linux poll. However, the differences in the saved output are that the Linux output is mysteriously missing results for tests 5-8. The tests attempt to test certain race possibilities in a non-racy way. This is not easy and the differences might be due to some races/states not occurring under Linux. POSIX finally specified the behaviour strictly enough for it to be possible to test it a couple of years ago. I didn't follow all the developments and forget the details, but it was close to the Linux behaviour. For my liking, I did not removed POLLINIGNEOF. diff --git a/sys/fs/fifofs/fifo_vnops.c b/sys/fs/fifofs/fifo_vnops.c index 66963bc..7e279ca 100644 --- a/sys/fs/fifofs/fifo_vnops.c +++ b/sys/fs/fifofs/fifo_vnops.c @@ -226,11 +226,47 @@ fail1: if (ap-a_mode FREAD) { fip-fi_readers++; if (fip-fi_readers == 1) { + SOCKBUF_LOCK(fip-fi_readsock-so_rcv); + if (fip-fi_writers 0) + fip-fi_readsock-so_rcv.sb_state |= + SBS_COULDRCV; My current version is in fact completely different. It doesn't have SBS_COULDRCV, but uses a generation count. IIRC, this is the same method as is used in Linux, and is needed for the same reasons (something to do with keeping new connections separate from old ones). So I will try to enclose the components of the patch in the order of your diff (might miss some). First one: % Index: fifo_vnops.c % === % RCS file: /home/ncvs/src/sys/fs/fifofs/fifo_vnops.c,v % retrieving revision 1.100 % diff -u -2 -r1.100 fifo_vnops.c % --- fifo_vnops.c 23 Jun 2004 00:35:50 - 1.100 % +++ fifo_vnops.c 17 Oct 2007 11:36:23 - % @@ -36,4 +36,5 @@ % #include sys/fcntl.h % #include sys/file.h % +#include sys/filedesc.h % #include sys/kernel.h % #include sys/lock.h % @@ -61,4 +62,5 @@ % longfi_readers; % longfi_writers; % + int fi_wgen; % }; % % @@ -182,8 +184,11 @@ % struct ucred *a_cred; % struct thread *a_td; % + int a_fdidx; % } */ *ap; % { % struct vnode *vp = ap-a_vp; % struct fifoinfo *fip; % + struct file *fp; % + struct filedesc *fdp; % struct thread *td = ap-a_td; % struct ucred *cred = ap-a_cred; % @@ -240,4 +245,10 @@ % } % } % + fdp = td-td_proc-p_fd; % + FILEDESC_LOCK(fdp); % + fp = fget_locked(fdp, ap-a_fdidx); % + /* Abuse f_msgcount as a generation count. */ % + fp-f_msgcount = fip-fi_wgen - fip-fi_writers; % + FILEDESC_UNLOCK(fdp); % } % if
Re: HEADS UP: More CAM fixes.
On Tue, 17 Feb 2009, Gary Jennejohn wrote: I tested this with an Adaptec 29160. I saw no real improvement in performance, but also no regressions. I suspect that the old disk I had attached just didn't have enough performance reserves to show an improvement. My test scenario was buildworld. Since /usr/src and /usr/obj were both on the one disk it got a pretty good workout. low AMD64 X2 (2.5 GHz) with 4GB of RAM. Buildworld hardly uses the disk at all. It reads and writes a few hundred MB. Ideally the i/o should go at disk speeds of 50-200MB/S and thus take between 20 and 5 seconds. In practice, it will take a few more seconds. physically but perhaps even less virtually due to parallelism. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS, NFS and Network tuning
On Thu, 29 Jan 2009, Brent Jones wrote: On Wed, Jan 28, 2009 at 11:21 PM, Brent Jones br...@servuhome.net wrote: ... The issue I am seeing, is that for certain file types, the FreeBSD NFS client will either issue an ASYNC write, or an FSYNC. However, NFSv3 and v4 both support safe ASYNC writes in the TCP versions of the protocol, so that should be the default. Issuing FSYNC's for every compete block transmitted adds substantial overhead and slows everything down. I use some patches (mainly for nfs write clustering on the server) by Bjorn Gronwall and some local fixes (mainly for vfs write clustering on the server, and tuning off excessive nfs[io]d daemons which get in each other's way due to poor scheduling, and things that only help for lots of small files), and see reasonable performance in all cases (~90% of disk bandwidth with all-async mounts, and half that with the client mounted noasync on an old version of FreeBSD. The client in -current is faster.) Writing is actually faster than reading here. ... My NFS mount command lines I have tried to get all data to ASYNC write: $ mount_nfs -3T -o async 192.168.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/ $ mount_nfs -3T 192.168.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/ $ mount_nfs -4TL 192.168.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/ Also try -r16384 -w16384, and udp, and async on the server. I think block sizes default to 8K for udp and 32K for tcp. 8K is too small, and 32K may be too large (it increases latency for little benefit if the server fs block size is 16K). udp gives lower latency. async on the server makes little difference provided the server block size is not too small. I have found a 4 year old bug, which may be related to this. cp uses mmap for small files (and I imagine lots of things use mmap for file operations) and causes slowdowns via NFS, due to the fsync data provided above. http://www.freebsd.org/cgi/query-pr.cgi?pr=bin/87792 mmap apparently breaks the async mount preference in the following code: from vnode_pager.c: % /* %* pageouts are already clustered, use IO_ASYNC t o force a bawrite() % * rather then a bdwrite() to prevent paging I/O from saturating % * the buffer cache. Dummy-up the sequential heuristic to cause %* large ranges to cluster. If neither IO_SYNC or IO_ASYNC is set, %* the system decides how to cluster. %*/ % ioflags = IO_VMIO; % if (flags (VM_PAGER_PUT_SYNC | VM_PAGER_PUT_INVAL)) % ioflags |= IO_SYNC; This apparently gives lots of sync writes. (Sync writes are the default for nfs, but we mount with async to try to get async writes.) % else if ((flags VM_PAGER_CLUSTER_OK) == 0) % ioflags |= IO_ASYNC; nfs doesn't even support this flag. In fact, ffs is the only file system that supports it, and here is the only place that sets it. This might explain some slowness. One of the bugs in vfs clustering that I don't have is related to this. IIRC, mounting the server with -o async doesn't work as well as it should because the buffer cache becomes congested with i/o that should have been sent to the disk. Some writes must be done async as explained above, but one place in vfs_cache.c is too agressive in delaying async writes for file systems that are mounted async. This problem is more noticeable for nfs, at least with networks not much faster than disks, since it results in the client and server taking turns waiting for each other. (The names here are very confusing -- the async mount flag normally delays both sync and async writes for as long as possible, except for nfs it doesn't affect delays but asks for async writes instead of sync writes on the server, while the IO_ASYNC flag asks for async writes and thus often has the opposite sense to the async mount flag.) % ioflags |= (flags VM_PAGER_PUT_INVAL) ? IO_INVAL: 0; % ioflags |= IO_SEQMAX IO_SEQSHIFT; Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Packet loss every 30.999 seconds
On Sat, 22 Dec 2007, Mark Fullmer wrote: On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote: I still don't understand the original problem, that the kernel is not even preemptible enough for network interrupts to work (except in 5.2 where Giant breaks things). Perhaps I misread the problem, and it is actually that networking works but userland is unable to run in time to avoid packet loss. The test is done with UDP packets between two servers. The em driver is incrementing the received packet count correctly but the packet is not making it up the network stack. If the application was not servicing the socket fast enough I would expect to see the dropped due to full socket buffers (udps_fullsock) counter incrementing, as shown by netstat -s. I couldn't see any sign of PREEMPTION not working in 6.3-PREREALEASE. em seemed to keep up with the maximum rate that I can easily generate (640 kpps with tiny udp packets), though it cannot transmit at more than 400 kpps on the same hardware. This is without aby syncer activity to cause glitches. The rest of the system couldn't keep up, and with my normal configuration of net.isr.direct=1, systat -ip (udps_fullsock) showed too many packets being dropped, but all the numbers seemed to add up right. (I didn't do end-to-end packet counts. I'm using ttcp to send and receive packets; the receiver loses so many packets that it rarely terminates properly, and when it does terminate it always shows many dropped.) However, with net.isr.direct=0, packets are dropped with no sign of the problem except a reduced count of good packets in systat -ip. Packet rate counter net.isr.direct=1 net.isr.direct=0 --- netstat -I 639042643522 (faster later) systat -ip (total rx) 639042382567 (dropped many b4 here) (UDP total) 639042382567 (udps_fullsock) 29891170340 (diff of prev 2) 340031312227 (300+k always dropped) net.isr.count small large (seems to be correct 643k) net.isr.directedlarge (correct?) no change net.isr.queued 0 0 net.isr.drop0 0 net.isr.direct=0 is apparently causing dropped packets without even counting them. However, the drop seems to be below the netisr level. More worryingly, with full 1500-byte packets (1472 data + 28 UDP header), packets can be sent at a rate of 76 kpps (nearly 950 Mbps) with a load of only 80% on the receiver, yet the ttcp receiver still drops about 1000 pps due top socket buffer full. With net.usr.direct=0 it drops an additinal 700 pps due to this. Glitches from sync(2) taking 25 ms increase the loss by about 1000 packets, and using rtprio for the ttcp receiver doesn't seem to help at all. In previous mail, you (Mark) wrote: # With FreeBSD 4 I was able to run a UDP data collector with rtprio set, # kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF # in the application. If packets were dropped they would show up # with netstat -s as dropped due to full socket buffers. # # Since the packet never makes it to ip_input() I no longer have # any way to count drops. There will always be corner cases where # interrupts are lost and drops not accounted for if the adapter # hardware can't report them, but right now I've got no way to # estimate any loss. I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't have an option for this). With the default kern.ipc.maxsockbuf of 256K, this didn't seem to help. 20MB should work better :-) but I didn't try that. I don't understand how fast the socket buffer fills up and would have thought that 256K was enough for tiny packets but not for 1500-byte packets. Their seems to be a general problem that 1Gbps NICs have or should have rings of size = 256 or 512 so that they aren't forced to drop packets when their interrupt handler has a reasonable but larger latency, yet if we actually use this feature then we flood the upper layers with hundreds of packets and fill up socket buffers etc. there. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Fri, 28 Dec 2007, Bruce Evans wrote: In previous mail, you (Mark) wrote: # With FreeBSD 4 I was able to run a UDP data collector with rtprio set, # kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF # in the application. If packets were dropped they would show up # with netstat -s as dropped due to full socket buffers. # # Since the packet never makes it to ip_input() I no longer have # any way to count drops. There will always be corner cases where # interrupts are lost and drops not accounted for if the adapter # hardware can't report them, but right now I've got no way to # estimate any loss. I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't have an option for this). With the default kern.ipc.maxsockbuf of 256K, this didn't seem to help. 20MB should work better :-) but I didn't try that. I've now tried this. With kern.ipc.maxsockbuf=2048 (~20MB) an SO_RCVBUF of 0x100 (16MB), the socket buffer full lossage increases from ~300 kpps (~47%) to ~450 kpps (70%) with tiny packets. I think this is caused by most accesses to the larger buffer being cache misses -- since the system can't keep up, cache misses make it worse). However, with 1500-byte packets, the larger buffer reduces the lossage from 1 kpps in 76 kpps to precisely zero pps, at a cost of only a small percentage of system overhead (~20Idle to ~18%Idle). The above is with net.isr.direct=1. With net.isr.direct=0, the loss is too small to be obvious and is reported as 0, but I don't trust the report. ttcp's packet counts indicate losses of a few per million with direct=0 but none with direct=1. while :; do sync; sleep 0.1 in the background causes a loss of about 100 pps with direct=0 and a smaller loss with direct=1. Running the ttcp receiver at rtprio 0 doesn't make much difference to the losses. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Fri, 28 Dec 2007, Bruce Evans wrote: On Fri, 28 Dec 2007, Bruce Evans wrote: In previous mail, you (Mark) wrote: # With FreeBSD 4 I was able to run a UDP data collector with rtprio set, # kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF # in the application. If packets were dropped they would show up # with netstat -s as dropped due to full socket buffers. # # Since the packet never makes it to ip_input() I no longer have # any way to count drops. There will always be corner cases where # interrupts are lost and drops not accounted for if the adapter # hardware can't report them, but right now I've got no way to # estimate any loss. I found where drops are recorded for the net.isr.direct=0 case. It is in net.inet.ip.intr_queue.drops. The netisr subsystem just calls IF_HANDOFF(), and IF_HANDOFF() calls _IF_DROP() if the queue fills up. _IF_DROP(ifq) just increments ifq-ip_drops. The usual case for netisrs is for the queue to be ipintrq for NETISR_IP. The following details don't help: - drops for input queues don't seem to be displayed by any utilities (except ones for ipintrq are displayed primitively by sysctl net.inet.ip.intr_queue_drops). netstat and systat only display drops for send queues and ip frags. - the netisr subsystem's drop count doesn't seem to be displayed by any utilities except sysctl. It only counts drops due to there not being a queue; other drops are counted by _IF_DROP() in the per-queue counter. Users have a hard time integrating all these primitively displayed drop counts with other error counters. - the length of ipintrq defaults to the default ifq length of ipqmaxlen = IPQ_MAXLEN = 50. This is inadequate if there is just one NIC in the system that has an rx ring size of = slightly less than 50. But 1 Gbps NICs should have an rx ring size of 256 or 512 (I think the size is 256 for em; it is 256 for bge due to bogus configuration of hardware that can handle it being 512). If the larger hardware rx ring is actually used, then ipintrq drops are almost ensured in the direct=0 case, so using the larger h/w ring is worse than useless (it also increases cache misses). This is for just one NIC. This problem is often limited by handling rx packets in small bursts, at a cost of extra overhead. Interrupt moderation increases it by increasing burst sizes. This contrasts with the handling of send queues. Send queues are per-interface and most drivers increase the default length from 50 to their ring size (-1 for bogus reasons). I think this is only an optimization, while a similar change for rx queues is important for avoiding packet loss. For send queues, the ifq acts mainly as a primitive implementation of watermarks. I have found that tx queue lengths need to be more like 5000 than 50 or 500 to provide enough buffering when applications are delayed by other applications or just by sleeping until the next clock tick, and use tx queues of length ~2 (a couple of clock ticks at HZ = 100), but now think queue lengths should be restricted to more like 50 since long queues cannot fit in L2 caches (not to mention they are bad for latency). The length of ipintrq can be changed using sysctl net.inet.ip.intrq_queue_maxlen. Changing it from 50 to 1024 turns most or all ipintrq drops into socket buffer full drops (640 kpps input packets and 434 kpps socket buffer fulls with direct=0; 640 kpps input packets and 324 kpps socket buffer fulls with direct=1). Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Mon, 24 Dec 2007, Kostik Belousov wrote: On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote: On Sat, 22 Dec 2007, Kostik Belousov wrote: Ok, since you talked about this first :). I already made the following patch, but did not published it since I still did not inspected all callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock. It shall be safe, but better to check. Also, I postponed the check until it was reported that yielding does solve the original problem. Good. I'd still like to unobfuscate the function call. What do you mean there ? Make the loop control and overheads clear by making the function call explicit, maybe by expanding MNT_VNODE_FOREACH() inline after fixing the style bugs in it. Later, fix the code to match the comment again by not making a function call in the usual case. This is harder. Putting the count in the union seems fragile at best. Even if nothing can access the marker vnode, you need to context-switch its old contents while using it for the count, in case its old contents is used. Vnode- printing routines might still be confused. Could you, please, describe what you mean by contex-switch for the VMARKER ? Oh, I didn't notice that the marker vnode is out of band (a whole new vnode is malloced for each marker). The context switching would be needed if an ordinary active vnode that uses the union is used as a marker. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Sat, 22 Dec 2007, Kostik Belousov wrote: On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote: I'm just an observer, and I may be confused, but it seems to me that this is motion in the wrong direction (at least, it's not going to fix the actual problem). As I understand the problem, once you reach a certain point, the system slows down *every* 30.999 seconds. Now, it's possible for the code to cause one slowdown as it cleans up, but why does it need to clean up so much 31 seconds later? It is just searching for things to clean up, and doing this pessimally due to unnecessary cache misses and (more recently) introduction of overheads to handling the case where the mount point is locked into the fast path where the mount point is not unlocked. The search every 30 seconds or so is probably more efficient, and is certainly simpler, than managing the list on every change to every vnode for every file system. However, it gives a high latency in non-preemptible kernels. Why not find/fix the actual bug? Then work on getting the yield right if it turns out there's an actual problem for it to fix. Yielding is probably the correct fix for non-preemptible kernels. Some operations just take a long time, but are low priority so they can be preempted. This operation is partly under user control, since any user can call sync(2) and thus generate the latency every latency seconds. But this is no worse than a user generating even larger blocks of latency by reading huge amounts from /dev/zero. My old latency workaround for the latter (and other huge i/o's) is still sort of necessary, though it now works bogusly (hogticks doesn't work since it is reset on context switches to interrupt handlers; however, any context switch mostly fixes the problem). My old latency workaround only reduces the latency to a multiple of 1/HZ, so a default of 200 ms, so it still is supposed to allow latencies much larger than the ones that cause problems here, but its bogus current operation tends to give latencies of more like 1/HZ which is short enough when HZ has its default misconfiguration to 1000. I still don't understand the original problem, that the kernel is not even preemptible enough for network interrupts to work (except in 5.2 where Giant breaks things). Perhaps I misread the problem, and it is actually that networking works but userland is unable to run in time to avoid packet loss. If the problem is that too much work is being done at a stretch and it turns out this is because work is being done erroneously or needlessly, fixing that should solve the whole problem. Doing the work that doesn't need to be done more slowly is at best an ugly workaround. Lots of necessary work is being done. Yes, rewriting the syncer is the right solution. It probably cannot be done quickly enough. If the yield workaround provide mitigation for now, it shall go in. I don't think rewriting the syncer just for this is the right solution. Rewriting the syncer so that it schedules actual i/o more efficiently might involve a solution. Better scheduling would probably take more CPU and increase the problem. Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH(). There are 4 places in vfs and 13 places in 6 file systems: % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(xvp, mp, mvp) { % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./fs/msdosfs/msdosfs_vfsops.c:MNT_VNODE_FOREACH(vp, mp, nvp) { % ./fs/coda/coda_subr.c:MNT_VNODE_FOREACH(vp, mp, nvp) { % ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_default.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./nfs4client/nfs4_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./nfsclient/nfs_subs.c: MNT_VNODE_FOREACH(vp, mp, nvp) { % ./nfsclient/nfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { Only file systems that support writing need it (for VOP_SYNC() and for MNT_RELOAD), else there would be many more places. There would also be more places if MNT_RELOAD support were not missing for some file systems. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Sat, 22 Dec 2007, Kostik Belousov wrote: On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote: On Sat, 22 Dec 2007, Kostik Belousov wrote: Yes, rewriting the syncer is the right solution. It probably cannot be done quickly enough. If the yield workaround provide mitigation for now, it shall go in. I don't think rewriting the syncer just for this is the right solution. Rewriting the syncer so that it schedules actual i/o more efficiently might involve a solution. Better scheduling would probably take more CPU and increase the problem. I think that we can easily predict what vnode(s) become dirty at the places where we do vn_start_write(). This works for writes to regular files at most. There are also reads (for ffs, these set IN_ATIME unless the file system is mounted with noatime) and directory operations. By grepping for IN_CHANGE, I get 78 places in ffs alone where dirtying of the inode occurs or is scheduled to occur (ffs = /sys/ufs). The efficiency of marking timestamps, especially for atimes, depends on just setting a flag in normal operation and picking up coalesced settings of the flag later, often at sync time by scanning all vnodes. Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH(). There are 4 places in vfs and 13 places in 6 file systems: ... Only file systems that support writing need it (for VOP_SYNC() and for MNT_RELOAD), else there would be many more places. There would also be more places if MNT_RELOAD support were not missing for some file systems. Ok, since you talked about this first :). I already made the following patch, but did not published it since I still did not inspected all callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock. It shall be safe, but better to check. Also, I postponed the check until it was reported that yielding does solve the original problem. Good. I'd still like to unobfuscate the function call. diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c index 14acc5b..046af82 100644 --- a/sys/kern/vfs_mount.c +++ b/sys/kern/vfs_mount.c @@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp) mtx_assert(MNT_MTX(mp), MA_OWNED); KASSERT((*mvp)-v_mount == mp, (marker vnode mount list mismatch)); + if ((*mvp)-v_yield++ == 500) { + MNT_IUNLOCK(mp); + (*mvp)-v_yield = 0; + uio_yield(); Another unobfuscation is to not name this uio_yield(). + MNT_ILOCK(mp); + } vp = TAILQ_NEXT(*mvp, v_nmntvnodes); while (vp != NULL vp-v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h index dc70417..6e3119b 100644 --- a/sys/sys/vnode.h +++ b/sys/sys/vnode.h @@ -131,6 +131,7 @@ struct vnode { struct socket *vu_socket; /* v unix domain net (VSOCK) */ struct cdev *vu_cdev; /* v device (VCHR, VBLK) */ struct fifoinfo *vu_fifoinfo; /* v fifo (VFIFO) */ + int vu_yield; /* yield count (VMARKER) */ } v_un; /* Putting the count in the union seems fragile at best. Even if nothing can access the marker vnode, you need to context-switch its old contents while using it for the count, in case its old contents is used. Vnode- printing routines might still be confused. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, David G Lawrence wrote: I got an almost identical delay (with 64000 vnodes). Now, 17ms isn't much. Says you. On modern systems, trying to run a pseudo real-time application on an otherwise quiescent system, 17ms is just short of an eternity. I agree that the syncer should be preemptable (which is what my bandaid patch attempts to do), but that probably wouldn't have helped my specific problem since my application was a user process, not a kernel thread. FreeBSD isn't a real-time system, and 17ms isn't much for it. I saw lots of syscall delays of nearly 1 second while debugging this. (With another hat, I would say that 17 us was a long time in 1992. 17 us is hundreds of times longer now.) One more followup (I swear I'm done, really!)... I have a laptop here that runs at 150MHz when it is in the lowest running CPU power save mode. At that speed, this bug causes a delay of more than 300ms and is enough to cause loss of keyboard input. I have to switch into high speed mode before I try to type anything, else I end up with random typos. Very annoying. Yes, something is wrong if keystrokes are lost with CPUs that run at 150 kHz (sic) or faster. Debugging shows that the problem is like I said. The loop really does take 125 ns per iteration. This time is actually not very much. The the linked list of vnodes could hardly be designed better to maximize cache thrashing. My system has a fairly small L2 cache (512K or 1M), and even a few words from the vnode and the inode don't fit in the L2 cache when there are 64000 vnodes, but the vp and ip are also fairly well desgined to maximize cache thrashing, so L2 cache thrashing starts at just a few thousand vnodes. My system has fairly low latency main memory, else the problem would be larger: % Memory latencies in nanoseconds - smaller is better % (WARNING - may not be correct, check graphs) % --- % Host OS Mhz L1 $ L2 $Main memGuesses % - - - ----- % besplex.b FreeBSD 7.0-C 2205 1.361 5.6090 42.4 [PC3200 CL2.5 overclocked] % sledge.fr FreeBSD 8.0-C 1802 1.666 8.9420 99.8 % freefall. FreeBSD 7.0-C 2778 0.746 6.6310 155.5 The loop makes the following memory accesses, at least in 5.2: % loop: % for (vp = TAILQ_FIRST(mp-mnt_nvnodelist); vp != NULL; vp = nvp) { % /* %* If the vnode that we are about to sync is no longer %* associated with this mount point, start over. %*/ % if (vp-v_mount != mp) % goto loop; % % /* %* Depend on the mntvnode_slock to keep things stable enough %* for a quick test. Since there might be hundreds of %* thousands of vnodes, we cannot afford even a subroutine %* call unless there's a good chance that we have work to do. %*/ % nvp = TAILQ_NEXT(vp, v_nmntvnodes); Access 1 word at vp offset 0x90. Costs 1 cache line. IIRC, my system has a cache line size of 0x40. Assume this, and that vp is aligned on a cache line boundary. So this access costs the cache line at vp offsets 0x80-0xbf. % VI_LOCK(vp); Access 1 word at vp offset 0x1c. Costs the cache line at vp offsets 0-0x3f. % if (vp-v_iflag VI_XLOCK) { Access 1 word at vp offset 0x24. Cache hit. % VI_UNLOCK(vp); % continue; % } % ip = VTOI(vp); Access 1 word at vp offset 0xa8. Cache hit. % if (vp-v_type == VNON || ((ip-i_flag Access 1 word at vp offset 0xa0. Cache hit. Access 1 word at ip offset 0x18. Assume that ip is aligned, as above. Costs the cache line at ip offsets 0-0x3f. % (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 % TAILQ_EMPTY(vp-v_dirtyblkhd))) { Access 1 word at vp offset 0x48. Costs the cache line at vp offsets 0x40- 0x7f. % VI_UNLOCK(vp); Reaccess 1 word at vp offset 0x1c. Cache hit. % continue; % } The total cost is 4 cache lines or 256 bytes per vnode. So with an L2 cache size of 1MB, the L2 cache will start thrashing at numvnodes = 4096. With thrashing, an at my main memory latency of 42.4 nsec, it might take 4*42.4 = 169.6 nsec to read main memory. This is similar to my observed time. Presumably things aren't quite that bad because there is some locality for the 3 lines in each vp. It might be possible to improve this a bit by accessing the lines sequentially and not interleaving the access to ip. Better, repack vp and move the IN* flags from ip to vp (a change that has other advantages), so that everything is in 1 cache line per vp. This isn't consistent with the delay increasing to 300 ms when the CPU is throttled -- memory shouldn't be
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, Mark Fullmer wrote: A little progress. I have a machine with a KTR enabled kernel running. Another machine is running David's ffs_vfsops.c's patch. I left two other machines (GENERIC kernels) running the packet loss test overnight. At ~ 32480 seconds of uptime the problem starts. This is really Try it with find / -type f /dev/null to duplicate the problem almost instantly. marks are the intervals between test runs. The window of missing packets (timestamps between two packets where a sequence number is missing) is usually less than 4us, altough I'm not sure gettimeofday() can be trusted for measuring this. See https://www.eng.oar.net/~maf/bsd6/p3.png gettimeofday() can normally be trusted to better than 1 us for time differences of up to about 1 second. However, gettimeofday() should not be used in any program written after clock_gettime() became standard in 1994. clock_gettime() has a resolution of 1 ns. It isn't quite that accurate on current machines, but I trust it to measure differences of 10 nsec between back to back clock_gettime() calls here. Sample output from wollman@'s old clock-watching program converted to clock_gettime(): %%% 2007/12/05 (TSC) bde-current, -O2 -mcpu=athlon-xp min 238, max 99730, mean 240.025380, std 77.291436 1th: 239 (1203207 observations) 2th: 240 (556307 observations) 3th: 241 (190211 observations) 4th: 238 (50091 observations) 5th: 242 (20 observations) 2007/11/23 (TSC) bde-current min 247, max 11890, mean 247.857786, std 62.559317 1th: 247 (1274231 observations) 2th: 248 (668611 observations) 3th: 249 (56950 observations) 4th: 250 (23 observations) 5th: 263 (8 observations) 2007/05/19 (TSC) plain -current-noacpi min 262, max 286965, mean 263.941187, std 41.801400 1th: 264 (1343245 observations) 2th: 263 (626226 observations) 3th: 265 (26860 observations) 4th: 262 (3572 observations) 5th: 268 (8 observations) 2007/05/19 (TSC) plain -current-acpi min 261, max 68926, mean 279.848650, std 40.477440 1th: 261 (999391 observations) 2th: 320 (473325 observations) 3th: 262 (373831 observations) 4th: 321 (148126 observations) 5th: 312 (4759 observations) 2007/05/19 (ACPI-fast timecounter) plain -current-acpi min 558, max 285494, mean 827.597038, std 78.322301 1th: 838 (1685662 observations) 2th: 839 (136980 observations) 3th: 559 (72160 observations) 4th: 837 (48902 observations) 5th: 558 (31217 observations) 2007/05/19 (i8254) plain -current-acpi min 3352, max 288288, mean 4182.774148, std 257.977752 1th: 4190 (1423885 observations) 2th: 4191 (440158 observations) 3th: 3352 (65261 observations) 4th: 5028 (39202 observations) 5th: 5029 (15456 observations) %%% min here gives the minimum latency of a clock_gettime() syscall. The improvement from 247 nsec to 240 nsec in the mean due to -O2 -march-athlon-xp can be trusted to be measured very accurately since it is an average over more than 100 million trials, and the improvement from 247 nsec to 238 nsec for min can be trusted because it is consistent with the improvement in the mean. The program had to be converted to use clock_gettime() a few years ago when CPU speeds increased so much that the correct min became significantly less than 1. With gettimeofday(), it cannot distinguish between an overhead of 1 ns and an overhead of 1 us. For the ACPI and i8254 timecounter, you can see that the low-level timecounters have a low frequency clock from the large gaps between the observations. There is a gap of 279-280 ns for the acpi timecounter. This is the period of the acpi timecounter's clock (frequency 14318182/4 = period 279.3651 ns. Since we can observe this period to within 1 ns, we must have a basic accuracy of nearly 1 ns, but if we make only 2 observations we are likely to have an inaccuracy of 279 ns due to the granularity of the clock. The TSC has a clock granuarity of 6 ns on my CPU, and delivers almost that much accuracy with only 2 observations, but technical problems prevent general use of the TSC. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Wed, 19 Dec 2007, David G Lawrence wrote: Debugging shows that the problem is like I said. The loop really does take 125 ns per iteration. This time is actually not very much. The Considering that the CPU clock cycle time is on the order of 300ps, I would say 125ns to do a few checks is pathetic. As I said, 125 nsec is a short time in this context. It is approximately the time for a single L2 cache miss on a machine with slow memory like freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns). As I said, the code is organized so as to give about 4 L2 cache misses per vnode if there are more than a few thousand vnodes, so it is doing very well to take only 125 nsec for a few checks. In any case, it appears that my patch is a no-op, at least for the problem I was trying to solve. This has me confused, however, because at one point the problem was mitigated with it. The patch has gone through several iterations, however, and it could be that it was made to the top of the loop, before any of the checks, in a previous version. Hmmm. The patch should work fine. IIRC, it yields voluntarily so that other things can run. I committed a similar hack for uiomove(). It was easy to make syscalls that take many seconds (now tenths of seconds insted of seconds?), and without yielding or PREEMPTION or multiple CPUs, everything except interrupts has to wait for these syscalls. Now the main problem is to figure out why PREEMPTION doesn't work. I'm not working on this directly since I'm running ~5.2 where nearly-full kernel preemption doesn't work due to Giant locking. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Wed, 19 Dec 2007, David G Lawrence wrote: Try it with find / -type f /dev/null to duplicate the problem almost instantly. FreeBSD used to have some code that would cause vnodes with no cached pages to be recycled quickly (which would have made a simple find ineffective without reading the files at least a little bit). I guess that got removed when the size of the vnode pool was dramatically increased. It might still. The data should be cached somewhere, but caching it in both the buffer cache/VMIO and the vnode/inode is wasteful. I may have been only caching vnodes for directories. I switched to using a find or a tar on /home/ncvs/ports since that has a very high density of directories. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Thu, 20 Dec 2007, Bruce Evans wrote: On Wed, 19 Dec 2007, David G Lawrence wrote: Considering that the CPU clock cycle time is on the order of 300ps, I would say 125ns to do a few checks is pathetic. As I said, 125 nsec is a short time in this context. It is approximately the time for a single L2 cache miss on a machine with slow memory like freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns). As I said, Perfmon counts for the cache misses during sync(1); == /tmp/kg1/z0 == vfs.numvnodes: 630 # s/kx-dc-accesses 484516 # s/kx-dc-misses 20852 misses = 4% == /tmp/kg1/z1 == vfs.numvnodes: 9246 # s/kx-dc-accesses 884361 # s/kx-dc-misses 89833 misses = 10% == /tmp/kg1/z2 == vfs.numvnodes: 20312 # s/kx-dc-accesses 1389959 # s/kx-dc-misses 178207 misses = 13% == /tmp/kg1/z3 == vfs.numvnodes: 80802 # s/kx-dc-accesses 4122411 # s/kx-dc-misses 658740 misses = 16% == /tmp/kg1/z4 == vfs.numvnodes: 138557 # s/kx-dc-accesses 7150726 # s/kx-dc-misses 1129997 misses = 16% === I forgot to only count active vnodes in the above. vfs.freevnodes was small ( 5%). I set kern.maxvnodes to 20, but vfs.numvnodes saturated at 138557 (probably all that fits in kvm or main memory on i386 with 1GB RAM). With 138557 vnodes, a null sync(2) takes 39673 us according to kdump -R. That is 35.1 ns per miss. This is consistent with lmbench2's estimate of 42.5 ns for main memory latency. Watching vfs.*vnodes confirmed that vnode caching still works like you said: o find /home/ncvs/ports -type f only gives a vnode for each directory o a repeated find /home/ncvs/ports -type f is fast because everything remains cached by VMIO. FreeBSD performed very badly at this benchmark before VMIO existed and was used for directories o tar cf /dev/zero /home/ncvs/ports gives a vnode for files too. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Wed, 19 Dec 2007, David G Lawrence wrote: The patch should work fine. IIRC, it yields voluntarily so that other things can run. I committed a similar hack for uiomove(). It was It patches the bottom of the loop, which is only reached if the vnode is dirty. So it will only help if there are thousands of dirty vnodes. While that condition can certainly happen, it isn't the case that I'm particularly interested in. Oops. When it reaches the bottom of the loop, it will probably block on i/o sometimes, so that the problem is smaller anyway. CPUs, everything except interrupts has to wait for these syscalls. Now the main problem is to figure out why PREEMPTION doesn't work. I'm not working on this directly since I'm running ~5.2 where nearly-full kernel preemption doesn't work due to Giant locking. I don't understand how PREEMPTION is supposed to work (I mean to any significant detail), so I can't really comment on that. Me neither, but I will comment anyway :-). I think PREEMPTION should even preempt kernel threads in favor of (higher priority of course) user threads that are in the kernel, but doesn't do this now. Even interrupt threads should have dynamic priorities so that when they become too hoggish they can be preempted even by user threads subject to the this priority rule. This is further from happening. ffs_sync() can hold the mountpoint lock for a long time. That gives problems preempting it. To move your fix to the top of the loop, I think you just need to drop the mountpoint lock every few hundred iterations while yielding. This would help for PREEMPTION too. Dropping the lock must be safe because it is already done while flushing. Hmm, the loop is nicely obfuscated and pessimized in current (see rev.1.234). The fast (modulo no cache misses) path used to be just a TAILQ_NEXT() to reach the next vnode, but now unnecessarily joins the slow path at MNT_VNODE_FOREACH(), and MNT_VNODE_FOREACH() hides a function call. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, David G Lawrence wrote: While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. I see the same behaviour under a heavily modified version of FreeBSD-5.2 (except the period was 2 ms longer and the latency was 7 ms instead of 11 ms when numvnodes was at a certain value. Now with numvnodes = 17500, the latency is 3 ms. I noticed this as well some time ago. The problem has to do with the processing (syncing) of vnodes. When the total number of allocated vnodes in the system grows to tens of thousands, the ~31 second periodic sync process takes a long time to run. Try this patch and let people know if it helps your problem. It will periodically wait for one tick (1ms) every 500 vnodes of processing, which will allow other things to run. However, the syncer should be running at a relative low priority and not cause packet loss. I don't see any packet loss even in ~5.2 where the network stack (but not drivers) is still Giant-locked. Other too-high latencies showed up: - syscons LED setting and vt switching gives a latency of 5.5 msec because syscons still uses busy-waiting for setting LEDs :-(. Oops, I do see packet loss -- this causes it under ~5.2 but not under -current. For the bge and/or em drivers, the packet loss shows up in netstat output as a few hundred errors for every LED setting on the receiving machine, while receiving tiny packets at the maximum possible rate of 640 kpps. sysctl is completely Giant-locked and so are upper layers of the network stack. The bge hardware rx ring size is 256 in -current and 512 in ~5.2. At 640 kpps, 512 packets take 800 us so bge wants to call the the upper layers with a latency of far below 800 us. I don't know exactly where the upper layers block on Giant. - a user CPU hog process gives a latency of over 200 ms every half a second or so when the hog starts up, and a 300-400 ms after the hog has been running for some time. Two user CPU hog processes double the latency. Reducing kern.sched.quantum from 100 ms to 10 ms and/or renicing the hogs don't seem to affect this. Running the hogs at idle priority fixes this. This won't affect packet loss, but it might affect user network processes -- they might need to run at real time priority to get low enough latency. They might need to do this anyway -- a scheduling quantum of 100 ms should give a latency of 100 ms per CPU hog quite often, though not usually since the hogs should never be prefered to a higher-prioerity process. Previously I've used a less specialized clock-watching program to determine the syscall latency. It showed similar problems for CPU hogs. I just remembered that I found the fix for these under ~5.2 -- remove a local hack that sacrifices latency for reduced context switches between user threads. -current with SCHED_4BSD does this non-hackishly, but seems to have a bug somehwhere that gives a latency that is large enough to be noticeable in interactive programs. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, Mark Fullmer wrote: Thanks. Have a kernel building now. It takes about a day of uptime after reboot before I'll see the problem. Yes run find / /dev/null to see the problem if it is the syncer one. At least the syscall latency problem does seem to be this. Under ~5.2, with the above find and also while :; do sync; done (to give latency spike more often), your program (with some fflush(stdout)'s and args 1 7700) gives: % 1197976029041677 12696 0 % 1197976033196396 9761 4154719 % 1197976034060031 13360 863635 % 1197976039080632 13749 5020601 % 1197976043195594 8536 4114962 % 1197976044100601 13505 905007 % 1197976049121870 14562 5021269 % 1197976052195631 8192 3073761 % 1197976054141545 14024 1945914 % 1197976059162357 14623 5020812 % 1197976063195735 7830 4033378 % 1197976064182564 14618 986829 % 1197976069202982 14823 5020418 % 1197976074223722 15350 5020740 % 1197976079244311 15726 5020589 % 1197976084264690 15893 5020379 % 1197976089289409 15058 5024719 % 1197976094315433 16209 5026024 % 1197976095197277 8015 881844 % 1197976099335529 16092 4138252 % 1197976104356513 16863 5020984 % 1197976109376236 16373 5019723 % 1197976114396803 16727 5020567 % 1197976119416822 16533 5020019 % 1197976124437790 17288 5020968 % 1197976126200637 10060 1762847 % 1197976127198459 7839 997822 % 1197976129457321 16606 2258862 % 1197976134477582 16654 5020261 This clearly shows the spike every 5 seconds, and the latency creeping up as vfs.numvnodes increases. It started at about 2 and ended at about 64000. The syncer won't be fixed soon, so the fix for dropped packets requires figuring out why the syncer affects networking. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, Scott Long wrote: Bruce Evans wrote: On Mon, 17 Dec 2007, David G Lawrence wrote: One more comment on my last email... The patch that I included is not meant as a real fix - it is just a bandaid. The real problem appears to be that a very large number of vnodes (all of them?) are getting synced (i.e. calling ffs_syncvnode()) every time. This should normally only happen for dirty vnodes. I suspect that something is broken with this check: if (vp-v_type == VNON || ((ip-i_flag (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 vp-v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); continue; } Isn't it just the O(N) algorithm with N quite large? Under ~5.2, on Right, it's a non-optimal loop when N is very large, and that's a fairly well understood problem. I think what DG was getting at, though, is that this massive flush happens every time the syncer runs, which doesn't seem correct. Sure, maybe you just rsynced 100,000 files 20 seconds ago, so the upcoming flush is going to be expensive. But the next flush 30 seconds after that shouldn't be just as expensive, yet it appears to be so. I'm sure it doesn't cause many bogus flushes. iostat shows zero writes caused by calling this incessantly using while :; do sync; done. This is further supported by the original poster's claim that it takes many hours of uptime before the problem becomes noticeable. If vnodes are never truly getting cleaned, or never getting their flags cleared so that this loop knows that they are clean, then it's feasible that they'll accumulate over time, keep on getting flushed every 30 seconds, keep on bogging down the loop, and so on. Using find / /dev/null to grow the problem and make it bad after a few seconds of uptime, and profiling of a single sync(2) call to show that nothing much is done except the loop containing the above: under ~5.2, on a 2.2GHz A64 UP ini386 mode: after booting, with about 700 vnodes: % % cumulative self self total % time seconds secondscalls ns/call ns/call name % 30.8 0.0000.0000 100.00% mcount [4] % 14.9 0.0010.0000 100.00% mexitcount [5] % 5.5 0.0010.0000 100.00% cputime [16] % 5.0 0.0010.00061331213312 vfs_msync [18] % 4.3 0.0010.0000 100.00% user [21] % 3.5 0.0010.00051132111993 ffs_sync [23] after find / /dev/null was stopped after saturating at 64000 vnodes (desiredvodes is 70240): % % cumulative self self total % time seconds secondscalls ns/call ns/call name % 50.7 0.0080.0085 1666427 1667246 ffs_sync [5] % 38.0 0.0150.0066 1041217 1041217 vfs_msync [6] % 3.1 0.0150.0010 100.00% mcount [7] % 1.5 0.0150.0000 100.00% mexitcount [8] % 0.6 0.0150.0000 100.00% cputime [22] % 0.6 0.0160.000 34 2660 2660 generic_bcopy [24] % 0.5 0.0160.0000 100.00% user [26] vfs_msync() is a problem too. It uses an almost identical loop for the case where the vnode is not dirty (but has a different condition for being dirty). ffs_sync() is called 5 times because there are 5 ffs file systems mounted r/w. There is another ffs file system mounted r/o and that combined with a missing r/o optimization might give the extra call to vfs_msync(). With 64000 vnodes, the calls take 1-2 ms each. That is already quite a lot, and there are many calls. Each call only looks at vnodes under the mount point so the number of mounted file systems doesn't affect the total time much. ffs_sync() i taking 125 ns per vnode. That is a more than I would have expected. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, David G Lawrence wrote: Thanks. Have a kernel building now. It takes about a day of uptime after reboot before I'll see the problem. You may also wish to try to get the problem to occur sooner after boot on a non-patched system by doing a tar cf /dev/null / (note: substitute /dev/zero instead of /dev/null, if you use GNU tar, to disable its optimization). You can stop it after it has gone through a 100K files. Verify by looking at sysctl vfs.numvnodes. Hmm, I said to use find /, but that is not so good since it only looks at directories and directories (and their inodes) are not packed as tightly as files (and their inodes). Optimized tar, or find / -type f, or ls -lR /, should work best, by doing not much more than stat()ing lots of files, while full tar wastes time reading file data. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, David G Lawrence wrote: I didn't say it caused any bogus disk I/O. My original problem (after a day or two of uptime) was an occasional large scheduling delay for a process that needed to process VoIP frames in real-time. It was happening every 31 seconds and was causing voice frames to be dropped due to the large latency causing the frame to be outside of the jitter window. I wrote a program that measures the scheduling delay by sleeping for one tick and then comparing the timeofday offset from what was expected. This revealed that every 31 seconds, the process was seeing a 17ms delay in scheduling. Further investigation found that 1) the I got an almost identical delay (with 64000 vnodes). Now, 17ms isn't much. Delays much have been much longer when CPUs were many times slower and RAM/vnodes were not so many times smaller. High-priority threads just need to be able to preempt the syncer so that they don't lose data (unless really hard real time is supported, which it isn't). This should work starting with about FreeBSD-6 (probably need options PREEMPT). I doesn't work in ~5.2 due to Giant locking, but I find Giant locking to rarely matter for UP. Old versions of FreeBSD were only able to preempt to non-threads (interrupt handlers) yet they somehow survived the longer delays. They didn't have Giant locking to get in the way, and presumably avoided packet loss by doing lots in interrupt handlers (hardware isr and netisr). I just remembered that I have seen packet loss even under -current when I leave out or turn off options PREEMPT. ... and it completely resolved the problem. Since the wait that I added is at the bottom of the loop and the limit is 500 vnodes, this tells me that every 31 seconds, there are a whole lot of vnodes that are being synced, when there shouldn't have been any (this fact wasn't apparent to me at the time, but when I later realized this, I had no time to investigate further). My tests and analysis have all been on an otherwise quiet system (no disk I/O), so the bottom of the ffs_sync vnode loop should not have been reached at all, let alone tens of thousands of times every 31 seconds. All machines were uni- processor, FreeBSD 6+. I don't know if this problem is present in 5.2. I didn't see ffs_syncvnode in your call graph, so it probably is not. I chopped to a float profile with only top callers. Any significant calls from ffs_sync() would show up as top callers. I still have the data, and the call graph shows much more clearly that there was just one dirty vnode for the whole sync(): % 0.000.01 1/1 syscall [3] % [4] 88.70.000.01 1 sync [4] % 0.010.00 5/5 ffs_sync [5] % 0.010.00 6/6 vfs_msync [6] % 0.000.00 7/8 vfs_busy [260] % 0.000.00 7/8 vfs_unbusy [263] % 0.000.00 6/7 vn_finished_write [310] % 0.000.00 6/6 vn_start_write [413] % 0.000.00 1/1 vfs_stdnosync [472] % % --- % % 0.010.00 5/5 sync [4] % [5] 50.70.010.00 5 ffs_sync [5] % 0.000.00 1/1 ffs_fsync [278] % 0.000.00 1/60 vget cycle 1 [223] % 0.000.00 1/60 ufs_vnoperatespec cycle 1 [78] % 0.000.00 1/26 vrele [76] It passed the flags test just once to get to the vget(). ffs_syncvnode() doesn't exist in 5.2, and ffs_fsync() is called instead. % % --- % % 0.010.00 6/6 sync [4] % [6] 38.00.010.00 6 vfs_msync [6] % % --- % ... % % 0.000.00 1/1 ffs_sync [5] % [278]0.00.000.00 1 ffs_fsync [278] % 0.000.00 1/1 ffs_update [368] % 0.000.00 1/4 vn_isdisk [304] This is presumbly to sync the 1 dirty vnode. BTW I use noatime a lot, including for all file systems used in the test, so the tree walk didn't dirty any vnodes. A tar to /dev/zero would dirty all vnodes if everything were mounted without this option. % ... % % cumulative self self total % time seconds secondscalls ns/call ns/call name % 50.7 0.0080.0085 1666427 1667246 ffs_sync [5] % 38.0 0.0150.0066 1041217 1041217 vfs_msync [6] % 3.1 0.0150.0010 100.00% mcount [7] % 1.5 0.0150.0000
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, David G Lawrence wrote: One more comment on my last email... The patch that I included is not meant as a real fix - it is just a bandaid. The real problem appears to be that a very large number of vnodes (all of them?) are getting synced (i.e. calling ffs_syncvnode()) every time. This should normally only happen for dirty vnodes. I suspect that something is broken with this check: if (vp-v_type == VNON || ((ip-i_flag (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 vp-v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); continue; } Isn't it just the O(N) algorithm with N quite large? Under ~5.2, on a 2.2GHz A64 UP in 32-bit mode, I see a latency of 3 ms for 17500 vnodes, which would be explained by the above (and the VI_LOCK() and loop overhead) taking 171 ns per vnode. I would expect it to take more like 20 ns per vnode for UP and 60 for SMP. The comment before this code shows that the problem is known, and says that a subroutine call cannot be afforded unless there is work to do, but the, the locking accesses look like subroutine calls, have subroutine calls in their internals, and take longer than simple subroutine calls in the SMP case even when they don't make subroutine calls. (IIRC, on A64 a minimal subroutine call takes 4 cycles while a minimal locked instructions takes 18 cycles; subroutine calls are only slow when their branches are mispredicted.) Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Float problen running i386 inary on amd64
On Fri, 16 Nov 2007, Peter Jeremy wrote: I've Cc'd bde@ because this relates to the FPU initialisation - which he is the expert on. On Thu, Nov 15, 2007 at 12:54:29PM +, Pete French wrote: On Fri, Nov 02, 2007 at 10:04:48PM +, Pete French wrote: int main(int argc, char *argv[]) { if(atof(3.2) == atof(3.200)) puts(They are equal); else puts(They are NOT equal!); return 0; } Since the program as defined above does not include any prototype for atof(), its return value is assumed to be int. The i386 code for the comparison is therefore: Sorry, I didn't bother sticking the include lines in when I sent it to the mailing list as I assumed it would be ovious that you need to include the prototypes! OK, sorry for the confusion. Interestingly, if you recode like this: double x = atof(3.2); double y = atof(3.200); if(x == y) puts(They are equal); else puts(They are NOT equal!); Then the problem goes away! Glancing at the assembly code they both appear to be doing the same thing as regards the comparison. Glance more closely. Behaviour like this should be expected on i386 but not on amd64. It gives the well-known property of the sin() function, that sin(x) != sin(x) for almost all x (!). It happens because expressions _may_ be evaluated in extra precision (this is perfectly standard), so identical expressions may sometimes be evaluated in different precisions even, or especially, if they are on the same line. atof(s) and sin(x) are expressions, so they may or may not be evaluated in extra precision. Certainly they may be evaluated in extra precision internally. Then when they return a result, C99 doesn't require discarding any extra precision. (It only requires a conversion if the type of the expression being returned is different from the return type. Then it requires a conversion as if by assignment, and such conversions _are_ required to discard any extra precision. This gives the bizarre behaviour that, if a functon returning double uses long double internally until the return statement so as to get extra precision, then it can only return double precision, since the return statement discards the extra precision, while if it uses double precision internally then it may return extra precision and the extra bits may even be correct.) The actual behaviour depends on implementation details and bugs. Programmers are supposed to be get almost deterministic behaviour (with no _may_'s) by using casts or assignments to discard any extra precision. E.g., in functions that are declared as double, to actually return only double precision, use return ((double)(x + y)) instead of return (x + y), or assign the result to a double (maybe x += y; return (x);). However, this is completely broken for gcc on i386's. For gcc on i386's, casts and assignments _may_ actually work as required by C99. The -ffloat-store hack is often recommended for fixing problems in this area, but it only works for assignments; casts remain broken, and the results of expressions remain unpredictable and dependent on the optimization level because intermediate values _may_ retain extra precision depending on whether they are spilled to memory and perhaps on other things (spilling certainly removes extra precision). This has been intentionally broken for about 20 years now. It is hard to fix without pessimizing almost everything in much the same way as -ffloat-store. The pessimization is larger than it was 20 years ago since memory is relatively slower (though the stores now normally go to L1 caches which are very fast, they add a relatvely large amount to pipeline latency) and register allocation is better. It is hard to write code that avoids the pessimization, since only code that uses very long expressions with no assignments to even register variables can avoid the stores. (Store+load to discard the extra precision is another implementation detail. It is the fastest way, even if a value with extra precision is in a register.) To work around the gcc bugs, something like *(volatile double *)x must be used to reduce double x; to actually be a double. The actual behaviour is fairly easy to describe for (f(x) == f(x)): amd64: if f() returns float, then the value is returned in the low quarter of an XMM register, so extra precision is automatically discarded and the results are equal except in exceptional cases (if f(x) is a NaN or varies due to internals in the function). Assignment of the result(s) to variables of any type work correctly and don't change the values since float is the lowest precision. if f() returns double, similarly except the value is returned in the low half of an XMM register, and assignment of the result(s) to variable(s) of type float would work correctly and
Re: Float problen running i386 inary on amd64
On Sat, 17 Nov 2007, Peter Jeremy wrote: On Sat, Nov 17, 2007 at 04:53:22AM +1100, Bruce Evans wrote: Behaviour like this should be expected on i386 but not on amd64. It gives the well-known property of the sin() function, that sin(x) != sin(x) for almost all x (!). It happens because expressions _may_ be evaluated in extra precision (this is perfectly standard), so identical expressions may sometimes be evaluated in different precisions even, or especially, if they are on the same line. Thank you for your detailed analysis. Hwever, I believe you missed the critical point (I may have removed too much reference to the actual problem that Pete French saw): I can take a program that was statically compiled on FreeBSD/i386, run it in legacy (i386) mode on FreeBSD-6.3/amd64 and get different results. Another (admittedly contrived) example: ... Ah, that explains it. This was also a longstanding bug in the Linux emulator. linux_setregs() wasn't fixed to use the Linux npx control word until relatively recently (2005). Linux libraries used to set the control word in the C library (crt), which I think is the right place to initialize it since the correct initialization may depend on the language, so the bug wasn't so obvious at first. This is identical code being executed in supposedly equivalent environments giving different results. I believe the fix is to initialise the FPU using __INITIAL_NPXCW__ in ia32_setregs(), though I'm not sure how difficult this is in reality. Yes, that is the right fix. It is moderately difficult to do correctly. linux_setregs() now just uses fldcw(control) where control = __LINUX_NPXCW__. This depends on bugs to work, since direct accesses to the FPU in the kernel are not supported. They cause a DNA trap which should be fatal. amd64 is supposed to print a message about this error, but it apparently doesn't else log files would be fuller. i386 doesn't even print a message. npxdna() and fpudna() check related invariants but not this one. Correct code would do something like {fpu,npx}xinit(control) to initialize the control word. setregs() in RELENG_[1-4] does exactly that -- npxinit() hides the complications. Now {fpu,npx}init() is only called once or twice at boot time for each CPU, and the complications are a little larger since most initialization is delayed until the DNA trap ({fpu,npx}init() now mainly sets up a copy of the initial FPU state in memory for the trap handler to load later, and it cannot set up per-thread state since the copy in memory is a global default). The complications for delayed initialization are mainly to optimize switching of the FPU state for signal handling, but are also used for exec. Another complication here is that signal handlers should be given the default control word. This is much more broken than for setregs: - there are sysent hooks for sendsig and sigreturn, but none for setting registers in sendsig. - all FreeBSD sendsig's end up using the gobal default initial FPU state (if they support switching the FPU state at all). - all Linux sendsig's are missing support for switching the FPU state. - suppose that the initial FPU (or even CPU) state is language-dependent and this is implemented mainly in the language runtime startup. sendsig's would have a hard time determining the languages' defaults so as to set them. The languages would need to set the defaults in signal trampolines. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em watchdogs - OS involvement
On Tue, 30 Oct 2007, Jack Vogel wrote: Another bit of data, if I define DEVICE_POLLING on the Oct. snap it also will work. Defining DEVICE_POLLING (globally) breaks configuration of fast interrupt handlers in em. I have to #undef it to test fast interrupt handlers in em without losing testing of polling in other network drivers. I lose only testing of polling in em. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ANN] 8-CURRENT, RELENG_7 and RELENG_6 have gotten latest ?unionfs improvements
On Wed, 24 Oct 2007, Oliver Fromme wrote: Dmitry Marakasov [EMAIL PROTECTED] wrote: I was told long time ago that -ounion is even more broken than unionfs. That's wrong. The union mount option was _never_ really broken. I'm using it for almost as long as FreeBSD exists. I recently noticed the following bugs in -ounion (which I've never used for anything except testing): (1) It is broken for all file systems except ffs and ext2fs, since all (?) file systems now use nmount(2) and only these two file systems have union in their mount options list. It is still in the global options list in mount/mntopts.h, but this is only used with mount(2). The global options list in mount/mntopts.h has many bogus non-global options, and even the global options list in kern/vfs_mount.c has some bogus non-global options, but union actually is a global options. ext2fs loves union more than ffs -- although its options list is less disordered than ffs's, it has enough disorder to have 2 copies of union. (2) After fixing (1) by not using nmount(2), following of symlinks works strangely for at least devfs: (a) a link foo - zero (where zero doesn't exist in the underlying file system) doesn't work. mount(1) says that the lookup is done in the mounted file system first. (b) a link foo - ./zero works. This is correct. Now I wonder if it would work if zero existed only in the underlying file system. Have you noticed these bugs? (2) is presumably old. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: rm(1) bug, possibly serious
On Tue, 25 Sep 2007, LI Xin wrote: I think this is a bug, here is a fix obtained from NetBSD. This bug, if any, cannot be fixed in rm. The reasoning (from NetBSD's rm.c,v 1.16): Bugs can easily be added to rm. Strip trailing slashes of operands in checkdot(). POSIX.2 requires that if . or .. are specified as the basename portion of an operand, a diagnostic message be written to standard error, etc. Note that POSIX only requires this for the rm utility. (See my previous mail about why this is bogus.) Pathname resolution and a similarly bogus restriction on rmdir(2) requires some operations with dot or dot-dot to fail, and any utility that uses these operations should then print a diagnostic, etc. We strip the slashes because POSIX.2 defines basename as the final portion of a pathname after trailing slashes have been removed. POSIX says the basename portion of the operand (that is, the final pathname component. This doesn't mean the operand mangled by basename(3). This also makes rm perform actions equivalent to the POSIX.1 rmdir() and unlink() functions when removing directories and files, even when they do not follow POSIX.1's pathname resolution semantics (which require trailing slashes be ignored). Which POSIX.1? POSIX.1-2001 actually requires trailing slashes shall be resolved as if a single dot character were appended to the pathname. This is completely different from removing the slash: rm regular file/# ENOTDIR rm regular file # success unless ENOENT etc. rm directory/ # success... rm directory# EISDIR rm symlink to regular file/ # ENOTDIR rm symlink to regular file # success (removes symlink) rm symlink to directory/# EISDIR rm symlink to directory # success (removes symlink) rmdir ... # reverse most of above Anyway, mangling the operands makes the utilities perform actions different from the functions. The problem case is rm -r symlink to directory/ which asks for removing the directory pointed to by the symlink and all its contents, and is useful -- you type the trailing symlink if you want to ensure that the removal is as recursive as possible. With breakage of rmdir(2) to POSIX spec, this gives removal the contents of the directory pointed to be the symlink and then fails to remove the directory. With breakage as in NetBSD, this gives removal of the symlink only. If nobody complains about this I will request for commit approval from [EMAIL PROTECTED] ++ Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Panic in 6.2-PRERELEASE with bge on amd64
On Wed, 10 Jan 2007, Sven Willenberger wrote: Bruce Evans presumably uttered the following on 01/09/07 21:42: Also look at nearby chain entries (especially at (rxidx - 1) mod 512)). I think the previous 255 entries and the rxidx one should be non-NULL since we should have refilled them as we used them (so the one at rxidx is least interesting since we certainly just refilled it), and the next 256 entries should be NULL since we bogusly only use half of the entries. If the problem is uninitialization, then I expect all 512 entries except the one just refilled at rxidx to be NULL. (kgdb) p sc-bge_cdata.bge_rx_std_chain[rxidx] $1 = (struct mbuf *) 0xff0097a27900 (kgdb) p rxidx $2 = 499 since rxidx = 499, I assume you are most interested in 498: (kgdb) p sc-bge_cdata.bge_rx_std_chain[498] $3 = (struct mbuf *) 0xff00cf1b3100 for the sake of argument, 500 is null: (kgdb) p sc-bge_cdata.bge_rx_std_chain[500] $13 = (struct mbuf *) 0x0 the indexes with values basically are 243 through 499: (kgdb) p sc-bge_cdata.bge_rx_std_chain[241] $30 = (struct mbuf *) 0x0 (kgdb) p sc-bge_cdata.bge_rx_std_chain[242] $31 = (struct mbuf *) 0x0 (kgdb) p sc-bge_cdata.bge_rx_std_chain[243] $32 = (struct mbuf *) 0xff005d4ab700 (kgdb) p sc-bge_cdata.bge_rx_std_chain[244] $33 = (struct mbuf *) 0xff004f644b00 so it does not seem to be a problem with uninitialization. There are supposed to be only 256 nonzero entries (except briefly while one is being refreshed), but the above indicates that there 257: #243 through #499 gives 257 nonzero entries. Everything indicates that entry #499 was null before it was refreshed, and that the loop in bge_rxeof() is trying to process a descriptor 1 after the last valid (previously handled) descriptor. I cannot see why it might do this. The next step might be to add active debugging code: - check that m != NULL when m is taken off the rx chain (before refresshing its entry), and panic if it is. - check that there are always BGE_SSLOTS (256) nonzero mbufs in the std rx chain. It would be interesting to know if they are always contiguous. They might not be since this depends on how the hardware uses them. Debugging is simpler if they are. - check that bge_rxeof() is not reentered. - check the rx producer index and related data before and after getting a null m. It can easily change while bge_rxeof() is running, so recording its value before and after might be useful. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Panic in 6.2-PRERELEASE with bge on amd64
On Tue, 9 Jan 2007, John Baldwin wrote: On Tuesday 09 January 2007 09:37, Sven Willenberger wrote: On Tue, 2007-01-09 at 12:50 +1100, Bruce Evans wrote: Oops. I should have asked for the statment in bge_rxeof(). #7 0x801d5f17 in bge_rxeof (sc=0x8836b000) at /usr/src/sys/dev/bge/if_bge.c:2528 2528m-m_pkthdr.len = m-m_len = cur_rx-bge_len - ETHER_CRC_LEN; (where m is defined as: 2449 struct mbuf *m = NULL; ) It's assigned earlier in between those two places. Its initialization here is just a style bug. Can you 'p rxidx' as well as 'p sc-bge_cdata.bge_rx_std_chain[rxidx]' and 'p sc-bge_cdata.bge_rx_jumbo_chain[rxidx]'? Also, are you using jumbo frames at all? Also look at nearby chain entries (especially at (rxidx - 1) mod 512)). I think the previous 255 entries and the rxidx one should be non-NULL since we should have refilled them as we used them (so the one at rxidx is least interesting since we certainly just refilled it), and the next 256 entries should be NULL since we bogusly only use half of the entries. If the problem is uninitialization, then I expect all 512 entries except the one just refilled at rxidx to be NULL. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Panic in 6.2-PRERELEASE with bge on amd64
On Mon, 8 Jan 2007, Sven Willenberger wrote: On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote: On Sun, 7 Jan 2007, Sven Willenberger wrote: The short and dirty of the dump: ... --- trap 0xc, rip = 0x801d5f17, rsp = 0xb371ab50, rbp = 0xb371aba0 --- bge_rxeof() at bge_rxeof+0x3b7 What is the instruction here? I will do my best to ferret out the information you need. For the bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is: 0x801d5f17 bge_rxeof+951: mov%r15,0x28(%r14) ... Looks like a null pointer panic anyway. I guess the instruction is movl to/from 0x28(%reg) where %reg is a null pointer. from the above lines, apparently %r14 is null then. Yes. It's a bit suprising that the access is a write. ... #8 0x801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707 What is the statement here? It presumably follow a null pointer and only the exprssion for the pointer is interesting. xsc is already null but that is probably a bug in gdb, or the result of excessive optimization. Compiling kernels with -O2 has little effect except to break debugging. the block of code from if_bge.c: 2705 if (ifp-if_drv_flags IFF_DRV_RUNNING) { 2706 /* Check RX return ring producer/consumer. */ 2707 bge_rxeof(sc); 2708 2709 /* Check TX ring producer/consumer. */ 2710 bge_txeof(sc); 2711 } Oops. I should have asked for the statment in bge_rxeof(). By default -O2 is passed to CC (I don't use any custom make flags other than and only define CPUTYPE in my /etc/make.conf). -O2 is unfortunately the default for COPTFLAGS for most arches in sys/conf/kern.pre.mk. All of my machines and most FreeBSD cluster machines override this default in /etc/make.conf. With the override overridden for RELENG_6 amd64, gcc inlines bge_rxeof(), so your environment must be a little different to get even the above ifo. I think gdb can show the correct line numbers but not the call frames (since there is no call). ddb and the kernel stack trace can only show the call frames for actual calls. With -O1, I couldn't find any instruction similar to the mov to the null pointer + 28. 28 is a popular offset in mbufs The short of it is that this interface sees pretty much non-stop traffic as this is a mailserver (final destination) and is constantly being delivered to (direct disk access) and mail being retrieved (remote machine(s) with nfs mounted mail spools. If a momentary down of the interface is enough to completely panic the driver and then the kernel, this hardly seems robust if, in fact, this is what is happening. So the question arises as to what would be causing the down/up of the interface; I could start looking at the cable, the switch it's connected to and ... any other ideas? (I don't have watchdog enabled or anything like that, for example). I don't think down/up can occur in normal operation, since it takes ioctls or a watchdog timeout to do it. Maybe some ioctls other than a full down/up can cause problems... bge_init() is called for the following ioctls: - mtu changes - some near down/up (possibly only these) Suspend/resume and of course detach/attach do much the same things as down/up. BTW, I added some sysctls and found it annoying to have to do down/up to make the sysctls take effect. Sysctls in several other NIC drivers require the same, since doing a full reinitialization is easiest. Since I am tuning using sysctls, I got used to doing down/up too much. Similarly for the mtu ioctl. I think a full reinitialization is used for mtu changes mainly in cases the change switches on/off support for jumbo buffers. Then there is a lot of buffer reallocation to be done, and interfaces have to be stopped to ensure that the bufferes being deallocated are not in use, etc. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Panic in 6.2-PRERELEASE with bge on amd64
On Sun, 7 Jan 2007, Sven Willenberger wrote: I am starting a new thread on this as what I had assumed was a panic in nfsd turns out to be an issue with the bge driver. This is an amd64 box, dual processor (SMP kernel) that happens to be running nfsd. About every 3-5 days the kernel panics and I have finally managed to get a core dump. The system: FreeBSD 6.2-PRERELEASE #8: Tue Jan 2 10:57:39 EST 2007 Like most NIC drivers, bge unlocks and re-locks around its call to ether_input() in its interrupt handler. This isn't very safe, and it certainly causes panics for bge. I often see it panic when bringing the interface down and up while input is arriving, on a non-SMP non-amd64 (actually i386) non-6.x (actually -current) system. Bringing the interface down is probably the worst case. It creates a null pointer for bge_intr() to follow. The short and dirty of the dump: ... --- trap 0xc, rip = 0x801d5f17, rsp = 0xb371ab50, rbp = 0xb371aba0 --- bge_rxeof() at bge_rxeof+0x3b7 What is the instruction here? bge_intr() at bge_intr+0x1c8 ithread_loop() at ithread_loop+0x14c fork_exit() at fork_exit+0xbb fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xb371ad00, rbp = 0 --- Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x28 Looks like a null pointer panic anyway. I guess the instruction is movl to/from 0x28(%reg) where %reg is a null pointer. ... #8 0x801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707 What is the statement here? It presumably follow a null pointer and only the exprssion for the pointer is interesting. xsc is already null but that is probably a bug in gdb, or the result of excessive optimization. Compiling kernels with -O2 has little effect except to break debugging. I rarely use gdb on kernels and haven't looked closely enough using ddb to see where the null pointer for the panic on down/up came from. BTW, the sbdrop panic in -current isn't bge-only or SMP-only. I saw it once for sk on a non-SMP system. It rarely happens for non-SMP (much more rarely than the panic in bge_intr()). Under -current, on an SMP amd64 system with bge, It happens almost every time on close of the socket for a ttcp server if input is arriving at the time of the close. I haven't seen it for 6.x. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: kqueue LOR
On Tue, 12 Dec 2006, Kostik Belousov wrote: On Tue, Dec 12, 2006 at 12:44:54AM -0800, Suleiman Souhlal wrote: Is the mount lock really required, if all we're doing is a single read of a single word (mnt_kern_flags) (v_mount should be read-only for the whole lifetime of the vnode, I believe)? After all, reads of a single word are atomic on all our supported architectures. The only situation I see where there MIGHT be problems are forced unmounts, but I think there are bigger issues with those. Sorry for noticing this email only now. The problem is real with snapshotting. Ignoring MNTK_SUSPEND/MNTK_SUSPENDED flags (in particular, reading stale value of mnt_kern_flag) while setting IN_MODIFIED caused deadlock at ufs vnode inactivation time. This was the big trouble with nfsd and snapshots. As such, I think that precise value of mmnt_kern_flag is critical there, and mount interlock is needed. Locking for just read is almost always bogus, but here (as in most cases) there is also a write based on the contents of the flag, and the lock is held across the write. Practically speaking, I agree with claim that reading of m_k_f is surrounded by enough locked operations that would make sure that the read value is not stale. But there is no such guarantee on future/non-i386 arches, isn't it ? I think not-very-staleness is implied by acquire/release semantics which are part of the API for most atomic operations. This behaviour doesn't seem to be documented for mutexes, but I don't see how mutexes could work without it (they have to synchronize all memory accesses, not just the memory accessed by the lock). As a side note, mount interlock scope could be reduced there. Index: ufs/ufs/ufs_vnops.c === RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v retrieving revision 1.283 diff -u -r1.283 ufs_vnops.c --- ufs/ufs/ufs_vnops.c 6 Nov 2006 13:42:09 - 1.283 +++ ufs/ufs/ufs_vnops.c 12 Dec 2006 10:18:04 - @@ -133,19 +134,19 @@ { struct inode *ip; struct timespec ts; - int mnt_locked; ip = VTOI(vp); - mnt_locked = 0; if ((vp-v_mount-mnt_flag MNT_RDONLY) != 0) { VI_LOCK(vp); goto out; } MNT_ILOCK(vp-v_mount); /* For reading of mnt_kern_flags. */ - mnt_locked = 1; VI_LOCK(vp); - if ((ip-i_flag (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0) - goto out_unl; + if ((ip-i_flag (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0) { + MNT_IUNLOCK(vp-v_mount); + VI_UNLOCK(vp); + return; + } The version that depends on not-very-staleness would test the flags without acquiring the lock(s) and return immediately in the usual case where none of the flags are set. It would have to acquire the locks and repeat the test to make changes (and the test is already repeated one flag at a time). I think this would be correct enough, but still inefficient and/or even messier. The current organization is usually: acquire vnode interlock in caller release vnode interlock in caller to avoid messes here (inefficient) call acquire mount interlock acquire vnode interlock test the flags; goto cleanup code if none set (usual case) do the work release vnode interlock release mount interlock return acquire vnode interlock (if needed) release vnode interlock (if needed) and it might become: acquire vnode interlock in caller call test the flags; return if none set (usual case) release vnode interlock // check that callers are aware of this acquire mount interlock acquire vnode interlock do the work // Assume no LOR problem for release, as below. // Otherwise need another relese+acquire of vnode interlock. release mount interlock return release vnode interlock if ((vp-v_type == VBLK || vp-v_type == VCHR) !DOINGSOFTDEP(vp)) ip-i_flag |= IN_LAZYMOD; @@ -155,6 +156,7 @@ ip-i_flag |= IN_MODIFIED; else if (ip-i_flag IN_ACCESS) ip-i_flag |= IN_LAZYACCESS; + MNT_IUNLOCK(vp-v_mount); vfs_timestamp(ts); if (ip-i_flag IN_ACCESS) { DIP_SET(ip, i_atime, ts.tv_sec); Is there no LOR problem for release? As I understand it, MNT_ILOCK() is only protecting IN_ACCESS being converted to IN_MODIFED, so after this conversion is done the lock is not needed. Is this correct? @@ -172,10 +174,7 @@ out: ip-i_flag = ~(IN_ACCESS | IN_CHANGE | IN_UPDATE); - out_unl: VI_UNLOCK(vp); - if (mnt_locked) - MNT_IUNLOCK(vp-v_mount); } /* BTW, vfs.lookup_shared defaults to 0 and decides shared access for all operations including read, so I wonder if there are [m]any bugs preventing shared accesses
Re: kqueue LOR
On Tue, 12 Dec 2006, John Baldwin wrote: On Tuesday 12 December 2006 13:43, Suleiman Souhlal wrote: Why is memory barrier usage not encouraged? As you said, they can be used to reduce the number of atomic (LOCKed) operations, in some cases. ... Admittedly, they are harder to use than atomic operations, but it might still worth having something similar. How would MI code know when using memory barriers is good? This is already hard to know for atomic ops -- if there would more than a couple of atomic ops then it is probably better to use 1 mutex lock/unlock and no atomic ops, since this reduces the total number of atomic ops in most cases, but it is hard for MI code to know how many a couple is. (This also depends on the SMP option -- without SMP, locking is automatic so atomic ops are very fast but mutexes are still slow since they do a lot more than an atomic op.) Memory barriers just specify ordering, they don't ensure a cache flush so another CPU reads up to date values. You can use memory barriers in conjunction with atomic operations on a variable to ensure that you can safely read other variables (which is what locks do). For example, in this I thought that the acquire/release variants of atomic ops guarantee this. They seem to be documented to do this, while mutexes don't seem to be documented to do this. The MI (?) implementation of mutexes depends on atomic_cmpset_{acq,rel}_ptr() doing this. Bruc ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Still possible to directly boot without loader?
On Mon, 30 Oct 2006, John Baldwin wrote: On Thursday 26 October 2006 15:54, Ruslan Ermilov wrote: On Thu, Oct 26, 2006 at 03:42:34PM -0400, John Baldwin wrote: On Thursday 26 October 2006 15:18, Ruslan Ermilov wrote: On Thu, Oct 26, 2006 at 11:38:24AM -0400, John Baldwin wrote: Sorry, I meant that both boot2 and loader should follow your proposal of masking 28 bits. Just masking the top 4 bits is probably sufficient. :-) OK, I'll craft a patch tomorrow. This will also require patching at least sys/boot/common/load_elf.c:__elfN(loadimage), maybe something else. I think we could actually mask 30 bits; that would allow to load 1G kernels, provided that sufficient memory exists. Actually, please mask 4 bits. Not all kernels run at 0xc000. You can adjust that address via 'options KVA_PAGES'. I know of folks who run kernels at 0xa000 for example because they need more KVA. This is part of why I They can probably use 0x8000, but it's not obvious how to get exactly that from KVA_PAGES. really don't like the masking part, though I'm not sure there's a way to figure out KERNBASE well enough to do the more correct 'pa = addr - KERNBASE' rather than 'pa = addr 0x0fff'. The masking hack is probably only needed for aout. For elf, objdump -h /kernel says: % Sections: % Idx Name Size VMA LMA File off Algn % ... % CONTENTS, ALLOC, LOAD, READONLY, DATA % 4 .text 002853e0 c043b510 c043b510 0003b510 2**4 so KERNBASE = LMA - File off for at least this kernel. boot2 now loads the text section from file offset File off to address LMA(masked). I think it just needs to load at an address that is the same mod PAGE_SIZE as LMA or VMA (these must agree mod PAGE_SIZE), provided it adjusts the entry address to match. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Still possible to directly boot without loader?
On Thu, 26 Oct 2006, Ruslan Ermilov wrote: On Mon, Sep 11, 2006 at 01:09:15PM -0500, Brooks Davis wrote: On Sun, Sep 10, 2006 at 09:10:26PM +0200, Stefan Bethke wrote: I just tried to load my standard kernel from the boot blocks (instead of using loader(8)), but I either get a hang before the kernel prints anything, or a BTX halted. Is this still supposed to work in 6- stable, or has it finally disappeared? You may be able to get this to work, but it is unsupported. I normally use it (with a different 1-stage boot loader) for kernels between ~4.10 and -current. I only boot RELENG_4 kernels for running benchmarks and don't bother applying my old fix for missing static symbols there. See another PR for the problem and patch. In newer kernels and userlands, starting some time in 5.0-CURRENT, sysutil programs use sysctls for live kernels so they aren't affected by missing static symbols. I've been investigating this today. Here's what I've found: 1) You need hints statically compiled into your kernel. (This has been a long time requirement.) Even though I normally use it, I once got very confused by this. Everything except GENERIC booted right (with boot loaders missing the bug in (3)). This is because GENERIC has had hints commented out since rev.1.272, and GENERIC also has no acpi (it's not very GENERIC). When there are no hints, except on very old systems, most things except isa devices work, but at least without acpi, console drivers on i386's are on isa so it is hard to see if things work. Hints are probably also needed for ata. I think a diskless machine with no consoles and pci NICs would just work. 2) You can only do it on i386, because boot2 only knows about ELF32, so attempts to load ELF64 amd64 kernels will fail. (loader(8) knows about both ELF32/64.) I haven't got around to fixing this. 3) It's currently broken even on i386; backing out rev. 1.71 of boot2.c by jhb@ fixes this for me. : revision 1.71 : date: 2004/09/18 02:07:00; author: jhb; state: Exp; lines: +3 -3 : A long, long time ago in a CVS branch far away (specifically, HEAD prior : to 4.0 and RELENG_3), the BTX mini-kernel used paging rather than flat : mode and clients were limited to a virtual address space of 16 megabytes. : Because of this limitation, boot2 silently masked all physical addresses : in any binaries it loaded so that they were always loaded into the first : 16 Meg. Since BTX no longer has this limitation (and hasn't for a long : time), remove the masking from boot2. This allows boot2 to load kernels : larger than about 12 to 14 meg (12 for non-PAE, 14 for PAE). : : Submitted by: Sergey Lyubka devnull at uptsoft dot com : MFC after: 1 month The kernel is linked at 0xc000 but loade din low memory, so the high bits must be masked off like they used to be for the kernel to boot at all. This has nothing to do with paging AFAIK. Rev.1.71 makes no sense, since BTX isn't large, and large kernels are more unbootable than before with 1.71. There is an another PR about this. 4) Another rev. broke support for booting with -c and -d to save 4 bytes. -c is useful for RELENG_6 and -d is essential for debugging. If you always use loader(8) then you would only notice this if you try to set these flags in boot2. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em network issues
On Wed, 18 Oct 2006, Kris Kennaway wrote: I have been working with someone's system that has em shared with fxp, and a simple fetch over the em (e.g. of a 10 GB file of zeroes) is enough to produce watchdog timeouts after a few seconds. As previously mentioned, changing the INTR_FAST to INTR_MPSAFE in the driver avoids this problem. However, others are seeing sporadic watchdog timeouts at higher system load on non-shared em systems too. em_intr_fast() has no locking whatsoever. I would be very surprised if it even seemed to work for SMP. For UP, masking of CPU interrupts (as is automatic in fast interrupt handlers) might provide sufficient locking, but for many drivers fast wth interrupt handlers, whatever locking is used by the fast interrupt handler must be used all over the driver to protect data strutures that are or might be accessed by the fast interrupt handler. That means lots of intr_disable/enable()s if the UP case is micro-optimized and lots of mtx_lock/unlock_spin()s for the general case. But em has no references to spinlocks or CPU interrupt disabling. em_intr() starts with EM_LOCK(), so it isn't obviously broken near its first statement. Very few operations are valid in fast interrupt handlers. Locking and fastness must be considered for every operation, not only in the interrupt handler but in all data structures shared by the interrupt handler. For just the interrupt handler in em: % static void % em_intr_fast(void *arg) % { % struct adapter *adapter = arg; This is safe because it has no side effects and doesn't take long. % struct ifnet*ifp; % uint32_treg_icr; % % ifp = adapter-ifp; % This is safe provided other parts of the driver ensure that the interrupt handler is not reached after adapter-ifp goes away. Similarly for other long-lived almost-const parts of *adapter. % reg_icr = E1000_READ_REG(adapter-hw, ICR); % This is safe provided reading the register doesn't change it. % /* Hot eject? */ % if (reg_icr == 0x) % return; % % /* Definitely not our interrupt. */ % if (reg_icr == 0x0) % return; % These are safe since we don't do anything with the result. % /* %* Starting with the 82571 chip, bit 31 should be used to %* determine whether the interrupt belongs to us. %*/ % if (adapter-hw.mac_type = em_82571 % (reg_icr E1000_ICR_INT_ASSERTED) == 0) % return; % This is safe, as above. % /* %* Mask interrupts until the taskqueue is finished running. This is %* cheap, just assume that it is needed. This also works around the %* MSI message reordering errata on certain systems. %*/ % em_disable_intr(adapter); Now that we start doing things, we have various races. The above races to disable interrupts with other entries to this interrupt handler, and may race with other parts of the driver. After we disable driver interrupts. There should be no more races with other entries to this handler. However, reg_icr may be stale at this point even if we handled the race correctly. The other entries may have partly or completely handled the interrupt when we get back here (we should have locked just before here, and then if the lock blocked waiting for the other entries (which can only happen in the SMP case), we should reread the status register to see if we still have anything to do, or more importantly to see what we have to do now (extrascheduling of the SWI handler would just wake time, but missing scheduling would break things). % taskqueue_enqueue(adapter-tq, adapter-rxtx_task); % Safe provided the API is correctly implemented. (AFAIK, the API only has huge design errors.) % /* Link status change */ % if (reg_icr (E1000_ICR_RXSEQ | E1000_ICR_LSC)) % taskqueue_enqueue(taskqueue_fast, adapter-link_task); % As above, plus might miss this call if the status changed underneath us. % if (reg_icr E1000_ICR_RXO) % adapter-rx_overruns++; Race updating the counter. Generally, fast interrupt handlers should avoid book-keeping like this, since correct locking for it would poison large parts of the driver with the locking required for the fast interrupt handler. Perhaps similarly for important things. It's safe to read the status register provided reading it doesn't change it. Then it is safe to schedule tasks based on the contents of the register provided we don't do anything else and schedule enough tasks. But don't disable interrupts -- leave that to the task and make the task do nothing if it handled everything for a previous scheduling. This would result in the task usually being scheduled when the interrupt is for us but not if it is for another device. The above doesn't try to do much more than this. However, a fast interrupt handler needs to handle the usual case to be worth having except on systems
Re: [fbsd] HEADS UP: FreeBSD 5.3, 5.4, 6.0 EoLs coming soon
On Wed, 11 Oct 2006, Dmitry Pryanishnikov wrote: On Wed, 11 Oct 2006, Jeremie Le Hen wrote: ... Is it envisageable to extend the RELENG_4's and RELENG_4_11's EoL once more ? Yes, I'm also voting for it. This support may be limited to remote-exploitable vulnerabilities only, but I'm sure there are many old slow routers for which RELENG_4 - 6 transition still hurts the performance. RELENG_4 is the last stable pre-SMPng branch, and (see my spring letters, Subject: RELENG_4 - 5 - 6: significant performance regression) _very_ significant UP performance loss (which has occured in RELENG_4 - 5 transition) still isn't reclaimed. So I think it would be wise to extend { RELENG_4 / RELENG_4_11 / both } [may be limited] support. I hesitate to do anything to kill RELENG_4, but recently spent a few days figuring out why the perfomance for building kernels over nfs dropped by much more than for building of kernels on local disks between RELENG_4 and -current. The most interesting loss (one not very specific to kernels) is that changes on 6 or 7 Dec 2004 resulted in open/close of an nfs file generating twice as much network traffic (2 instead of 1 Access RPCs per open) and thus being almost twice as slow for files that are otherwise locally cached. This combined with not very low network latency gives amazingly large losses of performance for things like make depend and cvs checkouts where 1 RPC per open already made things very slow. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: missing fpresetsticky in ieeefp.h
On Thu, 2 Feb 2006, O. Hartmann wrote: Bruce Evans schrieb: On Thu, 2 Feb 2006, O. Hartmann wrote: ... Now take a look into machine/ieeefp.h, where this function should be declared. Nothing, I can not find this routine, it seems to be 'not available' on my FreeBSD6.1-PRERELEASE AMD64 (no 32Bit compatibility). It was removed for amd64 and never existed for some other arches. It was [fresetsticky()] apparently unused when it was removed a year ago. ... % RCS file: /home/ncvs/src/sys/amd64/include/ieeefp.h,v ... % revision 1.13 ... Thanks a lot. In prior software compilations of GMT on FBSD/AMD64 I commented out the appropriate line in gmt_init.c without any hazardous effects - but I never used GMT that intensive having ever recognozed any malicious side effects. I should contact the guys from Soest/Hawaii asking them for any serious effects commenting out this line on amd64 architectures. I think it is probably used only for error detection, if at all. Accumulated IEEE exceptions are supposed to be read using fpgetsticky() and then cleared using fp[re]setsticky() so that the next set accumulated can be distinguished from the old set. Applications should now use fesetexceptflag() instead of fp[re]setsticky(). BTW, the most useful fp* functions other than fp[re]setsticky(), namely fp{get,set}round(), never worked on ia64 due to the rounding flags values being misspelled, so there are unlikely to be any portable uses of the fp* functions in ports. The corresponding fe{get,set}round() functions work on at least i386, amd64 and ia64. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: missing fpresetsticky in ieeefp.h
On Thu, 2 Feb 2006, O. Hartmann wrote: O. Hartmann schrieb: Hello. I do not know whether this should be a bug report or not, I will ask prior to any further action. Reading 'man fpresetsticky' show up a man page for FPGETROUND(3) and tells me the existence of the fpresetsticky routine. This is a bug in the man page. fpresetsticky() is supposed to only exist on i386's, but the man page and its link to fpresetsticky.3 are installed for all arches. Now take a look into machine/ieeefp.h, where this function should be declared. Nothing, I can not find this routine, it seems to be 'not available' on my FreeBSD6.1-PRERELEASE AMD64 (no 32Bit compatibility). It was removed for amd64 and never existed for some other arches. It was apparently unused when it was removed a year ago. Background is, I try to compile GMT 4.1 and ran into this problem again (I reveal this error since FBSD 5.4-PRE also on i386). If fpresetsticky() isn't available on amd64 anymore, it shouldn't be mentioned in the manpage. But it seems to me to be a bug, so somebody should confirm this. % RCS file: /home/ncvs/src/sys/amd64/include/ieeefp.h,v % Working file: ieeefp.h % head: 1.14 % ... % % revision 1.13 % date: 2005/03/15 15:53:39; author: das; state: Exp; lines: +0 -20 % Remove fpsetsticky(). This was added for SysV compatibility, but due % to mistakes from day 1, it has always had semantics inconsistent with % SVR4 and its successors. In particular, given argument M: % % - On Solaris and FreeBSD/{alpha,sparc64}, it clobbers the old flags % and *sets* the new flag word to M. (NetBSD, too?) % - On FreeBSD/{amd64,i386}, it *clears* the flags that are specified in M % and leaves the remaining flags unchanged (modulo a small bug on amd64.) % - On FreeBSD/ia64, it is not implemented. % % There is no way to fix fpsetsticky() to DTRT for both old FreeBSD apps % and apps ported from other operating systems, so the best approach % seems to be to kill the function and fix any apps that break. I % couldn't find any ports that use it, and any such ports would already % be broken on FreeBSD/ia64 and Linux anyway. % % By the way, the routine has always been undocumented in FreeBSD, % except for an MLINK to a manpage that doesn't describe it. This % manpage has stated since 5.3-RELEASE that the functions it describes % are deprecated, so that must mean that functions that it is *supposed* % to describe but doesn't are even *more* deprecated. ;-) % % Note that fpresetsticky() has been retained on FreeBSD/i386. As far % as I can tell, no other operating systems or ports of FreeBSD % implement it, so there's nothing for it to be inconsistent with. % % PR: 75862 % Suggested by: bde % Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Manipulating disk cache (buf) settings
On Mon, 23 May 2005, John-Mark Gurney wrote: Sven Willenberger wrote this message on Mon, May 23, 2005 at 10:58 -0400: We are running a PostgreSQL server (8.0.3) on a dual opteron system with 8G of RAM. If I interpret top and vfs.hibufspace correctly (which show values of 215MB and 225771520 (which equals 215MB) respectively. My understanding from having searched the archives is that this is the value that is used by the system/kernel in determining how much disk data to cache. This is incorrect... FreeBSD merged the vm and buf systems a while back, so all of memory is used as a disk cache.. Indeed. Statistics utilities still haven't caught up with dyson's changes in 1994 or 1995, so their display of statistics related to disk caching is very misleading. systat -v and top display vfs.bufspace but not vfs.hibufspace. Both of these are uninitersting. vfs.bufspace gives the amount of virtual memory that is currently allocated to the buffer cache. vfs.hibufspace gives the maximum for this amount. Virtual memory for buffers is almost never released, so on active systems vfs.bufspace is close to the maximum. The maximum is just a compile-time constand (BKVASIZE) times a boot-time constant (nbuf). There is no way to tell from userland exactly how much of memory is used for the vm part of the disk cache. inact in systat -v gives a maximum. Watch heavy file system for a while and you may see inact increase as vm is used for disk data. It decreases mainly when a file system is unmounted. Otherwise, it tends to stay near its maximum, with pages for not recently used disk data being reused for something else (newer disk data or processes). The buf cache is still used for filesystem meta data (and for pending writes of files, but those buf's reference the original page, not local storage)... This is mostly incorrect. The buffer cache is now little more than a window on vm. Metadata is backed by vm except for low quality file systems. Directories are backed by vm unless vfs.vmiodirenable is 0 (not the default). Just as an experiment, on a quiet system do: dd if=/dev/zero of=somefile bs=1m count=2048 and then read it back in: dd if=somefile of=/dev/null bs=1m and watch systat or iostat and see if any of the file is read... You'll probably see that none of it is... Also, with systat -v: - start with inact small and watch it grow as the file is cached - remove the file and watch inact drop. I haven't tried this lately. The system has some defence against using up all of the free and inactive pages for a single file to the exclusion of other disk data, so you might not get 2GB cached even if you have 4GB memory. If that is in fact the case, then my question would be how to best increase the amount of memory the system can use for disk caching. Just add RAM and don't run bloatware :-). Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: undefined reference to `memset'
On Wed, 23 Mar 2005, Vinod Kashyap wrote: If any kernel module has the following, or a similar line in it: - char x[100] = {0}; - I think you mean: - auto char x[100] = {0}; - or after fixing some style bugs: - char x[100] = { 0 }; - building of the GENERIC kernel on FreeBSD 5 -STABLE for amd64 as of 03/19/05, fails with the following message at the time of linking: undefined reference to `memset'. The same problem is not seen on i386. The problem goes away if the above line is changed to: - char x[100]; memset(x, 0, 100); - This version makes the pessimizations and potential bugs clear: - clearing 100 bytes on every entry to the function is wasteful. C90's auto initializers hide pessimizations like this. They should be used very rarely, especially in kernels. But they are often misused, even in kernels, even for read-only data that should be static. gcc doesn't optimize even auto const x[100] = { 0 }; to a static initialization -- the programmer must declare the object as static to prevent gcc laboriously clearing it on every entry to the function. - 100 bytes may be too much to put on the kernel stack. Objects just a little larger than this must be dynamically allocated unless they can be read-only. Adding CFLAGS+=-fbuiltin, or CFLAGS+=-fno-builtin to /sys/conf/Makefile.amd64 does not help. -fno-builtin is already in CFLAGS, and if it has any effect on this then it should be to cause gcc to generate a call to memset() instead of doing the memory clearing inline. I think gcc has a builtin memset() which is turned off by -fno-builtin, but -fno-builtin doesn't affect cases where memset() is not referenced in the source code. -ffreestanding should prevent gcc generating calls to library functions like memset(). However, -ffreestanding is already in CFLAGS too, and there is a problem: certain initializations like the one in your example need to use an interface like memset(), and struct copies need to use and interface like memcpy(), so what is gcc to do when -fno-builtin tells it to turn off its builtins and -ffreestanding tells it that the relevant interfaces might not exist in the library? Anyone knows what's happening? gcc is expecting that memset() is in the library, but the FreeBSD kernel is freestanding and happens not to have memset() in its library. Related bugs: - the FreeBSD kernel shouldn't have memset() at all. The kernel interface for clearing memory is bzero(). A few files misspelled bzero() as memset() and provided a macro to convert from memset() to bzero(), and instead of fixing them a low-quality memset() was added to sys/libkern.h. This gives an inline memset() so it doesn't help here. memset() is of some use for setting to nonzero, but this is rarely needed and can easily be repeated as necessary. The support for the nonzero case in sys/libkern.h is of particularly low quality -- e.g., it crashes if asked to set a length of 0. - memset() to zero and/or gcc methods for initialization to 0 might be much slower than the library's methods for clearing memory. This is not a problem in practice, although bzero() is much faster than gcc's methods in some cases, because: (a) -fno-builtin turns off builtin memset(). (b) the inline memset() just uses bzero() in the fill_byte = 0 case, so using it instead of bzero() is only a tiny pessimization. (c) large copies that bzero() can handle better than gcc's inline method (which is stosl on i386's for your example) cannot because the data would be too large to fit on the kernel statck. - there are slightly different problems for memcpy(): (a) memcpy() is in the library and is not inline, so there is no linkage problem if gcc generates a call to memcpy() for a struct copy. (b) the library memcpy() never uses bcopy(), so it is much slower than bcopy() in much cases. (c) the reason that memcpy() is in the library is to let gcc inline memcpy() for efficiency, but this reason was turned into nonsense by adding -fno-builtin to CFLAGS, and all calls to memcpy() are style bugs and ask for inefficiency. (The inefficiency is small or negative in practice because bzero() has optimizations for large copies that are small pessimizations for non-large copies.) - the FreeBSD kernel shouldn't have memcmp(). It has an inline one that has even lower quality than the inline memset(). memcmp() cannot be implemented using bcmp() since memcmp() is tri-state but bcmp() is boolean, but the inline memcmp() just calls bcmp(). This works, if at all, because nothing actually needs memcmp() and memcmp() is just a misspelling of bcmp(). Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: undefined reference to `memset'
On Thu, 24 Mar 2005, Bruce Evans wrote: On Wed, 23 Mar 2005, Vinod Kashyap wrote: If any kernel module has the following, or a similar line in it: - char x[100] = {0}; - building of the GENERIC kernel on FreeBSD 5 -STABLE for amd64 as of 03/19/05, fails with the following message at the time of linking: undefined reference to `memset'. ... ... Anyone knows what's happening? gcc is expecting that memset() is in the library, but the FreeBSD kernel is freestanding and happens not to have memset() in its library. As to why gcc calls memset() on amd64's but not on i386's: - gcc-3.3.3 doesn't call memset() on amd64's either. - gcc-3.4.2 on amd64's calls memset() starting with an array size of 65. It uses mov[qlwb] for sizes up to 16, then stos[qlwb] up to size 64. gcc-3.3.3 on i386's uses mov[lwb] for sizes up to 8, then stos[lwb] for all larger sizes. - the relevant change seems to be: % Index: i386.c % === % RCS file: /home/ncvs/src/contrib/gcc/config/i386/i386.c,v % retrieving revision 1.20 % retrieving revision 1.21 % diff -u -2 -r1.20 -r1.21 % --- i386.c19 Jun 2004 20:40:00 - 1.20 % +++ i386.c28 Jul 2004 04:47:35 - 1.21 % @@ -437,26 +502,36 @@ % ... % +const int x86_rep_movl_optimal = m_386 | m_PENT | m_PPRO | m_K6; % ... Note that rep_movl is considered optimal on i386's but not on amd64's. % @@ -10701,6 +11427,10 @@ %/* In case we don't know anything about the alignment, default to % library version, since it is usually equally fast and result in % - shorter code. */ % - if (!TARGET_INLINE_ALL_STRINGOPS align UNITS_PER_WORD) % + shorter code. % + % + Also emit call when we know that the count is large and call overhead % + will not be important. */ % + if (!TARGET_INLINE_ALL_STRINGOPS % +(align UNITS_PER_WORD || !TARGET_REP_MOVL_OPTIMAL)) % return 0; % TARGET_REP_MOVL_OPTIMAL is x86_rep_movl_optimal modulo a mask. It is zero for amd64's, so 0 is returned for amd64's here unless you use -mfoo to set TARGET_INLINE_ALL_STRINGOPS. Returning 0 gives the library call instead of a stringop. This is in i386_expand_clrstr(). There is an identical change in i386_expand_movstr() that gives library calls to memcpy() for (at least) copying structs. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: undefined reference to `memset'
On Thu, 24 Mar 2005, Nick Barnes wrote: At 2005-03-24 08:31:14+, Bruce Evans writes: what is gcc to do when -fno-builtin tells it to turn off its builtins and -ffreestanding tells it that the relevant interfaces might not exist in the library? Plainly, GCC should generate code which fills the array with zeroes. It's not obliged to generate code which calls memset (either builtin or in a library). If it knows that it can do so, then fine. Otherwise it must do it the Old Fashioned Way. So this is surely a bug in GCC. Nick B, who used to write compilers for a living But the compiler can require the Old Fashioned Way to be in the library. libgcc.a is probably part of gcc even in the freestanding case. The current implementation of libgcc.a won't all work in the freestanding case, since parts of it call stdio, but some parts of it are needed and work (e.g., __divdi3() on i386's at least). The kernel doesn't use libgcc.a, but it knows that __divdi3() and friends are needed and implements them in its libkern. Strictly, it should do something similar for memset(). I think the only bugs in gcc here are that the function it calls is in the application namespace in the freestanding case, and that the requirements for freestanding implementations are not all documented. The requirement for memset() and friends _is_ documented (in gcc.info), but the requirement for __divdi3() and friends are only documented indirectly by the presence of these functions in libgcc.a. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: undefined reference to `memset'
On Fri, 25 Mar 2005, Peter Jeremy wrote: On Thu, 2005-Mar-24 12:03:19 -0800, Vinod Kashyap wrote: [ char x[100] = { 0 }; ] A statement like this (auto and not static) I'd point out that this is the first time that you've mentioned that the variable is auto. Leaving out critical information will not encourage people to help you. It was obviously auto, since memset() would not have been called for a global variable. is necessary if you are dealing with re-entrancy. This isn't completely true. The preferred approach is: char*x; x = malloc(100, MEM_POOL_xxx, M_ZERO | M_WAITOK); (with a matching free() later). This is also preferred to alloca() and C99's dynamic arrays. BTW, the kernel has had some dubious examples of dynamic arrays in very important code since long before C99 existed. vm uses some dynamic arrays, and this is only safe since the size of the arrays is bounded and small. But when the size of an array is bounded and small, dynamic allocation is just a pessimization -- it is more efficient to always allocate an array with the maximum size that might be needed. How is it then, that an explicit call to memset (like in my example) works? The code auto char x[100] = {0}; is equivalent to auto char x[100]; memset(x, 0, sizeof(x)); but memset only exists as a static inline function (defined in libkern.h). If an explicit call to memset works then the problem would appear to be that the compiler's implicit expansion is failing to detect the static inline definition, and generating an external reference which can't be satisfied. This would seem to be a gcc bug. No, it is a feature :-). See my earlier reply. 2. I should have mentioned that I don't see the problem if I am building only the kernel module. It happens only when I am building the kernel integrating the module containing the example code. This is the opposite of what you implied previously. There are some differences in how kernel modules are built so this How about posting a (short) compilable piece of C code that shows the problem. I would expect that an nm of the resultant object would show U memset when the code was compiled for linking into the kernel and some_address t memset or not reference memset at all when compiled as a module. I deleted the actual example. Most likely it would fail at load time do to using memset(). Another possibility is for the code that needs memset to be unreachable in the module since it is inside an ifdef. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]