Re: zfs problems after rebuilding system [SOLVED]

2018-03-03 Thread Bruce Evans

On Sat, 3 Mar 2018, tech-lists wrote:


On 03/03/2018 00:23, Dimitry Andric wrote:

Indeed.  I have had the following for a few years now, due to USB drives
with ZFS pools:

--- /usr/src/etc/rc.d/zfs   2016-11-08 10:21:29.820131000 +0100
+++ /etc/rc.d/zfs   2016-11-08 12:49:52.971161000 +0100
@@ -25,6 +25,8 @@

 zfs_start_main()
 {
+   echo "Sleeping for 10 seconds to let USB devices settle..."
+   sleep 10
zfs mount -va
zfs share -a
if [ ! -r /etc/zfs/exports ]; then

For some reason, USB3 (xhci) controllers can take a very, very long time
to correctly attach mass storage devices: I usually see many timeouts
before they finally get detected.  After that, the devices always work
just fine, though.


I have one that works for an old USB hard drive but never works for a not
so old USB flash drive and a new SSD in a USB dock (just to check the SSD
speed when handicapped by USB).  Win7 has no problems with the xhci and
USB flash drive combination, and FreeBSD has no problems with the drive
on other systems.


Whether this is due to some sort of BIOS handover trouble, or due to
cheap and/or crappy USB-to-SATA bridges (even with brand WD and Seagate
disks!), I have no idea.  I attempted to debug it at some point, but
a well-placed "sleep 10" was an acceptable workaround... :)


That fixed it, thank you again :D


That won't work for the boot drive.

When no boot drive is detected early enough, the kernel goes to the
mountroot prompt.  That seems to hold a Giant lock which inhibits
further progress being made.  Sometimes progress can be made by trying
to mount unmountable partitions on other drives, but this usually goes
too fast, especially if the USB drive often times out.

Bruce
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


RE: mpslsi0 : Trying sleep, but thread marked as sleeping prohibited

2012-02-24 Thread Bruce Evans

On Fri, 24 Feb 2012, Desai, Kashyap wrote:


From: Alexander Kabaev [mailto:kab...@gmail.com]
...
sleep locks are by definition unbound. There is no spinning, no priority
propagation. Holders are free to take, say, page faults and go to long
journey to disk and back, etc.


I understood your above lines.

Hardly the stuff _anyone_ would want to
do from interrupt handler, thread or otherwise.


So the way mps driver does in interrupt handler is as below.

mps_lock(sc);
mps_intr_locked(data);
mps_unlock(sc);

We hold the mtx lock in Interrupt handler and do whole bunch of work(this is 
bit lengthy work) under that.
It looks mps driver is miss using mtx_lock. Are we ?


No.  Most NIC drivers do this.

Lengthy work isn't as long as it used to be, and here the lock only locks
out other accesses to a single piece of hardware (provided sc is for a
single piece of hardware as it should be).  Worry instead about more
global locks, either in your driver or in upper layers.  You might need
one to lock your whole driver, and upper layers might need one to lock
things globally too.  Giant locking is an example of the latter.  I don't
trust the upper layers much, but for interrupt handling they can be trusted
to not have anything locked when the interrupt handler is called (except
for Giant locking when the driver requests this).  Also worry about your
interrupt handler taking too long -- although nothing except interrupt
thread priority prevents other code running, it is possible that other
code doesn't get enough (or any) cycles if an interrupt handler is too
hoggish.  This problem is smaller than when there was a single ~1 MHz
CPU doing PIO.  With multiple ~2GHz CPUs doing DMA, the interrupt handler
can often be 100 times sloppier without anyone noticing.  But not 1000
times, and not 100 times with certain hardware.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-13 Thread Bruce Evans

On Wed, 14 Dec 2011, Ivan Klymenko wrote:


?? Wed, 14 Dec 2011 00:04:42 +0100
Jilles Tjoelker jil...@stack.nl ??:


On Tue, Dec 13, 2011 at 10:40:48AM +0200, Ivan Klymenko wrote:

If the algorithm ULE does not contain problems - it means the
problem has Core2Duo, or in a piece of code that uses the ULE
scheduler. I already wrote in a mailing list that specifically in
my case (Core2Duo) partially helps the following patch:
--- sched_ule.c.orig2011-11-24 18:11:48.0 +0200
+++ sched_ule.c 2011-12-10 22:47:08.0 +0200
...
@@ -2118,13 +2119,21 @@
struct td_sched *ts;

THREAD_LOCK_ASSERT(td, MA_OWNED);
+   if (td-td_pri_class  PRI_FIFO_BIT)
+   return;
+   ts = td-td_sched;
+   /*
+* We used up one time slice.
+*/
+   if (--ts-ts_slice  0)
+   return;


This skips most of the periodic functionality (long term load
balancer, saving switch count (?), insert index (?), interactivity
score update for long running thread) if the thread is not going to
be rescheduled right now.

It looks wrong but it is a data point if it helps your workload.


Yes, I did it for as long as possible to delay the execution of the code in 
section:


I don't understand what you are doing here, but recently noticed that
the timeslicing in SCHED_4BSD is completely broken.  This bug may be a
feature.  SCHED_4BSD doesn't have its own timeslice counter like ts_slice
above.  It uses `switchticks' instead.  But switchticks hasn't been usable
for this purpose since long before SCHED_4BSD started using it for this
purpose.  switchticks is reset on every context switch, so it is useless
for almost all purposes -- any interrupt activity on a non-fast interrupt
clobbers it.

Removing the check of ts_slice in the above and always returning might
give a similar bug to the SCHED_4BSD one.

I noticed this while looking for bugs in realtime scheduling.  In the
above, returning early for PRI_FIFO_BIT also skips most of the periodic
functionality.  In SCHED_4BSD, returning early is the usual case, so
the PRI_FIFO_BIT might as well not be checked, and it is the unusual
fifo scheduling case (which is supposed to only apply to realtime
priority threads) which has a chance of working as intended, while the
usual roundrobin case degenerates to an impure form of fifo scheduling
(iit is impure since priority decay still works so it is only fifo
among threads of the same priority).


...

@@ -2144,9 +2153,6 @@
if
(TAILQ_EMPTY(tdq-tdq_timeshare.rq_queues[tdq-tdq_ridx]))
tdq-tdq_ridx = tdq-tdq_idx; }
-   ts = td-td_sched;
-   if (td-td_pri_class  PRI_FIFO_BIT)
-   return;
if (PRI_BASE(td-td_pri_class) == PRI_TIMESHARE) {
/*
 * We used a tick; charge it to the thread so
@@ -2157,11 +2163,6 @@
sched_priority(td);
}
/*
-* We used up one time slice.
-*/
-   if (--ts-ts_slice  0)
-   return;
-   /*
 * We're out of time, force a requeue at userret().
 */
ts-ts_slice = sched_slice;


With the ts_slice check here before you moved it, removing it might
give buggy behaviour closer to SCHED_4BSD.


and refusal to use options FULL_PREEMPTION


4-5 years ago, I found that any form of PREMPTION was a pessimization
for at least makeworld (since it caused too many context switches).
PREEMPTION was needed for the !SMP case, at least partly because of
the broken switchticks (switchticks, when it works, gives voluntary
yielding by some CPU hogs in the kernel.  PREEMPTION, if it works,
should do this better).  So I used PREEMPTION in the !SMP case and
not for the SMP case.  I didn't worry about the CPU hogs in the SMP
case since it is rare to have more than 1 of them and 1 will use at
most 1/2 of a multi-CPU system.


But no one has unsubscribed to my letter, my patch helps or not in
the case of Core2Duo...
There is a suspicion that the problems stem from the sections of
code associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this
problem ...


The main point of SCHED_ULE is to give better affinity for multi-CPU
systems.  But the `multi' apparently needs to be strictly more than
2 for it to brak even.

Bruce___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: mountd has resolving problems

2011-02-17 Thread Bruce Evans

On Thu, 17 Feb 2011, John Baldwin wrote:


On Thursday, February 17, 2011 7:18:28 am Steven Hartland wrote:

This has become a issue for us in 8.x as well.

I'm pretty sure in pre 8.x these nfs mounts would simply background but
recently machines are now failing to boot. It seems that failure to
lookup nfs mount point hosts now causes this fatal error :(

We've just tried Jeremy's netwait script and it works perfectly so either
this or something similar needs to get pushed into base.

For reference the reason we need a delay here is our core Cisco router
takes a while to bring the port up properly on boot.

Thanks for sharing the script Jeremy :)


I use a similar hack that waits up to 30 seconds for the default gateway to be
pingable.  I think it is at least partly related to the new ARP code that now
drops packets in IP output if the link is down.


I use hackish ping -t timeout much smaller than 30 seconds since even 2
seconds is annoyings and traceroutes in /etc/rc.d/netif.  Don't know if
it is the same problem.  It affects mainly nfs and ntpdate/ntpd to local
systems here.  Even with all-static routes.


This can be very problematic
during boot since some interfaces take a few seconds to negotiate link but
the end result of the new check in IP output is that the attempt to send the
packet fails with an error causing gethostbyname() and getaddrinfo() to fail
completely without doing any retries.  In 7 the packet would either sit in the


Also after down/up to change something.  If you try to use the network
before it is back then you have to wait much longer before it is really
back.  This is a relatively minor problem since down/up is not needed
routinely.


descriptor ring until link was up, or it would be dropped, but it would
silently fail, so the resolver in libc would just retry in 30 seconds or so at
which time it would work fine.

Waiting for the default route to be pingable actually fixed a few other
problems for us on 7 though as well (often ntpdate would not work on boot and
now it works reliably, etc.) so we went with that route.


I thought I first saw the problem a little earlier, and it affected bge more
than fxp.  Maybe the latter is correct and the problem is smaller with fxp
just because it is ready sooner.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: poll()-ing a pipe descriptor, watching for POLLHUP

2009-06-03 Thread Bruce Evans

On Wed, 3 Jun 2009, Kostik Belousov wrote:


On Wed, Jun 03, 2009 at 05:30:51PM +0300, Kostik Belousov wrote:

On Wed, Jun 03, 2009 at 04:10:34PM +0300, Vlad Galu wrote:

Hm, I was having an issue with an internal piece of software, but
never checked what kind of pipe caused the problem. Turns out it was a
FIFO, and I got bitten by the same bug described here:
http://lists.freebsd.org/pipermail/freebsd-bugs/2006-March/017591.html

The problem is that the reader process isn't notified when the writer

G process exits or closes the FIFO fd...


So you did found the relevant PR with long audit trail and patches
attached. You obviously should contact the author of the patches,
Oliver Fromme, who is FreeBSD committer for some time (CCed).

I agree that the thing shall be fixed finally. Skimming over the
patches in kern/94772, I have some doubts about removal of
POLLINIGNEOF flag. The reason is that we are generally do not
remove exposed user interfaces.


Maybe, but this flag was not a documented interface, and too much
ugliness might be required to preserve its behaviour bug-for-bug
compatibly (the old buggy behaviour would probably be more wanted
for compatibility than the strange behaviour given by this flag!


I forward-ported Bruce' patch to the CURRENT. It passes the tests
from tools/regression/fifo and a test from kern/94772.


Thanks.  I won't be committing it any time soon, so you should.

I rewrote the test programs extensively (enclosed at the end) in Oct
2007 and updated the kernel patches to match.  Please run the new tests
to see if you are missing anything important in the kernel part.  If
so, I will search for the kernel patches later (actually, now --
enclosed in the middle).  I just ran them under RELENG_7 and unpatched
-current and found no differences with the Oct 2007 version for RELENG_7
in the old test output.

The old test output is in the following subdirectories:
4: FreeBSD-4
7: FreeBSD-7
l: Linux-2.6.10
m: my version of FreeBSD-5.2 including patches for this problem.
AFAIR, the FreeBSD output in m is the same as the Linux output in
all except a couple of cases where Linux select is inconsistent with
itself and/or with Linux poll.  However, the differences in the saved
output are that the Linux output is mysteriously missing results for
tests 5-8.  The tests attempt to test certain race possibilities in a
non-racy way.  This is not easy and the differences might be due to
some races/states not occurring under Linux.

POSIX finally specified the behaviour strictly enough for it to be possible
to test it a couple of years ago.  I didn't follow all the developments
and forget the details, but it was close to the Linux behaviour.


For my liking, I did not removed POLLINIGNEOF.

diff --git a/sys/fs/fifofs/fifo_vnops.c b/sys/fs/fifofs/fifo_vnops.c
index 66963bc..7e279ca 100644
--- a/sys/fs/fifofs/fifo_vnops.c
+++ b/sys/fs/fifofs/fifo_vnops.c
@@ -226,11 +226,47 @@ fail1:
if (ap-a_mode  FREAD) {
fip-fi_readers++;
if (fip-fi_readers == 1) {
+   SOCKBUF_LOCK(fip-fi_readsock-so_rcv);
+   if (fip-fi_writers  0)
+   fip-fi_readsock-so_rcv.sb_state |=
+   SBS_COULDRCV;


My current version is in fact completely different.  It doesn't have
SBS_COULDRCV, but uses a generation count.  IIRC, this is the same
method as is used in Linux, and is needed for the same reasons
(something to do with keeping new connections separate from old ones).
So I will try to enclose the components of the patch in the order of
your diff (might miss some).  First one:

% Index: fifo_vnops.c
% ===
% RCS file: /home/ncvs/src/sys/fs/fifofs/fifo_vnops.c,v
% retrieving revision 1.100
% diff -u -2 -r1.100 fifo_vnops.c
% --- fifo_vnops.c  23 Jun 2004 00:35:50 -  1.100
% +++ fifo_vnops.c  17 Oct 2007 11:36:23 -
% @@ -36,4 +36,5 @@
%  #include sys/fcntl.h
%  #include sys/file.h
% +#include sys/filedesc.h
%  #include sys/kernel.h
%  #include sys/lock.h
% @@ -61,4 +62,5 @@
%   longfi_readers;
%   longfi_writers;
% + int fi_wgen;
%  };
% 
% @@ -182,8 +184,11 @@

%   struct ucred *a_cred;
%   struct thread *a_td;
% + int  a_fdidx;
%   } */ *ap;
%  {
%   struct vnode *vp = ap-a_vp;
%   struct fifoinfo *fip;
% + struct file *fp;
% + struct filedesc *fdp;
%   struct thread *td = ap-a_td;
%   struct ucred *cred = ap-a_cred;
% @@ -240,4 +245,10 @@
%   }
%   }
% + fdp = td-td_proc-p_fd;
% + FILEDESC_LOCK(fdp);
% + fp = fget_locked(fdp, ap-a_fdidx);
% + /* Abuse f_msgcount as a generation count. */
% + fp-f_msgcount = fip-fi_wgen - fip-fi_writers;
% + FILEDESC_UNLOCK(fdp);
%   }
%   if 

Re: HEADS UP: More CAM fixes.

2009-02-17 Thread Bruce Evans

On Tue, 17 Feb 2009, Gary Jennejohn wrote:


I tested this with an Adaptec 29160.  I saw no real improvement in
performance, but also no regressions.

I suspect that the old disk I had attached just didn't have enough
performance reserves to show an improvement.

My test scenario was buildworld.  Since /usr/src and /usr/obj were both
on the one disk it got a pretty good workout.

   low


AMD64 X2 (2.5 GHz) with 4GB of RAM.


Buildworld hardly uses the disk at all.  It reads and writes a few hundred
MB.  Ideally the i/o should go at disk speeds of 50-200MB/S and thus take
between 20 and 5 seconds.  In practice, it will take a few more seconds.
physically but perhaps even less virtually due to parallelism.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS, NFS and Network tuning

2009-01-29 Thread Bruce Evans

On Thu, 29 Jan 2009, Brent Jones wrote:


On Wed, Jan 28, 2009 at 11:21 PM, Brent Jones br...@servuhome.net wrote:



...
The issue I am seeing, is that for certain file types, the FreeBSD NFS
client will either issue an ASYNC write, or an FSYNC.
However, NFSv3 and v4 both support safe ASYNC writes in the TCP
versions of the protocol, so that should be the default.
Issuing FSYNC's for every compete block transmitted adds substantial
overhead and slows everything down.


I use some patches (mainly for nfs write clustering on the server) by
Bjorn Gronwall and some local fixes (mainly for vfs write clustering
on the server, and tuning off excessive nfs[io]d daemons which get in
each other's way due to poor scheduling, and things that only help for
lots of small files), and see reasonable performance in all cases (~90%
of disk bandwidth with all-async mounts, and half that with the client
mounted noasync on an old version of FreeBSD.  The client in -current 
is faster.)  Writing is actually faster than reading here.



...
My NFS mount command lines I have tried to get all data to ASYNC write:

$ mount_nfs -3T -o async 192.168.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/
$ mount_nfs -3T 192.168.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/
$ mount_nfs -4TL 192.168.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/


Also try -r16384 -w16384, and udp, and async on the server.  I think
block sizes default to 8K for udp and 32K for tcp.  8K is too small,
and 32K may be too large (it increases latency for little benefit
if the server fs block size is 16K).  udp gives lower latency.  async
on the server makes little difference provided the server block size
is not too small.


I have found a 4 year old bug, which may be related to this. cp uses
mmap for small files (and I imagine lots of things use mmap for file
operations) and causes slowdowns via NFS, due to the fsync data
provided above.

http://www.freebsd.org/cgi/query-pr.cgi?pr=bin/87792


mmap apparently breaks the async mount preference in the following code:
from vnode_pager.c:

%   /*
%* pageouts are already clustered, use IO_ASYNC t o force a bawrite()
% 	 * rather then a bdwrite() to prevent paging I/O from saturating 
% 	 * the buffer cache.  Dummy-up the sequential heuristic to cause

%* large ranges to cluster.  If neither IO_SYNC or IO_ASYNC is set,
%* the system decides how to cluster.
%*/
%   ioflags = IO_VMIO;
%   if (flags  (VM_PAGER_PUT_SYNC | VM_PAGER_PUT_INVAL))
%   ioflags |= IO_SYNC;

This apparently gives lots of sync writes.  (Sync writes are the default for
nfs, but we mount with async to try to get async writes.)

%   else if ((flags  VM_PAGER_CLUSTER_OK) == 0)
%   ioflags |= IO_ASYNC;

nfs doesn't even support this flag.  In fact, ffs is the only file
system that supports it, and here is the only place that sets it.  This
might explain some slowness.

One of the bugs in vfs clustering that I don't have is related to this.
IIRC, mounting the server with -o async doesn't work as well as it
should because the buffer cache becomes congested with i/o that should
have been sent to the disk.  Some writes must be done async as explained
above, but one place in vfs_cache.c is too agressive in delaying async
writes for file systems that are mounted async.  This problem is more
noticeable for nfs, at least with networks not much faster than disks,
since it results in the client and server taking turns waiting for
each other.  (The names here are very confusing -- the async mount
flag normally delays both sync and async writes for as long as possible,
except for nfs it doesn't affect delays but asks for async writes
instead of sync writes on the server, while the IO_ASYNC flag asks for
async writes and thus often has the opposite sense to the async mount
flag.)

%   ioflags |= (flags  VM_PAGER_PUT_INVAL) ? IO_INVAL: 0;
%   ioflags |= IO_SEQMAX  IO_SEQSHIFT;

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Packet loss every 30.999 seconds

2007-12-27 Thread Bruce Evans

On Sat, 22 Dec 2007, Mark Fullmer wrote:


On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote:


I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things).  Perhaps I misread the problem, and it is
actually that networking works but userland is unable to run in time
to avoid packet loss.


The test is done with UDP packets between two servers.  The em
driver is incrementing the received packet count correctly but
the packet is not making it up the network stack.  If
the application was not servicing the socket fast enough I would
expect to see the dropped due to full socket buffers (udps_fullsock)
counter incrementing, as shown by netstat -s.


I couldn't see any sign of PREEMPTION not working in 6.3-PREREALEASE.
em seemed to keep up with the maximum rate that I can easily generate
(640 kpps with tiny udp packets), though it cannot transmit at more than
400 kpps on the same hardware.  This is without aby syncer activity to
cause glitches.  The rest of the system couldn't keep up, and with my
normal configuration of net.isr.direct=1, systat -ip (udps_fullsock)
showed too many packets being dropped, but all the numbers seemed to
add up right.  (I didn't do end-to-end packet counts.  I'm using ttcp
to send and receive packets; the receiver loses so many packets that
it rarely terminates properly, and when it does terminate it always
shows many dropped.)  However, with net.isr.direct=0, packets are dropped
with no sign of the problem except a reduced count of good packets in
systat -ip.

Packet rate counter net.isr.direct=1  net.isr.direct=0
---   
netstat -I  639042643522 (faster later)
systat -ip (total rx)   639042382567 (dropped many b4 here)
  (UDP total)   639042382567
  (udps_fullsock)   29891170340
 (diff of prev 2)   340031312227 (300+k always dropped)
net.isr.count   small large (seems to be correct 643k)
net.isr.directedlarge (correct?)  no change
net.isr.queued  0 0
net.isr.drop0 0

net.isr.direct=0 is apparently causing dropped packets without even counting
them.  However, the drop seems to be below the netisr level.

More worryingly, with full 1500-byte packets (1472 data + 28 UDP
header), packets can be sent at a rate of 76 kpps (nearly 950 Mbps)
with a load of only 80% on the receiver, yet the ttcp receiver still
drops about 1000 pps due top socket buffer full.  With net.usr.direct=0
it drops an additinal 700 pps due to this.  Glitches from sync(2)
taking 25 ms increase the loss by about 1000 packets, and using rtprio
for the ttcp receiver doesn't seem to help at all.

In previous mail, you (Mark) wrote:

# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application.  If packets were dropped they would show up
# with netstat -s as dropped due to full socket buffers.
# 
# Since the packet never makes it to ip_input() I no longer have

# any way to count drops.  There will always be corner cases where
# interrupts are lost and drops not accounted for if the adapter
# hardware can't report them, but right now I've got no way to
# estimate any loss.

I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't
have an option for this).  With the default kern.ipc.maxsockbuf of 256K,
this didn't seem to help.  20MB should work better :-) but I didn't try that.
I don't understand how fast the socket buffer fills up and would have
thought that 256K was enough for tiny packets but not for 1500-byte packets.
Their seems to be a general problem that 1Gbps NICs have or should have
rings of size = 256 or 512 so that they aren't forced to drop packets
when their interrupt handler has a reasonable but larger latency, yet if
we actually use this feature then we flood the upper layers with hundreds
of packets and fill up socket buffers etc. there.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-27 Thread Bruce Evans

On Fri, 28 Dec 2007, Bruce Evans wrote:


In previous mail, you (Mark) wrote:

# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application.  If packets were dropped they would show up
# with netstat -s as dropped due to full socket buffers.
# # Since the packet never makes it to ip_input() I no longer have
# any way to count drops.  There will always be corner cases where
# interrupts are lost and drops not accounted for if the adapter
# hardware can't report them, but right now I've got no way to
# estimate any loss.

I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't
have an option for this).  With the default kern.ipc.maxsockbuf of 256K,
this didn't seem to help.  20MB should work better :-) but I didn't try that.


I've now tried this.  With kern.ipc.maxsockbuf=2048 (~20MB) an
SO_RCVBUF of 0x100 (16MB), the socket buffer full lossage increases
from ~300 kpps (~47%) to ~450 kpps (70%) with tiny packets.  I think
this is caused by most accesses to the larger buffer being cache misses
-- since the system can't keep up, cache misses make it worse).

However, with 1500-byte packets, the larger buffer reduces the lossage
from 1 kpps in 76 kpps to precisely zero pps, at a cost of only a small
percentage of system overhead (~20Idle to ~18%Idle).

The above is with net.isr.direct=1.  With net.isr.direct=0, the loss is
too small to be obvious and is reported as 0, but I don't trust the
report.  ttcp's packet counts indicate losses of a few per million with
direct=0 but none with direct=1.  while :; do sync; sleep 0.1 in the
background causes a loss of about 100 pps with direct=0 and a smaller
loss with direct=1.  Running the ttcp receiver at rtprio 0 doesn't make
much difference to the losses.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-27 Thread Bruce Evans

On Fri, 28 Dec 2007, Bruce Evans wrote:


On Fri, 28 Dec 2007, Bruce Evans wrote:


In previous mail, you (Mark) wrote:

# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application.  If packets were dropped they would show up
# with netstat -s as dropped due to full socket buffers.
# # Since the packet never makes it to ip_input() I no longer have
# any way to count drops.  There will always be corner cases where
# interrupts are lost and drops not accounted for if the adapter
# hardware can't report them, but right now I've got no way to
# estimate any loss.


I found where drops are recorded for the net.isr.direct=0 case.  It is
in net.inet.ip.intr_queue.drops.  The netisr subsystem just calls
IF_HANDOFF(), and IF_HANDOFF() calls _IF_DROP() if the queue fills up.
_IF_DROP(ifq) just increments ifq-ip_drops.  The usual case for netisrs
is for the queue to be ipintrq for NETISR_IP.  The following details
don't help:

- drops for input queues don't seem to be displayed by any utilities
  (except ones for ipintrq are displayed primitively by
  sysctl net.inet.ip.intr_queue_drops).  netstat and systat only
  display drops for send queues and ip frags.
- the netisr subsystem's drop count doesn't seem to be displayed by any
  utilities except sysctl.  It only counts drops due to there not being
  a queue; other drops are counted by _IF_DROP() in the per-queue counter.
  Users have a hard time integrating all these primitively displayed drop
  counts with other error counters.
- the length of ipintrq defaults to the default ifq length of ipqmaxlen =
  IPQ_MAXLEN = 50.  This is inadequate if there is just one NIC in the
  system that has an rx ring size of = slightly less than 50.  But 1
  Gbps NICs should have an rx ring size of 256 or 512 (I think the
  size is 256 for em; it is 256 for bge due to bogus configuration of
  hardware that can handle it being 512).  If the larger hardware rx
  ring is actually used, then ipintrq drops are almost ensured in the
  direct=0 case, so using the larger h/w ring is worse than useless
  (it also increases cache misses).  This is for just one NIC.  This
  problem is often limited by handling rx packets in small bursts, at
  a cost of extra overhead.  Interrupt moderation increases it by
  increasing burst sizes.

  This contrasts with the handling of send queues.  Send queues are
  per-interface and most drivers increase the default length from 50
  to their ring size (-1 for bogus reasons).  I think this is only an
  optimization, while a similar change for rx queues is important for
  avoiding packet loss.  For send queues, the ifq acts mainly as a
  primitive implementation of watermarks.  I have found that tx queue
  lengths need to be more like 5000 than 50 or 500 to provide enough
  buffering when applications are delayed by other applications or
  just by sleeping until the next clock tick, and use tx queues of
  length ~2 (a couple of clock ticks at HZ = 100), but now think
  queue lengths should be restricted to more like 50 since long queues
  cannot fit in L2 caches (not to mention they are bad for latency).

The length of ipintrq can be changed using sysctl
net.inet.ip.intrq_queue_maxlen.  Changing it from 50 to 1024 turns most
or all ipintrq drops into socket buffer full drops
(640 kpps input packets and 434 kpps socket buffer fulls with direct=0;
 640 kpps input packets and 324 kpps socket buffer fulls with direct=1).

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-24 Thread Bruce Evans

On Mon, 24 Dec 2007, Kostik Belousov wrote:


On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote:

On Sat, 22 Dec 2007, Kostik Belousov wrote:

Ok, since you talked about this first :). I already made the following
patch, but did not published it since I still did not inspected all
callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
It shall be safe, but better to check. Also, I postponed the check
until it was reported that yielding does solve the original problem.


Good.  I'd still like to unobfuscate the function call.

What do you mean there ?


Make the loop control and overheads clear by making the function call
explicit, maybe by expanding MNT_VNODE_FOREACH() inline after fixing
the style bugs in it.  Later, fix the code to match the comment again
by not making a function call in the usual case.  This is harder.


Putting the count in the union seems fragile at best.  Even if nothing
can access the marker vnode, you need to context-switch its old contents
while using it for the count, in case its old contents is used.  Vnode-
printing routines might still be confused.

Could you, please, describe what you mean by contex-switch for the
VMARKER ?


Oh, I didn't notice that the marker vnode is out of band (a whole new
vnode is malloced for each marker).  The context switching would be
needed if an ordinary active vnode that uses the union is used as a
marker.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Bruce Evans

On Sat, 22 Dec 2007, Kostik Belousov wrote:


On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote:


I'm just an observer, and I may be confused, but it seems to me that this is
motion in the wrong direction (at least, it's not going to fix the actual
problem). As I understand the problem, once you reach a certain point, the
system slows down *every* 30.999 seconds. Now, it's possible for the code to
cause one slowdown as it cleans up, but why does it need to clean up so much
31 seconds later?


It is just searching for things to clean up, and doing this pessimally due
to unnecessary cache misses and (more recently) introduction of overheads
to handling the case where the mount point is locked into the fast path
where the mount point is not unlocked.

The search every 30 seconds or so is probably more efficient, and is
certainly simpler, than managing the list on every change to every vnode
for every file system.  However, it gives a high latency in non-preemptible
kernels.


Why not find/fix the actual bug? Then work on getting the yield right if it
turns out there's an actual problem for it to fix.


Yielding is probably the correct fix for non-preemptible kernels.  Some
operations just take a long time, but are low priority so they can be
preempted.  This operation is partly under user control, since any user
can call sync(2) and thus generate the latency every latency seconds.
But this is no worse than a user generating even larger blocks of latency
by reading huge amounts from /dev/zero.  My old latency workaround for
the latter (and other huge i/o's) is still sort of necessary, though it
now works bogusly (hogticks doesn't work since it is reset on context
switches to interrupt handlers; however, any context switch mostly fixes
the problem).  My old latency workaround only reduces the latency to a
multiple of 1/HZ, so a default of 200 ms, so it still is supposed to allow
latencies much larger than the ones that cause problems here, but its
bogus current operation tends to give latencies of more like 1/HZ which
is short enough when HZ has its default misconfiguration to 1000.

I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things).  Perhaps I misread the problem, and it is
actually that networking works but userland is unable to run in time
to avoid packet loss.


If the problem is that too much work is being done at a stretch and it turns
out this is because work is being done erroneously or needlessly, fixing
that should solve the whole problem. Doing the work that doesn't need to be
done more slowly is at best an ugly workaround.


Lots of necessary work is being done.


Yes, rewriting the syncer is the right solution. It probably cannot be done
quickly enough. If the yield workaround provide mitigation for now, it
shall go in.


I don't think rewriting the syncer just for this is the right solution.
Rewriting the syncer so that it schedules actual i/o more efficiently
might involve a solution.  Better scheduling would probably take more
CPU and increase the problem.

Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
There are 4 places in vfs and 13 places in 6 file systems:

% ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(xvp, mp, mvp) {
% ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./fs/msdosfs/msdosfs_vfsops.c:MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./fs/coda/coda_subr.c:MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_default.c: MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfs4client/nfs4_vfsops.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfsclient/nfs_subs.c:   MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./nfsclient/nfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) {

Only file systems that support writing need it (for VOP_SYNC() and for
MNT_RELOAD), else there would be many more places.  There would also
be more places if MNT_RELOAD support were not missing for some file
systems.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Bruce Evans

On Sat, 22 Dec 2007, Kostik Belousov wrote:


On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote:

On Sat, 22 Dec 2007, Kostik Belousov wrote:

Yes, rewriting the syncer is the right solution. It probably cannot be done
quickly enough. If the yield workaround provide mitigation for now, it
shall go in.


I don't think rewriting the syncer just for this is the right solution.
Rewriting the syncer so that it schedules actual i/o more efficiently
might involve a solution.  Better scheduling would probably take more
CPU and increase the problem.

I think that we can easily predict what vnode(s) become dirty at the
places where we do vn_start_write().


This works for writes to regular files at most.  There are also reads
(for ffs, these set IN_ATIME unless the file system is mounted with
noatime) and directory operations.  By grepping for IN_CHANGE, I get
78 places in ffs alone where dirtying of the inode occurs or is scheduled
to occur (ffs = /sys/ufs).  The efficiency of marking timestamps,
especially for atimes, depends on just setting a flag in normal operation
and picking up coalesced settings of the flag later, often at sync time
by scanning all vnodes.


Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
There are 4 places in vfs and 13 places in 6 file systems:
...

Only file systems that support writing need it (for VOP_SYNC() and for
MNT_RELOAD), else there would be many more places.  There would also
be more places if MNT_RELOAD support were not missing for some file
systems.


Ok, since you talked about this first :). I already made the following
patch, but did not published it since I still did not inspected all
callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
It shall be safe, but better to check. Also, I postponed the check
until it was reported that yielding does solve the original problem.


Good.  I'd still like to unobfuscate the function call.


diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
index 14acc5b..046af82 100644
--- a/sys/kern/vfs_mount.c
+++ b/sys/kern/vfs_mount.c
@@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp)
mtx_assert(MNT_MTX(mp), MA_OWNED);

KASSERT((*mvp)-v_mount == mp, (marker vnode mount list mismatch));
+   if ((*mvp)-v_yield++ == 500) {
+   MNT_IUNLOCK(mp);
+   (*mvp)-v_yield = 0;
+   uio_yield();


Another unobfuscation is to not name this uio_yield().


+   MNT_ILOCK(mp);
+   }
vp = TAILQ_NEXT(*mvp, v_nmntvnodes);
while (vp != NULL  vp-v_type == VMARKER)
vp = TAILQ_NEXT(vp, v_nmntvnodes);
diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
index dc70417..6e3119b 100644
--- a/sys/sys/vnode.h
+++ b/sys/sys/vnode.h
@@ -131,6 +131,7 @@ struct vnode {
struct socket   *vu_socket; /* v unix domain net (VSOCK) */
struct cdev *vu_cdev;   /* v device (VCHR, VBLK) */
struct fifoinfo *vu_fifoinfo;   /* v fifo (VFIFO) */
+   int vu_yield;   /*   yield count (VMARKER) */
} v_un;

/*


Putting the count in the union seems fragile at best.  Even if nothing
can access the marker vnode, you need to context-switch its old contents
while using it for the count, in case its old contents is used.  Vnode-
printing routines might still be confused.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Tue, 18 Dec 2007, David G Lawrence wrote:


I got an almost identical delay (with 64000 vnodes).

Now, 17ms isn't much.


   Says you. On modern systems, trying to run a pseudo real-time application
on an otherwise quiescent system, 17ms is just short of an eternity. I agree
that the syncer should be preemptable (which is what my bandaid patch
attempts to do), but that probably wouldn't have helped my specific problem
since my application was a user process, not a kernel thread.


FreeBSD isn't a real-time system, and 17ms isn't much for it.  I saw lots
of syscall delays of nearly 1 second while debugging this.  (With another
hat, I would say that 17 us was a long time in 1992.  17 us is hundreds of
times longer now.)


  One more followup (I swear I'm done, really!)... I have a laptop here
that runs at 150MHz when it is in the lowest running CPU power save mode.
At that speed, this bug causes a delay of more than 300ms and is enough
to cause loss of keyboard input. I have to switch into high speed mode
before I try to type anything, else I end up with random typos. Very
annoying.


Yes, something is wrong if keystrokes are lost with CPUs that run at
150 kHz (sic) or faster.

Debugging shows that the problem is like I said.  The loop really does
take 125 ns per iteration.  This time is actually not very much.  The
the linked list of vnodes could hardly be designed better to maximize
cache thrashing.  My system has a fairly small L2 cache (512K or 1M),
and even a few words from the vnode and the inode don't fit in the L2
cache when there are 64000 vnodes, but the vp and ip are also fairly
well desgined to maximize cache thrashing, so L2 cache thrashing starts
at just a few thousand vnodes.

My system has fairly low latency main memory, else the problem would
be larger:

% Memory latencies in nanoseconds - smaller is better
% (WARNING - may not be correct, check graphs)
% ---
% Host OS   Mhz  L1 $   L2 $Main memGuesses
% - -   - -----
% besplex.b FreeBSD 7.0-C  2205 1.361 5.6090   42.4 [PC3200 CL2.5 overclocked]
% sledge.fr FreeBSD 8.0-C  1802 1.666 8.9420   99.8
% freefall. FreeBSD 7.0-C  2778 0.746 6.6310  155.5

The loop makes the following memory accesses, at least in 5.2:

% loop:
%   for (vp = TAILQ_FIRST(mp-mnt_nvnodelist); vp != NULL; vp = nvp) {
%   /*
%* If the vnode that we are about to sync is no longer
%* associated with this mount point, start over.
%*/
%   if (vp-v_mount != mp)
%   goto loop;
% 
% 		/*

%* Depend on the mntvnode_slock to keep things stable enough
%* for a quick test.  Since there might be hundreds of
%* thousands of vnodes, we cannot afford even a subroutine
%* call unless there's a good chance that we have work to do.
%*/
%   nvp = TAILQ_NEXT(vp, v_nmntvnodes);

Access 1 word at vp offset 0x90.  Costs 1 cache line.  IIRC, my system has
a cache line size of 0x40.  Assume this, and that vp is aligned on a
cache line boundary.  So this access costs the cache line at vp offsets
0x80-0xbf.

%   VI_LOCK(vp);

Access 1 word at vp offset 0x1c.  Costs the cache line at vp offsets 0-0x3f.

%   if (vp-v_iflag  VI_XLOCK) {

Access 1 word at vp offset 0x24.  Cache hit.

%   VI_UNLOCK(vp);
%   continue;
%   }
%   ip = VTOI(vp);

Access 1 word at vp offset 0xa8.  Cache hit.

%   if (vp-v_type == VNON || ((ip-i_flag 

Access 1 word at vp offset 0xa0.  Cache hit.

Access 1 word at ip offset 0x18.  Assume that ip is aligned, as above.  Costs
the cache line at ip offsets 0-0x3f.

%   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 
%   TAILQ_EMPTY(vp-v_dirtyblkhd))) {

Access 1 word at vp offset 0x48.  Costs the cache line at vp offsets 0x40-
0x7f.

%   VI_UNLOCK(vp);

Reaccess 1 word at vp offset 0x1c.  Cache hit.

%   continue;
%   }

The total cost is 4 cache lines or 256 bytes per vnode.  So with an L2
cache size of 1MB, the L2 cache will start thrashing at numvnodes =
4096.  With thrashing, an at my main memory latency of 42.4 nsec, it
might take 4*42.4 = 169.6 nsec to read main memory.  This is similar
to my observed time.  Presumably things aren't quite that bad because
there is some locality for the 3 lines in each vp.  It might be possible
to improve this a bit by accessing the lines sequentially and not
interleaving the access to ip.  Better, repack vp and move the IN*
flags from ip to vp (a change that has other advantages), so that
everything is in 1 cache line per vp.

This isn't consistent with the delay increasing to 300 ms when the CPU
is throttled -- memory shouldn't be 

Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Tue, 18 Dec 2007, Mark Fullmer wrote:


A little progress.

I have a machine with a KTR enabled kernel running.

Another machine is running David's ffs_vfsops.c's patch.

I left two other machines (GENERIC kernels) running the packet loss test
overnight.  At ~ 32480 seconds of uptime the problem starts.  This is really


Try it with find / -type f /dev/null to duplicate the problem almost
instantly.


marks are the intervals between test runs.  The window of missing packets
(timestamps between two packets where a sequence number is missing)
is usually less than 4us, altough I'm not sure gettimeofday() can be
trusted for measuring this.  See https://www.eng.oar.net/~maf/bsd6/p3.png


gettimeofday() can normally be trusted to better than 1 us for time
differences of up to about 1 second.  However, gettimeofday() should
not be used in any program written after clock_gettime() became standard
in 1994.  clock_gettime() has a resolution of 1 ns.  It isn't quite
that accurate on current machines, but I trust it to measure differences
of 10 nsec between back to back clock_gettime() calls here.  Sample
output from wollman@'s old clock-watching program converted to
clock_gettime():

%%%
2007/12/05 (TSC) bde-current, -O2 -mcpu=athlon-xp
min 238, max 99730, mean 240.025380, std 77.291436
1th: 239 (1203207 observations)
2th: 240 (556307 observations)
3th: 241 (190211 observations)
4th: 238 (50091 observations)
5th: 242 (20 observations)

2007/11/23 (TSC) bde-current
min 247, max 11890, mean 247.857786, std 62.559317
1th: 247 (1274231 observations)
2th: 248 (668611 observations)
3th: 249 (56950 observations)
4th: 250 (23 observations)
5th: 263 (8 observations)

2007/05/19 (TSC) plain -current-noacpi
min 262, max 286965, mean 263.941187, std 41.801400
1th: 264 (1343245 observations)
2th: 263 (626226 observations)
3th: 265 (26860 observations)
4th: 262 (3572 observations)
5th: 268 (8 observations)

2007/05/19 (TSC) plain -current-acpi
min 261, max 68926, mean 279.848650, std 40.477440
1th: 261 (999391 observations)
2th: 320 (473325 observations)
3th: 262 (373831 observations)
4th: 321 (148126 observations)
5th: 312 (4759 observations)

2007/05/19 (ACPI-fast timecounter) plain -current-acpi
min 558, max 285494, mean 827.597038, std 78.322301
1th: 838 (1685662 observations)
2th: 839 (136980 observations)
3th: 559 (72160 observations)
4th: 837 (48902 observations)
5th: 558 (31217 observations)

2007/05/19 (i8254) plain -current-acpi
min 3352, max 288288, mean 4182.774148, std 257.977752
1th: 4190 (1423885 observations)
2th: 4191 (440158 observations)
3th: 3352 (65261 observations)
4th: 5028 (39202 observations)
5th: 5029 (15456 observations)
%%%

min here gives the minimum latency of a clock_gettime() syscall.
The improvement from 247 nsec to 240 nsec in the mean due to -O2
-march-athlon-xp can be trusted to be measured very accurately since
it is an average over more than 100 million trials, and the improvement
from 247 nsec to 238 nsec for min can be trusted because it is
consistent with the improvement in the mean.

The program had to be converted to use clock_gettime() a few years
ago when CPU speeds increased so much that the correct min became
significantly less than 1.  With gettimeofday(), it cannot distinguish
between an overhead of 1 ns and an overhead of 1 us.

For the ACPI and i8254 timecounter, you can see that the low-level
timecounters have a low frequency clock from the large gaps between
the observations.  There is a gap of 279-280 ns for the acpi timecounter.
This is the period of the acpi timecounter's clock (frequency
14318182/4 = period 279.3651 ns.  Since we can observe this period to
within 1 ns, we must have a basic accuracy of nearly 1 ns, but if we
make only 2 observations we are likely to have an inaccuracy of 279
ns due to the granularity of the clock.  The TSC has a clock granuarity
of 6 ns on my CPU, and delivers almost that much accuracy with only
2 observations, but technical problems prevent general use of the TSC.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Wed, 19 Dec 2007, David G Lawrence wrote:


Debugging shows that the problem is like I said.  The loop really does
take 125 ns per iteration.  This time is actually not very much.  The


  Considering that the CPU clock cycle time is on the order of 300ps, I
would say 125ns to do a few checks is pathetic.


As I said, 125 nsec is a short time in this context.  It is approximately
the time for a single L2 cache miss on a machine with slow memory like
freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns).  As I said,
the code is organized so as to give about 4 L2 cache misses per vnode
if there are more than a few thousand vnodes, so it is doing very well
to take only 125 nsec for a few checks.


  In any case, it appears that my patch is a no-op, at least for the
problem I was trying to solve. This has me confused, however, because at
one point the problem was mitigated with it. The patch has gone through
several iterations, however, and it could be that it was made to the top
of the loop, before any of the checks, in a previous version. Hmmm.


The patch should work fine.  IIRC, it yields voluntarily so that other
things can run.  I committed a similar hack for uiomove().  It was
easy to make syscalls that take many seconds (now tenths of seconds
insted of seconds?), and without yielding or PREEMPTION or multiple
CPUs, everything except interrupts has to wait for these syscalls.  Now
the main problem is to figure out why PREEMPTION doesn't work.  I'm
not working on this directly since I'm running ~5.2 where nearly-full
kernel preemption doesn't work due to Giant locking.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Wed, 19 Dec 2007, David G Lawrence wrote:


Try it with find / -type f /dev/null to duplicate the problem almost
instantly.


  FreeBSD used to have some code that would cause vnodes with no cached
pages to be recycled quickly (which would have made a simple find
ineffective without reading the files at least a little bit). I guess
that got removed when the size of the vnode pool was dramatically
increased.


It might still.  The data should be cached somewhere, but caching it
in both the buffer cache/VMIO and the vnode/inode is wasteful.

I may have been only caching vnodes for directories.  I switched to
using a find or a tar on /home/ncvs/ports since that has a very high
density of directories.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Thu, 20 Dec 2007, Bruce Evans wrote:


On Wed, 19 Dec 2007, David G Lawrence wrote:

  Considering that the CPU clock cycle time is on the order of 300ps, I
would say 125ns to do a few checks is pathetic.


As I said, 125 nsec is a short time in this context.  It is approximately
the time for a single L2 cache miss on a machine with slow memory like
freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns).  As I said,


Perfmon counts for the cache misses during sync(1);

== /tmp/kg1/z0 ==
vfs.numvnodes: 630
# s/kx-dc-accesses 
484516
# s/kx-dc-misses 
20852

misses = 4%

== /tmp/kg1/z1 ==
vfs.numvnodes: 9246
# s/kx-dc-accesses 
884361
# s/kx-dc-misses 
89833

misses = 10%

== /tmp/kg1/z2 ==
vfs.numvnodes: 20312
# s/kx-dc-accesses 
1389959
# s/kx-dc-misses 
178207

misses = 13%

== /tmp/kg1/z3 ==
vfs.numvnodes: 80802
# s/kx-dc-accesses 
4122411
# s/kx-dc-misses 
658740

misses = 16%

== /tmp/kg1/z4 ==
vfs.numvnodes: 138557
# s/kx-dc-accesses 
7150726
# s/kx-dc-misses 
1129997

misses = 16%

===

I forgot to only count active vnodes in the above.  vfs.freevnodes was
small ( 5%).

I set kern.maxvnodes to 20, but vfs.numvnodes saturated at 138557
(probably all that fits in kvm or main memory on i386 with 1GB RAM).

With 138557 vnodes, a null sync(2) takes 39673 us according to kdump -R.
That is 35.1 ns per miss.  This is consistent with lmbench2's estimate
of 42.5 ns for main memory latency.

Watching vfs.*vnodes confirmed that vnode caching still works like you
said:
o find /home/ncvs/ports -type f only gives a vnode for each directory
o a repeated find /home/ncvs/ports -type f is fast because everything
  remains cached by VMIO.  FreeBSD performed very badly at this benchmark
  before VMIO existed and was used for directories
o tar cf /dev/zero /home/ncvs/ports gives a vnode for files too.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Wed, 19 Dec 2007, David G Lawrence wrote:


The patch should work fine.  IIRC, it yields voluntarily so that other
things can run.  I committed a similar hack for uiomove().  It was


  It patches the bottom of the loop, which is only reached if the vnode
is dirty. So it will only help if there are thousands of dirty vnodes.
While that condition can certainly happen, it isn't the case that I'm
particularly interested in.


Oops.

When it reaches the bottom of the loop, it will probably block on i/o
sometimes, so that the problem is smaller anyway.


CPUs, everything except interrupts has to wait for these syscalls.  Now
the main problem is to figure out why PREEMPTION doesn't work.  I'm
not working on this directly since I'm running ~5.2 where nearly-full
kernel preemption doesn't work due to Giant locking.


  I don't understand how PREEMPTION is supposed to work (I mean
to any significant detail), so I can't really comment on that.


Me neither, but I will comment anyway :-).  I think PREEMPTION should
even preempt kernel threads in favor of (higher priority of course)
user threads that are in the kernel, but doesn't do this now.  Even
interrupt threads should have dynamic priorities so that when they
become too hoggish they can be preempted even by user threads subject
to the this priority rule.  This is further from happening.

ffs_sync() can hold the mountpoint lock for a long time.  That gives
problems preempting it.  To move your fix to the top of the loop, I
think you just need to drop the mountpoint lock every few hundred
iterations while yielding.  This would help for PREEMPTION too.  Dropping
the lock must be safe because it is already done while flushing.

Hmm, the loop is nicely obfuscated and pessimized in current (see
rev.1.234).  The fast (modulo no cache misses) path used to be just a
TAILQ_NEXT() to reach the next vnode, but now unnecessarily joins the
slow path at MNT_VNODE_FOREACH(), and MNT_VNODE_FOREACH() hides a
function call.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Mon, 17 Dec 2007, David G Lawrence wrote:


While trying to diagnose a packet loss problem in a RELENG_6 snapshot
dated
November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every 30.99 seconds.


I see the same behaviour under a heavily modified version of FreeBSD-5.2
(except the period was 2 ms longer and the latency was 7 ms instead
of 11 ms when numvnodes was at a certain value.  Now with numvnodes =
17500, the latency is 3 ms.


  I noticed this as well some time ago. The problem has to do with the
processing (syncing) of vnodes. When the total number of allocated vnodes
in the system grows to tens of thousands, the ~31 second periodic sync
process takes a long time to run. Try this patch and let people know if
it helps your problem. It will periodically wait for one tick (1ms) every
500 vnodes of processing, which will allow other things to run.


However, the syncer should be running at a relative low priority and not
cause packet loss.  I don't see any packet loss even in ~5.2 where the
network stack (but not drivers) is still Giant-locked.

Other too-high latencies showed up:
- syscons LED setting and vt switching gives a latency of 5.5 msec because
  syscons still uses busy-waiting for setting LEDs :-(.  Oops, I do see
  packet loss -- this causes it under ~5.2 but not under -current.  For
  the bge and/or em drivers, the packet loss shows up in netstat output
  as a few hundred errors for every LED setting on the receiving machine,
  while receiving tiny packets at the maximum possible rate of 640 kpps.
  sysctl is completely Giant-locked and so are upper layers of the
  network stack.  The bge hardware rx ring size is 256 in -current and
  512 in ~5.2.  At 640 kpps, 512 packets take 800 us so bge wants to
  call the the upper layers with a latency of far below 800 us.  I
  don't know exactly where the upper layers block on Giant.
- a user CPU hog process gives a latency of over 200 ms every half a
  second or so when the hog starts up, and a 300-400 ms after the
  hog has been running for some time.  Two user CPU hog processes
  double the latency.  Reducing kern.sched.quantum from 100 ms to 10
  ms and/or renicing the hogs don't seem to affect this.  Running the
  hogs at idle priority fixes this.  This won't affect packet loss,
  but it might affect user network processes -- they might need to
  run at real time priority to get low enough latency.  They might need
  to do this anyway -- a scheduling quantum of 100 ms should give a
  latency of 100 ms per CPU hog quite often, though not usually since
  the hogs should never be prefered to a higher-prioerity process.

Previously I've used a less specialized clock-watching program to
determine the syscall latency.  It showed similar problems for CPU
hogs.  I just remembered that I found the fix for these under ~5.2 --
remove a local hack that sacrifices latency for reduced context
switches between user threads.  -current with SCHED_4BSD does this
non-hackishly, but seems to have a bug somehwhere that gives a latency
that is large enough to be noticeable in interactive programs.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Mon, 17 Dec 2007, Mark Fullmer wrote:

Thanks.  Have a kernel building now.  It takes about a day of uptime after 
reboot before I'll see the problem.


Yes run find / /dev/null to see the problem if it is the syncer one.

At least the syscall latency problem does seem to be this.  Under ~5.2,
with the above find and also while :; do sync; done (to give latency
spike more often), your program (with some fflush(stdout)'s and args
1 7700) gives:

% 1197976029041677 12696 0
% 1197976033196396 9761 4154719
% 1197976034060031 13360 863635
% 1197976039080632 13749 5020601
% 1197976043195594 8536 4114962
% 1197976044100601 13505 905007
% 1197976049121870 14562 5021269
% 1197976052195631 8192 3073761
% 1197976054141545 14024 1945914
% 1197976059162357 14623 5020812
% 1197976063195735 7830 4033378
% 1197976064182564 14618 986829
% 1197976069202982 14823 5020418
% 1197976074223722 15350 5020740
% 1197976079244311 15726 5020589
% 1197976084264690 15893 5020379
% 1197976089289409 15058 5024719
% 1197976094315433 16209 5026024
% 1197976095197277 8015 881844
% 1197976099335529 16092 4138252
% 1197976104356513 16863 5020984
% 1197976109376236 16373 5019723
% 1197976114396803 16727 5020567
% 1197976119416822 16533 5020019
% 1197976124437790 17288 5020968
% 1197976126200637 10060 1762847
% 1197976127198459 7839 997822
% 1197976129457321 16606 2258862
% 1197976134477582 16654 5020261

This clearly shows the spike every 5 seconds, and the latency creeping
up as vfs.numvnodes increases.  It started at about 2 and ended at
about 64000.

The syncer won't be fixed soon, so the fix for dropped packets requires
figuring out why the syncer affects networking.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Mon, 17 Dec 2007, Scott Long wrote:


Bruce Evans wrote:

On Mon, 17 Dec 2007, David G Lawrence wrote:


  One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) every time. This should normally only
happen for dirty vnodes. I suspect that something is broken with this
check:

   if (vp-v_type == VNON || ((ip-i_flag 
   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 
vp-v_bufobj.bo_dirty.bv_cnt == 0)) {
   VI_UNLOCK(vp);
   continue;
   }


Isn't it just the O(N) algorithm with N quite large?  Under ~5.2, on



Right, it's a non-optimal loop when N is very large, and that's a fairly
well understood problem.  I think what DG was getting at, though, is
that this massive flush happens every time the syncer runs, which
doesn't seem correct.  Sure, maybe you just rsynced 100,000 files 20
seconds ago, so the upcoming flush is going to be expensive.  But the
next flush 30 seconds after that shouldn't be just as expensive, yet it
appears to be so.


I'm sure it doesn't cause many bogus flushes.  iostat shows zero writes
caused by calling this incessantly using while :; do sync; done.


This is further supported by the original poster's
claim that it takes many hours of uptime before the problem becomes
noticeable.  If vnodes are never truly getting cleaned, or never getting
their flags cleared so that this loop knows that they are clean, then
it's feasible that they'll accumulate over time, keep on getting flushed
every 30 seconds, keep on bogging down the loop, and so on.


Using find / /dev/null to grow the problem and make it bad after a
few seconds of uptime, and profiling of a single sync(2) call to show
that nothing much is done except the loop containing the above:

under ~5.2, on a 2.2GHz A64 UP ini386 mode:

after booting, with about 700 vnodes:

%   %   cumulative   self  self total 
%  time   seconds   secondscalls  ns/call  ns/call  name 
%  30.8  0.0000.0000  100.00%   mcount [4]

%  14.9  0.0010.0000  100.00%   mexitcount [5]
%   5.5  0.0010.0000  100.00%   cputime [16]
%   5.0  0.0010.00061331213312  vfs_msync [18]
%   4.3  0.0010.0000  100.00%   user [21]
%   3.5  0.0010.00051132111993  ffs_sync [23]

after find / /dev/null was stopped after saturating at 64000 vnodes
(desiredvodes is 70240):

%   %   cumulative   self  self total 
%  time   seconds   secondscalls  ns/call  ns/call  name 
%  50.7  0.0080.0085  1666427  1667246  ffs_sync [5]

%  38.0  0.0150.0066  1041217  1041217  vfs_msync [6]
%   3.1  0.0150.0010  100.00%   mcount [7]
%   1.5  0.0150.0000  100.00%   mexitcount [8]
%   0.6  0.0150.0000  100.00%   cputime [22]
%   0.6  0.0160.000   34 2660 2660  generic_bcopy [24]
%   0.5  0.0160.0000  100.00%   user [26]

vfs_msync() is a problem too.  It uses an almost identical loop for
the case where the vnode is not dirty (but has a different condition
for being dirty).  ffs_sync() is called 5 times because there are 5
ffs file systems mounted r/w.  There is another ffs file system mounted
r/o and that combined with a missing r/o optimization might give the
extra call to vfs_msync().  With 64000 vnodes, the calls take 1-2 ms
each.  That is already quite a lot, and there are many calls.  Each
call only looks at vnodes under the mount point so the number of mounted
file systems doesn't affect the total time much.

ffs_sync() i taking 125 ns per vnode.  That is a more than I would have
expected.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Tue, 18 Dec 2007, David G Lawrence wrote:


Thanks.  Have a kernel building now.  It takes about a day of uptime
after reboot before I'll see the problem.


  You may also wish to try to get the problem to occur sooner after boot
on a non-patched system by doing a tar cf /dev/null / (note: substitute
/dev/zero instead of /dev/null, if you use GNU tar, to disable its
optimization). You can stop it after it has gone through a 100K files.
Verify by looking at sysctl vfs.numvnodes.


Hmm, I said to use find /, but that is not so good since it only
looks at directories and directories (and their inodes) are not packed
as tightly as files (and their inodes).  Optimized tar, or find /
-type f, or ls -lR /, should work best, by doing not much more than
stat()ing lots of files, while full tar wastes time reading file data.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Tue, 18 Dec 2007, David G Lawrence wrote:


  I didn't say it caused any bogus disk I/O. My original problem
(after a day or two of uptime) was an occasional large scheduling delay
for a process that needed to process VoIP frames in real-time. It was
happening every 31 seconds and was causing voice frames to be dropped
due to the large latency causing the frame to be outside of the jitter
window. I wrote a program that measures the scheduling delay by sleeping
for one tick and then comparing the timeofday offset from what was
expected. This revealed that every 31 seconds, the process was seeing
a 17ms delay in scheduling. Further investigation found that 1) the


I got an almost identical delay (with 64000 vnodes).

Now, 17ms isn't much.  Delays much have been much longer when CPUs
were many times slower and RAM/vnodes were not so many times smaller.
High-priority threads just need to be able to preempt the syncer so
that they don't lose data (unless really hard real time is supported,
which it isn't).  This should work starting with about FreeBSD-6
(probably need options PREEMPT).  I doesn't work in ~5.2 due to Giant
locking, but I find Giant locking to rarely matter for UP.  Old
versions of FreeBSD were only able to preempt to non-threads (interrupt
handlers) yet they somehow survived the longer delays.  They didn't
have Giant locking to get in the way, and presumably avoided packet
loss by doing lots in interrupt handlers (hardware isr and netisr).

I just remembered that I have seen packet loss even under -current
when I leave out or turn off options PREEMPT.


...
and it completely resolved the problem. Since the wait that I added
is at the bottom of the loop and the limit is 500 vnodes, this tells
me that every 31 seconds, there are a whole lot of vnodes that are
being synced, when there shouldn't have been any (this fact wasn't
apparent to me at the time, but when I later realized this, I had
no time to investigate further). My tests and analysis have all been
on an otherwise quiet system (no disk I/O), so the bottom of the
ffs_sync vnode loop should not have been reached at all, let alone
tens of thousands of times every 31 seconds. All machines were uni-
processor, FreeBSD 6+. I don't know if this problem is present in 5.2.
I didn't see ffs_syncvnode in your call graph, so it probably is not.


I chopped to a float profile with only top callers.  Any significant
calls from ffs_sync() would show up as top callers.  I still have the
data, and the call graph shows much more clearly that there was just
one dirty vnode for the whole sync():

% 0.000.01   1/1   syscall [3]
% [4] 88.70.000.01   1 sync [4]
% 0.010.00   5/5   ffs_sync [5]
% 0.010.00   6/6   vfs_msync [6]
% 0.000.00   7/8   vfs_busy [260]
% 0.000.00   7/8   vfs_unbusy [263]
% 0.000.00   6/7   vn_finished_write [310]
% 0.000.00   6/6   vn_start_write [413]
% 0.000.00   1/1   vfs_stdnosync [472]
% 
% ---
% 
% 0.010.00   5/5   sync [4]

% [5] 50.70.010.00   5 ffs_sync [5]
% 0.000.00   1/1   ffs_fsync [278]
% 0.000.00   1/60  vget cycle 1 [223]
% 0.000.00   1/60  ufs_vnoperatespec cycle 
1 [78]
% 0.000.00   1/26  vrele [76]

It passed the flags test just once to get to the vget().  ffs_syncvnode()
doesn't exist in 5.2, and ffs_fsync() is called instead.

% 
% ---
% 
% 0.010.00   6/6   sync [4]

% [6] 38.00.010.00   6 vfs_msync [6]
% 
% ---

% ...
% 
% 0.000.00   1/1   ffs_sync [5]

% [278]0.00.000.00   1 ffs_fsync [278]
% 0.000.00   1/1   ffs_update [368]
% 0.000.00   1/4   vn_isdisk [304]

This is presumbly to sync the 1 dirty vnode.

BTW I use noatime a lot, including for all file systems used in the test,
so the tree walk didn't dirty any vnodes.  A tar to /dev/zero would dirty
all vnodes if everything were mounted without this option.

% ...
%   %   cumulative   self  self total 
%  time   seconds   secondscalls  ns/call  ns/call  name 
%  50.7  0.0080.0085  1666427  1667246  ffs_sync [5]

%  38.0  0.0150.0066  1041217  1041217  vfs_msync [6]
%   3.1  0.0150.0010  100.00%   mcount [7]
%   1.5  0.0150.0000  

Re: Packet loss every 30.999 seconds

2007-12-17 Thread Bruce Evans

On Mon, 17 Dec 2007, David G Lawrence wrote:


  One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) every time. This should normally only
happen for dirty vnodes. I suspect that something is broken with this
check:

   if (vp-v_type == VNON || ((ip-i_flag 
   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 
vp-v_bufobj.bo_dirty.bv_cnt == 0)) {
   VI_UNLOCK(vp);
   continue;
   }


Isn't it just the O(N) algorithm with N quite large?  Under ~5.2, on
a 2.2GHz A64 UP in 32-bit mode, I see a latency of 3 ms for 17500 vnodes,
which would be explained by the above (and the VI_LOCK() and loop
overhead) taking 171 ns per vnode.  I would expect it to take more like
20 ns per vnode for UP and 60 for SMP.

The comment before this code shows that the problem is known, and says
that a subroutine call cannot be afforded unless there is work to do,
but the, the locking accesses look like subroutine calls, have subroutine
calls in their internals, and take longer than simple subroutine calls
in the SMP case even when they don't make subroutine calls.  (IIRC, on
A64 a minimal subroutine call takes 4 cycles while a minimal locked
instructions takes 18 cycles; subroutine calls are only slow when their
branches are mispredicted.)

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Float problen running i386 inary on amd64

2007-11-17 Thread Bruce Evans

On Fri, 16 Nov 2007, Peter Jeremy wrote:


I've Cc'd bde@ because this relates to the FPU initialisation - which
he is the expert on.

On Thu, Nov 15, 2007 at 12:54:29PM +, Pete French wrote:

On Fri, Nov 02, 2007 at 10:04:48PM +, Pete French wrote:

int
main(int argc, char *argv[])
{
if(atof(3.2) == atof(3.200))
puts(They are equal);
else
puts(They are NOT equal!);
return 0;
}


Since the program as defined above does not include any prototype for
atof(), its return value is assumed to be int.  The i386 code for the
comparison is therefore:


Sorry, I didn't bother sticking the include lines in when I sent it
to the mailing list as I assumed it would be ovious that you need
to include the prototypes!


OK, sorry for the confusion.


Interestingly, if you recode like this:

   double x = atof(3.2);
   double y = atof(3.200);
   if(x == y)
   puts(They are equal);
   else
   puts(They are NOT equal!);

Then the problem goes away! Glancing at the assembly code they both appear to
be doing the same thing as regards the comparison.


Glance more closely.

Behaviour like this should be expected on i386 but not on amd64.  It
gives the well-known property of the sin() function, that sin(x) != sin(x)
for almost all x (!).  It happens because expressions _may_ be evaluated
in extra precision (this is perfectly standard), so identical expressions 
may sometimes be evaluated in different precisions even, or especially,

if they are on the same line.  atof(s) and sin(x) are expressions, so
they may or may not be evaluated in extra precision.  Certainly they
may be evaluated in extra precision internally.  Then when they return
a result, C99 doesn't require discarding any extra precision.  (It only
requires a conversion if the type of the expression being returned is
different from the return type.  Then it requires a conversion as if by
assignment, and such conversions _are_ required to discard any extra
precision.  This gives the bizarre behaviour that, if a functon returning
double uses long double internally until the return statement so as to
get extra precision, then it can only return double precision, since the
return statement discards the extra precision, while if it uses double
precision internally then it may return extra precision and the extra
bits may even be correct.)

The actual behaviour depends on implementation details and bugs.
Programmers are supposed to be get almost deterministic behaviour (with
no _may_'s) by using casts or assignments to discard any extra precision.
E.g., in functions that are declared as double, to actually return
only double precision, use return ((double)(x + y)) instead of return
(x + y), or assign the result to a double (maybe x += y; return (x);).
However, this is completely broken for gcc on i386's. For gcc on i386's,
casts and assignments _may_ actually work as required by C99.  The
-ffloat-store hack is often recommended for fixing problems in this
area, but it only works for assignments; casts remain broken, and the
results of expressions remain unpredictable and dependent on the
optimization level because intermediate values _may_ retain extra
precision depending on whether they are spilled to memory and perhaps
on other things (spilling certainly removes extra precision).  This
has been intentionally broken for about 20 years now.  It is hard to
fix without pessimizing almost everything in much the same way as
-ffloat-store.  The pessimization is larger than it was 20 years ago
since memory is relatively slower (though the stores now normally go
to L1 caches which are very fast, they add a relatvely large amount
to pipeline latency) and register allocation is better.  It is hard
to write code that avoids the pessimization, since only code that uses
very long expressions with no assignments to even register variables
can avoid the stores.  (Store+load to discard the extra precision is
another implementation detail.  It is the fastest way, even if a value
with extra precision is in a register.)

To work around the gcc bugs, something like *(volatile double *)x
must be used to reduce double x; to actually be a double.

The actual behaviour is fairly easy to describe for (f(x) == f(x)):

amd64:
if f() returns float, then the value is returned in the low
quarter of an XMM register, so extra precision is automatically
discarded and the results are equal except in exceptional cases
(if f(x) is a NaN or varies due to internals in the function).
Assignment of the result(s) to variables of any type work
correctly and don't change the values since float is the lowest
precision.

if f() returns double, similarly except the value is returned in
the low half of an XMM register, and assignment of the result(s)
to variable(s) of type float would work correctly and 

Re: Float problen running i386 inary on amd64

2007-11-16 Thread Bruce Evans

On Sat, 17 Nov 2007, Peter Jeremy wrote:


On Sat, Nov 17, 2007 at 04:53:22AM +1100, Bruce Evans wrote:

Behaviour like this should be expected on i386 but not on amd64.  It
gives the well-known property of the sin() function, that sin(x) != sin(x)
for almost all x (!).  It happens because expressions _may_ be evaluated
in extra precision (this is perfectly standard), so identical expressions
may sometimes be evaluated in different precisions even, or especially,
if they are on the same line.


Thank you for your detailed analysis.  Hwever, I believe you missed
the critical point (I may have removed too much reference to the
actual problem that Pete French saw): I can take a program that was
statically compiled on FreeBSD/i386, run it in legacy (i386) mode on
FreeBSD-6.3/amd64 and get different results.

Another (admittedly contrived) example:
...


Ah, that explains it.  This was also a longstanding bug in the Linux
emulator.  linux_setregs() wasn't fixed to use the Linux npx control
word until relatively recently (2005).  Linux libraries used to set
the control word in the C library (crt), which I think is the right
place to initialize it since the correct initialization may depend on
the language, so the bug wasn't so obvious at first.


This is identical code being executed in supposedly equivalent
environments giving different results.

I believe the fix is to initialise the FPU using __INITIAL_NPXCW__ in
ia32_setregs(), though I'm not sure how difficult this is in reality.


Yes, that is the right fix.  It is moderately difficult to do correctly.
linux_setregs() now just uses fldcw(control) where control =
__LINUX_NPXCW__.  This depends on bugs to work, since direct accesses
to the FPU in the kernel are not supported.  They cause a DNA trap
which should be fatal.  amd64 is supposed to print a message about
this error, but it apparently doesn't else log files would be fuller.
i386 doesn't even print a message.  npxdna() and fpudna() check related
invariants but not this one.

Correct code would do something like {fpu,npx}xinit(control) to
initialize the control word.  setregs() in RELENG_[1-4] does exactly
that -- npxinit() hides the complications.  Now {fpu,npx}init() is
only called once or twice at boot time for each CPU, and the complications
are a little larger since most initialization is delayed until the DNA
trap ({fpu,npx}init() now mainly sets up a copy of the initial FPU
state in memory for the trap handler to load later, and it cannot set
up per-thread state since the copy in memory is a global default).

The complications for delayed initialization are mainly to optimize
switching of the FPU state for signal handling, but are also used for
exec.  Another complication here is that signal handlers should be
given the default control word.  This is much more broken than for
setregs:
- there are sysent hooks for sendsig and sigreturn, but none for setting
  registers in sendsig.
- all FreeBSD sendsig's end up using the gobal default initial FPU state
  (if they support switching the FPU state at all).
- all Linux sendsig's are missing support for switching the FPU state.
- suppose that the initial FPU (or even CPU) state is language-dependent
  and this is implemented mainly in the language runtime startup.
  sendsig's would have a hard time determining the languages' defaults
  so as to set them.  The languages would need to set the defaults in
  signal trampolines.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em watchdogs - OS involvement

2007-10-31 Thread Bruce Evans

On Tue, 30 Oct 2007, Jack Vogel wrote:


Another bit of data, if I define DEVICE_POLLING on the Oct. snap it
also will work.


Defining DEVICE_POLLING (globally) breaks configuration of fast
interrupt handlers in em.  I have to #undef it to test fast interrupt
handlers in em without losing testing of polling in other network
drivers.  I lose only testing of polling in em.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ANN] 8-CURRENT, RELENG_7 and RELENG_6 have gotten latest ?unionfs improvements

2007-10-25 Thread Bruce Evans

On Wed, 24 Oct 2007, Oliver Fromme wrote:


Dmitry Marakasov [EMAIL PROTECTED] wrote:
 I was told long time ago that -ounion is even
 more broken than unionfs.

That's wrong.  The union mount option was _never_ really
broken.  I'm using it for almost as long as FreeBSD exists.


I recently noticed the following bugs in -ounion (which I've
never used for anything except testing):

(1) It is broken for all file systems except ffs and ext2fs, since
all (?) file systems now use nmount(2) and only these two file
systems have union in their mount options list.  It is still in
the global options list in mount/mntopts.h, but this is only used
with mount(2).  The global options list in mount/mntopts.h has
many bogus non-global options, and even the global options list
in kern/vfs_mount.c has some bogus non-global options, but union
actually is a global options.  ext2fs loves union more than
ffs -- although its options list is less disordered than ffs's,
it has enough disorder to have 2 copies of union.
(2) After fixing (1) by not using nmount(2), following of symlinks works
strangely for at least devfs:
(a) a link foo - zero (where zero doesn't exist in the underlying
file system) doesn't work.  mount(1) says that the lookup is
done in the mounted file system first.
(b) a link foo - ./zero works.  This is correct.  Now I wonder
if it would work if zero existed only in the underlying file
system.

Have you noticed these bugs?  (2) is presumably old.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: rm(1) bug, possibly serious

2007-09-26 Thread Bruce Evans

On Tue, 25 Sep 2007, LI Xin wrote:


I think this is a bug, here is a fix obtained from NetBSD.


This bug, if any, cannot be fixed in rm.


The reasoning (from NetBSD's rm.c,v 1.16):


Bugs can easily be added to rm.


Strip trailing slashes of operands in checkdot().

POSIX.2 requires that if . or .. are specified as the basename
portion of an operand, a diagnostic message be written to standard
error, etc.


Note that POSIX only requires this for the rm utility.  (See my previous
mail about why this is bogus.)  Pathname resolution and a similarly
bogus restriction on rmdir(2) requires some operations with dot or
dot-dot to fail, and any utility that uses these operations should
then print a diagnostic, etc.


We strip the slashes because POSIX.2 defines basename
as the final portion of a pathname after trailing slashes have been
removed.


POSIX says the basename portion of the operand (that is, the final
pathname component.  This doesn't mean the operand mangled by
basename(3).


This also makes rm perform actions equivalent to the POSIX.1
rmdir() and unlink() functions when removing directories and files,
even when they do not follow POSIX.1's pathname resolution semantics
(which require trailing slashes be ignored).


Which POSIX.1?  POSIX.1-2001 actually requires trailing slashes shall
be resolved as if a single dot character were appended to the pathname.
This is completely different from removing the slash:

rm regular file/# ENOTDIR
rm regular file # success unless ENOENT etc.
rm directory/   # success...
rm directory# EISDIR
rm symlink to regular file/ # ENOTDIR
rm symlink to regular file  # success (removes symlink)
rm symlink to directory/# EISDIR
rm symlink to directory # success (removes symlink)
rmdir ...   # reverse most of above

Anyway, mangling the operands makes the utilities perform actions different
from the functions.

The problem case is rm -r symlink to directory/ which asks for
removing the directory pointed to by the symlink and all its contents,
and is useful -- you type the trailing symlink if you want to ensure
that the removal is as recursive as possible.  With breakage of rmdir(2)
to POSIX spec, this gives removal the contents of the directory pointed
to be the symlink and then fails to remove the directory.  With breakage
as in NetBSD, this gives removal of the symlink only.


If nobody complains about this I will request for commit approval from [EMAIL 
PROTECTED]


++

Bruce

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Panic in 6.2-PRERELEASE with bge on amd64

2007-01-12 Thread Bruce Evans

On Wed, 10 Jan 2007, Sven Willenberger wrote:


Bruce Evans presumably uttered the following on 01/09/07 21:42:

Also look at nearby chain entries (especially at (rxidx - 1) mod 512)).
I think the previous 255 entries and the rxidx one should be
non-NULL since we should have refilled them as we used them (so the
one at rxidx is least interesting since we certainly just refilled
it), and the next 256 entries should be NULL since we bogusly only use
half of the entries.  If the problem is uninitialization, then I expect
all 512 entries except the one just refilled at rxidx to be NULL.



(kgdb) p sc-bge_cdata.bge_rx_std_chain[rxidx]
$1 = (struct mbuf *) 0xff0097a27900
(kgdb) p rxidx
$2 = 499

since rxidx = 499, I assume you are most interested in 498:
(kgdb) p sc-bge_cdata.bge_rx_std_chain[498]
$3 = (struct mbuf *) 0xff00cf1b3100

for the sake of argument, 500 is null:
(kgdb) p sc-bge_cdata.bge_rx_std_chain[500]
$13 = (struct mbuf *) 0x0

the indexes with values basically are 243 through 499:
(kgdb) p sc-bge_cdata.bge_rx_std_chain[241]
$30 = (struct mbuf *) 0x0
(kgdb) p sc-bge_cdata.bge_rx_std_chain[242]
$31 = (struct mbuf *) 0x0
(kgdb) p sc-bge_cdata.bge_rx_std_chain[243]
$32 = (struct mbuf *) 0xff005d4ab700
(kgdb) p sc-bge_cdata.bge_rx_std_chain[244]
$33 = (struct mbuf *) 0xff004f644b00

so it does not seem to be a problem with uninitialization.


There are supposed to be only 256 nonzero entries (except briefly while
one is being refreshed), but the above indicates that there 257: #243
through #499 gives 257 nonzero entries.  Everything indicates that
entry #499 was null before it was refreshed, and that the loop in
bge_rxeof() is trying to process a descriptor 1 after the last valid
(previously handled) descriptor.  I cannot see why it might do this.
The next step might be to add active debugging code:
- check that m != NULL when m is taken off the rx chain (before refresshing
  its entry), and panic if it is.
- check that there are always BGE_SSLOTS (256) nonzero mbufs in the std
  rx chain.  It would be interesting to know if they are always contiguous.
  They might not be since this depends on how the hardware uses them.
  Debugging is simpler if they are.
- check that bge_rxeof() is not reentered.
- check the rx producer index and related data before and after getting
  a null m.  It can easily change while bge_rxeof() is running, so
  recording its value before and after might be useful.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Panic in 6.2-PRERELEASE with bge on amd64

2007-01-09 Thread Bruce Evans

On Tue, 9 Jan 2007, John Baldwin wrote:


On Tuesday 09 January 2007 09:37, Sven Willenberger wrote:

On Tue, 2007-01-09 at 12:50 +1100, Bruce Evans wrote:

Oops.  I should have asked for the statment in bge_rxeof().


#7  0x801d5f17 in bge_rxeof (sc=0x8836b000)

at /usr/src/sys/dev/bge/if_bge.c:2528

2528m-m_pkthdr.len = m-m_len = cur_rx-bge_len -

ETHER_CRC_LEN;


(where m is defined as:
2449 struct mbuf *m = NULL;
)


It's assigned earlier in between those two places.


Its initialization here is just a style bug.


Can you 'p rxidx' as well
as 'p sc-bge_cdata.bge_rx_std_chain[rxidx]' and 'p
sc-bge_cdata.bge_rx_jumbo_chain[rxidx]'?  Also, are you using jumbo frames
at all?


Also look at nearby chain entries (especially at (rxidx - 1) mod 512)).
I think the previous 255 entries and the rxidx one should be
non-NULL since we should have refilled them as we used them (so the
one at rxidx is least interesting since we certainly just refilled
it), and the next 256 entries should be NULL since we bogusly only use
half of the entries.  If the problem is uninitialization, then I expect
all 512 entries except the one just refilled at rxidx to be NULL.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Panic in 6.2-PRERELEASE with bge on amd64

2007-01-08 Thread Bruce Evans

On Mon, 8 Jan 2007, Sven Willenberger wrote:


On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote:

On Sun, 7 Jan 2007, Sven Willenberger wrote:



The short and dirty of the dump:
...
--- trap 0xc, rip = 0x801d5f17, rsp = 0xb371ab50, rbp = 
0xb371aba0 ---
bge_rxeof() at bge_rxeof+0x3b7


What is the instruction here?


I will do my best to ferret out the information you need. For the
bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is:

0x801d5f17 bge_rxeof+951: mov%r15,0x28(%r14)
...

Looks like a null pointer panic anyway.  I guess the instruction is
movl to/from 0x28(%reg) where %reg is a null pointer.



from the above lines, apparently %r14 is null then.


Yes.  It's a bit suprising that the access is a write.


...
#8  0x801db818 in bge_intr (xsc=0x0) at 
/usr/src/sys/dev/bge/if_bge.c:2707


What is the statement here?  It presumably follow a null pointer and only
the exprssion for the pointer is interesting.  xsc is already null but
that is probably a bug in gdb, or the result of excessive optimization.
Compiling kernels with -O2 has little effect except to break debugging.


the block of code from if_bge.c:

  2705 if (ifp-if_drv_flags  IFF_DRV_RUNNING) {
  2706 /* Check RX return ring producer/consumer. */
  2707 bge_rxeof(sc);
  2708
  2709 /* Check TX ring producer/consumer. */
  2710 bge_txeof(sc);
  2711 }


Oops.  I should have asked for the statment in bge_rxeof().


By default -O2 is passed to CC (I don't use any custom make flags other
than and only define CPUTYPE in my /etc/make.conf).


-O2 is unfortunately the default for COPTFLAGS for most arches in
sys/conf/kern.pre.mk.  All of my machines and most FreeBSD cluster
machines override this default in /etc/make.conf.

With the override overridden for RELENG_6 amd64, gcc inlines bge_rxeof(),
so your environment must be a little different to get even the above
ifo.  I think gdb can show the correct line numbers but not the call
frames (since there is no call).  ddb and the kernel stack trace can
only show the call frames for actual calls.

With -O1, I couldn't find any instruction similar to the mov to the
null pointer + 28.  28 is a popular offset in mbufs


The short of it is that this interface sees pretty much non-stop traffic
as this is a mailserver (final destination) and is constantly being
delivered to (direct disk access) and mail being retrieved (remote
machine(s) with nfs mounted mail spools. If a momentary down of the
interface is enough to completely panic the driver and then the kernel,
this hardly seems robust if, in fact, this is what is happening. So
the question arises as to what would be causing the down/up of the
interface; I could start looking at the cable, the switch it's connected
to and ... any other ideas? (I don't have watchdog enabled or anything
like that, for example).


I don't think down/up can occur in normal operation, since it takes ioctls
or a watchdog timeout to do it.  Maybe some ioctls other than a full
down/up can cause problems... bge_init() is called for the following
ioctls:
- mtu changes
- some near down/up (possibly only these)
Suspend/resume and of course detach/attach do much the same things as
down/up.

BTW, I added some sysctls and found it annoying to have to do down/up
to make the sysctls take effect.  Sysctls in several other NIC drivers
require the same, since doing a full reinitialization is easiest.
Since I am tuning using sysctls, I got used to doing down/up too much.

Similarly for the mtu ioctl.  I think a full reinitialization is used
for mtu changes mainly in cases the change switches on/off support for
jumbo buffers.  Then there is a lot of buffer reallocation to be
done, and interfaces have to be stopped to ensure that the bufferes
being deallocated are not in use, etc.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Panic in 6.2-PRERELEASE with bge on amd64

2007-01-07 Thread Bruce Evans

On Sun, 7 Jan 2007, Sven Willenberger wrote:


I am starting a new thread on this as what I had assumed was a panic in
nfsd turns out to be an issue with the bge driver. This is an amd64 box,
dual processor (SMP kernel) that happens to be running nfsd. About every
3-5 days the kernel panics and I have finally managed to get a core
dump.
The system: FreeBSD 6.2-PRERELEASE #8: Tue Jan  2 10:57:39 EST 2007


Like most NIC drivers, bge unlocks and re-locks around its call to
ether_input() in its interrupt handler.  This isn't very safe, and it
certainly causes panics for bge.  I often see it panic when bringing
the interface down and up while input is arriving, on a non-SMP non-amd64
(actually i386) non-6.x (actually -current) system.  Bringing the
interface down is probably the worst case.  It creates a null pointer
for bge_intr() to follow.


The short and dirty of the dump:
...
--- trap 0xc, rip = 0x801d5f17, rsp = 0xb371ab50, rbp = 
0xb371aba0 ---
bge_rxeof() at bge_rxeof+0x3b7


What is the instruction here?


bge_intr() at bge_intr+0x1c8
ithread_loop() at ithread_loop+0x14c
fork_exit() at fork_exit+0xbb
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xb371ad00, rbp = 0 ---



Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x28


Looks like a null pointer panic anyway.  I guess the instruction is
movl to/from 0x28(%reg) where %reg is a null pointer.


...
#8  0x801db818 in bge_intr (xsc=0x0) at 
/usr/src/sys/dev/bge/if_bge.c:2707


What is the statement here?  It presumably follow a null pointer and only
the exprssion for the pointer is interesting.  xsc is already null but
that is probably a bug in gdb, or the result of excessive optimization.
Compiling kernels with -O2 has little effect except to break debugging.

I rarely use gdb on kernels and haven't looked closely enough using ddb
to see where the null pointer for the panic on down/up came from.

BTW, the sbdrop panic in -current isn't bge-only or SMP-only.  I saw
it once for sk on a non-SMP system.  It rarely happens for non-SMP
(much more rarely than the panic in bge_intr()).  Under -current, on
an SMP amd64 system with bge, It happens almost every time on close
of the socket for a ttcp server if input is arriving at the time of
the close.  I haven't seen it for 6.x.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: kqueue LOR

2006-12-12 Thread Bruce Evans

On Tue, 12 Dec 2006, Kostik Belousov wrote:


On Tue, Dec 12, 2006 at 12:44:54AM -0800, Suleiman Souhlal wrote:



Is the mount lock really required, if all we're doing is a single read of a
single word (mnt_kern_flags) (v_mount should be read-only for the whole
lifetime of the vnode, I believe)? After all, reads of a single word are
atomic on all our supported architectures.
The only situation I see where there MIGHT be problems are forced unmounts,
but I think there are bigger issues with those.
Sorry for noticing this email only now.


The problem is real with snapshotting. Ignoring
MNTK_SUSPEND/MNTK_SUSPENDED flags (in particular, reading stale value of
mnt_kern_flag) while setting IN_MODIFIED caused deadlock at ufs vnode
inactivation time. This was the big trouble with nfsd and snapshots. As
such, I think that precise value of mmnt_kern_flag is critical there,
and mount interlock is needed.


Locking for just read is almost always bogus, but here (as in most
cases) there is also a write based on the contents of the flag, and
the lock is held across the write.


Practically speaking, I agree with claim that reading of m_k_f is
surrounded by enough locked operations that would make sure that
the read value is not stale. But there is no such guarantee on
future/non-i386 arches, isn't it ?


I think not-very-staleness is implied by acquire/release semantics
which are part of the API for most atomic operations.  This behaviour
doesn't seem to be documented for mutexes, but I don't see how mutexes
could work without it (they have to synchronize all memory accesses,
not just the memory accessed by the lock).


As a side note, mount interlock scope could be reduced there.

Index: ufs/ufs/ufs_vnops.c
===
RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
retrieving revision 1.283
diff -u -r1.283 ufs_vnops.c
--- ufs/ufs/ufs_vnops.c 6 Nov 2006 13:42:09 -   1.283
+++ ufs/ufs/ufs_vnops.c 12 Dec 2006 10:18:04 -
@@ -133,19 +134,19 @@
{
struct inode *ip;
struct timespec ts;
-   int mnt_locked;

ip = VTOI(vp);
-   mnt_locked = 0;
if ((vp-v_mount-mnt_flag  MNT_RDONLY) != 0) {
VI_LOCK(vp);
goto out;
}
MNT_ILOCK(vp-v_mount);  /* For reading of mnt_kern_flags. 
*/
-   mnt_locked = 1;
VI_LOCK(vp);
-   if ((ip-i_flag  (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0)
-   goto out_unl;
+   if ((ip-i_flag  (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0) {
+   MNT_IUNLOCK(vp-v_mount);
+   VI_UNLOCK(vp);
+   return;
+   }


The version that depends on not-very-staleness would test the flags
without acquiring the lock(s) and return immediately in the usual case
where none of the flags are set.  It would have to acquire the locks
and repeat the test to make changes (and the test is already repeated
one flag at a time).  I think this would be correct enough, but still
inefficient and/or even messier.  The current organization is usually:

acquire vnode interlock in caller
release vnode interlock in caller to avoid messes here (inefficient)
call
acquire mount interlock
acquire vnode interlock
test the flags; goto cleanup code if none set (usual case)
do the work
release vnode interlock
release mount interlock
return
acquire vnode interlock (if needed)
release vnode interlock (if needed)

and it might become:

acquire vnode interlock in caller
call
test the flags; return if none set (usual case)
release vnode interlock // check that callers are aware of this
acquire mount interlock
acquire vnode interlock
do the work
// Assume no LOR problem for release, as below.
// Otherwise need another relese+acquire of vnode interlock.
release mount interlock
return
release vnode interlock



if ((vp-v_type == VBLK || vp-v_type == VCHR)  !DOINGSOFTDEP(vp))
ip-i_flag |= IN_LAZYMOD;
@@ -155,6 +156,7 @@
ip-i_flag |= IN_MODIFIED;
else if (ip-i_flag  IN_ACCESS)
ip-i_flag |= IN_LAZYACCESS;
+   MNT_IUNLOCK(vp-v_mount);
vfs_timestamp(ts);
if (ip-i_flag  IN_ACCESS) {
DIP_SET(ip, i_atime, ts.tv_sec);


Is there no LOR problem for release?

As I understand it, MNT_ILOCK() is only protecting IN_ACCESS being
converted to IN_MODIFED, so after this conversion is done the lock
is not needed.  Is this correct?


@@ -172,10 +174,7 @@

 out:
ip-i_flag = ~(IN_ACCESS | IN_CHANGE | IN_UPDATE);
- out_unl:
VI_UNLOCK(vp);
-   if (mnt_locked)
-   MNT_IUNLOCK(vp-v_mount);
}

/*



BTW, vfs.lookup_shared defaults to 0 and decides shared access for all
operations including read, so I wonder if there are [m]any bugs
preventing shared accesses 

Re: kqueue LOR

2006-12-12 Thread Bruce Evans

On Tue, 12 Dec 2006, John Baldwin wrote:


On Tuesday 12 December 2006 13:43, Suleiman Souhlal wrote:



Why is memory barrier usage not encouraged? As you said, they can be used to
reduce the number of atomic (LOCKed) operations, in some cases.
...
Admittedly, they are harder to use than atomic operations, but it might
still worth having something similar.


How would MI code know when using memory barriers is good?  This is already
hard to know for atomic ops -- if there would more than a couple of atomic
ops then it is probably better to use 1 mutex lock/unlock and no atomic
ops, since this reduces the total number of atomic ops in most cases, but
it is hard for MI code to know how many a couple is.  (This also depends
on the SMP option -- without SMP, locking is automatic so atomic ops are
very fast but mutexes are still slow since they do a lot more than an
atomic op.)


Memory barriers just specify ordering, they don't ensure a cache flush so
another CPU reads up to date values.  You can use memory barriers in
conjunction with atomic operations on a variable to ensure that you can
safely read other variables (which is what locks do).  For example, in this


I thought that the acquire/release variants of atomic ops guarantee
this.  They seem to be documented to do this, while mutexes don't seem
to be documented to do this.  The MI (?) implementation of mutexes
depends on atomic_cmpset_{acq,rel}_ptr() doing this.

Bruc
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Still possible to directly boot without loader?

2006-11-01 Thread Bruce Evans

On Mon, 30 Oct 2006, John Baldwin wrote:


On Thursday 26 October 2006 15:54, Ruslan Ermilov wrote:

On Thu, Oct 26, 2006 at 03:42:34PM -0400, John Baldwin wrote:

On Thursday 26 October 2006 15:18, Ruslan Ermilov wrote:

On Thu, Oct 26, 2006 at 11:38:24AM -0400, John Baldwin wrote:

Sorry, I meant that both boot2 and loader should follow your proposal of

masking 28 bits.

Just masking the top 4 bits is probably sufficient.


:-)

OK, I'll craft a patch tomorrow.  This will also require patching at least
sys/boot/common/load_elf.c:__elfN(loadimage), maybe something else.
I think we could actually mask 30 bits; that would allow to load 1G kernels,
provided that sufficient memory exists.


Actually, please mask 4 bits.  Not all kernels run at 0xc000.  You can
adjust that address via 'options KVA_PAGES'.  I know of folks who run kernels
at 0xa000 for example because they need more KVA.  This is part of why I


They can probably use 0x8000, but it's not obvious how to get exactly
that from KVA_PAGES.


really don't like the masking part, though I'm not sure there's a way to
figure out KERNBASE well enough to do the more correct 'pa = addr - KERNBASE'
rather than 'pa = addr  0x0fff'.


The masking hack is probably only needed for aout.  For elf,
objdump -h /kernel says:

% Sections:
% Idx Name  Size  VMA   LMA   File off  Algn
% ...
%   CONTENTS, ALLOC, LOAD, READONLY, DATA
%   4 .text 002853e0  c043b510  c043b510  0003b510  2**4

so KERNBASE = LMA - File off for at least this kernel.  boot2 now
loads the text section from file offset File off to address LMA(masked).
I think it just needs to load at an address that is the same mod
PAGE_SIZE as LMA or VMA (these must agree mod PAGE_SIZE), provided it
adjusts the entry address to match.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Still possible to directly boot without loader?

2006-10-26 Thread Bruce Evans

On Thu, 26 Oct 2006, Ruslan Ermilov wrote:


On Mon, Sep 11, 2006 at 01:09:15PM -0500, Brooks Davis wrote:

On Sun, Sep 10, 2006 at 09:10:26PM +0200, Stefan Bethke wrote:

I just tried to load my standard kernel from the boot blocks (instead
of using loader(8)), but I either get a hang before the kernel prints
anything, or a BTX halted.  Is this still supposed to work in 6-
stable, or has it finally disappeared?


You may be able to get this to work, but it is unsupported.


I normally use it (with a different 1-stage boot loader) for kernels
between ~4.10 and -current.  I only boot RELENG_4 kernels for running
benchmarks and don't bother applying my old fix for missing static
symbols there.  See another PR for the problem and patch.  In newer
kernels and userlands, starting some time in 5.0-CURRENT, sysutil
programs use sysctls for live kernels so they aren't affected by missing
static symbols.


I've been investigating this today.  Here's what I've found:

1)  You need hints statically compiled into your kernel.
   (This has been a long time requirement.)


Even though I normally use it, I once got very confused by this.
Everything except GENERIC booted right (with boot loaders missing
the bug in (3)).  This is because GENERIC has had hints commented
out since rev.1.272, and GENERIC also has no acpi (it's not very
GENERIC).  When there are no hints, except on very old systems, most
things except isa devices work, but at least without acpi, console
drivers on i386's are on isa so it is hard to see if things work.
Hints are probably also needed for ata.  I think a diskless machine
with no consoles and pci NICs would just work.


2)  You can only do it on i386, because boot2 only knows
   about ELF32, so attempts to load ELF64 amd64 kernels
   will fail.  (loader(8) knows about both ELF32/64.)


I haven't got around to fixing this.


3)  It's currently broken even on i386; backing out
   rev. 1.71 of boot2.c by jhb@ fixes this for me.

: revision 1.71
: date: 2004/09/18 02:07:00;  author: jhb;  state: Exp;  lines: +3 -3
: A long, long time ago in a CVS branch far away (specifically, HEAD prior
: to 4.0 and RELENG_3), the BTX mini-kernel used paging rather than flat
: mode and clients were limited to a virtual address space of 16 megabytes.
: Because of this limitation, boot2 silently masked all physical addresses
: in any binaries it loaded so that they were always loaded into the first
: 16 Meg.  Since BTX no longer has this limitation (and hasn't for a long
: time), remove the masking from boot2.  This allows boot2 to load kernels
: larger than about 12 to 14 meg (12 for non-PAE, 14 for PAE).
:
: Submitted by:   Sergey Lyubka devnull at uptsoft dot com
: MFC after:  1 month


The kernel is linked at 0xc000 but loade din low memory, so the high
bits must be masked off like they used to be for the kernel to boot at all.
This has nothing to do with paging AFAIK.  Rev.1.71 makes no sense, since
BTX isn't large, and large kernels are more unbootable than before with
1.71.

There is an another PR about this.

4) Another rev. broke support for booting with -c and -d to save 4 bytes.
-c is useful for RELENG_6 and -d is essential for debugging.  If you
always use loader(8) then you would only notice this if you try to set
these flags in boot2.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em network issues

2006-10-18 Thread Bruce Evans

On Wed, 18 Oct 2006, Kris Kennaway wrote:


I have been working with someone's system that has em shared with fxp,
and a simple fetch over the em (e.g. of a 10 GB file of zeroes) is
enough to produce watchdog timeouts after a few seconds.

As previously mentioned, changing the INTR_FAST to INTR_MPSAFE in the
driver avoids this problem.  However, others are seeing sporadic
watchdog timeouts at higher system load on non-shared em systems too.


em_intr_fast() has no locking whatsoever.  I would be very surprised
if it even seemed to work for SMP.  For UP, masking of CPU interrupts
(as is automatic in fast interrupt handlers) might provide sufficient
locking, but for many drivers fast wth interrupt handlers, whatever
locking is used by the fast interrupt handler must be used all over
the driver to protect data strutures that are or might be accessed by
the fast interrupt handler.  That means lots of intr_disable/enable()s
if the UP case is micro-optimized and lots of mtx_lock/unlock_spin()s
for the general case.  But em has no references to spinlocks or CPU
interrupt disabling.

em_intr() starts with EM_LOCK(), so it isn't obviously broken near its
first statement.

Very few operations are valid in fast interrupt handlers.  Locking
and fastness must be considered for every operation, not only in
the interrupt handler but in all data structures shared by the
interrupt handler.  For just the interrupt handler in em:

% static void
% em_intr_fast(void *arg)
% {
%   struct adapter  *adapter = arg;

This is safe because it has no side effects and doesn't take long.

%   struct ifnet*ifp;
%   uint32_treg_icr;
% 
% 	ifp = adapter-ifp;

%

This is safe provided other parts of the driver ensure that the interrupt
handler is not reached after adapter-ifp goes away.  Similarly for other
long-lived almost-const parts of *adapter.

%   reg_icr = E1000_READ_REG(adapter-hw, ICR);
%

This is safe provided reading the register doesn't change it.

%   /* Hot eject?  */
%   if (reg_icr == 0x)
%   return;
% 
% 	/* Definitely not our interrupt.  */

%   if (reg_icr == 0x0)
%   return;
%

These are safe since we don't do anything with the result.

%   /*
%* Starting with the 82571 chip, bit 31 should be used to
%* determine whether the interrupt belongs to us.
%*/
%   if (adapter-hw.mac_type = em_82571 
%   (reg_icr  E1000_ICR_INT_ASSERTED) == 0)
%   return;
%

This is safe, as above.

%   /*
%* Mask interrupts until the taskqueue is finished running.  This is
%* cheap, just assume that it is needed.  This also works around the
%* MSI message reordering errata on certain systems.
%*/
%   em_disable_intr(adapter);

Now that we start doing things, we have various races.

The above races to disable interrupts with other entries to this interrupt
handler, and may race with other parts of the driver.

After we disable driver interrupts.  There should be no more races with
other entries to this handler.  However, reg_icr may be stale at this
point even if we handled the race correctly.  The other entries may have
partly or completely handled the interrupt when we get back here (we
should have locked just before here, and then if the lock blocked waiting
for the other entries (which can only happen in the SMP case), we should
reread the status register to see if we still have anything to do, or
more importantly to see what we have to do now (extrascheduling of the
SWI handler would just wake time, but missing scheduling would break
things).

%   taskqueue_enqueue(adapter-tq, adapter-rxtx_task);
%

Safe provided the API is correctly implemented.  (AFAIK, the API only
has huge design errors.)

%   /* Link status change */
%   if (reg_icr  (E1000_ICR_RXSEQ | E1000_ICR_LSC))
%   taskqueue_enqueue(taskqueue_fast, adapter-link_task);
%

As above, plus might miss this call if the status changed underneath us.

%   if (reg_icr  E1000_ICR_RXO)
%   adapter-rx_overruns++;

Race updating the counter.

Generally, fast interrupt handlers should avoid book-keeping like this,
since correct locking for it would poison large parts of the driver
with the locking required for the fast interrupt handler.  Perhaps
similarly for important things.  It's safe to read the status register
provided reading it doesn't change it.  Then it is safe to schedule
tasks based on the contents of the register provided we don't do
anything else and schedule enough tasks.  But don't disable interrupts
-- leave that to the task and make the task do nothing if it handled
everything for a previous scheduling.  This would result in the task
usually being scheduled when the interrupt is for us but not if it is
for another device.  The above doesn't try to do much more than this.
However, a fast interrupt handler needs to handle the usual case to
be worth having except on systems 

Re: [fbsd] HEADS UP: FreeBSD 5.3, 5.4, 6.0 EoLs coming soon

2006-10-11 Thread Bruce Evans

On Wed, 11 Oct 2006, Dmitry Pryanishnikov wrote:


On Wed, 11 Oct 2006, Jeremie Le Hen wrote:

...
Is it envisageable to extend the RELENG_4's and RELENG_4_11's EoL once
more ?


 Yes, I'm also voting for it. This support may be limited to 
remote-exploitable vulnerabilities only, but I'm sure there are many old

slow routers for which RELENG_4 - 6 transition still hurts the performance.
RELENG_4 is the last stable pre-SMPng branch, and (see my spring letters,
Subject: RELENG_4 - 5 - 6: significant performance regression)
_very_ significant UP performance loss (which has occured in RELENG_4 - 5
transition) still isn't reclaimed. So I think it would be wise to extend
{ RELENG_4 / RELENG_4_11 / both } [may be limited] support.


I hesitate to do anything to kill RELENG_4, but recently spent a few
days figuring out why the perfomance for building kernels over nfs
dropped by much more than for building of kernels on local disks between
RELENG_4 and -current.  The most interesting loss (one not very specific
to kernels) is that changes on 6 or 7 Dec 2004 resulted in open/close
of an nfs file generating twice as much network traffic (2 instead of
1 Access RPCs per open) and thus being almost twice as slow for files
that are otherwise locally cached.  This combined with not very low
network latency gives amazingly large losses of performance for things
like make depend and cvs checkouts where 1 RPC per open already made
things very slow.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: missing fpresetsticky in ieeefp.h

2006-02-03 Thread Bruce Evans

On Thu, 2 Feb 2006, O. Hartmann wrote:


Bruce Evans schrieb:

On Thu, 2 Feb 2006, O. Hartmann wrote:
...
Now take a look into machine/ieeefp.h, where this function should be 
declared. Nothing, I can not find this routine, it seems to be 'not 
available' on my FreeBSD6.1-PRERELEASE AMD64 (no 32Bit compatibility).


It was removed for amd64 and never existed for some other arches.  It was

   [fresetsticky()]

apparently unused when it was removed a year ago.
...
% RCS file: /home/ncvs/src/sys/amd64/include/ieeefp.h,v
...
% revision 1.13

...
Thanks a lot. In prior software compilations of GMT on FBSD/AMD64 I commented 
out the appropriate line in gmt_init.c without any hazardous effects - but I 
never used GMT that intensive having ever recognozed any malicious side 
effects.


I should contact the guys from Soest/Hawaii asking them for any serious 
effects commenting out this line on amd64 architectures.


I think it is probably used only for error detection, if at all.
Accumulated IEEE exceptions are supposed to be read using fpgetsticky()
and then cleared using fp[re]setsticky() so that the next set accumulated
can be distinguished from the old set.  Applications should now use
fesetexceptflag() instead of fp[re]setsticky().

BTW, the most useful fp* functions other than fp[re]setsticky(), namely
fp{get,set}round(), never worked on ia64 due to the rounding flags
values being misspelled, so there are unlikely to be any portable uses
of the fp* functions in ports.  The corresponding fe{get,set}round()
functions work on at least i386, amd64 and ia64.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: missing fpresetsticky in ieeefp.h

2006-02-02 Thread Bruce Evans

On Thu, 2 Feb 2006, O. Hartmann wrote:


O. Hartmann schrieb:

Hello.
I do not know whether this should be a bug report or not, I will ask prior 
to any further action.


Reading 'man fpresetsticky' show up a man page for FPGETROUND(3) and tells 
me the existence of the fpresetsticky routine.


This is a bug in the man page.  fpresetsticky() is supposed to only exist
on i386's, but the man page and its link to fpresetsticky.3 are installed
for all arches.

Now take a look into machine/ieeefp.h, where this function should be 
declared. Nothing, I can not find this routine, it seems to be 'not 
available' on my FreeBSD6.1-PRERELEASE AMD64 (no 32Bit compatibility).


It was removed for amd64 and never existed for some other arches.  It was
apparently unused when it was removed a year ago.

Background is, I try to compile GMT 4.1 and ran into this problem again (I 
reveal this error since FBSD 5.4-PRE also on i386).


If fpresetsticky() isn't available on amd64 anymore, it shouldn't be 
mentioned in the manpage. But it seems to me to be a bug, so somebody 
should confirm this.


% RCS file: /home/ncvs/src/sys/amd64/include/ieeefp.h,v
% Working file: ieeefp.h
% head: 1.14
% ...
% 
% revision 1.13
% date: 2005/03/15 15:53:39;  author: das;  state: Exp;  lines: +0 -20
% Remove fpsetsticky().  This was added for SysV compatibility, but due
% to mistakes from day 1, it has always had semantics inconsistent with
% SVR4 and its successors.  In particular, given argument M:
% 
% - On Solaris and FreeBSD/{alpha,sparc64}, it clobbers the old flags

%   and *sets* the new flag word to M.  (NetBSD, too?)
% - On FreeBSD/{amd64,i386}, it *clears* the flags that are specified in M
%   and leaves the remaining flags unchanged (modulo a small bug on amd64.)
% - On FreeBSD/ia64, it is not implemented.
% 
% There is no way to fix fpsetsticky() to DTRT for both old FreeBSD apps

% and apps ported from other operating systems, so the best approach
% seems to be to kill the function and fix any apps that break.  I
% couldn't find any ports that use it, and any such ports would already
% be broken on FreeBSD/ia64 and Linux anyway.
% 
% By the way, the routine has always been undocumented in FreeBSD,

% except for an MLINK to a manpage that doesn't describe it.  This
% manpage has stated since 5.3-RELEASE that the functions it describes
% are deprecated, so that must mean that functions that it is *supposed*
% to describe but doesn't are even *more* deprecated.  ;-)
% 
% Note that fpresetsticky() has been retained on FreeBSD/i386.  As far

% as I can tell, no other operating systems or ports of FreeBSD
% implement it, so there's nothing for it to be inconsistent with.
% 
% PR:		75862

% Suggested by: bde
% 

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Manipulating disk cache (buf) settings

2005-05-23 Thread Bruce Evans

On Mon, 23 May 2005, John-Mark Gurney wrote:


Sven Willenberger wrote this message on Mon, May 23, 2005 at 10:58 -0400:

We are running a PostgreSQL server (8.0.3) on a dual opteron system with
8G of RAM. If I interpret top and vfs.hibufspace correctly (which show
values of 215MB and 225771520 (which equals 215MB) respectively. My
understanding from having searched the archives is that this is the
value that is used by the system/kernel in determining how much disk
data to cache.


This is incorrect...  FreeBSD merged the vm and buf systems a while back,
so all of memory is used as a disk cache..


Indeed.  Statistics utilities still haven't caught up with dyson's changes
in 1994 or 1995, so their display of statistics related to disk caching
is very misleading.  systat -v and top display vfs.bufspace but not
vfs.hibufspace.  Both of these are uninitersting.  vfs.bufspace gives the
amount of virtual memory that is currently allocated to the buffer cache.
vfs.hibufspace gives the maximum for this amount.  Virtual memory for
buffers is almost never released, so on active systems vfs.bufspace is
close to the maximum.  The maximum is just a compile-time constand
(BKVASIZE) times a boot-time constant (nbuf).

There is no way to tell from userland exactly how much of memory is used
for the vm part of the disk cache.  inact in systat -v gives a maximum.
Watch heavy file system for a while and you may see inact increase as
vm is used for disk data.  It decreases mainly when a file system is
unmounted.  Otherwise, it tends to stay near its maximum, with pages for
not recently used disk data being reused for something else (newer disk
data or processes).


The buf cache is still used
for filesystem meta data (and for pending writes of files, but those buf's
reference the original page, not local storage)...


This is mostly incorrect.  The buffer cache is now little more than a
window on vm.  Metadata is backed by vm except for low quality file
systems.  Directories are backed by vm unless vfs.vmiodirenable is 0
(not the default).


Just as an experiment, on a quiet system do:
dd if=/dev/zero of=somefile bs=1m count=2048
and then read it back in:
dd if=somefile of=/dev/null bs=1m
and watch systat or iostat and see if any of the file is read...  You'll
probably see that none of it is...


Also, with systat -v:
- start with inact small and watch it grow as the file is cached
- remove the file and watch inact drop.

I haven't tried this lately.  The system has some defence against using up
all of the free and inactive pages for a single file to the exclusion of
other disk data, so you might not get 2GB cached even if you have 4GB memory.


If that is in fact the case, then my question would be how to best
increase the amount of memory the system can use for disk caching.


Just add RAM and don't run bloatware :-).

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: undefined reference to `memset'

2005-03-24 Thread Bruce Evans
On Wed, 23 Mar 2005, Vinod Kashyap wrote:
If any kernel module has the following, or a similar line in it:
-
char x[100] = {0};
-
I think you mean:
-
auto char x[100] = {0};
-
or after fixing some style bugs:
-
char x[100] = { 0 };
-
building of the GENERIC kernel on FreeBSD 5 -STABLE for amd64
as of 03/19/05, fails with the following message at the time of linking:
undefined reference to `memset'.
The same problem is not seen on i386.
The problem goes away if the above line is changed to:
-
char x[100];
memset(x, 0, 100);
-
This version makes the pessimizations and potential bugs clear:
- clearing 100 bytes on every entry to the function is wasteful.  C90's
  auto initializers hide pessimizations like this.  They should be
  used very rarely, especially in kernels.  But they are often misused,
  even in kernels, even for read-only data that should be static.  gcc
  doesn't optimize even auto const x[100] = { 0 }; to a static
  initialization -- the programmer must declare the object as static to
  prevent gcc laboriously clearing it on every entry to the function.
- 100 bytes may be too much to put on the kernel stack.  Objects just
  a little larger than this must be dynamically allocated unless they
  can be read-only.
Adding CFLAGS+=-fbuiltin, or CFLAGS+=-fno-builtin to /sys/conf/Makefile.amd64
does not help.
-fno-builtin is already in CFLAGS, and if it has any effect on this then
it should be to cause gcc to generate a call to memset() instead of doing
the memory clearing inline.  I think gcc has a builtin memset() which is
turned off by -fno-builtin, but -fno-builtin doesn't affect cases where
memset() is not referenced in the source code.
-ffreestanding should prevent gcc generating calls to library functions
like memset().  However, -ffreestanding is already in CFLAGS too, and
there is a problem: certain initializations like the one in your example
need to use an interface like memset(), and struct copies need to use and
interface like memcpy(), so what is gcc to do when -fno-builtin tells it
to turn off its builtins and -ffreestanding tells it that the relevant
interfaces might not exist in the library?
Anyone knows what's happening?
gcc is expecting that memset() is in the library, but the FreeBSD kernel
is freestanding and happens not to have memset() in its library.
Related bugs:
- the FreeBSD kernel shouldn't have memset() at all.  The kernel interface
  for clearing memory is bzero().  A few files misspelled bzero() as
  memset() and provided a macro to convert from memset() to bzero(), and
  instead of fixing them a low-quality memset() was added to sys/libkern.h.
  This gives an inline memset() so it doesn't help here.  memset() is of
  some use for setting to nonzero, but this is rarely needed and can
  easily be repeated as necessary.  The support for the nonzero case in
  sys/libkern.h is of particularly low quality -- e.g., it crashes if
  asked to set a length of 0.
- memset() to zero and/or gcc methods for initialization to 0 might be
  much slower than the library's methods for clearing memory.  This is
  not a problem in practice, although bzero() is much faster than gcc's
  methods in some cases, because:
  (a) -fno-builtin turns off builtin memset().
  (b) the inline memset() just uses bzero() in the fill_byte = 0 case,
  so using it instead of bzero() is only a tiny pessimization.
  (c) large copies that bzero() can handle better than gcc's inline
  method (which is stosl on i386's for your example) cannot because
  the data would be too large to fit on the kernel statck.
- there are slightly different problems for memcpy():
  (a) memcpy() is in the library and is not inline, so there is no
  linkage problem if gcc generates a call to memcpy() for a struct
  copy.
  (b) the library memcpy() never uses bcopy(), so it is much slower than
  bcopy() in much cases.
  (c) the reason that memcpy() is in the library is to let gcc inline
  memcpy() for efficiency, but this reason was turned into nonsense
  by adding -fno-builtin to CFLAGS, and all calls to memcpy() are
  style bugs and ask for inefficiency.  (The inefficiency is small
  or negative in practice because bzero() has optimizations for
  large copies that are small pessimizations for non-large copies.)
- the FreeBSD kernel shouldn't have memcmp().  It has an inline one that
  has even lower quality than the inline memset().  memcmp() cannot be
  implemented using bcmp() since memcmp() is tri-state but bcmp() is
  boolean, but the inline memcmp() just calls bcmp().  This works, if
  at all, because nothing actually needs memcmp() and memcmp() is just
  a misspelling of bcmp().
Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: undefined reference to `memset'

2005-03-24 Thread Bruce Evans
On Thu, 24 Mar 2005, Bruce Evans wrote:
On Wed, 23 Mar 2005, Vinod Kashyap wrote:
If any kernel module has the following, or a similar line in it:
-
char x[100] = {0};
-
building of the GENERIC kernel on FreeBSD 5 -STABLE for amd64
as of 03/19/05, fails with the following message at the time of linking:
undefined reference to `memset'.
...
...
Anyone knows what's happening?
gcc is expecting that memset() is in the library, but the FreeBSD kernel
is freestanding and happens not to have memset() in its library.
As to why gcc calls memset() on amd64's but not on i386's:
- gcc-3.3.3 doesn't call memset() on amd64's either.
- gcc-3.4.2 on amd64's calls memset() starting with an array size of
  65.  It uses mov[qlwb] for sizes up to 16, then stos[qlwb] up to
  size 64.  gcc-3.3.3 on i386's uses mov[lwb] for sizes up to 8,
  then stos[lwb] for all larger sizes.
- the relevant change seems to be:
% Index: i386.c
% ===
% RCS file: /home/ncvs/src/contrib/gcc/config/i386/i386.c,v
% retrieving revision 1.20
% retrieving revision 1.21
% diff -u -2 -r1.20 -r1.21
% --- i386.c19 Jun 2004 20:40:00 -  1.20
% +++ i386.c28 Jul 2004 04:47:35 -  1.21
% @@ -437,26 +502,36 @@
% ...
% +const int x86_rep_movl_optimal = m_386 | m_PENT | m_PPRO | m_K6;
% ...
Note that rep_movl is considered optimal on i386's but not on amd64's.
% @@ -10701,6 +11427,10 @@
%/* In case we don't know anything about the alignment, default to
%   library version, since it is usually equally fast and result in
% - shorter code.  */
% -  if (!TARGET_INLINE_ALL_STRINGOPS  align  UNITS_PER_WORD)
% + shorter code.
% +
% +  Also emit call when we know that the count is large and call overhead
% +  will not be important.  */
% +  if (!TARGET_INLINE_ALL_STRINGOPS
% +(align  UNITS_PER_WORD || !TARGET_REP_MOVL_OPTIMAL))
%   return 0;
%
TARGET_REP_MOVL_OPTIMAL is x86_rep_movl_optimal modulo a mask.  It is
zero for amd64's, so 0 is returned for amd64's here unless you use -mfoo
to set TARGET_INLINE_ALL_STRINGOPS.  Returning 0 gives the library call
instead of a stringop.
This is in i386_expand_clrstr().  There is an identical change in
i386_expand_movstr() that gives library calls to memcpy() for (at
least) copying structs.
Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: undefined reference to `memset'

2005-03-24 Thread Bruce Evans
On Thu, 24 Mar 2005, Nick Barnes wrote:
At 2005-03-24 08:31:14+, Bruce Evans writes:
what is gcc to do when -fno-builtin tells it to turn off its
builtins and -ffreestanding tells it that the relevant interfaces
might not exist in the library?
Plainly, GCC should generate code which fills the array with zeroes.
It's not obliged to generate code which calls memset (either builtin
or in a library).  If it knows that it can do so, then fine.
Otherwise it must do it the Old Fashioned Way.  So this is surely a
bug in GCC.
Nick B, who used to write compilers for a living
But the compiler can require the Old Fashioned Way to be in the library.
libgcc.a is probably part of gcc even in the freestanding case.  The
current implementation of libgcc.a won't all work in the freestanding
case, since parts of it call stdio, but some parts of it are needed
and work (e.g., __divdi3() on i386's at least).  The kernel doesn't
use libgcc.a, but it knows that __divdi3() and friends are needed and
implements them in its libkern.  Strictly, it should do something
similar for memset().
I think the only bugs in gcc here are that the function it calls is
in the application namespace in the freestanding case, and that the
requirements for freestanding implementations are not all documented.
The requirement for memset() and friends _is_ documented (in gcc.info),
but the requirement for __divdi3() and friends are only documented
indirectly by the presence of these functions in libgcc.a.
Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: undefined reference to `memset'

2005-03-24 Thread Bruce Evans
On Fri, 25 Mar 2005, Peter Jeremy wrote:
On Thu, 2005-Mar-24 12:03:19 -0800, Vinod Kashyap wrote:
[  char x[100] = { 0 };  ]
A statement like this (auto and not static)
I'd point out that this is the first time that you've mentioned that
the variable is auto.  Leaving out critical information will not
encourage people to help you.
It was obviously auto, since memset() would not have been called for
a global variable.
is necessary if you
are dealing with re-entrancy.
This isn't completely true.  The preferred approach is:
char*x;
x = malloc(100, MEM_POOL_xxx, M_ZERO | M_WAITOK);
(with a matching free() later).
This is also preferred to alloca() and C99's dynamic arrays.
BTW, the kernel has had some dubious examples of dynamic arrays in very
important code since long before C99 existed.  vm uses some dynamic
arrays, and this is only safe since the size of the arrays is bounded
and small.  But when the size of an array is bounded and small,
dynamic allocation is just a pessimization -- it is more efficient
to always allocate an array with the maximum size that might be needed.
How is it then, that an explicit call to memset (like in my example) works?
The code
auto char   x[100] = {0};
is equivalent to
auto char   x[100];
memset(x, 0, sizeof(x));
but memset only exists as a static inline function (defined in libkern.h).
If an explicit call to memset works then the problem would appear to be
that the compiler's implicit expansion is failing to detect the static
inline definition, and generating an external reference which can't be
satisfied.  This would seem to be a gcc bug.
No, it is a feature :-).  See my earlier reply.
2. I should have mentioned that I don't see the problem if I am
  building only the kernel module.  It happens only when I am building
  the kernel integrating the module containing the example code.
This is the opposite of what you implied previously.  There are some
differences in how kernel modules are built so this
How about posting a (short) compilable piece of C code that shows the
problem.  I would expect that an nm of the resultant object would
show U memset when the code was compiled for linking into the kernel
and some_address t memset or not reference memset at all when
compiled as a module.
I deleted the actual example.  Most likely it would fail at load time
do to using memset().  Another possibility is for the code that needs
memset to be unreachable in the module since it is inside an ifdef.
Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]