from:"Bruce Evans"

Re: Replace bcopy() to update ether_addr

2012-08-23 Thread Bruce Evans

From owner-freebsd-...@freebsd.org Thu Aug 23 10:58:03 2012
Delivered-To: freebsd-...@freebsd.org
Date: Wed, 22 Aug 2012 23:52:44 +0200
From: Luigi Rizzo ri...@iet.unipi.it
To: John Baldwin j...@freebsd.org
References: 50324db4.6080...@cabletv.dp.ua
201208220802.14588@freebsd.org
CAJ-Vmo=1cbJn3pkSvoCq7y-kEGig-h1Vxo6M5V0=b9=mkfu...@mail.gmail.com
201208221521.06954@freebsd.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: 201208221521.06954@freebsd.org
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-hackers@FreeBSD.org, Adrian Chadd adr...@freebsd.org,
Mitya mi...@cabletv.dp.ua,
Wojciech Puchar woj...@wojtek.tensor.gdynia.pl,
freebsd-...@freebsd.org
Subject: Re: Replace bcopy() to update ether_addr
X-BeenThere: freebsd-...@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD freebsd-net.freebsd.org
List-Unsubscribe: http://lists.freebsd.org/mailman/listinfo/freebsd-net,
mailto:freebsd-net-requ...@freebsd.org?subject=unsubscribe
List-Archive: http://lists.freebsd.org/pipermail/freebsd-net
List-Post: mailto:freebsd-...@freebsd.org
List-Help: mailto:freebsd-net-requ...@freebsd.org?subject=help
List-Subscribe: http://lists.freebsd.org/mailman/listinfo/freebsd-net,
mailto:freebsd-net-requ...@freebsd.org?subject=subscribe
Sender: owner-freebsd-...@freebsd.org
Errors-To: owner-freebsd-...@freebsd.org
Content-Length: 2762
Lines: 64

On Wed, Aug 22, 2012 at 03:21:06PM -0400, John Baldwin wrote:
 On Wednesday, August 22, 2012 2:54:07 pm Adrian Chadd wrote:
  On 22 August 2012 05:02, John Baldwin j...@freebsd.org wrote:
   On Tuesday, August 21, 2012 12:34:42 pm Adrian Chadd wrote:
   Hi,

   What about just creating an ETHER_ADDR_COPY(dst, src) and putting that
   in a relevant include file, then hide the ugliness there?

   The same benefits will likely appear when copying wifi MAC addresses
   to/from headers.

   Thanks, I'm glad someone noticed this.

   I doubt we even _need_ the ugliness.  We should just use *dst = *src
   unless there is a compelling reason not to.

  Because it's not very clear? :-) I'd much prefer my array-of-things
  copies to be explicit.

 Eh?  'struct foo *src, *dst; *dst = *src' is pretty bog-standard C.  That 
 isn't really all that obtuse.

the thread has probably forked causing people to miss the explanation
that Bruce gave: quite often the function is called by casting
arbitrary pointers into 'struct foo *', so the compiler's expectations
about alignment do not necessarily match the user's lies.

Unfortunately we are building kernels with many compiler checks
disabled, so there is a fair chance that the compiler will not
detect such invalid casts.

Probably addresses are aligned to 2-byte boundaries, but certainly
not on a 4-byte, which means that a safe copy might require 3
instructions, even though a compiler could otherwise decide to align
all non-packed 'struct foo' to a 4- or 8-byte boundary and possibly
do the copy with 2 or even 1 instruction.

I would also suggest to try the code i posted in response to bruce
so you can check how good or bad are the various solutions on
different architectures or CPUs, and see if there is a reasonable
compromise.

cheers
luigi

  Also, the optimisation and compiler silliness may not be THAT obvious
  on intel (except when you're luigi and using netmap) but I can't help
  but wonder whether the same does hold for MIPS/ARM. Getting it wrong
  there will lead to some very very poor performing code.

 Don't you think there's a really good chance the compiler knows how to copy a 
 structure appropriately for each architecture already?

 -- 
 John Baldwin
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Replace bcopy() to update ether_addr

2012-08-23 Thread Bruce Evans

luigi wrote:

 On Wed, Aug 22, 2012 at 03:21:06PM -0400, John Baldwin wrote:
  On Wednesday, August 22, 2012 2:54:07 pm Adrian Chadd wrote:
   On 22 August 2012 05:02, John Baldwin j...@freebsd.org wrote:
On Tuesday, August 21, 2012 12:34:42 pm Adrian Chadd wrote:
Hi,
   
What about just creating an ETHER_ADDR_COPY(dst, src) and putting that
in a relevant include file, then hide the ugliness there?

Good for source code ugliness and bloat (not just in the macro).

The same benefits will likely appear when copying wifi MAC addresses
to/from headers.

Why stop there?  Change all assignments to use the ASSIGN() macro.  The
compiler doesn't understand assignment, but we do ;-).

Thanks, I'm glad someone noticed this.
   
I doubt we even _need_ the ugliness.  We should just use *dst = *src
unless there is a compelling reason not to.
   
   Because it's not very clear? :-) I'd much prefer my array-of-things
   copies to be explicit.
  
  Eh?  'struct foo *src, *dst; *dst = *src' is pretty bog-standard C.  That 
  isn't really all that obtuse.

(Builtin) memcpy is better for this in general (no sarcasm now).  It
reduces to a struct copy if the object being copy is a struct, and is
not limited to structs.

However, ethernet address types are now fully pessimized, so compilers
are almost forced to assume that they are fully misaligned.  From
net/ethernet.h:

% /*
%  * Structure of a 10Mb/s Ethernet header.
%  */

Wow, 10Mb/s.

% struct ether_header {
%   u_char  ether_dhost[ETHER_ADDR_LEN];
%   u_char  ether_shost[ETHER_ADDR_LEN];
%   u_short ether_type;
% } __packed;

This used to be not quite guaranteed to be 16-bit aligned by the short
in it.  Perhaps it is still aligned in practice,  But in 2006, it was
fully pessimized by adding __packed without adding __aligned(2).  This
__packed prevents the struct being padded to a multiple of 32 bits in
size on arm.  It has the side effect of allowing the compiler to not
add padding for alignment before instances of this struct, and allowing
embedded short to be misaligned, so all accesses to this short might
require 2 1-byte load/store instructions and more instructions to
combine the bytes.  On x86, the accesses can be a single load/store
(or sometimes an arithmetic operation), with the hardware doing the
combination, possibly more slowly than for aligned accesses.  The
pessimization is better on other arches.

% /*
%  * Structure of a 48-bit Ethernet address.
%  */
% struct ether_addr {
%   u_char octet[ETHER_ADDR_LEN];
% } __packed;

This was always just an array of bytes, so it was nevever required to
have more than byte alignment.  However, the compiler was permitted to
add padding before and after it to align it and the thing after it.
__packed on it may prevent this.  I hope that it actually takes a
__packed on the containing struct or on the struct member with this
type to prevent external padding, but __packed on everything may be
needed anyway for __packed on this to have much effect.

An ETHER_ADDR_COPY() macro can only do worse than the compiler here,
since the object is nothing more than an array of bytes after forgetting
the extra packing info that may be known to the compiler.  On x86, the
macro can use 2 possibly-misaligned load/stores, but so can the
compiler, or the macro can use a dynamic comparison of the address to
find the alignment point and then do 1, 2 or 3 aligned load/store
pairs, but so can the compiler, or the macro can use rep movsb, but
so can the compiler...  The best choice depends on CFLAGS (a static
test that is easiest for the compiler) or on the CPU.  For a complete
implementation, the macro must of course patch the instructions used
to best suit the current CPU.  This would be a lot of work for an
unimportant (~0.1% max) optimization.  On non-x86 or any arch with
strict alignment requirements, the only obviously correct implementions
of the macro are 6 1-byte load/store pairs, or memcpy(), or bcopy().
The compiler can easily beat all of these.

 the thread has probably forked causing people to miss the explanation
 that Bruce gave: quite often the function is called by casting
 arbitrary pointers into 'struct foo *', so the compiler's expectations
 about alignment do not necessarily match the user's lies.

Except for struct ether_addr, the cast would actually reduce alignment
(unless the compiler is smart enough to not do it literally).

 Unfortunately we are building kernels with many compiler checks
 disabled, so there is a fair chance that the compiler will not
 detect such invalid casts.

Er, we check more in the kernel than almost everywhere else.

 Probably addresses are aligned to 2-byte boundaries, but certainly
 not on a 4-byte, which means that a safe copy might require 3
 instructions, even though a compiler could otherwise decide to align
 all non-packed 'struct foo' to a 4- or 8-byte boundary and possibly
 do the copy with 2 or even 1 instruction.

The code actually

Re: Replace bcopy() to update ether_addr

2012-08-22 Thread Bruce Evans

mitya wrote:

 22.08.2012 05:07, Bruce Evans íàïèñàë:
  On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote:
  Hi.
  I found some overhead code in /src/sys/net/if_ethersubr.c and
  /src/sys/netgraph/ng_ether.c
 
  It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
  When src and dst are struct ether_addr*, and ETHER_ADDR_LEN equal 6.
  Only ng_ether.c contains such strings.  if_ethersubr.c doesn't exist.
  if_ether.c exists, but was converted to use memcpy() 17.25 years ago.

Oops, if_ethersubr.c does exist (in net/.  if_ether.c is a misnamed file
in netinet/).

  Summary: use builtin memcpy() for small copies, and don't try hard to
  otherwise optimize this.
 You are very surprised to learn that bcopy() and memcpy() are used for 
 copy struct in_addr, whose length is equal to 4 bytes ?
 I found this in sys/netinet/if_ether.c. And I think, there is a chance 
 to find them in other files.

Since memcpy() is the correct method, us of it is only slightly surprising.
Use of bcopy() is a regression.  Bugs are never surprising :-).

 And I found bzero(mtod(m, void *), sizeof(struct in_addr)); in ip_options.c

Speed of options processing isn't important.

Anyway, I dug out some of my old unimportant fixes for some of the
copying pessimizations:

% diff -c2 ./net/if_ethersubr.c~ ./net/if_ethersubr.c
% *** ./net/if_ethersubr.c~ Wed Dec 27 10:49:51 2006
% --- ./net/if_ethersubr.c  Wed Dec 27 10:49:52 2006
% ***
% *** 253,257 
%   hdrcmplt = 1;
%   eh = (struct ether_header *)dst-sa_data;
% ! (void)memcpy(esrc, eh-ether_shost, sizeof (esrc));
%   /* FALLTHROUGH */
%   
% --- 253,257 
%   hdrcmplt = 1;
%   eh = (struct ether_header *)dst-sa_data;
% ! __builtin_memcpy(esrc, eh-ether_shost, sizeof(esrc));
%   /* FALLTHROUGH */
%   

Not the right fix (except to fix the style bug (cast to void)).
memcpy() should be used, and then if the builtin is good then it shoukd
be used.

% ***
% *** 259,263 
%   loop_copy = 0; /* if this is for us, don't do it */
%   eh = (struct ether_header *)dst-sa_data;
% ! (void)memcpy(edst, eh-ether_dhost, sizeof (edst));
%   type = eh-ether_type;
%   break;
% --- 259,263 
%   loop_copy = 0; /* if this is for us, don't do it */
%   eh = (struct ether_header *)dst-sa_data;
% ! __builtin_memcpy(edst, eh-ether_dhost, sizeof(edst));
%   type = eh-ether_type;
%   break;
% ***
% *** 276,288 
%   senderr(ENOBUFS);
%   eh = mtod(m, struct ether_header *);
% ! (void)memcpy(eh-ether_type, type,
% ! sizeof(eh-ether_type));
% ! (void)memcpy(eh-ether_dhost, edst, sizeof (edst));
%   if (hdrcmplt)
% ! (void)memcpy(eh-ether_shost, esrc,
% ! sizeof(eh-ether_shost));
%   else
% ! (void)memcpy(eh-ether_shost, IF_LLADDR(ifp),
% ! sizeof(eh-ether_shost));
%   
%   /*
% --- 276,287 
%   senderr(ENOBUFS);
%   eh = mtod(m, struct ether_header *);
% ! __builtin_memcpy(eh-ether_type, type, sizeof(eh-ether_type));
% ! __builtin_memcpy(eh-ether_dhost, edst, sizeof(edst));
%   if (hdrcmplt)
% ! __builtin_memcpy(eh-ether_shost, esrc,
% ! sizeof(eh-ether_shost));
%   else
% ! __builtin_memcpy(eh-ether_shost, IF_LLADDR(ifp),
% ! sizeof(eh-ether_shost));
%   
%   /*

Larger style fixes.

Non-style fixes/breakages are in the z subdir:

% diff -c2 ./net/z/if_ethersubr.c~ ./net/z/if_ethersubr.c
% *** ./net/z/if_ethersubr.c~   Wed Dec 27 10:49:51 2006
% --- ./net/z/if_ethersubr.cWed Dec 27 20:28:06 2006
% ***
% *** 191,197 
%   
%   if (m-m_flags  M_BCAST)
% ! bcopy(ifp-if_broadcastaddr, edst, ETHER_ADDR_LEN);
%   else
% ! bcopy(ar_tha(ah), edst, ETHER_ADDR_LEN);
%   
%   }
% --- 191,198 
%   
%   if (m-m_flags  M_BCAST)
% ! __builtin_memcpy(edst, ifp-if_broadcastaddr,
% ! sizeof(edst));
%   else
% ! __builtin_memcpy(edst, ar_tha(ah), sizeof(edst));
%   
%   }

bcopy() can't be replaced by a builtin since it is required to handle
overlapping copies, which I think can't happen here (but I didn't check
this).  The builtin shouldn't be hard-coded like this, as above.

% ***
% *** 214,219 
%   } else
%   type = htons(ETHERTYPE_IPX);
% ! bcopy((caddr_t)(((struct sockaddr_ipx 
*)dst)-sipx_addr.x_host),
% ! (caddr_t)edst, sizeof (edst));
%   break;
%   #endif
% --- 215,221 
%   } else
%   type = htons(ETHERTYPE_IPX);
% ! __buitin_memcpy

Re: Replace bcopy() to update ether_addr

2012-08-21 Thread Bruce Evans

 On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote:
  Hi.
  I found some overhead code in /src/sys/net/if_ethersubr.c and 
  /src/sys/netgraph/ng_ether.c
  
  It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
  When src and dst are struct ether_addr*, and ETHER_ADDR_LEN equal 6.

Only ng_ether.c contains such strings.  if_ethersubr.c doesn't exist.
if_ether.c exists, but was converted to use memcpy() 17.25 years ago.

  This code call every time, when we send Ethernet packet.
  On example, on my machine in invoked nearly 20K per second.
  
  Why we are use bcopy(), to copy only 6 bytes?
  Answer - in some architectures we are can not directly copy unaligned data.
  
  I propose this solution.
  
  In file /usr/src/include/net/ethernet.h add this lines:
  
  static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
  #if defined(__i386__) || defined(__amd64__)
  *dst = *src;
  #else
  bcopy(src, dst, ETHER_ADDR_LEN);
  #endif
  }
  
  On platform i386 gcc produce like this code:
  leal-30(%ebp), %eax
  leal6(%eax), %ecx
  leal-44(%ebp), %edx
  movl(%edx), %eax
  movl%eax, (%ecx)
  movzwl  4(%edx), %eax
  movw%ax, 4(%ecx)
  And clang produce this:
  movl-48(%ebp), %ecx
  movl%ecx, -26(%ebp)
  movw-44(%ebp), %si
  movw%si, -22(%ebp)
  
  All this variants are much faster, than bcopy()

You mean as much as 5 nanoseconds faster.  Possibly even 10 nanoseconds
faster.

 A bit orthogonal to this but also related to the performance
 impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT
 architectures these places probably should use memcpy()
 instead as bcopy() additionally has to check for overlap
 while the former does not. Overlaps unlikely are an issue
 in these cases and at least NetBSD apparently has done the
 switch to memcpy() 5.5 years ago.

This is essentially just a style bug.  FreeBSD switched to memcpy()
17.25 years ago for selected networking copying.  memcpy() is supposed
to be used if and only if compilers can optimize it.  This means that
the size must be fixed and small, and of course that the copies don't
overlap.  Otherwise, compilers can't do any better than call an extern
copying routine, which is memcpy() in practice.  memcpy() was
interntionally left out in FreeBSD until it was added 17.25 years
ago to satisfy the changes to use memcpy() in networking code (since
with -O0, memcpy() won't be inlined and the extern memcpy() gets
used).  Other uses are style bugs, but there are many now :-(.

bcopy() is still the primary interface, and might be faster than
memcpy(), especially for misaligned cases, but in practice it isn't,
except in the kernel on Pentium1 in 1996-1998 where I only optimized
(large) bcopy()s.  Since it has to support overlapping copies it is
inherently slower.

Although compilers can't do better for large copying than call an
extern routine, some compilers bogusly inline it to something like
rep movsd on i386, (or worse, to a very large number of loads and
stores).  gcc used to have a very large threshold for inlining
moderately-sized copies and/or for switching between rep movsd and
load/store.  It now understands better than ut doesn't understand
memory, and has more reasonable thresholds.   Or rather the thresholds
are more MD.  gcc still makes a mess with some CFLAGS:

% struct foo
% {
%   short x;
%   struct bar {
%   short yy[31];
%   } y;
% } s, t;
% 
% foo()
% {
%   s.y = t.y;
% }

With just -O, gcc-4.2.1 -O on i386 handles this very badly, by
generating 15 misaligned 32-bit load/stores followed by 1 aligned
16-bit load/store.  With -march=almost anything, it generates 1
16-bit aligned load-store followed by an aligned rep movsd with a
count of 15.  The latter is not too bad.  Similarly for yy[32].  But
for yy[33] it switches to a call to memcpy() even with plain -O.

However, improvements in memory systems and branch prediction since
Pentium1 in 1996-98 mean that optimimizing copying mostly gets
nowhere.  Copying is either from the cache[s], in which case it is
fast (~1 nanosecond per 128 bits for L1), or it is not from the
caches in which case it is slow (~50-200 nanseconds per cache miss).
With 6-byte ethernet addresses, using bcopy() does slow the copying
to considerably below 1 nanosecond per 128 bits (a sloppy estimate
gives 5-10 ns/call), but it's hard for a single call to be made often
enough to make a significant difference.  Someone mentioned 2
calls.  That's the same as 0 calls: 2 * 10 nsec = 200 usec =
0.05% of 1 CPU.

If anyone cared about this, then they would use __builtin_memcpy()
instead of memcpy().  (Note that the point of the 17.25-year old
optimization has been defeated for ~10 years by turning off _all_
builtins, which was initially done mainly to kill builtin putc().
(-ffreestanding should have done that.)  So gcc inlines struct
copying for small structs, but never inlines memcpy(), or

Re: Replace bcopy() to update ether_addr

2012-08-21 Thread Bruce Evans

luigi wrote:

 even more orthogonal:
 
 I found that copying 8n + (5, 6 or 7) bytes was much much slower than
 copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient,
 other cases are slow (turned into 2 or 3 different writes).
 
 The netmap code uses a pkt_copy routine that does exactly this
 rounding, gaining some 10-20ns per packet for small sizes.

I don't believe 10-20ns for just the extra bytes.  memcpy() ends up
with a movsb to copy the extra bytes.  This can be slow, but I don't
believe 10-20ns (except on machines running at i486 speeds of course).

% ENTRY(memcpy)
%   pushl   %edi
%   pushl   %esi
%   movl12(%esp),%edi
%   movl16(%esp),%esi
%   movl20(%esp),%ecx
%   movl%edi,%eax
%   shrl$2,%ecx /* copy by 32-bit words */
%   cld /* nope, copy forwards */
%   rep
%   movsl
%   movl20(%esp),%ecx
%   andl$3,%ecx /* any bytes left? */

This avoids a branch.  Some optimization manuals say that the branch is
actually better for some machines,

The above 2 instructions have a throughput of 1 per cycle each on
modern x86.  Latency might be 6 cycles.

%   rep

Maybe 5-15 cycles throughput.

%   movsb

Now hopefully at most 1 cycle/byte.  Some hardware might combine the
bytes as much as possible, so the whole function should use 1 single
rep movsb and let the hardware do it all.

%   popl%esi
%   popl%edi
%   ret

Well, it's easy to get a latency of 20 cycles 5-10 ns) and maybe even
a throughput of that.  But all of thus is out of order on modern x86.
The extra cycles for the movsb might not cost at all if nothing accesses
the part of the target that they were written to soon.

With builtin memcpy, 6 bytes would be done using load/store of 4+2 bytes
and thus take the same time as 8 bytes on i386, but on amd64 8 bytes
would be faster.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Replace bcopy() to update ether_addr

2012-08-21 Thread Bruce Evans

jhb wrote:
 On Monday, August 20, 2012 10:46:12 am Mitya wrote:
  ...
  I propose this solution.
  
  In file /usr/src/include/net/ethernet.h add this lines:
  
  static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
  #if defined(__i386__) || defined(__amd64__)
   *dst = *src;
  #else
   bcopy(src, dst, ETHER_ADDR_LEN);
  #endif
  }
 
 Doesn't '*dst = *src' just work on all platforms?

It does when you have an actual struct, but often in networking code
you only have an array of bytes, and then casting the pointers from
uint8_t * to full object pointers fails iff there is an alignment
problem.

Even on i386, it may be pessimal to use struct copying for 6 bytes,
The bytes should be copied as 4+2 or 2+4 depending on alignment of the
first bytes.  Oops, not even that -- this reminds me of a problem with
penalties for mismatched loads and stores which affects at least
AthlonXP and Athlon64 significantly (10-20 cycle penalty = enough to
copy 80-160 bytes unpenalized).  The 6 should probably be copied as
2+2+2 or even as 1+1+1+1+1+1 to match previous stores and later loads,
also as 2+2+2, etc.  But if the 6 are part of another struct, they
might be accessed as 8+0 or 4+4 as part of copying the full struct.
gcc is not smart about this.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: High-res Timers

2012-05-17 Thread Bruce Evans


On Wed, 16 May 2012, Jason Hellenthal wrote:


Does anyone have a quick list of high-resolution timer functions? Both
user-land and kernel-land? It would be greatly appreciated (doing some
performance timing for applications).


clocks(7) - various system timers
getitimer(2), setitimer(2) - get/set value of interval timer

see ( man -k timer ) list for some other references.


I think the original poster wants timestamping functions.  Timer function
normally means a timer-interrupt-scheduling function.  clocks(7) is bogus
and more than 10 years out of date.  It is mostly about hardware and
virtual clock oscillators that may be used to build timestamping and timer
functions (mostly the former).


I am not sure what sort of high resolution you are refering to but maybe
these will lead you in the right direction. Bruce Evans CCd - he may
have quite a bit that could be added to this. I always find what he has
to say very enlightening.


Generally, some of the timestamping functions have high resolution, but
are too slow to use if you make a lot of timestamps, and if you don't
make a lot of timestamps then you don't need very high resolution.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: [review request] zfsboot/zfsloader: support accessing filesystems within a pool

2012-05-05 Thread Bruce Evans


On Sat, 5 May 2012, Andriy Gapon wrote:


on 04/05/2012 18:25 John Baldwin said the following:

On Thursday, May 03, 2012 11:23:51 am Andriy Gapon wrote:

on 03/05/2012 18:02 Andriy Gapon said the following:


Here's the latest version of the patches:
http://people.freebsd.org/~avg/zfsboot.patches.4.diff


I've found a couple of problems in the previous version, so here's another one:
http://people.freebsd.org/~avg/zfsboot.patches.5.diff
The important change is in the first patch (__exec args).


A few comments/suggestions on the args bits:


John,

these are excellent suggestions!  Thank you!
Some comments:

- Add #ifndef LOCORE guards to the new header around the structure so
  it can be used in assembly as well as C.


Done.  I have had to go into a few btx makefiles and add a necessary include
path and -DLOCORE to make the header usable from asm.


Ugh, why not use genassym, as is done for all old uses of this header in
locore.s, at least on i386 (5% of the i386 genassym.c is for this).


- Move BI_SIZE and ARGOFF into the header as constants.


Done.


- Add a CTASSERT() in loader/main.c that BI_SIZE == sizeof(struct bootinfo)


Ugh, BI_SIZE was already used in locore.s.  It wasn't the size of the struct,
but was the offset of the field that gives the size.  No CTASSERT() was
needed -- the size is whatever it is, as given by sizeof() on the struct
at the time of compilation of the utility that initializes the struct.
It was a feature that sizeof() and offsetof() can't be used in asm so they
must be translated in genassym and no macros are needed in the header (the
size was fully dynamic, so the asm code only needs the offsetof() values).
Of course, you could use CTASSERT()s to check that the struct layout didn't
get broken.  The old code just assumes that the struct is packed by the
programmer and that the arch's struct packing conventions don't change,
so that for example BI_SIZE = offsetof(struct bootinfo, bi_size) never
changes.

genassym is hard to use in boot programs, but the old design was that
boot programs shouldn't use bootinfo in asm and should just use the
target bootinfo.h at compile time (whatever time the target is compiled).
Anyway, LOCORE means for use in locore.[sS], so other uses of it, e.g.
in boot programs, are bogus.



I have added a definition of CTASSERT to boostrap.h as it was not available for
sys/boot and there were two local definitions of the macro in individual files.

However the assertion would fail right now.
The backward-compatible value of BI_SIZE (72 == 0x48) covers only part of the


This isn't backwards compatible.  BI_SIZE was decimal 48 (covers everything
up to the bi_size field).


fields in struct bootinfo, those up to the following comment:
/* Items below only from advanced bootloader */
I am a little bit hesitant: should I increase BI_SIZE to cover the whole struct
bootinfo or should I compare BI_SIZE to offsetof bi_kernend?


Neither.  BI_SIZE shouldn't exist.  It defeats the bi_size field.


Maybe
  create a 'struct kargs_ext' that looks like this:

struct kargs_ext {
uint32_t size;
char data[0];
};


I've decided to skip on this.


Use KNF indentation and KNF field prefixes (ka_) if you add it :-).  Generic
field names like `size' and `data' need prefixes more than mos.

The old struct was:

% #define   N_BIOS_GEOM 8
% ...
% /*
%  * A zero bootinfo field often means that there is no info available.
%  * Flags are used to indicate the validity of fields where zero is a
%  * normal value.
%  */
% struct bootinfo {
%   u_int32_t   bi_version;
%   u_int32_t   bi_kernelname;  /* represents a char * */
%   u_int32_t   bi_nfs_diskless;/* struct nfs_diskless * */
%   /* End of fields that are always present. */

The original size was apparently 12.

% #define   bi_endcommonbi_n_bios_used

Another style difference.  The magic 12 is essentially given by this macro.
This macro is a pseudo-field, like the ones for the copyable and zeroable
regions in struct proc.  Its name is in lower case.

%   u_int32_t   bi_n_bios_used;
%   u_int32_t   bi_bios_geom[N_BIOS_GEOM];

The struct was broken in 1994 by adding the above 2 fields without providing
any way to distinguish it from the old struct.

%   u_int32_t   bi_size;
%   u_int8_tbi_memsizes_valid;
%   u_int8_tbi_bios_dev;/* bootdev BIOS unit number */
%   u_int8_tbi_pad[2];
%   u_int32_t   bi_basemem;
%   u_int32_t   bi_extmem;
%   u_int32_t   bi_symtab;  /* struct symtab * */
%   u_int32_t   bi_esymtab; /* struct symtab * */

The above 8 fields were added in 1995 (together with fixing style bugs
like no prefixes for the old field names).  Now the struct is determined
by its size according to the bi_size field, and the bi_version field is
not really used (it's much easier to add

Re: sizeof(function pointer)

2011-06-01 Thread Bruce Evans


On Tue, 31 May 2011 m...@freebsd.org wrote:


I am looking into potentially MFC'ing r212367 and related, that adds
drains to sbufs.  The reason for MFC is that several pieces of new
code in CURRENT are using the drain functionality and it would make
MFCing those changes much easier.

The problem is that r212367 added a pointer to a drain function in the
sbuf (it replaced a pointer to void).  The C standard doesn't
guarantee that a void * and a function pointer have the same size,
though its true on amd64, i386 and I believe PPC.  What I'm wondering
is, though not guaranteed by the standard, is it *practically* true
that sizeof(void *) == sizeof(int(*)(void)), such that an MFC won't
break binary compatibility for any supported architecture?  (The


Only on supported arches.


standard does guarantee, though not in words, that all function
pointers have the same size, since it guarantees that pointers to
functions can be cast to other pointers to functions and back without
changing the value).


No, it doesn't guarantee that they have the same size.  The casts may
change inything in the representation including the size.

In 1995 I added intfptr_t, uintfptr_t and fptrdiff_t to FreeBSD to
handle this problem in the profil(2) APIs.  You can check the MD
definitions of these to verify that their size is always the same as
the sizeof(void *).  There are some style bugs and namespace issues
with these.  The definitions in /sys/*/include/_types.h are clean, but
are mostly not used in their primary consumer /sys/*/include/profile.h.
intfptr_t and uintfptr_t are bogusly declared for the kernel in a bogus
declaration section in /sys/types.h.  They are intentionally kept out
of general headers, but even the kernel doesn't really need them there,
unless they are used for more than profiling and then you might need
them outside the kernel.


Another possibility is to malloc a blob that is sizeof(int(*)(void))
and store that in a renamed s_unused; this is a bit messier but
guaranteed to work.  I'd just rather the code be an MCF instead of a
partial re-write.


It is only guaranteed to work if the value is converted to the actual
function pointer type before use.  (At least for use as a function
pointer.)

Someone pointed out that POSIX now requires function pointers to have
the same representation as void *.  This is a bug in POSIX IMO.  It
isn't in the 2001 version.  I think POSIX had to do something for
broken parts of dlcfn APIs.  Hopefully this isn't required generally.
Requiring all function pointers to be convertible to void * and back
without breaking them is bad enough, but requiring the same representation
is much more restrictive.  It breaks any system that has type info in
the pointers.  The C standard is careful not to require this breakage
even for data pointers.

Someones pointed out that ia64 function pointers are actually handles.
Systems that want to put type info in pointers would have to use handles
instead of actual pointers to give the same representation, even if this
is not the natural ABI.  Whether handle values are useful when the data
that they point to is unavailable is unclear.  The profiling ABIs mainly
need the function pointers (converted to integers) as ids, and handle
values are hopefully enough for that.  If not, the conversion needs to
mangle the values sufficiently for uniqueness.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: retry mounting with ro when rw fails

2011-04-08 Thread Bruce Evans


On Fri, 8 Apr 2011, Andriy Gapon wrote:


on 08/04/2011 03:00 Jeremy Chadwick said the following:

On Thu, Apr 07, 2011 at 01:20:53PM -0700, Garrett Cooper wrote:

As a generic question / observation, maybe we should just
implement 'errors=remount-ro' (or a reasonable facsimile) like Linux
has in our mount(8) command? Doesn't look like NetBSD, OpenBSD, or
[Open]Solaris sported similar functionality.


I was going to recommend exactly this.  :-)

I like the idea of Andriy's patch, but would feel more comfortable if it
were only used if a mount option was specified (-o errors=remount-ro).


Having the option is appealing, but my main motivation was the simplicity that
comes from having that enabled by default.
That is, you absolutely want an R/W mount then use -o rw, you need R/O then
explicitly -o ro, you just want to get that media mounted then the default
behavior tries its best.


But the default behaviour is backwards, especially for read-mostly
removable media.  The default should be ro, possibly with an automagic
upgrade to rw iff the media really needs to be written too.  Writing
timestamps for file system and inode access times doesn't count as
really needs to be written to.

I think I prefer requiring an explicit upgrade to rw.  rw implies
writing access times unless you also use noatime, and I wouldn't want
noatime to be set automagically depending on whether rw is set explicitly,
so I would want noatime to be set explicitly, and once you do that
then you can easily set rw or ro at the same time.  A new rm (read mostly)
or rwa (read or write automagically) flag could give automatic upgrade
from ro to rw.  I'd also like automatic downgrade to ro after a file
system has not been written to for some time (this would avoid fscks
in most cases for read-mostly file systems.  The ro flag should be
per-cylinder-group in ffs so that on big disks, most parts are read-only
most of the time and don't need to be checked).

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: get_cyclecount(9) deprecation

2011-03-18 Thread Bruce Evans


On Thu, 17 Mar 2011, Jung-uk Kim wrote:


Both get_cyclecount(9) and cpu_ticks() do almost exactly the same
thing now assuming set_cputicker() is called with a correct function
before get_cyclecount() is used, which is true for x86, at least.
The only difference is get_cyclecount() may be inlined but I don't
see much gain from the current uses.

Please review the patch.  Note I didn't really remove get_cyclecount()
just because some random third-party module may use it as it is a
documented feature while cpu_ticks is an internal KPI.

What do you think?


I like this idea, but see some minor problems:
- cpu_ticks() API needs to become more public and stable, and guarantee to
  use the ticker with the highest frequency
- it is hard to see when set_cputicker() is called, but on i386 it is
  called very early so there seem to be no problems with the dummy
  timecounter.  It is called from init_TSC() which is called from
  startrtclock() which is called from machdep.c:cpu_startup() just
  after the garbage Good morning comment which was originally just
  banal since it preceeded its code (the printf that prints the copyright
  message).  The ordering of this is now highly obfuscated, and in fact
  there is now the following enormous amount of code between the printf
  and its comment:

% enum sysinit_sub_id {
%   SI_SUB_DUMMY= 0x000,/* not executed; for linker*/
%   SI_SUB_DONE = 0x001,/* processed*/
%   SI_SUB_TUNABLES = 0x070,/* establish tunable values */
%   SI_SUB_COPYRIGHT= 0x081,/* first use of console*/
   ^ prints the copyright
%   SI_SUB_SETTINGS = 0x088,/* check and recheck settings */
%   SI_SUB_MTX_POOL_STATIC  = 0x090,/* static mutex pool */
%   SI_SUB_LOCKMGR  = 0x098,/* lockmgr locks */
%   SI_SUB_VM   = 0x100,/* virtual memory system init*/
%   SI_SUB_KMEM = 0x180,/* kernel memory*/
%   SI_SUB_KVM_RSRC = 0x1A0,/* kvm operational limits*/
%   SI_SUB_WITNESS  = 0x1A8,/* witness initialization */
%   SI_SUB_MTX_POOL_DYNAMIC = 0x1AC,/* dynamic mutex pool */
%   SI_SUB_LOCK = 0x1B0,/* various locks */
%   SI_SUB_EVENTHANDLER = 0x1C0,/* eventhandler init */
%   SI_SUB_VNET_PRELINK = 0x1E0,/* vnet init before modules */
%   SI_SUB_KLD  = 0x200,/* KLD and module setup */
%   SI_SUB_CPU  = 0x210,/* CPU resource(s)*/
   ^^^ calls cpu_startup()

  but there seem to be no slow operations (like device initialization) in
  there, so cpu_startup() seems to be called early enough.

  The function names for clock initialization are almost equally rotted:
  - the realtime clock was originally the i8254, and startrtclock()
started it.  Now startrtclock() doesn't touch either the the realtime
clock (the timecounter) or the i8254.  It only calls init_TSC().  This
might as well be called directly.  The i8254 is now initialized (only?)
by i8254, which is called much earlier, from init386(), so that DELAY()
works when called early from console drivers.  The same care should be
taken with initializing the TSC for early use, and would be essential
if the i8254 support in DELAY() were removed.
  - the correct interface for attaching the realtime clock now seems to be
cpu_initclocks().  This is now used for the TSC (init_TSC_tc()).  But
the i8254 is now attached as a timecounter in attimer_attach().
- set_cputicker() has various races and implementation bugs:
  - it has no visible locking.  This may be safe when it is called at boot
time.  It is also called from tsc_levels_changed().  BTW, these calls
still don't know about P-state invariance.  They always pass the
parameter saying that the ticker is variable.  This despite the
invariance variable being checked to lines after the call that passes
a hard-coded for variance.
  - the call to set_cputicker() after machdep.tsc_freq changes the ticker
freqency is missing.  This is a feature if the ticker is variable,
since the dynamic ticker calibration code can be more accurate than
a fixed setting, and would undo any fixed setting.  But this code is
buggy...
- set cputicker() has some design bugs.  It assumes that the tick frequency
  is the same across all CPUs, but the TSC is per-CPU.  I have an old SMP
  system with CPUs of different frequency that can demonstrate bugs from
  this.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: get_cyclecount(9) deprecation

2011-03-18 Thread Bruce Evans


On Fri, 18 Mar 2011, Kostik Belousov wrote:


On Fri, Mar 18, 2011 at 05:26:53PM +1100, Bruce Evans wrote:
...

- set cputicker() has some design bugs.  It assumes that the tick frequency
  is the same across all CPUs, but the TSC is per-CPU.  I have an old SMP
  system with CPUs of different frequency that can demonstrate bugs from
  this.

We definitely do not support configurations with different models of
CPUs in SMP, this is what Simmetric is about. Different as in frequency
or stepping.


It worked before this bug was implemented.  The TSC wasn't used so the
O/S didn't notice the asymmetric frequencies directly, and the O/S
also didn't care about this indirectly.  Now there is even more asymmetry
in core frequencies, with the hardware transiently slowing down or
stopping cores independently, at least for cores in different packages.
Ths O/S probably can't see this directly using TSCs since the technology
that changes the core's frequencies probably comes with invariant TSCs,
and the O/S shouldn't care about this indirectly just like before.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: get_cyclecount(9) deprecation

2011-03-18 Thread Bruce Evans


On Sat, 19 Mar 2011, Bruce Evans wrote:


On Fri, 18 Mar 2011, Kostik Belousov wrote:

We definitely do not support configurations with different models of
CPUs in SMP, this is what Simmetric is about. Different as in frequency
or stepping.


...
Now there is even more asymmetry
in core frequencies, with the hardware transiently slowing down or
stopping cores independently, at least for cores in different packages.


Also, with virtualization, the virtualizer cannot reasonably even provide
an invariant TSC that runs at the same rate on all cores.  It should
provide an invariant TSC that claims to run at the same rate on all cores,
but then the cores cannot run at the same rate except on average,
since some of the cores will have to run the virtualizer some of the
time, and it is unreasonable to distribute the overhead for this evenly
except on average.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Chasing down bugs with access(2)

2010-07-21 Thread Bruce Evans


On Wed, 21 Jul 2010, Jaakko Heinonen wrote:


On 2010-07-20, Garrett Cooper wrote:

I ran into an issue last night where apparently several apps make
faulty assumptions w.r.t. whether or not access(2) returns functional
data when running as a superuser.



New implementations are discouraged from returning X_OK unless at
least one execution permission bit is set.


See PR kern/125009 (http://www.freebsd.org/cgi/query-pr.cgi?pr=125009).

Here is the latest version of the vaccess*() patch which also changes
vaccess_acl_nfs4():

http://people.freebsd.org/~jh/patches/vaccess-VEXEC.diff

The patch is not a complete fix however. Not all file systems use
vaccess*() for VEXEC in their VOP_ACCESS() (ZFS confirmed). Thus the
patch doesn't work with ZFS.


I looked at the patches in the PR.  It seems reasonable to require an X
but for VEXEC for all file types except directories, like I think the
vaccess() version of your patch does.

Keeping the existing behaviour for directories seems necessary.  E.g.,
suppose a user changes all his files and directories to mode 000.  It
should still be possible for root to search, not to mention back up,
all those files and directories, without clobbering any of their
metadata (including atimes, but those are a different problem).

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Chasing down bugs with access(2)

2010-07-21 Thread Bruce Evans


On Tue, 20 Jul 2010, Garrett Cooper wrote:


Hi Hackers,
   I ran into an issue last night where apparently several apps make
faulty assumptions w.r.t. whether or not access(2) returns functional
data when running as a superuser.

POSIX says:
   In early proposals, some inadequacies in the access() function led
to the creation of an eaccess() function because:

  1. Historical implementations of access() do not test file
access correctly when the process' real user ID is superuser. In
particular, they always return zero when testing execute permissions
without regard to whether the file is executable.
  2. The superuser has complete access to all files on a system.
As a consequence, programs started by the superuser and switched to
the effective user ID with lesser privileges cannot use access() to
test their file access permissions.

   However, the historical model of eaccess() does not resolve
problem (1), so this volume of IEEE Std 1003.1-2001 now allows
access() to behave in the desired way because several implementations
have corrected the problem. It was also argued that problem (2) is
more easily solved by using open(), chdir(), or one of the exec
functions as appropriate and responding to the error, rather than
creating a new function that would not be as reliable. Therefore,
eaccess() is not included in this volume of IEEE Std 1003.1-2001.

   The sentence concerning appropriate privileges and execute
permission bits reflects the two possibilities implemented by
historical implementations when checking superuser access for X_OK.

   New implementations are discouraged from returning X_OK unless at
least one execution permission bit is set.


This seems wrong for directories.  It should say ... unless the file
is 'executable'.  'executable' means searchable for directories, and
the above shouldn't apply.  'executable' actually means executable for
regular files, and the above should only apply indirectly: it is
executability that should be required to have an X perm bit set, and
then access() should just track the capability.  The usual weaseling
with appropriate privilege allows the X perm bits to have any control
on executablity including none, and at least the old POSIX spec doesn't
get in the way of this, since it doesn't mention the X perm bits in
connection with the exec functions.  The spec goes too far in the other
direction for the access function.


FreeBSD says:

Even if a process's real or effective user has appropriate privileges and
indicates success for X_OK, the file may not actually have execute per-
mission bits set.  Likewise for R_OK and W_OK.


Perhaps it is time to fix this.  The part about X_OK never applied to any
version of FreeBSD.  Perhaps it applied to the body deleted version of
execve() in Net/2 and 4.4BSD, but FreeBSD had to reimplement execve() and
it never had this bug.  But^2, the access() syscall and man page weren't
changed to match.  See the end of this reply for more details on execve().
See the next paragraph about more bugs in the above paragraph.

Other bugs:
- R_OK and W_OK are far from likewise.  Everone knows that root can read
  and write any file.
- The permission bits are relatively uninteresting.  access() should track
  the capability, not the bits.  The bits used to map to the capability
  directly for non-root, but now with ACLs, MAC, etc. they don't even do
  that.
- access(2) has no mention of ACLs, MAC, etc.
- See a recent PR about unifdefed CAPABILITIES code in vaccess().  (The
  comment says that the code is always ifdefed out, but it now always
  unifdefed in.)   I don't quite understand this code -- does it give
  all of ACLs, MAC and etc. at this level?



   This results in:

   sh/test - ok-ish (a guy on bash-bugs challenged the fact that the
syscall was buggy based on the details returned).
   bash - broken (builtin test(1) always returns true)
   csh  - not really ok (uses whacky stat-like detection; doesn't
check for ACLs, or MAC info)
   perl - ok (uses eaccess(2) for our OS).
   python - broken (uses straight access(2), so os.access is broken).

   So now the question is how to fix this? Linux reports the correct
mode for the file (when operating as superuser or not), and there's a
lot of code outside of *BSD that builds upon that assumption because
stat(2) doesn't test for permissions via POSIX ACLs, MAC, etc.
   I tried munging through the code to determine where VOP_ACCESS was
coming from but I got lost. Where should I look for this?


Mostly it is in vaccess(9).  But execve() mostly doesn't use VOP_ACCESS().
Here is the FreeBSD-1 version, which is a bit shorter and thus easier to
understand than the current version, and shows that FreeBSD has always
required 1 exec bit for execve():

% /*
%  * Check permissions of file to execute.
%  *Return 0 for success or error code on failure.
%  */
% int
% exec_check_permissions(iparams)
%   struct image_params *iparams;
% {
%   struct proc *p =

Re: Chasing down bugs with access(2)

2010-07-21 Thread Bruce Evans


On Wed, 21 Jul 2010, Garrett Cooper wrote:


On Wed, Jul 21, 2010 at 1:40 AM, Bruce Evans b...@optusnet.com.au wrote:

- See a recent PR about unifdefed CAPABILITIES code in vaccess(). ?(The
?comment says that the code is always ifdefed out, but it now always
?unifdefed in.) ? I don't quite understand this code -- does it give
?all of ACLs, MAC and etc. at this level?


Interestingly standard permissions bypass ACLs/MAC if standard
permissions on the file/directory allow the requested permissions to
succeed; note the return (0) vs the priv_check_cred calls -- this is
where the the ACL/MAC for the inode is checked. This seems backwards,
but I could be missing something..


I was wrong to say that vaccess() does most of the checking.  This is only
correct if there are no ACLs, etc.  Otherwise, e.g., for ffs, the layering
is more like VOP_ACCESS() - ufs_access() =
check r/o ffs
check quota (bogusly in the clause whose comment says that it checks
for a r/o fs, deep in the case statement for that clause.  This
was readable when that clause was only 16 lines long, but now
it has messy locking for quotas, and large comments about this,
so it is 44 lines long, with the number of lines for locking
up from 4 to 32)
check immutability.  Another bug in access()'s man page is that it
doesn't mention either immutability or the errno EPERM that at
least ufs_access() returns for it.
check acls
call vaccess().  The MAC checks seem to be at the end of this and
   are unrelated to acls.  But for exec_check_permissions(), the
   MAC checks are first.


...

% ? ? ? /*
% ? ? ? ?* 1) Check if file execution is disabled for the filesystem that
this
% ? ? ? ?* ? ? ?file resides on.
% ? ? ? ?* 2) Insure that at least one execute bit is on - otherwise root
% ? ? ? ?* ? ? ?will always succeed, and we don't want to happen unless the
% ? ? ? ?* ? ? ?file really is executable.
% ? ? ? ?* 3) Insure that the file is a regular file.
% ? ? ? ?*/
% ? ? ? if ((vnodep-v_mount-mnt_flag  MNT_NOEXEC) ||
% ? ? ? ? ? ((attr-va_mode  0111) == 0) ||
% ? ? ? ? ? (attr-va_type != VREG)) {
% ? ? ? ? ? ? ? return (EACCES);
% ? ? ? }


Your mail program seems to be responsible for making the above unreadable
by changing tabs to hard \xa0's (which are displayed as tabs by my mail
client(s?), but soft \xa0's followed by a space by my editor.


0111 is an old spelling of the S_IX* bits. ?We check these directly


Maybe the changes weren't of tabs.  In this paragraphs, sentence breaks of
2 spaces got changed to 1 space followed by a hard \xa0.


Yet 2 more bugs: not just point 2, but points 1 and 3 in the above
comment are undocumented in execve(2) and access(2). ?The usual weaseling
with appropriate privilege should allow these too, but (as I forgot
to mention above) I think appropriate privilege is supposed to be
documented somewhere, so the man pages are still missing details.


Agreed on the former statement, and I understand the reasoning for the
latter statement, but at least for 1., this is a feature of mount(2)
(of course):

MNT_NOEXEC   Do not allow files to be executed from the file system.


How is a newbie supposed to know to look in mount(2) to find extra failure
cases?  execve(2) even cross-references mount(8), but this was in connection
with the nosuid mount option in a wrong man page:

% Index: execve.2
% ===
% RCS file: /home/ncvs/src/lib/libc/sys/execve.2,v
% retrieving revision 1.11
% retrieving revision 1.12
% diff -u -2 -r1.11 -r1.12
% --- execve.2  11 Jan 1998 21:43:38 -  1.11
% +++ execve.2  27 Apr 1999 03:56:10 -  1.12
% @@ -31,5 +31,5 @@
%  .\
%  .\ @(#)execve.28.5 (Berkeley) 6/1/94
% -.\ $Id$
% +.\ $Id: execve.2,v 1.11 1998/01/11 21:43:38 alex Exp $
%  .\
%  .Dd June 1, 1994
% @@ -144,4 +144,9 @@
%  .ne 1i
%  .Pp
% +The set-ID bits are not honored if the respective file system has the
% +.Ar nosuid
% +option enabled or if the new process file is an interpreter file.  Syscall
% +tracing is disabled if effective IDs are changed.
% +.Pp
%  The new process also inherits the following attributes from
%  the calling process:
% @@ -274,4 +279,6 @@
%  .Xr exit 3 ,
%  .Xr sysctl 3 ,
% +.Xr mount 1 ,

This dangling pointer was fixed to mount(8) in the next commit.  But it
should have been to mount(2) for MNT_NOSUID.

% +.Xr ktrace 1 ,
%  .Xr environ 7
%  .Sh HISTORY

I strongly dislike general references to man pages for 1 detail.  There
should be an Xr where each of the nosuid and tracing details is described
and none at the end.

There are actually many details of interest here, but how is a newbie
going to notice this when the committers didn't?  Details of interest
are at least:
- MNT_RDONLY (related to EROFS error)
- MNT_NOSUID
- MNT_NOEXEC
- MNT_ACLS (new)
- MNT_QUOTA
Better yet, nmount() allows any number of new mount options that might
affect

Re: [Patch] Kgmon/Gprof On/Off Switch

2010-07-07 Thread Bruce Evans


On Tue, 6 Jul 2010, Sean Bruno wrote:


On Thu, 2010-07-01 at 16:46 -0700, Sean Bruno wrote:

Found this lying around the Yahoo tree this week.  Basically it allows
you to activate, reset and deactivate profiling with the '-f flag.


I want something like this, but this implementation has too many style
bugs and obscure or missing locking.


Kind of nice to have if you want the ability to turn on profiling for
debugging live systems.


We already have this via `kgmon -hBb', but this patch implements
something different.  Normal profiling requires configuring the kernel
with config -p[p], but that adds a lot of overhead, even when profiling
is not turned on.  The patch implements dynamic configuration of flat
profiling only (since dynamic call graphs are too hard to implement,
but this restriction is not mentioned anywhere in the patch or the
accompanying mail.

Userland profiling has the same lossage -- you have to configure
programs specially by compiling with -pg, but that adds a lot of
overhead, even when profiling is not turned on.  Moreover, in userland
there is no library support for turning profiling on and off or for
dumping and clearing the statistics except on program start and
completion, so the much larger overhead of always maintaining the call
graph is normally always present.

In FreeBSD-1, I (ab)used a bug in the implementation to turn on flat
profiling (only) in the kernel.  This avoids the large call graph
overhead (it still costs a function call and some other branches on
every call and maybe on every return, but the branches for this are
predictable so the overhead is not very large.  This should be made
a standard feature, and perhaps this patch implements it as a bug
without really trying, as in FreeBSD-1 (implementing it just takes
setting gp-state to a value that gives flat profiling while keeping
mcount() null if full profiling is configured.

% Index: usr.sbin/kgmon/kgmon.8
% ===
% --- usr.sbin/kgmon/kgmon.8(revision 209745)
% +++ usr.sbin/kgmon/kgmon.8(working copy)
% @@ -70,6 +70,9 @@
%  Dump the contents of the profile buffers into a
%  .Pa gmon.out
%  file.
% +.It Fl f
% +Free profiling buffer.
% +Also stops profiling.
%  .It Fl r
%  Reset all the profile buffers.
%  If the

Freeing the profiling buffer alone isn't very useful.  The memory wastage
from always allocating it at boot time and never freeing it would be
rarely noticed, and would fix some races or simplify avoidance of races.
However, apart from the race problems, ordinary statically configured
profiling should free things too, since unlike in userland it is normal
to have profiling turned off most of the time even when it is statically
configured.

The above doesn't really describe what -f does.  In normal profiling,
there are multiple profiling buffers and freeing just the flat profiling
one makes no sense.  -f in fact frees the profiling buffer only if the
kernel is _not_ configured for profiling (as is required to not corrupt
the set of profiling buffers if the kernel is configured for profiling).
The profiling buffer is automatically allocated on first use iff it
is not already allocated, and of course -f has no effect if no buffer
is allocated.  Perhaps -r should automatically deallocate, so -f would
not be needed.  Kernel profiling has this feature of allowing multiple
dumps of the buffer(s) before reset, so reset is a good time to deallocate.

Style bugs in the above: f is unsorted between p and r.

% Index: usr.sbin/kgmon/kgmon.c
% ===
% --- usr.sbin/kgmon/kgmon.c(revision 209745)
% +++ usr.sbin/kgmon/kgmon.c(working copy)
% @@ -72,7 +72,7 @@
%   struct gmonparam gpm;
%  };
% 
% -int	Bflag, bflag, hflag, kflag, rflag, pflag;

% +int  Bflag, bflag, hflag, kflag, rflag, pflag, fflag;

Style bugs: now f is unsorted after r and p.  p was already unsorted after r.

%  int  debug = 0;
%  int  getprof(struct kvmvars *);
%  int  getprofhz(struct kvmvars *);
% @@ -82,6 +82,7 @@
%  void dumpstate(struct kvmvars *kvp);
%  void reset(struct kvmvars *kvp);
%  static void usage(void);
% +voidfreebuf(struct kvmvars *kvp);

Style bugs: f is unsorted after u; all the old declarations are indented
with 1 tab while the new one is indented with spaces.

% 
%  int

%  main(int argc, char **argv)
% @@ -93,7 +94,7 @@
%   seteuid(getuid());
%   kmemf = NULL;
%   system = NULL;
% - while ((ch = getopt(argc, argv, M:N:Bbhpr)) != -1) {
% + while ((ch = getopt(argc, argv, M:N:Bbfhpr)) != -1) {
%   switch((char)ch) {
% 
%  		case 'M':

% @@ -113,6 +114,10 @@
%   bflag = 1;
%   break;
% 
% +		case 'f':

% + fflag = 1;
% + break;
% +
%   case 'h':
%   hflag = 1;
%   break;

Now f is correctly sorted in the above 2 places.

Re: [PATCH] RUSAGE_THREAD

2010-05-06 Thread Bruce Evans


On Thu, 6 May 2010, Kostik Belousov wrote:


On Thu, May 06, 2010 at 04:42:39PM +0400, Alexander Krizhanovsky wrote:

...
Indeed, I concerned more about PROC_LOCK/PROC_SLOCK - they are acquired
both in kern_getrusage() and could be a problem in multithreaded process
with intensive performance accounting in threads. This happens for
example in resource accounting systems for shared hosting which are
solved by custom extensions of Apache, MySQL and other servers.


We still need per-process lock for resource summation code.


It's hard to see how it can be done without any lock at all.  The
oldest example of per-CPU summation code is in vm_meter.c  It doesn't
use any locks, but without any lock at all it is fundamentally broken.
Incoherencies in the data (due to reading and writing the counters at
different unsynchronized times in the past) would be more obvious for
times (except when the unsynchronized accesses give an off-by-2**32
error due to wraparound of a 32-bit counter), so I think getrusage()
should be more careful than vm_meter.  calcru() already does a lot of
work and locking just to synchronize times relative to previous readings.


On the other
hand, I do not understand why p_rux have to be protected by spinlock and
not by process mutex. Hm, spinlock disables preemption, but only for
curthread,


It is probably mostly historical.  The read-modify-write of pc_switchtime
and some other things used to be protected by splstatclock().  Hmm,
that was the lowest spl level, closer to a non-spin mutex than anything.
In RELENG_4, calcru() doesn't modify switchtime either, and mi_switch()
has comments saying that even the splstatclock() should be unnecessary.
RELENG_4 is of course non-preemptive, so it has close to the equivalent
of critical_enter() automatically.


Process spinlock is not acquired in the mi_switch(), so I
really do not see a reason not to change protection to mutex.


mi_switch)() holds thread lock, which is a special type of spinlock.

I think the following philosphy is adequate for reading times: the time
returned shall be no earlier than the time (according to a uniquely
selected clock at the point where the time is read) at the beginning of
the call, and coherent with itself.  (It will be indeterminately far in
the past by the time that it gets back to the caller.  Times read by a
single caller shall be monotonic, but times read by separate callers are
incomparable unless the callers do additional synchronization).  Thus
the locked part of reading a clock is:
- lock sufficiently to synchronize the next step
- update memory copy of the time, so that it is not in the past
- drop locks needed for the update, but keep ones needed to prevent the
  memory copy changing
- read and use the memory copy, reassembling it into the form needed by
  the caller.  This step may be long delayed from the update, but there
  is no point in trying to minimize the delay for this step, since we
  must soon drop all time-related locks and thus lose all control over
  the delay seen by the caller.
Only in very critical kernel environments would it be necessary and
possible to hold strong locks to minimize the delay across the whole
operation of reading the time and acting on the result.


I asked about it in my message from 3rd of May and also there I've
proposesed following patch for calcru1():

-   if ((int64_t)tu  0) {
-   /* XXX: this should be an assert /phk */
-   printf(calcru: negative runtime of %jd usec for pid %d
(%s)\n,
-   (intmax_t)tu, p-p_pid, p-p_comm);
-   tu = ruxp-rux_tu;
-   }
+   KASSERT((int64_t)tu  0, (calcru: negative runtime of %jd usec 
+   for pid %d (%s)\n,
+   (intmax_t)tu, p-p_pid, p-p_comm));


I did understand your proposal, but, as I said, I have no opinion.


This patch is mostly backwards:
- backwards condition in KASSERT()
- removal of KNF formatting
- removal of the fixup when the KASSERT() is null.

I think I prefer removing the whole statement.  But the KASSERT() may be
useful for verifying that the failure case really is unreachable due to
programming bugs.  I think it reachable due to hardware bugs, but only
ones that are less likely than a memory parity error, and the handling
of hardware bugs here shouldn't be to panic.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: question about sb-st_blksize in src/sys/kern/vfs_vnops.c

2008-10-25 Thread Bruce Evans

On Fri, 24 Oct 2008, Thierry Herbelot wrote:

the [SUBJ] file contains the following extract (around line 705) :

* Default to PAGE_SIZE after much discussion.
* XXX: min(PAGE_SIZE, vp-v_bufobj.bo_bsize) may be more correct.
*/

sb-st_blksize = PAGE_SIZE;

which arrived around four years ago, with revision 1.211 (see
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.210;r2=1.211;f=h)

Indeed, this was completely broken long ago (in 1.211). Before then, and
after 1.128, some cases worked as intended if not perfectly:
- regular files: file systems still set va_blksize to their idea of the
best i/o size (normally to the file system block size, which is
normally larger than PAGE_SIZE and probably better in all cases) and
this was used here. However, for regular files, the fs block size
and the application's i/o size are almost irrelevant in most cases
due to vfs clustering. Most large i/o's are done physically with
the cluster size (which due to a related bug suite ends up being
hard-coded to MAXPHYS (128K) at a minor cost when this is different
from the best size).
- disk files: non-broken device drivers set si_iosize_best to their idea
of the best i/o size (normally to the max i/o size, which is normally
better than PAGE_SIZE) and this was used here. The bogus default
of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it
was for the buffer cache implementation for block devices which no
longer exist and was too small for them anyway).
- non-disk character-special files: the default of PAGE_SIZE was used.
The comment about defaulting to PAGE_SIZE was added in 1.128 and is
mainly for this case. Now the comment is nonsense since the value is
fixed, not a default.
- other file types (fifos, pipes, sockets, ...): these got the default of
PAGE_SIZE too.

In rev.1.1, st_blksize was set to va_blksize in all cases. So file systems
were supposed to set va_blksize reasonably in all cases, but this is not
easy and they did nothing good except for regular files.

Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS
(64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for disks.
This gave nonsense like 64K buffers for slow tty devices (keyboards) and
2K buffers for fast disks. At least for programs that trust st_blksize
o be reasonable. Fortunately, st_blsize is rarely used...

the net effect of this change is to decrease the block buffer size used in
libc/stdio from 16 kbytes (derived from the underlying ufs partition) to
PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth is
lowered (this is on a slow Flash).

... except it is used by stdio. (Another mess here is that stdio mostly
doesn't use its own BUFSIZ. It trusts st_blksize if fstat() to determine
st_blksize works. Of course, the existence of BUFSIZ is a related
historical mistake -- no fixed size can work best for all cases. But
when BUFSIZ is used, it is an even worse default than PAGE_SIZE.)

It's interesting that you can see the difference. Clustering is especially
good for hiding slowness on slow devices. Maybe you are using a configuration
that makes clustering ineffective. Mounting the file system with -o sync
or equivalently, doing a sync after every (too-small) write would do it.
Otherwise, writes are normally delated until the next cluster boundary.

I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, to
revert to the block size previoulsly used), and the kernel and world seem to
be running fine.

Seeing the XXX coment above, I'm a bit worried about keeping this new
st_blksize value.

are there any drawbacks with running with this bigger buffer size value ?

Mostly it doesn't matter, since buffering (clustering) hides the differences.
Without clustering, 16K is a much better default for disks than 4K, though
not as good as the non-default va_blksize for regular files. Newer disks
might prefer 32K or 64k, but then the fs block size should also be increased
from 16K. Otherwise, increasing the block size usually reduces performance,
by thrashing caches or increasing latencies. With modern cache sizes and disk
speeds, you won't see these effects for a block size of 64K, so defaulting to
64K would be reasonable for disks. It would be silly for keyboards, but with
modern memory sizes you would notice this even less than when it was that in
old versions.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Bruce Evans


On Tue, 12 Feb 2008, Andriy Gapon wrote:


on 12/02/2008 15:11 Bruce Evans said the following:

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message [EMAIL PROTECTED], Andriy Gapon writes:



3.1. for a fresh buf getlbk would assign the following:
bsize = bo-bo_bsize;
offset = blkno * bsize;

That's clearly wrong.


If units were always 512-blocks, then anything except bsize = DEV_BSIZE
would be clearly wrong.  Things aren't that simple (but probably should
be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.


O, I missed this obvious thing!


Me too.


Bruce, thank you for putting me back on the ground :-)
Even in UDF case, when we bread() via a file or directory vnode we pass
a logical 2048-byte block number (within the file) to bread(). In this
case the code in getblk() does the right thing when it calculates offset
as blkno * 2048. Calculating it assuming 512-byte units would be a disaster.


I think the size is mnt_stat.f_iosize in general (but not for device vnodes).


And the actual reading works correctly because udf_strategy is called
for such vnodes and it translates block numbers from physical to logical
and also correctly re-calculates b_iooffset for actual reading. So
b_iooffset value set in breadn() (using bdtob) is completely ignored.


Is this setting ever used (and the corresponding setting for writing)
ever used?


I remember that this is why g_vfs_* was my primary target.
It seems that I revert my opinion back: either g_vfs_open should be
smart about setting bo_bsize, or g_vfs_strategy should be smart about
calculating bio_offset, or individual filesystems should not depend on
g_vfs_* always doing the best thing for them.


In fact, ffs already overrides the setting of bo_bsize for the device
vnode to a different wrong setting -- g_vfs_open() sets the sector size,
and ffs_mount() changes the setting to fs_bsize, but ffs actually needs
the setting to be DEV_BSIZE (I think).  Other bugs from this:
- ffs_rawread wants the sector size, and it assumes that this is in bo_bsize
  for the device vnode, but ffs_mount() has changed this to fs_bsize.
- multiple r/o mounts are supposed to work, but don't, since there is only
  one device vnode with a shared bufobj, but the bufobj needs to be
  per-file-system since all mounts write to it.  Various bad things
  including panics result.  There is a well-know panic from bo_private
  becoming garbage on unmount.  I just noticed more possibilities for
  panics.  bo_ops points to static storage, so it never becomes complete
  garbage.  However, at least ffs sets it blindly early on in
  ffs_mountfs(), before looking at the file system to see if ffs can
  mount it.  Thus if the file system is already mounted by another
  ffs, then ffs clobbers the other fs's bo_ops.  The ffs mount will
  presumably fail, leaving bo_ops clobbered.  Also, a successful
  g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little
  earlier.  g_vfs_open() is fundamental to looking at the file system,
  since I/O is not set up until it completes.  Clobbering the pointers
  is most dangerous, but just clobbering bo_bsize breaks blkno
  calculations for any code that uses bo_bsize.

Apart from these bugs, the fix for the blkno calculations for device
vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device
vp of all disk file systems (since all disk file systems use expect
this size).  Currently, ffs seems to be the only file system that
overrides g_vfs_open()'s default of the sector size.  Nothing in any
of fs/, gnu/fs/ and contrib/ references bo_bsize.


In any case, it seems that it is the g_vfs_* that is currently
inconsistent: it sets bo_bsize to a somewhat arbitrary value, but
expects that it would always be provided with correct b_iooffset (the
latter being rigidly calculated via bdtob() in buffcache code). So this
leaves filesystems dead in the water while reading via device vnode:
provide blkno in 512-byte units - screw up VM cache, provide blkno in
bo_bsize units - screw up reading from disk.


I think I/O on the disk doesn't use bo_bsize, but only the sector size
(to scale the byte count to a sector count).  Otherwise, ffs's override
would break I/O.  geom is another place that has few references to
bo_bsize -- it just clobbers it in g_vfs_open().


Not sure if the FS-s should have wits to set bo_bsize properly and/or
provide proper bo_ops - we are talking about a device vnode here.
Should filesystems be involved in the business of setting its
properties? Not sure.


Yes.  They can set it more easily as they can tell g_vfs_open() what to
set it to.  Except, until the bugs are fixed properly, g_vfs_open() can
more easily do tests to prevent clobbering.  For bo_bsize and bo_ops,
sharing a common value is safe and safe changes can be detected by
setting to a special value on last unmount.  For bo_private, a safety
check would probably disallow multiple mounts (since cp is dynamically
allocated (?)).

Bruce

Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Bruce Evans


On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message [EMAIL PROTECTED], Andriy Gapon writes:


2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.


Yes, the buf-cache works in 512 bytes units throughout.


I missed the dbtob() conversions in vfs_bio.c when I replied previously
So blkno cannot be a cookie; it must be for a 512-block.  So how did
the cases where bsize != DEV_BSIZE in getblk() ever work?


3.1. for a fresh buf getlbk would assign the following:
bsize = bo-bo_bsize;
offset = blkno * bsize;


That's clearly wrong.


If units were always 512-blocks, then anything except bsize = DEV_BSIZE
would be clearly wrong.  Things aren't that simple (but probably should
be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.
That seems to include nfs(client).  In fact, nfs_getcacheblk() does
weird scaling which seems to be mainly to compensate for for weird scaling
here.  It calls getblk() with a bn arg that seems to be f_iosize units.
Then at then end, for the VREG case, it sets bp-b_blkno to this bn
scaled to normal DEV_BSIZE units.  bp-b_blkno seems to have DEV_BSIZE
units for all uses of it in nfs.


We need to assert that the blkno is aligned to the start of a sector
and use the 512 byte units, so I guess it would be:

   offset = dbtob(blkno);
   KASSERT(!(offset  (bsize - 1)), (suitable diagnostic));


Barely worth checking.

The current bug has nothing to do with this.  The offset is just wrong
at this point, after using a scale factor that is inconsistent with the
units of blkno, for all (?) disk (and other?) file systems whose sector
size isn't 512.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Bruce Evans


On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message [EMAIL PROTECTED], Andriy Gapon writes:

on 06/02/2008 18:29 Andriy Gapon said the following:

Small summary of the above long description.
For directory reading fs/udf performs bread() on a (underlying) device
vnode. It passes block number as if block size was 512 bytes (i.e.
byte_offset_within_dev/512).


We have three sizes of relevance here, the sectorsize of the provider,
the blocksize of the filesystem and the page size of the system.


4. The block size required for bread() and friends (almost always
   DEV_BSIZE).


In general it is adventurous to have any of them be anything but
powers of two, and it is at best ill-adviced and more likely asking
for trouble to do requests that are not multiple of and aligned to
the sectorsize of the provider.

So up front, I would say that it is an UDF bug to do 512 byte reads off
a 2k sector provider.

Making the buffer cache handle this is possible, but it is not the
direction we have planned for the buffercache (ie: that it should
become a wrapper for using the VM system, rather than being a
separately allocated cache).

So if the objective is to make UDF work in the short term, your
change might work, if the objective is to move FreeBSD's kernel
architecture forward, the right thing to do is to fix UDF to not
access subsectors.


This bug has nothing to do with subsectors, and very little to do with
udf.  udf is just depending on bread()'s API being unbroken.  That API
requires starting with blocks consisting of whole sectors (else the
subsequent I/O would fail) and converting to blocks of size DEV_BSIZE,
phyexcept for unusual (non-disk?) file systems like nfs (and no others?).
All (?) disk file systems do this conversion.  The bug is just more
noticeable for udf since it is used more often on disks with a block
size != DEV_BSIZE, and it apparently does something that makes the bug
mess up VM more than other file systems.

Of course, it might be better for the API to take blocks in units of
the sector size, but that is not what it takes, and this can't be
changed easily or safely.

The standard macro btodb() is often used to convert from bytes to
blocks of this size, and doesn't support bo_bsize or the bufobj pointer
that would be needed to reach bo_bsize.

ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block
(frag?)-sized units to blocks in DEV_BSIZE units so as to pass them
to unbroken bread() and friends.  ffs initializes bo_bsize to its block
(not frag) size, and then never uses it directly except in ffs_rawread(),
where it is used to check that the I/O is in units of whole sectors
(otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest
of ffs).

The details of fsbtodb() are interesting.  They show signs of previous
attempts to use units of sectors.  From ffs/fs.h:

% /*
%  * Turn filesystem block numbers into disk block addresses.
%  * This maps filesystem blocks to device size blocks.
%  */
% #define   fsbtodb(fs, b)  ((daddr_t)(b)  (fs)-fs_fsbtodb)
% #define   dbtofsb(fs, b)  ((b)  (fs)-fs_fsbtodb)

Creation of fs_fsbtodb is left to newfs.  The units of DEV_BSIZE are thus
built in to the on-disk data struct (in a fairly easy to change and/or
override way, but there are a lot of existing disks).  From newfs.c:

%   realsectorsize = sectorsize;
%   if (sectorsize != DEV_BSIZE) {  /* XXX */
%   int secperblk = sectorsize / DEV_BSIZE;
% 
% 		sectorsize = DEV_BSIZE;

%   fssize *= secperblk;
%   if (pp != NULL)
%   pp-p_size *= secperblk;
%   }
%   mkfs(pp, special);

Though mkfs() appears to support disk blocks of size sector size =
sectorsize, its sectorsize parameter is hacked on here, so it always
generates fs_fsbtodb and other derived parameters for disk blocks of
size DEV_BSIZE, as is required for the unbroken bread() API.

msdosfs is similar to ffs here, except it constructs the shift count
at mount time, to arrange for always converting to DEV_BSIZE blocks for
passing the bread() and friends.

udf is effectively similar, but pessimal and disorganized.  For the
conversion for bread(), it uses btodb().  In udf_bmap(), it constructs
a shift count on every call by subtracting DEV_BSHIFT from its block shift
count.  In udf_strategy(), on every call it constructs a multiplier
instead of a shift count, by dividing its block size by DEV_BSIZE.  It's
weird that the strategy routing, which will soon end up sending sectors
to the disk, is converting to DEV_BSIZE units, a size that cannot be the
size for udf's normal media.

cd9660 uses btodb() for one conversion for bread() and
(its block shift count - DEV_BSHIFT) in 7 other conversions to
DEV_BSIZE units.

So it's surprising that any file systems work on CDs and DVDs.  Maybe
the bug affects udf more because udf crosses page boundaries more.

It's also surprising that the bug has such small effects.  This seems
to be because the

Re: Memory allocation performance

2008-02-03 Thread Bruce Evans


On Mon, 4 Feb 2008, Alexander Motin wrote:


Kris Kennaway wrote:
You can look at the raw output from pmcstat, which is a collection of 
instruction pointers that you can feed to e.g. addr2line to find out 
exactly where in those functions the events are occurring.  This will often 
help to track down the precise causes.


Thanks to the hint, it was interesting hunting, but it shown nothing. It hits 
into very simple lines like:

bucket = cache-uc_freebucket;
cache-uc_allocs++;
if (zone-uz_ctor != NULL) {
cache-uc_frees++;
and so on.
There is no loops, there is no inlines or macroses. Nothing! And the only 
hint about it is a huge number of p4-resource-stalls in those lines. I have 
no idea what exactly does it means, why does it happens mostly here and how 
to fight it.


Try profiling it one another type of CPU, to get different performance
counters but hopefully not very different stalls.  If the other CPU doesn't
stall at all, put another black mark against P4 and delete your copies of
it :-).

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: Memory allocation performance

2008-02-01 Thread Bruce Evans


On Fri, 1 Feb 2008, Alexander Motin wrote:


Robert Watson wrote:
It would be very helpful if you could try doing some analysis with hwpmc -- 
high resolution profiling is of increasingly limited utility with modern


You mean of increasingly greater utility with modern CPUs.  Low resolution
kernel profiling stopped giving enough resolution in about 1990, and has
become of increasingly limited utility since then, but high resolution
kernel profiling uses the TSC or possibly a performance counter so it
has kept up with CPU speedups.  Cache effects and out of order execution
are larger now, but they affect all types of profiling and still not too
bad with high resulotion kernel profiling.  High resolution kernel profiling
doesn't really work with SMP, but that is not Alexander's problem since he
profiled under UP.

CPUs, where even a high frequency timer won't run very often.  It's also 
quite subject to cycle events that align with other timers in the system.


No, it isn't affected by either of these.  The TSC timer is incremented on
every CPU cycle and the performance counters run are incremented on every
event.  It is completely unaffected by other timers.

I have tried hwpmc but still not completely friendly with it. Whole picture 
is somewhat alike to kgmon's, but it looks very noisy. Is there some know 
how about how to use it better?


hwpmc doesn't work for me either.  I can't see how it could work as well
as high resolution kernel profiling for events at the single function
level, since it is statistics-based.  With the statistics clock interrupt
rate fairly limited, it just cannot get enough resolution over short runs.
Also, it works poorly for me (with a current kernel and ~5.2 userland
except for some utilities like pmc*).  Generation of profiles stopped
working for me, and it often fails with allocation errors.

I have tried it for measuring number of instructions. But I am in doubt that 
instructions is a correct counter for performance measurement as different 
instructions may have very different execution times depending on many 
reasons, like cache misses and current memory traffic.


Cycle counts are more useful, but high resolution kernel profiling can do
this too, with some fixes:
- update perfmon for newer CPUs.  It is broken even for Athlons (takes a
  2 line fix, or more lines with proper #defines and if()s).
- select the performance counter to be used for profiling using
  sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the
  performance counter for the instruction count (or any).  I use hwpmc
  mainly to determine N :-).  It may also be necessary to change the
  kernel variable cpu_clock_pmc_conf.  Configuration of this is unfinished.
- use high resolution kernel profiling normally.  Note that switching to
  a perfmon counter is only permitted of !SMP (since it is too unsupported
  under SMP to do more than panic if permitted under SMP), and that the
  switch loses the calibration of profiling.  Profiling normally
  compensates for overheads of the profiling itself, and the compensation
  would work almoost perfectly for event counters, unlike for time-related
  counters, since the extra events for profiling aren't much affected by
  caches.

I have tried to use 
tsc to count CPU cycles, but got the error:

# pmcstat -n 1 -S tsc -O sample.out
pmcstat: ERROR: Cannot allocate system-mode pmc with specification tsc: 
Operation not supported

What have I missed?


This might be just because the TSC really is not supported.  Many things
require an APIC for hwpmc to support them.

I get errors allocation like this for operations that work a few times
before failing.

I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel 
built without SMP to simplify profiling. What counters can you recommend me 
to use on it for regular time profiling?


Try them all :-).  From userland first with an overall count, since looking
at the details in gprof output takes too long (and doesn't work for me with
hwpmc anyway).  I use scripts like the following to try them all from
userland:

runpm:
%%%
c=ttcp -n10 -l5 -u -t epsplex

ctr=0
while test $ctr -lt 256
do
ctr1=$(printf 0x%02x\n $ctr)
case $ctr1 in
0x00)   src=k8-fp-dispatched-fpu-ops;;
0x01)   src=k8-fp-cycles-with-no-fpu-ops-retired;;
0x02)   src=k8-fp-dispatched-fpu-fast-flag-ops;;
0x05)   src=k8-fp-unknown-$ctr1;;
0x09)   src=k8-fp-unknown-$ctr1;;
0x0d)   src=k8-fp-unknown-$ctr1;;
0x11)   src=k8-fp-unknown-$ctr1;;
0x15)   src=k8-fp-unknown-$ctr1;;
0x19)   src=k8-fp-unknown-$ctr1;;
0x1d)   src=k8-fp-unknown-$ctr1;;
0x20)   src=k8-ls-segment-register-load;;   # XXX
0x21)   src=kx-ls-microarchitectural-resync-by-self-mod-code;;
0x22)   src=k8-ls-microarchitectural-resync-by-snoop;;
0x23)   src=kx-ls-buffer2-full;;
0x24)   src=k8-ls-locked-operation;;# XXX

Re: [patch] Adding optimized kernel copying support - Part III

2006-05-31 Thread Bruce Evans


On Wed, 31 May 2006, Attilio Rao wrote:


2006/5/31, Suleiman Souhlal [EMAIL PROTECTED]:

Nice work. Any chance you could also port it to amd64? :-)


Not in the near future, I think. :P


It is not useful for amd64.  An amd64 has enough instruction bandwidth
to saturate the L1 cache using 64-bit accesses although not using
32-bit accesses.  An amd64 has 64-bit integer registers which can be
accesses without the huge setup overheads and code complications for
MMX/XMM registers.  It already uses 64-bit registers or 64-bit movs
for copying and zeroing of course.  Perhaps it should use prefetches
and nontemporal writes more than it already does, but these don't
require using SSE2 instructions like nontemporal writes do for 32-bit
CPUs.


Does that mean it won't work with SMP and PREEMPTION?


Yes it will work (even if I think it needs more testing) but maybe
would give lesser performances on SMP|PREEMPTION due to too much
traffic on memory/cache. For this I was planing to use non-temporal
instructions
(obviously benchmarks would be very appreciate).


Er, isn't its main point to fix some !SMP assumptions made in the old
copying-through-the-FPU code?  (The old code is messy due to its avoidance
of global changes.  It wants to preserve the FPU state on the stack, but
this doesn't quite work so it does extra things (still mostly locally)
that only work in the !SMP  (!SMPng even with UP) case.  Patching this
approach to work with SMP || SMPng cases would make it messier.)

The new code wouldn't behave much differently under SMP.  It just might
be a smaller optimization because more memory pressure for SMP causes
more cache misses for everything and there are no benefits from copying
through MMX/XMM unless nontemporal writes are used.  All (?) CPUs with
MMX or SSE* can saturate main memory using 32-bit instructions.  On
32-bit CPUs, the benefits of using MMX/XMM come from being able to
saturate the L1 cache on some CPUs (mainly Athlons and not P[2-4]),
and from being able to use nontemporal writes on some CPUs (at least
AthlonXP via SSE extensions all CPUs with SSE2).

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: Fwd: 5-STABLE kernel build with icc broken

2005-04-02 Thread Bruce Evans

On Fri, 1 Apr 2005, Matthew Dillon wrote:
:The use of the XMM registers is a cpu optimization. Modern CPUs,
:especially AMD Athlon and Opterons, are more efficient with 128 bit
:moves then with 64 bit moves. I experimented with all sorts of
:configurations, including the use of special data caching instructions,
:but they had so many special cases and degenerate conditions that
:I found that simply using straight XMM instructions, reading as big
:a glob as possible, then writing the glob, was by far the best solution.
:
:Are you sure about that? The amd64 optimization manual says (essentially)
This is in 25112.PDF section 5.16 (Interleave Loads and Stores, with
128 bits of loads followed by 128 bits of stores).
:that big globs are bad, and my benchmarks confirm this. The best glob size
:is 128 bits according to my benchmarks. This can be obtained using 2
:...
:
:Unfortunately (since I want to avoid using both MMX and XMM), I haven't
:managed to make copying through 64-integer registers work as well.
:Copying 128 bits at a time using 2 pairs of movq's through integer
:registers gives only 7.9GB/sec. movq through MMX is never that slow.
:However, movdqu through xmm is even slower (7.4GB/sec).
I forgot many of my earlier conclusions when I wrote the above. The
speeds between 7.4GB/sec and 12.9GB/sec for the fully (L1) cached case
are almost irrelevant. They basically just tell how well we have
used the instruction bandwidth. Plain movsq uses it better and gets
15.9GB/sec. I believe 15.9GB/sec is from saturating the L1 cache.
The CPU is an Athlon64 and its clock frequency is 1994 MHz, and I think
the max L1 cache bandwidth is with a 16-byte load and store per cycle;
16*1994*10^6 is 15.95GB/sec (disk manufacturers GB's).
Plain movsq is best here for many other cases too...
:
:The fully cached case is too unrepresentative of normal use, and normal
:(partially cached) use is hard to bencmark, so I normally benchmark
:the fully uncached case. For that, movnt* is best for benchmarks but
:not for general use, and it hardly matters which registers are used.
Yah, I'm pretty sure. I tested the fully cached (L1), partially
cached (L2), and the fully uncached cases. I don't have a logic
By the partially cached case, I meant the case where some of the source
and/or target addresses are in the L1 or L2 cache, but you don't really
the chance that they are there (or should be there after the copy), so
you can only guess the best strategy.
analyzer but what I think is happening is that the cpu's write buffer
is messing around with the reads and causing extra RAS cycles to occur.
I also tested using various combinations of movdqa, movntdq, and
prefetcha.
Somehow I'm only seeing small variations from different strategies now,
with all tests done in userland on an Athlon64 system (and on athlonXP
systems for reference). Using XMM or MMX can be twice as fast on
the AthlonXPs, but movsq is absolutely the fastest in many cases on
the Athlon64, and is 5% slower than the fastest in all cases
(except for the fully uncached case since it can't do nontemporal
stores), so it is the best general method.
...
I also think there might be some odd instruction pipeline effects
that skew the results when only one or two instructions are between
the load into an %xmm register and the store from the same register.
I tried using 2, 4, and 8 XMM registers. 8 XMM registers seemed to
work the best.
I'm getting only small variations from different load/store patterns.
Of course, I primarily tested on an Athlon 64 3200+, so YMMV. (One
of the first Athlon 64's, so it has a 1MB L2 cache).
My test system is very similar:
%%%
CPU: AMD Athlon(tm) 64 Processor 3400+ (1994.33-MHz K8-class CPU)
Origin = AuthenticAMD Id = 0xf48 Stepping = 8

Features=0x78bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2
AMD Features=0xe0500800SYSCALL,NX,MMX+,LM,3DNow+,3DNow
L1 2MB data TLB: 8 entries, fully associative
L1 2MB instruction TLB: 8 entries, fully associative
L1 4KB data TLB: 32 entries, fully associative
L1 4KB instruction TLB: 32 entries, fully associative
L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L2 2MB unified TLB: 0 entries, disabled/not present
L2 4KB data TLB: 512 entries, 4-way associative
L2 4KB instruction TLB: 512 entries, 4-way associative
L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative
%%%
The prefetchnta I have commented out seemed to improve performance,
but it requires 3dNOW and I didn't want to NOT have an MMX copy mode
for cpu's with MMX but without 3dNOW. Prefetching less then 128 bytes
did not help, and prefetching greater then 128 bytes (e.g. 256(%esi))
seemed to cause extra RAS cycles. It was unbelievably finicky, not at
all what I expected.

Re: Fwd: 5-STABLE kernel build with icc broken

2005-04-01 Thread Bruce Evans

On Thu, 31 Mar 2005, Matthew Dillon wrote:
I didn't mean to get into the kernel's use of the FPU, but...
   All I really did was implement a comment that DG had made many years
   ago in the PCB structure about making the FPU save area a pointer rather
   then hardwiring it into the PCB.
ISTR writing something like that.  dg committed most of my early work
since I didn't have commit access at the time.
...
   The use of the XMM registers is a cpu optimization.  Modern CPUs,
   especially AMD Athlon and Opterons, are more efficient with 128 bit
   moves then with 64 bit moves.   I experimented with all sorts of
   configurations, including the use of special data caching instructions,
   but they had so many special cases and degenerate conditions that
   I found that simply using straight XMM instructions, reading as big
   a glob as possible, then writing the glob, was by far the best solution.
Are you sure about that?  The amd64 optimization manual says (essentially)
that big globs are bad, and my benchmarks confirm this.  The best glob size
is 128 bits according to my benchmarks.  This can be obtained using 2
64-bit reads of 64-bit registers followed by 2 64-bit writes of these
registers, or by read-write of a single 128-bit register.  The 64-bit
registers can be either MMX or integer registers on 64-bit systems, but
the 128-registers must be XMM on all systems.  I get identical speeds
of 12.9GB/sec (+-0.1GB/sec) on a fairly old and slow Athlon64 system
for copying 16K (fully cached) through MMX and XMM 128 bits at a time
using the following instructions:
# MMX:  # XMM
movq(%0),%mm0   movdqa  (%0),%xmm0
movq8(%0),%mm1  movdqa  %xmm0,(%1)
movq%mm0,(%1)   ... # unroll same amount
movq%mm1,8(%1)
... # unroll to copy 64 bytes per iteration
Unfortunately (since I want to avoid using both MMX and XMM), I haven't
managed to make copying through 64-integer registers work as well.
Copying 128 bits at a time using 2 pairs of movq's through integer
registers gives only 7.9GB/sec.  movq through MMX is never that slow.
However, movdqu through xmm is even slower (7.4GB/sec).
The fully cached case is too unrepresentative of normal use, and normal
(partially cached) use is hard to bencmark, so I normally benchmark
the fully uncached case.  For that, movnt* is best for benchmarks but
not for general use, and it hardly matters which registers are used.
   The key for fast block copying is to not issue any memory writes other
   then those related directly to the data being copied.  This avoids
   unnecessary RAS cycles which would otherwise kill copying performance.
   In tests I found that copying multi-page blocks in a single loop was
   far more efficient then copying data page-by-page precisely because
   page-by-page copying was too complex to be able to avoid extranious
   writes to memory unrelated to the target buffer inbetween each page copy.
By page-by-page, do you mean prefetch a page at a time into the L1
cache?
I've noticed strange loss (apparently) from extraneous reads or writes
more for benchmarks that do just (very large) writes.  An at least old
Celerons and AthlonXPs, the writes go straight to the L1/L2 caches
(unless you use movntq on AthlonXP's).  The caches are flushed to main
memory some time later, apparently not very well since some pages take
more than twice as long to write as others (as seen by the writer
filling the caches), and the slow case happens enough to affect the
average write speed by up to 50%.  This problem can be reduced by
putting memory bank bits in the page colors.  This is hard to get right
even for the simple unrepresentative case of large writes.
Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-31 Thread Bruce Evans

On Wed, 30 Mar 2005, David Schultz wrote:
On Wed, Mar 30, 2005, Peter Jeremy wrote:
On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote:
Later in that thread they discuss skipping the restore state to make
things faster.  The minimum buffer size they say this will be good for
is between 2-4k.  Does this make sense, or am I showing my ignorance?
http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html
Yes.  There are a variety of options for saving/restoring FPU state:
a) save FPU state on kernel entry
b) save FPU state on a context switch (or if the kernel wants the FPU)
c) only save FPU state if a different process (or the kernel) wants the FPU
1) restore FPU on kernel exit
2) restore FPU state if a process wants the FPU.
a and 1 are the most obvious - that's the way the integer registers are
handled.
I thought FreeBSD used to be c2 but it seems it has been changed to b2
since I looked last.
No, it always used b2.  I never got around to implementing c2.
Linux used to implement c2 on i386's, but I think it switched (to b2?) to
optimize (or at least simplify) the SMP case.
Based on the mail above, it looks like Dfly was changed from 1 to 2
(I'm not sure if it's 'a' or 'c' on save).
'a' seems to be too inefficient to ever use.  '1' makes sense if it
rarely happens and/or the kernel can often use the FPU more than once
per entry (which it probably shouldn't), but it gives complications
like the ones for SMP, especially in FreeBSD where the kernel can be
preempted.
Saving FP state as needed is simplest but can be slow.  My Athlon-with-
SSE-extensions pagecopy and pagezero routines use the FPU (XMM) but
their FP state save isn't slow because only 1 or 2 XMM registers needs
to be saved.  E.g., the saving part of sse_pagezero_for_some_athlons() is:
pushfl  # Also have to save %eflags.
cli # Switch %eflags as needed to safely use FPU.
movl%cr0,%eax   # Also have to save %cr0.
clts# Switch %cr0 as needed to use FPU.
subl$16,%esp# Space to save some FP state.
movups  %xmm0,(%esp)# Save some FP state.  Only this much needed.
On the i386 (and probably most other CPUs), you can place the FPU into
am unavailable state.  This means that any attempt to use it will
trigger a trap.  The kernel will then restore FPU state and return.
On a normal system call, if the FPU hasn't been used, the kernel will
see that it's still in an unavailable state and can avoid saving the
state.  (On an i386, unavailable state is achieved by either setting
CR0_TS or CR0_EM).  This means you avoid having to always restore FPU
state at the expense of an additional trap if the process actually
uses the FPU.
I remember that you (Peter) did extensive benchmarks of this.  I still
think fully lazy switching (c2) is the best general method.  Maybe FP
state should be loaded in advance based on FPU affinity.  It might be
good for CPU affinity to depend on FPU use (prfer not to switch
threads away from a CPU if they own that CPU via its FPU).
This is basically what FreeBSD does on i386 and amd64.  (As a
disclaimer, I haven't read the code very carefully, so I might be
missing some of the details.)  Upon taking a trap for a process
that has never used the FPU before, we save the FPU state for the
last process to use the FPU, then load a fresh FPU state.  On
We don't save the FPU state for the last thread then (c2 behaviour)
since we have already saved it it when we switched away from it.
npxdna() panics if we haven't done that.  Except rev.1.131 added bogus
code (apparently to debug or hide bugs in the other changes in rev.1.131)
that breaks the panic in the fpcurthread == curthread case.
subsequent context switches, the FPU state for processes that have
already used the FPU gets loaded before entering user mode, I
think.  I haven't studied the code in enough detail to know what
No, that doesn't happen.  Instead, cpu_switch() has called npxsave()
on the context switch away from the thread.  npxsave() arranges for
a trap on the next use of the FPU, and we don't do anything more with
the FPU context of the thread until the thread tries to use the FPU
(in userland).  Then we take the trap and load the saved context in
npxdna().
happens for SMP, where a process could be scheduled on a different
processor before its FPU state is saved on the first processor.
There is no difference for SMP, but there would be large complicated
differences if we did fully lazy saving.  npxdna() would have to do
something like sending an IPI to the thread that owns the FPU if
that thread could be different from curthread.  This would be slow,
but might be worth doing if it didn't happen much and if lazy fully
lazy context switching were a significant advantage.  I think it
could be arranged to not happen much, but the advantage is insignificant.
BTW, David and I recently found a bug in the context switching in the
fxsr case, at least on

Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-31 Thread Bruce Evans

On Thu, 31 Mar 2005, Peter Jeremy wrote:
On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote:
 I still
think fully lazy switching (c2) is the best general method.
I think it depends on the FP workload.  It's a definite win if there
is exactly one FP thread - in this case the FPU state never needs to
be saved (and you could even optimise away the DNA trap by clearing
the TS and EM bits if the switched-to curthread is fputhread).
I think stopping the trap would be the usual method (not sure what
Linux did), but to collect statistics for determining affinity you
would want to take the trap anyway.
The worst case is two (or more) FP-intensive threads - in this case,
lazy switching is of no benefit.  The DNA trap overheads mean that
the performance is worse than just saving/restoring the FP state
during a context switch.
My guess is that the current generation workstation is closer to the
second case - current generation graphical bloatware uses a lot of
FP for rendering, not to mention that the idle task has a reasonable
chance of being an FP-intensive distributed computing task (setiathome
or similar).  It's probably time to do some more measuring (I'm not
offering just now, I have lots of other things on my TODO list).
Bloatware might be so hoggish that it rarely makes context switches :-).
Context switches for interrupts increase the problem though, as would
using FP more in the kernel.
BTW, David and I recently found a bug in the context switching in the
fxsr case, at least on Athlon-XP's and AMD64's.
I gather this is not noticable unless the application is doing its
own FPU save/restore.  Is there a solution or work-around?
It's most noticeable for debugging, and if you worry about leaking
thread context.  Fortunately, the last-instruction pointers won't
have real user data in them unless the application encodes it there
intentionally.  I can't see any efficent solution or workaround.
The kernel should do a full save/restore for processes being debugged.
For applications, the bug seems to be larger.  Even if they know about
the amd behaviour and do a full save/restore because they need it, it
won't work because the kernel doesn't preserve the state across
context switches.  Applications like vmware might care more than most.
I forgot to mention that we couldn't find anything in intel manuals
about this behaviour, so it might be completely amd-specific.  Also,
the instruction pointers are fundamentally broken for 64-bit CPUs,
since although they are 64 bits, they have the segment selector encoded
in their top 32 bits, so they are not really different from the 32:32
selector:pointer format for the non-fxsr case.  Their format is specified
by SSE2 so 64-bit extensions would have to be elsewhere, but amd64 doesn't
seem to extend them.
Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: why timeradd() and co are hidden from the kernel space ?

2004-06-18 Thread Bruce Evans

On Thu, 17 Jun 2004, Cyrille Lefevre wrote:

 is there any reasons to hide timeradd() and co in sys/time.h
 from the kernel space (! _KERNEL) ?

Yes.  This prevents them being used in the kernel.  They are compatibility
cruft for NetBSD.  In the kernel the corresponding interfaces are
spelled timevaladd(), etc.  Unfortunately, the timeradd() form of these
interfaces escaped to userland via NetBSD.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: ATA/CHS problem (path + new information)

2004-04-13 Thread Bruce Evans

On Sun, 11 Apr 2004, Roman Kurakin wrote:

 I remind you that now I have two problems. First one that FreeBSD uses
 wrong assumption about which device should be CHS and which LBA:

 if (!ad_version(atadev-param-version_major) ||
 !(atadev-param-atavalid  ATA_FLAG_54_58) || !lbasize)
 atadev-flags |= ATA_D_USE_CHS;

 True ATA device may not have ATA_FLAG_54_58 valid bit, and also due
 to last ATA standard this bit is obsoleted.

 I also want to know why ata driver doesn't check LBA support from word 49?
 May be this one check could solve my problems and didn't breake code for
 non-ATA devices.

Possibly for similar reasons.  It's hard to tell what's in the LBA bit for
pre-ATA devices older than LBA.  Similarly for the lbasize words, but it's
easier to do a sanity check on a 32-bit values that a 1-bit flag.

 Second one, that only 20G part of my hard disk works with CHS. This is other
 side of the same problem. Device should work in CHS mode. And it works
 witch ICH5 controller. But with ICH2 it doesn't with out hack.

 I've checked standard again and I sew command 91h (Initialize drive
 parameters).

Check out the commands for limiting the (apparent) disk size.  IIRC,
the CHS limit can be set independently of the LBA limit, and some
settings are harder than others so that they can't be cleared by old
commands like 0x91.  The limits may be set to prevent old drivers which
only understand old commands from becoming confused by trying to actually
use the whole disk.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: suspect spl*() code in syscons.c

2003-10-21 Thread Bruce Evans

On Tue, 21 Oct 2003, Luigi Rizzo wrote:

 both -current and -stable have the following snippet of code in
 sys/dev/syscons/syscons.c:scclose():

   {
   ...
   int s;

   if (SC_VTY(dev) != SC_CONSOLECTL) {
   ...
   s = spltty();
   ...
   }
   spltty();
   (*linesw[tp-t_line].l_close)(tp, flag);
   ttyclose(tp);
   spl0();
   return(0);
   }

 Note that the omitted code never does any spl*() call, nor it
 uses the saved value anymore. Also, i am a bit suspicious about the
 spltty()/spl0() sequence.

 Can someone explain if this code is correct ?
 (I have Bcc-ed the committers involved in writing this code,
 maybe they know the answer).

The initial ipl state is known to be that given by spl0(), so there is
no need to save the initial state to restore it using splx(); spl0()
sufficies.

However, when the omitted code neglects to restore the state, the initial
state is not that given by spl0() when the omitted code is executed.
Then using spl0() instead of splx() apparently-accidentally fixes the bug.
The scope of the spltty() seems to be a bit large but I don't know what
it should be exactly.

Ideally, the whole device close should be atomic (and as a prerequisite,
non-interruptible), but that is not possible for all console drivers
although it may be possible for syscons.  I have noticed the following
incompletely handled races in the corresponding code for the sio driver:

1. Last-close may block, mainly in l_close.  Quoting the above code again:

spltty();
(*linesw[tp-t_line].l_close)(tp, flag);
ttyclose(tp);
spl0();

   So despite holding spltty() here, we may block (mainly waiting for output
   to drain).  Perhaps we have to wait for output to drain even for
   synchcronous consoles like syscons, since their output may be blocked by
   something like scroll lock for syscons (I don't know where this block
   actually occurs).  While we are blocked, we have to drop the ipl so we
   may be affected by signals.  But generally signals aren't a problem here.
   More the reverse -- you can have a process waiting forever in exit() for
   output to drain and want to kill it but can't because the process has
   committed to exiting and doesn't accept any more signals.

2. The vfs permits (non-first) opens to succeed while last-close is blocked.
   For sio, this is useful behaviour.  You can use it to issue ioctls to
   unblock last-closes that would otherwise be blocked until the next reboot.
   However, the driver doesn't really understand this so it may make a mess
   of its state when the last-close completes despite the device ending up
   open at the file level.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: flush on close

2003-09-11 Thread Bruce Evans

On Wed, 10 Sep 2003, Eno Thereska wrote:

 In FreeBSD 4.4, I am noticing a huge number of calls
 to ffs_fsync() (in /sys/ufs/ffs/ffs_vnops.c)
 when running a benchmark like Postmark.

 ffs_fsync() flushes all dirty buffers with an open file
 to disk. Normally this function would be called
 either because the application writer explicitly
 flushes the file, or if the syncer daemon or buffer daemon
 decide it's time for the dirty blocks to go to disk.

 Neither of these two options is happening. Files are opened and closed
 very frequently though. I have a  suspicion that BSD is using the
 flush-on-close semantic.

 Could someone confirm or reject this claim?
 If confirmed, I am wondering how to get rid of it...

ffs_fsync() is (or should be) rarely called except as a result of
applications calling fsync(2) or sync(2).  It is not normally called
by the syncer daemon or buffer daemon (seems to be not at all in 4.4,
though -current calls it from vfs-bio.c when there are too many dirty
buffers, and benchmarks like postmark might trigger this).  In 4.4
the only relevant VOP_FSYNC() seems to be the one in vinvalbuf().
Using lots of vnodes might cause this to be called a lot, but this
should only cause a lot of i/o in ffs_fsync() if a lot is really needed.
Dirty buffers for vnodes which will soon be deleted are supposed to be
discarded in ffs_fsync().  Benchmarks that do lots of i/o to short-lived
files tend to cause too much physical i/o, but this is because the i/o
is done by the buffer (?) deamon before ffs_fsync() can discard it.

Bruce
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: seeking help to rewrite the msdos filesystem

2002-11-13 Thread Bruce Evans

On Tue, 12 Nov 2002, Tomas Pluskal wrote:

 I believe that everybody here knows about the slow msdosfs problem, that
 is AFAIK caused by implementation without clustering.

Which problem.  msdosfs has a number of small problems.  Mostly they don't
matter.

 For me this is very annoying, because I use digital camera, and ZIP drive,
 and FAT on both of them. Speed is about 10 times lower than it could be..

ZIP drives have much larger speed problems thn msdosfs.  msdosfs happens
to be a good way to get the worst out of them.  They have a minumum
i/o overhead of 20 msec (at least for all the 100MB ones that I tried),
so if you use msdosfs's minimum block size of 512 then their maximum
speed is 25K/sec which is about 40 times slower than it could be.  The
default block size of 2K gives a speed which is about 10 times slower
than it could be.  The ffs default block size of 16K gives a speed which
is only about 1.25 times slower than it could be.  E.g.:

%%%
Script started on Wed Nov 13 22:13:53 2002
ttyv1:root@gamplex:/tmp newfs /dev/afd0
/dev/afd0: 96.0MB (196608 sectors) block size 16384, fragment size 2048
using 4 cylinder groups of 24.02MB, 1537 blks, 3200 inodes.
super-block backups (for fsck -b #) at:
 32, 49216, 98400, 147584
newfs: ioctl (DIOCWDINFO): /dev/afd0: can't rewrite disk label: Operation not 
supported by device
ttyv1:root@gamplex:/tmp mount /dev/afd0 /mnt
ttyv1:root@gamplex:/tmp dd if=/dev/zero of=/mnt/zz bs=1m count=20
time umount /mnt
20+0 records in
20+0 records out
20971520 bytes transferred in 18.827154 secs (1113898 bytes/sec)
ttyv1:root@gamplex:/tmp time umount /mnt
0.29 real 0.00 user 0.02 sys
ttyv1:root@gamplex:/tmp newfs_msdos -b 16384 /dev/afd0
/dev/afd0: 196512 sectors in 6141 FAT16 clusters (16384 bytes/cluster)
bps=512 spc=32 res=1 nft=2 rde=512 mid=0xf0 spf=24 spt=32 hds=64 hid=0 bsec=196608
ttyv1:root@gamplex:/tmp mount -t msdosfs /dev/afd0 /mnt
ttyv1:root@gamplex:/tmp dd if=/dev/zero of=/mnt/zz bs=1m count=20
time umount /mnt
20+0 records in
20+0 records out
20971520 bytes transferred in 27.729786 secs (756281 bytes/sec)
ttyv1:root@gamplex:/tmp time umount /mnt
5.57 real 0.00 user 0.03 sys
ttyv1:root@gamplex:/tmp exit

Script done on Wed Nov 13 22:16:06 2002
%%%

The above could be calculations are based on a speed of 1000K/sec.  My
test drive can't quite reach this using raw reads with a block size of 64K,
but ffs clusters the data so well that it exceeds this speed for writes.
msdosfs with a block size of 16K achieves about 63% of this speed (not
the 87.5% suggested by the naive calculations).

My times are with some small improvements which I think don't affect
the tests much (they affect latency more than throughput).  With lots
of small files (smaller than the block size), clustering doesn't makes
even less difference; however, msdosfs doesn't support soft updates
or async mounts so it it is about as slow as plain ffs (in my test of
writing 1000 files of size 512, msdosfs is actually only 5 times slower
than ffs with soft updates or async; plain ffs is about 7.5 times slower).

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Problem detecting POSIX symbolic constants

2002-10-12 Thread Bruce Evans

On Fri, 11 Oct 2002, Terry Lambert wrote:

 Bruce Evans wrote:
   I know it's not fashionable to write code that's portable to
   compilers other than GCC, but even if FreeBSD is going to ignore
   portability for it's own source code, it's probably unreasonable
   to expect ACE to ignore portability for theirs.
 
  Undefined symbols being 0 in cpp in expressions was in early C compilers
  if not the original one.  Consult your archives for a posting 10-15 years
  or so ago by dmr in comp.std.c about him checking this in his archives.

 You mean in preprocessor expressions, of course.

Oop.  I meant only one in in being 0 in cpp in [sic] expressions.

   This can't be the case; specifically, the sysconf() test will
   only work at runtime, which means that the symbols had to be
   there and resolvable at link time.
 
  Symbols can be resolved at runtime using dlopen(), etc.  They would only
  actually be available on systems where sysconf() says that they are.

 You can't depend on people using the standard C library via dlopen,
 as opposed to, uh, linking.

Yes, a more careful reading of POSIX shows that if _POSIX_REALTIME_SIGNAL_
is undefined, then the (compile-time) system makes no claims as to the
feature being either supported or unsupported.  (Systems should leave it
undefined it they are clueless about it.)  The feature may still work at
runtime, but the compile-time system makes no claims that it will or
won't.  Applications would have to do something magic to use it at
runtime if it turns out to be present at runtime.  It could easily be
present at runtime because you update the OS.

   Uh, the 1990 standard, which allowed #if is only 12 years old.
 
  #if is in KR1 (1978).

 According to my first edition The C Programming Language:

   12.3 Conditional compilation
   A compiler control line of the form

   #if constant-expression

   checks whether the constant expression (see section 15)
   evaluates to non-zero.

   ...

   15 Constant expressions
   In several places C requires expressions which evaluate
   to a constant: after case, as array bounds, and in initializers.
   In the first two cases, the expression can involve only integer
   constants, character constants, and sizeof expressions, possibly
   connected by the binary operators

   + - * / %  | ^   == !=   = =

   or by the unary operators

   - ~

   or by the ternary operator

   ?:

 ...no  or ||... sorry.  Also, defined() is not defined.

 I think you will find that  and || in #if statements are
 later extensions.

 The value of undefined macros in preprocessor expressions being
 assumed to be zero is also not documented in the book.  I would
 have to imagine that they were not considered constant expressions.

This book has lots of bugs (mostly from being too informal and/or
omitting necessary details).  Note that section 15 doesn't even mention
 and || working in non-cpp constant expressions.  But they
cetainly worked in expressions, and expressions with only constants
in them are (informally) constant expessions.  I think you are right
that defined() didn't exist then.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Problem detecting POSIX symbolic constants

2002-10-11 Thread Bruce Evans


On Thu, 10 Oct 2002, Terry Lambert wrote:

 Bruce Evans wrote:
  In Standard C, this is equivalent to the non-verbose version:
 
  #if _POSIX_REALTIME_SIGNALS != -1
  ...
  #endif
 
  since if _POSIX_REALTIME_SIGNALS is not defined then it is equivalent to
  0 in cpp expressions.  The problem cases are if _POSIX_REALTIME_SIGNALS
  is defined to empty or garbage, but these are not permitted in
  POSIX.1-2001.  These cases were permitted for many feature test macros
  in POSIX.1-1990.

 I know it's not fashionable to write code that's portable to
 compilers other than GCC, but even if FreeBSD is going to ignore
 portability for it's own source code, it's probably unreasonable
 to expect ACE to ignore portability for theirs.

Undefined symbols being 0 in cpp in expressions was in early C compilers
if not the original one.  Consult your archives for a posting 10-15 years
or so ago by dmr in comp.std.c about him checking this in his archives.

   Sigh.  Why did the POSIX guys do this? :(
 
  Perhaps because they wanted you to use sysconf() instead of these mistakes.
  Actually, they didn't do this.  _POSIX_REALTIME_SIGNALS is specified to
  have value 0, -1 or 200xxxL (draft 7 says xxx; I think the final standard
  says 112).

 This can't be the case; specifically, the sysconf() test will
 only work at runtime, which means that the symbols had to be
 there and resolvable at link time.

Symbols can be resolved at runtime using dlopen(), etc.  They would only
actually be available on systems where sysconf() says that they are.

   BTW, I think that:
   #if defined(_POSIX_REALTIME_SIGNALS)  (_POSIX_REALTIME_SIGNALS != -1 )
  
   should suffice, but I'll double-check with one of my portability gurus
   to see if that is OK for ACE.
 
  This is variant of the above verbose version.  It requires slightly more
  modern compilers, so it might fail for some 20-year old pre-Standard C
  compilers instead of only for some 25-year old ones.

 Uh, the 1990 standard, which allowed #if is only 12 years old.

#if is in KR1 (1978).

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Problem detecting POSIX symbolic constants

2002-10-11 Thread Bruce Evans


On Thu, 10 Oct 2002, Terry Lambert wrote:

 Bruce Evans wrote:
  _POSIX_REALTIME_SIGNALS is undefined:
  Apparently the same as when it is defined to 0, except you cannot assume
  that anything related to it works until you call sysconf(), so you must
  not reference its interfaces statically, and must use a dll or something
  that references it.  The dll is presumably available on systems that
  support it but not (except possibly a dummy version) on systems that
  don't support it.
 
  I think the case where the symbol is undefined should never be implemented
  in practice.  It can be reduced to the case where the symbol is 0 using
  dynamic linkage with the complications for linkage not visible to the
  application.

 I think you will have to go back in time, for this to happen.  As
 things stand today, there are systems where it's undefined that
 were implemented before the symbol was a twinkle in some feature-test
 weenie on the POSIX committee's eye.

_POSIX_REALTIME_SIGNALS is only valid in versions of POSIX that support
it.  Applications must also conditionalize on _POSIX_VERSION if they
want to check for features that are not in all versions.

Runtime configuration only lets us go forward in time.  An application
might use POSIX realtime features if they are available and magically
start working better (without recompiling anything in the application)
when the OS and/or library implementors get around to implementing the
features.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Problem detecting POSIX symbolic constants

2002-10-11 Thread Bruce Evans

On Thu, 10 Oct 2002, Craig Rodrigues wrote:

 On Thu, Oct 10, 2002 at 09:31:56PM +1000, Bruce Evans wrote:
  Perhaps because they wanted you to use sysconf() instead of these mistakes.

 Well in the case of ACE, it is a C++ library that is compiled on
 platforms which may or may not have sysconf() (ie. Windows), so using sysconf() is
 not practical in this case.  Checking a feature macro is much easier.

How can new POSIX interfaces and new POSIX feature test macros work on
systems that don't have ancient POSIX interfaces like sysconf().  As
may have been clarified in other meesages in this thread, you need a
new version of POSIX even to interpret _POSIX_REALTIME_SIGNALS.  It is
in POSIX.1-1996 and perhaps in earlier versions of POSIX (not including
at least the 1990 one), so it is meaningless unless _POSIX_VERSION 
199mumble.

  I used a variant your patch for this in PR 35924 until recently when
  ...
 This patch works for me.  I think it is just as easy to just remove cruft from
 the header file entirely, but since your patch effectively does the same
 thing and has informative comments, that is fine.

I slightly prefer to ifdef them, since an implementation is planned.

 If your patch (or some equivalent variant) is committed, then I think
 PR 35924 can be closed.  Something needs to be done about these prototypes.

OK, I will clean up the patch and commit it.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Problem detecting POSIX symbolic constants

2002-10-10 Thread Bruce Evans


On Wed, 9 Oct 2002, Craig Rodrigues wrote:

 On Wed, Oct 09, 2002 at 07:29:48PM -0700, Terry Lambert wrote:
  To be totally correct, you will need to:
 
  #ifdef _POSIX_REALTIME_SIGNALS
  #if (_POSIX_REALTIME_SIGNALS != -1)
 
  ...
 
  #endif
  #endif
 
  It's annoying, but doing this will ensure that there are no
  gaps through which some system other than FreeBSD might fall.

In Standard C, this is equivalent to the non-verbose version:

#if _POSIX_REALTIME_SIGNALS != -1
...
#endif

since if _POSIX_REALTIME_SIGNALS is not defined then it is equivalent to
0 in cpp expressions.  The problem cases are if _POSIX_REALTIME_SIGNALS
is defined to empty or garbage, but these are not permitted in
POSIX.1-2001.  These cases were permitted for many feature test macros
in POSIX.1-1990.

 Sigh.  Why did the POSIX guys do this? :(

Perhaps because they wanted you to use sysconf() instead of these mistakes.
Actually, they didn't do this.  _POSIX_REALTIME_SIGNALS is specified to
have value 0, -1 or 200xxxL (draft 7 says xxx; I think the final standard
says 112).

 BTW, I think that:
 #if defined(_POSIX_REALTIME_SIGNALS)  (_POSIX_REALTIME_SIGNALS != -1 )

 should suffice, but I'll double-check with one of my portability gurus
 to see if that is OK for ACE.

This is variant of the above verbose version.  It requires slightly more
modern compilers, so it might fail for some 20-year old pre-Standard C
compilers instead of only for some 25-year old ones.

 I have another request.  Even though my preprocessor check was bogus,
 ACE still compiled, and I did not discover that there were any problems
 until link time, ie. I had a libACE.so library which could not
 link with anything because of unresolved symbols.  This was very annoying.
 It would have been nicer if this could have been caught earlier in the
 compile stage.

 Since sigqueue(), sigwait(), sigwaitinfo() are not implemented in FreeBSD,
 is this patch OK?

 --- src/include/signal.h.orig Wed Oct  9 23:15:21 2002
 +++ src/include/signal.h  Wed Oct  9 23:15:31 2002
 @@ -76,13 +76,6 @@
  int sigwait(const sigset_t *, int *);
  #endif

 -#if __BSD_VISIBLE || __POSIX_VISIBLE = 199506 || __XSI_VISIBLE = 600
 -int sigqueue(__pid_t, int, const union sigval);
 -int sigtimedwait(const sigset_t * __restrict, siginfo_t * __restrict,
 -const struct timespec * __restrict);
 -int sigwaitinfo(const sigset_t * __restrict, siginfo_t * __restrict);
 -#endif
 -
  #if __BSD_VISIBLE || __POSIX_VISIBLE = 200112 || __XSI_VISIBLE
  int killpg(__pid_t, int);
  int sigaltstack(const stack_t * __restrict, stack_t * __restrict);

 At some point in the future when POSIX RT signals are implemented
 in FreeBSD (I'm not volunteering :), then
 _POSIX_REALTIME_SIGNALS can be defined to 200112L in unistd.h, and
 these three prototypes can be put back into signal.h.

 Is this OK?

I used a variant your patch for this in PR 35924 until recently when
I noticed that it usually worked for the bogus reason that unistd.h
is not included, then _POSIX_REALTIME_SIGNALS is only defined accidentally
(to whatever value).  I now use just #if 0 and an XXX comment as a
reminder to fix this someday:

%%%
Index: signal.h
===
RCS file: /home/ncvs/src/include/signal.h,v
retrieving revision 1.19
diff -u -2 -r1.19 signal.h
--- signal.h6 Oct 2002 21:54:08 -   1.19
+++ signal.h7 Oct 2002 07:06:19 -
@@ -78,9 +79,18 @@

 #if __BSD_VISIBLE || __POSIX_VISIBLE = 199506 || __XSI_VISIBLE = 600
+#if 0
+/*
+ * PR: 35924
+ * XXX we don't actually have these.  We set _POSIX_REALTIME_SIGNALS to
+ * -1 to show that we don't have them, but this symbol is not necessarily
+ * in scope (in the current implementation), so we can't use it here.
+ */
 intsigqueue(__pid_t, int, const union sigval);
+struct timespec;
 intsigtimedwait(const sigset_t * __restrict, siginfo_t * __restrict,
const struct timespec * __restrict);
 intsigwaitinfo(const sigset_t * __restrict, siginfo_t * __restrict);
 #endif
+#endif

 #if __BSD_VISIBLE || __POSIX_VISIBLE = 200112 || __XSI_VISIBLE
%%%

The patch also moves the forward declaration of struct timespec near to
the one place that it is used (related patches not included).  This
mainly makes it obvious that the messy visibility for it is the same
as the one for the function that uses it.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Problem detecting POSIX symbolic constants

2002-10-10 Thread Bruce Evans

On Wed, 9 Oct 2002, Craig Rodrigues wrote:

Earlier this year on the FreeBSD hackers mailing list:

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=278142+0+/usr/local/www/db/text/2002/freebsd-hackers/20020317.freebsd-hackers

I was advised by Terry Lambert to use:
#ifdef _POSIX_REALTIME_SIGNALS

to detect if sigqueue(), sigtimedwait, sigwaitinfo() were defined.

I made this change to the FreeBSD configuration for ACE, a C++
library used for systems programming ( http://www.cs.wustl.edu/~schmidt/ACE.html ).

The change I made worked fine in -STABLE.
However, in -CURRENT, this test breaks, because _POSIX_REALTIME_SIGNALS
is defined, but it is -1.

According to the letter of the law:
http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap02.html

The following symbolic constants shall either be undefined or defined
with a value other than -1.

This law only covers 2 symbols (_POSIX_CHOWN_RESTRICTED and _POSIX_NO_TRUNC).
_POSIX_REALTIME_SIGNALS is covered by more general and complicated laws.
It may be undefined or defined to 0, -1 or 2001mumbleL.

So, this is legal way to implement these macros. It just breaks my code. :)

Can I appeal to the freebsd-standards team to leave these macros undefined
instead of defining them to -1? #ifdef/#ifndef is a pretty common way
to detect if a feature is available on a system, especially when used
in conjunction with something like autoconf.

No. If they were undefined (and _POSIX_VERSION is large enough), then
their undefinedness means that applications must use sysconf() to
determine if they work. Other cases are simpler:

_POSIX_REALTIME_SIGNALS is defined to -1:
This means that the interface is not supported. Applications shouldn't
use it at either compile time, link time or runtime, since it might not
exist in headers or libraries.

_POSIX_REALTIME_SIGNALS is defined to 0:
This means that the interface may work, and that it exists in headers and
libraries, so applications may reference it in normal ways. It may fail
at runtime; applications must use sysconf() to determine if it is actually

_POSIX_REALTIME_SIGNALS is defined to 2001mumbleL:
This means that the interface just works. sysconf() may be used to check
that it works, but sysconf() must not fail.

_POSIX_REALTIME_SIGNALS is undefined:
Apparently the same as when it is defined to 0, except you cannot assume
that anything related to it works until you call sysconf(), so you must
not reference its interfaces statically, and must use a dll or something
that references it. The dll is presumably available on systems that
support it but not (except possibly a dummy version) on systems that
don't support it.

I think the case where the symbol is undefined should never be implemented
in practice. It can be reduced to the case where the symbol is 0 using
dynamic linkage with the complications for linkage not visible to the
application.

Bruce

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: [ GEOM tests ] vinum drives lost

2002-10-05 Thread Bruce Evans


On Fri, 4 Oct 2002, Poul-Henning Kamp wrote:

 Worst case you will have the option to use:

   options NOGEOM
   options vinum

A NOGEOM option would be as acceptable as a NOFFS option for turning off
forcing of the one true file system down everyone's throats.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: [ GEOM tests ] vinum drives lost

2002-10-05 Thread Bruce Evans


On Sat, 5 Oct 2002, Peter Wemm wrote:

 Bruce Evans wrote:
  On Fri, 4 Oct 2002, Poul-Henning Kamp wrote:
 
   Worst case you will have the option to use:
  
 options NOGEOM
 options vinum
 
  A NOGEOM option would be as acceptable as a NOFFS option for turning off
  forcing of the one true file system down everyone's throats.

 Part of the problem there is a weakness in config that I've threatened
 to fix on more than one occasion.  We do not have a way to have options
 default to on and let people turn the option off.  Negative options
 (options NOFOO) are a poor substitute.  In the past, a couple of things
 were unifdefed that might have been better served as being 'default to on'
 options or drivers.

Hmm.  Negative options implemented as negoptions FOO would work OK for
this.  Options could be defaulted to on by putting them in an included
config file, and then turned off using negoptions.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: vmware reads disk on non-sector boundary

2002-10-04 Thread Bruce Evans


On Thu, 3 Oct 2002, Poul-Henning Kamp wrote:

 In message [EMAIL PROTECTED], Bakul Shah writes:
 How hard would it be to bring back block devices without GEOM?

 Not at all hard, pretty trivial in fact.

The easiest way is to restore the old code and use a minor number hack or
ioctl to enable it.  I had better do it instead of complaining about it.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: vmware reads disk on non-sector boundary

2002-10-03 Thread Bruce Evans


On Thu, 3 Oct 2002, Mark Santcroos wrote:

 I have an almost-ready patch that implements linux_read() syscall. This
 will check if we are reading from a raw disk and in that case it will
 enlarge the read() to the next sector boundary. I have it working in the
 kernel but I have problems returning the right read buffer to userland.

Unbreaking block devices would be a better solution.  Without buffering,
reads of raw disks using an unbuffered linux_read() might be sector size
times slower than they should be.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: vmware reads disk on non-sector boundary

2002-10-03 Thread Bruce Evans


On Thu, 3 Oct 2002, Poul-Henning Kamp wrote:

 In message [EMAIL PROTECTED], Mark Santcroos writes:
 On Thu, Oct 03, 2002 at 09:50:45PM +1000, Bruce Evans wrote:
  Unbreaking block devices would be a better solution.  Without buffering,
 ...
 What was the reason for the removal of block devices anyway?
 It would be nice if you would tell me some background about that.. :)

 It's well documented in the mail-archives actually.

 The short story:

Shorter story: phk didn't like them.

 1. We don't in general assign a special vnode type to device modes,
 instead we assign multiple device nodes, SCSI tapes is an
 example of this.

 2. The vnode layer already have enough trouble aliasing /dev/fd0,
 /mnt/dev/fd0, /usr/jail/dev/fd0, /cdrom/dev/fd0 (you get the idea),
 we do not need to make it even harder by also aliasing /dev/fd0 and
 /dev/rfd0.

Aliases that differ in type are slightly easier to handle than aliases
that differ by name or mount point.  They are diferent devices so they
have different vnodes, and the aliasing problems for them are quite
different than the vnode aliasing problems caused by having the same
device under different mount points.  E.g., slices and partitions allow
configuring 31*7 aliases per drive for hard drives (31 slices with 7
configuring partitions each).  Slices and partitions may overlap,
giving something more complicated than aliases.  The vnode layer doesn't
understand any of this.

 3. Write ordering on buffered devices were unspecified.  In other
 words, you cannot use it for anything which even remotely smells
 of transactions, because you have no way to know when your writes
 have hit the disk and in which order they did so.

This is no different than for regular files.

 4. No write errors were reported back to userland.

Actually, write errors were reported at fsync() and close() time in
the same way as for regular files.  fsync()'s handling of write errors
was broken for both regular files and buffered devices at the time
buffered devices were axed.

 (Given 3 and 4, it follows that use of block devices for any sort
 of data you happen to like is a very bad idea.)

It follos similarly that use of fil systems is a bad idea :-).  You
You should only use a databases on raw disks if you value your data.

 5. Block devices was in the way of getting DEVFS working in an
 architecturally sane manner.

This can be considered a feature.

 So they were removed, and good riddance.

 If a buffered access-mode on block devices is desired, it should
 be implemented either as an ioctl controllable feature, or as
 a GEOM module.  The latter is probably by far the easiest way.

It was desired, and was sort of promised.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: bin/20633: fdisk doesn't handle LBA correctly

2002-08-17 Thread Bruce Evans


On Sat, 17 Aug 2002, Artem 'Zazoobr' Ignatjev wrote:

 On Sun, Aug 18, 2002 at 02:12:45AM +1000, Bruce Evans wrote:
  [skip]
 
  Fdisk should print these values, at least optionally, since they are needed
  for debugging.  The magic values might be non-magic on old systems.
  Debugging WHAT? And, I can hardly imagine such a situation inside hard
  drives  slice tables.
 
  Debugging hard disk tables and slice tables.  I do it routinely.  It
  can be important to look at the raw data, but the dp_scyl...dp_ehd
  data is a little too raw.
 Ok, so we must add option to fdisk(8) to print CHS?

If we displayed suppressed the existing display for cases where we know
it is bogus, then there should be an option to display it for these cases.

 And, BTW, does CHS makes sense inside an extended slices?

I think so, since boot loaders might use at least the starting C/H/S if
there are any boot loaders that support booting from logical drives in
extended partitions (I think lilo can).  Using them might even work for
cylinders below 1024.

  ...
  Actually, 1023 is most magic (it is what the kernel expects), but the
  data in the PR shows that 1022 is magic too.  The magic 0xfe in the
  kernel is for the ending head number.  It can be important not to use
  a starting or ending head number of 255 (== a head count of 256) because
  some broken BIOSes crash on it, and there are now conventions that
  prohibit using it.

 Maybe we should take as magic cyls 1021?

Don't assume much when displaying magic values, but only write values that
satisfy current conventions.

 And where can one read all
 this conventions?

google :-).

 [skip]
  The data for partition 13 is:
  sysid 165 (FreeBSD/NetBSD/386BSD),
  start 80051958, size 8401932 (4102 Meg), flag 0
 beg: cyl 1023/ head 254/ sector 63;
 end: cyl 1023/ head 254/ sector 63
  Script done on Fri Aug 16 22:23:06 2002
 
  Seems reasonable.  It shows all of the magic beg and end values, because
  the most magic one (all 0xff's) is so magic that it is never used :-).

 and how  translates that value to CHS? Guess 1023/255/63?

Something like beg: [invalid] for 1023/255/63, and beg: [not used]
for 1023/254/63.

 I suppose, that it will be better to follow the linux' fdisk..

 Could you please look at http://memphis.mephi.ru/~timon/fdisk/ (and 
http://memphis.mephi.ru/~timon at all ;-) since you'll need a patch from there  or 
manually define DOSPTYP_EXTX if you'll try to compile it )?

Not sure if I have time to look at the details.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Increasing size of if_flags field in the ifnet structure [patchfor review]

2002-08-16 Thread Bruce Evans


On Thu, 15 Aug 2002, Maxim Sobolev wrote:

 When implementing ability to switch interface into promisc mode using
 ifconfig(8) I've stumbled into the problem with already exhausted
 space in the `short if_flags' field in the ifnet structure. I need to
 allocate one new flag, while we already have 16 IFF_* flags, and even
 one additional flag which is implemented using currently free
 if_ipending field of the said structure. Attached patch is aimed at
 increasing size of if_flags to 32 bits, as well as to clean-up
 if_ipending abuse. Granted, it will break backward ABI compatibility,
 but IMO it is not a big problem.

Why isn't it a bug problem?  It affects an application ABI (most socket
ioctls).  We have whole syscalls whose purpose is to avoid breaking
application ABIs back to about 4.3BSD.  Some of them may even work.

 Index: src/share/man/man4/netintro.4
 ===
 RCS file: /home/ncvs/src/share/man/man4/netintro.4,v
 retrieving revision 1.20
 diff -d -u -r1.20 netintro.4
 --- src/share/man/man4/netintro.4 18 Mar 2002 12:39:32 -  1.20
 +++ src/share/man/man4/netintro.4 15 Aug 2002 18:33:42 -
 @@ -197,7 +197,7 @@
  structsockaddr ifru_addr;
  structsockaddr ifru_dstaddr;
  structsockaddr ifru_broadaddr;
 -short ifru_flags;
 +int   ifru_flags;
  int   ifru_metric;
  int   ifru_mtu;
  int   ifru_phys;

This particular ABI seems to have been broken before (in if.h 1.50 on
1999/02/09), since the actual struct has short ifru_flags[2]; followed
by short if_index; instead of short ifru_flags;, and it has 2 new
struct members at the end too.  If the struct were actually as above,
then changing the short to an int would almost be binary compatible
since it would just expand ifru_flags to use the 2 bytes of unnamed
padding caused by the poor layout, so the struct wouldn't expand and
the other members wouldn't move.  Enlarging ifru_flags itself might
only break big-endian machines (little-endian ones wouldn't notice
providing the padding is zeroed).

 Index: src/share/man/man9/ifnet.9

Breaking kernel ABIs isn't so important.  They should only be compatible
within major releases.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: gcc -O broken in CURRENT

2002-03-14 Thread Bruce Evans


On Thu, 14 Mar 2002, ozan s. yigit wrote:

  Add the -ffloat-store flag to your compilation flags (or
  add -msoft-float).

 that really means for this compiler on certain platforms, you
 can have slow and correct or fast and incorrect, but NOT fast
 and correct.

I think fast and correct is impossible on i386's.  Correct
requires assignments and casts to discard any extra precision,
and the fastest way to implement this is probably to store to
memory and reload.  The -ffloat-store kludge only does a subset
of the necessary conversions.  Doing them all would be slower
and correct, which is why gcc doesn't do them.  C90 can be read
as permitting this incorrectness, but C99 doesn't permit it.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: sys/types.h or not sys/types.h? [Was: cvs commit: src/includegrp.h]

2002-02-25 Thread Bruce Evans


On Mon, 25 Feb 2002, David Malone wrote:

 I note that in the footnotes for getgrgid, in the section for issue
 6 of the standard:

   The requirement to include sys/types.h has been removed.
   Although sys/types.h was required for conforming
   implementations of previous POSIX specifications, it was
   not required for UNIX applications.

 Curiously, this seems to say the opposit of what you actually see
 in SUSv2, as it lists sys/types.h as a prerequisit.

This is probably because SUSv2 is almost as old as POSIX.1-1996 (which
has the prerequisite).

BTW, unistd.h has the same prerequisite, but our version of unistd.h
is polluted with an include of sys.types.h.  Nevertheless, we fixed
several man pages to document the old-POSIX requirements.  E.g.:

RCS file: /home/ncvs/src/lib/libc/sys/fork.2,v
Working file: fork.2
head: 1.14
...

revision 1.7
date: 1998/01/11 16:51:49;  author: alex;  state: Exp;  lines: +4 -1
branches:  1.7.2;
Add sys/types.h to synopsis.
Correct a grammatical error.
Add cross-reference to setrlimit(2).

Obtained from:  OpenBSD


ISTR that most of the fixes came from OpenBSD.  I haven't attempted to
fix unistd.h since its pollution is old.  I only police new pollution
and must have been asleep when we broke several man pages to undocument
the old-POSIX requirements.  E.g.:

RCS file: /home/ncvs/src/lib/libc/sys/setsid.2,v
Working file: setsid.2
head: 1.12
...

revision 1.6
date: 1997/04/08 10:43:47;  author: peter;  state: Exp;  lines: +1 -1
setsid is declared in unistd.h, which is self sufficient (doesn't need
prior sys/types.h)

Fixes PR#3229, from Dmitrij Tejblum [EMAIL PROTECTED]


Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: stack alignment issues

2002-02-05 Thread Bruce Evans


On Tue, 5 Feb 2002, Nik Clayton wrote:

 On Tue, Feb 05, 2002 at 05:01:29PM +1100, Bruce Evans wrote:
  My patch is not suitable for committing verbatim.  It has 2 or 3 XXX's.

 Do you make these patches available anywhere, so that other people can
 look over them and maybe help you on the XXX'd sections?

Erm, I'm replying to mail that gave a URL for my old mail with the patch.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: stack alignment issues

2002-02-05 Thread Bruce Evans


On Tue, 5 Feb 2002, Mike Silbersack wrote:

 On Tue, 5 Feb 2002, Bruce Evans wrote:
  foo:
  pushl %ebp
  movl %esp,%ebp
  subl $8,%esp# - extra instruction for alignment (for foo)
  addl $-12,%esp  # - extra instruction for alignment (for f1)

 What disgusting code.  I find it amazing that they didn't even stick in
 some peephole optimizer to at least limit it to one operation.

It's clearly the result of work in progress :-).

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: stack alignment issues

2002-02-04 Thread Bruce Evans


On Mon, 4 Feb 2002, Michal Mertl wrote:

 Did you look at the patch by Bruce at
 http://groups.yahoo.com/group/freebsd-current/message/39605 ?

 Bruce, is it still fresh in your memory? Can you comment on the patch -
 can it be commited in some form?

I haven't done anything to clean up the patch.  I hope the problem
will go away in future versions of gcc (align the stack at runtime in
the few routines that actually need it).

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: stack alignment issues

2002-02-04 Thread Bruce Evans


On Mon, 4 Feb 2002, Mike Silbersack wrote:

 On Tue, 5 Feb 2002, Bruce Evans wrote:
  I haven't done anything to clean up the patch.  I hope the problem
  will go away in future versions of gcc (align the stack at runtime in
  the few routines that actually need it).

 Well, if Linux aligns the initial stack, the chance that gcc will have
 auto-alignment added sounds to be about zero.  You might as well go ahead
 with your patch when you get a chance.

There is a nonzero probability that the pessimization of aligning in almost
every routine will be fixed someday.  Actually, the pessimization is worse
-- the alignment is done before every call.

Example:

%%%
foo.c:
foo()
{
f1(1);
f2(2);
f3(3.0);
}
%%%

gcc -O -S [-mpreferred-stack boundary] currently generates the following
code for this:

%%%
.file   z.c
.version01.01
gcc2_compiled.:
.section.rodata
.p2align 3
.LC0:
.long 0x0,0x4008
.text
.p2align 2,0x90
.globl foo
.typefoo,@function
foo:
pushl %ebp
movl %esp,%ebp
subl $8,%esp# - extra instruction for alignment (for foo)
addl $-12,%esp  # - extra instruction for alignment (for f1)
pushl $1
call f1
addl $-12,%esp  # - extra instruction for alignment (for f2)
pushl $2
call f2
addl $32,%esp   # - extra instruction for alignment (for f3)
addl $-8,%esp   # - extra instruction for alignment (another)
pushl .LC0+4
pushl .LC0
call f3
leave
ret
.Lfe1:
.sizefoo,.Lfe1-foo
.ident  GCC: (c) 2.95.3 20010315 (release)
%%%

It should generate something like:

.file   z.c
.version01.01
gcc2_compiled.:
.section.rodata
.p2align 3
.LC0:
.long 0x0,0x4008
.text
.p2align 2,0x90
.globl foo
.typefoo,@function
foo:
pushl %ebp
movl %esp,%ebp
andl $~0x7,%esp # - extra instruction for alignment (for foo)
# Only needed since foo() uses FPU.
# 8-byte alignment enough for doubles?
# Adjust in prologue so that there are
# hopefully no alloca()-like issues, except
# we need a frame pointer to restore %esp.
pushl $1
call f1
pushl $2
call f2
pushl .LC0+4
pushl .LC0
call f3
leave
ret
.Lfe1:
.sizefoo,.Lfe1-foo
.ident  GCC: (c) 2.95.3 20010315 (release)
%%%

My patch is not suitable for committing verbatim.  It has 2 or 3 XXX's.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: cvs commit: src/lib/libatm atm_addr.c cache_key.c ioctl_subr.c ip_addr.c ip_checksum.c timer.c

2001-09-15 Thread Bruce Evans


On Sat, 15 Sep 2001, Matt Dillon wrote:

 I'm redirecting this to freebsd-hackers.

 Ok, I've comitted a new set of changes to libatm.  Please check them out.
 When we get a format that enough people are happy with we can start
 converting the other libraries.  I'm not particularly interested in
 fixing old old problems, I just want to get the __FBSDID() stuff in shape.

I prefer the following (for one of the changed files).  Especially the
empty line after the copyright message:

%%%
Index: atm_addr.c
===
RCS file: /home/ncvs/src/lib/libatm/atm_addr.c,v
retrieving revision 1.6
diff -u -2 -r1.6 atm_addr.c
--- atm_addr.c  2001/09/15 19:36:55 1.6
+++ atm_addr.c  2001/09/15 20:25:39
@@ -24,10 +24,12 @@
  * notice must be reproduced on all copies.
  */
-#include sys/cdefs.h
+
+#ifdef VENDOR_ID
 #ifndef lint
-#if 0  /* original (broken) import id */
 static char *RCSid = @(#) $Id: atm_addr.c,v 1.1 1998/07/09 21:45:18 johnc Exp $;
 #endif
 #endif
+
+#include sys/cdefs.h
 __FBSDID($FreeBSD: src/lib/libatm/atm_addr.c,v 1.6 2001/09/15 19:36:55 dillon Exp 
$);

%%%

(VENDOR_ID is intended to be left undefined.)

This still changes the vendor id excessively relative to rev.1.1.  The
vendor put it after the comment after the copyright comment, and restoring
it in rev.1.5 moved it to a different place.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Junior Kernel Hacker task: improve vnode-v_tag

2001-09-08 Thread Bruce Evans


On Sat, 8 Sep 2001, Chris Costello wrote:

 On Saturday, September 08, 2001, Poul-Henning Kamp wrote:
  No actually not, I want something short and predictable like
  VT_CODA.

How about my second suggestion: making v_tag point to
 mp-mnt_stat.f_fstypename, or a copy thereof?

Good, but I as far as I understand this, the only legitimate point of
v_tag is to tell applications like fstat(1) the type of the vnode.
fstat can find chase the kernel pointers in
vp-v_mount-mnt_stat-f_fstypename almost as (un)easily as it could
chase vp-v_tag if v_tag is a pointer.  Copying the string would give
ugly bloat, but would not be all that much worse than the bloat for a
pointer on alphas (the contents of the string VT_CODA takes the same
space as a pointer on alphas).

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Junior Kernel Hacker task: improve vnode-v_tag

2001-09-04 Thread Bruce Evans


On Tue, 4 Sep 2001, Maxim Sobolev wrote:

  [neither Maxim Sobolev nor Brent Verner wrote]
  In message [EMAIL PROTECTED], Brent Verner writes:
  #include newbie_kernel_hacker_warning.h
  
  I've done a /cursory/ look over how this v_tag is used.  I'm not sure
  this is a simple/clean as you propose, since this is used in the
  IS_LOCKING_VFS macro, as well as in union_subr.c...
 
  Well, that is just too bad, because IS_LOCKING_VFS is wrong then.
 
  The places which inspect v_tag will have to be changed to use
  strcmp() then...

 I think that we can add a new vnode flag, say VCANLOCK, so that each
 particular VFS can set it if it supports locking, which should allow
 to remove pre-defined VFS list from the IS_LOCKING_VFS macro. I can
 produce a patch if it sounds reasonably.

Since IS_LOCKING_VFS() is a temporary kludge, the correct fix is to remove
it (after fixing whatever it was for), not to add more cruft.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset

2001-09-01 Thread Bruce Evans


On Fri, 31 Aug 2001 [EMAIL PROTECTED] wrote:

  I see (amost).
 
  Please format the macro the same as the other macros in the file.

 Index: apic_vector.s
 ===
 RCS file: /home/ncvs/src/sys/i386/isa/apic_vector.s,v
 retrieving revision 1.47.2.4
 diff -u -r1.47.2.4 apic_vector.s
 --- apic_vector.s 18 Jul 2000 21:12:41 -  1.47.2.4
 +++ apic_vector.s 31 Aug 2001 19:24:24 -
 @@ -653,7 +707,14 @@
   FAST_INTR(21,fastintr21)
   FAST_INTR(22,fastintr22)
   FAST_INTR(23,fastintr23)
 -#define  CLKINTR_PENDING movl $1,CNAME(clkintr_pending)
 +
 +#define  CLKINTR_PENDING \
 + pushl $clock_lock;  \
 + call s_lock;\
 + movl $1,CNAME(clkintr_pending); \
 + call s_unlock;  \
 + addl $4, %esp
 +
   INTR(0,intr0, CLKINTR_PENDING)
   INTR(1,intr1,)
   INTR(2,intr2,)


The other macros also have a space befor the semicolons :).  Otherwise OK.
Better commit this (especially to 4.4R) until we have something better.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset

2001-08-31 Thread Bruce Evans


On Fri, 31 Aug 2001 [EMAIL PROTECTED] wrote:

 I wrote:

  The problem here is that CPU#1 fails to hold clock_lock while setting
  clkintr_pending, causing i8254_offset to be stepped twice, first due
  to clkintr_pending, then due to i8254_lastcount being larger than count.

 With this patch applied to RELENG_4, the clock speedup seems to disappear.

 --- apic_vector.s.old Fri Mar  2 13:47:31 2001
 +++ apic_vector.s Fri Aug 31 01:07:53 2001
 @@ -707,7 +707,12 @@
   FAST_INTR(21,fastintr21)
   FAST_INTR(22,fastintr22)
   FAST_INTR(23,fastintr23)
 -#define  CLKINTR_PENDING movl $1,CNAME(clkintr_pending)
 +#define  CLKINTR_PENDING pushl $clock_lock;  \
 + call s_lock;\
 + movl $1,CNAME(clkintr_pending); \
 + call s_unlock;  \
 + addl $4, %esp
 +
   INTR(0,intr0, CLKINTR_PENDING)
   INTR(1,intr1,)
   INTR(2,intr2,)

I see (amost).

Please format the macro the same as the other macros in the file.

 The corresponding patch for -current is

This does nothing for -current, because -current uses a fast interrupt
handler for clkintr() and FAST_INTR() doesn't take a CLKINTR_PENDING
arg.  I think this is another bug.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset

2001-08-31 Thread Bruce Evans


On Fri, 31 Aug 2001 [EMAIL PROTECTED] wrote:

 For RELENG_4, which uses Xintr0, this problem is reduced by adding the
 clkintr_pending variable (cf. revision 1.134 of clock.c and revision
 1.38 of apic_vector.s)

 Looking more at the code, I now see that clkintr_pending is never set
 in -current due to Xintr0 not being used (Xfastintr0 is now used).
 Using a fast interrupt for the clock interrupt gives some of the same
 reduction of the race window size as the use of clkintr_pending.

I remember this a bit better after reading my log message for rev.1.134
of clock.c :-).

At the time, the clock interrupt handler was not fast.  However, it
could be delayed for a while by a fast interrupt handler (it still
can).  Since it is delayed, it can't set clkintr_pending and we have
to rely on the irr overflow test working for longer.  Rev.1.134 only
claims to work for the non-SMP case.  The irr test apparently never
worked for SMP.  Otherwise, it might be preferable to use the irr
overflow check instead of or as a backup to clkintr_pending, but it
can't be, since the irr bit is cleared when the irq is acked.  This
behaviour also gives races for SMP: the irr bit may become invalid
while you're looking it when another CPU acks the irq.

Making the clock interrupt handler fast doesn't significantly change
the problems as far as I can see.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset

2001-08-30 Thread Bruce Evans


On Thu, 30 Aug 2001, Martin Blapp wrote:

 Searching the freebsd mailinglists I have seen that you also suffering
 under this problem on 4.X. STABLE:

I now remember your old mail about this.  I (implicitly) suggested a fix,
but apparently no one tried it:

 On Thu, 21 Jun 2001, Martin Blapp wrote:

  Just gettimeofday() produces 8sec time drifting now. No need
  to use poll() in our little programm I sent previously.
 
  There is no time drifting if we used a 100% load programm with
  just poll().
 
  Very strange. Do you have some idea ?

 From clock.c in -current:

 | #ifdef APIC_IO
 | #define lapic_irr1  ((volatile u_int *)lapic)[0x210 / 4]   /* XXX XXX */
 | /* XXX this assumes that apic_8254_intr is  24. */
 | (lapic_irr1  (1  apic_8254_intr
 | #else
 | (inb(IO_ICU1)  1)))
 | #endif

 Maybe apic_8254_intr is not  24.  I think the second XXX comment has
 rotted in -current, but it still applies in RELENG_4.

I now think apic_8254_intr can't be = 24 (= 32 in -current), but
8254 is routed via 8259 and IOAPIC #0 intpin 0 in your probe messages
says that the relevant status bit is in lapic_irr0, not in lapic_irr1.
Try changing the 0x210 to 0x200.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: MIN()/MAX() definitions in sys/param.h

2001-05-14 Thread Bruce Evans


On Mon, 14 May 2001, Dima Dorfman wrote:

 Is there a reason the definitions of the MIN() and MAX() macros in
 sys/param.h are under an '#ifndef _KERNEL'?  Quite a few files in the

It is to inhibit use of these macros.  The {i,,l,lu,q}{max,min} inline
functions are supposed to be used instead.

Both of these interfaces are problematic.  MIN/MAX can not be implemented
to be safe macros in Standard C since they need to evaluate their
args more than once.  Their upper case names indicate that they are
unsafe.  The {i,,l,lu,q}{max,min} functions are not type-generic, so
they are difficult to use correctly for args whose type is typedef'ed.

 kernel define these (well, at least MIN) themselves, so it would seem
 to make sense to define them globally in sys/param.h for the kernel as
 well.  Any reason this isn't already done this way, or should I come
 up with a patch to fix that?

This is a bug in these files.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message

Re: Displaying options for current NFS mounts

2001-03-25 Thread Bruce Evans


On Sun, 25 Mar 2001, Peter Pentchev wrote:

  Only mount_foofs can reasonably know about the options for foofs.
  perhaps mount(8) could fork-exec mount_foofs(8) to print options for
  foofs.
 
 bikeshed type=question value="mostly stupid"
 Or could mount(8) invoke a couple of sysctl's to get a string representation
 of each mountpoint's mount options?
 /bikeshed

My bikeshed believes that string processing doesn't belong in the kernel :-).

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: Displaying options for current NFS mounts

2001-03-24 Thread Bruce Evans


On Sat, 24 Mar 2001, Dima Dorfman wrote:

 I tried to export this stuff in struct statfs, but ran into a problem:
 I'd need the complete definitions of fs_args in sys/mount.h, but I
 can't include, e.g., nfs/nfs.h because the latter includes the
 former (sys/mount.h)!

mount.h used to know too much about all sorts of filesystems, but this
was fixed in 4.4BSD.  It is impossible for mount.h or mount(8) to know
about all file systems, since filesystems can be dynamically loaded,
and ugly for it to know about more than 1 (or 0 -- ffs is too special).

 The patch below kind of implements this functionality.  I only export
 nfs_args (not otherfs_args), and I only modified mount(8) to print
 the NFS version, but printing the transport and others is simple from
 there.  To work around the above problem, I pasted the struct nfs_args
 definition into mount.h.  It is *horribly* ugly, but it does work.

Only mount_foofs can reasonably know about the options for foofs.
perhaps mount(8) could fork-exec mount_foofs(8) to print options for
foofs.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: What's changed recently with vmware/linuxemu/file I/O

2001-02-07 Thread Bruce Evans


On Wed, 7 Feb 2001, Josef Karthauser wrote:

 On Wed, Feb 07, 2001 at 08:56:14PM +, Josef Karthauser wrote:
  On Wed, Feb 07, 2001 at 08:26:15PM +0100, Dag-Erling Smorgrav wrote:
   Brian Somers [EMAIL PROTECTED] writes:
Indeed.  I've been doing a ``make build'' on an OpenBSD-current vm 
for three days (probably about 36 hours excluding suspends) on a 
366MHz laptop with a ATA33 disk.
   
   Would it be possible for someone experiencing this slowdown to try to
   narrow down the day (or even the week) on which it occurred?
  
  As I think about it it was definitity working before the symbol changes
  in libc/libc_r changed.  Was that last week?  No probably the week
  before.  It was working fine last week, but I'm not sure which day's I
  updated the kernel.
  
  I'll try some builds.
 
 Ok.  The problem definitely began between -D2001-01-29 and -D2001-01-30.
 I'll try and binary chop to workout what caused it.

If you have ata disks, try "options ATA_ENABLE_WC".  Nothing else has
changed significantly in this period.  I don't know how this would effect
vmware boot speeds.

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: What's changed recently with vmware/linuxemu/file I/O

2001-02-06 Thread Bruce Evans


On Tue, 6 Feb 2001, Josef Karthauser wrote:

 I'm wondering what's changed recently to cause vmware2 running on
 the linuxemu to lose a lot of performance with disk I/O.

Use of cmpxchg and possibly other SMP pessimizations.

 A couple of weeks ago I could boot win2000 under vmware2 in a matter
 of minutes;  on today's kernel it takes 5 or 10 minutes to boot,
 and disk I/O is through the roof.
 
 Could someone please hit me with a clue-bat :)

Read your freebsd-emulation mail :-).

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: device naming convention

2000-09-20 Thread Bruce Evans


On Mon, 18 Sep 2000, Marc Tardif wrote:

  0cicuta/home/babolo(9)#dd of=/dev/wd0s2 if=/dev/zero bs=660b
  1cicuta/home/babolo(11)#od -b /dev/wd0s2
 [ snip ]
  Why I use 2.2.7 for test?
  Because of my lovely 4.1-STABLE is extremly unstable with content of
  ad0s2 (wd0s2) above and silently reboot after the first dd in the test above.
  
 Assuming my wd0s2 is still unused and of size 0, 3.5-STABLE also crashes in the test 
above (no disk activity, ctrl-c doesn't work, alt-f# doesn't work either). Perhaps it 
eventually reboots, but I wasn't patient enough to wait that long. One solution to 
this problem is to specify the count blocks after which dd returns properly but still 
no bytes are copied.

[Please use lines somewhat shorter than 360 characters.]

This is a completely diferent problem.  wd0s2 was buffered in 3.5, and
buffered devices are very broken in 3.1 and later versions of 3.x (write
errors are retried endlessly.  Among other bugs, writing beyond EOF hangs
the system when it allocates all buffers for writing unwritable data).

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: Fdescfs updates--coming to a devfs near you!

2000-09-18 Thread Bruce Evans


On Sun, 17 Sep 2000, Adrian Filipi-Martin wrote:

   I recently ran into revelant problem with /dev/stdout, while
 working on some software under linux that expected /dev/stdout as an
 argument instead of using stdout.
 
   Using the device file breaks, if the process is suid to a non-root
 user.  This is because it cannot open /dev/stdout, which is owned by your
 UID and not the EUID of the process to which the device was passed.  My
 solution was to add the "-" hack and use the existing open descriptor.

Um, open on fdesc devices doesn't check either uid.  It just checks
the access mode.

Perhaps the software expected /dev/stdout to for read-write like a
tty would be.  Then opening /dev/stdout would fail for normal shell
output redirection which only opens for writing.

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: DDB is not setting break points...

2000-06-02 Thread Bruce Evans


On Thu, 1 Jun 2000, G.B.Naidu wrote:

 I am having problems with DDB while setting breakpoints in the kernel. I
 entered the DDB by giving kernel -d at boot prompt. After that I tried to
 set break point at ip_output() by giving "b ip_output". But it complains
 saying that "sumbol not found". I thought this might be due to stripped

Early setting of breakpoints by name was broken by the switch to elf in
FreeBSD-3.0 (symbols aren't available until the kernel module sysinit runs
much later).  I think it works for aout kernels in 3.x but not in 4.0 or
-current.  Use gdb or set breakpoints early by value in broken versions.

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: QUEUE_VMIO...???

2000-06-01 Thread Bruce Evans


On Wed, 31 May 2000, Joy Ganguly wrote:

 what is the significance of QUEUE_VMIO buffer (struct buf) queue ?? as
 far as i could see they  are not used at allbut maybe i am wrong.

It signifies bitrot.  Its use was removed more than 5 years ago in
rev.1.35 of sys/kern/vfs_bio.c, but its definition was not removed
until less than a year ago in rev.1.75 of sys/sys/buf.h.

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

RE: stupid FS questions

2000-05-31 Thread Bruce Evans


On Tue, 30 May 2000, Garrett Wollman wrote:

 On Tue, 30 May 2000 16:20:53 -0400, "Yevmenkin, Maksim N, CSCIO" 
[EMAIL PROTECTED] said:
 
  i know that :) i guess my questions were
  1) why the same piece of code duplicated in all ``mount_xxx'' utilities?
 
 Because the original loadable module system held strongly to the
 religion that the kernel should never load anything of its own
 accord.  The designers of the current loadable module system made
 different design choices, but the some traces of its predecessor still
 remain.

Including defunct traces like all the duplicated code in the mount utilities
:-).

Relevant history:

RCS file: /home/ncvs/src/lib/libc/gen/getvfsent.c,v
Working file: getvfsent.c
head: 1.14

revision 1.13
date: 1998/11/03 15:02:29;  author: peter;  state: Exp;  lines: +10 -1
A feeble attempt at kld compatability.  The mount_* programs assume that
they cannot mount a filesystem that they cannot see in getvfsbyname().
Part 1 of this is a hack, make vfsisloadable() always return true - the
ultimate decider of whether it's loadable or not is kldload() or mount().
Part 2 of this is to have vfsload() call kldload(2) and return success if
it works.  This means that we will use a viable kld module in preference
to an LKM!
Ultimately, the thing to do is remove the hacks to do a vfsload in all the
^^ should have been more than a year ago
mount_* commands and let the kernel do it by itself in mount(2).


RCS file: /home/ncvs/src/sys/kern/vfs_syscalls.c,v
Working file: vfs_syscalls.c
head: 1.153
...

revision 1.110
date: 1998/11/03 14:29:09;  author: peter;  state: Exp;  lines: +32 -3
make mount(2) automatically kldload modules if the requested filesystem
isn't present.


This commit made the duplicated code redundant except for backwards
compatibility.

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

RE: BPF bug or not?

2000-01-26 Thread Bruce Evans


On Wed, 26 Jan 2000, Yevmenkin, Maksim N, CSCIO wrote:

   I've just found that read from /dev/bpfX never return 
  EAGAIN/EWOULDBLOCK.
   It means that when you do a non blocking read and there is 
  no data you will
   always get 0.

 [ untested fix removed :) ]
 
 Yes, it works. But it returns EAGAIN for both O_RDONLY|O_NONBLOCK and 
 O_RDWR|O_NONBLOCK open modes. In the same time pipe returns 0 for 
 O_RDONLY|O_NONBLOCK mode and EAGAIN for O_RDWR|O_NONBLOCK. 
 
 It there any specs for "read" system call? 

Well, POSIX is very complete for read() on regular files and pipes (both
ordinary pipes and fifos.  read() on a pipe with no data and writers
returns 0 because that case is considered to be EOF.  O_RDWR for fifos
gives undefined behaviour.  I don't know of any legitimate use for it.
It has the illegitimate use of talking to oneself using only one channel
:-).  This gives the EAGAIN behaviour for O_RDWR|O_NONBLOCK.

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: kern/13644

2000-01-24 Thread Bruce Evans


On Sun, 23 Jan 2000, Warner Losh wrote:

 : This is what I asked for,  when I asked for "other specification". Could
 : you provide the chapter/verse number of where POSIX spec contradicts the
 : man pages? It will help me make my  case on the TCL forum, since the TCL
 : developers  remain  under the  mistaken  assumption,  that select()  may
 : return earlier, but never later than specified.
 
 That's trivially easy to show.

In theory, but not in practice :-).

 Given process X with a priority N + 1 that is doing
  N - 1 (higher priority is actually "lower")
   while (1) ;
 while process Y with a priority of N is doing the select.  The kernel
 won't prempt X until the time slice is done, which can be a long
 time.  If the select'd process is swapped out, then it could take a
 very very long time to swap back in.

Um, if the priorities are actually N vs. N - 1, then the process with
priority N won't run at the end of the timeslice.  It will only run
when its priority becomes "lower", possibly several timeslices later.

In practice, the priorities will never be N vs. N - 1.  The process
doing the select() sleeps at priority PSOCK = 24.  The process doing
the while loop should always have priority = PUSER = 50, but due to
a bug (?) in nice(2), the priority of a nice --20 process can be as
low as PUSER - 20 = 30.  Anyway, that is  PSOCK, so the process doing
the select() will preempt the user process and wake up as soon as it
times out.  Then, due to a longstanding scheduling bug (?), the process
doing the select() will return to userland without being rescheduled,
although its user priority may be much "higher" than that of any other
runnable process.  Processes that do i/o are thus preferred to cpu
hogs much more strongly than their priorities indicate.  This bug is
a feature in most cases.  It reduces context switches.  Interactive
process may get more benefit from it than from the classical scheduling
preference for interactive processes.

Bruce



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: getopt.h

1999-10-20 Thread Bruce Evans


 In unistd.h we have definitions for getopt, optarg, optind, opterr, and
 optopt.

These are the standard POSIX (.2?) declarations.

 The things I propose to add to unistd.h are the following:

 struct option
 {
   char*name;
   int has_arg;
   int *flag;
   int val;
 };

 #define no_argument   0
 #define required_argument 1
 #define optional_argument 2

 extern int getopt_long (int, char *const [], const char *,
   const struct option *, int *);
 extern int getopt_long_only (int, char *const [], const char *,
const struct option *, int *);
 extern int _getopt_internal (int, char *const [], const char *,
const struct option *, int *, int );

 This should enable us to compile programs which require getopt.h by
 simply including unistd.h.

These are gnu extensions which we don't even support in libc, so they
shouldn't be declared anywhere in /usr/include.  I think ports that
use the gnu extensions handle this problem by duplication the gnu
sources.  Not ideal, but the getopt sources are small.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: kernel/loader world

1999-10-18 Thread Bruce Evans


  In file included from
  /usr/src/sys/boot/i386/libi386/../../../sys/signal.h:236,
   from
  /usr/src/sys/boot/i386/libi386/../../../sys/param.h:90,
   from
  /usr/src/sys/boot/i386/libi386/aout_freebsd.c:29:
  /usr/src/sys/boot/i386/libi386/../../../sys/ucontext.h:34:
  machine/ucontext.h: No such file or directory
  
  We include the files directly from the source tree, but some of them
  then go and include files from machine/*, which, of course, refers
  to /usr/include/machine/*. Thus, some of the files are up-to-date,
  and some are not. Unless you build world first.

 It's dangerous to mix headers from the source tree with headers from
 /usr/include. Are you sure this is the case?

  [but you can't build world until you booted a new kernel, and you
  can't boot a new kernel until you have a new loader, but you can't
  build a new loader...]

 This is "artifical" in that it is solved by fixing the build process.

The bug is that the `machine' symlink is not always created.  The
makefile supports creating it, but has missing dependencies.  See
the old boot loader (sys/i386/boot/Makefile.inc) for a good way to
handle this.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: Init(8) cannot decrease securelevel

1999-09-06 Thread Bruce Evans


 There used to be security holes that allowed root to lower `securelevel'
 using init.  Rev.1.9 defends against any undiscovered holes.

How about following change?

OK.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: Init(8) cannot decrease securelevel

1999-09-06 Thread Bruce Evans


Once securelevel has been increased, no process can decrease it because
kernel always refuse decreasing it.  This is inconsistent with the
manual page of init:

 The kernel runs with four different levels of security.  Any super-user
 process can raise the security level, but only init can lower it.

Is there any security problem to implement this?  If no, could someone
review following patch?

The patch just backs out rev.1.9:

RCS file: /home/ncvs/src/sys/kern/kern_mib.c,v
Working file: kern_mib.c
head: 1.25
...

revision 1.9
date: 1997/06/25 07:31:47;  author: joerg;  state: Exp;  lines: +2 -2
Don't ever allow lowering the securelevel at all.  Allowing it does
nothing good except of opening a can of (potential or real) security
holes.  People maintaining a machine with higher security requirements
need to be on the console anyway, so there's no point in not forcing
them to reboot before starting maintenance.

Agreed by:  hackers, guido


There used to be security holes that allowed root to lower `securelevel'
using init.  Rev.1.9 defends against any undiscovered holes.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: Init(8) cannot decrease securelevel

1999-09-05 Thread Bruce Evans

Once securelevel has been increased, no process can decrease it because
kernel always refuse decreasing it.  This is inconsistent with the
manual page of init:

 The kernel runs with four different levels of security.  Any super-user
 process can raise the security level, but only init can lower it.

Is there any security problem to implement this?  If no, could someone
review following patch?

The patch just backs out rev.1.9:

RCS file: /home/ncvs/src/sys/kern/kern_mib.c,v
Working file: kern_mib.c
head: 1.25
...

revision 1.9
date: 1997/06/25 07:31:47;  author: joerg;  state: Exp;  lines: +2 -2
Don't ever allow lowering the securelevel at all.  Allowing it does
nothing good except of opening a can of (potential or real) security
holes.  People maintaining a machine with higher security requirements
need to be on the console anyway, so there's no point in not forcing
them to reboot before starting maintenance.

Agreed by:  hackers, guido


There used to be security holes that allowed root to lower `securelevel'
using init.  Rev.1.9 defends against any undiscovered holes.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: Init(8) cannot decrease securelevel

1999-09-05 Thread Bruce Evans

 There used to be security holes that allowed root to lower `securelevel'
 using init.  Rev.1.9 defends against any undiscovered holes.

How about following change?

OK.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: rndcontrol and SMP

1999-07-23 Thread Bruce Evans


/*
 * XXX the data is 16-bit due to a historical botch, so we use
 * magic 16's instead of ICU_LEN and can't support 24 interrupts
 * under SMP.
 */
intr = *(int16_t *)data;
if (cmd != MEM_RETURNIRQ  (intr  0 || intr = 16))
return (EINVAL);

What is needed to make this support a more sensible number of IRQs?

Mainly changing the ioctl and its clients (rndcontrol only?) to supply
more bits.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: rndcontrol and SMP

1999-07-23 Thread Bruce Evans

/*
 * XXX the data is 16-bit due to a historical botch, so we use
 * magic 16's instead of ICU_LEN and can't support 24 interrupts
 * under SMP.
 */
intr = *(int16_t *)data;
if (cmd != MEM_RETURNIRQ  (intr  0 || intr = 16))
return (EINVAL);

What is needed to make this support a more sensible number of IRQs?

Mainly changing the ioctl and its clients (rndcontrol only?) to supply
more bits.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: cvs commit: src/usr.sbin/inetd inetd.c

1999-07-21 Thread Bruce Evans


Can nohup really prevent processes from trapping SIGHUP? I thought it
just set the SIGUP handler to discard and hoped for the best.

It's normally a bug to catch ignored signals.  Daomons that reread config
files when they receive a signal may be counterexamples.  OTOH, they
probably shouldn't be started with some signals ignored unless ignoring
those signals is really wanted.

Xntpd in the base system explicitly requests its graceful termination
function, called finish(), be loaded as the SIGHUP handler.

This seems to be just a bug.  finish() is used for SIGHUP, SIGINT,
SIGQUIT and SIGTERM.  finish() just finishes (actually it has undefined
behaviour since it calls exit()), so nothing except undefined behaviour
is lost by ignoring these signals.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: cvs commit: src/usr.sbin/inetd inetd.c

1999-07-21 Thread Bruce Evans

Can nohup really prevent processes from trapping SIGHUP? I thought it
just set the SIGUP handler to discard and hoped for the best.

It's normally a bug to catch ignored signals.  Daomons that reread config
files when they receive a signal may be counterexamples.  OTOH, they
probably shouldn't be started with some signals ignored unless ignoring
those signals is really wanted.

Xntpd in the base system explicitly requests its graceful termination
function, called finish(), be loaded as the SIGHUP handler.

This seems to be just a bug.  finish() is used for SIGHUP, SIGINT,
SIGQUIT and SIGTERM.  finish() just finishes (actually it has undefined
behaviour since it calls exit()), so nothing except undefined behaviour
is lost by ignoring these signals.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: Rewriting pca(4) using finetimer(9) (was: Re: MPU401 now works under New Midi Driver Framework with a Fine Timer)

1999-07-09 Thread Bruce Evans

this is problematic.

you cannot add a new element before the pending firing because you can't
tell how far into the present trigger you are.

This is not a problem for readable counters like the i8254.  The problem
for the i8254 is that reading and writing it takes a long time (perhaps
5 usec each).  Every write to it will decrease its accuracy (perhaps by 1
usec after careful calibration).  Thus if the i8254 is used for generating
non-periodic interrupts with an average period of 333 usec, it may take
15/333 of the CPU (estimate another 5 usec for interrupt handling), and if
it is also used for timecounting then it may drift by 333 usec/second.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: Rewriting pca(4) using finetimer(9) (was: Re: MPU401 now worksunder New Midi Driver Framework with a Fine Timer)

1999-07-08 Thread Bruce Evans

dfr If I understand this correctly, you are suggesting that we program timer0
dfr so that we only take interrupts when a finetimer is due to fire? If so,
dfr then it sounds very good. The idea of taking 6000+ interrupts/sec made me
dfr uneasy, even though most would return without doing any work.

6000+ interrupts/sec is not many, unless it is done all the time.  A
486/33 can handle about 5 (16000 for pcaudio + 3 * 11520 for sio's).

There is one problem in this method. acquire_timer0() is only implemented
for i386. We would need to write something equivalent for alpha...

This is a serious problem.  acquire_timer0() is currently disabled even
for i386's when the i8254 is used for timecounting.  This is not hard
to fix (the hooks into clkintr() work even better with timecounters
since it is not necessary to resynchronise clock interrupts after a
state change), but an i8254 interrupt frequency of 16000 Hz is too fast
to be used routinely if the i8254 is being used for timecounting (even
if the CPU can keep up with the interrupts, the overflow heuristics in
i8254_get_timecount() may break down).  Other systems may have even
more limitations on the timecounters.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: Multiproc kernel 3.1-Release Cyclades Cyclom 8Yep PCI

1999-05-29 Thread Bruce Evans

I commented out the SMP and FAST code, rebuilt the kernel from scratch and
the panic still occurs.  Is there other code that should be disabled?  Is
this fixed in 3.2?  Why are there evil macro's in my computer?  Why am I
asking so many questions?

I committed a proper fix in /sys/i386/isa/cy.c:
rev.1.88 (-current)
rev.1.83.2.1 (-stable)

This is not in 3.2.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: Multiproc kernel 3.1-Release Cyclades Cyclom 8Yep PCI

1999-05-27 Thread Bruce Evans

There seems to be a problem with nested locks.  What was the panic
message?

I have the same problem with my Cyclom Ye cards (both the isa  pci 
variety)

The kernel panics with:

panic messages:
---
panic: rslock: cpu: 0, addr: 0xf026a15c, lock: 0x0001
mp_lock = 0001; cpuid = 0; lapic.id = 

That's the problem with nested locks.

after crash debugging shows that the value of com is 0x3 at the time of 
crash.  Since com is a pointer, I think we know that it is wrong.  I have 
added a printf statement to a varitation of cy.c that shows that just 
before the call to commctl, the value of com is correct (not 3).

I don't think that's the problem.

Any help towards getting the cy driver up and running on the smp kernel 
would be appreciated.

See hints in my previous mail.  The main problem is that disable_intr()
isn't doesn't nest properly and I only avoided this problem for !SMP
case.  The SMP disable_intr() is an evil macro that calls a locking
function which crashes if the lock is already held.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: Multiproc kernel 3.1-Release Cyclades Cyclom 8Yep PCI

1999-05-26 Thread Bruce Evans

I have multiprocessor machine:

matherboard SOYO 5TX2/X5 with 2 intel 166 proceccors
multiport card Cyclades Cyclom 8Yep

Kernel config:

options SMP # Symmetric MultiProcessor Kernel
options APIC_IO # Symmetric (APIC) I/O
options NCPU=2  # number of CPUs
options NBUS=2  # number of busses

All good work until somthing do start to send to /dev/cXX
Then kernel panic and reboot !!!

All good working under single processor kernel ... no problem

There seems to be a problem with nested locks.  What was the panic
message?

Try deleting the code in the SMP ifdefs in cy.c, and don't use fast
interrupts with SMP:
(a) for the pci version, don't use option CY_PCI_FASTINTR.
(b) for the isa version, delete the line with RI_FAST in it in cy.c.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

Re: CY_PCI_FASTINTR (Was: Re: Cyclom-Y driver for FreeBSD - help!!!)

1999-05-11 Thread Bruce Evans

Am I understand this right: if I configure my PCI card to an
exclusive interrupt, and then turn on CY_PCI_FASTINTR, I'll get
rid of ``silo overflows'' on it.

Except in -current, where CY_PCI_FASTINTR is broken (it has no effect).

Just out of curiosity.
What is the ``interrupt latency in the FreeBSD kernel'' problem
and what are the ``Bruce's fast interrupt hacks''?

The problem is that some mouldy old drivers and subsystems disable
interrupts for too long (e.g. 2-5msec for programming keyboard LEDs;
this breaks unbuffered serial input at speeds larger than 200-500 bps,
and it breaks the Cyclades' 12-character fifos at speeds larger than
600-2400 bps).

The fast interrupt hacks consist of ignoring the normal interrupt
masking system so that drivers using fast interrupts are not affected
by the problem unless there are enough of them (or enough activity on
them) to create their own latency problems.

Bruce


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message

88 matches

Mail list logo