Re: Replace bcopy() to update ether_addr
From owner-freebsd-...@freebsd.org Thu Aug 23 10:58:03 2012 Delivered-To: freebsd-...@freebsd.org Date: Wed, 22 Aug 2012 23:52:44 +0200 From: Luigi Rizzo ri...@iet.unipi.it To: John Baldwin j...@freebsd.org References: 50324db4.6080...@cabletv.dp.ua 201208220802.14588@freebsd.org CAJ-Vmo=1cbJn3pkSvoCq7y-kEGig-h1Vxo6M5V0=b9=mkfu...@mail.gmail.com 201208221521.06954@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: 201208221521.06954@freebsd.org User-Agent: Mutt/1.4.2.3i Cc: freebsd-hackers@FreeBSD.org, Adrian Chadd adr...@freebsd.org, Mitya mi...@cabletv.dp.ua, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl, freebsd-...@freebsd.org Subject: Re: Replace bcopy() to update ether_addr X-BeenThere: freebsd-...@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD freebsd-net.freebsd.org List-Unsubscribe: http://lists.freebsd.org/mailman/listinfo/freebsd-net, mailto:freebsd-net-requ...@freebsd.org?subject=unsubscribe List-Archive: http://lists.freebsd.org/pipermail/freebsd-net List-Post: mailto:freebsd-...@freebsd.org List-Help: mailto:freebsd-net-requ...@freebsd.org?subject=help List-Subscribe: http://lists.freebsd.org/mailman/listinfo/freebsd-net, mailto:freebsd-net-requ...@freebsd.org?subject=subscribe Sender: owner-freebsd-...@freebsd.org Errors-To: owner-freebsd-...@freebsd.org Content-Length: 2762 Lines: 64 On Wed, Aug 22, 2012 at 03:21:06PM -0400, John Baldwin wrote: On Wednesday, August 22, 2012 2:54:07 pm Adrian Chadd wrote: On 22 August 2012 05:02, John Baldwin j...@freebsd.org wrote: On Tuesday, August 21, 2012 12:34:42 pm Adrian Chadd wrote: Hi, What about just creating an ETHER_ADDR_COPY(dst, src) and putting that in a relevant include file, then hide the ugliness there? The same benefits will likely appear when copying wifi MAC addresses to/from headers. Thanks, I'm glad someone noticed this. I doubt we even _need_ the ugliness. We should just use *dst = *src unless there is a compelling reason not to. Because it's not very clear? :-) I'd much prefer my array-of-things copies to be explicit. Eh? 'struct foo *src, *dst; *dst = *src' is pretty bog-standard C. That isn't really all that obtuse. the thread has probably forked causing people to miss the explanation that Bruce gave: quite often the function is called by casting arbitrary pointers into 'struct foo *', so the compiler's expectations about alignment do not necessarily match the user's lies. Unfortunately we are building kernels with many compiler checks disabled, so there is a fair chance that the compiler will not detect such invalid casts. Probably addresses are aligned to 2-byte boundaries, but certainly not on a 4-byte, which means that a safe copy might require 3 instructions, even though a compiler could otherwise decide to align all non-packed 'struct foo' to a 4- or 8-byte boundary and possibly do the copy with 2 or even 1 instruction. I would also suggest to try the code i posted in response to bruce so you can check how good or bad are the various solutions on different architectures or CPUs, and see if there is a reasonable compromise. cheers luigi Also, the optimisation and compiler silliness may not be THAT obvious on intel (except when you're luigi and using netmap) but I can't help but wonder whether the same does hold for MIPS/ARM. Getting it wrong there will lead to some very very poor performing code. Don't you think there's a really good chance the compiler knows how to copy a structure appropriately for each architecture already? -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Replace bcopy() to update ether_addr
luigi wrote: On Wed, Aug 22, 2012 at 03:21:06PM -0400, John Baldwin wrote: On Wednesday, August 22, 2012 2:54:07 pm Adrian Chadd wrote: On 22 August 2012 05:02, John Baldwin j...@freebsd.org wrote: On Tuesday, August 21, 2012 12:34:42 pm Adrian Chadd wrote: Hi, What about just creating an ETHER_ADDR_COPY(dst, src) and putting that in a relevant include file, then hide the ugliness there? Good for source code ugliness and bloat (not just in the macro). The same benefits will likely appear when copying wifi MAC addresses to/from headers. Why stop there? Change all assignments to use the ASSIGN() macro. The compiler doesn't understand assignment, but we do ;-). Thanks, I'm glad someone noticed this. I doubt we even _need_ the ugliness. We should just use *dst = *src unless there is a compelling reason not to. Because it's not very clear? :-) I'd much prefer my array-of-things copies to be explicit. Eh? 'struct foo *src, *dst; *dst = *src' is pretty bog-standard C. That isn't really all that obtuse. (Builtin) memcpy is better for this in general (no sarcasm now). It reduces to a struct copy if the object being copy is a struct, and is not limited to structs. However, ethernet address types are now fully pessimized, so compilers are almost forced to assume that they are fully misaligned. From net/ethernet.h: % /* % * Structure of a 10Mb/s Ethernet header. % */ Wow, 10Mb/s. % struct ether_header { % u_char ether_dhost[ETHER_ADDR_LEN]; % u_char ether_shost[ETHER_ADDR_LEN]; % u_short ether_type; % } __packed; This used to be not quite guaranteed to be 16-bit aligned by the short in it. Perhaps it is still aligned in practice, But in 2006, it was fully pessimized by adding __packed without adding __aligned(2). This __packed prevents the struct being padded to a multiple of 32 bits in size on arm. It has the side effect of allowing the compiler to not add padding for alignment before instances of this struct, and allowing embedded short to be misaligned, so all accesses to this short might require 2 1-byte load/store instructions and more instructions to combine the bytes. On x86, the accesses can be a single load/store (or sometimes an arithmetic operation), with the hardware doing the combination, possibly more slowly than for aligned accesses. The pessimization is better on other arches. % /* % * Structure of a 48-bit Ethernet address. % */ % struct ether_addr { % u_char octet[ETHER_ADDR_LEN]; % } __packed; This was always just an array of bytes, so it was nevever required to have more than byte alignment. However, the compiler was permitted to add padding before and after it to align it and the thing after it. __packed on it may prevent this. I hope that it actually takes a __packed on the containing struct or on the struct member with this type to prevent external padding, but __packed on everything may be needed anyway for __packed on this to have much effect. An ETHER_ADDR_COPY() macro can only do worse than the compiler here, since the object is nothing more than an array of bytes after forgetting the extra packing info that may be known to the compiler. On x86, the macro can use 2 possibly-misaligned load/stores, but so can the compiler, or the macro can use a dynamic comparison of the address to find the alignment point and then do 1, 2 or 3 aligned load/store pairs, but so can the compiler, or the macro can use rep movsb, but so can the compiler... The best choice depends on CFLAGS (a static test that is easiest for the compiler) or on the CPU. For a complete implementation, the macro must of course patch the instructions used to best suit the current CPU. This would be a lot of work for an unimportant (~0.1% max) optimization. On non-x86 or any arch with strict alignment requirements, the only obviously correct implementions of the macro are 6 1-byte load/store pairs, or memcpy(), or bcopy(). The compiler can easily beat all of these. the thread has probably forked causing people to miss the explanation that Bruce gave: quite often the function is called by casting arbitrary pointers into 'struct foo *', so the compiler's expectations about alignment do not necessarily match the user's lies. Except for struct ether_addr, the cast would actually reduce alignment (unless the compiler is smart enough to not do it literally). Unfortunately we are building kernels with many compiler checks disabled, so there is a fair chance that the compiler will not detect such invalid casts. Er, we check more in the kernel than almost everywhere else. Probably addresses are aligned to 2-byte boundaries, but certainly not on a 4-byte, which means that a safe copy might require 3 instructions, even though a compiler could otherwise decide to align all non-packed 'struct foo' to a 4- or 8-byte boundary and possibly do the copy with 2 or even 1 instruction. The code actually
Re: Replace bcopy() to update ether_addr
mitya wrote: 22.08.2012 05:07, Bruce Evans íàïèñàë: On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote: Hi. I found some overhead code in /src/sys/net/if_ethersubr.c and /src/sys/netgraph/ng_ether.c It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN); When src and dst are struct ether_addr*, and ETHER_ADDR_LEN equal 6. Only ng_ether.c contains such strings. if_ethersubr.c doesn't exist. if_ether.c exists, but was converted to use memcpy() 17.25 years ago. Oops, if_ethersubr.c does exist (in net/. if_ether.c is a misnamed file in netinet/). Summary: use builtin memcpy() for small copies, and don't try hard to otherwise optimize this. You are very surprised to learn that bcopy() and memcpy() are used for copy struct in_addr, whose length is equal to 4 bytes ? I found this in sys/netinet/if_ether.c. And I think, there is a chance to find them in other files. Since memcpy() is the correct method, us of it is only slightly surprising. Use of bcopy() is a regression. Bugs are never surprising :-). And I found bzero(mtod(m, void *), sizeof(struct in_addr)); in ip_options.c Speed of options processing isn't important. Anyway, I dug out some of my old unimportant fixes for some of the copying pessimizations: % diff -c2 ./net/if_ethersubr.c~ ./net/if_ethersubr.c % *** ./net/if_ethersubr.c~ Wed Dec 27 10:49:51 2006 % --- ./net/if_ethersubr.c Wed Dec 27 10:49:52 2006 % *** % *** 253,257 % hdrcmplt = 1; % eh = (struct ether_header *)dst-sa_data; % ! (void)memcpy(esrc, eh-ether_shost, sizeof (esrc)); % /* FALLTHROUGH */ % % --- 253,257 % hdrcmplt = 1; % eh = (struct ether_header *)dst-sa_data; % ! __builtin_memcpy(esrc, eh-ether_shost, sizeof(esrc)); % /* FALLTHROUGH */ % Not the right fix (except to fix the style bug (cast to void)). memcpy() should be used, and then if the builtin is good then it shoukd be used. % *** % *** 259,263 % loop_copy = 0; /* if this is for us, don't do it */ % eh = (struct ether_header *)dst-sa_data; % ! (void)memcpy(edst, eh-ether_dhost, sizeof (edst)); % type = eh-ether_type; % break; % --- 259,263 % loop_copy = 0; /* if this is for us, don't do it */ % eh = (struct ether_header *)dst-sa_data; % ! __builtin_memcpy(edst, eh-ether_dhost, sizeof(edst)); % type = eh-ether_type; % break; % *** % *** 276,288 % senderr(ENOBUFS); % eh = mtod(m, struct ether_header *); % ! (void)memcpy(eh-ether_type, type, % ! sizeof(eh-ether_type)); % ! (void)memcpy(eh-ether_dhost, edst, sizeof (edst)); % if (hdrcmplt) % ! (void)memcpy(eh-ether_shost, esrc, % ! sizeof(eh-ether_shost)); % else % ! (void)memcpy(eh-ether_shost, IF_LLADDR(ifp), % ! sizeof(eh-ether_shost)); % % /* % --- 276,287 % senderr(ENOBUFS); % eh = mtod(m, struct ether_header *); % ! __builtin_memcpy(eh-ether_type, type, sizeof(eh-ether_type)); % ! __builtin_memcpy(eh-ether_dhost, edst, sizeof(edst)); % if (hdrcmplt) % ! __builtin_memcpy(eh-ether_shost, esrc, % ! sizeof(eh-ether_shost)); % else % ! __builtin_memcpy(eh-ether_shost, IF_LLADDR(ifp), % ! sizeof(eh-ether_shost)); % % /* Larger style fixes. Non-style fixes/breakages are in the z subdir: % diff -c2 ./net/z/if_ethersubr.c~ ./net/z/if_ethersubr.c % *** ./net/z/if_ethersubr.c~ Wed Dec 27 10:49:51 2006 % --- ./net/z/if_ethersubr.cWed Dec 27 20:28:06 2006 % *** % *** 191,197 % % if (m-m_flags M_BCAST) % ! bcopy(ifp-if_broadcastaddr, edst, ETHER_ADDR_LEN); % else % ! bcopy(ar_tha(ah), edst, ETHER_ADDR_LEN); % % } % --- 191,198 % % if (m-m_flags M_BCAST) % ! __builtin_memcpy(edst, ifp-if_broadcastaddr, % ! sizeof(edst)); % else % ! __builtin_memcpy(edst, ar_tha(ah), sizeof(edst)); % % } bcopy() can't be replaced by a builtin since it is required to handle overlapping copies, which I think can't happen here (but I didn't check this). The builtin shouldn't be hard-coded like this, as above. % *** % *** 214,219 % } else % type = htons(ETHERTYPE_IPX); % ! bcopy((caddr_t)(((struct sockaddr_ipx *)dst)-sipx_addr.x_host), % ! (caddr_t)edst, sizeof (edst)); % break; % #endif % --- 215,221 % } else % type = htons(ETHERTYPE_IPX); % ! __buitin_memcpy
Re: Replace bcopy() to update ether_addr
On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote: Hi. I found some overhead code in /src/sys/net/if_ethersubr.c and /src/sys/netgraph/ng_ether.c It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN); When src and dst are struct ether_addr*, and ETHER_ADDR_LEN equal 6. Only ng_ether.c contains such strings. if_ethersubr.c doesn't exist. if_ether.c exists, but was converted to use memcpy() 17.25 years ago. This code call every time, when we send Ethernet packet. On example, on my machine in invoked nearly 20K per second. Why we are use bcopy(), to copy only 6 bytes? Answer - in some architectures we are can not directly copy unaligned data. I propose this solution. In file /usr/src/include/net/ethernet.h add this lines: static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) { #if defined(__i386__) || defined(__amd64__) *dst = *src; #else bcopy(src, dst, ETHER_ADDR_LEN); #endif } On platform i386 gcc produce like this code: leal-30(%ebp), %eax leal6(%eax), %ecx leal-44(%ebp), %edx movl(%edx), %eax movl%eax, (%ecx) movzwl 4(%edx), %eax movw%ax, 4(%ecx) And clang produce this: movl-48(%ebp), %ecx movl%ecx, -26(%ebp) movw-44(%ebp), %si movw%si, -22(%ebp) All this variants are much faster, than bcopy() You mean as much as 5 nanoseconds faster. Possibly even 10 nanoseconds faster. A bit orthogonal to this but also related to the performance impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT architectures these places probably should use memcpy() instead as bcopy() additionally has to check for overlap while the former does not. Overlaps unlikely are an issue in these cases and at least NetBSD apparently has done the switch to memcpy() 5.5 years ago. This is essentially just a style bug. FreeBSD switched to memcpy() 17.25 years ago for selected networking copying. memcpy() is supposed to be used if and only if compilers can optimize it. This means that the size must be fixed and small, and of course that the copies don't overlap. Otherwise, compilers can't do any better than call an extern copying routine, which is memcpy() in practice. memcpy() was interntionally left out in FreeBSD until it was added 17.25 years ago to satisfy the changes to use memcpy() in networking code (since with -O0, memcpy() won't be inlined and the extern memcpy() gets used). Other uses are style bugs, but there are many now :-(. bcopy() is still the primary interface, and might be faster than memcpy(), especially for misaligned cases, but in practice it isn't, except in the kernel on Pentium1 in 1996-1998 where I only optimized (large) bcopy()s. Since it has to support overlapping copies it is inherently slower. Although compilers can't do better for large copying than call an extern routine, some compilers bogusly inline it to something like rep movsd on i386, (or worse, to a very large number of loads and stores). gcc used to have a very large threshold for inlining moderately-sized copies and/or for switching between rep movsd and load/store. It now understands better than ut doesn't understand memory, and has more reasonable thresholds. Or rather the thresholds are more MD. gcc still makes a mess with some CFLAGS: % struct foo % { % short x; % struct bar { % short yy[31]; % } y; % } s, t; % % foo() % { % s.y = t.y; % } With just -O, gcc-4.2.1 -O on i386 handles this very badly, by generating 15 misaligned 32-bit load/stores followed by 1 aligned 16-bit load/store. With -march=almost anything, it generates 1 16-bit aligned load-store followed by an aligned rep movsd with a count of 15. The latter is not too bad. Similarly for yy[32]. But for yy[33] it switches to a call to memcpy() even with plain -O. However, improvements in memory systems and branch prediction since Pentium1 in 1996-98 mean that optimimizing copying mostly gets nowhere. Copying is either from the cache[s], in which case it is fast (~1 nanosecond per 128 bits for L1), or it is not from the caches in which case it is slow (~50-200 nanseconds per cache miss). With 6-byte ethernet addresses, using bcopy() does slow the copying to considerably below 1 nanosecond per 128 bits (a sloppy estimate gives 5-10 ns/call), but it's hard for a single call to be made often enough to make a significant difference. Someone mentioned 2 calls. That's the same as 0 calls: 2 * 10 nsec = 200 usec = 0.05% of 1 CPU. If anyone cared about this, then they would use __builtin_memcpy() instead of memcpy(). (Note that the point of the 17.25-year old optimization has been defeated for ~10 years by turning off _all_ builtins, which was initially done mainly to kill builtin putc(). (-ffreestanding should have done that.) So gcc inlines struct copying for small structs, but never inlines memcpy(), or
Re: Replace bcopy() to update ether_addr
luigi wrote: even more orthogonal: I found that copying 8n + (5, 6 or 7) bytes was much much slower than copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient, other cases are slow (turned into 2 or 3 different writes). The netmap code uses a pkt_copy routine that does exactly this rounding, gaining some 10-20ns per packet for small sizes. I don't believe 10-20ns for just the extra bytes. memcpy() ends up with a movsb to copy the extra bytes. This can be slow, but I don't believe 10-20ns (except on machines running at i486 speeds of course). % ENTRY(memcpy) % pushl %edi % pushl %esi % movl12(%esp),%edi % movl16(%esp),%esi % movl20(%esp),%ecx % movl%edi,%eax % shrl$2,%ecx /* copy by 32-bit words */ % cld /* nope, copy forwards */ % rep % movsl % movl20(%esp),%ecx % andl$3,%ecx /* any bytes left? */ This avoids a branch. Some optimization manuals say that the branch is actually better for some machines, The above 2 instructions have a throughput of 1 per cycle each on modern x86. Latency might be 6 cycles. % rep Maybe 5-15 cycles throughput. % movsb Now hopefully at most 1 cycle/byte. Some hardware might combine the bytes as much as possible, so the whole function should use 1 single rep movsb and let the hardware do it all. % popl%esi % popl%edi % ret Well, it's easy to get a latency of 20 cycles 5-10 ns) and maybe even a throughput of that. But all of thus is out of order on modern x86. The extra cycles for the movsb might not cost at all if nothing accesses the part of the target that they were written to soon. With builtin memcpy, 6 bytes would be done using load/store of 4+2 bytes and thus take the same time as 8 bytes on i386, but on amd64 8 bytes would be faster. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Replace bcopy() to update ether_addr
jhb wrote: On Monday, August 20, 2012 10:46:12 am Mitya wrote: ... I propose this solution. In file /usr/src/include/net/ethernet.h add this lines: static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) { #if defined(__i386__) || defined(__amd64__) *dst = *src; #else bcopy(src, dst, ETHER_ADDR_LEN); #endif } Doesn't '*dst = *src' just work on all platforms? It does when you have an actual struct, but often in networking code you only have an array of bytes, and then casting the pointers from uint8_t * to full object pointers fails iff there is an alignment problem. Even on i386, it may be pessimal to use struct copying for 6 bytes, The bytes should be copied as 4+2 or 2+4 depending on alignment of the first bytes. Oops, not even that -- this reminds me of a problem with penalties for mismatched loads and stores which affects at least AthlonXP and Athlon64 significantly (10-20 cycle penalty = enough to copy 80-160 bytes unpenalized). The 6 should probably be copied as 2+2+2 or even as 1+1+1+1+1+1 to match previous stores and later loads, also as 2+2+2, etc. But if the 6 are part of another struct, they might be accessed as 8+0 or 4+4 as part of copying the full struct. gcc is not smart about this. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: High-res Timers
On Wed, 16 May 2012, Jason Hellenthal wrote: Does anyone have a quick list of high-resolution timer functions? Both user-land and kernel-land? It would be greatly appreciated (doing some performance timing for applications). clocks(7) - various system timers getitimer(2), setitimer(2) - get/set value of interval timer see ( man -k timer ) list for some other references. I think the original poster wants timestamping functions. Timer function normally means a timer-interrupt-scheduling function. clocks(7) is bogus and more than 10 years out of date. It is mostly about hardware and virtual clock oscillators that may be used to build timestamping and timer functions (mostly the former). I am not sure what sort of high resolution you are refering to but maybe these will lead you in the right direction. Bruce Evans CCd - he may have quite a bit that could be added to this. I always find what he has to say very enlightening. Generally, some of the timestamping functions have high resolution, but are too slow to use if you make a lot of timestamps, and if you don't make a lot of timestamps then you don't need very high resolution. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [review request] zfsboot/zfsloader: support accessing filesystems within a pool
On Sat, 5 May 2012, Andriy Gapon wrote: on 04/05/2012 18:25 John Baldwin said the following: On Thursday, May 03, 2012 11:23:51 am Andriy Gapon wrote: on 03/05/2012 18:02 Andriy Gapon said the following: Here's the latest version of the patches: http://people.freebsd.org/~avg/zfsboot.patches.4.diff I've found a couple of problems in the previous version, so here's another one: http://people.freebsd.org/~avg/zfsboot.patches.5.diff The important change is in the first patch (__exec args). A few comments/suggestions on the args bits: John, these are excellent suggestions! Thank you! Some comments: - Add #ifndef LOCORE guards to the new header around the structure so it can be used in assembly as well as C. Done. I have had to go into a few btx makefiles and add a necessary include path and -DLOCORE to make the header usable from asm. Ugh, why not use genassym, as is done for all old uses of this header in locore.s, at least on i386 (5% of the i386 genassym.c is for this). - Move BI_SIZE and ARGOFF into the header as constants. Done. - Add a CTASSERT() in loader/main.c that BI_SIZE == sizeof(struct bootinfo) Ugh, BI_SIZE was already used in locore.s. It wasn't the size of the struct, but was the offset of the field that gives the size. No CTASSERT() was needed -- the size is whatever it is, as given by sizeof() on the struct at the time of compilation of the utility that initializes the struct. It was a feature that sizeof() and offsetof() can't be used in asm so they must be translated in genassym and no macros are needed in the header (the size was fully dynamic, so the asm code only needs the offsetof() values). Of course, you could use CTASSERT()s to check that the struct layout didn't get broken. The old code just assumes that the struct is packed by the programmer and that the arch's struct packing conventions don't change, so that for example BI_SIZE = offsetof(struct bootinfo, bi_size) never changes. genassym is hard to use in boot programs, but the old design was that boot programs shouldn't use bootinfo in asm and should just use the target bootinfo.h at compile time (whatever time the target is compiled). Anyway, LOCORE means for use in locore.[sS], so other uses of it, e.g. in boot programs, are bogus. I have added a definition of CTASSERT to boostrap.h as it was not available for sys/boot and there were two local definitions of the macro in individual files. However the assertion would fail right now. The backward-compatible value of BI_SIZE (72 == 0x48) covers only part of the This isn't backwards compatible. BI_SIZE was decimal 48 (covers everything up to the bi_size field). fields in struct bootinfo, those up to the following comment: /* Items below only from advanced bootloader */ I am a little bit hesitant: should I increase BI_SIZE to cover the whole struct bootinfo or should I compare BI_SIZE to offsetof bi_kernend? Neither. BI_SIZE shouldn't exist. It defeats the bi_size field. Maybe create a 'struct kargs_ext' that looks like this: struct kargs_ext { uint32_t size; char data[0]; }; I've decided to skip on this. Use KNF indentation and KNF field prefixes (ka_) if you add it :-). Generic field names like `size' and `data' need prefixes more than mos. The old struct was: % #define N_BIOS_GEOM 8 % ... % /* % * A zero bootinfo field often means that there is no info available. % * Flags are used to indicate the validity of fields where zero is a % * normal value. % */ % struct bootinfo { % u_int32_t bi_version; % u_int32_t bi_kernelname; /* represents a char * */ % u_int32_t bi_nfs_diskless;/* struct nfs_diskless * */ % /* End of fields that are always present. */ The original size was apparently 12. % #define bi_endcommonbi_n_bios_used Another style difference. The magic 12 is essentially given by this macro. This macro is a pseudo-field, like the ones for the copyable and zeroable regions in struct proc. Its name is in lower case. % u_int32_t bi_n_bios_used; % u_int32_t bi_bios_geom[N_BIOS_GEOM]; The struct was broken in 1994 by adding the above 2 fields without providing any way to distinguish it from the old struct. % u_int32_t bi_size; % u_int8_tbi_memsizes_valid; % u_int8_tbi_bios_dev;/* bootdev BIOS unit number */ % u_int8_tbi_pad[2]; % u_int32_t bi_basemem; % u_int32_t bi_extmem; % u_int32_t bi_symtab; /* struct symtab * */ % u_int32_t bi_esymtab; /* struct symtab * */ The above 8 fields were added in 1995 (together with fixing style bugs like no prefixes for the old field names). Now the struct is determined by its size according to the bi_size field, and the bi_version field is not really used (it's much easier to add
Re: sizeof(function pointer)
On Tue, 31 May 2011 m...@freebsd.org wrote: I am looking into potentially MFC'ing r212367 and related, that adds drains to sbufs. The reason for MFC is that several pieces of new code in CURRENT are using the drain functionality and it would make MFCing those changes much easier. The problem is that r212367 added a pointer to a drain function in the sbuf (it replaced a pointer to void). The C standard doesn't guarantee that a void * and a function pointer have the same size, though its true on amd64, i386 and I believe PPC. What I'm wondering is, though not guaranteed by the standard, is it *practically* true that sizeof(void *) == sizeof(int(*)(void)), such that an MFC won't break binary compatibility for any supported architecture? (The Only on supported arches. standard does guarantee, though not in words, that all function pointers have the same size, since it guarantees that pointers to functions can be cast to other pointers to functions and back without changing the value). No, it doesn't guarantee that they have the same size. The casts may change inything in the representation including the size. In 1995 I added intfptr_t, uintfptr_t and fptrdiff_t to FreeBSD to handle this problem in the profil(2) APIs. You can check the MD definitions of these to verify that their size is always the same as the sizeof(void *). There are some style bugs and namespace issues with these. The definitions in /sys/*/include/_types.h are clean, but are mostly not used in their primary consumer /sys/*/include/profile.h. intfptr_t and uintfptr_t are bogusly declared for the kernel in a bogus declaration section in /sys/types.h. They are intentionally kept out of general headers, but even the kernel doesn't really need them there, unless they are used for more than profiling and then you might need them outside the kernel. Another possibility is to malloc a blob that is sizeof(int(*)(void)) and store that in a renamed s_unused; this is a bit messier but guaranteed to work. I'd just rather the code be an MCF instead of a partial re-write. It is only guaranteed to work if the value is converted to the actual function pointer type before use. (At least for use as a function pointer.) Someone pointed out that POSIX now requires function pointers to have the same representation as void *. This is a bug in POSIX IMO. It isn't in the 2001 version. I think POSIX had to do something for broken parts of dlcfn APIs. Hopefully this isn't required generally. Requiring all function pointers to be convertible to void * and back without breaking them is bad enough, but requiring the same representation is much more restrictive. It breaks any system that has type info in the pointers. The C standard is careful not to require this breakage even for data pointers. Someones pointed out that ia64 function pointers are actually handles. Systems that want to put type info in pointers would have to use handles instead of actual pointers to give the same representation, even if this is not the natural ABI. Whether handle values are useful when the data that they point to is unavailable is unclear. The profiling ABIs mainly need the function pointers (converted to integers) as ids, and handle values are hopefully enough for that. If not, the conversion needs to mangle the values sufficiently for uniqueness. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: retry mounting with ro when rw fails
On Fri, 8 Apr 2011, Andriy Gapon wrote: on 08/04/2011 03:00 Jeremy Chadwick said the following: On Thu, Apr 07, 2011 at 01:20:53PM -0700, Garrett Cooper wrote: As a generic question / observation, maybe we should just implement 'errors=remount-ro' (or a reasonable facsimile) like Linux has in our mount(8) command? Doesn't look like NetBSD, OpenBSD, or [Open]Solaris sported similar functionality. I was going to recommend exactly this. :-) I like the idea of Andriy's patch, but would feel more comfortable if it were only used if a mount option was specified (-o errors=remount-ro). Having the option is appealing, but my main motivation was the simplicity that comes from having that enabled by default. That is, you absolutely want an R/W mount then use -o rw, you need R/O then explicitly -o ro, you just want to get that media mounted then the default behavior tries its best. But the default behaviour is backwards, especially for read-mostly removable media. The default should be ro, possibly with an automagic upgrade to rw iff the media really needs to be written too. Writing timestamps for file system and inode access times doesn't count as really needs to be written to. I think I prefer requiring an explicit upgrade to rw. rw implies writing access times unless you also use noatime, and I wouldn't want noatime to be set automagically depending on whether rw is set explicitly, so I would want noatime to be set explicitly, and once you do that then you can easily set rw or ro at the same time. A new rm (read mostly) or rwa (read or write automagically) flag could give automatic upgrade from ro to rw. I'd also like automatic downgrade to ro after a file system has not been written to for some time (this would avoid fscks in most cases for read-mostly file systems. The ro flag should be per-cylinder-group in ffs so that on big disks, most parts are read-only most of the time and don't need to be checked). Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: get_cyclecount(9) deprecation
On Thu, 17 Mar 2011, Jung-uk Kim wrote: Both get_cyclecount(9) and cpu_ticks() do almost exactly the same thing now assuming set_cputicker() is called with a correct function before get_cyclecount() is used, which is true for x86, at least. The only difference is get_cyclecount() may be inlined but I don't see much gain from the current uses. Please review the patch. Note I didn't really remove get_cyclecount() just because some random third-party module may use it as it is a documented feature while cpu_ticks is an internal KPI. What do you think? I like this idea, but see some minor problems: - cpu_ticks() API needs to become more public and stable, and guarantee to use the ticker with the highest frequency - it is hard to see when set_cputicker() is called, but on i386 it is called very early so there seem to be no problems with the dummy timecounter. It is called from init_TSC() which is called from startrtclock() which is called from machdep.c:cpu_startup() just after the garbage Good morning comment which was originally just banal since it preceeded its code (the printf that prints the copyright message). The ordering of this is now highly obfuscated, and in fact there is now the following enormous amount of code between the printf and its comment: % enum sysinit_sub_id { % SI_SUB_DUMMY= 0x000,/* not executed; for linker*/ % SI_SUB_DONE = 0x001,/* processed*/ % SI_SUB_TUNABLES = 0x070,/* establish tunable values */ % SI_SUB_COPYRIGHT= 0x081,/* first use of console*/ ^ prints the copyright % SI_SUB_SETTINGS = 0x088,/* check and recheck settings */ % SI_SUB_MTX_POOL_STATIC = 0x090,/* static mutex pool */ % SI_SUB_LOCKMGR = 0x098,/* lockmgr locks */ % SI_SUB_VM = 0x100,/* virtual memory system init*/ % SI_SUB_KMEM = 0x180,/* kernel memory*/ % SI_SUB_KVM_RSRC = 0x1A0,/* kvm operational limits*/ % SI_SUB_WITNESS = 0x1A8,/* witness initialization */ % SI_SUB_MTX_POOL_DYNAMIC = 0x1AC,/* dynamic mutex pool */ % SI_SUB_LOCK = 0x1B0,/* various locks */ % SI_SUB_EVENTHANDLER = 0x1C0,/* eventhandler init */ % SI_SUB_VNET_PRELINK = 0x1E0,/* vnet init before modules */ % SI_SUB_KLD = 0x200,/* KLD and module setup */ % SI_SUB_CPU = 0x210,/* CPU resource(s)*/ ^^^ calls cpu_startup() but there seem to be no slow operations (like device initialization) in there, so cpu_startup() seems to be called early enough. The function names for clock initialization are almost equally rotted: - the realtime clock was originally the i8254, and startrtclock() started it. Now startrtclock() doesn't touch either the the realtime clock (the timecounter) or the i8254. It only calls init_TSC(). This might as well be called directly. The i8254 is now initialized (only?) by i8254, which is called much earlier, from init386(), so that DELAY() works when called early from console drivers. The same care should be taken with initializing the TSC for early use, and would be essential if the i8254 support in DELAY() were removed. - the correct interface for attaching the realtime clock now seems to be cpu_initclocks(). This is now used for the TSC (init_TSC_tc()). But the i8254 is now attached as a timecounter in attimer_attach(). - set_cputicker() has various races and implementation bugs: - it has no visible locking. This may be safe when it is called at boot time. It is also called from tsc_levels_changed(). BTW, these calls still don't know about P-state invariance. They always pass the parameter saying that the ticker is variable. This despite the invariance variable being checked to lines after the call that passes a hard-coded for variance. - the call to set_cputicker() after machdep.tsc_freq changes the ticker freqency is missing. This is a feature if the ticker is variable, since the dynamic ticker calibration code can be more accurate than a fixed setting, and would undo any fixed setting. But this code is buggy... - set cputicker() has some design bugs. It assumes that the tick frequency is the same across all CPUs, but the TSC is per-CPU. I have an old SMP system with CPUs of different frequency that can demonstrate bugs from this. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: get_cyclecount(9) deprecation
On Fri, 18 Mar 2011, Kostik Belousov wrote: On Fri, Mar 18, 2011 at 05:26:53PM +1100, Bruce Evans wrote: ... - set cputicker() has some design bugs. It assumes that the tick frequency is the same across all CPUs, but the TSC is per-CPU. I have an old SMP system with CPUs of different frequency that can demonstrate bugs from this. We definitely do not support configurations with different models of CPUs in SMP, this is what Simmetric is about. Different as in frequency or stepping. It worked before this bug was implemented. The TSC wasn't used so the O/S didn't notice the asymmetric frequencies directly, and the O/S also didn't care about this indirectly. Now there is even more asymmetry in core frequencies, with the hardware transiently slowing down or stopping cores independently, at least for cores in different packages. Ths O/S probably can't see this directly using TSCs since the technology that changes the core's frequencies probably comes with invariant TSCs, and the O/S shouldn't care about this indirectly just like before. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: get_cyclecount(9) deprecation
On Sat, 19 Mar 2011, Bruce Evans wrote: On Fri, 18 Mar 2011, Kostik Belousov wrote: We definitely do not support configurations with different models of CPUs in SMP, this is what Simmetric is about. Different as in frequency or stepping. ... Now there is even more asymmetry in core frequencies, with the hardware transiently slowing down or stopping cores independently, at least for cores in different packages. Also, with virtualization, the virtualizer cannot reasonably even provide an invariant TSC that runs at the same rate on all cores. It should provide an invariant TSC that claims to run at the same rate on all cores, but then the cores cannot run at the same rate except on average, since some of the cores will have to run the virtualizer some of the time, and it is unreasonable to distribute the overhead for this evenly except on average. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Chasing down bugs with access(2)
On Wed, 21 Jul 2010, Jaakko Heinonen wrote: On 2010-07-20, Garrett Cooper wrote: I ran into an issue last night where apparently several apps make faulty assumptions w.r.t. whether or not access(2) returns functional data when running as a superuser. New implementations are discouraged from returning X_OK unless at least one execution permission bit is set. See PR kern/125009 (http://www.freebsd.org/cgi/query-pr.cgi?pr=125009). Here is the latest version of the vaccess*() patch which also changes vaccess_acl_nfs4(): http://people.freebsd.org/~jh/patches/vaccess-VEXEC.diff The patch is not a complete fix however. Not all file systems use vaccess*() for VEXEC in their VOP_ACCESS() (ZFS confirmed). Thus the patch doesn't work with ZFS. I looked at the patches in the PR. It seems reasonable to require an X but for VEXEC for all file types except directories, like I think the vaccess() version of your patch does. Keeping the existing behaviour for directories seems necessary. E.g., suppose a user changes all his files and directories to mode 000. It should still be possible for root to search, not to mention back up, all those files and directories, without clobbering any of their metadata (including atimes, but those are a different problem). Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Chasing down bugs with access(2)
On Tue, 20 Jul 2010, Garrett Cooper wrote: Hi Hackers, I ran into an issue last night where apparently several apps make faulty assumptions w.r.t. whether or not access(2) returns functional data when running as a superuser. POSIX says: In early proposals, some inadequacies in the access() function led to the creation of an eaccess() function because: 1. Historical implementations of access() do not test file access correctly when the process' real user ID is superuser. In particular, they always return zero when testing execute permissions without regard to whether the file is executable. 2. The superuser has complete access to all files on a system. As a consequence, programs started by the superuser and switched to the effective user ID with lesser privileges cannot use access() to test their file access permissions. However, the historical model of eaccess() does not resolve problem (1), so this volume of IEEE Std 1003.1-2001 now allows access() to behave in the desired way because several implementations have corrected the problem. It was also argued that problem (2) is more easily solved by using open(), chdir(), or one of the exec functions as appropriate and responding to the error, rather than creating a new function that would not be as reliable. Therefore, eaccess() is not included in this volume of IEEE Std 1003.1-2001. The sentence concerning appropriate privileges and execute permission bits reflects the two possibilities implemented by historical implementations when checking superuser access for X_OK. New implementations are discouraged from returning X_OK unless at least one execution permission bit is set. This seems wrong for directories. It should say ... unless the file is 'executable'. 'executable' means searchable for directories, and the above shouldn't apply. 'executable' actually means executable for regular files, and the above should only apply indirectly: it is executability that should be required to have an X perm bit set, and then access() should just track the capability. The usual weaseling with appropriate privilege allows the X perm bits to have any control on executablity including none, and at least the old POSIX spec doesn't get in the way of this, since it doesn't mention the X perm bits in connection with the exec functions. The spec goes too far in the other direction for the access function. FreeBSD says: Even if a process's real or effective user has appropriate privileges and indicates success for X_OK, the file may not actually have execute per- mission bits set. Likewise for R_OK and W_OK. Perhaps it is time to fix this. The part about X_OK never applied to any version of FreeBSD. Perhaps it applied to the body deleted version of execve() in Net/2 and 4.4BSD, but FreeBSD had to reimplement execve() and it never had this bug. But^2, the access() syscall and man page weren't changed to match. See the end of this reply for more details on execve(). See the next paragraph about more bugs in the above paragraph. Other bugs: - R_OK and W_OK are far from likewise. Everone knows that root can read and write any file. - The permission bits are relatively uninteresting. access() should track the capability, not the bits. The bits used to map to the capability directly for non-root, but now with ACLs, MAC, etc. they don't even do that. - access(2) has no mention of ACLs, MAC, etc. - See a recent PR about unifdefed CAPABILITIES code in vaccess(). (The comment says that the code is always ifdefed out, but it now always unifdefed in.) I don't quite understand this code -- does it give all of ACLs, MAC and etc. at this level? This results in: sh/test - ok-ish (a guy on bash-bugs challenged the fact that the syscall was buggy based on the details returned). bash - broken (builtin test(1) always returns true) csh - not really ok (uses whacky stat-like detection; doesn't check for ACLs, or MAC info) perl - ok (uses eaccess(2) for our OS). python - broken (uses straight access(2), so os.access is broken). So now the question is how to fix this? Linux reports the correct mode for the file (when operating as superuser or not), and there's a lot of code outside of *BSD that builds upon that assumption because stat(2) doesn't test for permissions via POSIX ACLs, MAC, etc. I tried munging through the code to determine where VOP_ACCESS was coming from but I got lost. Where should I look for this? Mostly it is in vaccess(9). But execve() mostly doesn't use VOP_ACCESS(). Here is the FreeBSD-1 version, which is a bit shorter and thus easier to understand than the current version, and shows that FreeBSD has always required 1 exec bit for execve(): % /* % * Check permissions of file to execute. % *Return 0 for success or error code on failure. % */ % int % exec_check_permissions(iparams) % struct image_params *iparams; % { % struct proc *p =
Re: Chasing down bugs with access(2)
On Wed, 21 Jul 2010, Garrett Cooper wrote: On Wed, Jul 21, 2010 at 1:40 AM, Bruce Evans b...@optusnet.com.au wrote: - See a recent PR about unifdefed CAPABILITIES code in vaccess(). ?(The ?comment says that the code is always ifdefed out, but it now always ?unifdefed in.) ? I don't quite understand this code -- does it give ?all of ACLs, MAC and etc. at this level? Interestingly standard permissions bypass ACLs/MAC if standard permissions on the file/directory allow the requested permissions to succeed; note the return (0) vs the priv_check_cred calls -- this is where the the ACL/MAC for the inode is checked. This seems backwards, but I could be missing something.. I was wrong to say that vaccess() does most of the checking. This is only correct if there are no ACLs, etc. Otherwise, e.g., for ffs, the layering is more like VOP_ACCESS() - ufs_access() = check r/o ffs check quota (bogusly in the clause whose comment says that it checks for a r/o fs, deep in the case statement for that clause. This was readable when that clause was only 16 lines long, but now it has messy locking for quotas, and large comments about this, so it is 44 lines long, with the number of lines for locking up from 4 to 32) check immutability. Another bug in access()'s man page is that it doesn't mention either immutability or the errno EPERM that at least ufs_access() returns for it. check acls call vaccess(). The MAC checks seem to be at the end of this and are unrelated to acls. But for exec_check_permissions(), the MAC checks are first. ... % ? ? ? /* % ? ? ? ?* 1) Check if file execution is disabled for the filesystem that this % ? ? ? ?* ? ? ?file resides on. % ? ? ? ?* 2) Insure that at least one execute bit is on - otherwise root % ? ? ? ?* ? ? ?will always succeed, and we don't want to happen unless the % ? ? ? ?* ? ? ?file really is executable. % ? ? ? ?* 3) Insure that the file is a regular file. % ? ? ? ?*/ % ? ? ? if ((vnodep-v_mount-mnt_flag MNT_NOEXEC) || % ? ? ? ? ? ((attr-va_mode 0111) == 0) || % ? ? ? ? ? (attr-va_type != VREG)) { % ? ? ? ? ? ? ? return (EACCES); % ? ? ? } Your mail program seems to be responsible for making the above unreadable by changing tabs to hard \xa0's (which are displayed as tabs by my mail client(s?), but soft \xa0's followed by a space by my editor. 0111 is an old spelling of the S_IX* bits. ?We check these directly Maybe the changes weren't of tabs. In this paragraphs, sentence breaks of 2 spaces got changed to 1 space followed by a hard \xa0. Yet 2 more bugs: not just point 2, but points 1 and 3 in the above comment are undocumented in execve(2) and access(2). ?The usual weaseling with appropriate privilege should allow these too, but (as I forgot to mention above) I think appropriate privilege is supposed to be documented somewhere, so the man pages are still missing details. Agreed on the former statement, and I understand the reasoning for the latter statement, but at least for 1., this is a feature of mount(2) (of course): MNT_NOEXEC Do not allow files to be executed from the file system. How is a newbie supposed to know to look in mount(2) to find extra failure cases? execve(2) even cross-references mount(8), but this was in connection with the nosuid mount option in a wrong man page: % Index: execve.2 % === % RCS file: /home/ncvs/src/lib/libc/sys/execve.2,v % retrieving revision 1.11 % retrieving revision 1.12 % diff -u -2 -r1.11 -r1.12 % --- execve.2 11 Jan 1998 21:43:38 - 1.11 % +++ execve.2 27 Apr 1999 03:56:10 - 1.12 % @@ -31,5 +31,5 @@ % .\ % .\ @(#)execve.28.5 (Berkeley) 6/1/94 % -.\ $Id$ % +.\ $Id: execve.2,v 1.11 1998/01/11 21:43:38 alex Exp $ % .\ % .Dd June 1, 1994 % @@ -144,4 +144,9 @@ % .ne 1i % .Pp % +The set-ID bits are not honored if the respective file system has the % +.Ar nosuid % +option enabled or if the new process file is an interpreter file. Syscall % +tracing is disabled if effective IDs are changed. % +.Pp % The new process also inherits the following attributes from % the calling process: % @@ -274,4 +279,6 @@ % .Xr exit 3 , % .Xr sysctl 3 , % +.Xr mount 1 , This dangling pointer was fixed to mount(8) in the next commit. But it should have been to mount(2) for MNT_NOSUID. % +.Xr ktrace 1 , % .Xr environ 7 % .Sh HISTORY I strongly dislike general references to man pages for 1 detail. There should be an Xr where each of the nosuid and tracing details is described and none at the end. There are actually many details of interest here, but how is a newbie going to notice this when the committers didn't? Details of interest are at least: - MNT_RDONLY (related to EROFS error) - MNT_NOSUID - MNT_NOEXEC - MNT_ACLS (new) - MNT_QUOTA Better yet, nmount() allows any number of new mount options that might affect
Re: [Patch] Kgmon/Gprof On/Off Switch
On Tue, 6 Jul 2010, Sean Bruno wrote: On Thu, 2010-07-01 at 16:46 -0700, Sean Bruno wrote: Found this lying around the Yahoo tree this week. Basically it allows you to activate, reset and deactivate profiling with the '-f flag. I want something like this, but this implementation has too many style bugs and obscure or missing locking. Kind of nice to have if you want the ability to turn on profiling for debugging live systems. We already have this via `kgmon -hBb', but this patch implements something different. Normal profiling requires configuring the kernel with config -p[p], but that adds a lot of overhead, even when profiling is not turned on. The patch implements dynamic configuration of flat profiling only (since dynamic call graphs are too hard to implement, but this restriction is not mentioned anywhere in the patch or the accompanying mail. Userland profiling has the same lossage -- you have to configure programs specially by compiling with -pg, but that adds a lot of overhead, even when profiling is not turned on. Moreover, in userland there is no library support for turning profiling on and off or for dumping and clearing the statistics except on program start and completion, so the much larger overhead of always maintaining the call graph is normally always present. In FreeBSD-1, I (ab)used a bug in the implementation to turn on flat profiling (only) in the kernel. This avoids the large call graph overhead (it still costs a function call and some other branches on every call and maybe on every return, but the branches for this are predictable so the overhead is not very large. This should be made a standard feature, and perhaps this patch implements it as a bug without really trying, as in FreeBSD-1 (implementing it just takes setting gp-state to a value that gives flat profiling while keeping mcount() null if full profiling is configured. % Index: usr.sbin/kgmon/kgmon.8 % === % --- usr.sbin/kgmon/kgmon.8(revision 209745) % +++ usr.sbin/kgmon/kgmon.8(working copy) % @@ -70,6 +70,9 @@ % Dump the contents of the profile buffers into a % .Pa gmon.out % file. % +.It Fl f % +Free profiling buffer. % +Also stops profiling. % .It Fl r % Reset all the profile buffers. % If the Freeing the profiling buffer alone isn't very useful. The memory wastage from always allocating it at boot time and never freeing it would be rarely noticed, and would fix some races or simplify avoidance of races. However, apart from the race problems, ordinary statically configured profiling should free things too, since unlike in userland it is normal to have profiling turned off most of the time even when it is statically configured. The above doesn't really describe what -f does. In normal profiling, there are multiple profiling buffers and freeing just the flat profiling one makes no sense. -f in fact frees the profiling buffer only if the kernel is _not_ configured for profiling (as is required to not corrupt the set of profiling buffers if the kernel is configured for profiling). The profiling buffer is automatically allocated on first use iff it is not already allocated, and of course -f has no effect if no buffer is allocated. Perhaps -r should automatically deallocate, so -f would not be needed. Kernel profiling has this feature of allowing multiple dumps of the buffer(s) before reset, so reset is a good time to deallocate. Style bugs in the above: f is unsorted between p and r. % Index: usr.sbin/kgmon/kgmon.c % === % --- usr.sbin/kgmon/kgmon.c(revision 209745) % +++ usr.sbin/kgmon/kgmon.c(working copy) % @@ -72,7 +72,7 @@ % struct gmonparam gpm; % }; % % -int Bflag, bflag, hflag, kflag, rflag, pflag; % +int Bflag, bflag, hflag, kflag, rflag, pflag, fflag; Style bugs: now f is unsorted after r and p. p was already unsorted after r. % int debug = 0; % int getprof(struct kvmvars *); % int getprofhz(struct kvmvars *); % @@ -82,6 +82,7 @@ % void dumpstate(struct kvmvars *kvp); % void reset(struct kvmvars *kvp); % static void usage(void); % +voidfreebuf(struct kvmvars *kvp); Style bugs: f is unsorted after u; all the old declarations are indented with 1 tab while the new one is indented with spaces. % % int % main(int argc, char **argv) % @@ -93,7 +94,7 @@ % seteuid(getuid()); % kmemf = NULL; % system = NULL; % - while ((ch = getopt(argc, argv, M:N:Bbhpr)) != -1) { % + while ((ch = getopt(argc, argv, M:N:Bbfhpr)) != -1) { % switch((char)ch) { % % case 'M': % @@ -113,6 +114,10 @@ % bflag = 1; % break; % % + case 'f': % + fflag = 1; % + break; % + % case 'h': % hflag = 1; % break; Now f is correctly sorted in the above 2 places.
Re: [PATCH] RUSAGE_THREAD
On Thu, 6 May 2010, Kostik Belousov wrote: On Thu, May 06, 2010 at 04:42:39PM +0400, Alexander Krizhanovsky wrote: ... Indeed, I concerned more about PROC_LOCK/PROC_SLOCK - they are acquired both in kern_getrusage() and could be a problem in multithreaded process with intensive performance accounting in threads. This happens for example in resource accounting systems for shared hosting which are solved by custom extensions of Apache, MySQL and other servers. We still need per-process lock for resource summation code. It's hard to see how it can be done without any lock at all. The oldest example of per-CPU summation code is in vm_meter.c It doesn't use any locks, but without any lock at all it is fundamentally broken. Incoherencies in the data (due to reading and writing the counters at different unsynchronized times in the past) would be more obvious for times (except when the unsynchronized accesses give an off-by-2**32 error due to wraparound of a 32-bit counter), so I think getrusage() should be more careful than vm_meter. calcru() already does a lot of work and locking just to synchronize times relative to previous readings. On the other hand, I do not understand why p_rux have to be protected by spinlock and not by process mutex. Hm, spinlock disables preemption, but only for curthread, It is probably mostly historical. The read-modify-write of pc_switchtime and some other things used to be protected by splstatclock(). Hmm, that was the lowest spl level, closer to a non-spin mutex than anything. In RELENG_4, calcru() doesn't modify switchtime either, and mi_switch() has comments saying that even the splstatclock() should be unnecessary. RELENG_4 is of course non-preemptive, so it has close to the equivalent of critical_enter() automatically. Process spinlock is not acquired in the mi_switch(), so I really do not see a reason not to change protection to mutex. mi_switch)() holds thread lock, which is a special type of spinlock. I think the following philosphy is adequate for reading times: the time returned shall be no earlier than the time (according to a uniquely selected clock at the point where the time is read) at the beginning of the call, and coherent with itself. (It will be indeterminately far in the past by the time that it gets back to the caller. Times read by a single caller shall be monotonic, but times read by separate callers are incomparable unless the callers do additional synchronization). Thus the locked part of reading a clock is: - lock sufficiently to synchronize the next step - update memory copy of the time, so that it is not in the past - drop locks needed for the update, but keep ones needed to prevent the memory copy changing - read and use the memory copy, reassembling it into the form needed by the caller. This step may be long delayed from the update, but there is no point in trying to minimize the delay for this step, since we must soon drop all time-related locks and thus lose all control over the delay seen by the caller. Only in very critical kernel environments would it be necessary and possible to hold strong locks to minimize the delay across the whole operation of reading the time and acting on the result. I asked about it in my message from 3rd of May and also there I've proposesed following patch for calcru1(): - if ((int64_t)tu 0) { - /* XXX: this should be an assert /phk */ - printf(calcru: negative runtime of %jd usec for pid %d (%s)\n, - (intmax_t)tu, p-p_pid, p-p_comm); - tu = ruxp-rux_tu; - } + KASSERT((int64_t)tu 0, (calcru: negative runtime of %jd usec + for pid %d (%s)\n, + (intmax_t)tu, p-p_pid, p-p_comm)); I did understand your proposal, but, as I said, I have no opinion. This patch is mostly backwards: - backwards condition in KASSERT() - removal of KNF formatting - removal of the fixup when the KASSERT() is null. I think I prefer removing the whole statement. But the KASSERT() may be useful for verifying that the failure case really is unreachable due to programming bugs. I think it reachable due to hardware bugs, but only ones that are less likely than a memory parity error, and the handling of hardware bugs here shouldn't be to panic. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: question about sb-st_blksize in src/sys/kern/vfs_vnops.c
On Fri, 24 Oct 2008, Thierry Herbelot wrote: the [SUBJ] file contains the following extract (around line 705) : * Default to PAGE_SIZE after much discussion. * XXX: min(PAGE_SIZE, vp-v_bufobj.bo_bsize) may be more correct. */ sb-st_blksize = PAGE_SIZE; which arrived around four years ago, with revision 1.211 (see http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.210;r2=1.211;f=h) Indeed, this was completely broken long ago (in 1.211). Before then, and after 1.128, some cases worked as intended if not perfectly: - regular files: file systems still set va_blksize to their idea of the best i/o size (normally to the file system block size, which is normally larger than PAGE_SIZE and probably better in all cases) and this was used here. However, for regular files, the fs block size and the application's i/o size are almost irrelevant in most cases due to vfs clustering. Most large i/o's are done physically with the cluster size (which due to a related bug suite ends up being hard-coded to MAXPHYS (128K) at a minor cost when this is different from the best size). - disk files: non-broken device drivers set si_iosize_best to their idea of the best i/o size (normally to the max i/o size, which is normally better than PAGE_SIZE) and this was used here. The bogus default of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it was for the buffer cache implementation for block devices which no longer exist and was too small for them anyway). - non-disk character-special files: the default of PAGE_SIZE was used. The comment about defaulting to PAGE_SIZE was added in 1.128 and is mainly for this case. Now the comment is nonsense since the value is fixed, not a default. - other file types (fifos, pipes, sockets, ...): these got the default of PAGE_SIZE too. In rev.1.1, st_blksize was set to va_blksize in all cases. So file systems were supposed to set va_blksize reasonably in all cases, but this is not easy and they did nothing good except for regular files. Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS (64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for disks. This gave nonsense like 64K buffers for slow tty devices (keyboards) and 2K buffers for fast disks. At least for programs that trust st_blksize o be reasonable. Fortunately, st_blsize is rarely used... the net effect of this change is to decrease the block buffer size used in libc/stdio from 16 kbytes (derived from the underlying ufs partition) to PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth is lowered (this is on a slow Flash). ... except it is used by stdio. (Another mess here is that stdio mostly doesn't use its own BUFSIZ. It trusts st_blksize if fstat() to determine st_blksize works. Of course, the existence of BUFSIZ is a related historical mistake -- no fixed size can work best for all cases. But when BUFSIZ is used, it is an even worse default than PAGE_SIZE.) It's interesting that you can see the difference. Clustering is especially good for hiding slowness on slow devices. Maybe you are using a configuration that makes clustering ineffective. Mounting the file system with -o sync or equivalently, doing a sync after every (too-small) write would do it. Otherwise, writes are normally delated until the next cluster boundary. I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, to revert to the block size previoulsly used), and the kernel and world seem to be running fine. Seeing the XXX coment above, I'm a bit worried about keeping this new st_blksize value. are there any drawbacks with running with this bigger buffer size value ? Mostly it doesn't matter, since buffering (clustering) hides the differences. Without clustering, 16K is a much better default for disks than 4K, though not as good as the non-default va_blksize for regular files. Newer disks might prefer 32K or 64k, but then the fs block size should also be increased from 16K. Otherwise, increasing the block size usually reduces performance, by thrashing caches or increasing latencies. With modern cache sizes and disk speeds, you won't see these effects for a block size of 64K, so defaulting to 64K would be reasonable for disks. It would be silly for keyboards, but with modern memory sizes you would notice this even less than when it was that in old versions. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
On Tue, 12 Feb 2008, Andriy Gapon wrote: on 12/02/2008 15:11 Bruce Evans said the following: On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Andriy Gapon writes: 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; That's clearly wrong. If units were always 512-blocks, then anything except bsize = DEV_BSIZE would be clearly wrong. Things aren't that simple (but probably should be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. O, I missed this obvious thing! Me too. Bruce, thank you for putting me back on the ground :-) Even in UDF case, when we bread() via a file or directory vnode we pass a logical 2048-byte block number (within the file) to bread(). In this case the code in getblk() does the right thing when it calculates offset as blkno * 2048. Calculating it assuming 512-byte units would be a disaster. I think the size is mnt_stat.f_iosize in general (but not for device vnodes). And the actual reading works correctly because udf_strategy is called for such vnodes and it translates block numbers from physical to logical and also correctly re-calculates b_iooffset for actual reading. So b_iooffset value set in breadn() (using bdtob) is completely ignored. Is this setting ever used (and the corresponding setting for writing) ever used? I remember that this is why g_vfs_* was my primary target. It seems that I revert my opinion back: either g_vfs_open should be smart about setting bo_bsize, or g_vfs_strategy should be smart about calculating bio_offset, or individual filesystems should not depend on g_vfs_* always doing the best thing for them. In fact, ffs already overrides the setting of bo_bsize for the device vnode to a different wrong setting -- g_vfs_open() sets the sector size, and ffs_mount() changes the setting to fs_bsize, but ffs actually needs the setting to be DEV_BSIZE (I think). Other bugs from this: - ffs_rawread wants the sector size, and it assumes that this is in bo_bsize for the device vnode, but ffs_mount() has changed this to fs_bsize. - multiple r/o mounts are supposed to work, but don't, since there is only one device vnode with a shared bufobj, but the bufobj needs to be per-file-system since all mounts write to it. Various bad things including panics result. There is a well-know panic from bo_private becoming garbage on unmount. I just noticed more possibilities for panics. bo_ops points to static storage, so it never becomes complete garbage. However, at least ffs sets it blindly early on in ffs_mountfs(), before looking at the file system to see if ffs can mount it. Thus if the file system is already mounted by another ffs, then ffs clobbers the other fs's bo_ops. The ffs mount will presumably fail, leaving bo_ops clobbered. Also, a successful g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little earlier. g_vfs_open() is fundamental to looking at the file system, since I/O is not set up until it completes. Clobbering the pointers is most dangerous, but just clobbering bo_bsize breaks blkno calculations for any code that uses bo_bsize. Apart from these bugs, the fix for the blkno calculations for device vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device vp of all disk file systems (since all disk file systems use expect this size). Currently, ffs seems to be the only file system that overrides g_vfs_open()'s default of the sector size. Nothing in any of fs/, gnu/fs/ and contrib/ references bo_bsize. In any case, it seems that it is the g_vfs_* that is currently inconsistent: it sets bo_bsize to a somewhat arbitrary value, but expects that it would always be provided with correct b_iooffset (the latter being rigidly calculated via bdtob() in buffcache code). So this leaves filesystems dead in the water while reading via device vnode: provide blkno in 512-byte units - screw up VM cache, provide blkno in bo_bsize units - screw up reading from disk. I think I/O on the disk doesn't use bo_bsize, but only the sector size (to scale the byte count to a sector count). Otherwise, ffs's override would break I/O. geom is another place that has few references to bo_bsize -- it just clobbers it in g_vfs_open(). Not sure if the FS-s should have wits to set bo_bsize properly and/or provide proper bo_ops - we are talking about a device vnode here. Should filesystems be involved in the business of setting its properties? Not sure. Yes. They can set it more easily as they can tell g_vfs_open() what to set it to. Except, until the bugs are fixed properly, g_vfs_open() can more easily do tests to prevent clobbering. For bo_bsize and bo_ops, sharing a common value is safe and safe changes can be detected by setting to a special value on last unmount. For bo_private, a safety check would probably disallow multiple mounts (since cp is dynamically allocated (?)). Bruce
Re: fs/udf: vm pages overlap while reading large dir
On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Andriy Gapon writes: 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. Yes, the buf-cache works in 512 bytes units throughout. I missed the dbtob() conversions in vfs_bio.c when I replied previously So blkno cannot be a cookie; it must be for a 512-block. So how did the cases where bsize != DEV_BSIZE in getblk() ever work? 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; That's clearly wrong. If units were always 512-blocks, then anything except bsize = DEV_BSIZE would be clearly wrong. Things aren't that simple (but probably should be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. That seems to include nfs(client). In fact, nfs_getcacheblk() does weird scaling which seems to be mainly to compensate for for weird scaling here. It calls getblk() with a bn arg that seems to be f_iosize units. Then at then end, for the VREG case, it sets bp-b_blkno to this bn scaled to normal DEV_BSIZE units. bp-b_blkno seems to have DEV_BSIZE units for all uses of it in nfs. We need to assert that the blkno is aligned to the start of a sector and use the 512 byte units, so I guess it would be: offset = dbtob(blkno); KASSERT(!(offset (bsize - 1)), (suitable diagnostic)); Barely worth checking. The current bug has nothing to do with this. The offset is just wrong at this point, after using a scale factor that is inconsistent with the units of blkno, for all (?) disk (and other?) file systems whose sector size isn't 512. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Andriy Gapon writes: on 06/02/2008 18:29 Andriy Gapon said the following: Small summary of the above long description. For directory reading fs/udf performs bread() on a (underlying) device vnode. It passes block number as if block size was 512 bytes (i.e. byte_offset_within_dev/512). We have three sizes of relevance here, the sectorsize of the provider, the blocksize of the filesystem and the page size of the system. 4. The block size required for bread() and friends (almost always DEV_BSIZE). In general it is adventurous to have any of them be anything but powers of two, and it is at best ill-adviced and more likely asking for trouble to do requests that are not multiple of and aligned to the sectorsize of the provider. So up front, I would say that it is an UDF bug to do 512 byte reads off a 2k sector provider. Making the buffer cache handle this is possible, but it is not the direction we have planned for the buffercache (ie: that it should become a wrapper for using the VM system, rather than being a separately allocated cache). So if the objective is to make UDF work in the short term, your change might work, if the objective is to move FreeBSD's kernel architecture forward, the right thing to do is to fix UDF to not access subsectors. This bug has nothing to do with subsectors, and very little to do with udf. udf is just depending on bread()'s API being unbroken. That API requires starting with blocks consisting of whole sectors (else the subsequent I/O would fail) and converting to blocks of size DEV_BSIZE, phyexcept for unusual (non-disk?) file systems like nfs (and no others?). All (?) disk file systems do this conversion. The bug is just more noticeable for udf since it is used more often on disks with a block size != DEV_BSIZE, and it apparently does something that makes the bug mess up VM more than other file systems. Of course, it might be better for the API to take blocks in units of the sector size, but that is not what it takes, and this can't be changed easily or safely. The standard macro btodb() is often used to convert from bytes to blocks of this size, and doesn't support bo_bsize or the bufobj pointer that would be needed to reach bo_bsize. ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block (frag?)-sized units to blocks in DEV_BSIZE units so as to pass them to unbroken bread() and friends. ffs initializes bo_bsize to its block (not frag) size, and then never uses it directly except in ffs_rawread(), where it is used to check that the I/O is in units of whole sectors (otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest of ffs). The details of fsbtodb() are interesting. They show signs of previous attempts to use units of sectors. From ffs/fs.h: % /* % * Turn filesystem block numbers into disk block addresses. % * This maps filesystem blocks to device size blocks. % */ % #define fsbtodb(fs, b) ((daddr_t)(b) (fs)-fs_fsbtodb) % #define dbtofsb(fs, b) ((b) (fs)-fs_fsbtodb) Creation of fs_fsbtodb is left to newfs. The units of DEV_BSIZE are thus built in to the on-disk data struct (in a fairly easy to change and/or override way, but there are a lot of existing disks). From newfs.c: % realsectorsize = sectorsize; % if (sectorsize != DEV_BSIZE) { /* XXX */ % int secperblk = sectorsize / DEV_BSIZE; % % sectorsize = DEV_BSIZE; % fssize *= secperblk; % if (pp != NULL) % pp-p_size *= secperblk; % } % mkfs(pp, special); Though mkfs() appears to support disk blocks of size sector size = sectorsize, its sectorsize parameter is hacked on here, so it always generates fs_fsbtodb and other derived parameters for disk blocks of size DEV_BSIZE, as is required for the unbroken bread() API. msdosfs is similar to ffs here, except it constructs the shift count at mount time, to arrange for always converting to DEV_BSIZE blocks for passing the bread() and friends. udf is effectively similar, but pessimal and disorganized. For the conversion for bread(), it uses btodb(). In udf_bmap(), it constructs a shift count on every call by subtracting DEV_BSHIFT from its block shift count. In udf_strategy(), on every call it constructs a multiplier instead of a shift count, by dividing its block size by DEV_BSIZE. It's weird that the strategy routing, which will soon end up sending sectors to the disk, is converting to DEV_BSIZE units, a size that cannot be the size for udf's normal media. cd9660 uses btodb() for one conversion for bread() and (its block shift count - DEV_BSHIFT) in 7 other conversions to DEV_BSIZE units. So it's surprising that any file systems work on CDs and DVDs. Maybe the bug affects udf more because udf crosses page boundaries more. It's also surprising that the bug has such small effects. This seems to be because the
Re: Memory allocation performance
On Mon, 4 Feb 2008, Alexander Motin wrote: Kris Kennaway wrote: You can look at the raw output from pmcstat, which is a collection of instruction pointers that you can feed to e.g. addr2line to find out exactly where in those functions the events are occurring. This will often help to track down the precise causes. Thanks to the hint, it was interesting hunting, but it shown nothing. It hits into very simple lines like: bucket = cache-uc_freebucket; cache-uc_allocs++; if (zone-uz_ctor != NULL) { cache-uc_frees++; and so on. There is no loops, there is no inlines or macroses. Nothing! And the only hint about it is a huge number of p4-resource-stalls in those lines. I have no idea what exactly does it means, why does it happens mostly here and how to fight it. Try profiling it one another type of CPU, to get different performance counters but hopefully not very different stalls. If the other CPU doesn't stall at all, put another black mark against P4 and delete your copies of it :-). Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Memory allocation performance
On Fri, 1 Feb 2008, Alexander Motin wrote: Robert Watson wrote: It would be very helpful if you could try doing some analysis with hwpmc -- high resolution profiling is of increasingly limited utility with modern You mean of increasingly greater utility with modern CPUs. Low resolution kernel profiling stopped giving enough resolution in about 1990, and has become of increasingly limited utility since then, but high resolution kernel profiling uses the TSC or possibly a performance counter so it has kept up with CPU speedups. Cache effects and out of order execution are larger now, but they affect all types of profiling and still not too bad with high resulotion kernel profiling. High resolution kernel profiling doesn't really work with SMP, but that is not Alexander's problem since he profiled under UP. CPUs, where even a high frequency timer won't run very often. It's also quite subject to cycle events that align with other timers in the system. No, it isn't affected by either of these. The TSC timer is incremented on every CPU cycle and the performance counters run are incremented on every event. It is completely unaffected by other timers. I have tried hwpmc but still not completely friendly with it. Whole picture is somewhat alike to kgmon's, but it looks very noisy. Is there some know how about how to use it better? hwpmc doesn't work for me either. I can't see how it could work as well as high resolution kernel profiling for events at the single function level, since it is statistics-based. With the statistics clock interrupt rate fairly limited, it just cannot get enough resolution over short runs. Also, it works poorly for me (with a current kernel and ~5.2 userland except for some utilities like pmc*). Generation of profiles stopped working for me, and it often fails with allocation errors. I have tried it for measuring number of instructions. But I am in doubt that instructions is a correct counter for performance measurement as different instructions may have very different execution times depending on many reasons, like cache misses and current memory traffic. Cycle counts are more useful, but high resolution kernel profiling can do this too, with some fixes: - update perfmon for newer CPUs. It is broken even for Athlons (takes a 2 line fix, or more lines with proper #defines and if()s). - select the performance counter to be used for profiling using sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the performance counter for the instruction count (or any). I use hwpmc mainly to determine N :-). It may also be necessary to change the kernel variable cpu_clock_pmc_conf. Configuration of this is unfinished. - use high resolution kernel profiling normally. Note that switching to a perfmon counter is only permitted of !SMP (since it is too unsupported under SMP to do more than panic if permitted under SMP), and that the switch loses the calibration of profiling. Profiling normally compensates for overheads of the profiling itself, and the compensation would work almoost perfectly for event counters, unlike for time-related counters, since the extra events for profiling aren't much affected by caches. I have tried to use tsc to count CPU cycles, but got the error: # pmcstat -n 1 -S tsc -O sample.out pmcstat: ERROR: Cannot allocate system-mode pmc with specification tsc: Operation not supported What have I missed? This might be just because the TSC really is not supported. Many things require an APIC for hwpmc to support them. I get errors allocation like this for operations that work a few times before failing. I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel built without SMP to simplify profiling. What counters can you recommend me to use on it for regular time profiling? Try them all :-). From userland first with an overall count, since looking at the details in gprof output takes too long (and doesn't work for me with hwpmc anyway). I use scripts like the following to try them all from userland: runpm: %%% c=ttcp -n10 -l5 -u -t epsplex ctr=0 while test $ctr -lt 256 do ctr1=$(printf 0x%02x\n $ctr) case $ctr1 in 0x00) src=k8-fp-dispatched-fpu-ops;; 0x01) src=k8-fp-cycles-with-no-fpu-ops-retired;; 0x02) src=k8-fp-dispatched-fpu-fast-flag-ops;; 0x05) src=k8-fp-unknown-$ctr1;; 0x09) src=k8-fp-unknown-$ctr1;; 0x0d) src=k8-fp-unknown-$ctr1;; 0x11) src=k8-fp-unknown-$ctr1;; 0x15) src=k8-fp-unknown-$ctr1;; 0x19) src=k8-fp-unknown-$ctr1;; 0x1d) src=k8-fp-unknown-$ctr1;; 0x20) src=k8-ls-segment-register-load;; # XXX 0x21) src=kx-ls-microarchitectural-resync-by-self-mod-code;; 0x22) src=k8-ls-microarchitectural-resync-by-snoop;; 0x23) src=kx-ls-buffer2-full;; 0x24) src=k8-ls-locked-operation;;# XXX
Re: [patch] Adding optimized kernel copying support - Part III
On Wed, 31 May 2006, Attilio Rao wrote: 2006/5/31, Suleiman Souhlal [EMAIL PROTECTED]: Nice work. Any chance you could also port it to amd64? :-) Not in the near future, I think. :P It is not useful for amd64. An amd64 has enough instruction bandwidth to saturate the L1 cache using 64-bit accesses although not using 32-bit accesses. An amd64 has 64-bit integer registers which can be accesses without the huge setup overheads and code complications for MMX/XMM registers. It already uses 64-bit registers or 64-bit movs for copying and zeroing of course. Perhaps it should use prefetches and nontemporal writes more than it already does, but these don't require using SSE2 instructions like nontemporal writes do for 32-bit CPUs. Does that mean it won't work with SMP and PREEMPTION? Yes it will work (even if I think it needs more testing) but maybe would give lesser performances on SMP|PREEMPTION due to too much traffic on memory/cache. For this I was planing to use non-temporal instructions (obviously benchmarks would be very appreciate). Er, isn't its main point to fix some !SMP assumptions made in the old copying-through-the-FPU code? (The old code is messy due to its avoidance of global changes. It wants to preserve the FPU state on the stack, but this doesn't quite work so it does extra things (still mostly locally) that only work in the !SMP (!SMPng even with UP) case. Patching this approach to work with SMP || SMPng cases would make it messier.) The new code wouldn't behave much differently under SMP. It just might be a smaller optimization because more memory pressure for SMP causes more cache misses for everything and there are no benefits from copying through MMX/XMM unless nontemporal writes are used. All (?) CPUs with MMX or SSE* can saturate main memory using 32-bit instructions. On 32-bit CPUs, the benefits of using MMX/XMM come from being able to saturate the L1 cache on some CPUs (mainly Athlons and not P[2-4]), and from being able to use nontemporal writes on some CPUs (at least AthlonXP via SSE extensions all CPUs with SSE2). Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fwd: 5-STABLE kernel build with icc broken
On Fri, 1 Apr 2005, Matthew Dillon wrote: :The use of the XMM registers is a cpu optimization. Modern CPUs, :especially AMD Athlon and Opterons, are more efficient with 128 bit :moves then with 64 bit moves. I experimented with all sorts of :configurations, including the use of special data caching instructions, :but they had so many special cases and degenerate conditions that :I found that simply using straight XMM instructions, reading as big :a glob as possible, then writing the glob, was by far the best solution. : :Are you sure about that? The amd64 optimization manual says (essentially) This is in 25112.PDF section 5.16 (Interleave Loads and Stores, with 128 bits of loads followed by 128 bits of stores). :that big globs are bad, and my benchmarks confirm this. The best glob size :is 128 bits according to my benchmarks. This can be obtained using 2 :... : :Unfortunately (since I want to avoid using both MMX and XMM), I haven't :managed to make copying through 64-integer registers work as well. :Copying 128 bits at a time using 2 pairs of movq's through integer :registers gives only 7.9GB/sec. movq through MMX is never that slow. :However, movdqu through xmm is even slower (7.4GB/sec). I forgot many of my earlier conclusions when I wrote the above. The speeds between 7.4GB/sec and 12.9GB/sec for the fully (L1) cached case are almost irrelevant. They basically just tell how well we have used the instruction bandwidth. Plain movsq uses it better and gets 15.9GB/sec. I believe 15.9GB/sec is from saturating the L1 cache. The CPU is an Athlon64 and its clock frequency is 1994 MHz, and I think the max L1 cache bandwidth is with a 16-byte load and store per cycle; 16*1994*10^6 is 15.95GB/sec (disk manufacturers GB's). Plain movsq is best here for many other cases too... : :The fully cached case is too unrepresentative of normal use, and normal :(partially cached) use is hard to bencmark, so I normally benchmark :the fully uncached case. For that, movnt* is best for benchmarks but :not for general use, and it hardly matters which registers are used. Yah, I'm pretty sure. I tested the fully cached (L1), partially cached (L2), and the fully uncached cases. I don't have a logic By the partially cached case, I meant the case where some of the source and/or target addresses are in the L1 or L2 cache, but you don't really the chance that they are there (or should be there after the copy), so you can only guess the best strategy. analyzer but what I think is happening is that the cpu's write buffer is messing around with the reads and causing extra RAS cycles to occur. I also tested using various combinations of movdqa, movntdq, and prefetcha. Somehow I'm only seeing small variations from different strategies now, with all tests done in userland on an Athlon64 system (and on athlonXP systems for reference). Using XMM or MMX can be twice as fast on the AthlonXPs, but movsq is absolutely the fastest in many cases on the Athlon64, and is 5% slower than the fastest in all cases (except for the fully uncached case since it can't do nontemporal stores), so it is the best general method. ... I also think there might be some odd instruction pipeline effects that skew the results when only one or two instructions are between the load into an %xmm register and the store from the same register. I tried using 2, 4, and 8 XMM registers. 8 XMM registers seemed to work the best. I'm getting only small variations from different load/store patterns. Of course, I primarily tested on an Athlon 64 3200+, so YMMV. (One of the first Athlon 64's, so it has a 1MB L2 cache). My test system is very similar: %%% CPU: AMD Athlon(tm) 64 Processor 3400+ (1994.33-MHz K8-class CPU) Origin = AuthenticAMD Id = 0xf48 Stepping = 8 Features=0x78bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2 AMD Features=0xe0500800SYSCALL,NX,MMX+,LM,3DNow+,3DNow L1 2MB data TLB: 8 entries, fully associative L1 2MB instruction TLB: 8 entries, fully associative L1 4KB data TLB: 32 entries, fully associative L1 4KB instruction TLB: 32 entries, fully associative L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L2 2MB unified TLB: 0 entries, disabled/not present L2 4KB data TLB: 512 entries, 4-way associative L2 4KB instruction TLB: 512 entries, 4-way associative L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative %%% The prefetchnta I have commented out seemed to improve performance, but it requires 3dNOW and I didn't want to NOT have an MMX copy mode for cpu's with MMX but without 3dNOW. Prefetching less then 128 bytes did not help, and prefetching greater then 128 bytes (e.g. 256(%esi)) seemed to cause extra RAS cycles. It was unbelievably finicky, not at all what I expected.
Re: Fwd: 5-STABLE kernel build with icc broken
On Thu, 31 Mar 2005, Matthew Dillon wrote: I didn't mean to get into the kernel's use of the FPU, but... All I really did was implement a comment that DG had made many years ago in the PCB structure about making the FPU save area a pointer rather then hardwiring it into the PCB. ISTR writing something like that. dg committed most of my early work since I didn't have commit access at the time. ... The use of the XMM registers is a cpu optimization. Modern CPUs, especially AMD Athlon and Opterons, are more efficient with 128 bit moves then with 64 bit moves. I experimented with all sorts of configurations, including the use of special data caching instructions, but they had so many special cases and degenerate conditions that I found that simply using straight XMM instructions, reading as big a glob as possible, then writing the glob, was by far the best solution. Are you sure about that? The amd64 optimization manual says (essentially) that big globs are bad, and my benchmarks confirm this. The best glob size is 128 bits according to my benchmarks. This can be obtained using 2 64-bit reads of 64-bit registers followed by 2 64-bit writes of these registers, or by read-write of a single 128-bit register. The 64-bit registers can be either MMX or integer registers on 64-bit systems, but the 128-registers must be XMM on all systems. I get identical speeds of 12.9GB/sec (+-0.1GB/sec) on a fairly old and slow Athlon64 system for copying 16K (fully cached) through MMX and XMM 128 bits at a time using the following instructions: # MMX: # XMM movq(%0),%mm0 movdqa (%0),%xmm0 movq8(%0),%mm1 movdqa %xmm0,(%1) movq%mm0,(%1) ... # unroll same amount movq%mm1,8(%1) ... # unroll to copy 64 bytes per iteration Unfortunately (since I want to avoid using both MMX and XMM), I haven't managed to make copying through 64-integer registers work as well. Copying 128 bits at a time using 2 pairs of movq's through integer registers gives only 7.9GB/sec. movq through MMX is never that slow. However, movdqu through xmm is even slower (7.4GB/sec). The fully cached case is too unrepresentative of normal use, and normal (partially cached) use is hard to bencmark, so I normally benchmark the fully uncached case. For that, movnt* is best for benchmarks but not for general use, and it hardly matters which registers are used. The key for fast block copying is to not issue any memory writes other then those related directly to the data being copied. This avoids unnecessary RAS cycles which would otherwise kill copying performance. In tests I found that copying multi-page blocks in a single loop was far more efficient then copying data page-by-page precisely because page-by-page copying was too complex to be able to avoid extranious writes to memory unrelated to the target buffer inbetween each page copy. By page-by-page, do you mean prefetch a page at a time into the L1 cache? I've noticed strange loss (apparently) from extraneous reads or writes more for benchmarks that do just (very large) writes. An at least old Celerons and AthlonXPs, the writes go straight to the L1/L2 caches (unless you use movntq on AthlonXP's). The caches are flushed to main memory some time later, apparently not very well since some pages take more than twice as long to write as others (as seen by the writer filling the caches), and the slow case happens enough to affect the average write speed by up to 50%. This problem can be reduced by putting memory bank bits in the page colors. This is hard to get right even for the simple unrepresentative case of large writes. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fwd: 5-STABLE kernel build with icc broken
On Wed, 30 Mar 2005, David Schultz wrote: On Wed, Mar 30, 2005, Peter Jeremy wrote: On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote: Later in that thread they discuss skipping the restore state to make things faster. The minimum buffer size they say this will be good for is between 2-4k. Does this make sense, or am I showing my ignorance? http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html Yes. There are a variety of options for saving/restoring FPU state: a) save FPU state on kernel entry b) save FPU state on a context switch (or if the kernel wants the FPU) c) only save FPU state if a different process (or the kernel) wants the FPU 1) restore FPU on kernel exit 2) restore FPU state if a process wants the FPU. a and 1 are the most obvious - that's the way the integer registers are handled. I thought FreeBSD used to be c2 but it seems it has been changed to b2 since I looked last. No, it always used b2. I never got around to implementing c2. Linux used to implement c2 on i386's, but I think it switched (to b2?) to optimize (or at least simplify) the SMP case. Based on the mail above, it looks like Dfly was changed from 1 to 2 (I'm not sure if it's 'a' or 'c' on save). 'a' seems to be too inefficient to ever use. '1' makes sense if it rarely happens and/or the kernel can often use the FPU more than once per entry (which it probably shouldn't), but it gives complications like the ones for SMP, especially in FreeBSD where the kernel can be preempted. Saving FP state as needed is simplest but can be slow. My Athlon-with- SSE-extensions pagecopy and pagezero routines use the FPU (XMM) but their FP state save isn't slow because only 1 or 2 XMM registers needs to be saved. E.g., the saving part of sse_pagezero_for_some_athlons() is: pushfl # Also have to save %eflags. cli # Switch %eflags as needed to safely use FPU. movl%cr0,%eax # Also have to save %cr0. clts# Switch %cr0 as needed to use FPU. subl$16,%esp# Space to save some FP state. movups %xmm0,(%esp)# Save some FP state. Only this much needed. On the i386 (and probably most other CPUs), you can place the FPU into am unavailable state. This means that any attempt to use it will trigger a trap. The kernel will then restore FPU state and return. On a normal system call, if the FPU hasn't been used, the kernel will see that it's still in an unavailable state and can avoid saving the state. (On an i386, unavailable state is achieved by either setting CR0_TS or CR0_EM). This means you avoid having to always restore FPU state at the expense of an additional trap if the process actually uses the FPU. I remember that you (Peter) did extensive benchmarks of this. I still think fully lazy switching (c2) is the best general method. Maybe FP state should be loaded in advance based on FPU affinity. It might be good for CPU affinity to depend on FPU use (prfer not to switch threads away from a CPU if they own that CPU via its FPU). This is basically what FreeBSD does on i386 and amd64. (As a disclaimer, I haven't read the code very carefully, so I might be missing some of the details.) Upon taking a trap for a process that has never used the FPU before, we save the FPU state for the last process to use the FPU, then load a fresh FPU state. On We don't save the FPU state for the last thread then (c2 behaviour) since we have already saved it it when we switched away from it. npxdna() panics if we haven't done that. Except rev.1.131 added bogus code (apparently to debug or hide bugs in the other changes in rev.1.131) that breaks the panic in the fpcurthread == curthread case. subsequent context switches, the FPU state for processes that have already used the FPU gets loaded before entering user mode, I think. I haven't studied the code in enough detail to know what No, that doesn't happen. Instead, cpu_switch() has called npxsave() on the context switch away from the thread. npxsave() arranges for a trap on the next use of the FPU, and we don't do anything more with the FPU context of the thread until the thread tries to use the FPU (in userland). Then we take the trap and load the saved context in npxdna(). happens for SMP, where a process could be scheduled on a different processor before its FPU state is saved on the first processor. There is no difference for SMP, but there would be large complicated differences if we did fully lazy saving. npxdna() would have to do something like sending an IPI to the thread that owns the FPU if that thread could be different from curthread. This would be slow, but might be worth doing if it didn't happen much and if lazy fully lazy context switching were a significant advantage. I think it could be arranged to not happen much, but the advantage is insignificant. BTW, David and I recently found a bug in the context switching in the fxsr case, at least on
Re: Fwd: 5-STABLE kernel build with icc broken
On Thu, 31 Mar 2005, Peter Jeremy wrote: On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote: I still think fully lazy switching (c2) is the best general method. I think it depends on the FP workload. It's a definite win if there is exactly one FP thread - in this case the FPU state never needs to be saved (and you could even optimise away the DNA trap by clearing the TS and EM bits if the switched-to curthread is fputhread). I think stopping the trap would be the usual method (not sure what Linux did), but to collect statistics for determining affinity you would want to take the trap anyway. The worst case is two (or more) FP-intensive threads - in this case, lazy switching is of no benefit. The DNA trap overheads mean that the performance is worse than just saving/restoring the FP state during a context switch. My guess is that the current generation workstation is closer to the second case - current generation graphical bloatware uses a lot of FP for rendering, not to mention that the idle task has a reasonable chance of being an FP-intensive distributed computing task (setiathome or similar). It's probably time to do some more measuring (I'm not offering just now, I have lots of other things on my TODO list). Bloatware might be so hoggish that it rarely makes context switches :-). Context switches for interrupts increase the problem though, as would using FP more in the kernel. BTW, David and I recently found a bug in the context switching in the fxsr case, at least on Athlon-XP's and AMD64's. I gather this is not noticable unless the application is doing its own FPU save/restore. Is there a solution or work-around? It's most noticeable for debugging, and if you worry about leaking thread context. Fortunately, the last-instruction pointers won't have real user data in them unless the application encodes it there intentionally. I can't see any efficent solution or workaround. The kernel should do a full save/restore for processes being debugged. For applications, the bug seems to be larger. Even if they know about the amd behaviour and do a full save/restore because they need it, it won't work because the kernel doesn't preserve the state across context switches. Applications like vmware might care more than most. I forgot to mention that we couldn't find anything in intel manuals about this behaviour, so it might be completely amd-specific. Also, the instruction pointers are fundamentally broken for 64-bit CPUs, since although they are 64 bits, they have the segment selector encoded in their top 32 bits, so they are not really different from the 32:32 selector:pointer format for the non-fxsr case. Their format is specified by SSE2 so 64-bit extensions would have to be elsewhere, but amd64 doesn't seem to extend them. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: why timeradd() and co are hidden from the kernel space ?
On Thu, 17 Jun 2004, Cyrille Lefevre wrote: is there any reasons to hide timeradd() and co in sys/time.h from the kernel space (! _KERNEL) ? Yes. This prevents them being used in the kernel. They are compatibility cruft for NetBSD. In the kernel the corresponding interfaces are spelled timevaladd(), etc. Unfortunately, the timeradd() form of these interfaces escaped to userland via NetBSD. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA/CHS problem (path + new information)
On Sun, 11 Apr 2004, Roman Kurakin wrote: I remind you that now I have two problems. First one that FreeBSD uses wrong assumption about which device should be CHS and which LBA: if (!ad_version(atadev-param-version_major) || !(atadev-param-atavalid ATA_FLAG_54_58) || !lbasize) atadev-flags |= ATA_D_USE_CHS; True ATA device may not have ATA_FLAG_54_58 valid bit, and also due to last ATA standard this bit is obsoleted. I also want to know why ata driver doesn't check LBA support from word 49? May be this one check could solve my problems and didn't breake code for non-ATA devices. Possibly for similar reasons. It's hard to tell what's in the LBA bit for pre-ATA devices older than LBA. Similarly for the lbasize words, but it's easier to do a sanity check on a 32-bit values that a 1-bit flag. Second one, that only 20G part of my hard disk works with CHS. This is other side of the same problem. Device should work in CHS mode. And it works witch ICH5 controller. But with ICH2 it doesn't with out hack. I've checked standard again and I sew command 91h (Initialize drive parameters). Check out the commands for limiting the (apparent) disk size. IIRC, the CHS limit can be set independently of the LBA limit, and some settings are harder than others so that they can't be cleared by old commands like 0x91. The limits may be set to prevent old drivers which only understand old commands from becoming confused by trying to actually use the whole disk. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: suspect spl*() code in syscons.c
On Tue, 21 Oct 2003, Luigi Rizzo wrote: both -current and -stable have the following snippet of code in sys/dev/syscons/syscons.c:scclose(): { ... int s; if (SC_VTY(dev) != SC_CONSOLECTL) { ... s = spltty(); ... } spltty(); (*linesw[tp-t_line].l_close)(tp, flag); ttyclose(tp); spl0(); return(0); } Note that the omitted code never does any spl*() call, nor it uses the saved value anymore. Also, i am a bit suspicious about the spltty()/spl0() sequence. Can someone explain if this code is correct ? (I have Bcc-ed the committers involved in writing this code, maybe they know the answer). The initial ipl state is known to be that given by spl0(), so there is no need to save the initial state to restore it using splx(); spl0() sufficies. However, when the omitted code neglects to restore the state, the initial state is not that given by spl0() when the omitted code is executed. Then using spl0() instead of splx() apparently-accidentally fixes the bug. The scope of the spltty() seems to be a bit large but I don't know what it should be exactly. Ideally, the whole device close should be atomic (and as a prerequisite, non-interruptible), but that is not possible for all console drivers although it may be possible for syscons. I have noticed the following incompletely handled races in the corresponding code for the sio driver: 1. Last-close may block, mainly in l_close. Quoting the above code again: spltty(); (*linesw[tp-t_line].l_close)(tp, flag); ttyclose(tp); spl0(); So despite holding spltty() here, we may block (mainly waiting for output to drain). Perhaps we have to wait for output to drain even for synchcronous consoles like syscons, since their output may be blocked by something like scroll lock for syscons (I don't know where this block actually occurs). While we are blocked, we have to drop the ipl so we may be affected by signals. But generally signals aren't a problem here. More the reverse -- you can have a process waiting forever in exit() for output to drain and want to kill it but can't because the process has committed to exiting and doesn't accept any more signals. 2. The vfs permits (non-first) opens to succeed while last-close is blocked. For sio, this is useful behaviour. You can use it to issue ioctls to unblock last-closes that would otherwise be blocked until the next reboot. However, the driver doesn't really understand this so it may make a mess of its state when the last-close completes despite the device ending up open at the file level. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: flush on close
On Wed, 10 Sep 2003, Eno Thereska wrote: In FreeBSD 4.4, I am noticing a huge number of calls to ffs_fsync() (in /sys/ufs/ffs/ffs_vnops.c) when running a benchmark like Postmark. ffs_fsync() flushes all dirty buffers with an open file to disk. Normally this function would be called either because the application writer explicitly flushes the file, or if the syncer daemon or buffer daemon decide it's time for the dirty blocks to go to disk. Neither of these two options is happening. Files are opened and closed very frequently though. I have a suspicion that BSD is using the flush-on-close semantic. Could someone confirm or reject this claim? If confirmed, I am wondering how to get rid of it... ffs_fsync() is (or should be) rarely called except as a result of applications calling fsync(2) or sync(2). It is not normally called by the syncer daemon or buffer daemon (seems to be not at all in 4.4, though -current calls it from vfs-bio.c when there are too many dirty buffers, and benchmarks like postmark might trigger this). In 4.4 the only relevant VOP_FSYNC() seems to be the one in vinvalbuf(). Using lots of vnodes might cause this to be called a lot, but this should only cause a lot of i/o in ffs_fsync() if a lot is really needed. Dirty buffers for vnodes which will soon be deleted are supposed to be discarded in ffs_fsync(). Benchmarks that do lots of i/o to short-lived files tend to cause too much physical i/o, but this is because the i/o is done by the buffer (?) deamon before ffs_fsync() can discard it. Bruce ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: seeking help to rewrite the msdos filesystem
On Tue, 12 Nov 2002, Tomas Pluskal wrote: I believe that everybody here knows about the slow msdosfs problem, that is AFAIK caused by implementation without clustering. Which problem. msdosfs has a number of small problems. Mostly they don't matter. For me this is very annoying, because I use digital camera, and ZIP drive, and FAT on both of them. Speed is about 10 times lower than it could be.. ZIP drives have much larger speed problems thn msdosfs. msdosfs happens to be a good way to get the worst out of them. They have a minumum i/o overhead of 20 msec (at least for all the 100MB ones that I tried), so if you use msdosfs's minimum block size of 512 then their maximum speed is 25K/sec which is about 40 times slower than it could be. The default block size of 2K gives a speed which is about 10 times slower than it could be. The ffs default block size of 16K gives a speed which is only about 1.25 times slower than it could be. E.g.: %%% Script started on Wed Nov 13 22:13:53 2002 ttyv1:root@gamplex:/tmp newfs /dev/afd0 /dev/afd0: 96.0MB (196608 sectors) block size 16384, fragment size 2048 using 4 cylinder groups of 24.02MB, 1537 blks, 3200 inodes. super-block backups (for fsck -b #) at: 32, 49216, 98400, 147584 newfs: ioctl (DIOCWDINFO): /dev/afd0: can't rewrite disk label: Operation not supported by device ttyv1:root@gamplex:/tmp mount /dev/afd0 /mnt ttyv1:root@gamplex:/tmp dd if=/dev/zero of=/mnt/zz bs=1m count=20 time umount /mnt 20+0 records in 20+0 records out 20971520 bytes transferred in 18.827154 secs (1113898 bytes/sec) ttyv1:root@gamplex:/tmp time umount /mnt 0.29 real 0.00 user 0.02 sys ttyv1:root@gamplex:/tmp newfs_msdos -b 16384 /dev/afd0 /dev/afd0: 196512 sectors in 6141 FAT16 clusters (16384 bytes/cluster) bps=512 spc=32 res=1 nft=2 rde=512 mid=0xf0 spf=24 spt=32 hds=64 hid=0 bsec=196608 ttyv1:root@gamplex:/tmp mount -t msdosfs /dev/afd0 /mnt ttyv1:root@gamplex:/tmp dd if=/dev/zero of=/mnt/zz bs=1m count=20 time umount /mnt 20+0 records in 20+0 records out 20971520 bytes transferred in 27.729786 secs (756281 bytes/sec) ttyv1:root@gamplex:/tmp time umount /mnt 5.57 real 0.00 user 0.03 sys ttyv1:root@gamplex:/tmp exit Script done on Wed Nov 13 22:16:06 2002 %%% The above could be calculations are based on a speed of 1000K/sec. My test drive can't quite reach this using raw reads with a block size of 64K, but ffs clusters the data so well that it exceeds this speed for writes. msdosfs with a block size of 16K achieves about 63% of this speed (not the 87.5% suggested by the naive calculations). My times are with some small improvements which I think don't affect the tests much (they affect latency more than throughput). With lots of small files (smaller than the block size), clustering doesn't makes even less difference; however, msdosfs doesn't support soft updates or async mounts so it it is about as slow as plain ffs (in my test of writing 1000 files of size 512, msdosfs is actually only 5 times slower than ffs with soft updates or async; plain ffs is about 7.5 times slower). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Problem detecting POSIX symbolic constants
On Fri, 11 Oct 2002, Terry Lambert wrote: Bruce Evans wrote: I know it's not fashionable to write code that's portable to compilers other than GCC, but even if FreeBSD is going to ignore portability for it's own source code, it's probably unreasonable to expect ACE to ignore portability for theirs. Undefined symbols being 0 in cpp in expressions was in early C compilers if not the original one. Consult your archives for a posting 10-15 years or so ago by dmr in comp.std.c about him checking this in his archives. You mean in preprocessor expressions, of course. Oop. I meant only one in in being 0 in cpp in [sic] expressions. This can't be the case; specifically, the sysconf() test will only work at runtime, which means that the symbols had to be there and resolvable at link time. Symbols can be resolved at runtime using dlopen(), etc. They would only actually be available on systems where sysconf() says that they are. You can't depend on people using the standard C library via dlopen, as opposed to, uh, linking. Yes, a more careful reading of POSIX shows that if _POSIX_REALTIME_SIGNAL_ is undefined, then the (compile-time) system makes no claims as to the feature being either supported or unsupported. (Systems should leave it undefined it they are clueless about it.) The feature may still work at runtime, but the compile-time system makes no claims that it will or won't. Applications would have to do something magic to use it at runtime if it turns out to be present at runtime. It could easily be present at runtime because you update the OS. Uh, the 1990 standard, which allowed #if is only 12 years old. #if is in KR1 (1978). According to my first edition The C Programming Language: 12.3 Conditional compilation A compiler control line of the form #if constant-expression checks whether the constant expression (see section 15) evaluates to non-zero. ... 15 Constant expressions In several places C requires expressions which evaluate to a constant: after case, as array bounds, and in initializers. In the first two cases, the expression can involve only integer constants, character constants, and sizeof expressions, possibly connected by the binary operators + - * / % | ^ == != = = or by the unary operators - ~ or by the ternary operator ?: ...no or ||... sorry. Also, defined() is not defined. I think you will find that and || in #if statements are later extensions. The value of undefined macros in preprocessor expressions being assumed to be zero is also not documented in the book. I would have to imagine that they were not considered constant expressions. This book has lots of bugs (mostly from being too informal and/or omitting necessary details). Note that section 15 doesn't even mention and || working in non-cpp constant expressions. But they cetainly worked in expressions, and expressions with only constants in them are (informally) constant expessions. I think you are right that defined() didn't exist then. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Problem detecting POSIX symbolic constants
On Thu, 10 Oct 2002, Terry Lambert wrote: Bruce Evans wrote: In Standard C, this is equivalent to the non-verbose version: #if _POSIX_REALTIME_SIGNALS != -1 ... #endif since if _POSIX_REALTIME_SIGNALS is not defined then it is equivalent to 0 in cpp expressions. The problem cases are if _POSIX_REALTIME_SIGNALS is defined to empty or garbage, but these are not permitted in POSIX.1-2001. These cases were permitted for many feature test macros in POSIX.1-1990. I know it's not fashionable to write code that's portable to compilers other than GCC, but even if FreeBSD is going to ignore portability for it's own source code, it's probably unreasonable to expect ACE to ignore portability for theirs. Undefined symbols being 0 in cpp in expressions was in early C compilers if not the original one. Consult your archives for a posting 10-15 years or so ago by dmr in comp.std.c about him checking this in his archives. Sigh. Why did the POSIX guys do this? :( Perhaps because they wanted you to use sysconf() instead of these mistakes. Actually, they didn't do this. _POSIX_REALTIME_SIGNALS is specified to have value 0, -1 or 200xxxL (draft 7 says xxx; I think the final standard says 112). This can't be the case; specifically, the sysconf() test will only work at runtime, which means that the symbols had to be there and resolvable at link time. Symbols can be resolved at runtime using dlopen(), etc. They would only actually be available on systems where sysconf() says that they are. BTW, I think that: #if defined(_POSIX_REALTIME_SIGNALS) (_POSIX_REALTIME_SIGNALS != -1 ) should suffice, but I'll double-check with one of my portability gurus to see if that is OK for ACE. This is variant of the above verbose version. It requires slightly more modern compilers, so it might fail for some 20-year old pre-Standard C compilers instead of only for some 25-year old ones. Uh, the 1990 standard, which allowed #if is only 12 years old. #if is in KR1 (1978). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Problem detecting POSIX symbolic constants
On Thu, 10 Oct 2002, Terry Lambert wrote: Bruce Evans wrote: _POSIX_REALTIME_SIGNALS is undefined: Apparently the same as when it is defined to 0, except you cannot assume that anything related to it works until you call sysconf(), so you must not reference its interfaces statically, and must use a dll or something that references it. The dll is presumably available on systems that support it but not (except possibly a dummy version) on systems that don't support it. I think the case where the symbol is undefined should never be implemented in practice. It can be reduced to the case where the symbol is 0 using dynamic linkage with the complications for linkage not visible to the application. I think you will have to go back in time, for this to happen. As things stand today, there are systems where it's undefined that were implemented before the symbol was a twinkle in some feature-test weenie on the POSIX committee's eye. _POSIX_REALTIME_SIGNALS is only valid in versions of POSIX that support it. Applications must also conditionalize on _POSIX_VERSION if they want to check for features that are not in all versions. Runtime configuration only lets us go forward in time. An application might use POSIX realtime features if they are available and magically start working better (without recompiling anything in the application) when the OS and/or library implementors get around to implementing the features. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Problem detecting POSIX symbolic constants
On Thu, 10 Oct 2002, Craig Rodrigues wrote: On Thu, Oct 10, 2002 at 09:31:56PM +1000, Bruce Evans wrote: Perhaps because they wanted you to use sysconf() instead of these mistakes. Well in the case of ACE, it is a C++ library that is compiled on platforms which may or may not have sysconf() (ie. Windows), so using sysconf() is not practical in this case. Checking a feature macro is much easier. How can new POSIX interfaces and new POSIX feature test macros work on systems that don't have ancient POSIX interfaces like sysconf(). As may have been clarified in other meesages in this thread, you need a new version of POSIX even to interpret _POSIX_REALTIME_SIGNALS. It is in POSIX.1-1996 and perhaps in earlier versions of POSIX (not including at least the 1990 one), so it is meaningless unless _POSIX_VERSION 199mumble. I used a variant your patch for this in PR 35924 until recently when ... This patch works for me. I think it is just as easy to just remove cruft from the header file entirely, but since your patch effectively does the same thing and has informative comments, that is fine. I slightly prefer to ifdef them, since an implementation is planned. If your patch (or some equivalent variant) is committed, then I think PR 35924 can be closed. Something needs to be done about these prototypes. OK, I will clean up the patch and commit it. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Problem detecting POSIX symbolic constants
On Wed, 9 Oct 2002, Craig Rodrigues wrote: On Wed, Oct 09, 2002 at 07:29:48PM -0700, Terry Lambert wrote: To be totally correct, you will need to: #ifdef _POSIX_REALTIME_SIGNALS #if (_POSIX_REALTIME_SIGNALS != -1) ... #endif #endif It's annoying, but doing this will ensure that there are no gaps through which some system other than FreeBSD might fall. In Standard C, this is equivalent to the non-verbose version: #if _POSIX_REALTIME_SIGNALS != -1 ... #endif since if _POSIX_REALTIME_SIGNALS is not defined then it is equivalent to 0 in cpp expressions. The problem cases are if _POSIX_REALTIME_SIGNALS is defined to empty or garbage, but these are not permitted in POSIX.1-2001. These cases were permitted for many feature test macros in POSIX.1-1990. Sigh. Why did the POSIX guys do this? :( Perhaps because they wanted you to use sysconf() instead of these mistakes. Actually, they didn't do this. _POSIX_REALTIME_SIGNALS is specified to have value 0, -1 or 200xxxL (draft 7 says xxx; I think the final standard says 112). BTW, I think that: #if defined(_POSIX_REALTIME_SIGNALS) (_POSIX_REALTIME_SIGNALS != -1 ) should suffice, but I'll double-check with one of my portability gurus to see if that is OK for ACE. This is variant of the above verbose version. It requires slightly more modern compilers, so it might fail for some 20-year old pre-Standard C compilers instead of only for some 25-year old ones. I have another request. Even though my preprocessor check was bogus, ACE still compiled, and I did not discover that there were any problems until link time, ie. I had a libACE.so library which could not link with anything because of unresolved symbols. This was very annoying. It would have been nicer if this could have been caught earlier in the compile stage. Since sigqueue(), sigwait(), sigwaitinfo() are not implemented in FreeBSD, is this patch OK? --- src/include/signal.h.orig Wed Oct 9 23:15:21 2002 +++ src/include/signal.h Wed Oct 9 23:15:31 2002 @@ -76,13 +76,6 @@ int sigwait(const sigset_t *, int *); #endif -#if __BSD_VISIBLE || __POSIX_VISIBLE = 199506 || __XSI_VISIBLE = 600 -int sigqueue(__pid_t, int, const union sigval); -int sigtimedwait(const sigset_t * __restrict, siginfo_t * __restrict, -const struct timespec * __restrict); -int sigwaitinfo(const sigset_t * __restrict, siginfo_t * __restrict); -#endif - #if __BSD_VISIBLE || __POSIX_VISIBLE = 200112 || __XSI_VISIBLE int killpg(__pid_t, int); int sigaltstack(const stack_t * __restrict, stack_t * __restrict); At some point in the future when POSIX RT signals are implemented in FreeBSD (I'm not volunteering :), then _POSIX_REALTIME_SIGNALS can be defined to 200112L in unistd.h, and these three prototypes can be put back into signal.h. Is this OK? I used a variant your patch for this in PR 35924 until recently when I noticed that it usually worked for the bogus reason that unistd.h is not included, then _POSIX_REALTIME_SIGNALS is only defined accidentally (to whatever value). I now use just #if 0 and an XXX comment as a reminder to fix this someday: %%% Index: signal.h === RCS file: /home/ncvs/src/include/signal.h,v retrieving revision 1.19 diff -u -2 -r1.19 signal.h --- signal.h6 Oct 2002 21:54:08 - 1.19 +++ signal.h7 Oct 2002 07:06:19 - @@ -78,9 +79,18 @@ #if __BSD_VISIBLE || __POSIX_VISIBLE = 199506 || __XSI_VISIBLE = 600 +#if 0 +/* + * PR: 35924 + * XXX we don't actually have these. We set _POSIX_REALTIME_SIGNALS to + * -1 to show that we don't have them, but this symbol is not necessarily + * in scope (in the current implementation), so we can't use it here. + */ intsigqueue(__pid_t, int, const union sigval); +struct timespec; intsigtimedwait(const sigset_t * __restrict, siginfo_t * __restrict, const struct timespec * __restrict); intsigwaitinfo(const sigset_t * __restrict, siginfo_t * __restrict); #endif +#endif #if __BSD_VISIBLE || __POSIX_VISIBLE = 200112 || __XSI_VISIBLE %%% The patch also moves the forward declaration of struct timespec near to the one place that it is used (related patches not included). This mainly makes it obvious that the messy visibility for it is the same as the one for the function that uses it. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Problem detecting POSIX symbolic constants
On Wed, 9 Oct 2002, Craig Rodrigues wrote: Earlier this year on the FreeBSD hackers mailing list: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=278142+0+/usr/local/www/db/text/2002/freebsd-hackers/20020317.freebsd-hackers I was advised by Terry Lambert to use: #ifdef _POSIX_REALTIME_SIGNALS to detect if sigqueue(), sigtimedwait, sigwaitinfo() were defined. I made this change to the FreeBSD configuration for ACE, a C++ library used for systems programming ( http://www.cs.wustl.edu/~schmidt/ACE.html ). The change I made worked fine in -STABLE. However, in -CURRENT, this test breaks, because _POSIX_REALTIME_SIGNALS is defined, but it is -1. According to the letter of the law: http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap02.html The following symbolic constants shall either be undefined or defined with a value other than -1. This law only covers 2 symbols (_POSIX_CHOWN_RESTRICTED and _POSIX_NO_TRUNC). _POSIX_REALTIME_SIGNALS is covered by more general and complicated laws. It may be undefined or defined to 0, -1 or 2001mumbleL. So, this is legal way to implement these macros. It just breaks my code. :) Can I appeal to the freebsd-standards team to leave these macros undefined instead of defining them to -1? #ifdef/#ifndef is a pretty common way to detect if a feature is available on a system, especially when used in conjunction with something like autoconf. No. If they were undefined (and _POSIX_VERSION is large enough), then their undefinedness means that applications must use sysconf() to determine if they work. Other cases are simpler: _POSIX_REALTIME_SIGNALS is defined to -1: This means that the interface is not supported. Applications shouldn't use it at either compile time, link time or runtime, since it might not exist in headers or libraries. _POSIX_REALTIME_SIGNALS is defined to 0: This means that the interface may work, and that it exists in headers and libraries, so applications may reference it in normal ways. It may fail at runtime; applications must use sysconf() to determine if it is actually _POSIX_REALTIME_SIGNALS is defined to 2001mumbleL: This means that the interface just works. sysconf() may be used to check that it works, but sysconf() must not fail. _POSIX_REALTIME_SIGNALS is undefined: Apparently the same as when it is defined to 0, except you cannot assume that anything related to it works until you call sysconf(), so you must not reference its interfaces statically, and must use a dll or something that references it. The dll is presumably available on systems that support it but not (except possibly a dummy version) on systems that don't support it. I think the case where the symbol is undefined should never be implemented in practice. It can be reduced to the case where the symbol is 0 using dynamic linkage with the complications for linkage not visible to the application. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: [ GEOM tests ] vinum drives lost
On Fri, 4 Oct 2002, Poul-Henning Kamp wrote: Worst case you will have the option to use: options NOGEOM options vinum A NOGEOM option would be as acceptable as a NOFFS option for turning off forcing of the one true file system down everyone's throats. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: [ GEOM tests ] vinum drives lost
On Sat, 5 Oct 2002, Peter Wemm wrote: Bruce Evans wrote: On Fri, 4 Oct 2002, Poul-Henning Kamp wrote: Worst case you will have the option to use: options NOGEOM options vinum A NOGEOM option would be as acceptable as a NOFFS option for turning off forcing of the one true file system down everyone's throats. Part of the problem there is a weakness in config that I've threatened to fix on more than one occasion. We do not have a way to have options default to on and let people turn the option off. Negative options (options NOFOO) are a poor substitute. In the past, a couple of things were unifdefed that might have been better served as being 'default to on' options or drivers. Hmm. Negative options implemented as negoptions FOO would work OK for this. Options could be defaulted to on by putting them in an included config file, and then turned off using negoptions. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: vmware reads disk on non-sector boundary
On Thu, 3 Oct 2002, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Bakul Shah writes: How hard would it be to bring back block devices without GEOM? Not at all hard, pretty trivial in fact. The easiest way is to restore the old code and use a minor number hack or ioctl to enable it. I had better do it instead of complaining about it. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: vmware reads disk on non-sector boundary
On Thu, 3 Oct 2002, Mark Santcroos wrote: I have an almost-ready patch that implements linux_read() syscall. This will check if we are reading from a raw disk and in that case it will enlarge the read() to the next sector boundary. I have it working in the kernel but I have problems returning the right read buffer to userland. Unbreaking block devices would be a better solution. Without buffering, reads of raw disks using an unbuffered linux_read() might be sector size times slower than they should be. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: vmware reads disk on non-sector boundary
On Thu, 3 Oct 2002, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Mark Santcroos writes: On Thu, Oct 03, 2002 at 09:50:45PM +1000, Bruce Evans wrote: Unbreaking block devices would be a better solution. Without buffering, ... What was the reason for the removal of block devices anyway? It would be nice if you would tell me some background about that.. :) It's well documented in the mail-archives actually. The short story: Shorter story: phk didn't like them. 1. We don't in general assign a special vnode type to device modes, instead we assign multiple device nodes, SCSI tapes is an example of this. 2. The vnode layer already have enough trouble aliasing /dev/fd0, /mnt/dev/fd0, /usr/jail/dev/fd0, /cdrom/dev/fd0 (you get the idea), we do not need to make it even harder by also aliasing /dev/fd0 and /dev/rfd0. Aliases that differ in type are slightly easier to handle than aliases that differ by name or mount point. They are diferent devices so they have different vnodes, and the aliasing problems for them are quite different than the vnode aliasing problems caused by having the same device under different mount points. E.g., slices and partitions allow configuring 31*7 aliases per drive for hard drives (31 slices with 7 configuring partitions each). Slices and partitions may overlap, giving something more complicated than aliases. The vnode layer doesn't understand any of this. 3. Write ordering on buffered devices were unspecified. In other words, you cannot use it for anything which even remotely smells of transactions, because you have no way to know when your writes have hit the disk and in which order they did so. This is no different than for regular files. 4. No write errors were reported back to userland. Actually, write errors were reported at fsync() and close() time in the same way as for regular files. fsync()'s handling of write errors was broken for both regular files and buffered devices at the time buffered devices were axed. (Given 3 and 4, it follows that use of block devices for any sort of data you happen to like is a very bad idea.) It follos similarly that use of fil systems is a bad idea :-). You You should only use a databases on raw disks if you value your data. 5. Block devices was in the way of getting DEVFS working in an architecturally sane manner. This can be considered a feature. So they were removed, and good riddance. If a buffered access-mode on block devices is desired, it should be implemented either as an ioctl controllable feature, or as a GEOM module. The latter is probably by far the easiest way. It was desired, and was sort of promised. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: bin/20633: fdisk doesn't handle LBA correctly
On Sat, 17 Aug 2002, Artem 'Zazoobr' Ignatjev wrote: On Sun, Aug 18, 2002 at 02:12:45AM +1000, Bruce Evans wrote: [skip] Fdisk should print these values, at least optionally, since they are needed for debugging. The magic values might be non-magic on old systems. Debugging WHAT? And, I can hardly imagine such a situation inside hard drives slice tables. Debugging hard disk tables and slice tables. I do it routinely. It can be important to look at the raw data, but the dp_scyl...dp_ehd data is a little too raw. Ok, so we must add option to fdisk(8) to print CHS? If we displayed suppressed the existing display for cases where we know it is bogus, then there should be an option to display it for these cases. And, BTW, does CHS makes sense inside an extended slices? I think so, since boot loaders might use at least the starting C/H/S if there are any boot loaders that support booting from logical drives in extended partitions (I think lilo can). Using them might even work for cylinders below 1024. ... Actually, 1023 is most magic (it is what the kernel expects), but the data in the PR shows that 1022 is magic too. The magic 0xfe in the kernel is for the ending head number. It can be important not to use a starting or ending head number of 255 (== a head count of 256) because some broken BIOSes crash on it, and there are now conventions that prohibit using it. Maybe we should take as magic cyls 1021? Don't assume much when displaying magic values, but only write values that satisfy current conventions. And where can one read all this conventions? google :-). [skip] The data for partition 13 is: sysid 165 (FreeBSD/NetBSD/386BSD), start 80051958, size 8401932 (4102 Meg), flag 0 beg: cyl 1023/ head 254/ sector 63; end: cyl 1023/ head 254/ sector 63 Script done on Fri Aug 16 22:23:06 2002 Seems reasonable. It shows all of the magic beg and end values, because the most magic one (all 0xff's) is so magic that it is never used :-). and how translates that value to CHS? Guess 1023/255/63? Something like beg: [invalid] for 1023/255/63, and beg: [not used] for 1023/254/63. I suppose, that it will be better to follow the linux' fdisk.. Could you please look at http://memphis.mephi.ru/~timon/fdisk/ (and http://memphis.mephi.ru/~timon at all ;-) since you'll need a patch from there or manually define DOSPTYP_EXTX if you'll try to compile it )? Not sure if I have time to look at the details. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Increasing size of if_flags field in the ifnet structure [patchfor review]
On Thu, 15 Aug 2002, Maxim Sobolev wrote: When implementing ability to switch interface into promisc mode using ifconfig(8) I've stumbled into the problem with already exhausted space in the `short if_flags' field in the ifnet structure. I need to allocate one new flag, while we already have 16 IFF_* flags, and even one additional flag which is implemented using currently free if_ipending field of the said structure. Attached patch is aimed at increasing size of if_flags to 32 bits, as well as to clean-up if_ipending abuse. Granted, it will break backward ABI compatibility, but IMO it is not a big problem. Why isn't it a bug problem? It affects an application ABI (most socket ioctls). We have whole syscalls whose purpose is to avoid breaking application ABIs back to about 4.3BSD. Some of them may even work. Index: src/share/man/man4/netintro.4 === RCS file: /home/ncvs/src/share/man/man4/netintro.4,v retrieving revision 1.20 diff -d -u -r1.20 netintro.4 --- src/share/man/man4/netintro.4 18 Mar 2002 12:39:32 - 1.20 +++ src/share/man/man4/netintro.4 15 Aug 2002 18:33:42 - @@ -197,7 +197,7 @@ structsockaddr ifru_addr; structsockaddr ifru_dstaddr; structsockaddr ifru_broadaddr; -short ifru_flags; +int ifru_flags; int ifru_metric; int ifru_mtu; int ifru_phys; This particular ABI seems to have been broken before (in if.h 1.50 on 1999/02/09), since the actual struct has short ifru_flags[2]; followed by short if_index; instead of short ifru_flags;, and it has 2 new struct members at the end too. If the struct were actually as above, then changing the short to an int would almost be binary compatible since it would just expand ifru_flags to use the 2 bytes of unnamed padding caused by the poor layout, so the struct wouldn't expand and the other members wouldn't move. Enlarging ifru_flags itself might only break big-endian machines (little-endian ones wouldn't notice providing the padding is zeroed). Index: src/share/man/man9/ifnet.9 Breaking kernel ABIs isn't so important. They should only be compatible within major releases. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: gcc -O broken in CURRENT
On Thu, 14 Mar 2002, ozan s. yigit wrote: Add the -ffloat-store flag to your compilation flags (or add -msoft-float). that really means for this compiler on certain platforms, you can have slow and correct or fast and incorrect, but NOT fast and correct. I think fast and correct is impossible on i386's. Correct requires assignments and casts to discard any extra precision, and the fastest way to implement this is probably to store to memory and reload. The -ffloat-store kludge only does a subset of the necessary conversions. Doing them all would be slower and correct, which is why gcc doesn't do them. C90 can be read as permitting this incorrectness, but C99 doesn't permit it. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: sys/types.h or not sys/types.h? [Was: cvs commit: src/includegrp.h]
On Mon, 25 Feb 2002, David Malone wrote: I note that in the footnotes for getgrgid, in the section for issue 6 of the standard: The requirement to include sys/types.h has been removed. Although sys/types.h was required for conforming implementations of previous POSIX specifications, it was not required for UNIX applications. Curiously, this seems to say the opposit of what you actually see in SUSv2, as it lists sys/types.h as a prerequisit. This is probably because SUSv2 is almost as old as POSIX.1-1996 (which has the prerequisite). BTW, unistd.h has the same prerequisite, but our version of unistd.h is polluted with an include of sys.types.h. Nevertheless, we fixed several man pages to document the old-POSIX requirements. E.g.: RCS file: /home/ncvs/src/lib/libc/sys/fork.2,v Working file: fork.2 head: 1.14 ... revision 1.7 date: 1998/01/11 16:51:49; author: alex; state: Exp; lines: +4 -1 branches: 1.7.2; Add sys/types.h to synopsis. Correct a grammatical error. Add cross-reference to setrlimit(2). Obtained from: OpenBSD ISTR that most of the fixes came from OpenBSD. I haven't attempted to fix unistd.h since its pollution is old. I only police new pollution and must have been asleep when we broke several man pages to undocument the old-POSIX requirements. E.g.: RCS file: /home/ncvs/src/lib/libc/sys/setsid.2,v Working file: setsid.2 head: 1.12 ... revision 1.6 date: 1997/04/08 10:43:47; author: peter; state: Exp; lines: +1 -1 setsid is declared in unistd.h, which is self sufficient (doesn't need prior sys/types.h) Fixes PR#3229, from Dmitrij Tejblum [EMAIL PROTECTED] Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: stack alignment issues
On Tue, 5 Feb 2002, Nik Clayton wrote: On Tue, Feb 05, 2002 at 05:01:29PM +1100, Bruce Evans wrote: My patch is not suitable for committing verbatim. It has 2 or 3 XXX's. Do you make these patches available anywhere, so that other people can look over them and maybe help you on the XXX'd sections? Erm, I'm replying to mail that gave a URL for my old mail with the patch. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: stack alignment issues
On Tue, 5 Feb 2002, Mike Silbersack wrote: On Tue, 5 Feb 2002, Bruce Evans wrote: foo: pushl %ebp movl %esp,%ebp subl $8,%esp# - extra instruction for alignment (for foo) addl $-12,%esp # - extra instruction for alignment (for f1) What disgusting code. I find it amazing that they didn't even stick in some peephole optimizer to at least limit it to one operation. It's clearly the result of work in progress :-). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: stack alignment issues
On Mon, 4 Feb 2002, Michal Mertl wrote: Did you look at the patch by Bruce at http://groups.yahoo.com/group/freebsd-current/message/39605 ? Bruce, is it still fresh in your memory? Can you comment on the patch - can it be commited in some form? I haven't done anything to clean up the patch. I hope the problem will go away in future versions of gcc (align the stack at runtime in the few routines that actually need it). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: stack alignment issues
On Mon, 4 Feb 2002, Mike Silbersack wrote: On Tue, 5 Feb 2002, Bruce Evans wrote: I haven't done anything to clean up the patch. I hope the problem will go away in future versions of gcc (align the stack at runtime in the few routines that actually need it). Well, if Linux aligns the initial stack, the chance that gcc will have auto-alignment added sounds to be about zero. You might as well go ahead with your patch when you get a chance. There is a nonzero probability that the pessimization of aligning in almost every routine will be fixed someday. Actually, the pessimization is worse -- the alignment is done before every call. Example: %%% foo.c: foo() { f1(1); f2(2); f3(3.0); } %%% gcc -O -S [-mpreferred-stack boundary] currently generates the following code for this: %%% .file z.c .version01.01 gcc2_compiled.: .section.rodata .p2align 3 .LC0: .long 0x0,0x4008 .text .p2align 2,0x90 .globl foo .typefoo,@function foo: pushl %ebp movl %esp,%ebp subl $8,%esp# - extra instruction for alignment (for foo) addl $-12,%esp # - extra instruction for alignment (for f1) pushl $1 call f1 addl $-12,%esp # - extra instruction for alignment (for f2) pushl $2 call f2 addl $32,%esp # - extra instruction for alignment (for f3) addl $-8,%esp # - extra instruction for alignment (another) pushl .LC0+4 pushl .LC0 call f3 leave ret .Lfe1: .sizefoo,.Lfe1-foo .ident GCC: (c) 2.95.3 20010315 (release) %%% It should generate something like: .file z.c .version01.01 gcc2_compiled.: .section.rodata .p2align 3 .LC0: .long 0x0,0x4008 .text .p2align 2,0x90 .globl foo .typefoo,@function foo: pushl %ebp movl %esp,%ebp andl $~0x7,%esp # - extra instruction for alignment (for foo) # Only needed since foo() uses FPU. # 8-byte alignment enough for doubles? # Adjust in prologue so that there are # hopefully no alloca()-like issues, except # we need a frame pointer to restore %esp. pushl $1 call f1 pushl $2 call f2 pushl .LC0+4 pushl .LC0 call f3 leave ret .Lfe1: .sizefoo,.Lfe1-foo .ident GCC: (c) 2.95.3 20010315 (release) %%% My patch is not suitable for committing verbatim. It has 2 or 3 XXX's. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: cvs commit: src/lib/libatm atm_addr.c cache_key.c ioctl_subr.c ip_addr.c ip_checksum.c timer.c
On Sat, 15 Sep 2001, Matt Dillon wrote: I'm redirecting this to freebsd-hackers. Ok, I've comitted a new set of changes to libatm. Please check them out. When we get a format that enough people are happy with we can start converting the other libraries. I'm not particularly interested in fixing old old problems, I just want to get the __FBSDID() stuff in shape. I prefer the following (for one of the changed files). Especially the empty line after the copyright message: %%% Index: atm_addr.c === RCS file: /home/ncvs/src/lib/libatm/atm_addr.c,v retrieving revision 1.6 diff -u -2 -r1.6 atm_addr.c --- atm_addr.c 2001/09/15 19:36:55 1.6 +++ atm_addr.c 2001/09/15 20:25:39 @@ -24,10 +24,12 @@ * notice must be reproduced on all copies. */ -#include sys/cdefs.h + +#ifdef VENDOR_ID #ifndef lint -#if 0 /* original (broken) import id */ static char *RCSid = @(#) $Id: atm_addr.c,v 1.1 1998/07/09 21:45:18 johnc Exp $; #endif #endif + +#include sys/cdefs.h __FBSDID($FreeBSD: src/lib/libatm/atm_addr.c,v 1.6 2001/09/15 19:36:55 dillon Exp $); %%% (VENDOR_ID is intended to be left undefined.) This still changes the vendor id excessively relative to rev.1.1. The vendor put it after the comment after the copyright comment, and restoring it in rev.1.5 moved it to a different place. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Junior Kernel Hacker task: improve vnode-v_tag
On Sat, 8 Sep 2001, Chris Costello wrote: On Saturday, September 08, 2001, Poul-Henning Kamp wrote: No actually not, I want something short and predictable like VT_CODA. How about my second suggestion: making v_tag point to mp-mnt_stat.f_fstypename, or a copy thereof? Good, but I as far as I understand this, the only legitimate point of v_tag is to tell applications like fstat(1) the type of the vnode. fstat can find chase the kernel pointers in vp-v_mount-mnt_stat-f_fstypename almost as (un)easily as it could chase vp-v_tag if v_tag is a pointer. Copying the string would give ugly bloat, but would not be all that much worse than the bloat for a pointer on alphas (the contents of the string VT_CODA takes the same space as a pointer on alphas). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Junior Kernel Hacker task: improve vnode-v_tag
On Tue, 4 Sep 2001, Maxim Sobolev wrote: [neither Maxim Sobolev nor Brent Verner wrote] In message [EMAIL PROTECTED], Brent Verner writes: #include newbie_kernel_hacker_warning.h I've done a /cursory/ look over how this v_tag is used. I'm not sure this is a simple/clean as you propose, since this is used in the IS_LOCKING_VFS macro, as well as in union_subr.c... Well, that is just too bad, because IS_LOCKING_VFS is wrong then. The places which inspect v_tag will have to be changed to use strcmp() then... I think that we can add a new vnode flag, say VCANLOCK, so that each particular VFS can set it if it supports locking, which should allow to remove pre-defined VFS list from the IS_LOCKING_VFS macro. I can produce a patch if it sounds reasonably. Since IS_LOCKING_VFS() is a temporary kludge, the correct fix is to remove it (after fixing whatever it was for), not to add more cruft. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset
On Fri, 31 Aug 2001 [EMAIL PROTECTED] wrote: I see (amost). Please format the macro the same as the other macros in the file. Index: apic_vector.s === RCS file: /home/ncvs/src/sys/i386/isa/apic_vector.s,v retrieving revision 1.47.2.4 diff -u -r1.47.2.4 apic_vector.s --- apic_vector.s 18 Jul 2000 21:12:41 - 1.47.2.4 +++ apic_vector.s 31 Aug 2001 19:24:24 - @@ -653,7 +707,14 @@ FAST_INTR(21,fastintr21) FAST_INTR(22,fastintr22) FAST_INTR(23,fastintr23) -#define CLKINTR_PENDING movl $1,CNAME(clkintr_pending) + +#define CLKINTR_PENDING \ + pushl $clock_lock; \ + call s_lock;\ + movl $1,CNAME(clkintr_pending); \ + call s_unlock; \ + addl $4, %esp + INTR(0,intr0, CLKINTR_PENDING) INTR(1,intr1,) INTR(2,intr2,) The other macros also have a space befor the semicolons :). Otherwise OK. Better commit this (especially to 4.4R) until we have something better. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset
On Fri, 31 Aug 2001 [EMAIL PROTECTED] wrote: I wrote: The problem here is that CPU#1 fails to hold clock_lock while setting clkintr_pending, causing i8254_offset to be stepped twice, first due to clkintr_pending, then due to i8254_lastcount being larger than count. With this patch applied to RELENG_4, the clock speedup seems to disappear. --- apic_vector.s.old Fri Mar 2 13:47:31 2001 +++ apic_vector.s Fri Aug 31 01:07:53 2001 @@ -707,7 +707,12 @@ FAST_INTR(21,fastintr21) FAST_INTR(22,fastintr22) FAST_INTR(23,fastintr23) -#define CLKINTR_PENDING movl $1,CNAME(clkintr_pending) +#define CLKINTR_PENDING pushl $clock_lock; \ + call s_lock;\ + movl $1,CNAME(clkintr_pending); \ + call s_unlock; \ + addl $4, %esp + INTR(0,intr0, CLKINTR_PENDING) INTR(1,intr1,) INTR(2,intr2,) I see (amost). Please format the macro the same as the other macros in the file. The corresponding patch for -current is This does nothing for -current, because -current uses a fast interrupt handler for clkintr() and FAST_INTR() doesn't take a CLKINTR_PENDING arg. I think this is another bug. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset
On Fri, 31 Aug 2001 [EMAIL PROTECTED] wrote: For RELENG_4, which uses Xintr0, this problem is reduced by adding the clkintr_pending variable (cf. revision 1.134 of clock.c and revision 1.38 of apic_vector.s) Looking more at the code, I now see that clkintr_pending is never set in -current due to Xintr0 not being used (Xfastintr0 is now used). Using a fast interrupt for the clock interrupt gives some of the same reduction of the race window size as the use of clkintr_pending. I remember this a bit better after reading my log message for rev.1.134 of clock.c :-). At the time, the clock interrupt handler was not fast. However, it could be delayed for a while by a fast interrupt handler (it still can). Since it is delayed, it can't set clkintr_pending and we have to rely on the irr overflow test working for longer. Rev.1.134 only claims to work for the non-SMP case. The irr test apparently never worked for SMP. Otherwise, it might be preferable to use the irr overflow check instead of or as a backup to clkintr_pending, but it can't be, since the irr bit is cleared when the irq is acked. This behaviour also gives races for SMP: the irr bit may become invalid while you're looking it when another CPU acks the irq. Making the clock interrupt handler fast doesn't significantly change the problems as far as I can see. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Clock speedup on 4.X FreeBSD SMP and serverworks chipset
On Thu, 30 Aug 2001, Martin Blapp wrote: Searching the freebsd mailinglists I have seen that you also suffering under this problem on 4.X. STABLE: I now remember your old mail about this. I (implicitly) suggested a fix, but apparently no one tried it: On Thu, 21 Jun 2001, Martin Blapp wrote: Just gettimeofday() produces 8sec time drifting now. No need to use poll() in our little programm I sent previously. There is no time drifting if we used a 100% load programm with just poll(). Very strange. Do you have some idea ? From clock.c in -current: | #ifdef APIC_IO | #define lapic_irr1 ((volatile u_int *)lapic)[0x210 / 4] /* XXX XXX */ | /* XXX this assumes that apic_8254_intr is 24. */ | (lapic_irr1 (1 apic_8254_intr | #else | (inb(IO_ICU1) 1))) | #endif Maybe apic_8254_intr is not 24. I think the second XXX comment has rotted in -current, but it still applies in RELENG_4. I now think apic_8254_intr can't be = 24 (= 32 in -current), but 8254 is routed via 8259 and IOAPIC #0 intpin 0 in your probe messages says that the relevant status bit is in lapic_irr0, not in lapic_irr1. Try changing the 0x210 to 0x200. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: MIN()/MAX() definitions in sys/param.h
On Mon, 14 May 2001, Dima Dorfman wrote: Is there a reason the definitions of the MIN() and MAX() macros in sys/param.h are under an '#ifndef _KERNEL'? Quite a few files in the It is to inhibit use of these macros. The {i,,l,lu,q}{max,min} inline functions are supposed to be used instead. Both of these interfaces are problematic. MIN/MAX can not be implemented to be safe macros in Standard C since they need to evaluate their args more than once. Their upper case names indicate that they are unsafe. The {i,,l,lu,q}{max,min} functions are not type-generic, so they are difficult to use correctly for args whose type is typedef'ed. kernel define these (well, at least MIN) themselves, so it would seem to make sense to define them globally in sys/param.h for the kernel as well. Any reason this isn't already done this way, or should I come up with a patch to fix that? This is a bug in these files. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: Displaying options for current NFS mounts
On Sun, 25 Mar 2001, Peter Pentchev wrote: Only mount_foofs can reasonably know about the options for foofs. perhaps mount(8) could fork-exec mount_foofs(8) to print options for foofs. bikeshed type=question value="mostly stupid" Or could mount(8) invoke a couple of sysctl's to get a string representation of each mountpoint's mount options? /bikeshed My bikeshed believes that string processing doesn't belong in the kernel :-). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Displaying options for current NFS mounts
On Sat, 24 Mar 2001, Dima Dorfman wrote: I tried to export this stuff in struct statfs, but ran into a problem: I'd need the complete definitions of fs_args in sys/mount.h, but I can't include, e.g., nfs/nfs.h because the latter includes the former (sys/mount.h)! mount.h used to know too much about all sorts of filesystems, but this was fixed in 4.4BSD. It is impossible for mount.h or mount(8) to know about all file systems, since filesystems can be dynamically loaded, and ugly for it to know about more than 1 (or 0 -- ffs is too special). The patch below kind of implements this functionality. I only export nfs_args (not otherfs_args), and I only modified mount(8) to print the NFS version, but printing the transport and others is simple from there. To work around the above problem, I pasted the struct nfs_args definition into mount.h. It is *horribly* ugly, but it does work. Only mount_foofs can reasonably know about the options for foofs. perhaps mount(8) could fork-exec mount_foofs(8) to print options for foofs. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: What's changed recently with vmware/linuxemu/file I/O
On Wed, 7 Feb 2001, Josef Karthauser wrote: On Wed, Feb 07, 2001 at 08:56:14PM +, Josef Karthauser wrote: On Wed, Feb 07, 2001 at 08:26:15PM +0100, Dag-Erling Smorgrav wrote: Brian Somers [EMAIL PROTECTED] writes: Indeed. I've been doing a ``make build'' on an OpenBSD-current vm for three days (probably about 36 hours excluding suspends) on a 366MHz laptop with a ATA33 disk. Would it be possible for someone experiencing this slowdown to try to narrow down the day (or even the week) on which it occurred? As I think about it it was definitity working before the symbol changes in libc/libc_r changed. Was that last week? No probably the week before. It was working fine last week, but I'm not sure which day's I updated the kernel. I'll try some builds. Ok. The problem definitely began between -D2001-01-29 and -D2001-01-30. I'll try and binary chop to workout what caused it. If you have ata disks, try "options ATA_ENABLE_WC". Nothing else has changed significantly in this period. I don't know how this would effect vmware boot speeds. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: What's changed recently with vmware/linuxemu/file I/O
On Tue, 6 Feb 2001, Josef Karthauser wrote: I'm wondering what's changed recently to cause vmware2 running on the linuxemu to lose a lot of performance with disk I/O. Use of cmpxchg and possibly other SMP pessimizations. A couple of weeks ago I could boot win2000 under vmware2 in a matter of minutes; on today's kernel it takes 5 or 10 minutes to boot, and disk I/O is through the roof. Could someone please hit me with a clue-bat :) Read your freebsd-emulation mail :-). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: device naming convention
On Mon, 18 Sep 2000, Marc Tardif wrote: 0cicuta/home/babolo(9)#dd of=/dev/wd0s2 if=/dev/zero bs=660b 1cicuta/home/babolo(11)#od -b /dev/wd0s2 [ snip ] Why I use 2.2.7 for test? Because of my lovely 4.1-STABLE is extremly unstable with content of ad0s2 (wd0s2) above and silently reboot after the first dd in the test above. Assuming my wd0s2 is still unused and of size 0, 3.5-STABLE also crashes in the test above (no disk activity, ctrl-c doesn't work, alt-f# doesn't work either). Perhaps it eventually reboots, but I wasn't patient enough to wait that long. One solution to this problem is to specify the count blocks after which dd returns properly but still no bytes are copied. [Please use lines somewhat shorter than 360 characters.] This is a completely diferent problem. wd0s2 was buffered in 3.5, and buffered devices are very broken in 3.1 and later versions of 3.x (write errors are retried endlessly. Among other bugs, writing beyond EOF hangs the system when it allocates all buffers for writing unwritable data). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Fdescfs updates--coming to a devfs near you!
On Sun, 17 Sep 2000, Adrian Filipi-Martin wrote: I recently ran into revelant problem with /dev/stdout, while working on some software under linux that expected /dev/stdout as an argument instead of using stdout. Using the device file breaks, if the process is suid to a non-root user. This is because it cannot open /dev/stdout, which is owned by your UID and not the EUID of the process to which the device was passed. My solution was to add the "-" hack and use the existing open descriptor. Um, open on fdesc devices doesn't check either uid. It just checks the access mode. Perhaps the software expected /dev/stdout to for read-write like a tty would be. Then opening /dev/stdout would fail for normal shell output redirection which only opens for writing. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: DDB is not setting break points...
On Thu, 1 Jun 2000, G.B.Naidu wrote: I am having problems with DDB while setting breakpoints in the kernel. I entered the DDB by giving kernel -d at boot prompt. After that I tried to set break point at ip_output() by giving "b ip_output". But it complains saying that "sumbol not found". I thought this might be due to stripped Early setting of breakpoints by name was broken by the switch to elf in FreeBSD-3.0 (symbols aren't available until the kernel module sysinit runs much later). I think it works for aout kernels in 3.x but not in 4.0 or -current. Use gdb or set breakpoints early by value in broken versions. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: QUEUE_VMIO...???
On Wed, 31 May 2000, Joy Ganguly wrote: what is the significance of QUEUE_VMIO buffer (struct buf) queue ?? as far as i could see they are not used at allbut maybe i am wrong. It signifies bitrot. Its use was removed more than 5 years ago in rev.1.35 of sys/kern/vfs_bio.c, but its definition was not removed until less than a year ago in rev.1.75 of sys/sys/buf.h. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
RE: stupid FS questions
On Tue, 30 May 2000, Garrett Wollman wrote: On Tue, 30 May 2000 16:20:53 -0400, "Yevmenkin, Maksim N, CSCIO" [EMAIL PROTECTED] said: i know that :) i guess my questions were 1) why the same piece of code duplicated in all ``mount_xxx'' utilities? Because the original loadable module system held strongly to the religion that the kernel should never load anything of its own accord. The designers of the current loadable module system made different design choices, but the some traces of its predecessor still remain. Including defunct traces like all the duplicated code in the mount utilities :-). Relevant history: RCS file: /home/ncvs/src/lib/libc/gen/getvfsent.c,v Working file: getvfsent.c head: 1.14 revision 1.13 date: 1998/11/03 15:02:29; author: peter; state: Exp; lines: +10 -1 A feeble attempt at kld compatability. The mount_* programs assume that they cannot mount a filesystem that they cannot see in getvfsbyname(). Part 1 of this is a hack, make vfsisloadable() always return true - the ultimate decider of whether it's loadable or not is kldload() or mount(). Part 2 of this is to have vfsload() call kldload(2) and return success if it works. This means that we will use a viable kld module in preference to an LKM! Ultimately, the thing to do is remove the hacks to do a vfsload in all the ^^ should have been more than a year ago mount_* commands and let the kernel do it by itself in mount(2). RCS file: /home/ncvs/src/sys/kern/vfs_syscalls.c,v Working file: vfs_syscalls.c head: 1.153 ... revision 1.110 date: 1998/11/03 14:29:09; author: peter; state: Exp; lines: +32 -3 make mount(2) automatically kldload modules if the requested filesystem isn't present. This commit made the duplicated code redundant except for backwards compatibility. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
RE: BPF bug or not?
On Wed, 26 Jan 2000, Yevmenkin, Maksim N, CSCIO wrote: I've just found that read from /dev/bpfX never return EAGAIN/EWOULDBLOCK. It means that when you do a non blocking read and there is no data you will always get 0. [ untested fix removed :) ] Yes, it works. But it returns EAGAIN for both O_RDONLY|O_NONBLOCK and O_RDWR|O_NONBLOCK open modes. In the same time pipe returns 0 for O_RDONLY|O_NONBLOCK mode and EAGAIN for O_RDWR|O_NONBLOCK. It there any specs for "read" system call? Well, POSIX is very complete for read() on regular files and pipes (both ordinary pipes and fifos. read() on a pipe with no data and writers returns 0 because that case is considered to be EOF. O_RDWR for fifos gives undefined behaviour. I don't know of any legitimate use for it. It has the illegitimate use of talking to oneself using only one channel :-). This gives the EAGAIN behaviour for O_RDWR|O_NONBLOCK. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: kern/13644
On Sun, 23 Jan 2000, Warner Losh wrote: : This is what I asked for, when I asked for "other specification". Could : you provide the chapter/verse number of where POSIX spec contradicts the : man pages? It will help me make my case on the TCL forum, since the TCL : developers remain under the mistaken assumption, that select() may : return earlier, but never later than specified. That's trivially easy to show. In theory, but not in practice :-). Given process X with a priority N + 1 that is doing N - 1 (higher priority is actually "lower") while (1) ; while process Y with a priority of N is doing the select. The kernel won't prempt X until the time slice is done, which can be a long time. If the select'd process is swapped out, then it could take a very very long time to swap back in. Um, if the priorities are actually N vs. N - 1, then the process with priority N won't run at the end of the timeslice. It will only run when its priority becomes "lower", possibly several timeslices later. In practice, the priorities will never be N vs. N - 1. The process doing the select() sleeps at priority PSOCK = 24. The process doing the while loop should always have priority = PUSER = 50, but due to a bug (?) in nice(2), the priority of a nice --20 process can be as low as PUSER - 20 = 30. Anyway, that is PSOCK, so the process doing the select() will preempt the user process and wake up as soon as it times out. Then, due to a longstanding scheduling bug (?), the process doing the select() will return to userland without being rescheduled, although its user priority may be much "higher" than that of any other runnable process. Processes that do i/o are thus preferred to cpu hogs much more strongly than their priorities indicate. This bug is a feature in most cases. It reduces context switches. Interactive process may get more benefit from it than from the classical scheduling preference for interactive processes. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: getopt.h
In unistd.h we have definitions for getopt, optarg, optind, opterr, and optopt. These are the standard POSIX (.2?) declarations. The things I propose to add to unistd.h are the following: struct option { char*name; int has_arg; int *flag; int val; }; #define no_argument 0 #define required_argument 1 #define optional_argument 2 extern int getopt_long (int, char *const [], const char *, const struct option *, int *); extern int getopt_long_only (int, char *const [], const char *, const struct option *, int *); extern int _getopt_internal (int, char *const [], const char *, const struct option *, int *, int ); This should enable us to compile programs which require getopt.h by simply including unistd.h. These are gnu extensions which we don't even support in libc, so they shouldn't be declared anywhere in /usr/include. I think ports that use the gnu extensions handle this problem by duplication the gnu sources. Not ideal, but the getopt sources are small. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: kernel/loader world
In file included from /usr/src/sys/boot/i386/libi386/../../../sys/signal.h:236, from /usr/src/sys/boot/i386/libi386/../../../sys/param.h:90, from /usr/src/sys/boot/i386/libi386/aout_freebsd.c:29: /usr/src/sys/boot/i386/libi386/../../../sys/ucontext.h:34: machine/ucontext.h: No such file or directory We include the files directly from the source tree, but some of them then go and include files from machine/*, which, of course, refers to /usr/include/machine/*. Thus, some of the files are up-to-date, and some are not. Unless you build world first. It's dangerous to mix headers from the source tree with headers from /usr/include. Are you sure this is the case? [but you can't build world until you booted a new kernel, and you can't boot a new kernel until you have a new loader, but you can't build a new loader...] This is "artifical" in that it is solved by fixing the build process. The bug is that the `machine' symlink is not always created. The makefile supports creating it, but has missing dependencies. See the old boot loader (sys/i386/boot/Makefile.inc) for a good way to handle this. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Init(8) cannot decrease securelevel
There used to be security holes that allowed root to lower `securelevel' using init. Rev.1.9 defends against any undiscovered holes. How about following change? OK. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Init(8) cannot decrease securelevel
Once securelevel has been increased, no process can decrease it because kernel always refuse decreasing it. This is inconsistent with the manual page of init: The kernel runs with four different levels of security. Any super-user process can raise the security level, but only init can lower it. Is there any security problem to implement this? If no, could someone review following patch? The patch just backs out rev.1.9: RCS file: /home/ncvs/src/sys/kern/kern_mib.c,v Working file: kern_mib.c head: 1.25 ... revision 1.9 date: 1997/06/25 07:31:47; author: joerg; state: Exp; lines: +2 -2 Don't ever allow lowering the securelevel at all. Allowing it does nothing good except of opening a can of (potential or real) security holes. People maintaining a machine with higher security requirements need to be on the console anyway, so there's no point in not forcing them to reboot before starting maintenance. Agreed by: hackers, guido There used to be security holes that allowed root to lower `securelevel' using init. Rev.1.9 defends against any undiscovered holes. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Init(8) cannot decrease securelevel
Once securelevel has been increased, no process can decrease it because kernel always refuse decreasing it. This is inconsistent with the manual page of init: The kernel runs with four different levels of security. Any super-user process can raise the security level, but only init can lower it. Is there any security problem to implement this? If no, could someone review following patch? The patch just backs out rev.1.9: RCS file: /home/ncvs/src/sys/kern/kern_mib.c,v Working file: kern_mib.c head: 1.25 ... revision 1.9 date: 1997/06/25 07:31:47; author: joerg; state: Exp; lines: +2 -2 Don't ever allow lowering the securelevel at all. Allowing it does nothing good except of opening a can of (potential or real) security holes. People maintaining a machine with higher security requirements need to be on the console anyway, so there's no point in not forcing them to reboot before starting maintenance. Agreed by: hackers, guido There used to be security holes that allowed root to lower `securelevel' using init. Rev.1.9 defends against any undiscovered holes. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Init(8) cannot decrease securelevel
There used to be security holes that allowed root to lower `securelevel' using init. Rev.1.9 defends against any undiscovered holes. How about following change? OK. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: rndcontrol and SMP
/* * XXX the data is 16-bit due to a historical botch, so we use * magic 16's instead of ICU_LEN and can't support 24 interrupts * under SMP. */ intr = *(int16_t *)data; if (cmd != MEM_RETURNIRQ (intr 0 || intr = 16)) return (EINVAL); What is needed to make this support a more sensible number of IRQs? Mainly changing the ioctl and its clients (rndcontrol only?) to supply more bits. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: rndcontrol and SMP
/* * XXX the data is 16-bit due to a historical botch, so we use * magic 16's instead of ICU_LEN and can't support 24 interrupts * under SMP. */ intr = *(int16_t *)data; if (cmd != MEM_RETURNIRQ (intr 0 || intr = 16)) return (EINVAL); What is needed to make this support a more sensible number of IRQs? Mainly changing the ioctl and its clients (rndcontrol only?) to supply more bits. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: cvs commit: src/usr.sbin/inetd inetd.c
Can nohup really prevent processes from trapping SIGHUP? I thought it just set the SIGUP handler to discard and hoped for the best. It's normally a bug to catch ignored signals. Daomons that reread config files when they receive a signal may be counterexamples. OTOH, they probably shouldn't be started with some signals ignored unless ignoring those signals is really wanted. Xntpd in the base system explicitly requests its graceful termination function, called finish(), be loaded as the SIGHUP handler. This seems to be just a bug. finish() is used for SIGHUP, SIGINT, SIGQUIT and SIGTERM. finish() just finishes (actually it has undefined behaviour since it calls exit()), so nothing except undefined behaviour is lost by ignoring these signals. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: cvs commit: src/usr.sbin/inetd inetd.c
Can nohup really prevent processes from trapping SIGHUP? I thought it just set the SIGUP handler to discard and hoped for the best. It's normally a bug to catch ignored signals. Daomons that reread config files when they receive a signal may be counterexamples. OTOH, they probably shouldn't be started with some signals ignored unless ignoring those signals is really wanted. Xntpd in the base system explicitly requests its graceful termination function, called finish(), be loaded as the SIGHUP handler. This seems to be just a bug. finish() is used for SIGHUP, SIGINT, SIGQUIT and SIGTERM. finish() just finishes (actually it has undefined behaviour since it calls exit()), so nothing except undefined behaviour is lost by ignoring these signals. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Rewriting pca(4) using finetimer(9) (was: Re: MPU401 now works under New Midi Driver Framework with a Fine Timer)
this is problematic. you cannot add a new element before the pending firing because you can't tell how far into the present trigger you are. This is not a problem for readable counters like the i8254. The problem for the i8254 is that reading and writing it takes a long time (perhaps 5 usec each). Every write to it will decrease its accuracy (perhaps by 1 usec after careful calibration). Thus if the i8254 is used for generating non-periodic interrupts with an average period of 333 usec, it may take 15/333 of the CPU (estimate another 5 usec for interrupt handling), and if it is also used for timecounting then it may drift by 333 usec/second. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Rewriting pca(4) using finetimer(9) (was: Re: MPU401 now worksunder New Midi Driver Framework with a Fine Timer)
dfr If I understand this correctly, you are suggesting that we program timer0 dfr so that we only take interrupts when a finetimer is due to fire? If so, dfr then it sounds very good. The idea of taking 6000+ interrupts/sec made me dfr uneasy, even though most would return without doing any work. 6000+ interrupts/sec is not many, unless it is done all the time. A 486/33 can handle about 5 (16000 for pcaudio + 3 * 11520 for sio's). There is one problem in this method. acquire_timer0() is only implemented for i386. We would need to write something equivalent for alpha... This is a serious problem. acquire_timer0() is currently disabled even for i386's when the i8254 is used for timecounting. This is not hard to fix (the hooks into clkintr() work even better with timecounters since it is not necessary to resynchronise clock interrupts after a state change), but an i8254 interrupt frequency of 16000 Hz is too fast to be used routinely if the i8254 is being used for timecounting (even if the CPU can keep up with the interrupts, the overflow heuristics in i8254_get_timecount() may break down). Other systems may have even more limitations on the timecounters. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Multiproc kernel 3.1-Release Cyclades Cyclom 8Yep PCI
I commented out the SMP and FAST code, rebuilt the kernel from scratch and the panic still occurs. Is there other code that should be disabled? Is this fixed in 3.2? Why are there evil macro's in my computer? Why am I asking so many questions? I committed a proper fix in /sys/i386/isa/cy.c: rev.1.88 (-current) rev.1.83.2.1 (-stable) This is not in 3.2. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Multiproc kernel 3.1-Release Cyclades Cyclom 8Yep PCI
There seems to be a problem with nested locks. What was the panic message? I have the same problem with my Cyclom Ye cards (both the isa pci variety) The kernel panics with: panic messages: --- panic: rslock: cpu: 0, addr: 0xf026a15c, lock: 0x0001 mp_lock = 0001; cpuid = 0; lapic.id = That's the problem with nested locks. after crash debugging shows that the value of com is 0x3 at the time of crash. Since com is a pointer, I think we know that it is wrong. I have added a printf statement to a varitation of cy.c that shows that just before the call to commctl, the value of com is correct (not 3). I don't think that's the problem. Any help towards getting the cy driver up and running on the smp kernel would be appreciated. See hints in my previous mail. The main problem is that disable_intr() isn't doesn't nest properly and I only avoided this problem for !SMP case. The SMP disable_intr() is an evil macro that calls a locking function which crashes if the lock is already held. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Multiproc kernel 3.1-Release Cyclades Cyclom 8Yep PCI
I have multiprocessor machine: matherboard SOYO 5TX2/X5 with 2 intel 166 proceccors multiport card Cyclades Cyclom 8Yep Kernel config: options SMP # Symmetric MultiProcessor Kernel options APIC_IO # Symmetric (APIC) I/O options NCPU=2 # number of CPUs options NBUS=2 # number of busses All good work until somthing do start to send to /dev/cXX Then kernel panic and reboot !!! All good working under single processor kernel ... no problem There seems to be a problem with nested locks. What was the panic message? Try deleting the code in the SMP ifdefs in cy.c, and don't use fast interrupts with SMP: (a) for the pci version, don't use option CY_PCI_FASTINTR. (b) for the isa version, delete the line with RI_FAST in it in cy.c. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: CY_PCI_FASTINTR (Was: Re: Cyclom-Y driver for FreeBSD - help!!!)
Am I understand this right: if I configure my PCI card to an exclusive interrupt, and then turn on CY_PCI_FASTINTR, I'll get rid of ``silo overflows'' on it. Except in -current, where CY_PCI_FASTINTR is broken (it has no effect). Just out of curiosity. What is the ``interrupt latency in the FreeBSD kernel'' problem and what are the ``Bruce's fast interrupt hacks''? The problem is that some mouldy old drivers and subsystems disable interrupts for too long (e.g. 2-5msec for programming keyboard LEDs; this breaks unbuffered serial input at speeds larger than 200-500 bps, and it breaks the Cyclades' 12-character fifos at speeds larger than 600-2400 bps). The fast interrupt hacks consist of ignoring the normal interrupt masking system so that drivers using fast interrupts are not affected by the problem unless there are enough of them (or enough activity on them) to create their own latency problems. Bruce To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message