Corrected gettimeofday() test code

2003-11-29 Thread Kris Kennaway
I forwarded the reports of timecounter problems to phk, and he asked
that people who are seeing timecounter problems provide FULL details
of their system configuration, including:

* dmesg

* kernel configuration

* compiler options

* time-related system configuration (whether ntpd/timed/ntpdate is
running, and if so whether it's correcting for a seriously drifting
clock)

* The kernel timecounter configuration, e.g. the
kern.timecounter.method and kern.timecounter.hardware sysctls, and
whether changing them has any effect.

* The exact output of the corrected test program below (the original
would give spurious errors if it didn't run at least once a second,
which may have been confusing some people if their systems were
sufficiently loaded).

* The system status when the problem is observed (i.e. does it only
occur under load; what else is running at the time)

It's hard to track down this kind of problem when people don't provide
this level of detail, and half-reports just confuse the issue.

Thanks,

Kris

- Forwarded message from Poul-Henning Kamp [EMAIL PROTECTED] -

X-Original-To: [EMAIL PROTECTED]
Delivered-To: [EMAIL PROTECTED]
Delivered-To: [EMAIL PROTECTED]
Date: Sat, 29 Nov 2003 23:48:12 +0100
From: Poul-Henning Kamp [EMAIL PROTECTED]
Subject: Re: gettimeofday() test code
In-reply-to: Your message of Sat, 29 Nov 2003 14:21:24 PST.
 [EMAIL PROTECTED]
To: Kris Kennaway [EMAIL PROTECTED]
X-UIDL: 1f9b02cde6c0ab3cc62d0309b681fcc3
X-Bogosity: No, tests=bogofilter, spamicity=0.00, version=0.15.7


Have them run this one instead...

#include stdio.h
#include sys/time.h

int
main()
{
struct timeval prevtime;
struct timeval curtime;
double last, now;

gettimeofday(prevtime, NULL);
last = prevtime.tv_sec + prevtime.tv_usec * 1e-6;

for (;;)
{
gettimeofday(curtime, NULL);
now = curtime.tv_sec + curtime.tv_usec * 1e-6;

if (now  last) {
printf(%f %f %d.%06d %d.%06d\n, now, last,
curtime.tv_sec, curtime.tv_usec,
prevtime.tv_sec, prevtime.tv_usec);
}
last = now;
prevtime = curtime;
}

return 0;
}

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

- End forwarded message -


pgp0.pgp
Description: PGP signature


Re: Corrected gettimeofday() test code

2003-11-29 Thread Marc G. Fournier
On Sat, 29 Nov 2003, Kris Kennaway wrote:

 * dmesg

Attached ...

 * kernel configuration

Attached ...

 * compiler options

CFLAGS= -O -mpentium -pipe -g -DKVA_PAGES=512
COPTFLAGS= -O -mpentium -pipe -DKVA_PAGES=512

 * time-related system configuration (whether ntpd/timed/ntpdate is
 running, and if so whether it's correcting for a seriously drifting
 clock)

ntpdate running twice a day (11:59  23:59)

 * The kernel timecounter configuration, e.g. the
 kern.timecounter.method and kern.timecounter.hardware sysctls, and
 whether changing them has any effect.

neptune# sysctl kern.timecounter
kern.timecounter.method: 0
kern.timecounter.hardware: i8254

 * The exact output of the corrected test program below (the original
 would give spurious errors if it didn't run at least once a second,
 which may have been confusing some people if their systems were
 sufficiently loaded).

neptune# ./timetest
1070147310.273323 1070148005.799516 1070147310.273323 1070148005.799516
1070147313.623382 1070148009.299574 1070147313.623382 1070148009.299574
1070147320.223573 1070148015.699766 1070147320.223573 1070148015.699766
1070147351.373946 1070148046.800136 1070147351.373946 1070148046.800136
1070147361.174180 1070148056.700378 1070147361.174180 1070148056.700378
1070147374.524295 1070148070.100490 1070147374.524295 1070148070.100490
1070147378.674361 1070148074.100552 1070147378.674361 1070148074.100552
1070147396.124712 1070148091.700905 1070147396.124712 1070148091.700905
1070147413.724886 1070148109.401081 1070147413.724886 1070148109.401081
1070147447.925401 1070148143.401593 1070147447.925401 1070148143.401593
1070147450.475440 1070148146.101635 1070147450.475440 1070148146.101635
1070147453.175562 1070148148.701758 1070147453.175562 1070148148.701758
1070147474.275892 1070148169.702084 1070147474.275892 1070148169.702084
1070147527.826607 1070148223.302801 1070147527.826607 1070148223.302801
1070147531.876673 1070148227.302864 1070147531.876673 1070148227.302864
1070147531.876665 1070148227.402859 1070147531.876665 1070148227.402859
1070147545.176941 1070148240.703142 1070147545.176941 1070148240.703142
1070147616.077946 1070148311.504135 1070147616.077946 1070148311.504135
1070147616.178065 1070148311.704261 1070147616.178065 1070148311.704261

 * The system status when the problem is observed (i.e. does it only
 occur under load; what else is running at the time)

neptune# uptime
 7:24PM  up 47 days,  6:34, 5 users, load averages: 1.98, 1.23, 0.79
neptune# ps aux | wc -l
 346

Anything else that I can provide, let me know ...



Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664

kernel.gz
Description: Binary data


dmesg.boot.gz
Description: Binary data
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Corrected gettimeofday() test code

2003-11-29 Thread Don Bowman
From: Kris Kennaway [mailto:[EMAIL PROTECTED]
 
 I forwarded the reports of timecounter problems to phk, and he asked
 that people who are seeing timecounter problems provide FULL details
 of their system configuration, including:
 
 * dmesg
 
 * kernel configuration
 
 * compiler options
 
 * time-related system configuration (whether ntpd/timed/ntpdate is
 running, and if so whether it's correcting for a seriously drifting
 clock)
 
 * The kernel timecounter configuration, e.g. the
 kern.timecounter.method and kern.timecounter.hardware sysctls, and
 whether changing them has any effect.
 
 * The exact output of the corrected test program below (the original
 would give spurious errors if it didn't run at least once a second,
 which may have been confusing some people if their systems were
 sufficiently loaded).
 
 * The system status when the problem is observed (i.e. does it only
 occur under load; what else is running at the time)
 

For this config (below), kern.timecounter.method=0 reproduces the
problem, kern.timecounter.method=1 does not.

Output in 'error' case:
1070147643.248866 1070147651.028646 1070147643.248866 1070147651.028646
1070147656.287818 1070147664.067692 1070147656.287818 1070147664.067692
1070147659.326429 1070147667.106238 1070147659.326429 1070147667.106238
1070147668.370071 1070147676.149884 1070147668.370071 1070147676.149884
1070147681.433111 1070147689.212926 1070147681.433111 1070147689.212926
1070147683.418743 1070147691.198632 1070147683.418743 1070147691.198632

problem shows up within ~30s of starting the test program, and the messages
will come out about once per 1-5s period after that, not regularly.

kern.timecounter.hardware: TSC

on this machine, i have others which are i8254 which do it too.

hw.ncpu=1

compiler flags:
COPTFLAGS= -O2 -pipe -malign-loops=4 -malign-jumps=4 -malign-functions=4
-mcpu=i686 -march=i686 -fno-gcse -g



machine is running 4.7-RELEASE-p2.

dmesg, kernel config attached.


Intel-specific functions:
Version 0673:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 7 - Pentium III/Pentium III Xeon - external L2 cache
Stepping 3
Reserved 0

from cpuid.

ntpd -p /var/run/ntpd.pid -b -g -A
runs.

Clock does not drift much, less than 2s/day. Clock is 
stepped when system boots. There is a chimer on our router 
which broadcasts ntp time every minute. ntpd is not
observed to step clock.

On the 8254 machine [a dual 0f27 xeon, 533 FSB with HTT enabled]
1070147645.119531 1070148340.497729 1070148339.-693880469 1070148340.497729

both systems were unloaded. This was with the corrected program
you included.

--don



dmesg.boot
Description: Binary data
___#
___# GENERIC -- Generic kernel configuration file for FreeBSD/i386
___#
___# For more information on this file, please read the handbook section on
___# Kernel Configuration Files:
___#
___#http://www.FreeBSD.org/handbook/kernelconfig-config.html
___#
___# The handbook is also available locally in /usr/share/doc/handbook
___# if you've installed the doc distribution, otherwise always see the
___# FreeBSD World Wide Web server (http://www.FreeBSD.org/) for the
___# latest information.
___#
___# An exhaustive list of options and more detailed explanations of the
___# device lines is also present in the ./LINT configuration file. If you are
___# in doubt as to the purpose or necessity of a line, check first in LINT.
___#
___# $FreeBSD: src/sys/i386/conf/GENERIC,v 1.246.2.39 2002/03/24 13:19:10 wilko Exp $
___machine  i386
___#cpu I386_CPU
___#cpu I486_CPU
___cpu  I586_CPU
___cpu  I686_CPU
___identTPC
___maxusers 0
___makeoptions  DEBUG=-g#Build kernel with gdb(1) debug symbols
___#options MATH_EMULATE#Support for x87 emulation
___options  INET#InterNETworking
___#options INET6   #IPv6 communications protocols
___options  FFS #Berkeley Fast Filesystem
___options  FFS_ROOT#FFS usable as root device [keep this!]
___options  SOFTUPDATES #Enable FFS soft updates support
___options  UFS_DIRHASH #Improve performance on big directories
___options  MFS #Memory Filesystem
___options  MD_ROOT #MD is a potential root device
___options  NFS #Network Filesystem
___options  NFS_ROOT#NFS usable as root device, NFS required
___options  MSDOSFS #MSDOS Filesystem
___options  CD9660  #ISO 9660 Filesystem
___options  CD9660_ROOT #CD-ROM usable as root, CD9660 required
___options  PROCFS  #Process filesystem
___options  COMPAT_43   #Compatible with BSD 4.3 [KEEP THIS!]
___options  SCSI_DELAY=2000 #Delay (in ms) before probing SCSI
___options  UCONSOLE#Allow users to grab the console
___options  

RE: Corrected gettimeofday() test code

2003-11-29 Thread Marc G. Fournier
On Sat, 29 Nov 2003, Don Bowman wrote:

 For this config (below), kern.timecounter.method=0 reproduces the
 problem, kern.timecounter.method=1 does not.

Can anyone comment on what effect setting method to 1 has on the system?
Like, would one notice a degradation in performance?


Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Corrected gettimeofday() test code

2003-11-29 Thread Poul-Henning Kamp
In message [EMAIL PROTECTED], Marc G. Fournier writes:

Hmm, I'm not saying this makes sense, but at least there is
an emergent pattern...

Col#1 is duration of the shift, sorted in increasing order.
Col#2 is the delta to the line above.
Notice stratification on 50msec boundaries:

695.426189 -
695.426190 0.01
695.426191 0.01
695.426191 0.00
695.426192 0.01

695.476192 0.05
695.476193 0.01
695.476194 0.01

695.526193 0.04
695.526194 0.01
695.526196 0.02
695.526196 0.00
695.526198 0.02
695.526201 0.03

695.576193 0.049992
695.576195 0.02

695.626195 0.05

695.676192 0.049997
695.676195 0.03

The only place I can think of 50msec in relation to the timecounter
is NTIMECOUNTER * 1/HZ.

Try to modify some of that, and see if you can affect the results.

In particular, try to yank NTIMECOUNTER _way_ up, like a thousand
and see if the problem goes away.

Another (uglier) option is that ACPI/SMI/APM or similar BIOS magic
screws up the i8254 on a regular basis, or even that the latch
functionality of the i8254 doesn't work properly.  This is not
unheard off.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Corrected gettimeofday() test code

2003-11-29 Thread Poul-Henning Kamp
In message [EMAIL PROTECTED], Marc G. Fournier writes:
On Sat, 29 Nov 2003, Don Bowman wrote:

 For this config (below), kern.timecounter.method=0 reproduces the
 problem, kern.timecounter.method=1 does not.

Can anyone comment on what effect setting method to 1 has on the system?
Like, would one notice a degradation in performance?

Setting it to 1 forces the timecounter to update every HZ.

You can say that 0 was an optimization which it seems may not be
safe on certain hardware combinations...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Corrected gettimeofday() test code

2003-11-29 Thread Sten
On Sat, 29 Nov 2003, Kris Kennaway wrote:

 I forwarded the reports of timecounter problems to phk, and he asked
 that people who are seeing timecounter problems provide FULL details
 of their system configuration, including:

snip

Please note that there are quite a few broken timecounters out there,
a lot of them break when HZ != 100. I encountered this with a newer
revision of the dell 1550, the old revision was fine in this respect
oddly enough. These problems do however show up quickly when running
ntpd which will complain about a borked system clock every 10 minutes
or so.

Make sure that your hardware is good before blaming software :).

HTH, HAND.

-- 
Sten Spans

There is a crack in everything that's how the light gets in.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]