Re: NFS corruption on p4 machines (please test)

2003-10-03 Thread Lars Eggert
Kris Kennaway wrote:

On Fri, Oct 03, 2003 at 10:10:20AM -0700, Lars Eggert wrote:

Kris,

Kris Kennaway wrote:


For some months now I have been experiencing NFS corruption on the
three machines in the dosirak.kr package cluster - these are SMP
pentium 4 machines that run -CURRENT.  Setting DISABLE_PSE and
DISABLE_PG_G does not fix these problems.  I am able to easily
reproduce these problems using /usr/src/tools/regression/fsx on a
loopback nfs mount - they are not deterministic, but it blows up
within about 8000 operations (less than a minute of operation).  In
fact sometimes it even manages to make fsx segfault, which is fairly
impressive :)
Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs
filesystem for a few minutes.
I just ran an fsx cycle on my desktop machine over a TCP mount, and it
seemed to work fine:


Thanks.  What hardware specs?
Attached.

Lars
--
Lars Eggert <[EMAIL PROTECTED]>   USC Information Sciences Institute
cam: using minimum scsi_delay (100ms)
Copyright (c) 1992-2003 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.1-CURRENT #0: Tue Sep 30 10:11:59 PDT 2003
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/KERNEL-1.31
Preloaded elf kernel "/boot/kernel/kernel" at 0xc06ed000.
Preloaded elf module "/boot/kernel/vesa.ko" at 0xc06ed21c.
Preloaded elf module "/boot/kernel/md.ko" at 0xc06ed2c8.
Preloaded elf module "/boot/kernel/linux.ko" at 0xc06ed370.
Preloaded elf module "/boot/kernel/if_gif.ko" at 0xc06ed41c.
Preloaded elf module "/boot/kernel/if_tun.ko" at 0xc06ed4c8.
Preloaded elf module "/boot/kernel/ipfw.ko" at 0xc06ed574.
Preloaded elf module "/boot/kernel/if_an.ko" at 0xc06ed620.
Preloaded elf module "/boot/kernel/wlan.ko" at 0xc06ed6cc.
Preloaded elf module "/boot/kernel/rc4.ko" at 0xc06ed778.
Preloaded elf module "/boot/kernel/pccard.ko" at 0xc06ed820.
Preloaded elf module "/boot/kernel/if_em.ko" at 0xc06ed8cc.
Preloaded elf module "/boot/kernel/if_fxp.ko" at 0xc06ed978.
Preloaded elf module "/boot/kernel/miibus.ko" at 0xc06eda24.
Preloaded elf module "/boot/kernel/if_lnc.ko" at 0xc06edad0.
Preloaded elf module "/boot/kernel/if_wi.ko" at 0xc06edb7c.
Preloaded elf module "/boot/kernel/if_xl.ko" at 0xc06edc28.
Preloaded elf module "/boot/kernel/snd_emu10k1.ko" at 0xc06edcd4.
Preloaded elf module "/boot/kernel/snd_pcm.ko" at 0xc06edd84.
Preloaded elf module "/boot/kernel/snd_es137x.ko" at 0xc06ede30.
Preloaded elf module "/boot/kernel/snd_ich.ko" at 0xc06edee0.
Preloaded elf module "/boot/kernel/snd_maestro3.ko" at 0xc06edf8c.
Preloaded elf module "/boot/kernel/ugen.ko" at 0xc06ee040.
Preloaded elf module "/boot/kernel/usb.ko" at 0xc06ee0ec.
Preloaded elf module "/boot/kernel/uhid.ko" at 0xc06ee194.
Preloaded elf module "/boot/kernel/ukbd.ko" at 0xc06ee240.
Preloaded elf module "/boot/kernel/ulpt.ko" at 0xc06ee2ec.
Preloaded elf module "/boot/kernel/ums.ko" at 0xc06ee398.
Preloaded elf module "/boot/kernel/umass.ko" at 0xc06ee440.
Preloaded elf module "/boot/kernel/umodem.ko" at 0xc06ee4ec.
Preloaded elf module "/boot/kernel/ucom.ko" at 0xc06ee598.
Preloaded elf module "/boot/kernel/bktr.ko" at 0xc06ee644.
Preloaded elf module "/boot/kernel/bktr_mem.ko" at 0xc06ee6f0.
Preloaded elf module "/boot/kernel/agp.ko" at 0xc06ee7a0.
Preloaded elf module "/boot/kernel/random.ko" at 0xc06ee848.
Preloaded elf module "/boot/kernel/ip_mroute.ko" at 0xc06ee8f4.
Preloaded elf module "/boot/kernel/ip6fw.ko" at 0xc06ee9a4.
Preloaded elf module "/boot/kernel/netgraph.ko" at 0xc06eea50.
Preloaded elf module "/boot/kernel/dummynet.ko" at 0xc06eeb00.
Preloaded elf module "/boot/kernel/radeon.ko" at 0xc06eebb0.
Preloaded elf module "/boot/kernel/r128.ko" at 0xc06eec5c.
Preloaded elf module "/boot/kernel/ahc.ko" at 0xc06eed08.
Preloaded elf module "/boot/kernel/mpt.ko" at 0xc06eedb0.
Preloaded elf module "/boot/kernel/fdc.ko" at 0xc06eee58.
Preloaded elf module "/boot/kernel/cbb.ko" at 0xc06eef00.
Preloaded elf module "/boot/kernel/exca.ko" at 0xc06eefa8.
Preloaded elf module "/boot/kernel/cardbus.ko" at 0xc06ef054.
Preloaded elf module "/boot/kernel/lpt.ko" at 0xc06ef100.
Preloaded elf module "/boot/kernel/ubsa.ko" at 0xc06ef1a8.
Preloaded elf module "/boot/kernel/firewire.ko" at 0xc06ef254.
Preloaded elf module "/boot/kernel/sbp.ko" at 0xc06ef304.
Preloaded elf module "/boot/kernel/smbus.ko" at 0xc06ef3ac.
Preloaded elf module "/boot/kernel/intpm.ko" at 0xc06ef458.
Preloaded elf module "/boot/kernel/smb.ko" at 0xc06ef504.
Preloaded elf module "/boot/kernel/iicbus.ko" at 0xc06ef5ac.
Preloaded elf module "/boot/kernel/iic.ko" at 0xc06ef658.
Preloaded elf module "/boot/kernel/iicsmb.ko" at 0xc06ef700.
Preloaded elf module "/boot/kernel/uart.ko" at 0xc06ef7ac.
Preloaded elf module "/boot/kernel/acpi.ko" at 0xc06ef858.
Timecounter "i8254" frequency 1193121 Hz quality 0
CPU: Intel(R) XEON(TM) CPU 2.40GHz (2372.81-MHz 686-class CPU)
  Origin = "GenuineIntel

Re: NFS corruption on p4 machines (please test)

2003-10-03 Thread Kris Kennaway
On Fri, Oct 03, 2003 at 10:10:20AM -0700, Lars Eggert wrote:
> Kris,
> 
> Kris Kennaway wrote:
> 
> >For some months now I have been experiencing NFS corruption on the
> >three machines in the dosirak.kr package cluster - these are SMP
> >pentium 4 machines that run -CURRENT.  Setting DISABLE_PSE and
> >DISABLE_PG_G does not fix these problems.  I am able to easily
> >reproduce these problems using /usr/src/tools/regression/fsx on a
> >loopback nfs mount - they are not deterministic, but it blows up
> >within about 8000 operations (less than a minute of operation).  In
> >fact sometimes it even manages to make fsx segfault, which is fairly
> >impressive :)
> >
> >Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs
> >filesystem for a few minutes.
> 
> I just ran an fsx cycle on my desktop machine over a TCP mount, and it
> seemed to work fine:

Thanks.  What hardware specs?

Kris


pgp0.pgp
Description: PGP signature


Re: NFS corruption on p4 machines (please test)

2003-10-03 Thread Lars Eggert
Lars Eggert wrote:
Kris Kennaway wrote:
Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs
filesystem for a few minutes.
I just ran an fsx cycle on my desktop machine over a TCP mount, and it
seemed to work fine:
I should have mentioned that this is a Pentium 4 Xeon SMP machine 
running -current.

Lars
--
Lars Eggert <[EMAIL PROTECTED]>   USC Information Sciences Institute


smime.p7s
Description: S/MIME Cryptographic Signature


Re: NFS corruption on p4 machines (please test)

2003-10-03 Thread Lars Eggert
Kris,

Kris Kennaway wrote:

For some months now I have been experiencing NFS corruption on the
three machines in the dosirak.kr package cluster - these are SMP
pentium 4 machines that run -CURRENT.  Setting DISABLE_PSE and
DISABLE_PG_G does not fix these problems.  I am able to easily
reproduce these problems using /usr/src/tools/regression/fsx on a
loopback nfs mount - they are not deterministic, but it blows up
within about 8000 operations (less than a minute of operation).  In
fact sometimes it even manages to make fsx segfault, which is fairly
impressive :)
Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs
filesystem for a few minutes.
I just ran an fsx cycle on my desktop machine over a TCP mount, and it
seemed to work fine:
[EMAIL PROTECTED]: /usr/src/tools/regression/fsx] ./fsx /tmp/nfs/x
truncating to largest ever: 0x13e76
truncating to largest ever: 0x2e52c
truncating to largest ever: 0x3c2c2
truncating to largest ever: 0x3f15f
truncating to largest ever: 0x3fcb9
truncating to largest ever: 0x3fe96
truncating to largest ever: 0x3ff9d
truncating to largest ever: 0x3
skipping zero size read
skipping zero size write
skipping zero size write
^Csignal 2
testcalls = 166863
Lars
--
Lars Eggert <[EMAIL PROTECTED]>   USC Information Sciences Institute


smime.p7s
Description: S/MIME Cryptographic Signature


NFS corruption on p4 machines (please test)

2003-10-02 Thread Kris Kennaway
For some months now I have been experiencing NFS corruption on the
three machines in the dosirak.kr package cluster - these are SMP
pentium 4 machines that run -CURRENT.  Setting DISABLE_PSE and
DISABLE_PG_G does not fix these problems.  I am able to easily
reproduce these problems using /usr/src/tools/regression/fsx on a
loopback nfs mount - they are not deterministic, but it blows up
within about 8000 operations (less than a minute of operation).  In
fact sometimes it even manages to make fsx segfault, which is fairly
impressive :)

Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs
filesystem for a few minutes.

e.g.:
dosirak# fsx foo
truncating to largest ever: 0x13e76
truncating to largest ever: 0x2e52c
truncating to largest ever: 0x3c2c2
truncating to largest ever: 0x3f15f
truncating to largest ever: 0x3fcb9
ftruncate1: 30cc3
dotruncate: ftruncate: Permission denied

Is anyone else able to test this?  The three machines I see this on
have the same hardware specs, so it may be an interaction with certain
hardware.

Copyright (c) 1992-2003 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.1-CURRENT #0: Fri Sep 26 20:23:51 KST 2003
[EMAIL PROTECTED]:/usr/obj/d/src/sys/DALKI
Preloaded elf kernel "/boot/kernel/kernel" at 0xc0588000.
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) XEON(TM) CPU 2.20GHz (2199.94-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf24  Stepping = 4
  
Features=0x3febfbff
  Hyperthreading: 2 logical CPUs
real memory  = 2147418112 (2047 MB)
avail memory = 2084302848 (1987 MB)
Programming 16 pins in IOAPIC #0
IOAPIC #0 intpin 2 -> irq 0
Programming 16 pins in IOAPIC #1
Programming 16 pins in IOAPIC #2
Programming 16 pins in IOAPIC #3
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): apic id:  0, version: 0x00050014, at 0xfee0
 cpu1 (AP):  apic id:  1, version: 0x00050014, at 0xfee0
 cpu2 (AP):  apic id:  2, version: 0x00050014, at 0xfee0
 cpu3 (AP):  apic id:  3, version: 0x00050014, at 0xfee0
 io0 (APIC): apic id:  8, version: 0x000f0011, at 0xfec0
 io1 (APIC): apic id:  9, version: 0x000f0011, at 0xfec01000
 io2 (APIC): apic id: 10, version: 0x000f0011, at 0xfec02000
 io3 (APIC): apic id: 11, version: 0x000f0011, at 0xfec03000
Pentium Pro MTRR support enabled
ACPI-0660: *** Warning: Type override - [DEB_] had invalid type (Integer) for 
Scope operator, changed to (
Scope)
ACPI-0660: *** Warning: Type override - [MLIB] had invalid type (Integer) for 
Scope operator, changed to (
Scope)
ACPI-0660: *** Warning: Type override - [IO__] had invalid type (Integer) for 
Scope operator, changed to (
Scope)
ACPI-0660: *** Warning: Type override - [DATA] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [SIO_] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [SB__] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [PM__] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [ICNT] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [ACPI] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [IORG] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [SB__] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [PM__] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [SIO_] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [PM__] had invalid type (String) for Scope 
operator, changed to (S
cope)
ACPI-0660: *** Warning: Type override - [BIOS] had invalid type (Integer) for 
Scope operator, changed to (
Scope)
ACPI-0660: *** Warning: Type override - [CMOS] had invalid type (Integer) for 
Scope operator, changed to (
Scope)
ACPI-0660: *** Warning: Type override - [KBC_] had invalid type (Integer) for 
Scope operator, changed to (
Scope)
ACPI-0660: *** Warning: Type override - [OEM_] had invalid type (Integer) for 
Scope operator, changed to (
Scope)
acpi0:  on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000
pcibios: BIOS version 2.10
Using $PIR table, 7 entries at 0xc00f4a70
acpi_timer0: <32-bit timer at 3.579545MHz> port 0x508-0x50b on acpi0
acpi_cpu0:  on acpi0
acpi_cpu1:  on acpi0
acpi_cpu2:  on acpi0
acpi_cpu3:  on acpi0
acpi_cpu4:  on acpi0
acpi_cpu5:  on acpi0
acpi_cpu6:  on acpi0
acpi_cpu7:  on acpi0
acpi_button0:  on ac