RE: Problems in VM structure ?

1999-02-18 Thread tcobb
I've adjusted MAXUSERS to 128 on my heavily loaded PIIs and the crashes
have not re-occurred for 24 hours now.  (Had to adjust NMBCLUSTERS up,
though)
The panics were happening every 5-8 hours like clockwork prior to this.

I believe that these crashes are caused by heavy network traffic, not
heavy load values, so a make world may not trigger this.  Actually, I 
couldn't force it to happen when I hit the box hard during testing with
web traffic, so it must be a combination thing.

Another clue is the fact that I can't seem to get a Pentium (P5) to crash
at all, ever, even when running exactly the same kernel config.  
The Pentium IIs fell over like crazy.


-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message



RE: RE: Problems in VM structure ?

1999-02-17 Thread tcobb


   -Original Message-
   From: Matthew Dillon [mailto:dil...@apollo.backplane.com]
   :What's the chance that our kernel adaptations for PIIs
   :is partly at fault?
   :
   :-Troy Cobb
   : Circle Net, Inc.
   : http://www.circle.net
   
   With what config?  Have you tried reducing maxusers to 128?
   
   -Matt


I've had it at MAXUSERS=256 on both the P5 and the P6.  The P5 stays
stable, the P6 doesn't.  If I reduce MAXUSERS to 128 then these
heavily loaded boxen will fall over due to out of MBUFs errors, or
so I believe.

I'd love to find some real kernel-tuning documentation out there,
one of my panics is a pipeinit:  cannot allocate pipe -- out of kvm
and I can't pull a crashdump due to a DSCHECK error because my
SWAP is  2GB.


-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message



RE: Problems in VM structure ?

1999-02-16 Thread tcobb
I'm seeing different responses depending on hardware.

On regular Pentium 166 machines, I almost NEVER get
a panic.  On brand-new Pentium II 350s, I get a panic
every 6-9 hours.  This happens when both kernels are
configured the same for maxusers.  It happens when
both machines are under the same load level -- the
P5 stays rock solid, the P6 flakes out.

What's the chance that our kernel adaptations for PIIs
is partly at fault?


-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net

   -Original Message-
   From: Brian Feldman [mailto:gr...@unixhelp.org]
   Sent: Tuesday, February 16, 1999 7:48 AM
   To: Matthew Dillon
   Cc: Khetan Gajjar; curr...@freebsd.org
   Subject: Re: Problems in VM structure ?
   
   
   On Tue, 16 Feb 1999, Matthew Dillon wrote:
   
:maxusers 256

Try reducing maxusers to 128.  Another person 
   reported similar behavior
to me and after a bunch of work he tried going back 
   to a basic 
distribution -- and everything started working again.

It turned out that a maxusers value of 256 and 512 
   were causing his machine
to go poof, but a maxusers value of 128 worked fine.

I haven't tracked the problem down yet.  Please try 
   reducing your maxusers
to 128 and email the results to current.
   
   For what it's worth, my maxusers is 250 and my system is 
   quite stable, even
   during a make -j25 buildworld.
   

  -Matt


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message

   
Brian Feldman_ __  
   ___ ___ ___  
gr...@unixhelp.org   _ __ ___ | _ ) __|   \ 
http://www.freebsd.org/ _ __ ___  | _ \__ \ |) |
FreeBSD: The Power to Serve!  _ __ ___  _ 
   |___/___/___/ 
   
   
   To Unsubscribe: send mail to majord...@freebsd.org
   with unsubscribe freebsd-current in the body of the message
   


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message



panics deciphering VMSTAT output

1999-02-14 Thread tcobb
I've been trying to track down a regular, but not
manually reproducible crash in 3.0-BETA (19990205).

I can't get a crashdump due to a DSCHECK negative
number bug.  I think my swap space of 3+GB is
too large for it.  So, I've been having it send
me the output of vmstat -m every 15 minutes to
try to track it down.  The couple of times I've
seen the panic message on the console, it was
typically, but not always:
pipeinit: cannot allocate pipe - out of kvm

It has been happening approximately every 6 hours
on a heavily loaded server with 100+ chrooted daemons
and NFS.

So, in short, so that I can compare my vmstat outputs
to the one I captured 3 minutes before the last crash,
can anyone tell me what the vmstat entries mean? :)
A quick legend or tutorial would be helpful, and I'll
turn it into a FAQ for the documentation project, too.


-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net

To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message


RE: 3c905B stops responding during ifconfig alias

1999-02-11 Thread tcobb
Bill,
Your patch worked perfectly.  THANK YOU!

By the way, I'd be happy to fedex a 3c905B to
you for your use in testing these sorts of things
if that would be helpful.  We have a fairly large
commitment to this card now (40+) and I'd do this
happily to facilitate continuing performance 
enhancements or other improvements to it.

Sincerely,

-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net

   -Original Message-
   From: wp...@ctr.columbia.edu [mailto:wp...@ctr.columbia.edu]
   My apologies for not replying to you on this sooner; it 
   took me a while
   to locate a machine with which I could do some testing (all 
   the 3c905B
   hardware I have is in the form of embedded chipsets in Dell desktop
   machines, and they've been moving around on me a lot).
 
This does NOT happen on the:
xl0: 3Com 3c905 Fast Etherlink XL 10/100BaseTX rev 0x00 
   int a irq 10 on
   
   I think I found the problem. Currently, xl_stop() and xl_init() both
   issue RX and TX resets. Seems logical doesn't it? I mean, 
   the purpose
   of xl_init() is to put the NIC into a known good state, and 
   the purpose
   of xl_stop() is to slap it in the face and make it shut up ASAP. The
   difference between the 3c905 and the 3c905B (well, the important
   difference in this case) is that the 3c905B's chipset has a 
   built-in PHY,
   while the 3c905 requires an external one (3Com uses a 
   DP83840A for the
   3c905 boards, judging by the one sample 3c905 card I have). 
   Apparently,
   issuing the RX and TX reset commands on the 3c905B causes it to also
   reset the PHY, which causes the PHY to restart its 
   autonegotiation session
   with its link partner. It takes a few seconds for the 
   autoneg session to
   finish, and during this time the 3c905B stops receiving packets.
   
   This doesn't happen on the 3c905 because issuing the RX and TX reset
   commands does not have any affect on the external PHY: the only way
   to reset the PHY is by writing to the PHY's basic mode 
   control register
   via the MII management interface.
   
   I'm including a patch which should fix this problem. It 
   just disables
   the code that does the reset in both xl_stop() and xl_init(). Please
   try this and let me know if it helps.
   
   To apply the patch, do the following:
   
   - Make sure you have the kernel source code installed under 
   /usr/src.
   - Save this message to a file, i.e. /tmp/xl.patch
   - Become root.
   - Run the following commands:
   # cd /sys/pci
   # patch  /tmp/xl.patch
   - Compile a new kernel and boot it.
   
   This patch was generated using a version of if_xl.c from 
   FreeBSD-current,
   but it should work on any version of the driver with only a 
   couple of
   mild warnings.
   
   -Bill
   
   -- 
   
   =
   -Bill Paul(212) 854-6020 | System Manager, 
   Master of Unix-Fu
   Work: wp...@ctr.columbia.edu | Center for 
   Telecommunications Research
   Home:  wp...@skynet.ctr.columbia.edu | Columbia University, 
   New York City
   
   =
   Mulder, toads just fell from the sky! I guess their 
   parachutes didn't open.
   
   =
   
   *** ../CVSWORK/sys_pci/if_xl.c  Mon Feb  1 16:25:52 1999
   --- if_xl.c Thu Feb 11 18:34:39 1999
   ***
   *** 2363,2373 
   --- 2363,2375 
   for (i = 0; i  3; i++)
   CSR_WRITE_2(sc, XL_W2_STATION_MASK_LO + (i * 2), 0);
 
   + #ifdef notdef
   /* Reset TX and RX. */
   CSR_WRITE_2(sc, XL_COMMAND, XL_CMD_RX_RESET);
   xl_wait(sc);
   CSR_WRITE_2(sc, XL_COMMAND, XL_CMD_TX_RESET);
   xl_wait(sc);
   + #endif
 
   /* Init circular RX list. */
   if (xl_list_rx_init(sc) == ENOBUFS) {
   ***
   *** 2715,2724 
   --- 2717,2728 
   CSR_WRITE_2(sc, XL_COMMAND, XL_CMD_TX_DISABLE);
   CSR_WRITE_2(sc, XL_COMMAND, XL_CMD_COAX_STOP);
   DELAY(800);
   + #ifdef notdef
   CSR_WRITE_2(sc, XL_COMMAND, XL_CMD_RX_RESET);
   xl_wait(sc);
   CSR_WRITE_2(sc, XL_COMMAND, XL_CMD_TX_RESET);
   xl_wait(sc);
   + #endif
   CSR_WRITE_2(sc, XL_COMMAND, XL_CMD_INTR_ACK|XL_STAT_INTLATCH);
 
   /* Stop the stats updater. */
   

To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message


Tracking a Fatal Double Fault

1999-02-08 Thread tcobb
Can someone please give me a short guide
on how to track down a fatal double fault?
System is 3.0-19990205-STABLE, and I've written
down the fault info.

Thanks,

-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net

To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message


RE: Tracking a Fatal Double Fault

1999-02-08 Thread tcobb
The machine is running a custom kernel, but nothing
very unusual.  My instinct is that it may be related to 
something with the 3c905B 3COM cards that I reported
earlier, I'm trying with Intel EtherExpresses right now
and getting no fault problems.

The double-fault does not occur consistently, unfortunately,
and typically only occurs during my rc.local stuff (loading
a bunch (100+) of chrooted daemons) on boot-up.

Would the eip/esp/ebp values be worth sending?


-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net

   -Original Message-
   From: Mike Smith [mailto:m...@smith.net.au]
   Sent: Monday, February 08, 1999 6:55 PM
   To: tc...@staff.circle.net
   Cc: curr...@freebsd.org
   Subject: Re: Tracking a Fatal Double Fault 
   
   
Can someone please give me a short guide
on how to track down a fatal double fault?
System is 3.0-19990205-STABLE, and I've written
down the fault info.
   
   Ack.  It's actually pretty difficult.  You can start by trying to 
   locate the PC for the fault in the kernel image, but the 
   typical cause 
   of a double fault is running out of kernel stack. 
   
   Are you running any custom kernel code?
   
   -- 
   \\  Sometimes you're ahead,   \\  Mike Smith
   \\  sometimes you're behind.  \\  m...@smith.net.au
   \\  The race is long, and in the  \\  msm...@freebsd.org
   \\  end it's only with yourself.  \\  msm...@cdrom.com
   
   

To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message


RE: Tracking a Fatal Double Fault

1999-02-08 Thread tcobb
So a double-fault is always a kernel stack problem?

I find it suspicious that this same machine
also had trouble with the 3c905B flaking out --
dropping packets during an ifconfig alias, and
sometimes never reactivating the interface
according to what tcpdump shows.

The 3c905B problem repeates itself on EVERY machine
that I've them installed into (7 or so), the double-faults
are infrequent on some of the busier machines, and almost
always during the initial boot process.


-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net

   -Original Message-
   From: Mike Smith [mailto:m...@smith.net.au]
   There's nothing immediately obvious in the xl driver that 
   would suggest 
   that it uses excessive kernel stack either.  8(  Maybe 
   someone has some 
   clues on measuring stack usage (or simply on how to 
   increase the kernel 
   stack allocation...).
   
   
   -- 
   \\  Sometimes you're ahead,   \\  Mike Smith
   \\  sometimes you're behind.  \\  m...@smith.net.au
   \\  The race is long, and in the  \\  msm...@freebsd.org
   \\  end it's only with yourself.  \\  msm...@cdrom.com
   
   

To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message


3c905B stops responding during ifconfig alias

1999-02-06 Thread tcobb
This happens in -current and -stable.

Machine:
CPU: Pentium II (quarter-micron) (350.80-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0x652  Stepping=2
 
Features=0x183fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,
CM
OV,PAT,PSE36,MMX,b24
real memory  = 402653184 (393216K bytes)
config quit
avail memory = 388808704 (379696K bytes
...
xl0: 3Com 3c905B Fast Etherlink XL 10/100BaseTX rev 0x30 int a irq 9 on
pci0.18.0
...

During an ifconfig xl0 alias, the xl0 interface drops packets.
It does NOT generate errors (netstat -in).
In fact, on several occasions I've seen it go completely
unresponsive (not responding to arp requests) until kicked back
to life by outbound packets.

This does NOT happen on the:
xl0: 3Com 3c905 Fast Etherlink XL 10/100BaseTX rev 0x00 int a irq 10 on
pci0.1
8.0

Here's a quick test program that I use:

#!/usr/bin/perl

# call this as just_alias.pl class_c num_ips
$ip_base=$ARGV[0] || 209.95.67.;
$num=$ARGV[1] || 250;

for (1..$num)
{
print aliasing for $ip_base.$_.\n;
system (ifconfig xl0 alias $ip_base.$_. netmask
255.255.255.255);
  # the sleep command allows us to see the problem more
  # clearly, though it does happen w/o the sleep here...
sleep 1;
}


-Troy Cobb
 Circle Net, Inc.
 http://www.circle.net

To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-current in the body of the message