11-STABLE build failure in gnu/usr.bin/binutils/ld on recent -CURRENT

2020-07-26 Thread Don Lewis
I ran into another problem updating my 11-STABLE poudriere jails on my
package build machine, which runs a fairly recent version of -CURRENT.

If I try to cross build:

  -O2 -pipe -DBFD_DEFAULT_TARGET_SIZE=64 -I. -I/tmp/src11/gnu/usr.bin/binutils/l
d -I/tmp/src11/gnu/usr.bin/binutils/ld/../libbfd -I/usr/obj/tmp/src11/gnu/usr.bi
n/binutils/ld/../libbfd -I/tmp/src11/gnu/usr.bin/binutils/ld/../../../../contrib
/binutils/include   -DTARGET=\"x86_64-unknown-freebsd\" -DDEFAULT_EMULATION=\"el
f_x86_64_fbsd\" -DSCRIPTDIR=\"/usr/libdata\" -DBFD_VERSION_STRING=\""2.17.50 [Fr
eeBSD] 2007-07-03"\" -DBINDIR=\"/usr/bin\" -DTARGET_SYSTEM_ROOT=\"/\" -DTOOLBIND
IR=\"//usr/bin/libexec\" -D_GNU_SOURCE -I/tmp/src11/gnu/usr.bin/binutils/ld/../.
./../../contrib/binutils/ld -I/tmp/src11/gnu/usr.bin/binutils/ld/../../../../con
trib/binutils/bfd -g -MD  -MF.depend.ldlex.o -MTldlex.o -std=gnu99 -fstack-prote
ctor-strong -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parame
ter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized
-Wno-pointer-sign -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variabl
e -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unu
sed-function -Wno-enum-conversion -Wno-unused-local-typedef -Wno-address-of-pack
ed-member  -Qunused-arguments  -c ldlex.c -o ldlex.o
ldlex.c:3216:3: error: incompatible pointer types passing 'int *' to parameter
  of type 'yy_size_t *' (aka 'unsigned long *')
  [-Werror,-Wincompatible-pointer-types]
  ...YY_INPUT( (_CURRENT_BUFFER_LVALUE->yy_ch_buf[number_to_move]),
 ^
/tmp/src11/gnu/usr.bin/binutils/ld/../../../../contrib/binutils/ld/ldlex.l:64:54
: note:
  expanded from macro 'YY_INPUT'
#define YY_INPUT(buf,result,max_size) yy_input (buf, , max_size)
 ^~~
/tmp/src11/gnu/usr.bin/binutils/ld/../../../../contrib/binutils/ld/ldlex.l:73:42
: note:
  passing argument to parameter here
static void yy_input (char *, yy_size_t *, yy_size_t);
 ^
1 error generated.
*** Error code 1


The problem is that the skeleton defines yy_n_chars as type 'int'
instead of type 'yy_size_t'.  That's a bit of a puzzle because it is
defined as 'yy_size_t' in usr.bin/lex/initskel.c.

If I force lex to always be built as a bootstrap tool, then I get a
successful build, so it looks like the host version of lex is getting
used by default.

I think this is a new problem when the build host is -CURRENT. This
commit:
  
  r362333 | jkim | 2020-06-18 11:09:16 -0700 (Thu, 18 Jun 2020) | 4 lines
  
  MFV:  r362286
  
  Merge flex 2.6.4.
  
  
changes the type of yy_size_t from 'yy_size_t' to 'int'.

I'm not sure what the best fix for this is.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 11-STABLE and 12-STABLE build failures

2020-07-25 Thread Don Lewis
On 25 Jul, Yuri Pankov wrote:
> Don Lewis wrote:
>> On 25 Jul, Don Lewis wrote:
>>> On 25 Jul, Don Lewis wrote:
>>>> On 25 Jul, Warner Losh wrote:
>>>>> Liby.a was retired. Maybe there is some dangling references?
>>>>
>>>> # grep yydebug *
>>>> localedef.c:   yydebug = 0;
>>>> localedef.h:extern int yydebug;
>>>>
>>>> I see the same in the 13-CURRENT source and it builds successfully.
>>>
>>> If I diff the two source trees, I see that the 13-CURRENT references are
>>> guarded by #if YYDEBUG.
>> 
>> Looks like we need this MFC:
>> 
>> %svn log -c 362569
>> 
>> r362569 | jkim | 2020-06-23 19:08:08 -0700 (Tue, 23 Jun 2020) | 2 lines
>> 
>> Fix build with recent byacc.
>> 
>> 
> 
> Just checked that I see this too trying to build stable/12 on recent 
> head, failing while bootstrapping tools where we use the host system 
> binaries.

I just did the MFC, so it should be fixed now.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 11-STABLE and 12-STABLE build failures

2020-07-25 Thread Don Lewis
On 25 Jul, Don Lewis wrote:
> On 25 Jul, Don Lewis wrote:
>> On 25 Jul, Warner Losh wrote:
>>> Liby.a was retired. Maybe there is some dangling references?
>> 
>> # grep yydebug *
>> localedef.c: yydebug = 0;
>> localedef.h:extern int yydebug;
>> 
>> I see the same in the 13-CURRENT source and it builds successfully.
> 
> If I diff the two source trees, I see that the 13-CURRENT references are
> guarded by #if YYDEBUG.

Looks like we need this MFC:

%svn log -c 362569

r362569 | jkim | 2020-06-23 19:08:08 -0700 (Tue, 23 Jun 2020) | 2 lines

Fix build with recent byacc.



___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 11-STABLE and 12-STABLE build failures

2020-07-25 Thread Don Lewis
On 25 Jul, Don Lewis wrote:
> On 25 Jul, Warner Losh wrote:
>> Liby.a was retired. Maybe there is some dangling references?
> 
> # grep yydebug *
> localedef.c:  yydebug = 0;
> localedef.h:extern int yydebug;
> 
> I see the same in the 13-CURRENT source and it builds successfully.

If I diff the two source trees, I see that the 13-CURRENT references are
guarded by #if YYDEBUG.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 11-STABLE and 12-STABLE build failures

2020-07-25 Thread Don Lewis
On 25 Jul, Warner Losh wrote:
> Liby.a was retired. Maybe there is some dangling references?

# grep yydebug *
localedef.c:yydebug = 0;
localedef.h:extern int yydebug;

I see the same in the 13-CURRENT source and it builds successfully.

> On Sat, Jul 25, 2020, 2:05 PM Don Lewis  wrote:
> 
>> I'm seeing this failure when building 11-STABLE and 12-STABLE poudriere
>> jails:
>>
>> --- localedef.full ---
>> cc -O2 -pipe -fno-common -I.
>> -I/var/poudriere/jails/12STABLEamd64/usr/src/usr.bin/localedef
>> -I/var/poudriere/jails/12STABLEamd64/usr/src/lib/libc/locale
>> -I/var/poudriere/jails/12STABLEamd64/usr/src/lib/libc/stdtime -g -std=gnu99
>> -Qunused-arguments
>> -I/usr/obj/var/poudriere/jails/12STABLEamd64/usr/src/amd64.amd64/tmp/legacy/usr/include
>> -static
>> -L/usr/obj/var/poudriere/jails/12STABLEamd64/usr/src/amd64.amd64/tmp/legacy/usr/lib
>> -o localedef.full charmap.o collate.o ctype.o localedef.o messages.o
>> monetary.o numeric.o parser.o scanner.o time.o wide.o  -legacy
>> ld: error: undefined symbol: yydebug
>> >>> referenced by localedef.c:276
>> (/var/poudriere/jails/12STABLEamd64/usr/src/usr.bin/localedef/localedef.c:276)
>> >>>   localedef.o:(main)
>> cc: error: linker command failed with exit code 1 (use -v to see
>> invocation)
>> *** [localedef.full] Error code 1
>>
>> I hadn't done a build in a while, so I don't know how long this has been
>> broken.  My builds last night failed and things are still broken right
>> now.
>>
>> ___
>> freebsd-stable@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>>

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


11-STABLE and 12-STABLE build failures

2020-07-25 Thread Don Lewis
I'm seeing this failure when building 11-STABLE and 12-STABLE poudriere
jails:

--- localedef.full ---
cc -O2 -pipe -fno-common -I. 
-I/var/poudriere/jails/12STABLEamd64/usr/src/usr.bin/localedef 
-I/var/poudriere/jails/12STABLEamd64/usr/src/lib/libc/locale 
-I/var/poudriere/jails/12STABLEamd64/usr/src/lib/libc/stdtime -g -std=gnu99 
-Qunused-arguments 
-I/usr/obj/var/poudriere/jails/12STABLEamd64/usr/src/amd64.amd64/tmp/legacy/usr/include
  -static  
-L/usr/obj/var/poudriere/jails/12STABLEamd64/usr/src/amd64.amd64/tmp/legacy/usr/lib
 -o localedef.full charmap.o collate.o ctype.o localedef.o messages.o 
monetary.o numeric.o parser.o scanner.o time.o wide.o  -legacy
ld: error: undefined symbol: yydebug
>>> referenced by localedef.c:276 
>>> (/var/poudriere/jails/12STABLEamd64/usr/src/usr.bin/localedef/localedef.c:276)
>>>   localedef.o:(main)
cc: error: linker command failed with exit code 1 (use -v to see invocation)
*** [localedef.full] Error code 1

I hadn't done a build in a while, so I don't know how long this has been
broken.  My builds last night failed and things are still broken right
now.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ZFS...

2019-05-05 Thread Don Lewis
On  3 May, Michelle Sullivan wrote:
> 
> 
> Michelle Sullivan
> http://www.mhix.org/
> Sent from my iPad
> 
>> On 03 May 2019, at 03:18, N.J. Mann  wrote:
>> 
>> Hi,
>> 
>> 
>> On Friday, May 03, 2019 03:00:05 +1000 Michelle Sullivan
>>  wrote:
> I am sorry to hear about your loss of data, but where does the
> 11kV come from? I can understand 415V, i.e. two phases in contact,
> but the type of overhead lines in the pictures you reference are
> three phase each typically 240V to neutral and 415V between two
> phases.
> 
 Bottom lines on the power pole are normal 240/415 .. top lines are
 the 11KV distribution network.
>>> 
>>> Oh and just so you know,  it’s sorta impossible to get 415 down a
>>> 240v connection
>> 
>> No it is not.  As I said, if two phases come into contact you can
>> have 415v between live and neutral.
>> 
>> 
> 
> You’re not an electrician then..  the connection point on my house has
> the earth connected to the return on the pole and that also connected
> to the ground stake (using 16mm copper).  You’d have to cut that link
> before dropping a phase on the return to get 415 past the distribution
> board... sorta impossible... cut the ground link first then it’s
> possible... but as every connection has the same, that’s a lot of
> ground links to cut to make it happen... unless you drop the return on
> both sizes of your pole and your ground stake and then drop a phase on
> that floating terminal ...

A friend had a similar catastrophic UPS failure several years ago.  In
her case utility power was 120V single-phase, or 240V hot to hot.
Neutral was bonded to ground at the meter box.  Under normal
circumstances, any current imbalance between the two hot legs returns to
the utility distribution transformer center tap over the neutral wire.
In her case, the neutral connection failed at the pole end of her power
line.  In that case, the imbalance current was forced to return via the
ground rod outside her house and then through some combination of the
ground rods at neighboring houses and the transformer ground connection
at the base of the pole.  Any resistance in this path will reduce the
hot to neutral voltage of the heavily loaded side and increase the
voltage by the same amount on the lightly loaded side.  Fire code
specifies a maximum 25 ohm ground resistance, but it seems this is
seldom actually measured.  In addition her house was old, so there is no
telling what the ground resistance actually was.  If we assume a 25 ohm
resistance, it only takes 1 amp of imbalance current to increase the
voltage on the lightly loaded side by 25V.  At that rate, it doesn't
require much to exceed the continuous maximum voltage rating of the
protective MOVs in the UPS. Once you get past that point, the magic
smoke escapes.

The UPS was actually a spare that I had lent her.  I thought about
repairing it by replacing the MOVs after I got it back from her, but I
abandoned that plan after I opened the UPS and found this insides were
heavily coated with a layer of conductive-looking soot.  Two of the MOVs
were pretty much obliterated.  The third was intact, but charred a bit
by its neighbors.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Congratulations on the Silver Anniversery Edition of FreeBSD

2018-12-13 Thread Don Lewis
On 12 Dec, Kevin Oberman wrote:
> I worked with 4.2BSD and 4.3BSD back in the 70s and 80s. I moved to other
> OSes (Digital RT, RSX and VMS an Varian Vortex) and returned in about 1999
> to FreeBSD 3.0. I've been using FreeBSD ever since. It's been wonderful and
> I am so grateful to all of the people who have keeping it going over the
> years. I try to contribute when I can, but I am not a coder, so it's in
> other ways.

Wow, I haven't heard Varian Vortex mentioned in a long time.  The first
real computer that I got to play with was a Varian, back in the
mid-70's.  I think I still have some of the manuals.  After that it was
Univac Chi/OS, Digital RT, Harris Vulcan / VOS, Masscomp RTU, Sun SunOS
/ Solaris, a bit of NetBSD, FreeBSD starting somewhere in the 2.2.x
timeframe, and more recently CentOS, Fedora, and Debian.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: early boot netisr_init() panic on older AMD SMP machine with recent 11-STABLE

2018-10-10 Thread Don Lewis
On  9 Oct, Don Lewis wrote:
> My desktop machine has an older AMD SMP CPU and tracks 11-STABLE.  For
> about six months or so it frequently panics early in boot.  If I retry a
> sufficient number of times I can get a successful boot, but this is
> rather annoying.
> 
> A normal boot looks like this:
> 
> Copyright (c) 1992-2018 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
> The Regents of the University of California. All rights reserved.
> FreeBSD is a registered trademark of The FreeBSD Foundation.
> FreeBSD 11.2-STABLE #16 r339017M: Sat Sep 29 19:18:41 PDT 2018
> d...@mousie.catspoiler.org:/usr/obj/usr/src/sys/GENERICDDB amd64
> FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 
> 6.0.1
> )
> WARNING: WITNESS option enabled, expect reduced performance.
> VT(vga): resolution 640x480
> CPU: AMD Athlon(tm) II X3 450 Processor (3214.60-MHz K8-class CPU)
>   Origin="AuthenticAMD"  Id=0x100f53  Family=0x10  Model=0x5  Stepping=3
>   
> Features=0x178bfbff MOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
>   Features2=0x802009
>   AMD 
> Features=0xee500800>
>   AMD 
> Features2=0x37ff SKINIT,WDT>
>   SVM: NP,NRIP,NAsids=64
>   TSC: P-state invariant
> real memory  = 34359738368 (32768 MB)
> avail memory = 33275473920 (31733 MB)
> Event timer "LAPIC" quality 100
> ACPI APIC Table: 
> FreeBSD/SMP: Multiprocessor System Detected: 3 CPUs
> FreeBSD/SMP: 1 package(s) x 3 core(s)
> ioapic0: Changing APIC ID to 2
> ioapic0  irqs 0-23 on motherboard
> SMP: AP CPU #1 Launched!
> SMP: AP CPU #2 Launched!
> Timecounter "TSC-low" frequency 1607298818 Hz quality 800
> random: entropy device external interface
> [SNIP]
> 
> An unsuccessful boot looks like this (hand transcribed):
> [SNIP]
> ACPI APIC Table: 
> FreeBSD/SMP: Multiprocessor System Detected: 3 CPUs
> FreeBSD/SMP: 1 package(s) x 3 core(s)
> ioapic0: Changing APIC ID to 2
> ioapic0  irqs 0-23 on motherboard
> SMP: AP CPU #2 Launched!
> SMP: AP CPU #1 Launched!
> Timecounter "TSC-low" frequency 1607298818 Hz quality 800
> panic: netisr_init: not on CPU 0
> cpuid = 2
> KDB: stack backtrace:
> db_trace_selfwrapper() ...
> vpanic() ...
> doadump() ...
> netisr_init() ...
> mi_startup() ...
> btext() ...
> 
> This problem may be silently occuring on many other machines.  This
> machine is running a custom kernel with INVARIANTS and WITNESS.  The
> panic is coming from a KASSERT(), which is only checked when the kernel
> is built with INVARIANTS.
> 
> This KASSERT was removed from 12.0-CURRENT with this commit:
>   https://svnweb.freebsd.org/base/head/sys/net/netisr.c?r1=301270=302595
> 
>   Revision 302595 - (view) (download) (annotate) - [select for diffs]
>   Modified Mon Jul 11 21:25:28 2016 UTC (2 years, 2 months ago) by nwhitehorn
>   File length: 44729 byte(s)
>   Diff to previous 301270
> 
>   Remove assumptions in MI code that the BSP is CPU 0.
> 
> Perhaps this should be MFC'ed, but it seems odd that the BSP is
> non-deterministic.

I now wonder if this panic is a side effect of EARLY_AP_STARTUP, which
was enabled by default in 11-STABLE GENERIC back in May, so the
timeframe fits.  Since the panic only happens with INVARIANTS enabled,
most users are unlikely to to encounter this problem.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


early boot netisr_init() panic on older AMD SMP machine with recent 11-STABLE

2018-10-09 Thread Don Lewis
My desktop machine has an older AMD SMP CPU and tracks 11-STABLE.  For
about six months or so it frequently panics early in boot.  If I retry a
sufficient number of times I can get a successful boot, but this is
rather annoying.

A normal boot looks like this:

Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.2-STABLE #16 r339017M: Sat Sep 29 19:18:41 PDT 2018
d...@mousie.catspoiler.org:/usr/obj/usr/src/sys/GENERICDDB amd64
FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1
)
WARNING: WITNESS option enabled, expect reduced performance.
VT(vga): resolution 640x480
CPU: AMD Athlon(tm) II X3 450 Processor (3214.60-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x100f53  Family=0x10  Model=0x5  Stepping=3
  Features=0x178bfbff
  Features2=0x802009
  AMD Features=0xee500800
  AMD Features2=0x37ff
  SVM: NP,NRIP,NAsids=64
  TSC: P-state invariant
real memory  = 34359738368 (32768 MB)
avail memory = 33275473920 (31733 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: 
FreeBSD/SMP: Multiprocessor System Detected: 3 CPUs
FreeBSD/SMP: 1 package(s) x 3 core(s)
ioapic0: Changing APIC ID to 2
ioapic0  irqs 0-23 on motherboard
SMP: AP CPU #1 Launched!
SMP: AP CPU #2 Launched!
Timecounter "TSC-low" frequency 1607298818 Hz quality 800
random: entropy device external interface
[SNIP]

An unsuccessful boot looks like this (hand transcribed):
[SNIP]
ACPI APIC Table: 
FreeBSD/SMP: Multiprocessor System Detected: 3 CPUs
FreeBSD/SMP: 1 package(s) x 3 core(s)
ioapic0: Changing APIC ID to 2
ioapic0  irqs 0-23 on motherboard
SMP: AP CPU #2 Launched!
SMP: AP CPU #1 Launched!
Timecounter "TSC-low" frequency 1607298818 Hz quality 800
panic: netisr_init: not on CPU 0
cpuid = 2
KDB: stack backtrace:
db_trace_selfwrapper() ...
vpanic() ...
doadump() ...
netisr_init() ...
mi_startup() ...
btext() ...

This problem may be silently occuring on many other machines.  This
machine is running a custom kernel with INVARIANTS and WITNESS.  The
panic is coming from a KASSERT(), which is only checked when the kernel
is built with INVARIANTS.

This KASSERT was removed from 12.0-CURRENT with this commit:
  https://svnweb.freebsd.org/base/head/sys/net/netisr.c?r1=301270=302595

  Revision 302595 - (view) (download) (annotate) - [select for diffs]
  Modified Mon Jul 11 21:25:28 2016 UTC (2 years, 2 months ago) by nwhitehorn
  File length: 44729 byte(s)
  Diff to previous 301270

  Remove assumptions in MI code that the BSP is CPU 0.

Perhaps this should be MFC'ed, but it seems odd that the BSP is
non-deterministic.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: building 11.2-STABLE on CURRENT is broken

2018-08-09 Thread Don Lewis
On  9 Aug, Kyle Evans wrote:
> On Thu, Aug 9, 2018 at 12:48 PM, Kyle Evans  wrote:
>> On Thu, Aug 9, 2018 at 12:32 PM, Don Lewis  wrote:
>>> My poudriere machine is running fairly recent 12.0-CURRENT, r336859.
>>>
>>> I just tried upgrading my 11.2-STABLE poudriere jail from r336040 to
>>> r337508 and got the errors below.  Universe builds are also affected.
>>> I just started a universe build of r336040 and it looks likely to work,
>>> so the problem appears to be with the stable/11 branch sometime after
>>> r336040.
>>>
>>
>> Hi,
>>
>> This is almost-certainly my fault... taking a look now.
>>
> 
> Hi Don,
> 
> Can you  try with this small patch [1] to remove -I${SRCTOP}/sys from
> libnv's include paths? I should have done this back when I added it to
> legacy, I think, but it seems to fix this for me.
> 
> Thanks,
> 
> Kyle Evans
> 
> [1] https://people.freebsd.org/~kevans/libnv-nosys.diff

That works for me as well.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


building 11.2-STABLE on CURRENT is broken

2018-08-09 Thread Don Lewis
My poudriere machine is running fairly recent 12.0-CURRENT, r336859.

I just tried upgrading my 11.2-STABLE poudriere jail from r336040 to
r337508 and got the errors below.  Universe builds are also affected.
I just started a universe build of r336040 and it looks likely to work,
so the problem appears to be with the stable/11 branch sometime after
r336040.

--- nvlist.o ---
cc  -O2 -pipe  -I/var/poudriere/jails/110STABLEamd64/usr/src/sys 
-I/var/poudriere/jails/110STABLEamd64/usr/src/lib/libnv -MD  
-MF.depend.nvlist.o -MTnvlist.o -std=gnu99  -Qunused-arguments  
-I/usr/obj/var/poudriere/jails/110STABLEamd64/usr/src/tmp/legacy/usr/include -c 
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvlist.c -o 
nvlist.o
--- nvpair.o ---
cc  -O2 -pipe  -I/var/poudriere/jails/110STABLEamd64/usr/src/sys 
-I/var/poudriere/jails/110STABLEamd64/usr/src/lib/libnv -MD  
-MF.depend.nvpair.o -MTnvpair.o -std=gnu99  -Qunused-arguments  
-I/usr/obj/var/poudriere/jails/110STABLEamd64/usr/src/tmp/legacy/usr/include -c 
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvpair.c -o 
nvpair.o
--- cnvlist.o ---
In file included from 
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/cnvlist.c:43:
In file included from /usr/include/stdarg.h:6:
In file included from /usr/include/x86/stdarg.h:33:
/usr/include/sys/_stdarg.h:41:11: error: unknown type name '__va_list'
  typedef __va_list   va_list;
  ^
--- dnvlist.o ---
In file included from 
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/dnvlist.c:44:
In file included from /usr/include/stdarg.h:6:
In file included from /usr/include/x86/stdarg.h:33:
/usr/include/sys/_stdarg.h:41:11: error: unknown type name '__va_list'
  typedef __va_list   va_list;
  ^
1 error generated.
--- cnvlist.o ---
1 error generated.
--- nvpair.o ---
In file included from 
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvpair.c:50:
In file included from /usr/include/stdarg.h:6:
In file included from /usr/include/x86/stdarg.h:33:
/usr/include/sys/_stdarg.h:41:11: error: unknown type name '__va_list'
  typedef __va_list   va_list;
  ^
--- dnvlist.o ---
*** [dnvlist.o] Error code 1

make[3]: stopped in /var/poudriere/jails/110STABLEamd64/usr/src/lib/libnv
--- nvlist.o ---
In file included from 
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvlist.c:52:
In file included from /usr/include/stdarg.h:6:
In file included from /usr/include/x86/stdarg.h:33:
/usr/include/sys/_stdarg.h:41:11: error: unknown type name '__va_list'
  typedef __va_list   va_list;
  ^
--- cnvlist.o ---
*** [cnvlist.o] Error code 1

make[3]: stopped in /var/poudriere/jails/110STABLEamd64/usr/src/lib/libnv
--- nvpair.o ---
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvpair.c:1176:2: 
warning: incompatible integer to pointer conversion passing 'va_list' (aka 
'int') to parameter of type 'struct __va_list_tag *' [-Wint-conversion]
va_start(valueap, valuefmt);
^~~
/usr/include/sys/_stdarg.h:45:49: note: expanded from macro 'va_start'
  #define   va_start(ap, last)  __builtin_va_start((ap), (last))
   ^~~~
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvpair.c:1178:9: 
warning: incompatible integer to pointer conversion passing 'va_list' (aka 
'int') to parameter of type 'struct __va_list_tag *' [-Wint-conversion]
va_end(valueap);
   ^~~
/usr/include/sys/_stdarg.h:51:40: note: expanded from macro 'va_end'
  #define   va_end(ap)  __builtin_va_end(ap)
 ^~
--- nvlist.o ---
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvlist.c:1483:2: 
warning: incompatible integer to pointer conversion passing 'va_list' (aka 
'int') to parameter of type 'struct __va_list_tag *' [-Wint-conversion]
va_start(valueap, valuefmt);
^~~
/usr/include/sys/_stdarg.h:45:49: note: expanded from macro 'va_start'
  #define   va_start(ap, last)  __builtin_va_start((ap), (last))
   ^~~~
/var/poudriere/jails/110STABLEamd64/usr/src/sys/contrib/libnv/nvlist.c:1485:9: 
warning: incompatible integer to pointer conversion passing 'va_list' (aka 
'int') to parameter of type 'struct __va_list_tag *' [-Wint-conversion]
va_end(valueap);
   ^~~
/usr/include/sys/_stdarg.h:51:40: note: expanded from macro 'va_end'
  #define   va_end(ap)  __builtin_va_end(ap)
 ^~
--- nvpair.o ---
2 warnings and 1 error generated.
--- nvlist.o ---
2 warnings and 1 error generated.
--- nvpair.o ---
*** [nvpair.o] Error code 1

make[3]: stopped in /var/poudriere/jails/110STABLEamd64/usr/src/lib/libnv
--- nvlist.o ---
*** [nvlist.o] Error code 1


Re: Ryzen issues on FreeBSD ? (with sort of workaround)

2018-06-02 Thread Don Lewis
On  2 Jun, Pete French wrote:
> So,I notice that https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 was
> closes as fixed. I cant remember if there was another bug report for ongoing
> Ryzen issues at all - I have been experimenting and have re-enabled most of
> the BIOS setting and tweaks fne, but I still need SMP disbled or it
> locks up. Havent tried for a week or so with that, and maybe some
> of the latest chnages in STABLE will help. is this fixed for
> everyone else, or are you all stll running with SMP off ?

With that bug fix, I get pretty much the same behavior on my Ryzen
machine as on my AMD FX-8320E.  BIOS settings are pretty much the just
the defaults.  I'm running 12.0-CURRENT, so I can't really comment on
11-STABLE.



___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ? (with sort of workaround)

2018-04-24 Thread Don Lewis
On 24 Apr, Mike Tancsa wrote:
> On 4/24/2018 10:01 AM, Pete French wrote:
>> 
>> Well, I ranh the iperf tests between real machine for 24 hours and that
>> worked fine. I also then spun up a Virtualbox with Win10 in it, and ran
>> iuperf to there at the same time as doing it betwene real machines, and
>> also did a full virus scan to exercise the disc. the idea being to
>> replicate Mondays lockup.
> 
> I was able to lock it up with vbox. bhyve was just a little easier to
> script and also I figured would be good to get VBox out of the mix in
> case it was something specific to VBox.  I dont recall if I tried it
> with SMT disabled.  Regardless, on Intel based systems I ran these tests
> for 72hrs straight without issue.   I can sort of believe hardware issue
> or flaky motherboard BIOSes (2 ASUS MBs, 1 MSI MB, 3 Ryzen chips), but
> the fact that two server class MBs from SuperMicro along with an Epyc
> chip also does the same thing makes me think something specific to
> FreeBSD and this class of AMD CPU :(

It would be interesting to test other AMD CPUs.  I think AMD and Intel
have some differences in the virtualization implementations.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ? (with sort of workaround)

2018-04-23 Thread Don Lewis
On 23 Apr, Mike Tancsa wrote:
> On 4/22/2018 5:29 PM, Don Lewis wrote:
>> Pretty much all of my BIOS settings are the defaults.
>> 
>> I suspect that the idle hang issues are motherboard and/or BIOS
>> specific.  For the record my motherboard is a Gigabyte
>> GA-AX370-Gaming 5.
> Hi Don,
>   Any chance you could try that bhyve test ? Basically, or 3 VMs and then
> run iperf3 between the instances.  I can lock up all 3 of my AMD boards
> (2 ASUS, one MSI) and both my Epyc (SuperMicro) boards.

I should be able to do that, but it will likely be a few days before I
can get around to it.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ? (with sort of workaround)

2018-04-22 Thread Don Lewis
On 20 Apr, Pete French wrote:
> So, resurrecting the thread from a few weeks ago, as I finally found
> time yesterday to out together the Ryzen machine I bought the parts for in
> Jaunary (busy year at work). All went smoothl;y, checked it
> booted up, used it for 15 minutes, was impressed by the speed and went home.
> 
> ...and by the time I got home, an hour or so later, it had locked up hard.
> 
> I was somewhat dissapointed, as I had seen various fixes go in,. and had hoped
> the issues were fixed. This morning I have booted the machine back up,
> tweaking the BIOPS to do things mentioned in this thread, viz:
> 
>   Disable Turbo Boost
>   Disable SMT
>   Disable global C-states
> 
> The memory was already ruunning correctly at 2133 (though I have locked that
> in the BIOS too) and I was already using kern.eventtimer.periodic=1, so
> the lockup was not related to those. Its the latest BIOS, and a -STABLE
> build from yesterday.
> 
> I suspect it will now be stable, but I was wondering if anyone was any further
> forward on working out which of the settings above are the ones which 'fix'
> the issue - or indeed if its really fixed, by them or just made far less 
> likely
> to happen.
> 
> Anyone got any more comments on this ?

In terms of hangs and system crashes, my Ryzen system has been stable
since the fix to relocate the shared page.

The random segfault problem during parallel builds went away when I
RMAed my original, early-build CPU.

This commit:
  r329254 | kib | 2018-02-13 16:31:45 -0800 (Tue, 13 Feb 2018) | 43 lines

  Ensure memory consistency on COW.

Fixed most of the remaining random port build errors that I had.  I
think the only remaining problem is random build failures of
guile-related ports, but I also see these on my FX-8320E, so they are
not Ryzen-specific.

Pretty much all of my BIOS settings are the defaults.

I suspect that the idle hang issues are motherboard and/or BIOS
specific.  For the record my motherboard is a Gigabyte
GA-AX370-Gaming 5.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: package building performance (was: Re: FreeBSD on AMD Epyc boards)

2018-02-17 Thread Don Lewis
On 14 Feb, Mark Linimon wrote:
> On Wed, Feb 14, 2018 at 09:15:53AM +0100, Kurt Jaeger wrote:
>> On the plus side: 16+16 cores, on the minus: A low CPU tact of 2.2 GHz.
>> Would a box like this be better for a package build host instead of 4+4 cores
>> with 3.x GHz ?
> 
> In my experience, "it depends".
> 
> I think that above a certain number of cores, I/O will dominate.  I _think_;
> I have never done any metrics on any of this.
> 
> The dominant term of the equation is, as you might guess, RAM.  Previous
> experience suggests that you need at least 2GB per build.  By default,
> nbuilds is set equal to ncores.  Less than 2GB-per and you're going to be
> unhappy.
> 
> (It's true that for modern systems, where large amounts of RAM are standard,
> that this is probably no longer a concern.)
> 
> Put it this way: with 4 cores and 16GB and netbooting (7GB of which was
> devoted to md(4)), I was having lots of problems on powerpc64.  The same
> machine with 64GB gives me no problems.
> 
> My guess is that after RAM, there is I/O, ncores, and speed.  But I'm just
> speculating.

I've been configuring 4 GB per builder, so on my 8-core 16-thread Ryzen
machine, that means 64 GB of RAM.  I also set USE_TMPS to "wrkdir data
localbase" in poudriere.conf, so I'm leaning pretty heavily on RAM.  I do
figure that zfs clone is more efficient than tmpfs for the builder
jails.  With this configuration, building my default set of ports is
pretty much CPU-bound.  When it starts building the the larger ports
that need a lot of space for WRKDIR, like openoffice-4,
openoffice-devel, libreoffice, chromium, etc. the machine does end up
using a lot of swap space, but it is mostly dead data from the wrkdirs,
so generally there isn't a lot of paging activity.  I also have
ALLOW_MAKE_JOBS=yes to load up the CPUs a bit more, though I did get
the best results with MAKE_JOBS_NUMBER=7 building my default port set on
this machine.  The hard drive is a fairly old WD Green that I removed
from one of my other machines, and it is plenty fast enough to keep CPU
idle % at or near zero most of the time during the build run.

I did just try out "poudriere bulk -a" on this machine to build ports
for 11.1-RELEASE amd64 and got these results:

[111amd64-default] [2018-02-14_23h40m24s] [committing:] Queued: 29787 Built: 
29277 Failed: 59Skipped: 112   Ignored: 339   Tobuild: 0  Time: 47:39:48

I did notice some periods of high idle CPU during this run, but a lot
of that was due to a bunch of the builders in the fetch state at the
same time.  Without that, the runtime would have been lower.  On the
other hand, some ports failed due to a gmake issue, and others looked
like they failed due to having problems with ALLOW_MAKE_JOBS=yes.  The
runtime would have been higher without those problems.

As far as Epyc goes, I think the larger core count would win.  A lot
depends on how effective cache is for this workload, so it would be
interesting to plot poudriere run time vs. clock speed.  If cache misses
dominate execution time, then lowering the clock speed would not hurt
that much.  Something important to keep in mind with Threadripper and
Epync is NUMA.  For best results, all of the memory channels should be
used and the work should be distributed so that the processes on each
core primarily access RAM local to that CPU die.  If this isn't the case
then the infinity fabric that connects all of the CPU die will be the
bottleneck.  The lower core clock speed on Epyc lessens that penalty,
but it is still something to be avoided if possible.

Something else to consider is price/performance.  If you want to build
packages for four OS/arch combinations, then doing it in parallel on
four Ryzen machines is likely to be both cheaper and faster than doing
the same builds sequentially on an Epyc machine with 4x the core count
and RAM.

It is unfortunate that there don't seem to be any server-grade Ryzen
motherboards.  They all seem to be gamer boards with a lot of
unnecessary bling.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ? (with sort of workaround)

2018-02-12 Thread Don Lewis
On 31 Jan, Eugene Grosbein wrote:
> 31.01.2018 4:36, Mike Tancsa пишет:
>> On 1/30/2018 2:51 PM, Mike Tancsa wrote:
>>>
>>> And sadly, I am still able to hang the compile in about the same place.
>>> However, if I set
>> 
>> 
>> OK, here is a sort of work around. If I have the box a little more busy,
>> I can avoid whatever deadlock is going on.  In another console I have
>> cat /dev/urandom | sha256
>> running while the build runs
>> 
>> ... and I can compile net/samba47 from scratch without the compile
>> hanging.  This problem also happens on HEAD from today.  Should I start
>> a new thread on freebsd-current ? Or just file a bug report ?
>> The compile worked 4/4
> 
> That's really strange. Could you try to do "sysctl kern.eventtimer.periodic=1"
> and re-do the test without extra load?

I'm having really good luck with the kernel patch attached to this
message:
https://docs.freebsd.org/cgi/getmsg.cgi?fetch=417183+0+archive/2018/freebsd-hackers/20180211.freebsd-hackers

Since applying that patch, I did three poudriere runs to build the set
of ~1700 ports that I use.  Other than one gmake-related build runaway
that I've also seen on my AMD FX-8320E, I didn't see any random port
build failures.  When I was last did some testing a few weeks ago,
lang/go would almost always fail.  I also would seem random build
failures in lang/guile or finance/gnucash (which uses guile during its
build) on both my Ryzen and FX-8320E machines, but those built cleanly
all three times.

I even built samba 16 times in a row without a hang.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ? (with sort of workaround)

2018-01-30 Thread Don Lewis
On 30 Jan, Mike Tancsa wrote:
> On 1/30/2018 5:23 PM, Nimrod Levy wrote:
>> That's really strange. I never saw those kinds of deadlocks, but I did
>> notice that if I kept the cpu busy using distributed.net
>>  I could keep the full system lockups away for
>> at least a week if not longer.
>> 
>> Not to keep harping on it, but what worked for me was lowering the
>> memory speed. I'm at 11 days of uptime so far without anything running
>> the cpu. Before the change it would lock up anywhere from an hour to a day.
>> 
> Spoke too soon. After a dozen loops, the process has hung again.  Note,
> this is not the box locking up, just the compile.  I do have memory at a
> lower speed too. -- 2133 instead of the default 2400

I suspect the problem is a race condition that causes a wakeup to be
lost.  Adding load changes the timing enough to avoid the problem most
of the time.

> I also just tried upgrading to the latest HEAD with a generic kernel and
> same / similar lockups although procstat -kk gives some odd results
> 
> 
> root@amdtestr12:/home/mdtancsa # procstat -kk 6067
>   PIDTID COMMTDNAME  KSTACK
> 
>  6067 100865 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100900 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100901 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100902 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100903 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100904 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100905 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100906 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100907 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100908 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100909 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100910 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0
>  6067 100911 python2.7   -   ??+0 ??+0 ??+0 ??+0
> ??+0 ??+0 ??+0 ??+0 ??+0 ??+0

Strange ... kernel vs. world mismatch?  Some other new regression in
HEAD?


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ? (with sort of workaround)

2018-01-30 Thread Don Lewis
On 30 Jan, Mike Tancsa wrote:
> On 1/30/2018 2:51 PM, Mike Tancsa wrote:
>> 
>> And sadly, I am still able to hang the compile in about the same place.
>> However, if I set
> 
> 
> OK, here is a sort of work around. If I have the box a little more busy,
> I can avoid whatever deadlock is going on.  In another console I have
> cat /dev/urandom | sha256
> running while the build runs

Interesting ...

> ... and I can compile net/samba47 from scratch without the compile
> hanging.  This problem also happens on HEAD from today.  Should I start
> a new thread on freebsd-current ? Or just file a bug report ?
> The compile worked 4/4

I'd file a PR to capture all the information in one place and drop a
pointer on freebsd-current.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-28 Thread Don Lewis
On 27 Jan, Mike Tancsa wrote:
> On 1/27/2018 3:23 AM, Don Lewis wrote:
>> 
>> I just ran into this for this first time with samba46.  I kicked of a
>> ports build this evening before leaving for several hours.  When I
>> returned, samba46 had failed with a build runaway.  I just tried again
>> and I see python stuck in the usem state.  This is what I see with
>> procstat -k:
> 
> Hmmm, is this indicative of a processor bug or a FreeBSD bug or its
> indeterminate at this point ?

My suspicion is a FreeBSD bug, probably a locking / race issue.  I know
that we've had to make some tweeks to our code for AMD CPUs, like this:


r321608 | kib | 2017-07-27 01:37:07 -0700 (Thu, 27 Jul 2017) | 9 lines

Use MFENCE to serialize RDTSC on non-Intel CPUs.

Kernel already used the stronger barrier instruction for AMDs, correct
the userspace fast gettimeofday() implementation as well.



I did go back and look at the build runaways that I've occasionally seen
on my AMD FX-8320E package builder.  I haven't seen the python issue
there, but have seen gmake get stuck in a sleeping state with a bunch of
zombie offspring.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-28 Thread Don Lewis
On 27 Jan, Peter Moody wrote:
> Whelp, I replaced the r5 1600x with an r7 1700 (au 1734) and I'm now
> getting minutes of uptime before I hard crash. With smt, without, with c
> states, without, with opcache, without. No difference.

Check the temperatures.  Maybe the heat sink isn't making good contact
after the CPU replacement.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-28 Thread Don Lewis
On 28 Jan, Pete French wrote:
> 
> 
> On 28/01/2018 20:28, Don Lewis wrote:
>> I'd be wary of the B350 boards with the higher TDP eight core Ryzen CPUs
>> since the VRMs on the cheaper boards tend to have less robust VRM
>> designs.
> 
> Gah! Yes, I forgot that.originally sec'd the board for a smaller Ryzen, 
> then though "what the hell" and got the 1700 without going back and 
> checking that kind of stuff. Hmm, shall swap for a different one if I 
> can. Thanks for poining that out.

I started off with a Gigabyte AB350 Gaming for my 1700X back when there
was enough ambiguity about ECC support to give me hope that it would
work.  Everything seemed to work other than ECC and the problems caused
by my buggy CPU and the shared page issue, but the VRM temps in the BIOS
were really high (and I had no way to monitor that under load).  When I
upgraded to get working ECC, I also looked at reviews about VRM quality.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-28 Thread Don Lewis
On 28 Jan, Pete French wrote:
>>> I'm about ready to have a party.  My Ryzen 5 1600 has been up for over 8
>>> days so far after changing the memory to a slower speed.  System load
>>> hovers around .3
>>
>> I couldn't find an easy way to down-speed my memory in the bios :(
> 
> Out of interest, what motherboards are people using ? I still havent
> built my test system, desite being the OP in the thread, but I have
> an MSI B350 Tomahawk as the test board.

Gigabyte AX370 Gaming 5.

I'd be wary of the B350 boards with the higher TDP eight core Ryzen CPUs
since the VRMs on the cheaper boards tend to have less robust VRM
designs.

Personally I won't put together a system without ECC RAM both for
overall reliability and also the fact that the error reporting will
immediately flag (or eliminate) RAM issues when the system is unstable.
That pretty much confined my motherboard choices to the higher end X370
motherboards.  I think only ASRock makes a B350 motherboard with ECC
support.  There's no reason that ECC support couldn't be universal other
than product differentiation so that the motherboard manufacturers can
collect more $$$ from anyone who cares about this feature.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-27 Thread Don Lewis
On 23 Jan, Mike Tancsa wrote:
> On 1/22/2018 5:13 PM, Don Lewis wrote:
>> On 22 Jan, Mike Tancsa wrote:
>>> On 1/22/2018 1:41 PM, Peter Moody wrote:
>>>> fwiw, I upgraded to 11-STABLE (11.1-STABLE #6 r328223), applied the
>>>> hw.lower_amd64_sharedpage setting to my loader.conf and got a crash
>>>> last night following the familiar high load -> idle. this was with SMT
>>>> re-enabled. no crashdump, so it was the hard crash that I've been
>>>> getting.
>>>
>>> hw.lower_amd64_sharedpage=1 is the default on AMD boxes no ? I didnt
>>> need to set mine to 1
>>>
>>>>
>>>> shrug, I'm at a loss here.
>>>
>>> I am trying an RMA with AMD.
>> 
>> Something else that you might want to try is 12.0-CURRENT.  There might
>> be some changes in HEAD that need to be merged back to 11.1-STABLE.
> 
> 
> Temp works as expected now. However, a (similar?) hang building Samba47.
> 
> ctrl+T shows
> 
> 
> load: 1.98  cmd: python2.7 53438 [usem] 54.70r 14.98u 6.04s 0% 230992k
> make: Working in: /usr/ports/net/samba47
> load: 0.34  cmd: python2.7 53438 [usem] 168.48r 14.98u 6.04s 0% 230992k
> make: Working in: /usr/ports/net/samba47
> load: 0.31  cmd: python2.7 53438 [usem] 174.12r 14.98u 6.04s 0% 230992k
> make: Working in: /usr/ports/net/samba47

I just ran into this for this first time with samba46.  I kicked of a
ports build this evening before leaving for several hours.  When I
returned, samba46 had failed with a build runaway.  I just tried again
and I see python stuck in the usem state.  This is what I see with
procstat -k:

  PIDTID COMMTDNAME  KSTACK 
  
90692 100801 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 100824 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 100857 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 100956 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 100995 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101483 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101538 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101549 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101570 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101572 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101583 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101588 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101593 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101610 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 101629 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_lock_umutex 
__umtx_op_wait_umutex amd64_syscall fast_syscall_common 
90692 101666 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 
__umtx_op_sem2_wait amd64_syscall fast_syscall_common 
90692 102114 python2.7   -   mi_switch 
sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_sem2_wait 

Re: Ryzen issues on FreeBSD ?

2018-01-23 Thread Don Lewis
On 23 Jan, Mike Tancsa wrote:
> On 1/22/2018 5:13 PM, Don Lewis wrote:
>>>
>>> I am trying an RMA with AMD.
>> 
>> Something else that you might want to try is 12.0-CURRENT.  There might
>> be some changes in HEAD that need to be merged back to 11.1-STABLE.
> 
> It looks like this thread got mention on phorix :) In the comments
> section (comment #9) a post makes reference to
> 
> http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen
> 
> I guess Linux is still working through similar lockups too :(

Yes.  Interesting (and fairly concise) thread here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-23 Thread Don Lewis
On 23 Jan, Pete French wrote:
> On 22/01/2018 18:25, Don Lewis wrote:
>> On 22 Jan, Pete French wrote:
>>>
>>>
>>> On 21/01/2018 19:05, Peter Moody wrote:
>>>> hm, so i've got nearly 3 days of uptime with smt disabled.
>>>> unfortunately this means that my otherwise '12' cores is actually only
>>>> '6'. I'm also getting occasional segfaults compiling go programs.
>>>
>>> Isn't go known to have issues on BSD anyway though ? I have seen
>>> complaints of random crashes running go under BSD systems - and
>>> preseumably the go compiler itself is written in go, so those issues
>>> might surface when compiling.
>> 
>> Not that I'm aware of.  I'm not a heavy go user on FreeBSD, but I don't
>> recall any unexpected go crashes and I haven't seen  problems building
>> go on my older AMD machines.
> 
> 
>  From the go 1.9 release notes:
> 
> "Known Issues
> There are some instabilities on FreeBSD that are known but not 
> understood. These can lead to program crashes in rare cases. See issue 
> 15658. Any help in solving this FreeBSD-specific issue would be 
> appreciated."
> 
> ( link is to https://github.com/golang/go/issues/15658 )
> 
> Having said that, we use it internally and have not seen any issues with 
> it ourselves. Just I am wary of the release notes, and that issue report.

Interesting ...

I've only seen problems on my Ryzen machine, which has >= 2x the number
of cores as any of my other machines.  All are AMD CPUs.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-22 Thread Don Lewis
On 22 Jan, Mike Tancsa wrote:
> On 1/22/2018 1:41 PM, Peter Moody wrote:
>> fwiw, I upgraded to 11-STABLE (11.1-STABLE #6 r328223), applied the
>> hw.lower_amd64_sharedpage setting to my loader.conf and got a crash
>> last night following the familiar high load -> idle. this was with SMT
>> re-enabled. no crashdump, so it was the hard crash that I've been
>> getting.
> 
> hw.lower_amd64_sharedpage=1 is the default on AMD boxes no ? I didnt
> need to set mine to 1
> 
>> 
>> shrug, I'm at a loss here.
> 
> I am trying an RMA with AMD.

Something else that you might want to try is 12.0-CURRENT.  There might
be some changes in HEAD that need to be merged back to 11.1-STABLE.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-22 Thread Don Lewis
On 22 Jan, Mike Tancsa wrote:
> On 1/21/2018 3:24 PM, Don Lewis wrote:
>>>
>>> I have supplied a customer with a Ryzen5 and a 350MB motherboard.
>>> But he runs Windows 10, but I haven't heard him complain about anything 
>>> like this.
>>> But I'll ask him specific.
>> 
>> Only the BSDs were affected by the shared page issue.  I think Linux
>> already had a guard page.  I don't think Windows was affected by the
>> idle C-state issue.  I suspect it is caused by software not doing the
>> right thing during C-state transitions, but the publicly available
>> documentation from AMD is pretty lacking.  The random segfault issue is
>> primarily triggered by heavy parallel software build loads and how many
>> Windows users do that?
> 
> 
> Are all the AMD accomodations that DragonFly did in FreeBSD ?
> 
> http://lists.dragonflybsd.org/pipermail/commits/2017-August/626190.html

We only lowered the top of user space by 4KB, which should be
sufficient, and we unmapped the boundary page.  The signal trampoline
was already in a separate page than the stack.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-22 Thread Don Lewis
On 22 Jan, Pete French wrote:
> 
> 
> On 21/01/2018 19:05, Peter Moody wrote:
>> hm, so i've got nearly 3 days of uptime with smt disabled.
>> unfortunately this means that my otherwise '12' cores is actually only
>> '6'. I'm also getting occasional segfaults compiling go programs.
> 
> Isn't go known to have issues on BSD anyway though ? I have seen 
> complaints of random crashes running go under BSD systems - and 
> preseumably the go compiler itself is written in go, so those issues 
> might surface when compiling.

Not that I'm aware of.  I'm not a heavy go user on FreeBSD, but I don't
recall any unexpected go crashes and I haven't seen  problems building
go on my older AMD machines.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-21 Thread Don Lewis
On 21 Jan, Willem Jan Withagen wrote:
> On 21/01/2018 21:24, Don Lewis wrote:
>> On 21 Jan, Willem Jan Withagen wrote:
>>> On 19/01/2018 23:29, Don Lewis wrote:
>>>> On 19 Jan, Pete French wrote:
>>>>> Out of interest, is there anyone out there running Ryzen who *hasnt*
>>>>> seen lockups ? I'd be curious if there a lot of lurkers thinking "mine
>>>>> works fine"
>>>>
>>>> No hangs or silent reboots here with either my original CPU or warranty
>>>> replacement once the shared page fix was in place.
>>>
>>> Perhaps a too weird reference:
>>>
>>> I have supplied a customer with a Ryzen5 and a 350MB motherboard.
>>> But he runs Windows 10, but I haven't heard him complain about anything
>>> like this.
>>> But I'll ask him specific.
>> 
>> Only the BSDs were affected by the shared page issue.  I think Linux
>> already had a guard page.  I don't think Windows was affected by the
>> idle C-state issue.  I suspect it is caused by software not doing the
>> right thing during C-state transitions, but the publicly available
>> documentation from AMD is pretty lacking.  The random segfault issue is
>> primarily triggered by heavy parallel software build loads and how many
>> Windows users do that?
> 
> This is an adobe workstation where several users remote login and do 
> work. So I would assume that the system is seriously (ab)used.
> 
> Adn as expected I'm know aware of any of the detailed things that 
> Windows does while powering into lesser active states.

It might depend on the scheduler details.  On Linux and the FreeBSD ULE
scheduler, runnable threads migrate between CPUs to balance the loading
across all cores.  When I did some experiments to disable that, the rate
of build failures greatly decreased.  AMD has been very vague about the
cause of the problem (a "performance marginality") and resorted to
replacing CPUs with this problem without suggesting any sort of software
workaround.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-21 Thread Don Lewis
On 21 Jan, Peter Moody wrote:
> hm, so i've got nearly 3 days of uptime with smt disabled.
> unfortunately this means that my otherwise '12' cores is actually only
> '6'. I'm also getting occasional segfaults compiling go programs.

Both my original and replacement CPUs croak on go, so I don't think an
RMA is likely to help with that.  Go is a heavy user of threads and my
suspicion is that there is some sort of issue with the locking that is
uses. I'm guessing a memory barrier issue of some sort ...

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-21 Thread Don Lewis
On 21 Jan, Willem Jan Withagen wrote:
> On 19/01/2018 23:29, Don Lewis wrote:
>> On 19 Jan, Pete French wrote:
>>> Out of interest, is there anyone out there running Ryzen who *hasnt*
>>> seen lockups ? I'd be curious if there a lot of lurkers thinking "mine
>>> works fine"
>> 
>> No hangs or silent reboots here with either my original CPU or warranty
>> replacement once the shared page fix was in place.
> 
> Perhaps a too weird reference:
> 
> I have supplied a customer with a Ryzen5 and a 350MB motherboard.
> But he runs Windows 10, but I haven't heard him complain about anything 
> like this.
> But I'll ask him specific.

Only the BSDs were affected by the shared page issue.  I think Linux
already had a guard page.  I don't think Windows was affected by the
idle C-state issue.  I suspect it is caused by software not doing the
right thing during C-state transitions, but the publicly available
documentation from AMD is pretty lacking.  The random segfault issue is
primarily triggered by heavy parallel software build loads and how many
Windows users do that?

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-21 Thread Don Lewis
On 20 Jan, Mark Millard wrote:
> Don Lewis truckman at FreeBSD.org wrote on
> Sat Jan 20 02:35:40 UTC 2018 :
> 
>> The only real problem with the old CPUs is the random segfault problem
>> and some other random strangeness, like the lang/ghc build almost always
>> failing.
> 
> 
> At one time you had written
> ( https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029
> comment #103 on 2017-Oct-09):
> 
> QUOTE
> The ghc build failure seems to be gone after upgrading the a
> more recent 12.0-CURRENT.  I will try to bisect for the fix
> when I have a chance.
> END QUOTE
> 
> Did that not pan out? Did you conclude it was
> hardware-context specific?

I was never able to reproduce the problem.  It seems like it failed on
the first ports build run after I replaced the CPU.  When I upgraded the
OS and ports, the build succeeded.  I tried going back to much earlier
OS and ports versions, but I could never get the ghc build to fail
again.  I'm baffled by this ...

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Mike Tancsa wrote:
> On 1/19/2018 5:29 PM, Don Lewis wrote:
>> On 19 Jan, Pete French wrote:
>>> Out of interest, is there anyone out there running Ryzen who *hasnt* 
>>> seen lockups ? I'd be curious if there a lot of lurkers thinking "mine 
>>> works fine"
>> 
>> No hangs or silent reboots here with either my original CPU or warranty
>> replacement once the shared page fix was in place.
> 
> 
> Hmmm, I wonder if I have a pair of the old CPUs (came from 2 different
> suppliers however).

The only real problem with the old CPUs is the random segfault problem
and some other random strangeness, like the lang/ghc build almost always
failing.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Mike Tancsa wrote:
> On 1/19/2018 6:16 PM, Don Lewis wrote:
>>>
>>> 0(ms-v1)# kldload amdtemp
>>> 0(ms-v1)# dmesg | tail -2
>>> ums0: at uhub0, port 3, addr 1 (disconnected)
>>> ums0: detached
>>> 0(ms-v1)#
>> 
>> What FreeBSD version are you running?  It looks like the amdtemp changes
>> for Ryzen are only in 12.0-CURRENT.   It looks like r323185 and r323195
>> need to be merged to stable/11.
> 
> releng11. It seems amdsmn is needed as well

That sounds right.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Mike Tancsa wrote:
> On 1/19/2018 5:45 PM, Don Lewis wrote:
>>>
>>> And it just hangs there. No segfaults, but it just hangs.
>>>
>>> A ctrl+t shows just shows
>>>
>>> load: 0.16  cmd: python2.7 65754 [usem] 589.51r 10.52u 1.63s 0% 122360k
>>> make: Working in: /usr/ports/net/samba47
>> 
>> I sometimes seen build runaways when using poudriere to build my
>> standard set of packages on my Ryzen machine.  I don't think this is a
>> Ryzen-specific issue since I also see the same on older AMD FX-8320E
>> machine, but much less frequently there.  It looks like a lost wakeup
>> issue, but I haven't had a chance to dig into it yet.
> 
> Odd, does this happen on Intel machines too ?

Unknown.  The last one of those I had was a Pentium III ...

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ancient FreeBSD update path

2018-01-19 Thread Don Lewis
On 19 Jan, Mathieu Arnold wrote:
> On Fri, Jan 19, 2018 at 01:28:41PM +0100, Andrea Brancatelli wrote:
>> Hello guys. 
>> 
>> I have a couple of ancient FreeBSD install that I have to bring into
>> this century (read either 10.4 or 11.1) :-) 
>> 
>> I'm talking about a FreeBSD 8.0-RELEASE-p4 and a couple of FreeBSD
>> 9.3-RELEASE-p53. 
>> 
>> What upgrade strategy would you suggest? 
>> 
>> Direct jump into the future (8 -> 11)? Progressive steps (8 -> 9 -> 10
>> -> 11)? Boiling water on the HDs? :-) 
>> 
>> Thanks, any suggestion in more than welcome.
> 
> The *supported* upgrade strategy is to upgrade to the latest version of
> your current branch, and jump from latest version to latest version.  So
> 8.4 -> 9.3 -> 10.4 -> 11.1. (Note that you can stay at 10.4, it still is
> supported.)

Only until October 31, 2018.  At this point I'd go all the way to 11.1
to avoid going through the pain of another major OS version upgrade in
the nearish future.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Mike Tancsa wrote:
> On 1/19/2018 3:22 PM, Lucas Holt wrote:
>> I have an Asus Prime X370-pro and a Ryzen 7 1700 that I bought in late
> 
> Thanks! Thats the board I have, but no luck with amdtemp.  Did you have
> to change the source code for it to work ?
> 
> dmidecode shows
> 
> Manufacturer: ASUSTeK COMPUTER INC.
> Product Name: PRIME X370-PRO
> 
> Vendor: American Megatrends Inc.
> Version: 3402
> Release Date: 12/11/2017
> Address: 0xF
> Runtime Size: 64 kB
> ROM Size: 16 MB
> Characteristics:
> 
> memory is
> 
> Type: DDR4
> Type Detail: Synchronous Unbuffered (Unregistered)
> Speed: 2133 MT/s
> Manufacturer: Unknown
> Serial Number: 192BE196
> Asset Tag: Not Specified
> Part Number: CT16G4DFD824A.C16FHD
> Rank: 2
> Configured Clock Speed: 1067 MT/s
> Minimum Voltage: 1.2 V
> Maximum Voltage: 1.2 V
> Configured Voltage: 1.2 V
> 
> 
> 
> When I try and load the kld, I get nothing :(
> 
> 0(ms-v1)# kldload amdtemp
> 0(ms-v1)# dmesg | tail -2
> ums0: at uhub0, port 3, addr 1 (disconnected)
> ums0: detached
> 0(ms-v1)#

What FreeBSD version are you running?  It looks like the amdtemp changes
for Ryzen are only in 12.0-CURRENT.   It looks like r323185 and r323195
need to be merged to stable/11.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Mike Tancsa wrote:
> On 1/19/2018 3:48 PM, Peter Moody wrote:
>> 
>> weirdly enough though, with SMT enabled, building net/samba47 would
>> always hang (like compilation segfaults). with SMT disabled, no such
>> problems.
> 
> wow, thats so strange!  I just tried,and the same thing and see it as well.
> 
> [ 442/3804] Generating lib/ldb-samba/ldif_handlers_proto.h
> [ 443/3804] Generating source4/lib/registry/tools/common.h
> runner  /usr/local/bin/perl
> "/usr/ports/net/samba47/work/samba-4.7.4/source4/script/mkproto.pl"
> --srcdir=.. --builddir=. --public=/dev/null
> --private="default/lib/ldb-samba/ldif_handlers_proto.h"
> ../lib/ldb-samba/ldif_handlers.c ../lib/ldb-samba/ldb_matching_rules.c
> [ 444/3804] Generating source4/lib/registry/tests/proto.h
> runner  /usr/local/bin/perl
> "/usr/ports/net/samba47/work/samba-4.7.4/source4/script/mkproto.pl"
> --srcdir=.. --builddir=. --public=/dev/null
> --private="default/source4/lib/registry/tools/common.h"
> ../source4/lib/registry/tools/common.c
> 
> 
> And it just hangs there. No segfaults, but it just hangs.
> 
> A ctrl+t shows just shows
> 
> load: 0.16  cmd: python2.7 65754 [usem] 589.51r 10.52u 1.63s 0% 122360k
> make: Working in: /usr/ports/net/samba47

I just tried building samba47 here.  Top shows python spending a lot of
time in that state and steadily growing in size, but forward progress
does happen.  I got a successful build:
  [00:07:54] [01] [00:06:31] Finished net/samba47 | samba47-4.7.4_1: Success

I'm currently running:
  FreeBSD 12.0-CURRENT #0 r327261M: Wed Dec 27 22:44:16 PST 2017

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Mike Tancsa wrote:
> On 1/19/2018 3:48 PM, Peter Moody wrote:
>> 
>> weirdly enough though, with SMT enabled, building net/samba47 would
>> always hang (like compilation segfaults). with SMT disabled, no such
>> problems.
> 
> wow, thats so strange!  I just tried,and the same thing and see it as well.
> 
> [ 442/3804] Generating lib/ldb-samba/ldif_handlers_proto.h
> [ 443/3804] Generating source4/lib/registry/tools/common.h
> runner  /usr/local/bin/perl
> "/usr/ports/net/samba47/work/samba-4.7.4/source4/script/mkproto.pl"
> --srcdir=.. --builddir=. --public=/dev/null
> --private="default/lib/ldb-samba/ldif_handlers_proto.h"
> ../lib/ldb-samba/ldif_handlers.c ../lib/ldb-samba/ldb_matching_rules.c
> [ 444/3804] Generating source4/lib/registry/tests/proto.h
> runner  /usr/local/bin/perl
> "/usr/ports/net/samba47/work/samba-4.7.4/source4/script/mkproto.pl"
> --srcdir=.. --builddir=. --public=/dev/null
> --private="default/source4/lib/registry/tools/common.h"
> ../source4/lib/registry/tools/common.c
> 
> 
> And it just hangs there. No segfaults, but it just hangs.
> 
> A ctrl+t shows just shows
> 
> load: 0.16  cmd: python2.7 65754 [usem] 589.51r 10.52u 1.63s 0% 122360k
> make: Working in: /usr/ports/net/samba47

I sometimes seen build runaways when using poudriere to build my
standard set of packages on my Ryzen machine.  I don't think this is a
Ryzen-specific issue since I also see the same on older AMD FX-8320E
machine, but much less frequently there.  It looks like a lost wakeup
issue, but I haven't had a chance to dig into it yet.

=>> Killing runaway build after 7200 seconds with no output
=>> Cleaning up wrkdir
===>  Cleaning for doxygen-1.8.13_1,2
=>> Warning: Leftover processes:
USER PID %CPU %MEM   VSZ   RSS TT  STAT STARTEDTIME COMMAND
nobody 55576  0.0  0.0 10556  1528  0  I+J  00:32   0:00.04 /usr/bin/make -C 
/usr/ports/devel/doxygen build
nobody 55625  0.0  0.0 11660  1952  0  I+J  00:32   0:00.00 - /bin/sh -e -c (cd 
/wrkdirs/usr/ports/devel/dox
ygen/work/.build; if ! /usr/bin/env 
XDG_DATA_HOME=/wrkdirs/usr/ports/devel/doxygen/work  XDG_CONFIG_HOME=/wr
kdirs/usr/ports/devel/doxygen/work  HOME=/wrkdirs/usr/ports/devel/doxygen/work 
TMPDIR="/tmp" PATH=/wrkdirs/u
sr/ports/devel/doxygen/work/.bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/nonexistent/b
in NO_PIE=yes MK_DEBUG_FILES=no MK_KERNEL_SYMBOLS=no SHELL=/bin/sh NO_LINT=YES 
PREFIX=/usr/local  LOCALBASE=
/usr/local  LIBDIR="/usr/lib"  CC="cc" CFLAGS="-O2 -pipe  -DLIBICONV_PLUG 
-fstack-protector -fno-strict-alia
sing"  CPP="cpp" CPPFLAGS="-DLIBICONV_PLUG"  LDFLAGS=" -fstack-protector" 
LIBS=""  CXX="c++" CXXFLAGS="-O2 -
pipe -DLIBICONV_PLUG -fstack-protector -fno-strict-aliasing  -DLIBICONV_PLUG"  
MANPREFIX="/usr/local" BSD_IN
STALL_PROGRAM="install  -s -m 555"  BSD_INSTALL_LIB="install  -s -m 0644"  
BSD_INSTALL_SCRIPT="install  -m 5
55"  BSD_INSTALL_DATA="install  -m 0644"  BSD_INSTALL_MAN="install  -m 444" 
/usr/bin/make -f Makefile   all
docs; then  if [ -n "" ] ; then  echo "===> Compilation failed unexpectedly.";  
(echo "") | /usr/bin/fmt 75
79 ;  fi;  false;  fi)
nobody 55636  0.0  0.0  9988  1108  0  I+J  00:32   0:00.01 `-- /usr/bin/make 
-f Makefile all docs
nobody  6734  0.0  0.0 10140  1216  0  I+J  00:42   0:00.00   `-- /usr/bin/make 
-f CMakeFiles/Makefile2 docs
nobody  6764  0.0  0.0 10140  1216  0  I+J  00:42   0:00.01 `-- 
/usr/bin/make -f CMakeFiles/Makefile2 do
c/CMakeFiles/docs.dir/all
nobody  7107  0.0  0.0 10512  1536  0  I+J  00:42   0:00.03   `-- 
/usr/bin/make -f examples/CMakeFiles/e
xamples.dir/build.make examples/CMakeFiles/examples.dir/build
nobody 12111  0.0  0.0 61468 27060  0  I+J  00:43   0:00.16 `-- 
../bin/doxygen diagrams.cfg
Killed
build of devel/doxygen | doxygen-1.8.13_1,2 ended at Sat Dec 30 18:44:47 PST 
2017
build time: 02:14:51
!!! build failure encountered !!!


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Mike Tancsa wrote:
> On 1/19/2018 3:32 PM, Peter Moody wrote:
>> 
>> I have a ryzen5 1600X and an ASRock AB350M and I've tried just about
>> everything in all of these threads; disabling C state (no effect),
>> setting the sysctl (doesn't exist on my 11.1 RELEASE), tweaking
>> voltage and cooling settings, rma'ing the board the cpu and the
>> memory. nothing helped.
>> 
>> last night I tried disabling SMT and, so far so good.
> 
> 
> Is there anything that can be done to trigger the lockup more reliably ?
> I havent found any patterns. I have had lockups with the system is 100%
> idle and lockups when lightly loaded.  I have yet to see any segfaults
> or sig 11s while doing buildworld (make -j12 or make -j16 even)

I never seen the idle lockup problem here.  Prior to the shared page
fix, I could almost always trigger a system hang or silent reboot by
doing a parallel build of openjdk8.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-19 Thread Don Lewis
On 19 Jan, Pete French wrote:
> Out of interest, is there anyone out there running Ryzen who *hasnt* 
> seen lockups ? I'd be curious if there a lot of lurkers thinking "mine 
> works fine"

No hangs or silent reboots here with either my original CPU or warranty
replacement once the shared page fix was in place.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-17 Thread Don Lewis
On 17 Jan, Nimrod Levy wrote:
> I'm running 11-STABLE from 12/9.  amdtemp works for me.  It also has the
> systl indicating that it it has the shared page fix. I'm pretty sure I've
> seen the lockups since then.  I'll update to the latest STABLE and see
> what  happens.
> 
> One weird thing about my experience is that if I keep something running
> continuously like the distributed.net client on 6 of 12 possible threads,
> it keeps the system up for MUCH longer than without.  This is a home server
> and very lightly loaded (one could argue insanely overpowered for the use
> case).

This sounds like the problem with the deep Cx states that has been
reported by numerous Linux users.  I think some motherboard brands are
more likely to have the problem.  See:
http://forum.asrock.com/forum_posts.asp?TID=5963=taichi-x370-with-ubuntu-idle-lock-ups-idle-freeze

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-17 Thread Don Lewis
On 17 Jan, Mike Tancsa wrote:
> On 1/17/2018 3:39 PM, Don Lewis wrote:
>> On 17 Jan, Mike Tancsa wrote:
>>> On 1/17/2018 8:43 AM, Pete French wrote:
>>>>
>>>> Are you running the latest STABLE ? There were some patches for Ryzen
>>>> which went in I belive, and might affect te stability. Specificly the
>>>> chnages to stop it locking up when executing code in the top page ?
>>>
>>> Hi,
>>> I was testing with RELENG_11 as of 2 days ago.  The fix seems to be 
>>> there
>>>
>>> # sysctl -A hw.lower_amd64_sharedpage
>>> hw.lower_amd64_sharedpage: 1
>>>
>>> Would love to find a class of motherboard that pushes its "You dont need
>>> to dork around with any BIOS settings. It just works.  Oh, and we have a
>>> hardware watchdog too" ipmi would be stellar.
>> 
>> The shared page change fixed the random lockup and silent reboot problem
>> for me.  I've got a 1700X eight core CPU and a Gigabyte X370 Gaming 5. I
>> did have to RMA my CPU (it was an early one) because it had the problem
>> with random segfaults that seemed to be triggered by process migration
>> between CPU cores.  I still haven't switched over to using it for
>> package builds because I see more random fallout than on my older
>> package builder.  I'm not blaming the hardware for that at this point
>> because I see a lot of the same issues on my older machine, but less
>> frequently.
>> 
>> One thing to watch (though it should be less critical with a six core
>> CPU) is VRM cooling.  I removed the stupid plastic shroud over the VRM
>> sink on my motherboard so that it gets some more airflow.
> 
> Thanks! I will confirm the cooling.  I tried just now looking at the CPU
> FAN control in the BIOS and up'd it to "turbo" from the default.  Does
> amdtmp.ko work with your chipset ? Nothing on mine unfortunately, so I
> cant tell from the OS if its running hot.
> 
> Is there a way to see if your CPU is old and has that bug ? I havent
> seen any segfaults on the few dozen buildworlds I have done. So far its
> always been a total lockup and not crash with RELENG11.
> 
> x86info v1.31pre
> Found 12 identical CPUs
> Extended Family: 8 Extended Model: 0 Family: 15 Model: 1 Stepping: 1
> CPU Model (x86info's best guess): AMD Zen Series Processor (ZP-B1)
> Processor name string (BIOS programmed): AMD Ryzen 5 1600 Six-Core
> Processor

My original CPU had a date code of 1708SUT (8th week of 2017 I think),
and the replacement has a date code of 1733SUS.  There's a humungous
discussion thread here <https://community.amd.com/thread/215773> where
date codes are discussed.  As I recall, the first replacement parts
shipped had dates codes somewhere in the mid 20's, but I think AMD was
still hand screening parts at that point.  My replacement came in a
sealed box, so it wasn't hand screened and AMD probably was able to
screen for this problem in their production test.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Ryzen issues on FreeBSD ?

2018-01-17 Thread Don Lewis
On 17 Jan, Mike Tancsa wrote:
> On 1/17/2018 8:43 AM, Pete French wrote:
>> 
>> Are you running the latest STABLE ? There were some patches for Ryzen
>> which went in I belive, and might affect te stability. Specificly the
>> chnages to stop it locking up when executing code in the top page ?
> 
> Hi,
>   I was testing with RELENG_11 as of 2 days ago.  The fix seems to be 
> there
> 
> # sysctl -A hw.lower_amd64_sharedpage
> hw.lower_amd64_sharedpage: 1
> 
> Would love to find a class of motherboard that pushes its "You dont need
> to dork around with any BIOS settings. It just works.  Oh, and we have a
> hardware watchdog too" ipmi would be stellar.

The shared page change fixed the random lockup and silent reboot problem
for me.  I've got a 1700X eight core CPU and a Gigabyte X370 Gaming 5. I
did have to RMA my CPU (it was an early one) because it had the problem
with random segfaults that seemed to be triggered by process migration
between CPU cores.  I still haven't switched over to using it for
package builds because I see more random fallout than on my older
package builder.  I'm not blaming the hardware for that at this point
because I see a lot of the same issues on my older machine, but less
frequently.

One thing to watch (though it should be less critical with a six core
CPU) is VRM cooling.  I removed the stupid plastic shroud over the VRM
sink on my motherboard so that it gets some more airflow.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 r321349 crashing immediately

2017-07-22 Thread Don Lewis
On 22 Jul, To: pz-freebsd-sta...@ziemba.us wrote:
> On 22 Jul, G. Paul Ziemba wrote:
>> My previous table had an error in the cumulative size column
>> (keyboard made "220" into "20" when I was plugging into the hex
>> calculator), so thet stack is 0x200 bigger than I originally thought:
>> 
>> Frame   Stack Pointer   sz  cumu function
>> -   -   ---  
>>  44 0xfe085cfa8a10   amd64_syscall
>>  43 0xfe085cfa88b0  160  160 syscallenter
>>  42 0xfe085cfa87f0  220  380 sys_execve
>>  41 0xfe085cfa87c0   30  3B0 kern_execve
>>  40 0xfe085cfa8090  730  AE0 do_execve
>>  39 0xfe085cfa7ec0  1D0  CB0 namei
>>  38 0xfe085cfa7d40  180  E30 lookup
>>  37 0xfe085cfa7cf0   50  E80 VOP_LOOKUP
>>  36 0xfe085cfa7c80   70  EF0 VOP_LOOKUP_APV
>>  35 0xfe085cfa7650  630 1520 nfs_lookup
>>  34 0xfe085cfa75f0   60 1580 VOP_ACCESS
>>  33 0xfe085cfa7580   70 15F0 VOP_ACCESS_APV
>>  32 0xfe085cfa7410  170 1760 nfs_access
>>  31 0xfe085cfa7240  1D0 1930 nfs34_access_otw
>>  30 0xfe085cfa7060  1E0 1B10 nfsrpc_accessrpc
>>  29 0xfe085cfa6fb0   B0 1BC0 nfscl_request
>>  28 0xfe085cfa6b20  490 2050 newnfs_request
>>  27 0xfe085cfa6980  1A0 21F0 clnt_reconnect_call
>>  26 0xfe085cfa6520  460 2650 clnt_vc_call
>>  25 0xfe085cfa64c0   60 26B0 sosend
>>  24 0xfe085cfa6280  240 28F0 sosend_generic
>>  23 0xfe085cfa6110  170 2A60 tcp_usr_send
>>  22 0xfe085cfa5ca0  470 2ED0 tcp_output
>>  21 0xfe085cfa5900  3A0 3270 ip_output
>>  20 0xfe085cfa5880   80 32F0 looutput
>>  19 0xfe085cfa5800   80 3370 if_simloop
>>  18 0xfe085cfa57d0   30 33A0 netisr_queue
>>  17 0xfe085cfa5780   50 33F0 netisr_queue_src
>>  16 0xfe085cfa56f0   90 3480 netisr_queue_internal
>>  15 0xfe085cfa56a0   50 34D0 swi_sched
>>  14 0xfe085cfa5620   80 3550 intr_event_schedule_thread
>>  13 0xfe085cfa55b0   70 35C0 sched_add
>>  12 0xfe085cfa5490  120 36E0 sched_pickcpu
>>  11 0xfe085cfa5420   70 3750 sched_lowest
>>  10 0xfe085cfa52a0  180 38D0 cpu_search_lowest
>>   9 0xfe085cfa52a00 38D0 cpu_search
>>   8 0xfe085cfa5120  180 3A50 cpu_search_lowest
>>   7 0xfe085cfa51200 3A50 cpu_search
>>   6 0xfe085cfa4fa0  180 3BD0 cpu_search_lowest
>>   5 0xfe0839778f40  signal handler
> 
> The stack is aligned to a 4096 (0x1000) boundary.  The first access to a
> local variable below 0xfe085cfa5000 is what triggered the trap.  The
> other end of the stack must be at 0xfe085cfa9000 less a bit. I don't
> know why the first stack pointer value in the trace is
> 0xfe085cfa8a10. That would seem to indicate that amd64_syscall is
> using ~1500 bytes of stack space.

Actually there could be quite a bit of CPU context that gets saved. That
could be sizeable on amd64.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 r321349 crashing immediately

2017-07-22 Thread Don Lewis
On 22 Jul, G. Paul Ziemba wrote:
> My previous table had an error in the cumulative size column
> (keyboard made "220" into "20" when I was plugging into the hex
> calculator), so thet stack is 0x200 bigger than I originally thought:
> 
> Frame   Stack Pointer   sz  cumu function
> -   -   ---  
>  44 0xfe085cfa8a10   amd64_syscall
>  43 0xfe085cfa88b0  160  160 syscallenter
>  42 0xfe085cfa87f0  220  380 sys_execve
>  41 0xfe085cfa87c0   30  3B0 kern_execve
>  40 0xfe085cfa8090  730  AE0 do_execve
>  39 0xfe085cfa7ec0  1D0  CB0 namei
>  38 0xfe085cfa7d40  180  E30 lookup
>  37 0xfe085cfa7cf0   50  E80 VOP_LOOKUP
>  36 0xfe085cfa7c80   70  EF0 VOP_LOOKUP_APV
>  35 0xfe085cfa7650  630 1520 nfs_lookup
>  34 0xfe085cfa75f0   60 1580 VOP_ACCESS
>  33 0xfe085cfa7580   70 15F0 VOP_ACCESS_APV
>  32 0xfe085cfa7410  170 1760 nfs_access
>  31 0xfe085cfa7240  1D0 1930 nfs34_access_otw
>  30 0xfe085cfa7060  1E0 1B10 nfsrpc_accessrpc
>  29 0xfe085cfa6fb0   B0 1BC0 nfscl_request
>  28 0xfe085cfa6b20  490 2050 newnfs_request
>  27 0xfe085cfa6980  1A0 21F0 clnt_reconnect_call
>  26 0xfe085cfa6520  460 2650 clnt_vc_call
>  25 0xfe085cfa64c0   60 26B0 sosend
>  24 0xfe085cfa6280  240 28F0 sosend_generic
>  23 0xfe085cfa6110  170 2A60 tcp_usr_send
>  22 0xfe085cfa5ca0  470 2ED0 tcp_output
>  21 0xfe085cfa5900  3A0 3270 ip_output
>  20 0xfe085cfa5880   80 32F0 looutput
>  19 0xfe085cfa5800   80 3370 if_simloop
>  18 0xfe085cfa57d0   30 33A0 netisr_queue
>  17 0xfe085cfa5780   50 33F0 netisr_queue_src
>  16 0xfe085cfa56f0   90 3480 netisr_queue_internal
>  15 0xfe085cfa56a0   50 34D0 swi_sched
>  14 0xfe085cfa5620   80 3550 intr_event_schedule_thread
>  13 0xfe085cfa55b0   70 35C0 sched_add
>  12 0xfe085cfa5490  120 36E0 sched_pickcpu
>  11 0xfe085cfa5420   70 3750 sched_lowest
>  10 0xfe085cfa52a0  180 38D0 cpu_search_lowest
>   9 0xfe085cfa52a00 38D0 cpu_search
>   8 0xfe085cfa5120  180 3A50 cpu_search_lowest
>   7 0xfe085cfa51200 3A50 cpu_search
>   6 0xfe085cfa4fa0  180 3BD0 cpu_search_lowest
>   5 0xfe0839778f40  signal handler

The stack is aligned to a 4096 (0x1000) boundary.  The first access to a
local variable below 0xfe085cfa5000 is what triggered the trap.  The
other end of the stack must be at 0xfe085cfa9000 less a bit. I don't
know why the first stack pointer value in the trace is
0xfe085cfa8a10. That would seem to indicate that amd64_syscall is
using ~1500 bytes of stack space.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 r321349 crashing immediately

2017-07-22 Thread Don Lewis
On 22 Jul, David Wolfskill wrote:
> On Fri, Jul 21, 2017 at 04:53:18AM +, G. Paul Ziemba wrote:
>> ...
>> >It looks like you are trying to execute a program from an NFS file
>> >system that is exported by the same host.  This isn't exactly optimal
>> >...
>> 
>> Perhaps not optimal for the implementation, but I think it's a
>> common NFS scenario: define a set of NFS-provided paths for files
>> and use those path names on all hosts, regardless of whether they
>> happen to be serving the files in question or merely clients.
> 
> Back when I was doing sysadmin stuff for a group of engineers, my
> usual approach for that sort of thing was to use amd (this was late
> 1990s - 2001) to have maps so it would set up NFS mounts if the
> file system being served was from a different host (from the one
> running amd), but instantiating a symlink instead if the file system
> resided on the current host.

Same here.

It's a bit messy to do this manually, but you could either use a symlink
or a nullfs mount for the filesystems that are local.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 r321349 crashing immediately

2017-07-22 Thread Don Lewis
On 21 Jul, G. Paul Ziemba wrote:
> truck...@freebsd.org (Don Lewis) writes:
> 
>>On 21 Jul, G. Paul Ziemba wrote:
>>> GENERIC kernel r321349 results in the following about a minute after
>>> multiuser boot completes.
>>> 
>>> What additional information should I provide to assist in debugging?
>>> 
>>> Many thanks!
>>> 
>>> [Extracted from /var/crash/core.txt.NNN]
>>> 
>>> KDB: stack backtrace:
>>> #0 0x810f6ed7 at kdb_backtrace+0xa7
>>> #1 0x810872a9 at vpanic+0x249
>>> #2 0x81087060 at vpanic+0
>>> #3 0x817d9aca at dblfault_handler+0x10a
>>> #4 0x817ae93c at Xdblfault+0xac
>>> #5 0x810cf76e at cpu_search_lowest+0x35e
>>> #6 0x810cf76e at cpu_search_lowest+0x35e
>>> #7 0x810d5b36 at sched_lowest+0x66
>>> #8 0x810d1d92 at sched_pickcpu+0x522
>>> #9 0x810d2b03 at sched_add+0xd3
>>> #10 0x8101df5c at intr_event_schedule_thread+0x18c
>>> #11 0x8101ddb0 at swi_sched+0xa0
>>> #12 0x81261643 at netisr_queue_internal+0x1d3
>>> #13 0x81261212 at netisr_queue_src+0x92
>>> #14 0x81261677 at netisr_queue+0x27
>>> #15 0x8123da5a at if_simloop+0x20a
>>> #16 0x8123d83b at looutput+0x22b
>>> #17 0x8131c4c6 at ip_output+0x1aa6
>>> 
>>> doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:298
>>> 298 dumptid = curthread->td_tid;
>>> (kgdb) #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:298
>>> #1  0x810867e8 in kern_reboot (howto=260)
>>> at /usr/src/sys/kern/kern_shutdown.c:366
>>> #2  0x810872ff in vpanic (fmt=0x81e5f7e0 "double fault", 
>>> ap=0xfe0839778ec0) at /usr/src/sys/kern/kern_shutdown.c:759
>>> #3  0x81087060 in panic (fmt=0x81e5f7e0 "double fault")
>>> at /usr/src/sys/kern/kern_shutdown.c:690
>>> #4  0x817d9aca in dblfault_handler (frame=0xfe0839778f40)
>>> at /usr/src/sys/amd64/amd64/trap.c:828
>>> #5  
>>> #6  0x810cf422 in cpu_search_lowest (
>>> cg=0x826ccd98 <group+280>, 
>>> low=>> 0xfe085cfa4
>>> ff8>) at /usr/src/sys/kern/sched_ule.c:782
>>> #7  0x810cf76e in cpu_search (cg=0x826cccb8 <group+56>, 
>>> low=0xfe085cfa53b8, high=0x0, match=1)
>>> at /usr/src/sys/kern/sched_ule.c:710
>>> #8  cpu_search_lowest (cg=0x826cccb8 <group+56>, 
>>> low=0xfe085cfa53b8) at /usr/src/sys/kern/sched_ule.c:783
>>> #9  0x810cf76e in cpu_search (cg=0x826ccc80 , 
>>> low=0xfe085cfa5430, high=0x0, match=1)
>>> at /usr/src/sys/kern/sched_ule.c:710
>>> #10 cpu_search_lowest (cg=0x826ccc80 , 
>>> low=0xfe085cfa5430)
>>> at /usr/src/sys/kern/sched_ule.c:783
>>> #11 0x810d5b36 in sched_lowest (cg=0x826ccc80 , 
>>> mask=..., pri=28, maxload=2147483647, prefer=4)
>>> at /usr/src/sys/kern/sched_ule.c:815
>>> #12 0x810d1d92 in sched_pickcpu (td=0xf8000a3a9000, flags=4)
>>> at /usr/src/sys/kern/sched_ule.c:1292
>>> #13 0x810d2b03 in sched_add (td=0xf8000a3a9000, flags=4)
>>> at /usr/src/sys/kern/sched_ule.c:2447
>>> #14 0x8101df5c in intr_event_schedule_thread (ie=0xf80007e7ae00)
>>> at /usr/src/sys/kern/kern_intr.c:917
>>> #15 0x8101ddb0 in swi_sched (cookie=0xf8000a386880, flags=0)
>>> at /usr/src/sys/kern/kern_intr.c:1163
>>> #16 0x81261643 in netisr_queue_internal (proto=1, 
>>> m=0xf80026d00500, cpuid=0) at /usr/src/sys/net/netisr.c:1022
>>> #17 0x81261212 in netisr_queue_src (proto=1, source=0, 
>>> m=0xf80026d00500) at /usr/src/sys/net/netisr.c:1056
>>> #18 0x81261677 in netisr_queue (proto=1, m=0xf80026d00500)
>>> at /usr/src/sys/net/netisr.c:1069
>>> #19 0x8123da5a in if_simloop (ifp=0xf800116eb000, 
>>> m=0xf80026d00500, af=2, hlen=0) at /usr/src/sys/net/if_loop.c:358
>>> #20 0x8123d83b in looutput (ifp=0xf800116eb000, 
>>> m=0xf80026d00500, dst=0xf80026ed6550, ro=0xf80026ed6530)
>>> at /usr/src/sys/net/if_loop.c:265
>>> #21 0x8131c4c6 in ip_output (m=0xf80026d00500, opt=0x0, 
>>> ro=0xf80026ed6

Re: stable/11 r321349 crashing immediately

2017-07-21 Thread Don Lewis
On 21 Jul, G. Paul Ziemba wrote:
> GENERIC kernel r321349 results in the following about a minute after
> multiuser boot completes.
> 
> What additional information should I provide to assist in debugging?
> 
> Many thanks!
> 
> [Extracted from /var/crash/core.txt.NNN]
> 
> KDB: stack backtrace:
> #0 0x810f6ed7 at kdb_backtrace+0xa7
> #1 0x810872a9 at vpanic+0x249
> #2 0x81087060 at vpanic+0
> #3 0x817d9aca at dblfault_handler+0x10a
> #4 0x817ae93c at Xdblfault+0xac
> #5 0x810cf76e at cpu_search_lowest+0x35e
> #6 0x810cf76e at cpu_search_lowest+0x35e
> #7 0x810d5b36 at sched_lowest+0x66
> #8 0x810d1d92 at sched_pickcpu+0x522
> #9 0x810d2b03 at sched_add+0xd3
> #10 0x8101df5c at intr_event_schedule_thread+0x18c
> #11 0x8101ddb0 at swi_sched+0xa0
> #12 0x81261643 at netisr_queue_internal+0x1d3
> #13 0x81261212 at netisr_queue_src+0x92
> #14 0x81261677 at netisr_queue+0x27
> #15 0x8123da5a at if_simloop+0x20a
> #16 0x8123d83b at looutput+0x22b
> #17 0x8131c4c6 at ip_output+0x1aa6
> 
> doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:298
> 298 dumptid = curthread->td_tid;
> (kgdb) #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:298
> #1  0x810867e8 in kern_reboot (howto=260)
> at /usr/src/sys/kern/kern_shutdown.c:366
> #2  0x810872ff in vpanic (fmt=0x81e5f7e0 "double fault", 
> ap=0xfe0839778ec0) at /usr/src/sys/kern/kern_shutdown.c:759
> #3  0x81087060 in panic (fmt=0x81e5f7e0 "double fault")
> at /usr/src/sys/kern/kern_shutdown.c:690
> #4  0x817d9aca in dblfault_handler (frame=0xfe0839778f40)
> at /usr/src/sys/amd64/amd64/trap.c:828
> #5  
> #6  0x810cf422 in cpu_search_lowest (
> cg=0x826ccd98 , 
> low= 0xfe085cfa4
> ff8>) at /usr/src/sys/kern/sched_ule.c:782
> #7  0x810cf76e in cpu_search (cg=0x826cccb8 , 
> low=0xfe085cfa53b8, high=0x0, match=1)
> at /usr/src/sys/kern/sched_ule.c:710
> #8  cpu_search_lowest (cg=0x826cccb8 , 
> low=0xfe085cfa53b8) at /usr/src/sys/kern/sched_ule.c:783
> #9  0x810cf76e in cpu_search (cg=0x826ccc80 , 
> low=0xfe085cfa5430, high=0x0, match=1)
> at /usr/src/sys/kern/sched_ule.c:710
> #10 cpu_search_lowest (cg=0x826ccc80 , low=0xfe085cfa5430)
> at /usr/src/sys/kern/sched_ule.c:783
> #11 0x810d5b36 in sched_lowest (cg=0x826ccc80 , 
> mask=..., pri=28, maxload=2147483647, prefer=4)
> at /usr/src/sys/kern/sched_ule.c:815
> #12 0x810d1d92 in sched_pickcpu (td=0xf8000a3a9000, flags=4)
> at /usr/src/sys/kern/sched_ule.c:1292
> #13 0x810d2b03 in sched_add (td=0xf8000a3a9000, flags=4)
> at /usr/src/sys/kern/sched_ule.c:2447
> #14 0x8101df5c in intr_event_schedule_thread (ie=0xf80007e7ae00)
> at /usr/src/sys/kern/kern_intr.c:917
> #15 0x8101ddb0 in swi_sched (cookie=0xf8000a386880, flags=0)
> at /usr/src/sys/kern/kern_intr.c:1163
> #16 0x81261643 in netisr_queue_internal (proto=1, 
> m=0xf80026d00500, cpuid=0) at /usr/src/sys/net/netisr.c:1022
> #17 0x81261212 in netisr_queue_src (proto=1, source=0, 
> m=0xf80026d00500) at /usr/src/sys/net/netisr.c:1056
> #18 0x81261677 in netisr_queue (proto=1, m=0xf80026d00500)
> at /usr/src/sys/net/netisr.c:1069
> #19 0x8123da5a in if_simloop (ifp=0xf800116eb000, 
> m=0xf80026d00500, af=2, hlen=0) at /usr/src/sys/net/if_loop.c:358
> #20 0x8123d83b in looutput (ifp=0xf800116eb000, 
> m=0xf80026d00500, dst=0xf80026ed6550, ro=0xf80026ed6530)
> at /usr/src/sys/net/if_loop.c:265
> #21 0x8131c4c6 in ip_output (m=0xf80026d00500, opt=0x0, 
> ro=0xf80026ed6530, flags=0, imo=0x0, inp=0xf80026ed63a0)
> at /usr/src/sys/netinet/ip_output.c:655
> #22 0x8142e1c7 in tcp_output (tp=0xf80026eb2820)
> at /usr/src/sys/netinet/tcp_output.c:1447
> #23 0x81447700 in tcp_usr_send (so=0xf80011ec2360, flags=0, 
> m=0xf80026d14d00, nam=0x0, control=0x0, td=0xf80063ba1000)
> at /usr/src/sys/netinet/tcp_usrreq.c:967
> #24 0x811776f1 in sosend_generic (so=0xf80011ec2360, addr=0x0, 
> uio=0x0, top=0xf80026d14d00, control=0x0, flags=0, 
> td=0xf80063ba1000) at /usr/src/sys/kern/uipc_socket.c:1360
> #25 0x811779bd in sosend (so=0xf80011ec2360, addr=0x0, uio=0x0, 
> top=0xf80026d14d00, control=0x0, flags=0, td=0xf80063ba1000)
> at /usr/src/sys/kern/uipc_socket.c:1405
> #26 0x815276a2 in clnt_vc_call (cl=0xf80063ca0980, 
> ext=0xfe085cfa6e38, proc=4, args=0xf80026c3bc00, 
> resultsp=0xfe085cfa7110, utimeout=...)
> at /usr/src/sys/rpc/clnt_vc.c:413
> 

Re: Where is 10.2-STABLE?

2016-03-31 Thread Don Lewis
On 30 Mar, Ronald F. Guilmette wrote:
> 
> 
> I was looking to download an ISO for 10.2-STABLE for a new build I'm
> doing, however I can't see to locate any such ISO.  It isn't where
> it seems it should be, according to the info on this page:
> 
> https://www.freebsd.org/snapshots/
> 
> Where can I get such an ISO?

If you really want a snapshot from the stable/10 branch, look for
10.3-PRERELEASE.  When stable/10 was branched in preparation for
10.3-RELEASE (creating the releng/10.3 branch), newvers.sh on the
stable/10 branch was edited to change REVISION from 10.2 to 10.3 and
BRANCH from STABLE to PRERELEASE.  Commits to the stable/10 branch after
this point are eligible to be merged to releng/10.3 if approved by re@.

Once 10.3 is officially released, then newvers.sh will be modified again
to change BRANCH from PRERELEASE back to STABLE, and the snapshots will
then be named 10.3-STABLE.

I suspect that the oldest of the 10.2-STABLE snapshots has been expired.
I don't know why there haven't been any 10.3-PRERELASE snapshot .iso
files created since January ...
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: MFC r282973 (disable libgomp build) and r283060 (disable libgcov build)?

2015-07-24 Thread Don Lewis
On 24 Jul, Matthieu Volat wrote:
 On Mon, 20 Jul 2015 14:03:01 -0700 (PDT)
 Don Lewis truck...@freebsd.org wrote:
 
 Should r282973 and r283060 be MFCed to FreeBSD 10?  On amd64 and i386,
 which use clang as their base compiler and don't have gcc in base by
 default, the math/scilab port uses clang for cc and c++ compilation,
 but finds /usr/include/omp.h (and links to libgomp from lang/gcc). 
 The build succeeds, but I suspect this may not run properly.
 
 Does it mean the door to an openmp-enabled cc in base is closed?

That is probably true for FreeBSD 10, since clang 3.4 in base doesn't
support OpenMP.  Even though omp.h and libgomp are present in FreeBSD
10, and using base clang to compile the test program here
http://openmp.org/wp/openmp-compilers/ succeeds, the resulting
executable only runs one thread.  There is probably more hope for
FreeBSD 11 if the runtime is imported, since clang 3.6 in base there at
least knows about the -fopenmp option.

 I'm not fond of lang/gcc as openmp provider: if a port use c++, it
 will cause linkage headaches with libc++ (I never was able to have
 graphics/darktable working, for example).

You might want to try out lang/clang-devel with devel/libiomp5-devel.
See this thread on freebsd-ports@
http://docs.freebsd.org/cgi/mid.cgi?55AE0474.5050207.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


MFC r282973 (disable libgomp build) and r283060 (disable libgcov build)?

2015-07-20 Thread Don Lewis
Should r282973 and r283060 be MFCed to FreeBSD 10?  On amd64 and i386,
which use clang as their base compiler and don't have gcc in base by
default, the math/scilab port uses clang for cc and c++ compilation, but
finds /usr/include/omp.h (and links to libgomp from lang/gcc).  The
build succeeds, but I suspect this may not run properly.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: There has to be a better way of merging /etc during a major freebsd-update

2015-03-10 Thread Don Lewis
On 10 Mar, Miroslav Lachman wrote:

 This and some other problems with freebsd-update (hanging on the reboot 
 after update) turns me back to using source compiled upgrades.
 I am compiling 10.1 right now to do the upgrades from 8.4 to 10.1 on 15 
 machines.

I only do source upgrades, but I've still run into the hang on reboot
problem with 10.1-STABLE.  The problem seems to have gone away when I
switched the root filesystem (there's actually only one filesystem on
that machine) from SU+J to SU.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: portupgrade(1) | portmaster(8) -- which is more effective for large upgrade?

2013-06-27 Thread Don Lewis
On 26 Jun, Bob Willcox wrote:
 On Wed, Jun 26, 2013 at 01:23:32PM -0700, Jeremy Chadwick wrote:
 On Wed, Jun 26, 2013 at 09:42:43AM -0700, Chris H wrote:
  Greetings,
   I haven't upgraded my tree(s) for awhile. My last attempt to rebuild 
  after an updating
  src  ports, resulted in nearly installing the entire ports tree, which 
  is why I've
  waited so long. Try as I might, I've had great difficulty finding 
  something that will
  _only_ upgrade what I already have installed, _and_ respect the options 
  used during the
  original make  make install, or those options expressed in make.conf.
  As portupgrade(1)  portmaster(8) appear to be the most used in this 
  scenario,
  I'm soliciting opinions on which of these works best, or if there is 
  something else to
  better manage this situation. Is there such a thing as a FreeBSD upgrade 
  easy button?
 
 Use portmaster, avoid portupgrade.  And no I will not expand on my
 reasoning -- I urge anyone even mentioning the word portupgrade to spend
 a few hours of their day reading the horror stories on the mailing lists
 over the past 10 years or so (including recently).  Choose wisely.
 
 Well, just to offer a counter-opinion here, I use portupgrade and feel that it
 has improved significantly over the past year or two and has become quite
 usable. I run it every two to four weeks on about five systems and haven't had
 any problems with it in a long time. However, YMMV.

I'm also a long-time portupgrade user, though I've been running locally
tweaked versions for quite some time.  Currently I'm using the patch
from the PR ports/177365, which makes the -a -f and -r options play
together much better.

I always start my upgrades by running
  portupgrade -aFc
to fetch any needed distfiles and configure all the port options.  That
avoids breakage in the middle of the upgrade from an unfetchable
distfile, and avoids interactive pauses in the middle of the upgrade to
set options.

In my latest upgrade, I had to deal with the ruby version change as well
as the perl upgrade and the audio/flac library version bump.  On my sole
10-CURRENT machine, I just followed the initial steps listed in UPDATING
for the ruby version change, and then ran:
 portupgrade -afx ruby-1.8.\* -r lang/ruby18 lang/perl5.12 audio/flac
That upgrades all the ports that are out of date and rebuilds all the
ports that depend on the explictly listed ports, all in the correct
dependency order.

For my 8-STABLE machines, I build pkgng packages on one machine and then
use pkg to upgrade the others.  I build the packages in three steps:

 portupgrade -nfx ruby-1.8.\* -r lang/ruby18 lang/perl5.12 audio/flac f

 edit the file f to get the list of the origins of the ports that
 would be upgraded

 portugrade -fpr `cat f`

I do this so that if port foo gets upgraded because it is out of date,
I want to rebuild all the packages that depend on foo.  The reason for
that is if I install package bar that depends on foo, I want pkg to
also install the correct version of foo.

If I feel ambitious, I might tweak portouprade so that it can handle
this internally instead of having to do the extra steps manually.  On
the other hand, I might switch to poudriere, which is probably a better
solution.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: What is the Right Way(™) to run X?

2013-03-17 Thread Don Lewis
On 17 Mar, Andreas Nilsson wrote:
 On Sun, Mar 17, 2013 at 1:58 PM, Daniel O'Connor dar...@dons.net.au wrote:
 

 On 17/03/2013, at 23:08, Matthew D. Fuller fulle...@over-yonder.net
 wrote:
  However, some time back, X _did_ start being all stupid about finding
  the mouse for me.  Un/re-plugging it (USB) after starting X made it
  show up working, but that's annoying and stupid (and not an option on
  other systems with e.g. PS/2 meece).  I wound up sticking the other
  half of that oft-cargo-culted incantation:
 
  Section ServerFlags
 Option AutoAddDevices off
  EndSection
 
  in my config, and it's worked OK since.  's probably worth a try...


 Yeah, that does work too. It's just annoying it's necessary :)

 
 Sure is. One thing that also comes to mind is moused. Do you have it
 running? I seem to remember having weird troubles when moused wasn't
 running.

I ran into this problem a while back.  The problem turned out to be that
moused was exclusively opening /dev/psm0 before hald so that hald was
unable open it.  This happened first on my laptop, and I just disabled
moused and everything seemed to work except that the trackpad no longer
worked in console mode.  I tried the same thing later when my primary
desktop broke and it sort of worked.  The problem that I ran into was
that Xorg would occasionally wedge and spam its log with messages about
problems with detecting the mouse protocol.  Even worse, I found that my
KVM switch would very reliably trigger this problem.  After much hair
pulling, I eventually re-enabled moused and added this to xorg.conf:
Option  AllowEmptyInput Off
so that it would obey this mouse configuration section:
Section InputDevice
Identifier  Mouse0
Driver  mouse
Option  Protocol auto
Option  Device /dev/sysmouse
Option  ZAxisMapping 4 5 6 7
EndSection

I don't recall if I disabled hald and changed xorg.conf to point to
/dev/psm0 before I re-enabled moused.  I do know that hald is currently
disabled and nothing obvious seems to be broken in Gnome.

I haven't had any issues with AllowEmptyInput so I never bothered to
switch over to the preferred AutoAddDevices.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: amdtemp does not find my CPU.

2013-03-16 Thread Don Lewis
On 16 Mar, Jim Ohlstein wrote:
 On 3/16/13 2:20 AM, Jeremy Chadwick wrote:
 On Fri, Mar 15, 2013 at 03:16:19PM -0400, Jim Ohlstein wrote:
 On 3/15/13 12:15 PM, Zoran Kolic wrote:
 After I installed 9.1 amd64 on node with amd 8120,
 I was not able to read temperatures out of the box.
 I fetched source for head module and compiled. And
 loaded module. Still nothing. I assume my cpu is
 a bit different.
 Best regards

 The module from head works for me with an 8120 on 9.1 stable (r247893)
 though the results are inconsistent. I am not certain of how useful they
 are.

 # sysctl hw.model
 hw.model: AMD FX(tm)-8120 Eight-Core Processor

 # kldstat | grep amd
  51 0x8183e000 1043 amdtemp.ko

 # sysctl -a | grep dev.amdtemp
 dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
 dev.amdtemp.0.%driver: amdtemp
 dev.amdtemp.0.%parent: hostb4
 dev.amdtemp.0.sensor_offset: 0
 dev.amdtemp.0.core0.sensor0: 47.7C

 Here are results taken at 0.1 second intervals using a shell script:

 dev.amdtemp.0.core0.sensor0: 42.1C
 dev.amdtemp.0.core0.sensor0: 42.2C
 dev.amdtemp.0.core0.sensor0: 42.0C
 dev.amdtemp.0.core0.sensor0: 42.1C
 dev.amdtemp.0.core0.sensor0: 41.8C
 dev.amdtemp.0.core0.sensor0: 41.7C
 dev.amdtemp.0.core0.sensor0: 51.1C
 dev.amdtemp.0.core0.sensor0: 51.0C
 dev.amdtemp.0.core0.sensor0: 50.7C
 dev.amdtemp.0.core0.sensor0: 50.5C
 dev.amdtemp.0.core0.sensor0: 50.1C
 dev.amdtemp.0.core0.sensor0: 49.8C
 dev.amdtemp.0.core0.sensor0: 49.5C
 dev.amdtemp.0.core0.sensor0: 49.2C
 dev.amdtemp.0.core0.sensor0: 49.2C


 and again:

 dev.amdtemp.0.core0.sensor0: 41.5C
 dev.amdtemp.0.core0.sensor0: 41.2C
 dev.amdtemp.0.core0.sensor0: 40.8C
 dev.amdtemp.0.core0.sensor0: 40.8C
 dev.amdtemp.0.core0.sensor0: 41.0C
 dev.amdtemp.0.core0.sensor0: 41.3C
 dev.amdtemp.0.core0.sensor0: 41.6C
 dev.amdtemp.0.core0.sensor0: 41.3C
 dev.amdtemp.0.core0.sensor0: 54.0C
 dev.amdtemp.0.core0.sensor0: 53.7C
 dev.amdtemp.0.core0.sensor0: 53.3C
 dev.amdtemp.0.core0.sensor0: 53.1C
 dev.amdtemp.0.core0.sensor0: 52.7C
 dev.amdtemp.0.core0.sensor0: 52.3C
 dev.amdtemp.0.core0.sensor0: 52.1C
 dev.amdtemp.0.core0.sensor0: 51.7C
 dev.amdtemp.0.core0.sensor0: 51.5C

 You can see during each series there are sudden increases of over 9C and
 almost 13C respectively.

 The same effect is seen if I track any of the individual cores with
 dev.cpu.[0-7].temperature. Here's an example with a 9C jump in 0.1 second.

 dev.cpu.3.temperature: 41.5C
 dev.cpu.3.temperature: 41.5C
 dev.cpu.3.temperature: 41.7C
 dev.cpu.3.temperature: 41.7C
 dev.cpu.3.temperature: 41.3C
 dev.cpu.3.temperature: 41.0C
 dev.cpu.3.temperature: 40.7C
 dev.cpu.3.temperature: 49.8C
 dev.cpu.3.temperature: 49.5C
 dev.cpu.3.temperature: 49.2C
 dev.cpu.3.temperature: 48.8C
 dev.cpu.3.temperature: 48.6C
 dev.cpu.3.temperature: 48.2C
 dev.cpu.3.temperature: 48.0C

 I don't have hands on access to this box as it's in a datacenter 1000
 miles from me, but the techs there had a look and all seems to be OK.
 
 1. While it's certainly possible the DTS reading routines and/or the
 calculation formulas may be wrong in amdtemp(4), possibly for your model
 of CPU, it is also certainly possible that what you're seeing is normal
 and fully justified.  This is especially the case for the
 dev.cpu.X.temperature nodes on the K8 family.
 
 Respectfully, not combatively nor dismissively: you've not provided a
 comparison base to prove there's an issue.  You would need to provide
 data from Linux (I forget what daemon/tool they have to get this) or
 Windows (Core Temp).
 
 Respectfully, not combatively nor dismissively: I hadn't attempted to
 prove anything. I said: I am not certain of how useful they [the
 readings] are.. I had merely provided some observational data as an
 aside to the fact that yes, indeed, the module provides readings for me
 on the 8120 This was in direct response to to Zoran's issue with this
 module and that processor model.
 
 This started, for me, when I looked at a graph of average core
 temperatures taken at 30 second intervals on two different machines
 using Zabbix. The fluctuations were visibly (I know that's not
 scientific proof) more wild than on this server than on another using
 the amdtemp module from 9 stable.
 
 I don't have access to another server with this model CPU on any other
 OS, or even on this OS, so I cannot provide the data to prove this is
 an issue according to your criteria. However, I will provide comparative
 data from the other machine with the module from stable and with the the
 module from head.
 
 
 Full data taken now:
 
 # sysctl hw.model
 hw.model: AMD FX(tm)-8120 Eight-Core Processor
 
 Using the module from head:
 
 http://pastebin.com/wqQ0FLq3
 
 Note the big change between lines 34 and 35.

My FX-4100 behaves the same way.  I noticed it because on an idle system
sysctl -a | grep amdtemp
would read consistently higher than
sysctl dev.amdtemp

I think the thermal sensor in this AMD CPU family has a much faster
response 

Re: RELENG_8: amdtemp module and newer CPUs not working. MFC?

2013-02-24 Thread Don Lewis
On 20 Feb, Jeremy Chadwick wrote:
 On Wed, Feb 20, 2013 at 10:29:05PM -0800, Don Lewis wrote:
 On 17 Feb, Torfinn Ingolfsen wrote:
  Hello,
  I'm running FreeBSD 8.3-stable on a machine with an AMD A8-5600K cpu.
  tingo@kg-quiet$ uname -a
  FreeBSD kg-quiet.kg4.no 8.3-STABLE FreeBSD 8.3-STABLE #2: Fri Jan  4 
  19:18:15 CET 2013 
  r...@kg-quiet.kg4.no:/usr/obj/usr/src/sys/GENERIC  amd64
  tingo@kg-quiet$ dmesg | grep CPU | head -1
  CPU: AMD A8-5600K APU with Radeon(tm) HD Graphics(3618.02-MHz K8-class 
  CPU)
  
  Unfortunately, the amdtemp.ko module doesn't work:
  tingo@kg-quiet$ kldstat | grep temp
  101 0x8123e000 f0f  amdtemp.ko
  tingo@kg-quiet$ sysctl dev.amdtemp
  sysctl: unknown oid 'dev.amdtemp'
  
  Based on a thread[1] on the forums, amdtemp.c from -CURRENT work.
  But it doesn't compile under FreeBSD 8.3-stable:
 
 Updating amdtemp is on my TODO list.  It has some issues even on
 -CURRENT.  This is kind of far down my priority list because on most of
 my AMD machines, I can also get the temperature without amdtemp:
 
 % sysctl hw.acpi.thermal.tz0.temperature
 hw.acpi.thermal.tz0.temperature: 30.0C
 
 There's an implication in your statement here, so I want to clarify for
 readers (as the author of sysutils/bsdhwmon):
 
 acpi_thermal(4) does not necessarily tie in to an on-die DTS within
 the CPU.  Your motherboards and CPUs (both matter! (e.g. for Intel CPUs,
 see PECI (not a typo)) may offer this tie-in, but such is not the case
 for many people.  I tend to find ACPI thermal zones used in laptops and
 very rarely anywhere else.
 
 acpi_thermal(4) may return temperatures from zones that are mapped to
 readings from Super I/O chips or dedicated H/W monitoring ICs (such as
 ones provided by Nuvuton/Winbond, LM, ITE, ADT, etc.).  It all depends
 on how the BIOSes ACPI tables are written/what maps to what.
 
 Such ICs DO NOT have anything to do with the on-die DTS which both
 amdtemp(4) and coretemp(4) use -- instead, these chips use external
 thermistors which may be placed anywhere on the motherboard (such as
 under the CPU socket, or wherever the manufacturer chooses (and more
 often than not, does not document)).
 
 My point: under the CPU thermistor != within the CPU DTS.  They measure
 two different things, and are not guaranteed to be even remotely
 similar.  I can show proof of this (a very large delta between Core i5
 core DTSes and an on-board IT87xxx) if requested.

You are correct.  It had been several months since I looked at this and
was misremembering the details.

With amdtemp loaded on one of my systems where it works:

hw.acpi.thermal.tz0.temperature: 34.0C
dev.cpu.0.temperature: 37.2C
dev.cpu.1.temperature: 42.2C
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%parent: hostb3
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.core0.sensor0: 37.5C
dev.amdtemp.0.core0.sensor1: 32.7C
dev.amdtemp.0.core1.sensor0: 42.2C
dev.amdtemp.0.core1.sensor1: 28.2C

When I looked at this previously (on another system with only one DTS),
I noticed that dev.amdtemp.0.core0.sensor0 was giving the same answer as
dev.cpu.0.temperature.  I was unaware that amdtemp was responsible for
both sysctl nodes and thought that some other kernel driver was
responsible for dev.cpu.0.temperature, which is why I stopped work on my
amdtemp updates.  I see that the amdtemp(4) man page explains the
situation.

Thanks for the heads up about sysutils/bsdhwmon.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: RELENG_8: amdtemp module and newer CPUs not working. MFC?

2013-02-20 Thread Don Lewis
On 17 Feb, Torfinn Ingolfsen wrote:
 Hello,
 I'm running FreeBSD 8.3-stable on a machine with an AMD A8-5600K cpu.
 tingo@kg-quiet$ uname -a
 FreeBSD kg-quiet.kg4.no 8.3-STABLE FreeBSD 8.3-STABLE #2: Fri Jan  4 19:18:15 
 CET 2013 
 r...@kg-quiet.kg4.no:/usr/obj/usr/src/sys/GENERIC  amd64
 tingo@kg-quiet$ dmesg | grep CPU | head -1
 CPU: AMD A8-5600K APU with Radeon(tm) HD Graphics(3618.02-MHz K8-class 
 CPU)
 
 Unfortunately, the amdtemp.ko module doesn't work:
 tingo@kg-quiet$ kldstat | grep temp
 101 0x8123e000 f0f  amdtemp.ko
 tingo@kg-quiet$ sysctl dev.amdtemp
 sysctl: unknown oid 'dev.amdtemp'
 
 Based on a thread[1] on the forums, amdtemp.c from -CURRENT work.
 But it doesn't compile under FreeBSD 8.3-stable:

Updating amdtemp is on my TODO list.  It has some issues even on
-CURRENT.  This is kind of far down my priority list because on most of
my AMD machines, I can also get the temperature without amdtemp:

% sysctl hw.acpi.thermal.tz0.temperature
hw.acpi.thermal.tz0.temperature: 30.0C


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: LSI 9240-4i 4K alignment

2012-08-25 Thread Don Lewis
On  8 Aug, George Kontostanos wrote:

 The problem:
 
 When trying to create a RaidZ pool using gpart and perform a 4K
 alignment using  gnop, we get the follwoing error immediately after
 exporting the pool and destroying the .nop devices:
 
 id: 8043746387654554958
   state: FAULTED
  status: One or more devices contains corrupted data.
  action: The pool cannot be imported due to damaged devices or data.
   The pool may be active on another system, but can be imported using
   the '-f' flag.
see: http://illumos.org/msg/ZFS-8000-5E
  config:
 
   Pool  FAULTED  corrupted data
 raidz1-0ONLINE
   13283347160590042564  UNAVAIL  corrupted data
   16981727992215676534  UNAVAIL  corrupted data
   6607570030658834339   UNAVAIL  corrupted data
   3435463242860701988   UNAVAIL  corrupted data
 
 When we use glabel for the same purpose with the combination of gnop,
 the pool imports fine.

Might kern/170945 have something to do with this?

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: LSI 9240-4i 4K alignment

2012-08-20 Thread Don Lewis
On 19 Aug, Josh Paetzel wrote:
 On 08/19/2012 14:04, Steven Hartland wrote:

 HBA's are the way to go if your using ZFS to manage the disks, you only
 need RAID if your using a FS which doesn't manage the disk side well
 such as UFS.
 
 Its often quite common for RAID controllers to actually be slower
 vs RAID controllers as the RAID stack can get in the way.

Any idea of what kind of performance penalty I might see by using the
RAID firmware in JBOD mode vs flashing the IT firmware?

 Just to clear up,
 
 The 9240 is a sas2008 based card with the megaraid software on top of
 it.  In it's default config from LSI the FreeBSD mfi will recognize it
 in later versions of FreeBSD (The upcoming 9.1  for sure)  Older
 versions of mfi will not recognize it.
 
 The card can be flashed with IT firmware and then becomes a 9211 HBA,
 but it's a bit more expensive than a 9211 is so that doesn't make sense
 to do in many cases.

The price difference was pretty minor when I looked.  Confusingly
enough, the 9211 HBA also has some RAID capabilities.

For me, the biggest advantage of the 9211 would be that it would have
allowed me to use shorter cables.

 On the dmesg posted the firmware on the card is phase 11.  This *must*
 be in lockstep with the driver version or the card may not play nicely.
  FreeBSD 8.3 and 9.0 have v13 of the driver, the upcoming 9.1 will have
 v14.  Note that v14 fixes a *ton* of stability bugs, including issues
 where bad drives would hang the controller or prevent systems from booting.

Where do those version numbers come from?  The mfi driver in 9.0-RELEASE
claims to be version 3.00 and the the driver in 9.1 claims to be version
4.23.

This is what shows up in dmesg on my machine:

mfi0: Drake Skinny port 0xce00-0xceff mem 0xfcefc000-0xfcef,0xfce8-0xf
ceb irq 18 at device 0.0 on pci1
mfi0: Using MSI
mfi0: Megaraid SAS driver Ver 4.23
mfi0: 333 (398082533s/0x0020/info) - Shutdown command received from host
mfi0: 334 (boot + 3s/0x0020/info) - Firmware initialization started (PCI ID 0073
/1000/9240/1000)
mfi0: 335 (boot + 3s/0x0020/info) - Firmware version 2.70.04-0862
mfi0: 336 (boot + 5s/0x0020/info) - Board Revision 04A
mfi0: 337 (boot + 3s/0x0020/info) - Firmware initialization started (PCI ID 0073
/1000/9240/1000)
mfi0: 338 (boot + 3s/0x0020/info) - Firmware version 2.70.04-0862
mfi0: 339 (boot + 5s/0x0020/info) - Board Revision 04A
mfi0: 340 (boot + 3s/0x0020/info) - Firmware initialization started (PCI ID 0073
/1000/9240/1000)
mfi0: 341 (boot + 3s/0x0020/info) - Firmware version 2.70.04-0862
mfi0: 342 (boot + 5s/0x0020/info) - Board Revision 04A
mfi0: 343 (boot + 3s/0x0020/info) - Firmware initialization started (PCI ID 0073
/1000/9240/1000)
mfi0: 344 (boot + 3s/0x0020/info) - Firmware version 2.70.04-0862
mfi0: 345 (boot + 5s/0x0020/info) - Board Revision 04A
mfi0: 346 (398759025s/0x0020/info) - Time established as 08/20/12  6:23:45; (25
seconds since power on)
mfi0: 347 (398759051s/0x0020/info) - Time established as 08/20/12  6:24:11; (51
seconds since power on)
mfi0: 348 (398759078s/0x0020/WARN) - Patrol Read can't be started, as PDs are ei
ther not ONLINE, or are in a VD with an active process, or are in an excluded VD


% mfiutil show firmware
mfi0 Firmware Package Version: 20.5.1-0003
mfi0 Firmware Images:
Name  VersionDate Time  Status
BIOS  4.14.00   active
PCLI  03.02-001:#%8  Feb 09 2010  13:09:06  active
BCON  4.0-22-e_10-RelMar 11 2010  12:38:08  active
NVDT  3.04.03-0002   Apr 05 2010  18:50:27  active
APP   2.70.04-0862   May 05 2010  18:12:07  active
BTBL  2.01.00.00-0019May 14 2009  15:52:08  active


The only firmware file on LSI's web site for the 9240-8i is version
20.10.1-107, which appears to be newer than what is on the card if the
20.5.1-0003 is the version number that I should be looking at.  Is the
BIOS Version 4.14 the v14 version that you mention above?

If the FreeBSD mfi driver expects a certain firmware version, shouldn't
it complain if it doesn't find it?

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: LSI 9240-4i 4K alignment

2012-08-19 Thread Don Lewis
On 16 Aug, Steven Hartland wrote:
 - Original Message - 
 From: George Kontostanos gkontos.m...@gmail.com
  
 You are right, the chip specs say: LSISAS2108 RAID-on-Chip
 
 The drives are identified as mfisyspd0, mfisyspd1, etc.
 
 The following might be interesting to you:-
 http://forums.servethehome.com/showthread.php?599-LSI-RAID-Controller-and-HBA-Complete-Listing-Plus-OEM-Models
 
 Which states:-
 LSI MegaRAID SAS 9240-4i 1x4 port internal SAS vertical,
 no cache, no BBU, RAID 0, 1, 10 and 5, can be crossflashed
 to LSI9211 IT/IR
 
 This is insteresting as this is the card we're using but
 in the 8 port version under mps :)

I wish I would have known this earlier.  I just put together a ZFS
server using LSI MegaRAID SAS 9240-8i cards.  The cabling probably would
have been cleaner with the 9211-8i, but I went with the 9240 because the
vendor that I purchased the cards from listed that 9240 as being
PCI-Express 2.0, but didn't say that about the 9211.  I also got the
impression that the 9240 recognized JBOD drives with the off-the-shelf
firmware, whereas the 9211 did not.

Even LSI's own site is a bit confusing.  They list the 9211 in the HBA
section, but its specs don't mention JBOD, whereas the 9240 is listed in
the RAID section and its specs do list JBOD.  If the only physical
difference between the cards is the connector position, it seems odd
that they don't offer products with all the combinations of firmware and
connector position.

I haven't configured the ZFS pool yet, but I didn't have any trouble
installing FreeBSD 9.1-BETA on the GPT partitioned boot drive, which
shows up as an mfi device.  I'm planning on getting the ZFS pool up and
running in the next few days.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: LSI 9240-4i 4K alignment

2012-08-19 Thread Don Lewis
On  8 Aug, George Kontostanos wrote:
 Hi all,
 
 We have a server with a LSI 9240-4i controller configured in JBOD with
 4 SATA disks. Running FreeBSD 9.1-Beta1:
 
 Relevant dmesg:
 
 FreeBSD 9.1-BETA1 #0: Thu Jul 12 09:38:51 UTC 2012
 r...@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
 CPU: Intel(R) Xeon(R) CPU E31230 @ 3.20GHz (3200.09-MHz K8-class CPU)
   Origin = GenuineIntel  Id = 0x206a7  Family = 6  Model = 2a  Stepping = 7
   
 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
 Features2=0x1fbae3ffSSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX
   AMD Features=0x28100800SYSCALL,NX,RDTSCP,LM
   AMD Features2=0x1LAHF
   TSC: P-state invariant, performance statistics
 real memory  = 17179869184 (16384 MB)
 avail memory = 16471670784 (15708 MB)
 ...
 mfi0: Drake Skinny port 0xe000-0xe0ff mem
 0xf7a6-0xf7a63fff,0xf7a0-0xf7a3 irq 16 at device 0.0 on
 pci1
 mfi0: Using MSI
 mfi0: Megaraid SAS driver Ver 4.23
 ...
 mfi0: 321 (397672301s/0x0020/info) - Shutdown command received from host
 mfi0: 322 (boot + 3s/0x0020/info) - Firmware initialization started
 (PCI ID 0073/1000/9241/1000)
 mfi0: 323 (boot + 3s/0x0020/info) - Firmware version 2.130.354-1664
 mfi0: 324 (boot + 3s/0x0020/info) - Firmware initialization started
 (PCI ID 0073/1000/9241/1000)
 mfi0: 325 (boot + 3s/0x0020/info) - Firmware version 2.130.354-1664
 mfi0: 326 (boot + 5s/0x0020/info) - Package version 20.10.1-0107
 mfi0: 327 (boot + 5s/0x0020/info) - Board Revision 03A
 mfi0: 328 (boot + 25s/0x0002/info) - Inserted: PD 04(e0xff/s3)
 ...
 mfisyspd0 on mfi0
 mfisyspd0: 1907729MB (3907029168 sectors) SYSPD volume
 mfisyspd0:  SYSPD volume attached
 mfisyspd1 on mfi0
 mfisyspd1: 1907729MB (3907029168 sectors) SYSPD volume
 mfisyspd1:  SYSPD volume attached
 mfisyspd2 on mfi0
 mfisyspd2: 1907729MB (3907029168 sectors) SYSPD volume
 mfisyspd2:  SYSPD volume attached
 mfisyspd3 on mfi0
 mfisyspd3: 1907729MB (3907029168 sectors) SYSPD volume
 mfisyspd3:  SYSPD volume attached
 ...
 mfi0: 329 (boot + 25s/0x0002/info) - Inserted: PD 04(e0xff/s3) Info:
 enclPd=, scsiType=0, portMap=00,
 sasAddr=44332211,
 mfi0: 330 (boot + 25s/0x0002/info) - Inserted: PD 05(e0xff/s1)
 mfi0: 331 (boot + 25s/0x0002/info) - Inserted: PD 05(e0xff/s1) Info:
 enclPd=, scsiType=0, portMap=02,
 sasAddr=443322110200,
 mfi0: 332 (boot + 25s/0x0002/info) - Inserted: PD 06(e0xff/s2)
 mfi0: 333 (boot + 25s/0x0002/info) - Inserted: PD 06(e0xff/s2) Info:
 enclPd=, scsiType=0, portMap=03,
 sasAddr=443322110100,
 mfi0: 334 (boot + 25s/0x0002/info) - Inserted: PD 07(e0xff/s0)
 mfi0: 335 (boot + 25s/0x0002/info) - Inserted: PD 07(e0xff/s0) Info:
 enclPd=, scsiType=0, portMap=01,
 sasAddr=443322110300,
 mfi0: 336 (397672376s/0x0020/info) - Time established as 08/07/12
 16:32:56; (28 seconds since power on)
 
 The problem:
 
 When trying to create a RaidZ pool using gpart and perform a 4K
 alignment using  gnop, we get the follwoing error immediately after
 exporting the pool and destroying the .nop devices:
 
 id: 8043746387654554958
   state: FAULTED
  status: One or more devices contains corrupted data.
  action: The pool cannot be imported due to damaged devices or data.
   The pool may be active on another system, but can be imported using
   the '-f' flag.
see: http://illumos.org/msg/ZFS-8000-5E
  config:
 
   Pool  FAULTED  corrupted data
 raidz1-0ONLINE
   13283347160590042564  UNAVAIL  corrupted data
   16981727992215676534  UNAVAIL  corrupted data
   6607570030658834339   UNAVAIL  corrupted data
   3435463242860701988   UNAVAIL  corrupted data

I'm planning on doing something similar, but I'm curious about how gnop
and GPT labels interact.  I want to partition the drives for my pool
slightly on the small side so that I'm less likely to run into problems
if I have to replace a drive in the future.  If I used gpart to create
and label a GPT partition on the drive, the partition will show up as
/dev/gpt/label.  The gnop man page says that running gnop on dev creates
/dev/dev.nop.  What happens if you gnop /dev/gpt/label?  Is this what
you are doing?

 When we use glabel for the same purpose with the combination of gnop,
 the pool imports fine.
 
 Any suggestions?

It should be sufficient to only gnop one of the devices.  You should be
able to create the pool with only one gnop device to get the ashift
value that you desire, export the pool, destroy the .nop device, and
import the pool.  If it things the device is corrupted (which seems like
a bug of some sort), then ZFS should be able to resilver it.

___
freebsd-stable@freebsd.org mailing list

Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-10-01 Thread Don Lewis
On 30 Sep, Don Lewis wrote:

 The silent reboots that I was seeing with WITNESS go away if I add
 WITNESS_SKIPSPIN.  Witness doesn't complain about anything.

I've tracked down the the silent reboot problem.  It happens when a
userland sysctl call gets down into calcru1(), which tries to print a
calcu: .. message.  Eventually sc_puts() wants to grab a spin lock,
which causes a call to witness, which detects a lock order reversal.
This recurses into printf(), which dives back into the console code and
eventually triggers a panic.

I'm still gathering the details on this and I see what I can come up
with for a fix.

 I tested -CURRENT and !SMP seems to work ok.  One difference in terms of
 hardware between the two tests is that I'm using a SATA drive when
 testing -STABLE and a SCSI drive when testing -CURRENT.

I'm not able to trigger the problem with -CURRENT when it is running on
a SCSI drive, but I do see the freezes, long ping RTTs, and ntp insanity
when running a !SMP -CURRENT kernel on my SATA drive with an 8.1-STABLE
world.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-30 Thread Don Lewis
On 30 Sep, Andriy Gapon wrote:
 on 30/09/2010 02:27 Don Lewis said the following:

 vmstat -i ?

I didn't see anything odd in the vmstat -i output that I posted to the list
earlier.  It looked more or less normal as the ntp offset suddenly went
insane.
 
 I did manage to catch the problem with lock profiling enabled:
 http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE_lock_profile_freeze.txt
 I'm currently testing SMP some more to verify if it really avoids this
 problem.
 
 OK.

I wasn't able to cause SMP on stable to break.

The silent reboots that I was seeing with WITNESS go away if I add
WITNESS_SKIPSPIN.  Witness doesn't complain about anything.

I tested -CURRENT and !SMP seems to work ok.  One difference in terms of
hardware between the two tests is that I'm using a SATA drive when
testing -STABLE and a SCSI drive when testing -CURRENT.

At this point, I think the biggest clues are going to be in the lock
profile results.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-30 Thread Don Lewis
On 30 Sep, Andriy Gapon wrote:
 on 30/09/2010 02:27 Don Lewis said the following:

 I tried enabling apic and got worse results.  I saw ping RTTs as high as
 67 seconds.  Here's the timer info with apic enabled:
[snip]
 Here's the verbose boot info with apic:
 http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE-apic-verbose.txt
 
 vmstat -i ?

Here's the vmstat -i output at the time the machine starts experiencing
freezes and ntp goes insane:

Thu Sep 30 11:38:57 PDT 2010
interrupt  total   rate
irq1: atkbd0   6  0
irq9: acpi0   10  0
irq12: psm0   18  0
irq14: ata0 2845  1
irq17: ahc0  310  0
irq19: fwohci0 1  0
irq22: ehci0+  74628 40
cpu0: timer  3676399   1999
irq256: nfe03915  2
Total3758132   2043
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u  129  128  3770.185   -0.307   0.020

Thu Sep 30 11:39:59 PDT 2010
interrupt  total   rate
irq1: atkbd0   6  0
irq9: acpi0   10  0
irq12: psm0   18  0
irq14: ata0 2935  1
irq17: ahc0  310  0
irq19: fwohci0 1  0
irq22: ehci0+  78954 41
cpu0: timer  3796447   1998
irq256: nfe04090  2
Total3882771   2043
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u   61  128  3770.185   -0.307   0.023

Thu Sep 30 11:40:59 PDT 2010
interrupt  total   rate
irq1: atkbd0   6  0
irq9: acpi0   10  0
irq12: psm0   18  0
irq14: ata0 3025  1
irq17: ahc0  310  0
irq19: fwohci0 1  0
irq22: ehci0+  85038 43
cpu0: timer  3916483   1998
irq256: nfe04247  2
Total4009138   2045
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u  121  128  3770.185   -0.307   0.023

Thu Sep 30 11:41:59 PDT 2010
interrupt  total   rate
irq1: atkbd0   6  0
irq9: acpi0   10  0
irq12: psm0   18  0
irq14: ata0 3115  1
irq17: ahc0  310  0
irq19: fwohci0 1  0
irq22: ehci0+  89099 44
cpu0: timer  4036529   1998
irq256: nfe04384  2
Total4133472   2046
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u   54  128  3770.185   -0.307 43008.9

Thu Sep 30 11:42:59 PDT 2010
interrupt  total   rate
irq1: atkbd0   6  0
irq9: acpi0   11  0
irq12: psm0   18  0
irq14: ata0 3205  1
irq17: ahc0  310  0
irq19: fwohci0 1  0
irq22: ehci0+  92111 44
cpu0: timer  4156575   1998
irq256: nfe04421  2
Total4256658   2046
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u  114  128  3770.185   -0.307 43008.9

Thu Sep 30 11:43:59 PDT 2010
interrupt  total   rate
irq1: atkbd0   6  0
irq9: acpi0   12  0
irq12: psm0   18  0
irq14: ata0 3295  1
irq17: ahc0  310  0
irq19

Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-29 Thread Don Lewis
On 29 Sep, Jeremy Chadwick wrote:

 Given all the information here, in addition to the other portion of the
 thread (indicating ntpd reports extreme offset between the system clock
 and its stratum 1 source), I would say the motherboard is faulty or
 there is a system device which is behaving badly (possibly something
 pertaining to interrupts, but I don't know how to debug this on a low
 level).

Possible, but I haven't run into any problems running -CURRENT on this
box with an SMP kernel.

 Can you boot verbosely and provide all of the output here or somewhere
 on the web?

http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE-verbose.txt

 If possible, I would start by replacing the mainboard.  The board looks
 to be a consumer-level board (I see an nfe(4) controller, for example).

It's an Abit AN-M2 HD.  The RAM is ECC.  I haven't seen any machine
check errors in the logs.  I'll run prime95 as soon as I have a chance.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-29 Thread Don Lewis
On 29 Sep, Jeremy Chadwick wrote:
 On Wed, Sep 29, 2010 at 12:39:49AM -0700, Don Lewis wrote:
 On 29 Sep, Jeremy Chadwick wrote:
 
  Given all the information here, in addition to the other portion of the
  thread (indicating ntpd reports extreme offset between the system clock
  and its stratum 1 source), I would say the motherboard is faulty or
  there is a system device which is behaving badly (possibly something
  pertaining to interrupts, but I don't know how to debug this on a low
  level).
 
 Possible, but I haven't run into any problems running -CURRENT on this
 box with an SMP kernel.
 
  Can you boot verbosely and provide all of the output here or somewhere
  on the web?
 
 http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE-verbose.txt
 
  If possible, I would start by replacing the mainboard.  The board looks
  to be a consumer-level board (I see an nfe(4) controller, for example).
 
 It's an Abit AN-M2 HD.  The RAM is ECC.  I haven't seen any machine
 check errors in the logs.  I'll run prime95 as soon as I have a chance.
 
 Thanks for the verbose boot.  Since it works on -CURRENT, can you
 provide a verbose boot from that as well?  Possibly someone made some
 changes between RELENG_8 and HEAD which fixed an issue, which could be
 MFC'd.

Even when I saw the wierd ntp stepping problem and the calcru messages,
the system was still stable enough to build hundreds of ports.  In the
most recent case, I built 800+ ports over several days without any other
hiccups.

It could also be a difference between SMP and !SMP.  I just found a bug
that causes an immediate panic if lock profiling is enabled on a !SMP
kernel.  This bug also exists in -CURRENT.  Here's the patch:

Index: sys/sys/mutex.h
===
RCS file: /home/ncvs/src/sys/sys/mutex.h,v
retrieving revision 1.105.2.1
diff -u -r1.105.2.1 mutex.h
--- sys/sys/mutex.h 3 Aug 2009 08:13:06 -   1.105.2.1
+++ sys/sys/mutex.h 29 Sep 2010 06:58:52 -
@@ -251,8 +251,11 @@
 #define _rel_spin_lock(mp) do {
\
if (mtx_recursed((mp))) \
(mp)-mtx_recurse--;\
-   else\
+   else {  \
(mp)-mtx_lock = MTX_UNOWNED;   \
+   LOCKSTAT_PROFILE_RELEASE_LOCK(LS_MTX_SPIN_UNLOCK_RELEASE, \
+   mp);\
+   }   \
spinlock_exit();\
 } while (0)
 #endif /* SMP */


After applying the above patch, I enabled lock profiling and got the
following results when I ran make index:
http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE_lock_profile.txt

I didn't see anything strange happening this time.  I don't know if I
got lucky, or the change in kernel options fixed the bug.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-29 Thread Don Lewis
On 29 Sep, Andriy Gapon wrote:
 on 29/09/2010 00:11 Don Lewis said the following:
 On 28 Sep, Don Lewis wrote:
 
 
 % vmstat -i
 interrupt  total   rate
 irq0: clk   60683442   1000
 irq1: atkbd0   6  0
 irq8: rtc7765537127
 irq9: acpi0   13  0
 irq10: ohci0 ehci1+ 10275064169
 irq11: fwohci0 ahc+   132133  2
 irq12: psm0   21  0
 irq14: ata090982  1
 irq15: nfe0 ata1   18363  0

 I'm not sure why I'm getting USB interrupts.  There aren't any USB
 devices plugged into this machine.
 
 Answer: irq 10 is also shared by vgapci0 and atapci1.
 
 Just curious why Local APIC timer isn't being used for hardclock on your 
 system.

I'm using the same kernel config as the one on a slower !SMP box which
I'm trying to squeeze as much performance out of as possible.  My kernel
config file contains these statements:
nooptions   SMP
nodeviceapic

Testing with an SMP kernel is on my TODO list.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-29 Thread Don Lewis
On 29 Sep, Andriy Gapon wrote:
 on 29/09/2010 11:56 Don Lewis said the following:
 I'm using the same kernel config as the one on a slower !SMP box which
 I'm trying to squeeze as much performance out of as possible.  My kernel
 config file contains these statements:
  nooptions   SMP
  nodeviceapic
 
 Testing with an SMP kernel is on my TODO list.
 
 SMP or not, it's really weird to see apic disabled nowadays.

I tried enabling apic and got worse results.  I saw ping RTTs as high as
67 seconds.  Here's the timer info with apic enabled:

# sysctl kern.timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(800) ACPI-fast(1000) i8254(0) dummy(-100)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 53633
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 7988816
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 1000
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 1341917999
kern.timecounter.tc.TSC.frequency: 2500014018
kern.timecounter.tc.TSC.quality: 800
kern.timecounter.invariant_tsc: 0

Here's the verbose boot info with apic:
http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE-apic-verbose.txt


I've also experimented with SMP as well as SCHED_4BSD (all previous
testing was with !SMP and SCHED_ULE).  I still see occasional problems
with SCHED_4BSD and !SMP, but so far I have not seen any problems with
SCHED_ULE and SMP.

I did manage to catch the problem with lock profiling enabled:
http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE_lock_profile_freeze.txt


I'm currently testing SMP some more to verify if it really avoids this
problem.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-28 Thread Don Lewis
On 28 Sep, Chip Camden wrote:
 Quoth Don Lewis on Monday, 27 September 2010:
 CPU time accounting is broken on one of my machines running 8-STABLE.  I
 ran a test with a simple program that just loops and consumes CPU time:
 
 % time ./a.out
 94.544u 0.000s 19:14.10 8.1% 62+2054k 0+0io 0pf+0w
 
 The display in top shows the process with WCPU at 100%, but TIME
 increments very slowly.
 
 Several hours after booting, I got a bunch of calcru: runtime went
 backwards messages, but they stopped right away and never appeared
 again.
 
 Aug 23 13:40:07 scratch ntpd[1159]: ntpd 4.2.4p5-a (1)
 Aug 23 13:43:18 scratch ntpd[1160]: kernel time sync status change 2001
 Aug 23 18:05:57 scratch dbus-daemon: [system] Reloaded configuration
 Aug 23 18:06:16 scratch dbus-daemon: [system] Reloaded configuration
 Aug 23 18:12:40 scratch ntpd[1160]: time reset +18.059948 s
 [snip]
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 
 6836685136 usec to 5425839798 usec for pid 1526 (csh)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 4747 
 usec to 2403 usec for pid 1519 (csh)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 5265 
 usec to 2594 usec for pid 1494 (hald-addon-mouse-sy)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 7818 
 usec to 3734 usec for pid 1488 (console-kit-daemon)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 977 usec 
 to 459 usec for pid 1480 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 958 usec 
 to 450 usec for pid 1479 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 957 usec 
 to 449 usec for pid 1478 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 952 usec 
 to 447 usec for pid 1477 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 959 usec 
 to 450 usec for pid 1476 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 975 usec 
 to 458 usec for pid 1475 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1026 
 usec to 482 usec for pid 1474 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1333 
 usec to 626 usec for pid 1473 (getty)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 2469 
 usec to 1160 usec for pid 1440 (inetd)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 719 usec 
 to 690 usec for pid 1402 (sshd)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 120486 
 usec to 56770 usec for pid 1360 (cupsd)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6204 
 usec to 2914 usec for pid 1289 (dbus-daemon)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 179 usec 
 to 84 usec for pid 1265 (moused)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 22156 
 usec to 10407 usec for pid 1041 (nfsd)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1292 
 usec to 607 usec for pid 1032 (mountd)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 8801 
 usec to 4134 usec for pid 664 (devd)
 Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 19 usec 
 to 9 usec for pid 9 (sctp_iterator)
 
 
 If I reboot and run the test again, the CPU time accounting seems to be
 working correctly.
 % time ./a.out
 1144.226u 0.000s 19:06.62 99.7%  5+168k 0+0io 0pf+0w
 
 snip
 
 I notice that before the calcru messages, ntpd reset the clock by
 18 seconds -- that probably accounts for that.

Interesting observation.  Since this happened so early in the log, I
thought that this time change was the initial time change after boot,
but taking a closer look, the time change occurred about 4 1/2 hours
after boot.  The calcru messages occured another 5 1/2 hours after that.

I also just noticed that this log info was from the August 23rd kernel,
before I noticed the CPU time accounting problem, and not the latest
occurance.  Here's the latest log info:

Sep 23 16:33:50 scratch ntpd[1144]: ntpd 4.2.4p5-a (1)
Sep 23 16:37:03 scratch ntpd[1145]: kernel time sync status change 2001
Sep 23 17:43:47 scratch ntpd[1145]: time reset +276.133928 s
Sep 23 17:43:47 scratch ntpd[1145]: kernel time sync status change 6001
Sep 23 17:47:15 scratch ntpd[1145]: kernel time sync status change 2001
Sep 23 19:02:48 scratch ntpd[1145]: time reset +291.507262 s
Sep 23 19:02:48 scratch ntpd[1145]: kernel time sync status change 6001
Sep 23 19:06:37 scratch ntpd[1145]: kernel time sync status change 2001
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 1120690857 u
sec to 367348485 usec for pid 1518 (csh)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 5403 usec to
 466 usec for pid 1477 (hald-addon-mouse-sy)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 7511 usec to
 1502 usec for pid 1472 (hald-runner)
Sep 24 00:03:36

Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-28 Thread Don Lewis
On 28 Sep, Jeremy Chadwick wrote:
 On Tue, Sep 28, 2010 at 10:15:34AM -0700, Don Lewis wrote:
 My time source is another FreeBSD box with a GPS receiver on my LAN.  My
 other client machine isn't seeing these time jumps.  The only messages
 from ntp in its log from this period are these:
 
 Sep 23 04:12:23 mousie ntpd[]: kernel time sync status change 6001
 Sep 23 04:29:29 mousie ntpd[]: kernel time sync status change 2001
 Sep 24 03:55:24 mousie ntpd[]: kernel time sync status change 6001
 Sep 24 04:12:28 mousie ntpd[]: kernel time sync status change 2001
 
 I'm speaking purely about ntpd below this point -- almost certainly a
 separate problem/issue, but I'll explain it anyway.  I'm not under the
 impression that the calcru messages indicate RTC clock drift, but I'd
 need someone like John Baldwin to validate my statement.

I don't think the problems are directly related.  I think the calcru
messages get triggered by clcok frequency changes that get detected and
change the tick to usec conversion ratio.

 Back to ntpd: you can addressing the above messages by adding maxpoll
 9 to your server lines in ntp.conf.  The comment we use in our
 ntp.conf that documents the well-known problem:

Thanks I'll try that.

 # maxpoll 9 is used to work around PLL/FLL flipping, which happens at
 # exactly 1024 seconds (the default maxpoll value).  Another FreeBSD
 # user recommended using 9 instead:
 # http://lists.freebsd.org/pipermail/freebsd-stable/2006-December/031512.html
 
  I don't know if that has any connection to time(1) running slower -- but
  perhaps ntpd is aggressively adjusting your clock?
 
 It seems to be pretty stable when the machine is idle:
 
 % ntpq -c pe
  remote   refid  st t when poll reach   delay   offset  
 jitter
 ==
 *gw.catspoiler.o .GPS.1 u8   64  3770.168   -0.081   
 0.007
 
 Not too much degradation under CPU load:
 
 % ntpq -c pe
  remote   refid  st t when poll reach   delay   offset  
 jitter
 ==
 *gw.catspoiler.o .GPS.1 u   40   64  3770.166   -0.156   
 0.026
 
 I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it
 much, either.
 
 % ntpq -c pe
  remote   refid  st t when poll reach   delay   offset  
 jitter
 ==
 *gw.catspoiler.o .GPS.1 u   35   64  3770.169   -0.106   
 0.009
 
 Still speaking purely about ntpd:
 
 The above doesn't indicate a single problem.  The deltas shown in both
 delay, offset, and jitter are all 100% legitimate.  A dd (to induce more
 interrupt use) isn't going to exacerbate the problem (depending on your
 system configuration, IRQ setup, local APIC, etc.).

I was hoping to do something to provoke clock interrupt loss.  I don't
see any problems when this machine is idle.  The last two times that the
calcru messages have occured where when I booted this machine to build a
bunch of ports.

I don't see any problems when this machine is idle.  Offset and jitter
always look really good whenever I've looked.



 How about writing a small shell script that runs every minute in a
 cronjob that does vmstat -i  /some/file.log?  Then when you see calcru
 messages, look around the time frame where vmstat -i was run.  Look for
 high interrupt rates, aside from those associated with cpuX devices.

Ok, I'll give this a try.  Just for reference, this is what is currently
reported:
% vmstat -i
interrupt  total   rate
irq0: clk   60683442   1000
irq1: atkbd0   6  0
irq8: rtc7765537127
irq9: acpi0   13  0
irq10: ohci0 ehci1+ 10275064169
irq11: fwohci0 ahc+   132133  2
irq12: psm0   21  0
irq14: ata090982  1
irq15: nfe0 ata1   18363  0

I'm not sure why I'm getting USB interrupts.  There aren't any USB
devices plugged into this machine.

# usbconfig dump_info
ugen0.1: OHCI root HUB nVidia at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) 
pwr=ON

ugen1.1: EHCI root HUB nVidia at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) 
pwr=ON

ugen2.1: OHCI root HUB nVidia at usbus2, cfg=0 md=HOST spd=FULL (12Mbps) 
pwr=ON

ugen3.1: EHCI root HUB nVidia at usbus3, cfg=0 md=HOST spd=HIGH (480Mbps) 
pwr=ON


 Next, you need to let ntpd run for quite a bit longer than what you did
 above.  Your poll maximum is only 64, indicating ntpd had recently been
 restarted, or that your offset deviates greatly (my guess is ntpd being
 restarted).  poll will increase over time (64, 128, 256, 512, and
 usually max out at 1024), depending on how stable the clock is.  when
 is a counter

Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-28 Thread Don Lewis
On 28 Sep, Don Lewis wrote:


 % vmstat -i
 interrupt  total   rate
 irq0: clk   60683442   1000
 irq1: atkbd0   6  0
 irq8: rtc7765537127
 irq9: acpi0   13  0
 irq10: ohci0 ehci1+ 10275064169
 irq11: fwohci0 ahc+   132133  2
 irq12: psm0   21  0
 irq14: ata090982  1
 irq15: nfe0 ata1   18363  0
 
 I'm not sure why I'm getting USB interrupts.  There aren't any USB
 devices plugged into this machine.

Answer: irq 10 is also shared by vgapci0 and atapci1.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-28 Thread Don Lewis
On 28 Sep, Jeremy Chadwick wrote:

 Still speaking purely about ntpd:
 
 The above doesn't indicate a single problem.  The deltas shown in both
 delay, offset, and jitter are all 100% legitimate.  A dd (to induce more
 interrupt use) isn't going to exacerbate the problem (depending on your
 system configuration, IRQ setup, local APIC, etc.).
 
 How about writing a small shell script that runs every minute in a
 cronjob that does vmstat -i  /some/file.log?  Then when you see calcru
 messages, look around the time frame where vmstat -i was run.  Look for
 high interrupt rates, aside from those associated with cpuX devices.

Looking at the timestamps of things and comparing to my logs, I
discovered that the last instance of ntp instability happened when I was
running make index in /usr/ports.  I tried it again with entertaining
results.  After a while, the machine became unresponsive.  I was logged
in over ssh and it stopped echoing keystrokes.  In parallel I was
running a script that echoed the date, the results of vmstat -i, and
the results of ntpq -c pe.  The latter showed jitter and offset going
insane.  Eventually make index finished and the machine was responsive
again, but the time was way off and ntpd croaked because the necessary
time correction was too large.  Nothing else anomalous showed up in the
logs.  Hmn, about half an hour after ntpd died I started my CPU time
accounting test and two minutes into that test I got a spew of calcru
messages ...

Tue Sep 28 14:52:27 PDT 2010
interrupt  total   rate
irq0: clk   64077827999
irq1: atkbd0  26  0
irq8: rtc8199966127
irq9: acpi0   19  0
irq10: ohci0 ehci1+ 10356112161
irq11: fwohci0 ahc+   132133  2
irq12: psm0   27  0
irq14: ata096064  1
irq15: nfe0 ata1   23350  0
Total   82885524   1293
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u  137  128  3770.1950.111   0.030

Tue Sep 28 14:53:27 PDT 2010
interrupt  total   rate
irq0: clk   64137854999
irq1: atkbd0  26  0
irq8: rtc8207648127
irq9: acpi0   19  0
irq10: ohci0 ehci1+ 10360184161
irq11: fwohci0 ahc+   132133  2
irq12: psm0   27  0
irq14: ata096154  1
irq15: nfe0 ata1   23379  0
Total   82957424   1293
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u   56  128  3770.1950.111 853895.

Tue Sep 28 14:54:27 PDT 2010
interrupt  total   rate
irq0: clk   64197881999
irq1: atkbd0  26  0
irq8: rtc8215329127
irq9: acpi0   21  0
irq10: ohci0 ehci1+ 10360777161
irq11: fwohci0 ahc+   132133  2
irq12: psm0   27  0
irq14: ata096244  1
irq15: nfe0 ata1   23405  0
Total   83025843   1293
 remote   refid  st t when poll reach   delay   offset  jitter
==
*gw.catspoiler.o .GPS.1 u  116  128  3770.1950.111 853895.

Tue Sep 28 14:55:27 PDT 2010
interrupt  total   rate
irq0: clk   64257907999
irq1: atkbd0  26  0
irq8: rtc8223011127
irq9: acpi0   21  0
irq10: ohci0 ehci1+ 10360836161
irq11: fwohci0 ahc+   132133  2
irq12: psm0   27  0
irq14: ata096334  1
irq15: nfe0 ata1   23424  0
Total   83093719   1292
 remote   refid  st t when poll reach   delay   offset  jitter
==
 gw.catspoiler.o .GPS.1 u   48  128  3770.197  2259195 2091608

Tue Sep 28 14:56:27 PDT 2010
interrupt  total   rate
irq0: clk   64317933999
irq1: atkbd0 

Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-28 Thread Don Lewis
On 28 Sep, Don Lewis wrote:

 Looking at the timestamps of things and comparing to my logs, I
 discovered that the last instance of ntp instability happened when I was
 running make index in /usr/ports.  I tried it again with entertaining
 results.  After a while, the machine became unresponsive.  I was logged
 in over ssh and it stopped echoing keystrokes.  In parallel I was
 running a script that echoed the date, the results of vmstat -i, and
 the results of ntpq -c pe.  The latter showed jitter and offset going
 insane.  Eventually make index finished and the machine was responsive
 again, but the time was way off and ntpd croaked because the necessary
 time correction was too large.  Nothing else anomalous showed up in the
 logs.  Hmn, about half an hour after ntpd died I started my CPU time
 accounting test and two minutes into that test I got a spew of calcru
 messages ...

I tried this experiment again using a kernel with WITNESS and
DEBUG_VFS_LOCKS compiled in, and pinging this machine from another.
Things look normal for a while, then the ping times get huge for a while
and then recover.

64 bytes from 192.168.101.3: icmp_seq=1169 ttl=64 time=0.135 ms
64 bytes from 192.168.101.3: icmp_seq=1170 ttl=64 time=0.141 ms
64 bytes from 192.168.101.3: icmp_seq=1171 ttl=64 time=0.130 ms
64 bytes from 192.168.101.3: icmp_seq=1172 ttl=64 time=0.131 ms
64 bytes from 192.168.101.3: icmp_seq=1173 ttl=64 time=0.128 ms
64 bytes from 192.168.101.3: icmp_seq=1174 ttl=64 time=38232.140 ms
64 bytes from 192.168.101.3: icmp_seq=1175 ttl=64 time=37231.309 ms
64 bytes from 192.168.101.3: icmp_seq=1176 ttl=64 time=36230.470 ms
64 bytes from 192.168.101.3: icmp_seq=1177 ttl=64 time=35229.632 ms
64 bytes from 192.168.101.3: icmp_seq=1178 ttl=64 time=34228.791 ms
64 bytes from 192.168.101.3: icmp_seq=1179 ttl=64 time=33227.953 ms
64 bytes from 192.168.101.3: icmp_seq=1180 ttl=64 time=32227.091 ms
64 bytes from 192.168.101.3: icmp_seq=1181 ttl=64 time=31226.262 ms
64 bytes from 192.168.101.3: icmp_seq=1182 ttl=64 time=30225.425 ms
64 bytes from 192.168.101.3: icmp_seq=1183 ttl=64 time=29224.597 ms
64 bytes from 192.168.101.3: icmp_seq=1184 ttl=64 time=28223.757 ms
64 bytes from 192.168.101.3: icmp_seq=1185 ttl=64 time=27222.918 ms
64 bytes from 192.168.101.3: icmp_seq=1186 ttl=64 time=26222.086 ms
64 bytes from 192.168.101.3: icmp_seq=1187 ttl=64 time=25221.164 ms
64 bytes from 192.168.101.3: icmp_seq=1188 ttl=64 time=24220.407 ms
64 bytes from 192.168.101.3: icmp_seq=1189 ttl=64 time=23219.575 ms
64 bytes from 192.168.101.3: icmp_seq=1190 ttl=64 time=22218.737 ms
64 bytes from 192.168.101.3: icmp_seq=1191 ttl=64 time=21217.905 ms
64 bytes from 192.168.101.3: icmp_seq=1192 ttl=64 time=20217.066 ms
64 bytes from 192.168.101.3: icmp_seq=1193 ttl=64 time=19216.228 ms
64 bytes from 192.168.101.3: icmp_seq=1194 ttl=64 time=18215.333 ms
64 bytes from 192.168.101.3: icmp_seq=1195 ttl=64 time=17214.503 ms
64 bytes from 192.168.101.3: icmp_seq=1196 ttl=64 time=16213.720 ms
64 bytes from 192.168.101.3: icmp_seq=1197 ttl=64 time=15210.912 ms
64 bytes from 192.168.101.3: icmp_seq=1198 ttl=64 time=14210.044 ms
64 bytes from 192.168.101.3: icmp_seq=1199 ttl=64 time=13209.194 ms
64 bytes from 192.168.101.3: icmp_seq=1200 ttl=64 time=12208.376 ms
64 bytes from 192.168.101.3: icmp_seq=1201 ttl=64 time=11207.536 ms
64 bytes from 192.168.101.3: icmp_seq=1202 ttl=64 time=10206.694 ms
64 bytes from 192.168.101.3: icmp_seq=1203 ttl=64 time=9205.816 ms
64 bytes from 192.168.101.3: icmp_seq=1204 ttl=64 time=8205.014 ms
64 bytes from 192.168.101.3: icmp_seq=1205 ttl=64 time=7204.186 ms
64 bytes from 192.168.101.3: icmp_seq=1206 ttl=64 time=6203.294 ms
64 bytes from 192.168.101.3: icmp_seq=1207 ttl=64 time=5202.510 ms
64 bytes from 192.168.101.3: icmp_seq=1208 ttl=64 time=4201.677 ms
64 bytes from 192.168.101.3: icmp_seq=1209 ttl=64 time=3200.851 ms
64 bytes from 192.168.101.3: icmp_seq=1210 ttl=64 time=2200.013 ms
64 bytes from 192.168.101.3: icmp_seq=1211 ttl=64 time=1199.100 ms
64 bytes from 192.168.101.3: icmp_seq=1212 ttl=64 time=198.331 ms
64 bytes from 192.168.101.3: icmp_seq=1213 ttl=64 time=0.129 ms
64 bytes from 192.168.101.3: icmp_seq=1214 ttl=64 time=58223.470 ms
64 bytes from 192.168.101.3: icmp_seq=1215 ttl=64 time=57222.637 ms
64 bytes from 192.168.101.3: icmp_seq=1216 ttl=64 time=56221.800 ms
64 bytes from 192.168.101.3: icmp_seq=1217 ttl=64 time=55220.960 ms
64 bytes from 192.168.101.3: icmp_seq=1218 ttl=64 time=54220.116 ms
64 bytes from 192.168.101.3: icmp_seq=1219 ttl=64 time=53219.282 ms
64 bytes from 192.168.101.3: icmp_seq=1220 ttl=64 time=52218.444 ms
64 bytes from 192.168.101.3: icmp_seq=1221 ttl=64 time=51217.618 ms
64 bytes from 192.168.101.3: icmp_seq=1222 ttl=64 time=50216.778 ms
64 bytes from 192.168.101.3: icmp_seq=1223 ttl=64 time=49215.932 ms
64 bytes from 192.168.101.3: icmp_seq=1224 ttl=64 time=48215.095 ms
64 bytes from 192.168.101.3: icmp_seq=1225 ttl=64 time=47214.262 ms
64 bytes from 192.168.101.3: icmp_seq=1226

CPU time accounting broken on 8-STABLE machine after a few hours of uptime

2010-09-27 Thread Don Lewis
CPU time accounting is broken on one of my machines running 8-STABLE.  I
ran a test with a simple program that just loops and consumes CPU time:

% time ./a.out
94.544u 0.000s 19:14.10 8.1%62+2054k 0+0io 0pf+0w

The display in top shows the process with WCPU at 100%, but TIME
increments very slowly.

Several hours after booting, I got a bunch of calcru: runtime went
backwards messages, but they stopped right away and never appeared
again.

Aug 23 13:40:07 scratch ntpd[1159]: ntpd 4.2.4p5-a (1)
Aug 23 13:43:18 scratch ntpd[1160]: kernel time sync status change 2001
Aug 23 18:05:57 scratch dbus-daemon: [system] Reloaded configuration
Aug 23 18:06:16 scratch dbus-daemon: [system] Reloaded configuration
Aug 23 18:12:40 scratch ntpd[1160]: time reset +18.059948 s
[snip]
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6836685136 
usec to 5425839798 usec for pid 1526 (csh)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 4747 usec 
to 2403 usec for pid 1519 (csh)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 5265 usec 
to 2594 usec for pid 1494 (hald-addon-mouse-sy)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 7818 usec 
to 3734 usec for pid 1488 (console-kit-daemon)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 977 usec to 
459 usec for pid 1480 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 958 usec to 
450 usec for pid 1479 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 957 usec to 
449 usec for pid 1478 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 952 usec to 
447 usec for pid 1477 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 959 usec to 
450 usec for pid 1476 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 975 usec to 
458 usec for pid 1475 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1026 usec 
to 482 usec for pid 1474 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1333 usec 
to 626 usec for pid 1473 (getty)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 2469 usec 
to 1160 usec for pid 1440 (inetd)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 719 usec to 
690 usec for pid 1402 (sshd)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 120486 usec 
to 56770 usec for pid 1360 (cupsd)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6204 usec 
to 2914 usec for pid 1289 (dbus-daemon)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 179 usec to 
84 usec for pid 1265 (moused)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 22156 usec 
to 10407 usec for pid 1041 (nfsd)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1292 usec 
to 607 usec for pid 1032 (mountd)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 8801 usec 
to 4134 usec for pid 664 (devd)
Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 19 usec to 
9 usec for pid 9 (sctp_iterator)


If I reboot and run the test again, the CPU time accounting seems to be
working correctly.
% time ./a.out
1144.226u 0.000s 19:06.62 99.7% 5+168k 0+0io 0pf+0w


I'm not sure how long this problem has been present. I do remember
seeing the calcru messages with an August 23rd kernel.  I have not seen
the calcru messages when running -CURRENT on the same hardware.  I also
have not seen this same problem on my other Athlon 64 box running the
August 23rd kernel.

Before reboot:
# sysctl kern.timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(800) ACPI-fast(1000) i8254(0) dummy(-100)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 4294967295
kern.timecounter.tc.i8254.counter: 3534
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 8685335
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 1000
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 2204228369
kern.timecounter.tc.TSC.frequency: 2500018183
kern.timecounter.tc.TSC.quality: 800
kern.timecounter.invariant_tsc: 0

After reboot:
% sysctl kern.timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(800) ACPI-fast(1000) i8254(0) dummy(-100)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 4294967295
kern.timecounter.tc.i8254.counter: 2241
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 4636239
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 1000
kern.timecounter.tc.TSC.mask: 

Re: Extending your zfs pool with multiple devices

2010-09-03 Thread Don Lewis
On  2 Sep, Jeremy Chadwick wrote:
 On Thu, Sep 02, 2010 at 04:56:04PM -0400, Zaphod Beeblebrox wrote:
 [regarding getting more disks in a machine]

 An inexpensive option are SATA port replicators.  Think SATA switch or
 hub.  1:4 is common and cheap.
 
 I have a motherboard with intel ICH10 chipset.  It commonly provides 6
 ports.  This chipset is happy to configure port replicators.  Meaning
 you can put 24 drives on this motherboard.

 ...

 With 1.5T disks, I find that the 4 to 1 multipliers have a small
 effect on speed.  The 4 drives I have on the multipler are saturated
 at 100% a little bit more than the drives directly connected.
 Essentially you have 3 gigabit for 4 drives instead of 3 gigabit for 1
 drive.
 
 1:4 SATA replicators impose a bottleneck on the overall bandwidth
 available between the replicator and the disks attached, as you stated.
 Diagram:
 
 ICH10
   |||___ (SATA300) Port 0, Disk 0
   || (SATA300) Port 1, Disk 1
   |_ (SATA300) Port 2, eSATA Replicator
 (SATA300) Port 0, Disk 2
|||_ (SATA300) Port 1, Disk 3
||__ (SATA300) Port 2, Disk 4
|___ (SATA300) Port 3, Disk 5
 
 If Disks 2 through 5 are decent disks (pushing 100MB/sec), essentially
 you have 100*4 = 400MB/sec worth of bandwidth being shoved across a
 300MB/sec link.  That's making the assumption the disks attached are
 magnetic and not SSD, and not taking into consideration protocol
 overhead.
 
 Given the evolutionary rate of hard disks and SSDs, replicators are (in
 my opinion) not a viable solution mid or long-term.

 A better choice is a SATA multilane HBA, which are usually PCIe-based
 with a single connector on the back of the HBA which splits out to
 multiple disks (usually 4, but sometimes more).
 
 An ideal choice is ane Areca ARC-1300 series SAS-based PCIe x4 multilane
 adapters, which provides SATA300 to each individual disk and uses PCIe
 x4 (which can handle about 1GByte/sec in each direction, so 2GByte/sec
 total)...
 
 http://www.areca.com.tw/products/sasnoneraid.htm
 
 ...but there doesn't appear to be driver support for FreeBSD for this
 series of controller (arcmsr(4) doesn't mention the ARC-1300 series).  I
 also don't know what Areca means on their site when they say
 BSD/FreeBSD (will be available with 6Gb/s Host Adapter), given that
 none of the ARC-1300 series cards are SATA600.
 
 If people are more focused on total number of devices (disks) that are
 available, then they should probably be looking at dropping a pretty
 penny on a low-end filer.  Otherwise, consider replacing the actual hard
 disks themselves with drives of a higher capacity.

[raises hand]

Here's what I've got on my mythtv box (running Fedora ... sorry):

FilesystemSize  
/dev/sda4 439G  
/dev/sdb1 1.9T  
/dev/sdc1 1.9T  
/dev/sdd1 1.9T  
/dev/sde1 1.9T  
/dev/sdf1 1.4T  
/dev/sdg1 1.4T  
/dev/sdh1 932G  
/dev/sdi1 932G  
/dev/sdj1 1.4T  
/dev/sdk1 1.9T  
/dev/sdl1 932G  
/dev/sdm1 1.9T  
/dev/sdn1 932G  
/dev/sdo1 699G  
/dev/sdp1 1.4T  

I'm currently upgrading the older drives as I run out of space, and I'm
really hoping that  2TB drives arrive soon.  The motherboard is
full-size ATX with six onboard SATA ports, all of which are in use.  The
only x16 PCIe slot is occupied by a graphics card, and all but one of
the x1 PCIe slots are in use.  One of the x1 PCIe slots has a Silicon
Image two-port ESATA controller, which connects to two external
enclosures with 1:4 and 1:5 port replicators.  At the moment there are
also three external USB drives.  This weekend's project is to install a
new 2TB drive and do some consolidation.

Fortunately the bandwidth requirements aren't too high ...

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: cname replace in mail address? [off-topic] (Re: Attn Ronald Klop)

2010-08-27 Thread Don Lewis
On 27 Aug, Ronald Klop wrote:
 Mandatory? I'm googling, but can't find a document that declares it  
 mandatory and only sendmail seems to do it.
 I think it is lame to use DNS info to rewrite e-mail addresses, but the  
 person who made it 'mandatory' will have good reasons for it.
 
 Does somebody have a pointer to the specs about this?

http://www.rfc-editor.org/rfc/rfc1123.txt

  5.2.2  Canonicalization: RFC-821 Section 3.1

 The domain names that a Sender-SMTP sends in MAIL and RCPT
 commands MUST have been  canonicalized, i.e., they must be
 fully-qualified principal names or domain literals, not
 nicknames or domain abbreviations.  A canonicalized name either
 identifies a host directly or is an MX name; it cannot be a
 CNAME.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: any hope for nfe/msk?

2007-11-21 Thread Don Lewis
On 21 Nov, Chris wrote:
 On 07/11/2007, Pyun YongHyeon [EMAIL PROTECTED] wrote:
 On Wed, Nov 07, 2007 at 02:28:00PM +0200, Oleg Lomaka wrote:
   Hello,
  
   Pyun YongHyeon wrote:
   On Thu, Nov 01, 2007 at 10:59:48AM +0200, Oleg Lomaka wrote:
 Hello,

 Pyun YongHyeon wrote:
 On Tue, Oct 30, 2007 at 04:01:04PM +0200, Oleg Lomaka wrote:
 
 [...]
 
   I had RxFIFO overrun again :(
   from dmest:
   msk0: Rx FIFO overrun!
 
 [...]
 
 Please try attached patch again. Sorry for the trouble.
 After applying the patch show me verbosed dmesg output related with
 msk(4)/PHY driver.
 
 Thanks for testing.
 
 pcib1: MPTable PCI-PCI bridge irq 16 at device 28.0 on pci0
 pcib1:   domain0
 pcib1:   secondary bus 2
 pcib1:   subordinate bus   2
 pcib1:   I/O decode0x2000-0x2fff
 pcib1:   memory decode 0xd010-0xd01f
 pcib1:   no prefetched decode
 pci2: PCI bus on pcib1
 pci2: domain=0, physical bus=2
 found- vendor=0x11ab, dev=0x4352, revid=0x14
domain=0, bus=2, slot=0, func=0
class=02-00-00, hdrtype=0x00, mfdev=0
cmdreg=0x0007, statreg=0x4010, cachelnsz=16 (dwords)
lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
intpin=a, irq=11
powerspec 2  supports D0 D1 D2 D3  current D0
MSI supports 2 messages, 64 bit
map[10]: type Memory, range 64, base 0xd010, size 14, 
 enabled
 pcib1: requested memory range 0xd010-0xd0103fff: good
map[18]: type I/O Port, range 32, base 0x2000, size  8, enabled
 pcib1: requested I/O range 0x2000-0x20ff: in range
 pcib1: slot 0 INTA routed to irq 16
 mskc0: Marvell Yukon 88E8038 Gigabit Ethernet port 0x2000-0x20ff mem
 0xd010-0xd0103fff irq 16 at device 0.0 on pci2
 mskc0: Reserved 0x4000 bytes for rid 0x10 type 3 at 0xd010
 mskc0: MSI count : 2
 mskc0: RAM buffer size : 4KB
 mskc0: Port 0 : Rx Queue 2KB(0x:0x07ff)
 mskc0: Port 0 : Tx Queue 2KB(0x0800:0x0fff)
 msk0: Marvell Technology Group Ltd. Yukon FE Id 0xb7 Rev 0x01 on 
 mskc0
 msk0: bpf attached
 msk0: Ethernet address: 00:1b:24:0e:bc:26
 miibus0: MII bus on msk0
 e1000phy0: Marvell 88E3082 10/100 Fast Ethernet PHY PHY 0 on miibus0
 e1000phy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
 ioapic0: routing intpin 16 (PCI IRQ 16) to vector 49
 mskc0: [MPSAFE]
 mskc0: [FILTER]

   
   So far all looks good to me. If you encounter watchdog timeouts
   or Rx FIFO overruns let me know.
   
   
  
   Got it again:
   msk0: Rx FIFO overrun!
   I believe this is happening under heavy CPU usage. Now i have firefox
   compiling and watched pictures on remote windows box using rdesktop. And
   after few minutes got network freeze.

 If it only happens under heavy system loads it's probably normal. If
 system is too busy to serve other jobs the msk(4) may not recevie
 more packets because its receive buffer was full. Probably msk(4)
 should just count the overrun errors without printing the message
 such that it would save more CPU cycles.
 Btw, did you also see watchdog timeout errors?

   But it looks i didn't get any packet lost :). Take a look at ping
   statistics... funny...

 I guess something is wrong here. Latency is unacceptable. However
 I have no idea why ICMP echo reponse takes so long time. Are you
 using any power saving mechanism(powerd, cpufreq etc)?

   tdevil% ping 10.1.1.254
   PING 10.1.1.254 (10.1.1.254): 56 data bytes
   64 bytes from 10.1.1.254: icmp_seq=0 ttl=64 time=35926.404 ms
   64 bytes from 10.1.1.254: icmp_seq=1 ttl=64 time=34925.694 ms
   64 bytes from 10.1.1.254: icmp_seq=2 ttl=64 time=33924.729 ms
   64 bytes from 10.1.1.254: icmp_seq=3 ttl=64 time=32923.814 ms
   64 bytes from 10.1.1.254: icmp_seq=4 ttl=64 time=31922.833 ms
   64 bytes from 10.1.1.254: icmp_seq=5 ttl=64 time=30921.878 ms
   64 bytes from 10.1.1.254: icmp_seq=6 ttl=64 time=29920.923 ms
   64 bytes from 10.1.1.254: icmp_seq=7 ttl=64 time=28919.960 ms
   64 bytes from 10.1.1.254: icmp_seq=8 ttl=64 time=27919.009 ms
   64 bytes from 10.1.1.254: icmp_seq=9 ttl=64 time=26918.042 ms
   64 bytes from 10.1.1.254: icmp_seq=10 ttl=64 time=25917.078 ms
   64 bytes from 10.1.1.254: icmp_seq=11 ttl=64 time=24916.115 ms
   64 bytes from 10.1.1.254: icmp_seq=12 ttl=64 time=23915.144 ms
   64 bytes from 10.1.1.254: icmp_seq=13 ttl=64 time=22914.192 ms
   64 bytes from 10.1.1.254: icmp_seq=14 ttl=64 time=21913.214 ms
   64 bytes from 10.1.1.254: icmp_seq=15 ttl=64 time=20912.278 ms
   64 bytes from 10.1.1.254: icmp_seq=16 ttl=64 time=19911.330 ms
   64 bytes from 10.1.1.254: icmp_seq=17 ttl=64 time=18910.375 ms
   64 bytes from 10.1.1.254: icmp_seq=18 ttl=64 time=17909.419 ms
   64 bytes from 10.1.1.254: icmp_seq=19 ttl=64 time=16853.821 ms
   64 bytes from 10.1.1.254: 

Re: Call for testing: patch that helps Wine on 6.x

2007-08-06 Thread Don Lewis
On  6 Aug, John Baldwin wrote:
 On Friday 03 August 2007 10:56:48 pm Marc G. Fournier wrote:

 John, I've been running both the signal and pfault patches on my 6.x 
 desktops
 since Tijl posted them, and haven't noticed any issues resulting from 
 them ...
 
 Does cvsup work?  A similar patch broke cvsup on HEAD.

cvsup only broke if was built with the old and unsupported lang/pm3*
ports.  There is no problem if it was built with lang/ezm3, which is the
dependency listed in the cvsup Makefile.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: removing external usb hdd without unmounting causes reboot?

2007-07-18 Thread Don Lewis
On 18 Jul, Momchil Ivanov wrote:

 If the problem is in general with a file system, regardless of the provider, 
 then what does one do when a mounted smbfs becomes unavailable due to remote 
 host down, no route to host or some other network related problems? Same 
 question for NFS mounted filesystems?

In the case of NFS, nothing happens if the filesystem is idle.  If the
filesystem is active, any pending operations are retried indefinitely by
periodically resending the I/O requests if the file system is hard
mounted.  If the filesystem is soft mounted, then the I/O requests are
eventually timed out with the appropriate error status returned to the
process on the client.

An important difference between NFS and UFS is that a loss of network
connectivity (or a clean server reboot) can't cause any filesystem
inconsistencies in the NFS case because complex filesystem operations
that require multiple disk operations are treated as atomic operations
between the client and server.  For example, creating a new directory
requires a number of physical disk writes in the UFS case, and
unplugging the disk in the middle would result in an inconsistent
filesystem state.  In the NFS case, creating a new directory only
requires only one NFS operation over the wire, and the client is allowed
to keep retrying the operation until it receives a status response from
the server.  Retries might be necessary if either the request or the
response packet was dropped by the network, the server crashed, etc.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [5.4] mode bits changed by close(2)

2006-01-28 Thread Don Lewis
On 28 Jan, David Malone wrote:
 On Fri, Jan 27, 2006 at 02:01:19PM -0700, [EMAIL PROTECTED] wrote:
 Sticking an fsync() in between the fchmod() and the close() causes the
 bits to be cleared as a side-effect of the fsync().  Doing another
 fchmod() after the fsync() produces the final expected set{u,g}id
 results even after the close.  Unfortunately, fsync() is a rather
 expensive operation.
 
 There is code to clear the suid bits on a file when it is written
 to, and I guess this is being triggered when the write is flushed
 rather than when the write call is made. This would explain why
 flushing before the fsync stops the problem.

The last partial write is probably being cached by the client and not
being flushed to the server until the client calls close().  The server
is seeing the NFS write after it set the set{u,g}id bit(s) and is
clearing them in response to the write.

The sequence
write()
fsync()
fchmod()
close()
should work.

This oddity could be hidden from userland if fchmod() sync'ed the file
before setting the bits, but that doesn't change the performance impact.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-12 Thread Don Lewis
On 11 Jan, Denis Shaposhnikov wrote:
 Hi!
 
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don Are you using any unusual file systems, such as nullfs or
  Don unionfs?
 
   Yes, I'm use a lots of nullfs. This is a host system for about 20
   jails with nullfs mounted ro system:
 
  Don That would be my guess as to the cause of the problem. Hopefully
  Don DEBUG_VFS_LOCKS will help pinpoint the bug.
 
 I've got the problem again. Now I have debug kernel and crash
 dump. That is an output from the kdb. Do you need any additional
 information which I can get from the crash dump?


 0xc6bed3fc: tag ufs, type VDIR
 usecount 3, writecount 0, refcount 4 mountedhere 0
 flags ()
 v_object 0xc84fce0c ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc69d1d80 (pid 546) with 2 
 pending#0 0xc04cd7ad at lockmgr+0x3c6
 #1 0xc053d780 at vop_stdlock+0x32
 #2 0xc0655075 at VOP_LOCK_APV+0x3b
 #3 0xc05db299 at ffs_lock+0x19
 #4 0xc0655075 at VOP_LOCK_APV+0x3b
 #5 0xc0557513 at vn_lock+0x6c
 #6 0xc054a320 at vget+0x87
 #7 0xc053a1f0 at cache_lookup+0xde
 #8 0xc053ac93 at vfs_cache_lookup+0xad
 #9 0xc0652eaf at VOP_LOOKUP_APV+0x3b
 #10 0xc053f786 at lookup+0x419
 #11 0xc0540262 at namei+0x2e7
 #12 0xc0558d37 at vn_open_cred+0x1c2
 #13 0xc055918a at vn_open+0x33
 #14 0xc054e6d3 at kern_open+0xd2
 #15 0xc054f055 at open+0x36
 #16 0xc0648a47 at syscall+0x23a
 #17 0xc06326ff at Xint0x80_syscall+0x1f
 
   ino 2072795, on dev ad4s1g
 
 0xc6c07bf4: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc940a744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc9146c00 (pid 33016) with 1 
 pending#0 0xc04cd7ad at lockmgr+0x3c6
 #1 0xc053d780 at vop_stdlock+0x32
 #2 0xc0655075 at VOP_LOCK_APV+0x3b
 #3 0xc05db299 at ffs_lock+0x19
 #4 0xc0655075 at VOP_LOCK_APV+0x3b
 #5 0xc0557513 at vn_lock+0x6c
 #6 0xc054a320 at vget+0x87
 #7 0xc053a1f0 at cache_lookup+0xde
 #8 0xc053ac93 at vfs_cache_lookup+0xad
 #9 0xc0652eaf at VOP_LOOKUP_APV+0x3b
 #10 0xc053f786 at lookup+0x419
 #11 0xc0540262 at namei+0x2e7
 #12 0xc0554134 at kern_rmdir+0x98
 #13 0xc055436e at rmdir+0x22
 #14 0xc0648a47 at syscall+0x23a
 #15 0xc06326ff at Xint0x80_syscall+0x1f
 
   ino 2072767, on dev ad4s1g
 

 db alltrace

 Tracing command parser3.cgi pid 33016 tid 100652 td 0xc9146c00
 sched_switch(c9146c00,0,1,5e57bc72,aaef377f) at sched_switch+0xd8
 mi_switch(1,0,0,e93d86c4,c04f425b) at mi_switch+0x150
 sleepq_switch(e93d86f8,c04e4cc6,c6bed454,50,c0672a31) at sleepq_switch+0x115
 sleepq_wait(c6bed454,50,c0672a31,0,c602a080) at sleepq_wait+0xb
 msleep(c6bed454,c06b9c98,50,c0672a31,0) at msleep+0x454
 acquire(6,da1dcb4c,c0504340,da1dcb4c,4c) at acquire+0x7a
 lockmgr(c6bed454,2002,c6bed4c4,c9146c00,e93d87e0) at lockmgr+0x4ce
 vop_stdlock(e93d8838,120,c06a4f20,e93d8838,e93d87f0) at vop_stdlock+0x32
 VOP_LOCK_APV(c06a55c0,e93d8838,e93d8808,c0655075,e93d8838) at 
 VOP_LOCK_APV+0x3b
 ffs_lock(e93d8838,f,2,c6bed3fc,e93d8854) at ffs_lock+0x19
 VOP_LOCK_APV(c06a4f20,e93d8838,c0653a6a,c053acd2,c0652eaf) at 
 VOP_LOCK_APV+0x3b
 vn_lock(c6bed3fc,2002,c9146c00,c06326ff,6) at vn_lock+0x6c
 vget(c6bed3fc,2002,c9146c00,c0653a6a,c053acd2) at vget+0x87
 vfs_hash_get(c6268c00,1fa0db,2,c9146c00,e93d89bc) at vfs_hash_get+0xf5
 ffs_vget(c6268c00,1fa0db,2,e93d89bc,e93d89c0) at ffs_vget+0x49
 ufs_lookup(e93d8a6c,c0685ac5,c6c07bf4,e93d8c34,e93d8aa8) at ufs_lookup+0x965
 VOP_CACHEDLOOKUP_APV(c06a4f20,e93d8a6c,e93d8c34,c9146c00,c7fac400) at 
 VOP_CACHEDLOOKUP_APV+0x59
 vfs_cache_lookup(e93d8b14,e93d8ac0,c6c07bf4,e93d8c34,e93d8b30) at 
 vfs_cache_lookup+0xec
 VOP_LOOKUP_APV(c06a4f20,e93d8b14,c9146c00,c065510a,c054a6f6) at 
 VOP_LOOKUP_APV+0x3b
 lookup(e93d8c0c,c954a800,400,e93d8c28,0) at lookup+0x419
 namei(e93d8c0c,ffdf,2,0,c6268c00) at namei+0x2e7
 kern_rmdir(c9146c00,824b500,0,e93d8d30,c0648a47) at kern_rmdir+0x98
 rmdir(c9146c00,e93d8d04,4,d,e93d8d38) at rmdir+0x22
 syscall(804003b,e50003b,bfbf003b,8310560,8310560) at syscall+0x23a
 Xint0x80_syscall() at Xint0x80_syscall+0x1f
 --- syscall (137, FreeBSD ELF32, rmdir), eip = 0xe8201bf, esp = 0xbfbfbcfc, 
 ebp = 0xbfbfbd18 ---

 Tracing command nginx pid 546 tid 100122 td 0xc69d1d80
 sched_switch(c69d1d80,0,1,f002cf9e,6c43d35e) at sched_switch+0xd8
 mi_switch(1,0,c0502e44,e8a73670,c04f425b) at mi_switch+0x150
 sleepq_switch(e8a736a4,c04e4cc6,c6c07c4c,50,c0672a31) at sleepq_switch+0x115
 sleepq_wait(c6c07c4c,50,c0672a31,0,c044b82a) at sleepq_wait+0xb
 msleep(c6c07c4c,c06ba45c,50,c0672a31,0) at msleep+0x454
 acquire(6,c0655075,c05db299,c0655075,c0557513) at acquire+0x7a
 lockmgr(c6c07c4c,2002,c6c07cbc,c69d1d80,e8a7378c) at lockmgr+0x4ce
 vop_stdlock(e8a737e4,c06a4f20,c06a4f20,e8a737e4,e8a7379c) at vop_stdlock+0x32
 VOP_LOCK_APV(c06a55c0,e8a737e4,e8a737b4,c0655075,e8a737e4) at 
 VOP_LOCK_APV+0x3b
 ffs_lock(e8a737e4,c0673faf,2,c6c07bf4,e8a73800) at ffs_lock+0x19
 VOP_LOCK_APV(c06a4f20,e8a737e4,e8a73814,c0504340,e8a73814) at 
 VOP_LOCK_APV+0x3b

Re: Recurring problem: processes block accessing UFS file system

2006-01-12 Thread Don Lewis
On 12 Jan, Don Lewis wrote:
 On 11 Jan, Denis Shaposhnikov wrote:
 Hi!
 
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don Are you using any unusual file systems, such as nullfs or
  Don unionfs?
 
   Yes, I'm use a lots of nullfs. This is a host system for about 20
   jails with nullfs mounted ro system:
 
  Don That would be my guess as to the cause of the problem. Hopefully
  Don DEBUG_VFS_LOCKS will help pinpoint the bug.
 
 I've got the problem again. Now I have debug kernel and crash
 dump. That is an output from the kdb. Do you need any additional
 information which I can get from the crash dump?

 Process 33016 is executing rmdir().  While doing the lookup, it is
 holding a lock on vnode 0xc6c07bf4 and attempting to lock vnode
 c6bed3fc.  Vnode 0xc6c07bf4 should be the parent directory of c6bed3fc.
 
 Process 546 is executing open().  While doing the lookup, it is holding
 a lock on vnode 0xc6bed3fc while attempting to lock vnode c6c07bf4.
 Vnode 0xc6bed3fc should be the parent directory of c6c07bf4, but this is
 inconsistent with the previous paragraph.
 
 This situation should not be possible.  Using kgdb on your saved crash
 dump, print fmode and *ndp in the vn_open_cred() stack frame of
 process 546, and *nd in the kern_rmdir() stack frame of process 33016.
 The path names being looked up may be helpful.
 
 Are there any symbolic links in the path names?  If so, what are the
 link contents?
 
 Are either of these processes jailed?  If so, same or different jails?
 
 What are inodes 2072767 and 2072795 on ad4s1g?
 
 Are you using snapshots?

I just thought of another possible cause for this problem.  Is is
possible that you have any hard links to directories in the file system
on ad4s1g?  That could put a loop in the directory tree and mess up the
normal parent-child relationship that we rely on to avoid deadlocks.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Swapfile problem in 6?

2006-01-06 Thread Don Lewis
On  2 Jan, Lars Kristiansen wrote:
 Attempting to catch up with my backlog of unread email, only 12K unread
 messages to go ...

 On 24 Nov, Rob wrote:

 I have cvsup'ed the sources to STABLE as of Nov. 23rd
 2005.
 After recompiling/installing world and debug-kernel,
 I again get a kernel deadlock when using swapfile:
http://surfion.snu.ac.kr/~lahaye/swapfile2.txt

 Previous deadlocks are still documented here
http://surfion.snu.ac.kr/~lahaye/swapfile.txt

 I hope this is of use for fixing this bug in 6.
 If further investigation is needed, then please let me
 know.

 This is a deadlock caused by memory exhaustion.  The pagedaemon only has
 a limited number of bufs that it uses for writing dirty pages to swap to
 prevent it from saturating the I/O subsystem with large numbers of
 writes.  In this case, pagedaemon is trying to free up memory by writing
 dirty pages, and it has used up all of its bufs and is waiting for the
 write requests to complete and the bufs the bufs to be returned to it.
 This isn't happening because md0 is stuck waiting for memory.  This is a
 little bit suprising to me because it looks like writes to vnode backed
 devices are done synchronously by default.

 If you have a chance to test this again, a stack trace of md0 in the
 deadlock state would be interesting.  I'd like to know where md0 is
 getting stuck.

 I wonder if pagedaemon should scan ahead and more agressively discard
 clean pages when it has run out of bufs to write dirty pages, especially
 in low memory situations.  Preventing the creation of more dirty pages
 would be nice, but I don't know how to do that ...
 
 Just in case it can help. Do not have this machine available for testing
 at the moment but this is the last debuginfo I did get from it.
 Here is a trace from a situation when a possible idle system got stuck
 during the night and db showed only one locked vnode:
 
 db show lockedvnods
 Locked vnodes
 
 0xc1309330: tag ufs, type VREG
 usecount 1, writecount 1, refcount 154 mountedhere 0
 flags ()
 v_object 0xc12cb39c ref 0 pages 606
  lock type ufs: EXCL (count 1) by thread 0xc126b900 (pid 178)
 ino 8155, on dev ad0s1f
 db trace 178
 Tracing pid 178 tid 100058 td 0xc126b900
 sched_switch(c126b900,0,1) at 0xc066a4db = sched_switch+0x17b
 mi_switch(1,0) at 0xc065f49e = mi_switch+0x27e
 sleepq_switch(c09b2a98,c484bacc,c065f0e3,c09b2a98,0) at 0xc0677f00 =
 sleepq_switch+0xe0
 sleepq_wait(c09b2a98,0,0,c08ad92d,37b) at 0xc0678100 = sleepq_wait+0x30
 msleep(c09b2a98,c09b2d00,244,c08adb6a,0) at 0xc065f0e3 = msleep+0x333
 vm_wait(c12cb39c,0,c08990f3,ad7,c06512a4) at 0xc07c6a71 = vm_wait+0x91
 allocbuf(c28fa9d8,4000,354000,0,354000) at 0xc06a2f89 = allocbuf+0x4e9
 getblk(c1309330,d5,0,4000,0) at 0xc06a29cb = getblk+0x4eb
 cluster_read(c1309330,1000,0,d5,0) at 0xc06a5d65 = cluster_read+0xe5
 ffs_read(c484bc9c) at 0xc07a631f = ffs_read+0x28f
 VOP_READ_APV(c09309a0,c484bc9c) at 0xc0838aab = VOP_READ_APV+0x7b
 mdstart_vnode(c1310800,c1634294,c1310820,1,c0566e10) at 0xc056688c =
 mdstart_vnode+0xec
 md_kthread(c1310800,c484bd38,c1310800,c0566e10,0) at 0xc0566f7f =
 md_kthread+0x16f
 fork_exit(c0566e10,c1310800,c484bd38) at 0xc0645618 = fork_exit+0xa8
 fork_trampoline() at 0xc0816f3c = fork_trampoline+0x8
 --- trap 0x1, eip = 0, esp = 0xc484bd6c, ebp = 0 ---

The md thread is stuck waiting for memory to be freed by pagedaemon.
Pagedaemon is stuck waiting for at least one of its pageout requests to
complete.  The pageout requests are probably all stuck waiting for md.

I had expected that the problem is that while pagedaemon is allowed to
dig deeper into the free page pool, I didn't think that the md thread
would be allowed to, allowing the md thread to get wedged first.  That
does not appear to be the case because the vm_page_alloc() call in
allocbuf() has the VM_ALLOC_SYSTEM flag set, which should match
vm_page_alloc()'s treatment of requests by pagedaemon.

I don't see how the md thread could be consuming a large number of
reserved pages, but it looks like that must be what is happening.



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Don Lewis
On  5 Jan, Denis Shaposhnikov wrote:
 Hi!
 
 Greg == Greg Rivers [EMAIL PROTECTED] writes:
 
  Greg It's taken more than a month, but the problem has recurred
  Greg without snapshots ever having been run.  I've got a good trace
 
 I think that I have the same problem on a fresh CURRENT. For some
 processes I see MWCHAN = ufs and D in the STAT. And I can't kill
 such processes even with -9. And system can't kill them too on
 shutdown. So, system can't do shutdown and wait forever after All
 buffers synced message. At this moment I've entered to KDB do show
 lockedvnods:
 
 Locked vnodes
 
 0xc687cb58: tag ufs, type VDIR
 usecount 1, writecount 0, refcount 2 mountedhere 0
 flags ()
 v_object 0xcb5b1934 ref 0 pages 0
  lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 
 pending
 ino 2072602, on dev ad4s1g
 
 0xc687ca50: tag ufs, type VDIR
 usecount 31, writecount 0, refcount 32 mountedhere 0
 flags ()
 v_object 0xc85d2744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 
 pending
 ino 2072603, on dev ad4s1g
 
 0xc687c948: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc875d000 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 
 pending
 ino 2072615, on dev ad4s1g
 
 0xc691f420: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc8a773e0 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 
 pending
 ino 2072680, on dev ad4s1g
 
 0xc691f318: tag ufs, type VDIR
 usecount 3, writecount 0, refcount 4 mountedhere 0
 flags ()
 v_object 0xc8a7b2e8 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc7019780 (pid 74103) with 2 
 pending
 ino 2072795, on dev ad4s1g
 
 0xc69bb528: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc7890744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc91f4600 (pid 74129) with 1 
 pending
 ino 2072767, on dev ad4s1g
 Locked vnodes
 
 0xc687cb58: tag ufs, type VDIR
 usecount 1, writecount 0, refcount 2 mountedhere 0
 flags ()
 v_object 0xcb5b1934 ref 0 pages 0
  lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 
 pending
 ino 2072602, on dev ad4s1g
 
 0xc687ca50: tag ufs, type VDIR
 usecount 31, writecount 0, refcount 32 mountedhere 0
 flags ()
 v_object 0xc85d2744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 
 pending
 ino 2072603, on dev ad4s1g
 
 0xc687c948: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc875d000 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 
 pending
 ino 2072615, on dev ad4s1g
 
 0xc691f420: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc8a773e0 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 
 pending
 ino 2072680, on dev ad4s1g
 
 0xc691f318: tag ufs, type VDIR
 usecount 3, writecount 0, refcount 4 mountedhere 0
 flags ()
 v_object 0xc8a7b2e8 ref 0 pages 1
  lock type ufs: EXCL (count 1) by t(kgdb) 
 
 After that I've done call doadump and got vmcore.
 
 ps show me:
 
 (kgdb) ps
 During symbol reading, Incomplete CFI data; unspecified registers at 
 0xc04d97eb.
   pidproc   uid  ppid  pgrp   flag stat comm wchan
 74686 c94640000 1 1  00  1  sh   ufs c687caa8
 74195 c970d0000  3074 74195  4000100  1  sshd ufs c687caa8
 74178 c7682adc0  3074 74178  004000  1  sshd ufs c687c9a0
 74129 c9b82adc 1008 1  5504  004000  1  parser3.cgi   ufs c691f370
 74103 c70b5458 1008 1  5504  00  1  httpdufs c69bb580
 65610 c92c0458 1005 1 65610  004000  1  sftp-server   ufs c691f478
  5518 c6247458 1008 1  5516  004002  1  perl5.8.7ufs c687caa8
  3081 c7523d080 1  3081  00  1  cron ufs c687caa8
  3074 c7682d080 1  3074  000100  1  sshd ufs c687caa8
  3016 c7523adc0 1  3016  00  1  syslogd  ufs c687caa8
   519 c68e4d08   80 1   518  000100  1  nginxufs c691f370
34 c6260 0 0  000204  1  schedcpu - e88b3cf0
33 c62438b00 0 0  000204  1  syncer   ktsusp c6243938
32 c6243adc0 0 0  000204  1  vnlruktsusp c6243b64
31 c6243d080 0 0  000204  1  bufdaemonktsusp c6243d90
30 c62440000 0 0  00020c  1  pagezero pgzero c06c21a0
29 c624422c0 0 0  000204  1  vmdaemon psleep c06c1d08
28 c62444580 0 0  000204  1  pagedaemon   psleep c06c1cc8
27 c602e6840   

WITNESS speedup patch for RELENG_5

2006-01-05 Thread Don Lewis
If you are running RELENG_5 and using WITNESS, you might want to try the
patch below.  It speeds up WITNESS rather dramatically.  This patch was
committed to HEAD in late August (subr_witness.c 1.198) and early
September (subr_witness.c 1.200).  It was MFC'ed to RELENG_6 in the last
few days.  I'd like to MFC it to RELENG_5, but I think it should get a
bit more exposure before I do.

Index: sys/kern/subr_witness.c
===
RCS file: /home/ncvs/src/sys/kern/subr_witness.c,v
retrieving revision 1.178.2.8
diff -u -r1.178.2.8 subr_witness.c
--- sys/kern/subr_witness.c 4 May 2005 19:26:30 -   1.178.2.8
+++ sys/kern/subr_witness.c 12 Sep 2005 04:52:53 -
@@ -165,16 +165,9 @@
 static int isitmychild(struct witness *parent, struct witness *child);
 static int isitmydescendant(struct witness *parent, struct witness *child);
 static int itismychild(struct witness *parent, struct witness *child);
-static int rebalancetree(struct witness_list *list);
 static voidremovechild(struct witness *parent, struct witness *child);
-static int reparentchildren(struct witness *newparent,
-   struct witness *oldparent);
 static int sysctl_debug_witness_watch(SYSCTL_HANDLER_ARGS);
-static voidwitness_displaydescendants(void(*)(const char *fmt, ...),
-  struct witness *, int indent);
 static const char *fixup_filename(const char *file);
-static voidwitness_leveldescendents(struct witness *parent, int level);
-static voidwitness_levelall(void);
 static struct  witness *witness_get(void);
 static voidwitness_free(struct witness *m);
 static struct  witness_child_list_entry *witness_child_get(void);
@@ -185,20 +178,21 @@
 struct lock_object *lock);
 static voidwitness_list_lock(struct lock_instance *instance);
 #ifdef DDB
-static voidwitness_list(struct thread *td);
+static voidwitness_leveldescendents(struct witness *parent, int level);
+static voidwitness_levelall(void);
+static voidwitness_displaydescendants(void(*)(const char *fmt, ...),
+  struct witness *, int indent);
 static voidwitness_display_list(void(*prnt)(const char *fmt, ...),
 struct witness_list *list);
 static voidwitness_display(void(*)(const char *fmt, ...));
+static voidwitness_list(struct thread *td);
 #endif
 
 SYSCTL_NODE(_debug, OID_AUTO, witness, CTLFLAG_RW, 0, Witness Locking);
 
 /*
- * If set to 0, witness is disabled.  If set to 1, witness performs full lock
- * order checking for all locks.  If set to 2 or higher, then witness skips
- * the full lock order check if the lock being acquired is at a higher level
- * (i.e. farther down in the tree) than the current lock.  This last mode is
- * somewhat experimental and not considered fully safe.  At runtime, this
+ * If set to 0, witness is disabled.  If set to a non-zero value, witness
+ * performs full lock order checking for all locks.  At runtime, this
  * value may be set to 0 to turn off witness.  witness is not allowed be
  * turned on once it is turned off, however.
  */
@@ -250,6 +244,16 @@
 static struct witness_child_list_entry *w_child_free = NULL;
 static struct lock_list_entry *w_lock_list_free = NULL;
 
+static int w_free_cnt, w_spin_cnt, w_sleep_cnt, w_child_free_cnt, w_child_cnt;
+SYSCTL_INT(_debug_witness, OID_AUTO, free_cnt, CTLFLAG_RD, w_free_cnt, 0, );
+SYSCTL_INT(_debug_witness, OID_AUTO, spin_cnt, CTLFLAG_RD, w_spin_cnt, 0, );
+SYSCTL_INT(_debug_witness, OID_AUTO, sleep_cnt, CTLFLAG_RD, w_sleep_cnt, 0,
+);
+SYSCTL_INT(_debug_witness, OID_AUTO, child_free_cnt, CTLFLAG_RD,
+w_child_free_cnt, 0, );
+SYSCTL_INT(_debug_witness, OID_AUTO, child_cnt, CTLFLAG_RD, w_child_cnt, 0,
+);
+
 static struct witness w_data[WITNESS_COUNT];
 static struct witness_child_list_entry w_childdata[WITNESS_CHILDCOUNT];
 static struct lock_list_entry w_locklistdata[LOCK_CHILDCOUNT];
@@ -575,6 +579,87 @@
 
 #ifdef DDB
 static void
+witness_levelall (void)
+{
+   struct witness_list *list;
+   struct witness *w, *w1;
+
+   /*
+* First clear all levels.
+*/
+   STAILQ_FOREACH(w, w_all, w_list) {
+   w-w_level = 0;
+   }
+
+   /*
+* Look for locks with no parent and level all their descendants.
+*/
+   STAILQ_FOREACH(w, w_all, w_list) {
+   /*
+* This is just an optimization, technically we could get
+* away just walking the all list each time.
+*/
+   if (w-w_class-lc_flags  LC_SLEEPLOCK)
+   list = w_sleep;
+   else
+   list = w_spin;
+   STAILQ_FOREACH(w1, list, w_typelist) {
+   if (isitmychild(w1, w))
+   goto skip;
+

Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Don Lewis
On  5 Jan, Denis Shaposhnikov wrote:
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don pid 519 wants to lock this vnode but some other thread is
  Don holding the vnode lock.  Unfortunately we don't know who the
  Don lock holder is because the message is truncated.
 
 Is it possible to find out the answer from the crashdump?

It's possible if you have the matching debug kernel, though it is more
painful.  In kgdb:
print *(struct vnode *)0xc691f318
print (struct vnode 
*)0xc691f318-v_vnlock-lk_lockholder-td_proc-p_pid)
or something like that.

  Don This might just be a vnode lock leak.  Build a debug kernel with
  Don the DEBUG_VFS_LOCKS and DEBUG_LOCKS options and see if anything
  Don shows up.
 
 I'll try, thank you.

Are you using any unusual file systems, such as nullfs or unionfs?

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Don Lewis
On  5 Jan, Denis Shaposhnikov wrote:
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don Are you using any unusual file systems, such as nullfs or
  Don unionfs?
 
 Yes, I'm use a lots of nullfs. This is a host system for about 20
 jails with nullfs mounted ro system:

That would be my guess as to the cause of the problem. Hopefully
DEBUG_VFS_LOCKS will help pinpoint the bug.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-04 Thread Don Lewis
On  3 Jan, Greg Rivers wrote:
 On Tue, 3 Jan 2006, Don Lewis wrote:

 Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is
 sleeping on the buf's b_xflags member.  It looks like 87117 is waiting
 for an in-progress write to complete.  There are a large number of other
 sendmail processes waiting in this same place.

 How about show buffer 0xdc76fe30?

 
 db show buffer 0xdc76fe30
 buf at 0xdc76fe30
 b_flags = 0x20a0vmio,delwri,cache
 b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
 b_bufobj = (0xc8985610), b_data = 0xe1d6b000, b_blkno = 365086368
 lockstatus = 0, excl count = 0, excl owner 0x
 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b858d4, 
 0xa8de1000),(0xc8984108, 0x2b858d5, 0xa8c62000),(0xc8984108, 0x2b858d6, 
 0xa8de3000),(0xc8984108, 0x2b858d7, 0xa8e64000)
 db

Hmn, it would be nice if DDB printed b_vflags so that we would know the
state of BV_BKGRDINPROG and BV_BKGRDWAIT.  As it is, we don't know if
the background write got lost or if we missed the wakeup.

At this point, I think it might be easier to do post-mortem debugging on
a core file with kgdb and kernel.debug.  Unless there is an obvious race
condition, I suspect this will be a tough slog.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-03 Thread Don Lewis
On  3 Jan, Greg Rivers wrote:
 On Tue, 22 Nov 2005, I wrote:
 
 On Mon, 21 Nov 2005, Kris Kennaway wrote:

 It may not be the same problem.  You should also try to obtain a trace when 
 snapshots are not implicated.
 

 Agreed.  I'll do so at the first opportunity.

 
 First, my thanks to all of you for looking into this.
 
 It's taken more than a month, but the problem has recurred without 
 snapshots ever having been run.  I've got a good trace of the machine in 
 this state (attached).  My apologies for the size of the debug output, but 
 the processes had really stacked up this time before I noticed it.
 
 I have enough capacity that I can afford to have this machine out of 
 production for a while, so I've left it suspended in kdb for the time 
 being in case additional information is needed.  Please let me know if 
 there's anything else I can do to facilitate troubleshooting this. 
 Thanks!

There are large number of sendmail processes waiting on vnode locks
which are held by other sendmail processes that are waiting on other
vnode locks, etc. until we get to sendmail pid 87150 which is holding a
vnode lock and waiting to lock a buf.

Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480
sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158
mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5
sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f
sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11
msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7
acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89
lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f
getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211
breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52
bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c
ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383
ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154
ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61
ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36
VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c
vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe
vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33
kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca
open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36
syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f

This doesn't appear to be a buf/memory exhausting problem because
syncer, bufdaemon, and pagedaemon all appear to be idle.

What does show lockedbufs say?

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-03 Thread Don Lewis
On  3 Jan, Greg Rivers wrote:
 On Tue, 3 Jan 2006, Don Lewis wrote:
 
 There are large number of sendmail processes waiting on vnode locks
 which are held by other sendmail processes that are waiting on other
 vnode locks, etc. until we get to sendmail pid 87150 which is holding a
 vnode lock and waiting to lock a buf.

 Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480
 sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158
 mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5
 sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f
 sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11
 msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7
 acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89
 lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f
 getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211
 breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52
 bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c
 ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383
 ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154
 ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61
 ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36
 VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c
 vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe
 vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33
 kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca
 open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36
 syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324
 Xint0x80_syscall() at Xint0x80_syscall+0x1f

 This doesn't appear to be a buf/memory exhausting problem because
 syncer, bufdaemon, and pagedaemon all appear to be idle.

 What does show lockedbufs say?

 
 db show lockedbufs

[snip]

looks like this is the buf that pid 87150 is waiting for:

 buf at 0xdc713f50
 b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache
 b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
 b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624
 lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80
 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 
 0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 
 0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000)

which is locked by this thread:

Tracing command sendmail pid 87117 tid 101335 td 0xcfeb5d80
sched_switch(cfeb5d80,0,1,fd1926a,640c65f9) at sched_switch+0x158
mi_switch(1,0,c04d7b33,dc76fe8c,ec883b2c) at mi_switch+0x1d5
sleepq_switch(dc76fe8c,ec883b60,c04bb9ce,dc76fe8c,4c) at sleepq_switch+0x16f
sleepq_wait(dc76fe8c,4c,c061e9ac,0,0) at sleepq_wait+0x11
msleep(dc76fe8c,c0662f80,4c,c061e9ac,0) at msleep+0x3d7
getdirtybuf(dc76fe30,c0662f80,1,ec883ba8,0) at getdirtybuf+0x221   
softdep_update_inodeblock(cd1bc528,dc713f50,1,4000,0) at softdep_update_inodeblo
ck+0x267
ffs_update(cd953bb0,1,0,cd953bb0,ec883c78,c0529a59,0,0,0,4,1,cd953c2c) at ffs_up
date+0x27f
ffs_syncvnode(cd953bb0,1,4,ec883c78,c05f9a70) at ffs_syncvnode+0x52e
ffs_fsync(ec883cb4,ec883cd0,c052468a,c0646cc0,ec883cb4) at ffs_fsync+0x1c
VOP_FSYNC_APV(c0646cc0,ec883cb4,0,0,0) at VOP_FSYNC_APV+0x3a
fsync(cfeb5d80,ec883d04,4,cfeb5d80,ec883d2c) at fsync+0x1db
syscall(3b,3b,3b,80c7c1b,bfbfa6b0) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (95, FreeBSD ELF32, fsync), eip = 0x8830f63f, esp = 0xbfbfa66c, ebp
= 0xbfbfaf98 ---


Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is
sleeping on the buf's b_xflags member.  It looks like 87117 is waiting
for an in-progress write to complete.  There are a large number of other
sendmail processes waiting in this same place.

How about show buffer 0xdc76fe30?

This is getting into an area of the kernel that I do not understand
well.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Swapfile problem in 6?

2006-01-02 Thread Don Lewis
Attempting to catch up with my backlog of unread email, only 12K unread
messages to go ...

On 24 Nov, Rob wrote:

 I have cvsup'ed the sources to STABLE as of Nov. 23rd
 2005.
 After recompiling/installing world and debug-kernel,
 I again get a kernel deadlock when using swapfile:
http://surfion.snu.ac.kr/~lahaye/swapfile2.txt
 
 Previous deadlocks are still documented here
http://surfion.snu.ac.kr/~lahaye/swapfile.txt
 
 I hope this is of use for fixing this bug in 6.
 If further investigation is needed, then please let me
 know.

This is a deadlock caused by memory exhaustion.  The pagedaemon only has
a limited number of bufs that it uses for writing dirty pages to swap to
prevent it from saturating the I/O subsystem with large numbers of
writes.  In this case, pagedaemon is trying to free up memory by writing
dirty pages, and it has used up all of its bufs and is waiting for the
write requests to complete and the bufs the bufs to be returned to it.
This isn't happening because md0 is stuck waiting for memory.  This is a
little bit suprising to me because it looks like writes to vnode backed
devices are done synchronously by default.

If you have a chance to test this again, a stack trace of md0 in the
deadlock state would be interesting.  I'd like to know where md0 is
getting stuck.

I wonder if pagedaemon should scan ahead and more agressively discard
clean pages when it has run out of bufs to write dirty pages, especially
in low memory situations.  Preventing the creation of more dirty pages
would be nice, but I don't know how to do that ...

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Swapfile problem in 6?

2006-01-02 Thread Don Lewis
On 17 Nov, Kris Kennaway wrote:
 On Thu, Nov 17, 2005 at 04:33:50PM -0800, Rob wrote:
 --- Kris Kennaway [EMAIL PROTECTED] wrote:
  
  I commented on it elsewhere in this thread.
 
 Do you mean your comment on the swap_pager error:
 
 Quote:
  AFAICT that is just a trigger-happy timer..it's
   supposed to detect when a swap operation took too
   long to complete, but it also triggers on swapfiles
   since they're so much less efficient (i.e. slower)
   than swapping onto a bare device.
 EndQuote.
 
 Right: harmless warning not error.

This isn't totally harmless because the pagedaemon is only allowed a
handful of outstanding writes.  It gets stuck when it runs out.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-02 Thread Don Lewis
On 26 Nov, Tor Egge wrote:
 
 Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
 during a snapshot is fairly easy to reproduce, I did so and collected this 
 information below.  alltrace didn't work as I expected (didn't produce a 
 trace), so I traced each pid associated with a locked vnode separately.
 
 The vnode syncing loop in ffs_sync() has some problems:
 
   1. Softupdate processing performed after the loop has started might
  trigger the need for retrying the loop.  Processing of dirrem work
  items can cause IN_CHANGE to be set on some inodes, causing
  deadlock in ufs_inactive() later on while the file system is
  suspended).

I also don't like how this loop interacts with the vnode list churn done
by vnlru_free().  Maybe vnode recycling for a file system should be
skipped while a file system is being suspended or unmounted.

   2. nvp might no longer be associated with the same mount point after
  MNT_IUNLOCK(mp) has been called in the loop.  This can cause the
  vnode list traversal to be incomplete, with stale information in
  the snapshot.  Further damage can occur when background fsck uses
  that stale information.

It looks like this is handled in __mnt_vnode_next() by starting over.
Skipping vnode recycling should avoid this problem in the snapshot case.

This loop should be bypassed in normal operation and the individual
vnode syncing should be done by the syncer.  The only reason this loop
isn't skipped during normal operation is that that timestamp updates
aren't sufficient to add vnodes to the syncer worklist.

 Just a few lines down from that loop is a new problem:
 
   3. softdep_flushworklist() might not have processed all dirrem work
  items associated with the file system even if both error and count
  are zero. This can cause both background fsck and softupdate
  processing (after file system has been resumed) to decrement the
  link count of an inode, causing file system corruption or a panic.

Are you sure this is still true after the changes that were committed to
both HEAD and RELENG_6 before 6.0-RELEASE?

All the pending items that hang around various lists make me nervous,
though.  I really thing the number of each flavor should be tracked per
mount point and softupdates should complain if the counts are non-zero
at the end of the suspend and unmount tasks.

  Processing of these work items while the file system is suspended
  causes a panic.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-02 Thread Don Lewis
On 26 Nov, Tor Egge wrote:
 
 Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
 during a snapshot is fairly easy to reproduce, I did so and collected this 
 information below.  alltrace didn't work as I expected (didn't produce a 
 trace), so I traced each pid associated with a locked vnode separately.
 
 The vnode syncing loop in ffs_sync() has some problems:

There is also a MNT_VNODE_FOREACH() loop in ffs_snapshot().

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-02 Thread Don Lewis
On 22 Nov, Greg Rivers wrote:
 On Mon, 21 Nov 2005, Kris Kennaway wrote:
 
 It may not be the same problem.  You should also try to obtain a trace 
 when snapshots are not implicated.

 
 Agreed.  I'll do so at the first opportunity.
 
 
 'show lockedvnods' is very important for diagnosing filesystem 
 deadlocks, and 'alltrace' is the easiest way to obtain stack traces for 
 the resulting processes that are holding locks.  You'll want to 'set 
 $lines=0' and capture the output from the serial console to a file.

 
 Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
 during a snapshot is fairly easy to reproduce, I did so and collected this 
 information below.  alltrace didn't work as I expected (didn't produce a 
 trace), so I traced each pid associated with a locked vnode separately.

There appears to be a lock order reversal between a vnode lock and the
hand-rolled suspend lock.  Basically, it is not permissible to wait for
the suspend lock while holding a vnode lock.  In cases where a thread
holds a vnode lock and then wants to start a file system write, it has
to make a non-blocking attempt to grab the suspend lock and then release
the vnode lock if the suspend lock can't be obtained.  For example, in
vn_open_cred():

if (vn_start_write(ndp-ni_dvp, mp, V_NOWAIT) != 0) {
NDFREE(ndp, NDF_ONLY_PNBUF);
vput(ndp-ni_dvp);
VFS_UNLOCK_GIANT(vfslocked);
if ((error = vn_start_write(NULL, mp,
V_XSLEEP | PCATCH)) != 0)
return (error);
goto restart;
}


The problem is taht vput() may call vinactive(), which calls
ufs_inactive(), which may attempt to do a file system write and block on
the snapshot lock.

db trace 99361
Tracing pid 99361 tid 100437 td 0xce67cd80
sched_switch(ce67cd80,0,1,3d4e53a2,95c40548) at sched_switch+0x158
mi_switch(1,0,c04d7b33,c86d706c,ebb00908) at mi_switch+0x1d5
sleepq_switch(c86d706c,ebb0093c,c04bb9ce,c86d706c,9f) at sleepq_switch+0x16f
sleepq_wait(c86d706c,9f,c061a026,0,c0647200) at sleepq_wait+0x11
msleep(c86d706c,c86d7044,29f,c061a026,0,c0653720,c8dc1aa0,ebb00978,ce67cd80) at
msleep+0x3d7
vn_write_suspend_wait(c8dc1aa0,c86d7000,1,0,ca07f817) at vn_write_suspend_wait+0
x181
ufs_inactive(ebb009c8,ebb009dc,c051b330,c0646cc0,ebb009c8) at ufs_inactive+0x1b4
VOP_INACTIVE_APV(c0646cc0,ebb009c8,a03,ebb009d0,c051054f) at VOP_INACTIVE_APV+0x
3a
vinactive(c8dc1aa0,ce67cd80,ffdf,ebb00a24,c0513592) at vinactive+0x82
vput(c8dc1aa0,ffdf,2,ebb00a50,0) at vput+0x187
vn_open_cred(ebb00bc0,ebb00cc0,180,c9430580,5) at vn_open_cred+0xfb
vn_open(ebb00bc0,ebb00cc0,180,5,4b5fcfad) at vn_open+0x33
kern_open(ce67cd80,81414e0,0,a03,180) at kern_open+0xca
open(ce67cd80,ebb00d04,c,ce67cd80,8155000) at open+0x36
syscall(3b,3b,3b,0,a02) at syscall+0x324

99361 ce682624  100   673   673 100 [SLPQ suspfs 0xc86d706c][SLP] sendmail


If a thread gets stuck waiting for the suspend lock while holding a
vnode lock, then it is possible for the thread that is creating the
snapshot to get stuck when it attempts to lock that same vnode.

98639 c8e91418 11008 98637 97958 0004102 [SLPQ ufs 0xc8991d18][SLP] mksnap_ffs

db trace 98639
Tracing pid 98639 tid 100282 td 0xc8e8e300
sched_switch(c8e8e300,0,1,cc167362,733b6597) at sched_switch+0x158
mi_switch(1,0,c04d7b33,c8991d18,eb80c544) at mi_switch+0x1d5
sleepq_switch(c8991d18,eb80c578,c04bb9ce,c8991d18,50) at sleepq_switch+0x16f
sleepq_wait(c8991d18,50,c06186b3,0,c065ae80) at sleepq_wait+0x11  
msleep(c8991d18,c065888c,50,c06186b3,0) at msleep+0x3d7
acquire(eb80c5e0,40,6,c8e8e300,0) at acquire+0x89
lockmgr(c8991d18,2002,c8991d3c,c8e8e300,eb80c608) at lockmgr+0x45f
vop_stdlock(eb80c660,0,c0646cc0,eb80c660,eb80c618) at vop_stdlock+0x2f
VOP_LOCK_APV(c0647200,eb80c660,eb80c630,c05f9f43,eb80c660) at VOP_LOCK_APV+0x44
ffs_lock(eb80c660,0,2002,c8991cc0,eb80c67c) at ffs_lock+0x19
VOP_LOCK_APV(c0646cc0,eb80c660,c06402e0,eb80c7d0,0) at VOP_LOCK_APV+0x44
vn_lock(c8991cc0,2002,c8e8e300,4000,c851f300) at vn_lock+0x132
ffs_snapshot(c86d7000,cc978e00,eb80c9a4,6c,eb80c964) at ffs_snapshot+0x152b
ffs_mount(c86d7000,c8e8e300,0,c8e8e300,c9f71bb0) at ffs_mount+0x9af
vfs_domount(c8e8e300,c88a8760,c88a8870,11211000,c9a77510) at vfs_domount+0x728
vfs_donmount(c8e8e300,11211000,eb80cbec,c9314580,e) at vfs_donmount+0x12e
kernel_mount(c8737b80,11211000,eb80cc30,6c,bfbfe8c8) at kernel_mount+0x46
ffs_cmount(c8737b80,bfbfe110,11211000,c8e8e300,0) at ffs_cmount+0x85
mount(c8e8e300,eb80cd04,10,c8e8e300,8052000) at mount+0x21b
syscall(3b,3b,3b,8814819c,bfbfe0b0) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f


This block of code in ufs_inactive() is triggering the problem:

if (ip-i_flag  (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) {
if ((ip-i_flag  

Re: FreeBSD unstable on Dell 1750 using SMP?

2006-01-02 Thread Don Lewis
On 30 Nov, Dan Charrois wrote:
 This is encouraging - it's the first I've heard of someone who has  
 found a way to trigger the problem on demand.  The problems I was  
 experiencing were on a dual Xeon with HTT enabled as well.Perhaps  
 someone out there who knows much more about the inner workings of  
 FreeBSD may have an idea of why running top in aggressive mode like  
 this might trigger the random rebooting.  In particular, it would be  
 nice to *know* that someone out there specifically fixed whatever is  
 wrong in 5.4 when bringing it to 6.0.  It's encouraging that you  
 haven't had any problems since upgrading to 6.0, but I have to wonder  
 if the bug's actually fixed, or the specific trigger of running top  
 doesn't trigger the problem but the problem is still lurking in the  
 background waiting to strike with the right combination of events.
 
 In any case, I'm anxious to try it out myself on our server to see if  
 top -s0 brings it down on command with HTT enabled, and not with  
 HTT disabled.  But I'm going to have to wait until some time over the  
 Christmas holidays to do that sort of experimentation at a time when  
 it isn't affecting the end users of the machine.  I may also upgrade  
 to 6.0 at that time, since by then it will have been out for a couple  
 of months, so most of the worst quirks should be worked out by then.
 
 In the meantime, disabling HTT as I've done seems like a reasonable  
 precaution to improve the stability..
 
 Thanks for your help!
 
 Dan

Try this patch, which I posted to stable@ on October 15.  I had hoped to
commit it to RELENG_5 in November, but my day job intervened.

-- Forwarded message --
From: Don Lewis [EMAIL PROTECTED]
 Subject: testers wanted for 5.4-STABLE sysctl kern.proc patch
Date: Sat, 15 Oct 2005 14:51:37 -0700 (PDT)
  To: [EMAIL PROTECTED]
  Cc: 

The patch below is the 5.4-STABLE version of a patch that was recently
committed to HEAD and 6.0-BETA5 to fix locking problems in the kern.proc
sysctl handler that could cause panics or deadlocks.  It has already
been tested by myself and one other person in 5.4-STABLE, but I think it
deserves wider testing before I commit it.  Testing on SMP systems,
while running threaded applications, and on systems that have
experienced panics in the existing code is of the most interest.  Also
be on the lookout for any regressions, such as incorrect data being
returned.

Index: sys/kern/kern_proc.c
===
RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v
retrieving revision 1.215.2.6
diff -u -r1.215.2.6 kern_proc.c
--- sys/kern/kern_proc.c22 Mar 2005 13:40:23 -  1.215.2.6
+++ sys/kern/kern_proc.c12 Oct 2005 19:13:14 -
@@ -72,6 +72,8 @@
 
 static void doenterpgrp(struct proc *, struct pgrp *);
 static void orphanpg(struct pgrp *pg);
+static void fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp);
+static void fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp);
 static void pgadjustjobc(struct pgrp *pgrp, int entering);
 static void pgdelete(struct pgrp *);
 static int proc_ctor(void *mem, int size, void *arg, int flags);
@@ -601,33 +603,22 @@
}
 }
 #endif /* DDB */
-void
-fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp);
 
 /*
- * Fill in a kinfo_proc structure for the specified process.
+ * Clear kinfo_proc and fill in any information that is common
+ * to all threads in the process.
  * Must be called with the target process locked.
  */
-void
-fill_kinfo_proc(struct proc *p, struct kinfo_proc *kp)
-{
-   fill_kinfo_thread(FIRST_THREAD_IN_PROC(p), kp);
-}
-
-void
-fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp)
+static void
+fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp)
 {
-   struct proc *p;
struct thread *td0;
-   struct ksegrp *kg;
struct tty *tp;
struct session *sp;
struct timeval tv;
struct ucred *cred;
struct sigacts *ps;
 
-   p = td-td_proc;
-
bzero(kp, sizeof(*kp));
 
kp-ki_structsize = sizeof(*kp);
@@ -685,7 +676,8 @@
kp-ki_tsize = vm-vm_tsize;
kp-ki_dsize = vm-vm_dsize;
kp-ki_ssize = vm-vm_ssize;
-   }
+   } else if (p-p_state == PRS_ZOMBIE)
+   kp-ki_stat = SZOMB;
if ((p-p_sflag  PS_INMEM)  p-p_stats) {
kp-ki_start = p-p_stats-p_start;
timevaladd(kp-ki_start, boottime);
@@ -704,71 +696,6 @@
kp-ki_nice = p-p_nice;
bintime2timeval(p-p_runtime, tv);
kp-ki_runtime = tv.tv_sec * (u_int64_t)100 + tv.tv_usec;
-   if (p-p_state != PRS_ZOMBIE) {
-#if 0
-   if (td == NULL) {
-   /* XXXKSE: This should never happen. */
-   printf(fill_kinfo_proc(): pid %d has no threads!\n,
-   p-p_pid);
-   mtx_unlock_spin

testers wanted for 5.4-STABLE sysctl kern.proc patch

2005-10-15 Thread Don Lewis
The patch below is the 5.4-STABLE version of a patch that was recently
committed to HEAD and 6.0-BETA5 to fix locking problems in the kern.proc
sysctl handler that could cause panics or deadlocks.  It has already
been tested by myself and one other person in 5.4-STABLE, but I think it
deserves wider testing before I commit it.  Testing on SMP systems,
while running threaded applications, and on systems that have
experienced panics in the existing code is of the most interest.  Also
be on the lookout for any regressions, such as incorrect data being
returned.

Index: sys/kern/kern_proc.c
===
RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v
retrieving revision 1.215.2.6
diff -u -r1.215.2.6 kern_proc.c
--- sys/kern/kern_proc.c22 Mar 2005 13:40:23 -  1.215.2.6
+++ sys/kern/kern_proc.c12 Oct 2005 19:13:14 -
@@ -72,6 +72,8 @@
 
 static void doenterpgrp(struct proc *, struct pgrp *);
 static void orphanpg(struct pgrp *pg);
+static void fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp);
+static void fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp);
 static void pgadjustjobc(struct pgrp *pgrp, int entering);
 static void pgdelete(struct pgrp *);
 static int proc_ctor(void *mem, int size, void *arg, int flags);
@@ -601,33 +603,22 @@
}
 }
 #endif /* DDB */
-void
-fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp);
 
 /*
- * Fill in a kinfo_proc structure for the specified process.
+ * Clear kinfo_proc and fill in any information that is common
+ * to all threads in the process.
  * Must be called with the target process locked.
  */
-void
-fill_kinfo_proc(struct proc *p, struct kinfo_proc *kp)
-{
-   fill_kinfo_thread(FIRST_THREAD_IN_PROC(p), kp);
-}
-
-void
-fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp)
+static void
+fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp)
 {
-   struct proc *p;
struct thread *td0;
-   struct ksegrp *kg;
struct tty *tp;
struct session *sp;
struct timeval tv;
struct ucred *cred;
struct sigacts *ps;
 
-   p = td-td_proc;
-
bzero(kp, sizeof(*kp));
 
kp-ki_structsize = sizeof(*kp);
@@ -685,7 +676,8 @@
kp-ki_tsize = vm-vm_tsize;
kp-ki_dsize = vm-vm_dsize;
kp-ki_ssize = vm-vm_ssize;
-   }
+   } else if (p-p_state == PRS_ZOMBIE)
+   kp-ki_stat = SZOMB;
if ((p-p_sflag  PS_INMEM)  p-p_stats) {
kp-ki_start = p-p_stats-p_start;
timevaladd(kp-ki_start, boottime);
@@ -704,71 +696,6 @@
kp-ki_nice = p-p_nice;
bintime2timeval(p-p_runtime, tv);
kp-ki_runtime = tv.tv_sec * (u_int64_t)100 + tv.tv_usec;
-   if (p-p_state != PRS_ZOMBIE) {
-#if 0
-   if (td == NULL) {
-   /* XXXKSE: This should never happen. */
-   printf(fill_kinfo_proc(): pid %d has no threads!\n,
-   p-p_pid);
-   mtx_unlock_spin(sched_lock);
-   return;
-   }
-#endif
-   if (td-td_wmesg != NULL) {
-   strlcpy(kp-ki_wmesg, td-td_wmesg,
-   sizeof(kp-ki_wmesg));
-   }
-   if (TD_ON_LOCK(td)) {
-   kp-ki_kiflag |= KI_LOCKBLOCK;
-   strlcpy(kp-ki_lockname, td-td_lockname,
-   sizeof(kp-ki_lockname));
-   }
-
-   if (p-p_state == PRS_NORMAL) { /*  XXXKSE very approximate */
-   if (TD_ON_RUNQ(td) ||
-   TD_CAN_RUN(td) ||
-   TD_IS_RUNNING(td)) {
-   kp-ki_stat = SRUN;
-   } else if (P_SHOULDSTOP(p)) {
-   kp-ki_stat = SSTOP;
-   } else if (TD_IS_SLEEPING(td)) {
-   kp-ki_stat = SSLEEP;
-   } else if (TD_ON_LOCK(td)) {
-   kp-ki_stat = SLOCK;
-   } else {
-   kp-ki_stat = SWAIT;
-   }
-   } else {
-   kp-ki_stat = SIDL;
-   }
-
-   kg = td-td_ksegrp;
-
-   /* things in the KSE GROUP */
-   kp-ki_estcpu = kg-kg_estcpu;
-   kp-ki_slptime = kg-kg_slptime;
-   kp-ki_pri.pri_user = kg-kg_user_pri;
-   kp-ki_pri.pri_class = kg-kg_pri_class;
-
-   /* Things in the thread */
-   kp-ki_wchan = td-td_wchan;
-   kp-ki_pri.pri_level = td-td_priority;
-   kp-ki_pri.pri_native = td-td_base_pri;
-   kp-ki_lastcpu = td-td_lastcpu;
-   kp-ki_oncpu = td-td_oncpu;
-   kp-ki_tdflags = td-td_flags;
- 

  1   2   >