Re: random lockups (now suspecting zfs)

2023-11-07 Thread Stephen Borrill

On Sat, 4 Nov 2023, Simon Burge wrote:

Hi Greg,

Greg Troxel wrote:


 Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote:

A different machine has locked up, running recent netbsd-10.  I was
doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total
physical.  It has a private patch to reduce the amount of memory used
for ARC, which has been working well.


Are you still seeing the problem below even with limiting the amount of
memory ARC can use?


All 3 tmux windows show something like

  [ 373598.5266510] load: 0.00  cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% 
6396k

and I can switch among them and ^T, but trying to run top is stuck (in
flt_noram5).  I'll give it an hour or so, and have a look at the
console.


I've seen cc1plus processes wedged in either flt_noram or tstile after
doing multiple builds, and a reboot is the only way out.  I'm using ZFS
for everything except swap and some mostly-unused media files that live
on an FFS.


So to me this feels like a locking botch in a rare path in zfs.


This appears to be the case.  Chuck Silvers has some understanding of
the problem and I'm helping test, but at this stage there isn't a fix
available. :/


It's interesting that you see the lockups during pkgsrc builds, i.e. a 
period where there is lots of file creation. We use zfs on backup systems 
that pull in data with rsync. During the initial runs (where every file is 
new) we usually get a couple of lockups, but during day to day operation 
(few changes) it is reliable. These are on physical and virtual machines 
running NetBSD 9 with the rule of thumb of 1GB RAM per TB of storage 
obeyed, but no patches besides setting MAXPHYS in the module to 32k for 
Xen.


--
Stephen



Re: ipmi0: incorrect critical max

2023-03-21 Thread Stephen Borrill

On Sat, 18 Mar 2023, Lloyd Parkes wrote:

On 18/03/23 05:14, Stephen Borrill wrote:
On an HP Microserver Gen10 Plus, I found that soon after booting, I get the 
following alert:

...
   Current  CritMax  WarnMax  
WarnMin  CritMin  Unit

[ipmi0]
    11-LOM-CORE:    59.253    0.000  
110.471    degC




Just out of interest, in the BIOS (RBSU) what is the Power Management / Power 
Regulator set to? It will have settings such as "Dynamic Power Savings Mode" 
and "OS Control Mode".


I set it to Maximum I/O Performance (words may not match exactly, it is in 
a box waiting to be installed at a customer).


--
Stephen


ipmi0: incorrect critical max

2023-03-17 Thread Stephen Borrill
On an HP Microserver Gen10 Plus, I found that soon after booting, I get 
the following alert:


ipmi0: critical over limit on '11-LOM-CORE'

If powerd is running (the default), it shuts the machine down (so 
basically as soon as it hits multi-user).


envstat shows that CritMax is zero:

  Current  CritMax  WarnMax  WarnMin  CritMin  Unit
[ipmi0]
   11-LOM-CORE:59.2530.000  110.471degC

Seen on 9.3_STABLE, but also in 10 BETA.

I suppose one simple fix would be to ensure that if CritMax is lower 
than WarnMax, it should be set to the value of WarnMax.


Any other things to look at? The machine won't be put into production for 
a few days, so it's good time to experiment


I have put the latest BIOS on the machine

--
Stephen



Re: ixg wierdness

2021-12-23 Thread Stephen Borrill

On Wed, 22 Dec 2021, Patrick Welche wrote:

On Wed, Dec 22, 2021 at 01:34:25PM +0100, Hauke Fath wrote:

On Wed, 22 Dec 2021 12:26:21 +, Patrick Welche wrote:

The box in 53155 is Hauke's - also a Dell, but slightly different model.


he@, not hauke@ -- no Dell boxes here.


Sorry - Havard's!

On the 51355 front, dholland asks if the 2 bnx hang issue is the same as
47229, and it looks like it. From the email threads quoted in 47229, the
gist seems to be that the issue doesn't exist on /i386, just /amd64.


I reported something similar on an IBM x3550M3 back in 2019, too:

http://mail-index.netbsd.org/tech-net/2019/03/19/msg007302.html

--
Stephen



Re: sdmmc_mem_enable failed with error 60

2021-01-29 Thread Stephen Borrill

On Tue, Mar 24, 2020 at 03:48:03PM +, Patrick Welche wrote:

Last time I played with my raspberry pi zero w, I couldn't see the network
card and saw

  sdmmc_mem_enable failed with error 60

Now I'm seeing the same thing on a new amd64 laptop trying to use
another new 32GB microsd card. I opened kern/54959 in the rpi0w
case.

The laptop has a

rtsx0 at pci6 dev 0 function 0: Realtek Semiconductor RTS525A PCI-E Card Reader 
(rev. 0x01)
rtsx0: interrupting at msi2 vec 0
sdmmc0 at rtsx0


I'm testing a Dell Latitude 3190 and its hard drive is on sdmmc0 so I have 
no storage:


[ 1.016863] sdhc0 at pci0 dev 28 function 0: Intel Gemini Lake eMMC (rev. 
0x06)
[ 1.016863] sdhc0: interrupting at ioapic0 pin 39
[ 1.016863] sdhc0: SDHC 3.0, rev 16, SDMA, 20 kHz, embedded slot, HS 
SDR50 DDR50 SDR104 HS200 1.8V, re-tuning mode 1 (128s timer), 2048 byte blocks
[ 1.016863] sdmmc0 at sdhc0 slot 0
[ 4.914910] sdmmc0: sdmmc_mem_enable failed with error 60
[ 4.924910] sdmmc0: autoconfiguration error: couldn't enable card: 60

Full dmesg (without SDMMC_DEBUG):
http://www.netbsd.org/~sborrill/dmesg.lt3190

And acpidump:
http://www.netbsd.org/~sborrill/acpidump.lt3190

--
Stephen



Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1

2020-07-06 Thread Stephen Borrill

On Mon, 6 Jul 2020, Martin Husemann wrote:

On Mon, Jul 06, 2020 at 05:07:51PM +0100, Mike Pumford wrote:

A quick look around suggests that some of the very high end gaming ones
don't. Also assuming users will actually be able to find a cable to actually
hook up the motherboard COM port is optimistic. You would probably have to
get one second hand these days and if I remember correctly there are 2
incompatible pinouts for the 10 pin header. :(


I had no trouble ordering new ones late last year.


If you wanted a branded one (!), there's a Lenovo server option with 
SKU 7Z17A02577 (I don't know about the pinout off-hand, but I have one 
here I could buzz through if anyone really cared).


--
Stephen



Re: modload & xen and -current 9.99.60

2020-05-11 Thread Stephen Borrill

On Fri, 8 May 2020, Manuel Bouyer wrote:


On Fri, May 08, 2020 at 02:55:10PM +0200, Frank Kardel wrote:

I checked to same kernel in an instance with memory=2048 and it just works.

Using todays kernel also works woth memory=2048.

Using memory=65536 for the xen instance gives a surprising familiar

TEST-A# modload bpfjit
[  97.4727034] kobj_load, 444: [%M/bpfjit/bpfjit.kmod]: linker error: out of
memory
modload: bpfjit: Cannot allocate memory
TEST-A#

So it seems to be linked to available memory.

The more you have the less you get for modload.


It could be a variable overflow somewhere but I can't see how it relates to
64Gb. Does it work with 16Gb ?


This sounds similar to the problem I reported a couple of weeks ago with 
exactly 16GB:


http://mail-index.netbsd.org/port-xen/2020/04/17/msg009654.html


Also could you try with a PVH or HVM guest ? These ones would use modules
from /stand/amd64/ and not /stand/amd64-xen/ and should be close to native.

I don't have a box with that much RAM to test ...

--
Manuel Bouyer 
NetBSD: 26 ans d'experience feront toujours la difference
--



Re: xen & uefi

2020-03-20 Thread Stephen Borrill

On Fri, 20 Mar 2020, Brad Spencer wrote:

Patrick Welche  writes:


Is booting into xen from uefi meant to work?

I have a slightly unorthdox set up, but get:


NetBSD/x86 EFI Boot (x64), Revision 1.1 (Tue Jan 28 13:49:42 UTC 2020) (from)

...
Start @ 0xce60 [1=0xce982000-0xce9820ec]...
Trampoline space cannot be allocated; will try fallback.


I didn't think that a DOM0 + UEFI worked anywhere very well at this
point or at least was not a default...

See https://wiki.xenproject.org/wiki/Xen_EFI for example...
or ... https://ubuntuforums.org/showthread.php?t=2413434


It's been the default on XenServer (Citrix Hypervisor) for a long time (if 
you boot the installer from uEFI).


--
Stephen



Re: XEN3_DOMU no longer shutting down or rebooting

2019-03-04 Thread Stephen Borrill

On Fri, 1 Mar 2019, Chavdar Ivanov wrote:

On 3/1/19 в 1:02 AM, Mathew, Cherry G.:

Would be could to know the dom0 versions it broke under, please.


The DOMU is tCentOS 7.6, the virtualizer is XCP-NG v.7.6.


That's not what Cherry's asking for.

Try the output of xl dmesg on XCP-NG. Near the top, you'll see lines 
like:


(XEN) [0.00] Xen version 4.7.6-6.3 (mockbuild@[unknown]) (gcc 
(GCC) 7.3.0) debug=n  Fri Nov  9 15:37:51 UTC 2018

(XEN) [0.00] Latest ChangeSet: 9a6cc4f5c14b, pq 15428fd29b9a
uname -a will give you something like:
Linux xen01 4.4.0+10 #1 SMP Fri Aug 24 08:15:39 UTC 2018 x86_64 x86_64 
x86_64 GNU/Linux


(the above are from fully-patched XenServer 7.6, so I'm mildly curious to 
see what XCP-NG is based on).


--
Stephen



Re: mfii0 kudos to bouyer@ Was Re: dmesg | grep -c "not configured" = 240...

2018-12-05 Thread Stephen Borrill

On Tue, 4 Dec 2018, Martin Husemann wrote:

On Tue, Dec 04, 2018 at 07:17:59PM +, Mike Pumford wrote:

One thing that surprised me was that I was testing with the USB install
image but instead of landing in sysinst I ended up at a a login prompt which
was unexpected. Could this be because the USB disk that was my root device
ended up as sd23 and there is a hard coded sd0 somewhere in the install
code?


No hardcoded sd0, but maybe the boot device matching did not properly work
for this case (depends on geometry and stuff that the bootloader gets
from bios USB emulation or something).


Interesting, I noticed exactly the same when booting a -current image to
test the original mfii changes. In my case, sd0 and sd1 are the HW RAID 
arrays and sd2 is the USB stick:


[ 8.177453] sd2 at scsibus1 target 0 lun 0:  
disk removable
[ 8.177453] sd2: 3958 MB, 522 cyl, 255 head, 63 sec, 512 bytes/sect x 
8105984 sectors
[ 8.287566] boot device: sd2
[ 8.287566] root on sd2a dumps on sd2b

--
Stephen



Re: mfii0 kudos to bouyer@ Was Re: dmesg | grep -c "not configured" = 240...

2018-11-30 Thread Stephen Borrill

On Thu, 29 Nov 2018, Manuel Bouyer wrote:

On Thu, Nov 29, 2018 at 03:56:37PM +, Stephen Borrill wrote:

[snip]

The other missing driver is handled by mpii in OpenBSD (SAS3408). Our mpii
doesn't yet support any SAS3 cards.
[ 1.048805] vendor 1000 product 00af (SAS mass storage, revision 0x01)
at pci4 dev 0 function 0 not configured


Do you have drives connected to this controller ?
If so I can probably come up with a patch this week-end.
The SAS3 has a sighly different interface, but from looking at the OpenBSD
driver it's all in a single function.


I cannot easily attach drives to it (it has external ports only, and I 
would need to drag it to our datacenter to connect it to something). Let's 
see what Mike Pumford's PCI IDs are.


If I do go to the datacenter however, I should also be able to MegaRAID 
3108 support (IBM ServeRAID M5210).


Do you have a gut feel on how easy it would be to backport your mpii 
changes to -7 and -8?


--
Stephen


Re: Running NetBSD-current in PV mode under Xen

2018-11-30 Thread Stephen Borrill

On Thu, 29 Nov 2018, Chavdar Ivanov wrote:

I was trying to respond to and old pr of mine -
https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=47486 - and
went through the installation of XCP-NG (after I found out about the
existence of this project and that Citrix has apparently changed some
licensing conditions after XenServer 7.2). Latest -current works in
HVM mode (after switching the network adapter emulation to e1000,
modulo the weird mouse behaviour under X, but I am not bothered about
it). If I switch the system to PV mode following the method described
in the above pr, the machine apparently starts and in some 10-15
seconds stops, without showing any console. I checked through the
XCP-NG log files, but could not iodentify anythin obvious (I may have
missed stuff - they are copious).

I seem to recall a relatively recent discussion about NetBSD not
working any more under AWS PV - some parameter needed to be modified,
but could not find a reference; could that be related?


/opt/xensource/libexec/xen-cmdline --set-xen pv-linear-pt=true

See N.B. (3) here:
https://www.precedence.co.uk/wiki/Support-KB-Citrix/XenServer-Hotfixes

--
Stephen



Re: mfii0 kudos to bouyer@ Was Re: dmesg | grep -c "not configured" = 240...

2018-11-29 Thread Stephen Borrill

On Mon, 26 Nov 2018, Stephen Borrill wrote:

Thanks Manuel!

[ 1.048805] mfii0 at pci11 dev 0 function 0: "RAID 930-8i 2GB Flash", 
firmware 50.3.0-1075, 2048MB cache

[ 1.048805] mfii0: interrupting at ioapic4 pin 2
[ 1.048805] scsibus0 at mfii0: 64 targets, 8 luns per target
[ 2.161214] scsibus0: waiting 2 seconds for devices to settle...
[ 2.161214] mfii0: physical disk inserted id 18 enclosure 134
[ 2.161214] mfii0: physical disk inserted id 19 enclosure 134
[ 2.161214] mfii0: physical disk inserted id 20 enclosure 134
[ 2.161214] mfii0: physical disk inserted id 21 enclosure 134
[ 4.163289] sd0 at scsibus0 target 0 lun 0: 5.03> disk fixed

[ 4.163289] sd0: fabricating a geometry
[ 4.163289] sd0: 2234 GB, 2287864 cyl, 64 head, 32 sec, 512 bytes/sect x 
4685545472 sectors

[ 4.163289] sd0: fabricating a geometry
[ 4.163289] sd0: tagged queueing
[ 4.163289] sd1 at scsibus0 target 1 lun 0: 5.03> disk fixed

[ 4.163289] sd1: fabricating a geometry
[ 4.163289] sd1: 744 GB, 761985 cyl, 64 head, 32 sec, 512 bytes/sect x 
1560545280 sectors

[ 4.163289] sd1: fabricating a geometry
[ 4.163289] sd1: tagged queueing
[32.192359] mfii0: critical limit on 'mfii0 BBU state'
[32.192359] mfii0: normal state on 'mfii0:0' (online)
[32.192359] mfii0: normal state on 'mfii0:1' (online)


The other missing driver is handled by mpii in OpenBSD (SAS3408). Our mpii 
doesn't yet support any SAS3 cards.
[ 1.048805] vendor 1000 product 00af (SAS mass storage, revision 0x01) 
at pci4 dev 0 function 0 not configured


--
Stephen



mfii0 kudos to bouyer@ Was Re: dmesg | grep -c "not configured" = 240...

2018-11-26 Thread Stephen Borrill

Thanks Manuel!

[ 1.048805] mfii0 at pci11 dev 0 function 0: "RAID 930-8i 2GB Flash", 
firmware 50.3.0-1075, 2048MB cache
[ 1.048805] mfii0: interrupting at ioapic4 pin 2
[ 1.048805] scsibus0 at mfii0: 64 targets, 8 luns per target
[ 2.161214] scsibus0: waiting 2 seconds for devices to settle...
[ 2.161214] mfii0: physical disk inserted id 18 enclosure 134
[ 2.161214] mfii0: physical disk inserted id 19 enclosure 134
[ 2.161214] mfii0: physical disk inserted id 20 enclosure 134
[ 2.161214] mfii0: physical disk inserted id 21 enclosure 134
[ 4.163289] sd0 at scsibus0 target 0 lun 0:  
disk fixed
[ 4.163289] sd0: fabricating a geometry
[ 4.163289] sd0: 2234 GB, 2287864 cyl, 64 head, 32 sec, 512 bytes/sect x 
4685545472 sectors
[ 4.163289] sd0: fabricating a geometry
[ 4.163289] sd0: tagged queueing
[ 4.163289] sd1 at scsibus0 target 1 lun 0:  
disk fixed
[ 4.163289] sd1: fabricating a geometry
[ 4.163289] sd1: 744 GB, 761985 cyl, 64 head, 32 sec, 512 bytes/sect x 
1560545280 sectors
[ 4.163289] sd1: fabricating a geometry
[ 4.163289] sd1: tagged queueing
[32.192359] mfii0: critical limit on 'mfii0 BBU state'
[32.192359] mfii0: normal state on 'mfii0:0' (online)
[32.192359] mfii0: normal state on 'mfii0:1' (online)

# bioctl mfii0 show
Volume Status   Size Device/LabelLevel Stripe
=
 0 Online   2.2T   System   RAID 1N/A  65535 seconds
   0:0 Online   2.2T 1:0.0 noencl 
   0:1 Online   2.2T 1:1.0 noencl 
 1 Online   744G   WriteCache   RAID 1N/A  65535 seconds
   1:0 Online   745G 1:2.0 noencl 
   1:1 Online   745G 1:3.0 noencl 


dmesg | grep -c "not configured" = 239 :-)

On Mon, 19 Feb 2018, Stephen Borrill wrote:


So I've just got a Lenovo ThinkSystem SR630 and:
# dmesg | grep -c "not configured"
  240

http://www.netbsd.org/~sborrill/sr630.dmesg.txt

Main issues are missing Ethernet (Intel X722) and RAID controller:
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 0 not configured
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 1 not configured
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 2 not configured
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 3 not configured
vendor 1000 product 0016 (RAID mass storage, revision 0x01) at pci11 dev 0 
function 0 not configured


msaitoh@ - have you looked at the Intel X722 gigabit controllers?

As for the RAID controller, we are missing support for all recent 
LSI/Symbios/Avago/Broadcom controllers meaning no support for lots of servers 
from Lenovo/HP, etc. OpenBSD's mfii supports most of these:


https://www.precedence.co.uk/wiki/Support-KB-IBM/PCIIDs

NetBSD has extended mfi to support a few variants, but OpenBSD has split the 
driver into mfi and mfii which makes porting more tricky.


I tried OpenBSD 6.2 (last release), but the support for the RAID controller 
in this server was added after 6.2. On OpenBSD:

# dmesg | grep -c "not configured"
350

--
Stephen




Re: M.2 SSDs and Marvell 88SE9230 SATA

2018-07-10 Thread Stephen Borrill

On Tue, 3 Jul 2018, Stephen Borrill wrote:
Any informed guesses whether M.2 SSDs will work if I buy them for my Lenovo 
server?


Based on the following, I've determined they should be Marvell 88SE9230:
https://discussions.citrix.com/topic/396920-unable-to-install-xen-server-75-onto-lenovo-m2-drive/

Only 88SE91XX is currently explicitly supported by ahcisata:
ahcisata_pci.c: { PCI_VENDOR_MARVELL2, PCI_PRODUCT_MARVELL2_88SE91XX,
pcidevs:product MARVELL2 88SE91XX   0x91a3  88SE91XX SATA
pcidevs:product MARVELL2 88SE9215   0x9215  88SE9215 SATA
pcidevs:product MARVELL2 88SE9220   0x9220  88SE9220 SATA
pcidevs:product MARVELL2 88SE9230   0x9230  88SE9230 SATA
pcidevs:product MARVELL2 88SE9235   0x9235  88SE9235 SATA

I'm probably going to try nonetheless (how hard can it be to add???), but any 
hints based on experience would be useful.


OK, so I fitted a pair of M.2 SSDs:
https://lenovopress.com/lp0769-thinksystem-m2-drives-adapters

Without doing anything they appeared as JBODs (along with a extra 
virtual device):

ahcisata2 at pci3 dev 0 function 0: vendor 1b4b product 9230 (rev. 0x11)
ahcisata2: interrupting at ioapic0 pin 18
ahcisata2: 64-bit DMA
ahcisata2: AHCI revision 1.20, 3 ports, 32 slots, CAP 
0xc0309f02
atabus12 at ahcisata2 channel 0
atabus13 at ahcisata2 channel 1
atabus14 at ahcisata2 channel 2
wd0: 
wd0: drive supports 1-sector PIO transfers, LBA48 addressing
wd0: 119 GB, 248085 cyl, 16 head, 63 sec, 512 bytes/sect x 250069680 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), NCQ 
(32 tags)
wd0(ahcisata2:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) 
(using DMA), NCQ (31 tags)
wd1 at atabus13 drive 0
wd1: 
wd1: drive supports 1-sector PIO transfers, LBA48 addressing
wd1: 119 GB, 248085 cyl, 16 head, 63 sec, 512 bytes/sect x 250069680 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), NCQ 
(32 tags)
wd1(ahcisata2:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) 
(using DMA), NCQ (31 tags)
atapibus0 at atabus14: 1 targets
uk0 at atapibus0 drive 0:  processor fixed
uk0: drive supports PIO mode 4, Ultra-DMA mode 4 (Ultra/66)
uk0(ahcisata2:2:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA)

By going into the uEFI setup, I could easily set up a RAID-1 and a new 
slightly smaller virtual wd0 appeared and wd1 disappeared:


wd0 at atabus12 drive 0
wd0: 
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 119 GB, 247954 cyl, 16 head, 63 sec, 512 bytes/sect x 249938560 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 7, NCQ (32 tags)
wd0(ahcisata2:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) 
(using DMA), NCQ (31 tags)

Performance isn't too shabby either. bonnie says:
119691 K/sec output char
137535 K/sec output block
130260 K/sec output rewrite
171789 K/sec input char
260260 K/sec input block
13679 seeks/sec

So happy days!

--
Stephen



M.2 SSDs and Marvell 88SE9230 SATA

2018-07-03 Thread Stephen Borrill
Any informed guesses whether M.2 SSDs will work if I buy them for my 
Lenovo server?


Based on the following, I've determined they should be Marvell 88SE9230:
https://discussions.citrix.com/topic/396920-unable-to-install-xen-server-75-onto-lenovo-m2-drive/

Only 88SE91XX is currently explicitly supported by ahcisata:
ahcisata_pci.c: { PCI_VENDOR_MARVELL2, PCI_PRODUCT_MARVELL2_88SE91XX,
pcidevs:product MARVELL2 88SE91XX   0x91a3  88SE91XX SATA
pcidevs:product MARVELL2 88SE9215   0x9215  88SE9215 SATA
pcidevs:product MARVELL2 88SE9220   0x9220  88SE9220 SATA
pcidevs:product MARVELL2 88SE9230   0x9230  88SE9230 SATA
pcidevs:product MARVELL2 88SE9235   0x9235  88SE9235 SATA

I'm probably going to try nonetheless (how hard can it be to add???), but 
any hints based on experience would be useful.


--
Stephen



dmesg | grep -c "not configured" = 240...

2018-02-19 Thread Stephen Borrill

So I've just got a Lenovo ThinkSystem SR630 and:
# dmesg | grep -c "not configured"
   240

http://www.netbsd.org/~sborrill/sr630.dmesg.txt

Main issues are missing Ethernet (Intel X722) and RAID controller:
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 0 not configured
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 1 not configured
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 2 not configured
vendor 8086 product 37d2 (ethernet network, revision 0x09) at pci7 dev 0 
function 3 not configured
vendor 1000 product 0016 (RAID mass storage, revision 0x01) at pci11 dev 0 
function 0 not configured

msaitoh@ - have you looked at the Intel X722 gigabit controllers?

As for the RAID controller, we are missing support for all recent 
LSI/Symbios/Avago/Broadcom controllers meaning no support for lots of 
servers from Lenovo/HP, etc. OpenBSD's mfii supports most of these:


https://www.precedence.co.uk/wiki/Support-KB-IBM/PCIIDs

NetBSD has extended mfi to support a few variants, but OpenBSD has split 
the driver into mfi and mfii which makes porting more tricky.


I tried OpenBSD 6.2 (last release), but the support for the RAID 
controller in this server was added after 6.2. On OpenBSD:

# dmesg | grep -c "not configured"
350

--
Stephen



Re: DHCP client: dhclient vs dhcpcd ?

2018-02-01 Thread Stephen Borrill

On Thu, 1 Feb 2018, Thomas Mueller wrote:

On Wed, Jan 31, 2018 at 1:18 PM, KIRIHARA Masaharu  wrote:

NetBSD has two DHCP clients; dhclient(8) and dhcpcd(8).
What's the difference?
Which is better to use?


On Wed, 31 Jan 2018 13:47:42 +0100, Benny Siegert responded:


I agree that this is confusing. dhclient is the older tool, while
dhcpcd has been created by a NetBSD developer, is newer and smaller. I
have run into situations (on Google Compute Engine for instance) where
dhclient was unable to interpret some of the more modern DHCP
features.



I recommend using dhcpcd :)


I have read about NetBSD planning to drop dhclient in favor of dhcpcd.

I have had installations where dhcpcd succeeded where dhclient failed, and 
(7.99.1 amd64) where dhclient succeeded where dhcpcd failed.

Failure means not being able to set up the internet connection even if the 
command ran without error messages.

I have also had a situation where neither dhcpcd nor dhclient could establish 
the internet connection, but I was able to connect by using ifconfig and route 
directly.

I notice NetBSD's dhclient is very big while FreeBSD's dhclient is much 
smaller, like

$ ls -l /sbin/dhclient
-r-xr-xr-x  1 root  wheel  100056 Jul 31  2017 /sbin/dhclient
$ ls -l /media/zip0/sbin/dh*
-r-xr-xr-x  1 root  wheel  5352184 Jun 20  2017 /media/zip0/sbin/dhclient
-r-xr-xr-x  1 root  wheel 6221 Jun 20  2017 /media/zip0/sbin/dhclient-script
-r-xr-xr-x  1 root  wheel   299176 Jun 20  2017 /media/zip0/sbin/dhcpcd

running from FreeBSD 11.1-STABLE where /media/zip0 is mount point for NetBSD 
8.99.1 installation.


Interesting. NetBSD-5/i386:
# ls -l /sbin/dhclient
-r-xr-xr-x  1 root  wheel  353002 May 21  2010 /sbin/dhclient

NetBSD-7/amd64:
# ls -l /sbin/dhclient
-r-xr-xr-x  1 root  wheel  5056282 Jan 16 14:27 /sbin/dhclient

Both dynamically linked, not stripped.

--
Stephen



Re: The NPF firewall leaks! (was Re: in_cksum: out of data)

2016-12-06 Thread Stephen Borrill

On Tue, 6 Dec 2016, Tom Ivar Helbekkmo wrote:

Tom Ivar Helbekkmo  writes:


So far, I have just one improvement suggestion for npf: the ability to
use sets instead of singletons in rules is great, but needs to be
extended to letting sets of addresses and networks cross address
families.


I now have one more.  I accidentally created a leak in my npf
configuration, partially caused by looking at the example in the man
page npf.conf(5).

I've got several VLANs, one of them connected to the outside world, and
the others to internal networks with various levels of trust.  To limit
access among them, I've configured npf to handle each VLAN by allowing
all outbound traffic, statefully, while limiting inbound traffic to the
particular connections I want to allow.

The groups typically follow this pattern:

group "vlan10" on $vlan10 {
   pass stateful out final all
   pass in final proto tcp to $somehost port $someservices
   pass in final proto udp to $somehost port $otherservices
   block return in final all
}

Can you spot the vulnerability?

Some of the attack software that probes well-known ports to look for
holes, will respond to a TCP RST by sending a new TCP SYN from the very
same source port.  Guess what npf does then?  :)

Yup, the TCP RST sent by the last line of the above example gets
permitted out by the rule in the first line, updating the connection
state -- and the next connection attempt is permitted.

I had to change the above to this:

group "vlan10" on $vlan10 {
   pass stateful out final proto tcp flags S/SAFR all
   pass out final proto tcp all
   pass stateful out final all
   pass in final proto tcp to $somehost port $someservices
   pass in final proto udp to $somehost port $otherservices
   block return in final all
}

It's fine and all, but I tend to think that the simplistic first version
might automatically expand to the code in the second one.  In fact, the
documentation seems to agree with me:

By default, a stateful rule implies SYN-only flag check ("flags
S/SAFR") for the TCP packets.  It is not advisable to change this
behavior; however, it can be overridden with the flags keyword.

The code or the documentation needs to change.  I vote for the code.  :)


Yep, I found it was pretty easy to naively end up with rules where if I 
added a block it was allowed and if I removed it, it was blocked...


--
Stephen



Re: Lenovo T500 hang - down to DRM

2016-06-10 Thread Stephen Borrill

On Fri, 10 Jun 2016, Stephen Borrill wrote:

On Sun, 5 Jun 2016, Michael van Elst wrote:

On Sat, Jun 04, 2016 at 04:15:15PM +0100, Stephen Borrill wrote:


Under Windows it is cool and I get 6-hour battery life...


Does it have an ATI graphics card and also Intel graphics?


Yes. And that's the root of the hanging problem.

If Integrated graphics (Intel) or Discrete graphics (ATI) are selected in the 
BIOS, the machine boots.


With Intel, DRM attaches very early on:
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0: vendor 0x8086 product 0x2a40 (rev. 0x07)
agp0 at pchb0: G4X-family chipset
agp0: detected 32252k stolen memory
agp0: aperture at 0xd000, size 0x1000
i915drmkms0 at pci0 dev 2 function 0: vendor 0x8086 product 0x2a42 (rev. 
0x07)

drm: Memory usable by graphics device = 512M
drm: Supports vblank timestamp caching Rev 2 (21.10.2013).
drm: Driver supports precise vblank timestamp query.
i915drmkms0: interrupting at ioapic0 pin 16 (i915)
intelfb0 at i915drmkms0
i915drmkms0: info: registered panic notifier

With ATI, DRM attaches much, much later on:
pad0: outputs: 44100Hz, 16-bit, stereo
audio1 at pad0: half duplex, playback, capture
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
drm: initializing kernel modesetting (RV635 0x1002:0x9591 0x17AA:0x2117).
drm: register mmio base: 0xcfff
drm: register mmio size: 65536
drm kern info: ATOM BIOS: M86M
radeon0: info: VRAM: 256M 0x - 0x0FFF (256M used)
radeon0: info: GTT: 512M 0x1000 - 0x2FFF

Note that this is around the point where the machine hangs. If Switchable 
graphics is enabled in the BIOS (which is the default it likes to keep 
resetting back to), the hang occurs. My theory is that i915drmkms0 attaches 
early and then the later probe for radeon0 (and perhaps even trying to double 
up on DRM?) is causing the hang. When the device is hidden or disabled by the 
BIOS, it is OK.


Between RC2 (OK) and RC3 (hang) there were a number of changes to the radeon 
and i915 drm code. Not had chance to test which yet.


dmesgs here:
http://dmesgd.nycbug.org/index.cgi?do=view=2980
http://dmesgd.nycbug.org/index.cgi?do=view=2981

--
Stephen



Re: Lenovo T500 hang - down to DRM

2016-06-10 Thread Stephen Borrill

On Sun, 5 Jun 2016, Michael van Elst wrote:

On Sat, Jun 04, 2016 at 04:15:15PM +0100, Stephen Borrill wrote:


Under Windows it is cool and I get 6-hour battery life...


Does it have an ATI graphics card and also Intel graphics?


Yes. And that's the root of the hanging problem.

If Integrated graphics (Intel) or Discrete graphics (ATI) are selected in 
the BIOS, the machine boots.


With Intel, DRM attaches very early on:
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0: vendor 0x8086 product 0x2a40 (rev. 0x07)
agp0 at pchb0: G4X-family chipset
agp0: detected 32252k stolen memory
agp0: aperture at 0xd000, size 0x1000
i915drmkms0 at pci0 dev 2 function 0: vendor 0x8086 product 0x2a42 (rev. 
0x07)

drm: Memory usable by graphics device = 512M
drm: Supports vblank timestamp caching Rev 2 (21.10.2013).
drm: Driver supports precise vblank timestamp query.
i915drmkms0: interrupting at ioapic0 pin 16 (i915)
intelfb0 at i915drmkms0
i915drmkms0: info: registered panic notifier

With ATI, DRM attaches much, much later on:
pad0: outputs: 44100Hz, 16-bit, stereo
audio1 at pad0: half duplex, playback, capture
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
drm: initializing kernel modesetting (RV635 0x1002:0x9591 0x17AA:0x2117).
drm: register mmio base: 0xcfff
drm: register mmio size: 65536
drm kern info: ATOM BIOS: M86M
radeon0: info: VRAM: 256M 0x - 0x0FFF (256M 
used)

radeon0: info: GTT: 512M 0x1000 - 0x2FFF

Note that this is around the point where the machine hangs. If Switchable 
graphics is enabled in the BIOS (which is the default it likes to keep 
resetting back to), the hang occurs. My theory is that i915drmkms0 
attaches early and then the later probe for radeon0 (and perhaps even 
trying to double up on DRM?) is causing the hang. When the device is 
hidden or disabled by the BIOS, it is OK.


Between RC2 (OK) and RC3 (hang) there were a number of changes to the 
radeon and i915 drm code. Not had chance to test which yet.


--
Stephen




Re: Lenovo T500 hang

2016-06-04 Thread Stephen Borrill

On Fri, 3 Jun 2016, Michael van Elst wrote:

net...@precedence.co.uk (Stephen Borrill) writes:


Does it run at a sensible temperature? Mine runs very hot and runs the
battery down very quickly.


Probably needs some cleaning and/or there is a problem with the
fan and heat-pipes.


Under Windows it is cool and I get 6-hour battery life...

--
Stephen



Re: Lenovo T500 hang

2016-06-03 Thread Stephen Borrill

On Thu, 2 Jun 2016, Roy Marples wrote:

On 2016-06-01 10:54, Stephen Borrill wrote:

Somewhere after 7.0_RC2, a problem started where the machine hangs
right at the end of the kernel boot just before it prints the boot
device:

pad0: outputs: 44100Hz, 16-bit, stereo
audio1 at pad0: half duplex, playback, capture
*** HANGS HERE ***
boot device: wd0
root on wd0a dumps on wd0b

The only thing to do is to power the machine off. This happens with
all later kernels (including -current).

I've not had chance to bisect the sources yet as I need to use the
machine for real work (and running NetBSD 7 on it also makes it so hot
it burns my legs).

I think a few users (and developers) have Lenovo T500 laptops. Does
any recent NetBSD work for them?


I have a T500 which is my main NetBSD dev machine.


Do you have a dmesg? Perhaps yours is a model without switchable graphics 
or the 3G modem, etc.


I haven't updated the kernel in a month or so, but it's been running -current 
very well.


Does it run at a sensible temperature? Mine runs very hot and runs the 
battery down very quickly.



I don't recall if it ever ran anything earlier than a 7.0 release though.


I've been using mine since the netbsd-5 and this problem just started.

--
Stephen



Lenovo T500 hang

2016-06-01 Thread Stephen Borrill
Somewhere after 7.0_RC2, a problem started where the machine hangs right 
at the end of the kernel boot just before it prints the boot device:


pad0: outputs: 44100Hz, 16-bit, stereo
audio1 at pad0: half duplex, playback, capture
*** HANGS HERE ***
boot device: wd0
root on wd0a dumps on wd0b

The only thing to do is to power the machine off. This happens with all 
later kernels (including -current).


I've not had chance to bisect the sources yet as I need to use the machine 
for real work (and running NetBSD 7 on it also makes it so hot it burns my 
legs).


I think a few users (and developers) have Lenovo T500 laptops. Does any 
recent NetBSD work for them?


--
Stephen



Re: USB scanners and PR 50340

2016-03-21 Thread Stephen Borrill

On Fri, 18 Mar 2016, Gary Duzan wrote:

=>Dave Tyson  writes:
=>
=>> I note that PR 50340 has been closed and with the latest pkgsrc
=>>under current (amd64) my Mustek 1200 UB scanner seems to work OK
=>>- but I have comment out the uscanner device in the kernel and use
=>>it as a ugen device. It seems that this is the 'new world order'
=>>and the sane backend code to handle uscanner devices is deprecated.
=>>Given this is the case is there any point in still keeping the
=>>
=>> uscanner* at uhub? port ?
=>>
=>> in GENERIC?
=>
=>Quite possibly we should remove (comment out) uscanner in GENERIC.
=>ulpt is more controversial, but cups wants to use libusb too.
=>
=>> I am of the same opinion as the PR originator that it is easier
=>>to control access permissions with a uscanner device rather than
=>>having to open up a whole raft of ugen devices, but I guess the
=>>sane developers feel that using libusb makes support easier...
=>
=>Perhaps if we had something called uscanner that would match scanners
=>and that libusb would fine, we could have the permissions management of
=>direct matching but the cope-with-the-rest-of-the-world benefit of
=>libusb.

  Can we not build some sort of bus-like device to which both the
specialized and generic devices can attach which prevents opening
both at the same time?


An alternative is to have a method to detach the kernel driver so that you 
can revert to ugen access (and probably method to reattach too). This is 
true for all usb devices (e.g. uvideo, umass, etc.). libusb has the 
following API, but we don't have the kernel support for it.



int 	libusb_kernel_driver_active (libusb_device_handle *dev, int 
interface_number)

Determine if a kernel driver is active on an interface.

int 	libusb_detach_kernel_driver (libusb_device_handle *dev, int 
interface_number)

Detach a kernel driver from an interface.

int 	libusb_attach_kernel_driver (libusb_device_handle *dev, int 
interface_number)
 	Re-attach an interface's kernel driver, which was previously 
detached using libusb_detach_kernel_driver().


int 	libusb_set_auto_detach_kernel_driver (libusb_device_handle *dev, 
int enable)


Being able to detach kernel drivers would allow for USB remoting (e.g. 
http://usbip.sourceforge.net/ or Citrix Receiver). It would aid 
development of drivers with rump too.


--
Stephen



Re: MegaRAID 3008/3108

2015-06-05 Thread Stephen Borrill

On Thu, 4 Jun 2015, David Brownlee wrote:

On 4 June 2015 at 08:03, Frank Kardel kar...@netbsd.org wrote:

On 06/03/15 20:27, Christos Zoulas wrote:


In article 20150603111042.4fad14b2@taliesin-2.local,
Harry Waddell  wadd...@caravaninfotech.com wrote:


On Tue, 2 Jun 2015 16:13:07 +0100 (BST)
Stephen Borrill net...@precedence.co.uk wrote:


Anyone working on adding support for SYMBIOS MEGARAID 3108
(0x1000/0x005d)
or 3008 (0x1000/0x005f)? These are supported in OpenBSD by the mfii
driver
which also supports the MEGARAID 2208 (0x1000/0x005b). In NetBSD, the
mfi(4) driver was extended to support the 2208 (Thunderbolt) rather than
adding a new driver. The 3008/3108 will require another MFI_IOP type
(OpenBSD call it 25).

--
Stephen


I have a system with this on the motherboard, but I'm dropping an lsi
9261-i8 in
because the newer cards are not supported. My vendor has told me that
the 9261 is near EOL,
so it would be really helpful if someone could add support for the newer
LSI
cards. Unfortunately, I don't have much experience in this area.


Shouldn't be too hard to do... As long someone has a card to test...

christos


One of our customer systems (Dell PowerEdge R730) has this card. I got it to
work by adding the pciids to the driver and crudely adjusting the
thunderbolt support to use EOM markers, remove the setting of a flag. I/O
seemed to be working (installation was ok and the system was running fine.
Issues left were: Absysmal I/O performance on SSDs (no non SSDs were
available) in the range of 5 - 40 Mb/sec averaging around 20 Mb/sec.
Checking other OS delivered: FreeBSD 10 - 5 MB/sec, OpenBSD 420 Mb/sec
slowly decreasing. Linux SuSe 13.2 - 525-490 MB/sec. So due to time
constraints and a customer machine we went for the fastest. Patches
(mis-using the MFI_IOP type for thunderbolt) have been postedalready.
OpenBSD seems to have an additional change in the way i/o commands are
handled.


Would it help to get a card or machine into the hands of someone with
time to work on the driver? Maybe something like
http://www.ebay.co.uk/itm/251453703470 ?

I'm happy to throw something into the pot :)


I'm interested in IBM ServeRAID M5210 (3108):
http://www.redbooks.ibm.com/abstracts/tips1069.html

And IBM ServeRAID M1215 (3008):
http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/tips1174.html

I can get these at trade prices if anyone is interested (also happy to 
provide hardware for someone to work with).


--
Stephen



MegaRAID 3008/3108

2015-06-03 Thread Stephen Borrill
Anyone working on adding support for SYMBIOS MEGARAID 3108 (0x1000/0x005d) 
or 3008 (0x1000/0x005f)? These are supported in OpenBSD by the mfii driver 
which also supports the MEGARAID 2208 (0x1000/0x005b). In NetBSD, the 
mfi(4) driver was extended to support the 2208 (Thunderbolt) rather than 
adding a new driver. The 3008/3108 will require another MFI_IOP type 
(OpenBSD call it 25).


--
Stephen



PXE entry invalid, so PXE boot hangs

2015-03-27 Thread Stephen Borrill
I'm trying to PXE boot an x86 box. The setup works fine on all my other 
kit, but on the problem one I see:


booting netbsd - starting in 0 seconds.
pxe_init: bad cksum (0xbc) for PXENV+ at 0x900d8
PXE BIOS Version 2.1
*hang*

I'm prepared for a flaky BIOS, but it does boot pxelinux and Citrix 
Provisioning Services OK.


Adding a few printfs, I see it found PXENV+ at two locations: 0x900d8 
(rejected as bad checksum) and 0x8bb52. PXE+ was found at 0x8baf2. As it's 
PXE BIOS 2.1, it ignores the PXENV+ info. The hang is because it never 
returns from this call:


pxe_call(PXENV_GET_CACHED_INFO);

http://nxr.netbsd.org/xref/src/sys/arch/i386/stand/pxeboot/pxe.c#380

pxelinux uses 5 methods in priority order to find the pxe structure. 
NetBSD only uses a memory scan which is combination of its final 2 (it 
calls these plans D and E).


pxelinux prints:
!PXE entry point found (we hope) at 8A44:0100 via plan A

http://git.kernel.org/cgit/boot/syslinux/syslinux.git/tree/core/fs/pxe/bios.c?id=a7f5892c4d85f3685708b8efb237c9c73a8b1ddf#n240

These addresses correspond to bangpxe_seg and bangpxe_off:
http://nxr.netbsd.org/xref/src/sys/arch/i386/stand/pxeboot/pxe.c#360

Printing these shows that bangpxe_reg is 0, not 0x8a44. Hardwiring 
bangpxe_reg to 0x8a44 gets the machine booting (well, the kernel panics 
later on, but that's a different story).


Therefore, it looks like the structure found by memory scanning is 
incorrect and perhaps we should implement Linux's plans A and B (C being 
the int 0x1a function 0x5650 that we explicitly choose not to support). 
These involve reading points from offsets relative to InitStack. Where 
does this correspond to in NetBSD?


--
Stephen



Re: Problemns after updating from 6.1.4 to 6.1.5

2014-10-20 Thread Stephen Borrill

On Sun, 19 Oct 2014, John Nemeth wrote:

On Oct 20,  6:38am, Paul Goyette wrote:
} On Sun, 19 Oct 2014, John Nemeth wrote:
}
}  } * I have to load the kernel from an external partition using grub, and
}  }thus have to edit grub's menu.lst config file!
}  }
}  } * The booted kernel is independent of what is in /netbsd, so I currently
}  }have to manually gunzip(1) the kernel on the external partition and
}  }put the results in /netbsd
} 
}  Why would you be using grub when you're keeping the kernel
}  outside the NetBSD partition.  All you need to do is add:
} 
}  kernel = path to kernel
} 
}  to your domU config.  xentools is fully capable of loading a NetBSD
}  kernel, including one that is gzipped.  An example from one of my
}  config files is:
} 
}  kernel = /usr/pkg/etc/xen/kernels/netbsd-7-XEN3_DOMU.gz
} 
}  This is not something that should be dependent on your dom0 as
}  xentools is supplied by the Xen project.
}
} Most likely, I don't know enough (approx zero) of the XEN environment to
} get it right.  I'm just following the explicit instructions from my DOM0
} provider.

So, you're using a VPS.  This can change everything depending
on how they do things.  Which raises the question, who is the
provider?

} As a quick summary, I initially boot up a common DOMU which runs some
} variant of Linux.  A customer-specific very small ext2fs partition is
} mounted (based on my login information) on which I put my grub menu.lst
} file and the kernel(s) I need to boot.  This partition is accessible

This tells me that they are most likely using pygrub or pvgrub.
Basically these things dig around in the image for the domU to find
a grub config file, which then tells p[yv]grub what needs to be
extracted (i.e. it will copy the kernel {and initramhd for linux}
out of the image and hand it to Xen).  This also tells me that in
your case, the menu.lst thing is necessary.


A couple of notes on how this relates to XenServer (and perhaps other 
xapi-based providers) - more for the record rather than necessarily of 
immediate relevance:
- if loading a kernel from the dom0, it has to be in a very specific 
location otherwise you get annoyingly vague errors


- pygrub does not need menu.lst if the kernel path is set in the VM 
properties:

 PV-kernel ( RW):
PV-ramdisk ( RW):
   PV-args ( RW):
PV-legacy-args ( RW):
 PV-bootloader ( RW): pygrub
PV-bootloader-args ( RW): --kernel=/netbsd

- pygrub does not need a MBR partition table, it will deal with just a 
disklabel with root starting at 0. However, if you've had an MBR partition 
on there in the past, ensure you wipe sector zero, not just the partition 
table otherwise it will spot the 0xaa55 signature and insist on reading 
the non-existent MBR leading to a No partition bot failure.


--
Stephen


Re: KASSERT fail in uvm_page.c in latest -7

2014-10-20 Thread Stephen Borrill

On Sat, 18 Oct 2014, Stephen Borrill wrote:
Just updated my netbsd-7 sources and built and installed a new release from 
it (amd64). Now GENERIC will not boot, when trying to start init it fails the 
KASSERT on line 1226 of uvm/uvm_page.c:


KASSERT(obj == NULL || mutex_owned(obj-vmobjlock))


Apologies for the noise, I think this was down to running an update build 
after the gcc changes so that .o files were mixed between two different 
compilers. Anyway, wiping out obj and doing a clean build fixed it.


--
Stephen



KASSERT fail in uvm_page.c in latest -7

2014-10-18 Thread Stephen Borrill
Just updated my netbsd-7 sources and built and installed a new release 
from it (amd64). Now GENERIC will not boot, when trying to start init it 
fails the KASSERT on line 1226 of uvm/uvm_page.c:


KASSERT(obj == NULL || mutex_owned(obj-vmobjlock))

--
Stephen