FreeBSD 12.2-RC1 poudriere warning: awk: can't open file /sys/param.h

2020-10-07 Thread Mark Martinec

awk: can't open file /sys/param.h



Probably innocent, but reporting just in case:

$ poudriere version
3.3.4

$ freebsd-version
12.2-RC1

# poudriere jail -c -j 122amd64-srv -v 12.2-RC1
[00:00:00] Creating 122amd64-srv fs at 
/data0/poudriere/jails/122amd64-srv... done

[00:00:01] Using pre-distributed MANIFEST for FreeBSD 12.2-RC1 amd64
[00:00:01] Fetching base for FreeBSD 12.2-RC1 amd64
/data0/poudriere/jails/122amd64-srv/fromftp/ba 173 MB   20 MBps  
  08s

[00:00:12] Extracting base... done
[00:00:48] Fetching src for FreeBSD 12.2-RC1 amd64
/data0/poudriere/jails/122amd64-srv/fromftp/sr 163 MB 4192 kBps  
  40s

[00:01:29] Extracting src... done
[00:02:14] Fetching lib32 for FreeBSD 12.2-RC1 amd64
/data0/poudriere/jails/122amd64-srv/fromftp/li  62 MB 3139 kBps  
  21s

[00:02:35] Extracting lib32... done
[00:02:48] Cleaning up... done
awk: can't open file /sys/param.h
 source line number 1
[00:02:52] Recording filesystem state for clean... done
[00:02:53] Jail 122amd64-srv 12.2-RC1 amd64 is ready to be used


  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: New Xorg - different key-codes

2020-03-11 Thread Mark Martinec

I just updated my laptop from source, and somewhere along the way
the key-codes Xorg sees changed.


Indeed.  This doesn't just affect -CURRENT: it happened to me on
-STABLE last week, so I'm copying that list too.


And a "Down" key now opens and closes a KDE "Application Launcher",
alternatively with its original function (which makes editing a 
frustration).


  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=244354


Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2

2019-12-10 Thread Mark Martinec

2019-12-10 16:35, Marc Branchaud wrote:


On 2019-12-10 9:18 a.m., Mark Martinec wrote:
Commenting on a thread from 2018-12 and from 2019-09-20, with my 
solution

to the boot problem at the end, in case anyone is still interested.


Thank you very much for this.  A couple of questions:

(1) Why do you say "raw devices for historical reasons"?  Glancing
through the zpool man page and the Handbook, I see nothing
recommending or requiring GPT partitions.


Apparently using raw devices for zpool is now discouraged,
although I don't think it has ever become officially unsupported.



(2) Just to be 100% clear, my 11.3 non-root zpool looks like this:
NAMESTATE READ WRITE CKSUM
storage ONLINE   0 0 0
  raidz2-0  ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0

So this is using raw devices.  Are you saying that if I upgrade this
machine to 12 that it won't be able to boot?


It is possible it won't boot under 12, although not necessary.

Try booting from a 12.0 (or 12.0) memory stick - it that boots,
it is probably a safe bet that it will survive an upgrade.

Of the bunch of machines that I have upgraded from 11.2 to 12,
only three failed to boot under 12.0 loader. There were a couple
of others which upgraded and booted fine even though they
had a zfs pool on raw devices. I never had a problem of
booting on hosts that had zfs pool on a gpt partition.

So it's a lottery: a few raw devices in a zpool seem to do fine,
while many raw devices in a zpool is asking for trouble
under 12.0 and later.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2

2019-12-10 Thread Mark Martinec
Commenting on a thread from 2018-12 and from 2019-09-20, with my 
solution

to the boot problem at the end, in case anyone is still interested.

===

On 2018-11-29 myself wrote:
(after upgrading from 11.2 to 12.0):

While booting, the 'BTX loader' comes up, lists the BIOS drives,
then the spinner below the list comes up and begins turning,
stuttering, and after a couple of seconds it grinds to a standstill
and nothing happens afterwards.
At this point the ZFS and the bootstrap loader is supposed to
come up, but it doesn't.

[...] (on 2018-12-04):

The situation has not changed: the BTX loader lists all BIOS drives
C..J (disk0..disk7), then a spinner starts and gets stuck forever.
It never reaches the 'BIOS 635kB/3537856kB available memory' line.

While trying to restore the old /boot from 11.2, I tried booting
a live image from a 12.0-RC3 memory stick - and the loader got
stuck again, same as when booting from a disk.
So I had to boot from an 11.2 memstick to be able to regain control.


===

2018-12-04, Ian Lepore writes:

  Toomas Soome wrote:
|ok, if you could perform 2 tests:
|1. from loader prompt enter 0x413 0xa000 - @w . cr
|2. on first spinner, press space and type on boot: prompt:
|/boot/loader_4th and see if that will do better
|thanks, toomas
I don't think that will be an option.  If it hasn't gotten to the point
of saying how much BIOS available memory there is, it's only halfway
through loader main() and has hung before getting to interact().

In fact, if that line hasn't printed, but some disk drives have been
listed, it pretty much has to be hung in the "March through the device
switch probing for things" loop. If all the disks are listed, then it
got through that entry in the devsw, and is likely hanging in the
dv_init calls for either the pxedisk or zfsdev devices.


===

2018-12-07 19:08, Willem Jan Withagen wrote:

Ended up more or less in the same situation this afternoon with
freebsd-upgrade to [12.0]-RC3
Boot stops after listing all DOS disks, in a spinner.
So that is no fix.

I booted from USB 11.2 and replaced the /boot/zfs{boot,loader} by the 
11.2 ones.

That makes my server again happy.


===are

2019-09-19 16:02, Kurt Jaeger wrote:
Subject: Re: Lockdown adaX numbers to allow booting ?

|  Kurt Jaeger writes:
|The problem is that if all 10 disks are connected, the system
|looses track from where it should boot and fails to boot (serial 
boot log):

|
|Consoles: internal video/keyboard  serial port
|BTX loader 1.00  BTX version is 1.02
|Consoles: internal video/keyboard  serial port
|BIOS drive C: is disk0
|BIOS drive D: is disk1
|BIOS drive E: is disk2
|BIOS drive F: is disk3
|BIOS drive G: is disk4
|BIOS drive H: is disk5
|BIOS drive I: is disk6
|BIOS drive J: is disk7
|BIOS drive K: is disk8
|BIOS drive L: is disk9
|//
|[...]
|The solution right now is this to unplug all disks of the 'bck' 
pool,

|reboot, and re-insert the data disks after the boot is finished.
|[...]
|No gpart on the bck pool, raw drives.


2019-09-20 17:27, Mark Martinec wrote:
Subject: Re: Lockdown adaX numbers to allow booting ?


This sounds very much like my experience:

  2018-11-29, Boot loader stuck after first stage upgrading 11.2 to 
12.0-RC2

https://lists.freebsd.org/pipermail/freebsd-stable/2018-November/090129.html

https://lists.freebsd.org/pipermail/freebsd-stable/2018-December/090159.html


I now have three SuperMicro machines which are unable to boot after
upgrading 11.2 to 12.0. After unsuccessfully fiddling with boot 
loaders,

I have reverted two back to 11.2 (which boots and works fine again),
and the third one is now at 12.0 but needs the boot hack as described
by Kurt, i.e. pull out half the disks (of the 'data' pool), boot the
system, plug the disks back in and zfs mount the remaining pool.

Considering that the 11.2 boots and works fine on these machines,
I consider it a btx loader failure and not a BIOS issue.

What is common with these three machines is that they have one pool
on raw devices for historical reasons (not on gpt partitions).
My guess is that the new loader gets confused by these raw disks.


===

Ok, now to my current situation and solution/workaround.

What was common with these hosts (and similar) is that a machine
has more than a couple of disks, with a zfs pool (non-root) on
raw devices (for historical reasons), not on gpt partitions.

Three workarounds seem possible:

- replace a boot loader with the one from 11.2, or

- using a default loader from 12, disconnect a sufficient number
  of data disks, boot, then reconnect disks and zfs attach the pool,

- or my current solution: zfs offline one disk at a time from
  a data pool, wipe it, set up a gpt partition on it and
  put it back to the pool by 'zfs replace', letting it resilver.
  It was a painful and slightly risky procedure (9 hours of
  resilvering each of the s

Re: No amdtemp sysctls, AMD Ryzen 5 3600X

2019-11-15 Thread Mark Martinec

On 15/11/2019 3:27 am, Mark Martinec wrote:

Running 12.1-RELEASE-p1 on AMD Ryzen 5 3600X cpu,
but I don't see any temperatures reported in sysctl,
even though amdtemp.ko and amdsmn.ko are loaded
and they don't produce any complaints on loading.


2019-11-15 03:01, Kubilay Kocak wrote:

Resolver of original Ryzen 2 temperature support:
  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218264
"In Progress" issue for Ryzen 5 support with patch:
  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239607


I have applied the patch from Bug 239607 and it works now!
Perfect, thanks!


Was committed to head (CURRENT), apparently merged to stable/12
(cant find the MFC commit).
I've updated/retriaged the issue, and asked about a merge to
stable/11, but at this point it looks like it missed the 12.1-RELEASE
window


It's unfortunate that it missed the 12.1-RELEASE.
Thanks for a quick response!

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


No amdtemp sysctls, AMD Ryzen 5 3600X

2019-11-14 Thread Mark Martinec

Running 12.1-RELEASE-p1 on AMD Ryzen 5 3600X cpu,
but I don't see any temperatures reported in sysctl,
even though amdtemp.ko and amdsmn.ko are loaded
and they don't produce any complaints on loading.

  $ kldstat | fgrep amd
  271 0x82f3 1458 amdtemp.ko
  281 0x82f32000  808 amdsmn.ko

  $ sysctl -a | grep -i tempe
  $

  $ sysctl dev.amdtemp
  sysctl: unknown oid 'dev.amdtemp'
  $

Nov 13 12:07:27 xxx kernel: CPU: AMD Ryzen 5 3600X 6-Core Processor 
(4100.09-MHz K8-class CPU)
Nov 13 12:07:27 xxx kernel:   Origin="AuthenticAMD"  Id=0x870f10  
Family=0x17  Model=0x71  Stepping=0



Motherboard is an ASUS with X570 chipset, latest BIOS.
No obvious errors are reported during booting.

Any additional information that I can provide?
Any suggestions?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting

2018-12-27 Thread Mark Martinec

2018-12-26 22:26, Terry Kennedy wrote:

The earlier LSI P20 releases were pretty flakey in some cases - try
flashing 20.00.07.00.



Indeed.

I have upgraded LSI SAS2308 firmware from 20.00.02.00 to 20.00.07.00
a week ago, left it running for a while with 11.2, then upgraded again
to 12.0, and the controller is stable now, even with the new mps driver
that came with 12.0.

To recap:

 - mps driver from FreeBSD 11.2 and earlier is stable with SAS2308 
firmware

   20.00.02.00 _and_ 20.00.07.00

 - mps driver from FreeBSD 12.0 causes frequent controller resets
   with SAS2308 firmware 20.00.02.00 (and ZFS can't cope with that),
   but is stable with 20.00.07.00.

Mark




2018-12-17 16:52, je Mark Martinec napisal

One of our servers that was upgraded from 11.2 to 12.0 (to RC2
initially, then to RC3
and lastly to a 12.0-RELEASE) is suffering severe instability of a
disk controller,
resetting itself a couple of times a day, usually associated with high
disk usage
(like poudriere buils or zfs scrub or nightly file system scans). The 
same setup

was rock-solid under 11.2 (and still/again is).

The disk controller is LSI SAS2308. It has four disks attached as 
JBODs,
one pair of SSDs and one pair of hard disks, each pair forming its own 
zpool.

A controller reset can occur regardless of which pair is in heavy use.

The following can be found in logs, just before machine becomes 
unusable
(although not logged always, as disks may be dropped before syslog has 
a chance

of writing anything):

  xxx kernel: [2382] mps0: IOC Fault 0x4d04, Resetting
  xxx kernel: [2382] mps0: Reinitializing controller
  xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: 
21.02.00.00-fbsd

  xxx kernel: [2383] mps0: IOCCapabilities:
5a85c
  xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack

The IOC Fault location is always the same. Apparently the disk
controller resets,
all disk devices are dropped and ZFS finds itself with no disks. The
machine still
responds to ping, and if logged-in during the event and running zpool
status -v 1,
zfs reports loss of all devices for each pool:

  pool: data0
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
 action: Make sure the affected devices are connected, then run 'zpool 
clear'.

   see: http://illumos.org/msg/ZFS-8000-HC
   scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov
17 00:22:38 2018
config:

NAME  STATE READ WRITE CKSUM
data0 UNAVAIL  0 0 0
  mirror-0UNAVAIL  024 0
 2396428274137360341   REMOVED  0 0 0  was
/dev/gpt/da2-PN1334PCKAKD4S
 16738407333921736610  REMOVED  0 0 0  was
/dev/gpt/da3-PN2338P4GJ1XYC

(and similar for the other pool)

At this point the machine is unusable and needs to be hard-reset.

My guess is that after the controller resets, disk devices come up 
again
(according to the report seen on the console, stating 'periph 
destroyed'

first, then listing full info on each disk) - but zfs ignores them.

I don't see any mention of changes of the mps driver in the 12.0 
release notes,
although diff-ing its sources between 11.2 and 12.0 shows plenty of 
nontrivial

changes.

After suffering this instability for some time, I finally downgraded 
the OS

to 11.2, and things are back to normal again!

This downgrade path was nontrivial, as I have foolishly upgraded pool 
features
to what comes with 12.0, so downgrading involved hacking with 
dismantling

both zfs mirror pools, recreating pools without the two new features,
zfs send/receive copying, while having a machine hang during some of
these operations. Not something for the faint at heart. I know, foolish
of me to upgrade pools after just one day of uptime with 12.0.

Some info on the controller:

kernel: mps0:  port 0xf000-0xf0ff
mem 0xfbe4-
  0xfbe4,0xfbe0-0xfbe3 irq 64 at device 0.0 numa-domain 1 
on pci11

kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd

mpsutil shows:

  mps0 Adapter:
Board Name: LSI2308-IT
Board Assembly:
Chip Name: LSISAS2308
Chip Revision: ALL
BIOS Revision: 7.39.00.00
Firmware Revision: 20.00.02.00
Integrated RAID: no


So, what has changed in the mps driver for this to be happening?
Would it be possible to take mps driver sources from 11.2, transplant
them to 12.0, recompile, and use that? Could the new mps driver be
using some new feature of the controller and hits a firmware bug?
I have resisted upgrading SAS2308 firmware and its BIOS, as it is
working very well under 11.2.

Anyone else seen problems with mps driver and LSI SAS2308 controller?

(btw, on another machine the mps driver with LSI SAS2004 is working
just fine under 12.0)

  Mark

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail

mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting

2018-12-17 Thread Mark Martinec
One of our servers that was upgraded from 11.2 to 12.0 (to RC2 
initially, then to RC3
and lastly to a 12.0-RELEASE) is suffering severe instability of a disk 
controller,
resetting itself a couple of times a day, usually associated with high 
disk usage
(like poudriere buils or zfs scrub or nightly file system scans). The 
same setup

was rock-solid under 11.2 (and still/again is).

The disk controller is LSI SAS2308. It has four disks attached as JBODs,
one pair of SSDs and one pair of hard disks, each pair forming its own 
zpool.

A controller reset can occur regardless of which pair is in heavy use.

The following can be found in logs, just before machine becomes unusable
(although not logged always, as disks may be dropped before syslog has a 
chance

of writing anything):

  xxx kernel: [2382] mps0: IOC Fault 0x4d04, Resetting
  xxx kernel: [2382] mps0: Reinitializing controller
  xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: 
21.02.00.00-fbsd
  xxx kernel: [2383] mps0: IOCCapabilities: 
5a85c

  xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack

The IOC Fault location is always the same. Apparently the disk 
controller resets,
all disk devices are dropped and ZFS finds itself with no disks. The 
machine still
responds to ping, and if logged-in during the event and running zpool 
status -v 1,

zfs reports loss of all devices for each pool:

  pool: data0
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
 action: Make sure the affected devices are connected, then run 'zpool 
clear'.

   see: http://illumos.org/msg/ZFS-8000-HC
   scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov 17 
00:22:38 2018

config:

NAME  STATE READ WRITE CKSUM
data0 UNAVAIL  0 0 0
  mirror-0UNAVAIL  024 0
 2396428274137360341   REMOVED  0 0 0  was 
/dev/gpt/da2-PN1334PCKAKD4S
 16738407333921736610  REMOVED  0 0 0  was 
/dev/gpt/da3-PN2338P4GJ1XYC


(and similar for the other pool)

At this point the machine is unusable and needs to be hard-reset.

My guess is that after the controller resets, disk devices come up again
(according to the report seen on the console, stating 'periph destroyed'
first, then listing full info on each disk) - but zfs ignores them.

I don't see any mention of changes of the mps driver in the 12.0 release 
notes,
although diff-ing its sources between 11.2 and 12.0 shows plenty of 
nontrivial

changes.

After suffering this instability for some time, I finally downgraded the 
OS

to 11.2, and things are back to normal again!

This downgrade path was nontrivial, as I have foolishly upgraded pool 
features
to what comes with 12.0, so downgrading involved hacking with 
dismantling

both zfs mirror pools, recreating pools without the two new features,
zfs send/receive copying, while having a machine hang during some of
these operations. Not something for the faint at heart. I know, foolish
of me to upgrade pools after just one day of uptime with 12.0.

Some info on the controller:

kernel: mps0:  port 0xf000-0xf0ff mem 
0xfbe4-
  0xfbe4,0xfbe0-0xfbe3 irq 64 at device 0.0 numa-domain 1 on 
pci11

kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd

mpsutil shows:

  mps0 Adapter:
Board Name: LSI2308-IT
Board Assembly:
Chip Name: LSISAS2308
Chip Revision: ALL
BIOS Revision: 7.39.00.00
Firmware Revision: 20.00.02.00
Integrated RAID: no


So, what has changed in the mps driver for this to be happening?
Would it be possible to take mps driver sources from 11.2, transplant
them to 12.0, recompile, and use that? Could the new mps driver be
using some new feature of the controller and hits a firmware bug?
I have resisted upgrading SAS2308 firmware and its BIOS, as it is
working very well under 11.2.

Anyone else seen problems with mps driver and LSI SAS2308 controller?

(btw, on another machine the mps driver with LSI SAS2004 is working
just fine under 12.0)

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: zfsboot@12.0: Shortening read at xxxx from 16 to -479991569

2018-12-13 Thread Mark Martinec

2018-12-13 16:59, Warner Losh wrote:


Do you have any encrypted disks?


Indeed I do, both pools are encrypted.


(although I haven't seen such messages with 11.2, as far as I can tell)

  Mark



On Thu, Dec 13, 2018, 6:19 AM Mark Martinec 

wrote:


On one of my hosts (now running 12.0-RELEASE) the zfsboot shows
this weird negative number, which sounds suspicious:

   Verifying DMI pool Data .
   Shortening read at 3907029152 from 16 to 15
   Shortening read at 7435283708 from 16 to -479991569

   BTX loader 1.0  BTX version is 1.02
   Consoles: ...
   BIOS drive C: is disk0
   ...

The machine boots up normally and is fine, zpool scrub is happy,
so, should I worry? Anything fishy there?

Searching through sources, the message seems to come from
stand/i386/zfsboot/zfsboot.c :

   printf("Shortening read at %lld from %d to %lld\n",
 alignlba, alignnb, (zdsk->dsk.size + zdsk->dsk.start) - 
alignlba);


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


zfsboot@12.0: Shortening read at xxxx from 16 to -479991569

2018-12-13 Thread Mark Martinec

On one of my hosts (now running 12.0-RELEASE) the zfsboot shows
this weird negative number, which sounds suspicious:

  Verifying DMI pool Data .
  Shortening read at 3907029152 from 16 to 15
  Shortening read at 7435283708 from 16 to -479991569

  BTX loader 1.0  BTX version is 1.02
  Consoles: ...
  BIOS drive C: is disk0
  ...

The machine boots up normally and is fine, zpool scrub is happy,
so, should I worry? Anything fishy there?

Searching through sources, the message seems to come from
stand/i386/zfsboot/zfsboot.c :

  printf("Shortening read at %lld from %d to %lld\n",
alignlba, alignnb, (zdsk->dsk.size + zdsk->dsk.start) - alignlba);


Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2

2018-12-04 Thread Mark Martinec

2018-11-29 18:43, Toomas Soome wrote:

I just did push biosdisk updates to stable/12, I wonder if you could
test those bits…


Myself wrote:

Thank you!  I haven't tried it yet, but I wonder whether this fix was
already incorporated into 12.0-RC3, which would make my rescue easier.
Otherwise I can build a stable/12 on another host and transplant
the problematic file(s) to the affected host - if I knew which files
to copy.


2018-12-02 18:59, Toomas wrote:

The files are /boot/loader* binaries - to be exact, check which one is
linked to /boot/loader. I can provide binaries if needed.
[...]
rgds,
toomas


I got a maintenance window today so I tried with the new loader,
and it did not help.

More specifically:

As it comes with 12-RC2, the /boot/loader was hard linked with 
loader_lua.

Its size is 421888 bytes. So I concentrated on this loader.

I build a fresh stable/12 on another host, and copied the newly
built loader_lua (425984 bytes) to the /boot directory of the affected
host, deleted the file 'loader', and hard-linked loader_lua to loader.

The situation has not changed: the BTX loader lists all BIOS drives
C..J (disk0..disk7), then a spinner starts and gets stuck forever.
It never reaches the 'BIOS 635kB/3537856kB available memory' line.

While trying to restore the old /boot from 11.2, I tried booting
a live image from a 12.0-RC3 memory stick - and the loader got
stuck again, same as when booting from a disk.

So I had to boot from an 11.2 memstick to be able to regain control.

  Mark



On 29 Nov 2018, at 17:01, Mark Martinec 
 wrote:
After successfully upgraded three hosts from 11.2-p4 to 12.0-RC2 
(amd64,
zfs, bios), I tried my luck with one of our production hosts, and 
ended up
with a stuck loader after rebooting with a new kernel (after the 
first

stage of upgrade).
These were the steps, and all went smoothly and normally until a 
reboot:

freebsd-update upgrade -r 12.0-RC2
freebsd-update install
shutdown -r now
While booting, the 'BTX loader' comes up, lists the BIOS drives,
then the spinner below the list comes up and begins turning,
stuttering, and after a couple of seconds it grinds to a standstill
and nothing happens afterwards.
At this point the ZFS and the bootstrap loader is supposed to
come up, but it doesn't.
This host has too zfs pools, the system pool consists of two SSDs
in a zfs mirror (also holding a freebsd-boot partition each), the
other pool is a raidz2 with six JBOD disks on an LSI controller.
The gptzfsboot in both freebsd-boot partitions is fresh from 11.2,
both zpool versions are up-to-date with 11.2. The 'zpool status -v'
is happy with both pools.
After rebooting from an USB drive and reverting the /boot directory
to a previous version, the machine comes up normally again
with the 11.2-RELEASE-p4.
I found a file init.core in the / directory, slightly predating the
last reboot with a salvaged system - although it was probably not
a cause of the problem, but a consequence of the rescue operation.
It is unfortunate that this is a production host, so I can't play
much with it. One or two more quick experiments I can probably
afford, but not much more. Should I just first wait for the
official 12.0 release? Should I try booting with a 12.0 on USB
and try to import pools? Suggestions welcome.
Now that the /boot has been manually restored to the 11.2 state,
A SECOND QUESTION is about freebsd-update, which still thinks we are
in the middle of an upgrade procedure. Trying now to just update
the 11.2-RELEASE-p4 to 11.2-RELEASE-p5, the fetch complains:
# uname -a
FreeBSD xxx 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4
#
# freebsd-version
11.2-RELEASE-p4
#
# freebsd-update fetch
src component not installed, skipped
You have a partially completed upgrade pending
Run '/usr/sbin/freebsd-update install' first.
Run '/usr/sbin/freebsd-update fetch -F' to proceed anyway.
So what is the right way to get rid of all traces of the
unsuccessful upgrade, and let freebsd-update believe we are cleanly
at 11.2-p4 ?  Removing /var/db/freebsd-update did not help.
Mark

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2

2018-12-01 Thread Mark Martinec

2018-11-29 18:43, Toomas Soome wrote:

I just did push biosdisk updates to stable/12, I wonder if you could
test those bits…


Thank you!  I haven't tried it yet, but I wonder whether this fix was
already incorporated into 12.0-RC3, which would make my rescue easier.

Otherwise I can build a stable/12 on another host and transplant
the problematic file(s) to the affected host - if I knew which files
to copy.

I wonder also, if the today's posting by cksalexan...@q.com on the
freebsd-stable ML titled "FreeBSD-12.0-RC3-i386-disc1.iso does not boot"
could be describing the same problem?

  Mark


On 29 Nov 2018, at 17:01, Mark Martinec  
wrote:


After successfully upgraded three hosts from 11.2-p4 to 12.0-RC2 
(amd64,
zfs, bios), I tried my luck with one of our production hosts, and 
ended up

with a stuck loader after rebooting with a new kernel (after the first
stage of upgrade).

These were the steps, and all went smoothly and normally until a 
reboot:


 freebsd-update upgrade -r 12.0-RC2
 freebsd-update install
 shutdown -r now

While booting, the 'BTX loader' comes up, lists the BIOS drives,
then the spinner below the list comes up and begins turning,
stuttering, and after a couple of seconds it grinds to a standstill
and nothing happens afterwards.

At this point the ZFS and the bootstrap loader is supposed to
come up, but it doesn't.

This host has too zfs pools, the system pool consists of two SSDs
in a zfs mirror (also holding a freebsd-boot partition each), the
other pool is a raidz2 with six JBOD disks on an LSI controller.
The gptzfsboot in both freebsd-boot partitions is fresh from 11.2,
both zpool versions are up-to-date with 11.2. The 'zpool status -v'
is happy with both pools.

After rebooting from an USB drive and reverting the /boot directory
to a previous version, the machine comes up normally again
with the 11.2-RELEASE-p4.

I found a file init.core in the / directory, slightly predating the
last reboot with a salvaged system - although it was probably not
a cause of the problem, but a consequence of the rescue operation.

It is unfortunate that this is a production host, so I can't play
much with it. One or two more quick experiments I can probably
afford, but not much more. Should I just first wait for the
official 12.0 release? Should I try booting with a 12.0 on USB
and try to import pools? Suggestions welcome.



Now that the /boot has been manually restored to the 11.2 state,
A SECOND QUESTION is about freebsd-update, which still thinks we are
in the middle of an upgrade procedure. Trying now to just update
the 11.2-RELEASE-p4 to 11.2-RELEASE-p5, the fetch complains:

 # uname -a
 FreeBSD xxx 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4
 #
 # freebsd-version
 11.2-RELEASE-p4
 #
 # freebsd-update fetch
 src component not installed, skipped
 You have a partially completed upgrade pending
 Run '/usr/sbin/freebsd-update install' first.
 Run '/usr/sbin/freebsd-update fetch -F' to proceed anyway.

So what is the right way to get rid of all traces of the
unsuccessful upgrade, and let freebsd-update believe we are cleanly
at 11.2-p4 ?  Removing /var/db/freebsd-update did not help.

 Mark

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2

2018-11-29 Thread Mark Martinec

After successfully upgraded three hosts from 11.2-p4 to 12.0-RC2 (amd64,
zfs, bios), I tried my luck with one of our production hosts, and ended 
up

with a stuck loader after rebooting with a new kernel (after the first
stage of upgrade).

These were the steps, and all went smoothly and normally until a reboot:

  freebsd-update upgrade -r 12.0-RC2
  freebsd-update install
  shutdown -r now

While booting, the 'BTX loader' comes up, lists the BIOS drives,
then the spinner below the list comes up and begins turning,
stuttering, and after a couple of seconds it grinds to a standstill
and nothing happens afterwards.

At this point the ZFS and the bootstrap loader is supposed to
come up, but it doesn't.

This host has too zfs pools, the system pool consists of two SSDs
in a zfs mirror (also holding a freebsd-boot partition each), the
other pool is a raidz2 with six JBOD disks on an LSI controller.
The gptzfsboot in both freebsd-boot partitions is fresh from 11.2,
both zpool versions are up-to-date with 11.2. The 'zpool status -v'
is happy with both pools.

After rebooting from an USB drive and reverting the /boot directory
to a previous version, the machine comes up normally again
with the 11.2-RELEASE-p4.

I found a file init.core in the / directory, slightly predating the
last reboot with a salvaged system - although it was probably not
a cause of the problem, but a consequence of the rescue operation.

It is unfortunate that this is a production host, so I can't play
much with it. One or two more quick experiments I can probably
afford, but not much more. Should I just first wait for the
official 12.0 release? Should I try booting with a 12.0 on USB
and try to import pools? Suggestions welcome.



Now that the /boot has been manually restored to the 11.2 state,
A SECOND QUESTION is about freebsd-update, which still thinks we are
in the middle of an upgrade procedure. Trying now to just update
the 11.2-RELEASE-p4 to 11.2-RELEASE-p5, the fetch complains:

  # uname -a
  FreeBSD xxx 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4
  #
  # freebsd-version
  11.2-RELEASE-p4
  #
  # freebsd-update fetch
  src component not installed, skipped
  You have a partially completed upgrade pending
  Run '/usr/sbin/freebsd-update install' first.
  Run '/usr/sbin/freebsd-update fetch -F' to proceed anyway.

So what is the right way to get rid of all traces of the
unsuccessful upgrade, and let freebsd-update believe we are cleanly
at 11.2-p4 ?  Removing /var/db/freebsd-update did not help.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-08-17 Thread Mark Martinec

On 07/08/2018 15:58, Mark Martinec wrote:

Collected, here it is:
  https://www.ijs.si/usr/mark/tmp/dtrace-cmd.out.bz2



2018-08-14 11:18, Andriy Gapon wrote:

I see one memory leak, not sure if it's the only one.
It looks like vdev_geom_read_config() leaks all parsed vdev nvlist-s 
but

the last.  The problems seems to come from r316760.  Before that commit
the function would return upon finding the first valid config, but now
it keeps iterating.

The memory leak should not be a problem when vdev-s are probed
sufficiently rarely, but it appears that with an unhealthy pool the
probing can happen much more frequently (e.g., every time pools are 
listed).



Superb, thanks!!!

I have opened a bug report now:

  Bug 230704: All the memory eaten away by ZFS 'solaris' malloc

  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230704


Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-08-13 Thread Mark Martinec

2018-08-13 21:48, Volodymyr Kostyrko wrote:

I've been in the same situation. ZFS, only pool, no ZFS errors.

I think the problem is rather between swapping and ZFS ARC. This host
has different load, sometimes it needs more active memory, somtimes
less... This means that active zone can expand and shrink like +-2G os
mem (I have 16Gb installed there). The problem is, when huge task is
idle it doesn't use much active memory and other activity is pushing
it's memory to the swap. When active runs low and ARC runs >50% of
memory it becomes very hard to make ARC give some memory back. My host
even was broght to the point when it couldn't get tasks back into
memory from swap, because while some pages were restored from swap the
time passes by and the other pages are instead stored to swap due to
zome ARC activity. Finally active zone shrinks so bad that the host
becomes unresponsive.

Like 6 month ago I tried tweaking kernel and swap to make things go
other way. Currently I have `vm.swap_idle_enabled=1` in
/etc/loader.conf and looks like this solves my problem. The other
interesting things to look at are `vfs.zfs.arc_free_target`,
`vfs.zfs.arc_shrink_shift`, `vfs.zfs.arc_grow_retry`.

Or you can take another route and plain limit current ARC size with
`vfs.zfs.arc_max`.


What you describe is not the same problem as the one I described
in this thread. In my case the ZFS malloc'ed memory ("solaris" zone)
is growing, while the size of the ARC remains capped to a reasonably
low value, and the ARC even shrinks as the "solaris" zone approaches
the memory size.

I too have been bitten previously by the ARC size being reluctant to
shrink. Ths problem is described here, but only partially mitigated
now in the 11.? version:

  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

The usually suggested workaround is to limit the size of the ARC,
although it would be nice to find a solution to handle ARC UMA
shrinking automatically, like it worked well in FreeBSD 9 but
broke in FreeBSD 10.

Like I said, the problem I described in this thread is different.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.2-R amd64

2018-08-13 Thread Mark Martinec

2018-08-04 21:47, Mark Johnston wrote:

Sorry, I missed that message.  Given that information, it would be
useful to see the output of the following script instead:

# dtrace -c "zpool list -Hp" -x temporal=off -n '
 dtmalloc::solaris:malloc
   /pid == $target/{@allocs[stack(), args[3]] = count()}
 dtmalloc::solaris:free
   /pid == $target/{@frees[stack(), args[3]] = count();}'
This will record all allocations and frees from a single instance of
"zpool list".




2018-08-07 14:58, Mark Martinec wrote:

Collected, here it is:
  https://www.ijs.si/usr/mark/tmp/dtrace-cmd.out.bz2




Was there a mention of a defunct pool?


Indeed.
Haven't tried yet to destroy it, so it is only my hypothesis
that a defunct pool plays a role in this leak.

[...]

I have jumped from 10.3 directly to 11.1-RELEASE-p11, so I'm not sure
with exactly which version / patch level the problem was introduced.

Tried to reproduce the problem on another host running 11.2R,
using memory disk (md), created GPT partition on it and a ZFS pool
on top, then destroyed the disk, so the pool was left as UNAVAILABLE.
Unfortunately this did not reproduce the problem, the "zpool list"
on that host does not cause ZFS to leak memory. Must be something
specific to that failed disk or pool, which is causing the leak.
  Mark



More news: on my last posting I said I can't reproduce the issue
on another 11.2 host. Well, it turned out this was only half the truth.

So this is what I did the last time:

  # create a test pool on md
  mdconfig -a -t swap -s 1Gb
  gpart create -s gpt /dev/md0
  gpart add -t freebsd-zfs -a 4k /dev/md0
  zpool create test /dev/md0p1
  # destroy the disk underneath the pool, making it "unavailable"
  mdconfig -d -u 0 -o force

and I reported that the "zpool list" command does not leak memory,
unlike on another host where the problem was first detected.

But in the following days after this, the second machine
started to run out of memory and ground to a standstill after
a couple of days - this now happened three times, until I realized
the same thing was happening here as on the original host.
(the "zpool list" is running periodically as a plugin to a
"telegraf" monitoring)

Sure enough the "zpool list" was leaking "solaris" zone memory
here too, and even in larger chunks (previously by 570, now by about 
2k):


  # (while true; do zpool list >/dev/null; vmstat -m | \
  fgrep solaris; sleep 0.5; done) | awk '{print $2-a; a=$2}'
  12224540
  2509
  3121
  5022
  2507
  1834
  2508
  2505

And it's not just the "zpool list" command. The same leak occurs with
"zpool status" and with "zpool iostat", either when explicitly 
specifying
the defunct pool as argument, or without specifying a pool (implying 
all).

(but not when a healthy pool is explicitly specified to such command)

And to confirm the hypothesis: while running the "zpool list" in an
above loop, I destroyed the defunct pool from another terminal, and
the leak immediately vanished (the vmstat -m | fgrep solaris
no longer grew).

So the only missing link is: why the leak did not start immediately
after revoking the disk and making the pool unavailable, but only
some time later (hours? few days? after a reboot? after running some
other command?).

  Mark





___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-08-07 Thread Mark Martinec

On Sat, Aug 04, 2018 at 08:38:04PM +0200, Mark Martinec wrote:

2018-08-04 19:01, Mark Johnston wrote:
> I think running "zpool list" is adding a lot of noise to the output.
> Could you retry without doing that?
No, like I said previously, the "zpool list" (with one defunct
zfs pool) *is* the sole culprit of the zfs memory leak.
With each invocation of "zpool list" the "solaris" malloc
jumps up by the same amount, and never ever drops. Without
running it (like repeatedly under 'telegraf' monitoring
of zfs), the machine runs normally and never runs out of
memory, the "solaris" malloc count no longer grows steadily.


2018-08-04 21:47, Mark Johnston wrote:

Sorry, I missed that message.  Given that information, it would be
useful to see the output of the following script instead:

# dtrace -c "zpool list -Hp" -x temporal=off -n '
 dtmalloc::solaris:malloc
   /pid == $target/{@allocs[stack(), args[3]] = count()}
 dtmalloc::solaris:free
   /pid == $target/{@frees[stack(), args[3]] = count();}'

This will record all allocations and frees from a single instance of
"zpool list".



Collected, here it is:

  https://www.ijs.si/usr/mark/tmp/dtrace-cmd.out.bz2



Kevin P. Neal wrote:

Was there a mention of a defunct pool?


Indeed.
Haven't tried yet to destroy it, so it is only my hypothesis
that a defunct pool plays a role in this leak.

I've got a machine with 8GB RAM running 11.1-RELEASE-p4 with a single 
ZFS

pool. It runs zfs list in a script multiple times a minute, and it has
been doing so for 181 days with no reboot. I have not seen any memory
issues.


I have jumped from 10.3 directly to 11.1-RELEASE-p11, so I'm not sure
with exactly which version / patch level the problem was introduced.

Tried to reproduce the problem on another host running 11.2R,
using memory disk (md), created GPT partition on it and a ZFS pool
on top, then destroyed the disk, so the pool was left as UNAVAILABLE.
Unfortunately this did not reproduce the problem, the "zpool list"
on that host does not cause ZFS to leak memory. Must be something
specific to that failed disk or pool, which is causing the leak.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-08-04 Thread Mark Martinec

2018-08-04 19:01, Mark Johnston wrote:

I think running "zpool list" is adding a lot of noise to the output.
Could you retry without doing that?


No, like I said previously, the "zpool list" (with one defunct
zfs pool) *is* the sole culprit of the zfs memory leak.
With each invocation of "zpool list" the "solaris" malloc
jumps up by the same amount, and never ever drops. Without
running it (like repeatedly under 'telegraf' monitoring
of zfs), the machine runs normally and never runs out of
memory, the "solaris" malloc count no longer grows steadily.

This leak was introduced sometime between 10.3 and 11.1R-p11,
and is still there with 11.2.

  Mark



On Fri, Aug 03, 2018 at 09:11:42PM +0200, Mark Martinec wrote:

More attempts at tracking this down. The suggested dtrace command does
usually abort with:

   Assertion failed: (buf->dtbd_timestamp >= first_timestamp),
 file
/usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c,
 line 3330.


Hrmm.  As a workaround you can add "-x temporal=off" to the dtrace(1)
invocation.

but with some luck soon after each machine reboot I can leave the 
dtrace

running for about 10 or 20 seconds (max) before terminating it with a
^C,
and succeed in collecting the report.  If I miss the opportunity to
leave
dtrace running just long enough to collect useful info, but not long
enough for it to hit the assertion check, then any further attempt
to run the dtrace script hits the assertion fault immediately.

Btw, (just in case) I have recompiled kernel from source
(base/release/11.2.0)
with debugging symbols, although the behaviour has not changed:

   FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238:
 Fri Aug 3 17:29:42 CEST 2018
m...@xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64


Anyway, after several attempts I was able to collect a useful dtrace
output from the suggested dtrace stript:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
   count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = 
count()}'


while running "zpool list" repeatedly in another terminal screen:


I think running "zpool list" is adding a lot of noise to the output.
Could you retry without doing that?

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-08-03 Thread Mark Martinec

More attempts at tracking this down. The suggested dtrace command does
usually abort with:

  Assertion failed: (buf->dtbd_timestamp >= first_timestamp),
file 
/usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c,

line 3330.

but with some luck soon after each machine reboot I can leave the dtrace
running for about 10 or 20 seconds (max) before terminating it with a 
^C,
and succeed in collecting the report.  If I miss the opportunity to 
leave

dtrace running just long enough to collect useful info, but not long
enough for it to hit the assertion check, then any further attempt
to run the dtrace script hits the assertion fault immediately.

Btw, (just in case) I have recompiled kernel from source 
(base/release/11.2.0)

with debugging symbols, although the behaviour has not changed:

  FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238:
Fri Aug 3 17:29:42 CEST 2018 
m...@xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64



Anyway, after several attempts I was able to collect a useful dtrace
output from the suggested dtrace stript:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
  count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count()}'

while running "zpool list" repeatedly in another terminal screen:

  # (while true; do zpool list -Hp >/dev/null; vmstat -m | fgrep 
solaris; \

  sleep 0.2; done) | awk '{print $2-a; a=$2}'
454303
570
570
570
570
570
570
570
570
570
570
570
570
570
570

Two samples of the collected dtrace output (after about 15 seconds)
are at:

  https://www.ijs.si/usr/mark/tmp/dtrace1.out.bz2
  https://www.ijs.si/usr/mark/tmp/dtrace2.out.bz2

(the dtrace2.out is probably cleaner, I made sure no other service
 was running except my sshd and syslog)

Not really sure what I'm looking at, but a couple of large entries
stand out:

$ awk '/^ .*[0-9]+ .*[0-9]$/' dtrace2.out | sort -k1n | tail -5
   114688  138
   114688  138
   114688  138
   114688  138
   114688  138

Thanks in advance for looking into it,
  Mark




2018-08-01 09:12, myself wrote:

On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote:

I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE
and the situation has not improved. Also turned off all services.
ZFS is still leaking memory about 30 MB per hour, until the host
runs out of memory and swap space and crashes, unless I reboot it
first every four days.

Any advise before I try to get rid of that faulted disk with a pool
(or downgrade to 10.3, which was stable) ?


2018-08-01 00:09, Mark Johnston wrote:

If you're able to use dtrace, it would be useful to try tracking
allocations with the solaris tag:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
  count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = 
count();}'


Try letting that run for one minute, then kill it and paste the 
output.

Ideally the host will be as close to idle as possible while still
demonstrating the leak.


Good and bad news:

The suggested dtrace command bails out:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}'
dtrace: description 'dtmalloc::solaris:malloc ' matched 2 probes
Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file
/usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c,
line 3330.
Abort trap

But I did get one step further, localizing the culprit.

I realized that the "solaris" malloc count goes up in sync with
the 'telegraf' monitoring service polls, which also has a ZFS plugin
which monitors the zfs pool and ARC. This plugin runs 'zpool list -Hp'
periodically.

So after stopping telegraf (and other remaining services),
the 'vmstat -m' shows that InUse count for "solaris" goes up by 552
every time that I run "zpool list -Hp" :

# (while true; do zpool list -Hp >/dev/null; vmstat -m | \
fgrep solaris; sleep 1; done) | awk '{print $2-a; a=$2}'
6664427
541
552
552
552
552
552
552
552
552

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-08-01 Thread Mark Martinec

On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote:

I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE
and the situation has not improved. Also turned off all services.
ZFS is still leaking memory about 30 MB per hour, until the host
runs out of memory and swap space and crashes, unless I reboot it
first every four days.

Any advise before I try to get rid of that faulted disk with a pool
(or downgrade to 10.3, which was stable) ?


2018-08-01 00:09, Mark Johnston wrote:

If you're able to use dtrace, it would be useful to try tracking
allocations with the solaris tag:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
  count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = 
count();}'


Try letting that run for one minute, then kill it and paste the output.
Ideally the host will be as close to idle as possible while still
demonstrating the leak.


Good and bad news:

The suggested dtrace command bails out:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = 
count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}'

dtrace: description 'dtmalloc::solaris:malloc ' matched 2 probes
Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file 
/usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, 
line 3330.

Abort trap

But I did get one step further, localizing the culprit.

I realized that the "solaris" malloc count goes up in sync with
the 'telegraf' monitoring service polls, which also has a ZFS plugin
which monitors the zfs pool and ARC. This plugin runs 'zpool list -Hp'
periodically.

So after stopping telegraf (and other remaining services),
the 'vmstat -m' shows that InUse count for "solaris" goes up by 552
every time that I run "zpool list -Hp" :

# (while true; do zpool list -Hp >/dev/null; vmstat -m | \
fgrep solaris; sleep 1; done) | awk '{print $2-a; a=$2}'
6664427
541
552
552
552
552
552
552
552
552
556
548
552
552
552
552
552
552
552
552
552

# zpool list -Hp
floki   68719476736 37354102272 31365374464 -   -   
49% 54  1.00x   ONLINE  -
stuff   -   -   -   -   -   -   -   -   
UNAVAIL -



  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-07-31 Thread Mark Martinec

I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE
and the situation has not improved. Also turned off all services.
ZFS is still leaking memory about 30 MB per hour, until the host
runs out of memory and swap space and crashes, unless I reboot it
first every four days.

Any advise before I try to get rid of that faulted disk with a pool
(or downgrade to 10.3, which was stable) ?

  Mark


2018-07-23 17:12, myself wrote:

After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11
(amd64), ZFS is gradually eating up all memory, so that it crashes 
every
few days when the memory is completely exhausted (after swapping 
heavily

for a couple of hours).

This machine has only 4 GB of memory. After capping up the ZFS ARC
to 1.8 GB the machine can now stay up a bit longer, but in four days
all the memory is used up. The machine is lightly loaded, it runs
a bind resolver and a lightly used web server, the ps output
does not show any excessive memory use by any process.

During the last survival period I ran  vmstat -m  every second
and logged results. What caught my eye was the 'solaris' entry,
which seems to explain all the exhaustion.

The MemUse for the solaris entry starts modestly, e.g. after a few
hours of uptime:

$ vmstat -m :
 Type InUse MemUse HighUse Requests  Size(s)
  solaris 3141552 225178K   - 12066929
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768

... but this number keeps steadily growing.

After about four days, shortly before a crash, it grew to 2.5 GB,
which gets dangerously close to all the available memory:

  solaris 39359484 2652696K   - 234986296
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768

Plotting the 'solaris' MemUse entry vs. wall time in seconds, one can 
see
a steady linear growth, about 25 MB per hour. On a fine-resolution 
small scale

the step size seems to be one small step increase per about 6 seconds.
All steps are small, but not all are the same size.

The only thing (in my mind) that distinguishes this host from others
running 11.1 seems to be that one of the two ZFS pools is down because
its disk is broken. This is a scratch data pool, not otherwise in use.
The pool with the OS is healthy.

The syslog shows entries like the following periodically:

Jul 23 16:48:49 xxx ZFS: vdev state changed,
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:49:09 xxx ZFS: vdev state changed,
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:55:34 xxx ZFS: vdev state changed,
pool_guid=15371508659919408885 vdev_guid=11732693005294113354

The 'zpool status -v' on this pool shows:

  pool: stuff
 state: UNAVAIL
status: One or more devices could not be opened.  There are 
insufficient

replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-3C
  scan: none requested
config:

NAMESTATE READ WRITE CKSUM
stuff   UNAVAIL  0 0 0
  11732693005294113354  UNAVAIL  0 0 0  was 
/dev/da2



The same machine with this broken pool could previously survive 
indefinitely

under FreeBSD 10.3 .

So, could this be the reason for memory depletion?
Any fixes for that? Any more tests suggested to perform
before I try to get rid of this pool?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to 
"freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-07-23 Thread Mark Martinec

After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11
(amd64), ZFS is gradually eating up all memory, so that it crashes every
few days when the memory is completely exhausted (after swapping heavily
for a couple of hours).

This machine has only 4 GB of memory. After capping up the ZFS ARC
to 1.8 GB the machine can now stay up a bit longer, but in four days
all the memory is used up. The machine is lightly loaded, it runs
a bind resolver and a lightly used web server, the ps output
does not show any excessive memory use by any process.

During the last survival period I ran  vmstat -m  every second
and logged results. What caught my eye was the 'solaris' entry,
which seems to explain all the exhaustion.

The MemUse for the solaris entry starts modestly, e.g. after a few
hours of uptime:

$ vmstat -m :
 Type InUse MemUse HighUse Requests  Size(s)
  solaris 3141552 225178K   - 12066929  
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768


... but this number keeps steadily growing.

After about four days, shortly before a crash, it grew to 2.5 GB,
which gets dangerously close to all the available memory:

  solaris 39359484 2652696K   - 234986296  
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768


Plotting the 'solaris' MemUse entry vs. wall time in seconds, one can 
see
a steady linear growth, about 25 MB per hour. On a fine-resolution small 
scale

the step size seems to be one small step increase per about 6 seconds.
All steps are small, but not all are the same size.

The only thing (in my mind) that distinguishes this host from others
running 11.1 seems to be that one of the two ZFS pools is down because
its disk is broken. This is a scratch data pool, not otherwise in use.
The pool with the OS is healthy.

The syslog shows entries like the following periodically:

Jul 23 16:48:49 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:49:09 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:55:34 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354


The 'zpool status -v' on this pool shows:

  pool: stuff
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-3C
  scan: none requested
config:

NAMESTATE READ WRITE CKSUM
stuff   UNAVAIL  0 0 0
  11732693005294113354  UNAVAIL  0 0 0  was /dev/da2


The same machine with this broken pool could previously survive 
indefinitely

under FreeBSD 10.3 .

So, could this be the reason for memory depletion?
Any fixes for that? Any more tests suggested to perform
before I try to get rid of this pool?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2018-05-24 Thread Mark Martinec

Just a short report to a thread I started when 11.1 came out.

This machine would stall in a busy loop while attaching disks
during boot. Rebuilding a kernel with EARLY_AP_STARTUP disabled
avoided the problem. This was a situation through the whole
11.1 life cycle (i.e. patch releases did not help).

Today I have upgraded this host to 11.2-BETA2, and it is
no longer necessary to disable EARLY_AP_STARTUP. Good, thanks!

  Mark



2017-07-20 02:03, Mark Johnston wrote:

One thing to try at this point would be to disable EARLY_AP_STARTUP in
the kernel config. That is, take a configuration with which you're 
able

to reproduce the hang during boot, and remove "options
EARLY_AP_STARTUP".


2017-07-20 15:45, Mark Martinec wrote:
Done. And it avoids the problem altogether! Thanks.
Tried a reboot several times and it succeeds every time.

Here is all that I had in a config file for building a kernel,
i.e. I took away the 'options DDB' which also seemingly avoided
the problem:
  include GENERIC
  ident NELI
  nooptions EARLY_AP_STARTUP


This feature has a fairly large impact on the bootup process and has
had a few problems that manifested as hangs during boot. There was at
least one other case where an innocuous change to the kernel
configuration "fixed" the problem by introducing some second-order
effect (causing kernel threads to be scheduled in a different
order, for instance).

[...]
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Should patch releases to stable 11.1 (errata) include fixes for kernel crashes?

2017-11-30 Thread Mark Martinec
Should patch releases to stable 11.1 (errata) include fixes for kernel 
crashes?


Referring to:
  Bug 59  - 11.1-R crashing in sendfile syscall, as used by a uwsgi 
process

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=59
https://svnweb.freebsd.org/base?view=revision=323634

So my background story is: I fell for this trick twice now, upgrading
a 11.1-p3 to -p4, and now a -p4 to -p5 -- forgetting that I need a
patched kernel for our web servers (quite a common setup: nginx+uwsgi);
so after an upgrade crashes returned.

I know, my fault, but I wonder, shouldn't a fix for kernel crashes
end up in a patch release of a stable version, with an errata notice?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 11.1 coredumping in sendfile, as used by a uwsgi process

2017-09-12 Thread Mark Martinec

2017-09-12 15:46, Steven Hartland wrote:

Could you post the decoded crash info from /var/crash/...


Using crashinfo(8) I suppose?


I would also create a bug report:
https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base%20System


Done (with additional info):

  Bug 59 - 11.1-R crashing in sendfile syscall, as used by a uwsgi 
process

  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=59


Mark

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


11.1 coredumping in sendfile, as used by a uwsgi process

2017-09-12 Thread Mark Martinec

A couple of days ago I have upgraded an Intel box from FreeBSD 10.3 to
11.1-RELEASE-p1, and reinstalled all the packages, built on the same OS 
version.

This host is running nginx web server with an uwsgi as a backend.
The file system is ZFS (recent as of 10.3, zpool not yet upgraded
to new 11.1 features).

Ever since the upgrade, this host is crashing/rebooting two or three 
times
per day. The reported crash location is always the same: it is in a 
sendfile
function (same addresses each time), the running process is always 
uwsgi:



Sep 12 15:03:12 xxx syslogd: kernel boot file is /boot/kernel/kernel
Sep 12 15:03:12 xxx kernel: [22677]
Sep 12 15:03:12 xxx kernel: [22677]
Sep 12 15:03:12 xxx kernel: [22677] Fatal trap 12: page fault while in 
kernel mode

Sep 12 15:03:12 xxx kernel: [22677] cpuid = 7; apic id = 07
Sep 12 15:03:12 xxx kernel: [22677] fault virtual address = 0xe8
Sep 12 15:03:12 xxx kernel: [22677] fault code= 
supervisor write data, page not present
Sep 12 15:03:12 xxx kernel: [22677] instruction pointer   = 
0x20:0x80afefb2
Sep 12 15:03:12 xxx kernel: [22677] stack pointer = 
0x28:0xfe02397da5a0
Sep 12 15:03:12 xxx kernel: [22677] frame pointer = 
0x28:0xfe02397da5e0
Sep 12 15:03:12 xxx kernel: [22677] code segment  = base 
0x0, limit 0xf, type 0x1b
Sep 12 15:03:12 xxx kernel: [22677]   = DPL 0, pres 1, 
long 1, def32 0, gran 1
Sep 12 15:03:12 xxx kernel: [22677] processor eflags  = interrupt 
enabled, resume, IOPL = 0
Sep 12 15:03:12 xxx kernel: [22677] current process   = 34504 
(uwsgi)

Sep 12 15:03:12 xxx kernel: [22677] trap number   = 12
Sep 12 15:03:12 xxx kernel: [22677] panic: page fault
Sep 12 15:03:12 xxx kernel: [22677] cpuid = 7
Sep 12 15:03:12 xxx kernel: [22677] KDB: stack backtrace:
Sep 12 15:03:12 xxx kernel: [22677] #0 0x80aada97 at 
kdb_backtrace+0x67
Sep 12 15:03:12 xxx kernel: [22677] #1 0x80a6bb76 at 
vpanic+0x186

Sep 12 15:03:12 xxx kernel: [22677] #2 0x80a6b9e3 at panic+0x43
Sep 12 15:03:12 xxx kernel: [22677] #3 0x80edf832 at 
trap_fatal+0x322
Sep 12 15:03:12 xxx kernel: [22677] #4 0x80edf889 at 
trap_pfault+0x49

Sep 12 15:03:12 xxx kernel: [22677] #5 0x80edf0c6 at trap+0x286
Sep 12 15:03:12 xxx kernel: [22677] #6 0x80ec3641 at 
calltrap+0x8
Sep 12 15:03:12 xxx kernel: [22677] #7 0x80a6a2af at 
sendfile_iodone+0xbf
Sep 12 15:03:12 xxx kernel: [22677] #8 0x80a69eae at 
vn_sendfile+0x124e
Sep 12 15:03:12 xxx kernel: [22677] #9 0x80a6a4dd at 
sendfile+0x13d
Sep 12 15:03:12 xxx kernel: [22677] #10 0x80ee0394 at 
amd64_syscall+0x6c4
Sep 12 15:03:12 xxx kernel: [22677] #11 0x80ec392b at 
Xfast_syscall+0xfb

Sep 12 15:03:12 xxx kernel: [22677] Uptime: 6h17m57s
Sep 12 15:03:12 xxx kernel: [22677] Dumping 983 out of 8129 
MB:..2%..12%..22%..31%..41%..51%..61%..72%..82%..92%Copyright (c) 
1992-2017 The FreeBSD Project.
Sep 12 15:03:12 xxx kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 
1989, 1991, 1992, 1993, 1994
Sep 12 15:03:12 xxx kernel: The Regents of the University of California. 
All rights reserved.
Sep 12 15:03:12 xxx kernel: FreeBSD is a registered trademark of The 
FreeBSD Foundation.
Sep 12 15:03:12 xxx kernel: FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9 
11:55:48 UTC 2017

[...]
Sep 12 15:03:12 xxx savecore: reboot after panic: page fault
Sep 12 15:03:12 xxx savecore: writing core to /var/crash/vmcore.4


This host with the same services was very stable under 10.3, same ZFS 
pool.


We have several other hosts running 11.1 with no incidents, running 
various
services (but admittedly no other host has a comparably busy web 
server).
Interestingly the nginx has a sendfile feature enabled too, but this 
does

not cause a crash (on this or other hosts), only the sendfile as used
by uwsgi seems to be the problem.

For the time being I have disabled the use of sendfile in uwsgi, we'll 
see

is this avoids the trouble.

Suggestions?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


syslogd include directive reads but disregards all but the last included .conf file

2017-08-28 Thread Mark Martinec
Could somebody please check why the new 'include' 11.1 feature of 
syslogd

does not work when given more than one file to include...

Any chance of fixing this as a patch release to 11.1 ?



The 11.1 release brought a very desirable feature to syslogd:

$ man syslog.conf :
  A special include keyword can be used to include all files with names
  ending in '.conf' and not beginning with a '.' contained in the 
directory

  following the keyword.

but ...

It turns out that of all the *.conf files found in the included
directory /etc/syslog.d, only entries found in the (alphabetically)
*last* file there are taken into account, all other entries in
remaining included files are just ignored.

[...]

Details at:

  Bug 221742
syslogd include directive reads but disregards all but the last 
included .conf file


  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221742


Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: [USB] hang after upgrade from 11.0 to 11.1, ZFS or callout() related?

2017-08-07 Thread Mark Martinec

But this is all for 11.0, on 11.1 it hangs and I cannot look it up:


Does it also hang if you choose 'Safe mode" in the loader dialog?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Mark Martinec

2017-07-24 18:25, Ken Merry wrote:

It is possible that the change I MFCed today (r321207 in head, r321415
in stable/11) is related, but Mark will have to boot his machine with
the fix to see if it makes any difference.

What happened in my case on one particular machine (not on most
machines in our lab running the same code) was that mps_wait_command()
/ mpr_wait_command() would not wait the full 60 seconds for a write to
the DPM table (Driver Persistent Mapping) table in the controller.
So, it reported that there was a timeout.
[...]
Eliminating bogus timeouts will eliminate most all of the sources of
those panics anyway.


Took r321415 from stable/11 and applied it to 11.1-RC3 - and it makes
no difference to booting: still hangs attempting to attach da0,
with a spinning CPU (according to fan speed).
Booting in safe mode, or with EARLY_AP_STARTUP disabled avoids the 
problem.



There is a secondary bug that is still in the mps(4) / mpr(4) drivers
when a timeout does happen — the error recovery code in the
wait_command() routine reinitializes the controller, which clears out
all the commands.  When the wait_command() routine returns, the
command passed in has been freed, but the caller doesn’t know that.
So the caller (it happens in a number of places) dereferences a
pointer to freed memory and the kernel panics.

I’m planning to fix that bug, too, if slm@ doesn’t get to it first,
I’ve just had other bugs to fix first.


No panics in my case, just hangs.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Mark Martinec

Thanks! Tried it, and the message (or a backtrace) does not show
during a boot of a generic (patched) kernel, at least not in
the last 40-lines screen before the hang occurs.
(It also does not show during a "Safe mode" successful boot.)


Btw (may or may not be relevant): after the above experiment
I have rebooted the machine in "Safe mode" (generic kernel,
EARLY_AP_STARTUP enabled by default) - and spent some time
doing non-intensive interactive work on this host (web browsing,
editor, shell, all under KDE) - and after about an hour the
machine froze: clock display not updating, keyboard unresponsive,
console virtual terminals inaccessible) - so had to reboot.
According to fans speed the machine was idle.
The /var/log/messages does not show anything of interest
before the freeze. All disks are under ZFS.

Can EARLY_AP_STARTUP have an effect also _after_ booting?
This host never hung during normal work when EARLY_AP_STARTUP
was disabled (or with 11.0 and earlier).

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Mark Martinec

2017-07-24 04:15, Mark Johnston wrote:

Could you try re-enabling EARLY_AP_STARTUP, applying the patch at the
end of this email, and see if the message "sleeping before eventtimer
init" appears in the boot output? If it does, it'll be followed by a
backtrace that might be useful for tracking down the hang. It might
produce false positives, but we'll see.


Thanks! Tried it, and the message (or a backtrace) does not show
during a boot of a generic (patched) kernel, at least not in
the last 40-lines screen before the hang occurs.
(It also does not show during a "Safe mode" successful boot.)

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-20 Thread Mark Martinec

2017-07-20 02:03, Mark Johnston wrote:

One thing to try at this point would be to disable EARLY_AP_STARTUP in
the kernel config. That is, take a configuration with which you're able
to reproduce the hang during boot, and remove "options
EARLY_AP_STARTUP".


Done. And it avoids the problem altogether! Thanks.
Tried a reboot several times and it succeeds every time.

Here is all that I had in a config file for building a kernel,
i.e. I took away the 'options DDB' which also seemingly avoided
the problem:
  include GENERIC
  ident NELI
  nooptions EARLY_AP_STARTUP


This feature has a fairly large impact on the bootup process and has
had a few problems that manifested as hangs during boot. There was at
least one other case where an innocuous change to the kernel
configuration "fixed" the problem by introducing some second-order
effect (causing kernel threads to be scheduled in a different
order, for instance).



Regardless of whether the suggestion above makes a difference, it would
be helpful to see verbose dmesgs from both a clean boot and a boot that
hangs. If disabling EARLY_AP_STARTUP helps, then we can try adding some
assertions that will cause the system to panic when the hang occurs,
making it easier to see what's going on.


Hmmm.
I have now saved a couple of versions of /var/run/dmesg.boot
(in boot_verbose mode) when EARLY_AP_STARTUP is disabled and
the boot is successful. However, I don't know how to capture
such log when booting hangs, as I have no serial interface
and the boot never completes. All I have is a screen photo
of the last state when a hang occurs (showing ada disks
successfully attached, followed immediately by the attempt
to attach a da disk, which hangs).

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-19 Thread Mark Martinec

More news on the matter. As reported yesterday the locally built
kernel with options INVARIANTS and DDB works fine and somehow avoids
the trouble at attaching the da (mps) disks on an LSI controller, so
today I wanted to get back to a reproducible hang - and sure enough,
reverting to the generic kernel as distributed brings back the hang.

So I tried rebuilding the kernel while experimenting with options
like DDB and INVARIANTS.

A locally built GENERIC kernel behaves the same as the original
kernel from the distribution (as installed by freebsd-upgrade),
so no surprises there. It hangs trying to attach the first of the
da disks (after first successfully attaching all the ada disks).
The alt ctrl esc is unable to enter debugger when the hang occurs
(possibly due to an unresponsive USB keyboard at that time),
even though the debug.kdb.break_to_debugger was set to 1 at a
loader prompt. It needs loader "Safe mode" to be able to boot.

Next, a locally built kernel with DDB and INVARIANTS works well
(the remaining options come from an included GENERIC).

Now the funny part: a locally built kernel with just the DDB
option (and the rest included from GENERIC) *also* works well.
Somehow the DDB option makes a difference, even though kernel
debugger is never activated.

To re-assert: at the time of a hang the CPU fan starts revving up,
and the USB keyboard is unresponsive ( does not enter scroll
mode, caps lock and num lock do not toggle their LED indicators,
alt ctrl esc do not activate kernel debugger. Loader "Safe mode"
avoids the problem (presumably by disabling SMP).

Meanwhile I have successfully upgraded two other similar
hosts from 11.0 to 11.1-RC3, no surprises there (but they do not
have the same disk controller).

Not sure what to try next.

  Mark



2017-07-19 01:18, Mark Martinec wrote:

2017-07-18 01:24, Mark Johnston wrote:

Are you able to break into the debugger at this point? Try setting
debug.kdb.break_to_debugger=1 and debug.kdb.alt_break_to_debugger=1 at
the loader prompt, and hit the break key, or the key sequence
 ~ ctrl-b once the hang occurs. At the debugger prompt, try
"bt" and "show allpcpu" to start.


Thank you for a prompt and good suggestion! I spent an afternoon
fiddling with the machine, with mixed results. Your suggestion to
break into debugger did not work, there was no reaction to 
or to  ~ ctrl-b.

So I embarked on rebuilding the RC3 kernel with
  options KDB
  options DDB
  options BREAK_TO_DEBUGGER
  options ALT_BREAK_TO_DEBUGGER
  options INVARIANTS
  options INVARIANT_SUPPORT
  options WITNESS
  options WITNESS_SKIPSPIN
but then I realized the  key is mapped-to by: alt ctrl ,
which now does break into debugger - but not so early where the
holdup occurs.

The WITNESS produced some LOR warnings, but that is probably ok.
I came across a trace just before the problem area, but it flows
by so fast on a vt console and only the last 40 or so lines
remain on the screen (I have a photo), which do not look like
revealing much. Unfortunately this machine does not have a serial
interface.

So in my last attempt I rebuilt a kernel with INVARIANTS but
without WITNESS - and now I cannot reproduce the problem, with
or without a "safe mode". What is interesting here that now
the da0..da3 disks are attached first, and only then the ada
disks - and even within the group of disks on the same
controller their order has been shuffled - no idea what could
have caused it - and it may have avoided the problem by doing so.

Will play some more with this tomorrow...

  Mark



On Tue, Jul 18, 2017 at 01:01:16AM +0200, Mark Martinec wrote:

Upgrading 11.0-RELEASE-p11 to 11.1-RC3 using the usual freebsd-update
upgrade
method I ended up with a system which gets stuck while trying to 
attach
the second set of disks. This happened already after the first phase 
of

the upgrade procedure (installing and re-booting with a new kernel).

The first set of disks (ada0 .. ada2) are attached successfully, also 
a

cd0, but then when the first of the set of four (a regular spinning
disk)
on an LSI controller is to be attached, the boot procedure just gets
stuck there:
   kernel: ada1: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 
8192bytes)

   kernel: ada1: Command Queueing enabled
   kernel: ada1: 305245MB (625142448 512 byte sectors)
   kernel: ada2 at ahcich6 bus 0 scbus8 target 0 lun 0
   kernel: ada2:  ATA8-ACS SATA 3.x device
   kernel: ada2: Serial Number OCZ-O1L6RF591R09Z5C8
   kernel: ada2: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 
8192bytes)

   kernel: ada2: Command Queueing enabled
   kernel: ada2: 114473MB (234441648 512 byte sectors)
   kernel: ada2: quirks=0x1<4K>
   kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0

(stuck here, keyboard not responding, fans rising their pitch,
  presumably CPU is spinning)

[...]
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailm

Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-18 Thread Mark Martinec

2017-07-18 01:24, Mark Johnston wrote:

Are you able to break into the debugger at this point? Try setting
debug.kdb.break_to_debugger=1 and debug.kdb.alt_break_to_debugger=1 at
the loader prompt, and hit the break key, or the key sequence
 ~ ctrl-b once the hang occurs. At the debugger prompt, try
"bt" and "show allpcpu" to start.


Thank you for a prompt and good suggestion! I spent an afternoon
fiddling with the machine, with mixed results. Your suggestion to
break into debugger did not work, there was no reaction to 
or to  ~ ctrl-b.

So I embarked on rebuilding the RC3 kernel with
  options KDB
  options DDB
  options BREAK_TO_DEBUGGER
  options ALT_BREAK_TO_DEBUGGER
  options INVARIANTS
  options INVARIANT_SUPPORT
  options WITNESS
  options WITNESS_SKIPSPIN
but then I realized the  key is mapped-to by: alt ctrl ,
which now does break into debugger - but not so early where the
holdup occurs.

The WITNESS produced some LOR warnings, but that is probably ok.
I came across a trace just before the problem area, but it flows
by so fast on a vt console and only the last 40 or so lines
remain on the screen (I have a photo), which do not look like
revealing much. Unfortunately this machine does not have a serial
interface.

So in my last attempt I rebuilt a kernel with INVARIANTS but
without WITNESS - and now I cannot reproduce the problem, with
or without a "safe mode". What is interesting here that now
the da0..da3 disks are attached first, and only then the ada
disks - and even within the group of disks on the same
controller their order has been shuffled - no idea what could
have caused it - and it may have avoided the problem by doing so.

Will play some more with this tomorrow...

  Mark



On Tue, Jul 18, 2017 at 01:01:16AM +0200, Mark Martinec wrote:

Upgrading 11.0-RELEASE-p11 to 11.1-RC3 using the usual freebsd-update
upgrade
method I ended up with a system which gets stuck while trying to 
attach
the second set of disks. This happened already after the first phase 
of

the upgrade procedure (installing and re-booting with a new kernel).

The first set of disks (ada0 .. ada2) are attached successfully, also 
a

cd0, but then when the first of the set of four (a regular spinning
disk)
on an LSI controller is to be attached, the boot procedure just gets
stuck there:
   kernel: ada1: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes)
   kernel: ada1: Command Queueing enabled
   kernel: ada1: 305245MB (625142448 512 byte sectors)
   kernel: ada2 at ahcich6 bus 0 scbus8 target 0 lun 0
   kernel: ada2:  ATA8-ACS SATA 3.x device
   kernel: ada2: Serial Number OCZ-O1L6RF591R09Z5C8
   kernel: ada2: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes)
   kernel: ada2: Command Queueing enabled
   kernel: ada2: 114473MB (234441648 512 byte sectors)
   kernel: ada2: quirks=0x1<4K>
   kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0

(stuck here, keyboard not responding, fans rising their pitch,
  presumably CPU is spinning)

[...]
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-17 Thread Mark Martinec
Upgrading 11.0-RELEASE-p11 to 11.1-RC3 using the usual freebsd-update 
upgrade

method I ended up with a system which gets stuck while trying to attach
the second set of disks. This happened already after the first phase of
the upgrade procedure (installing and re-booting with a new kernel).

The first set of disks (ada0 .. ada2) are attached successfully, also a
cd0, but then when the first of the set of four (a regular spinning 
disk)

on an LSI controller is to be attached, the boot procedure just gets
stuck there:

  kernel: ada1: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes)
  kernel: ada1: Command Queueing enabled
  kernel: ada1: 305245MB (625142448 512 byte sectors)
  kernel: ada2 at ahcich6 bus 0 scbus8 target 0 lun 0
  kernel: ada2:  ATA8-ACS SATA 3.x device
  kernel: ada2: Serial Number OCZ-O1L6RF591R09Z5C8
  kernel: ada2: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes)
  kernel: ada2: Command Queueing enabled
  kernel: ada2: 114473MB (234441648 512 byte sectors)
  kernel: ada2: quirks=0x1<4K>
  kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0

(stuck here, keyboard not responding, fans rising their pitch,
 presumably CPU is spinning)

(instead of the normal continuation like:
  kernel: da0:  Fixed Direct Access SPC-4 SCSI 
device

  kernel: da0: Serial Number 
  kernel: da0: 600.000MB/s transfers
  kernel: da0: Command Queueing enabled
  kernel: da0: 1907729MB (3907029168 512 byte sectors)
)

The controller for da0 .. da3 is an LSI:

  kernel: mps0:  port 0x4000-0x40ff 
mem 0xd174-0xd1743fff,0xd130-0xd133 irq 16 at device 0.0 on 
pci1

  kernel: mps0: Firmware: 14.00.01.00, Driver: 21.02.00.00-fbsd
  kernel: mps0: IOCCapabilities: 
185c

[...]
  kernel: mps0: SAS Address for SATA device = a4a4843003d0cf79
  kernel: mps0: SAS Address from SATA device = a4a4843003d0cf79
  kernel: mps0: SAS Address for SATA device = d3d48904eddff0d5
  kernel: mps0: SAS Address from SATA device = d3d48904eddff0d5
[...]
  kernel: mps0: SAS Address for SATA device = 2a021c07585c665b
  kernel: mps0: SAS Address from SATA device = 2a021c07585c665b
  kernel: mps0: SAS Address for SATA device = 2a021c0758637b7c
  kernel: mps0: SAS Address from SATA device = 2a021c0758637b7c

This host in this configuration worked perfectly well with 11.0 and
many older versions of the OS.

After some frustration I found out that the system can boot fine
if a boot loader option "Safe mode" is set. This way I successfully
finished the upgrade procedure (installing world).

Playing with loader options that the "Safe mode" turns on
( /boot/menu-commands.4th ) it seems that kern.smp.disabled=1
is the crucial option, although my attempts at ruling out remaining
options of the "Safe mode" turned out inconclusive - perhaps there
is some random/race involved. Anyway, in "Safe mode" the machine
always boots normally and attaches all disks.

This experience is much like described in:
  https://forums.freebsd.org/threads/56524/
where the poster ended up disabling SMP to be able to have a working 
host.


It is also somewhat similar to:
  
https://lists.freebsd.org/pipermail/freebsd-hackers/2017-July/051258.html

where a FreeBSD 11.1 prerelease only boots on a single-CPU AWS host,
but fails to boot on a 2-core CPU, with various symptoms, including:
  ( 
https://lists.freebsd.org/pipermail/freebsd-hackers/2017-July/051260.html 
)

  Feeding entropy: .
  spin lock 0x80db45c0 (smp rendezvous) held by 
0xf80004378560

  (tid 100074) too long
  timeout stopping cpus
  panic: spin lock held too long

Please advise, thanks
  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: net.inet.udp.log_in_vain strange syslog reports

2017-02-15 Thread Mark Martinec

2017-02-06 18:04, Eric van Gyzen wrote:


On 02/06/2017 10:19, Mark Martinec wrote:

Hope the fix finds its way into 11.1 (or better yet, as a patch level
in 10.0).  Should I open a bug report?


It will quite likely get into 11.1.  As for a 10.x patch, you would 
have

to ask re@ (I think), but I doubt it.  These messages are really just
informative and can't be used for any filtering, since the source
address could be spoofed.


I meant to say 11.0-p*, but nevermind.


In a similar vein, I noticed also the following in our logs,
with net.inet.tcp.log_in_vain=1.

Looks like messages got concatenated somehow:

Jan 25 01:37:53 mildred kernel: TCP: [2607:ff10:c5:509a::10]:26459 to 
[2001:1470:ff80::80:16]:4911 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:14898 to 
[2001:1470:ff80::80:16]:5222 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:58022 to 
[2001:1470:ff80::80:16]:9981 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:34680 to 
[2001:1470:ff80::80:16]:10243 tcpflags 0x2; tcp_input: Connection 
attempt to closedport
Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:30991 to 
[2001:1470:ff80::80:16]:8554 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:20012 to 
[2001:1470:ff80::80:16]:8443 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:14166 to 
[2001:1470:ff80::80:16]: tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:34680 to 
[2001:1470:ff80::80:16]:8010 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:47957 to 
[2001:1470:ff80::80:16]:3460 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:34680 to 
[2001:1470:ff80::80:16]:13579 tcpflags 0x2; tcp_input: Connection 
attempt to closedport
Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:20012 to 
[2001:1470:ff80::80:16]:9001 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:30651 to 
[2001:1470:ff80::80:16]:9000 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 12 04:50:58 mildred kernel: TCP: [2607:ff10:c5:509a::1]:42266 to 
[2001:1470:ff80::80:16]:49153 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::1]:35372 to 
[2001:1470:ff80::80:16]:62078 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:58022 to 
[2001:1470:ff80::80:16]:9200 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:46640 to 
[2001:1470:ff80::80:16]:8181 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:36877 to 
[2001:1470:ff80::80:16]:7218 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:46640 to 
[2001:1470:ff80::80:16]:7071 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:30651 to 
[2001:1470:ff80::80:16]:9000 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:36877 to 
[2001:1470:ff80::80:16]:2332 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:46640 to 
[2001:1470:ff80::80:16]:7548 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::10]:46640 to 
[2001:1470:ff80::80:16]:5986 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 19 02:52:34 mildred kernel: TCP: [2607:ff10:c5:509a::1]:42266 to 
[2001:1470:ff80::80:16]:49153 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::1]:35372 to 
[2001:1470:ff80::80:16]:62078 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 19 02:52:34 mildred kernel: TCP: [2607:ff10:c5:509a::1]:61788 to 
[2001:1470:ff80::80:16]:2 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::1]:34680 to 
[2001:1470:ff80::80:16]:10243 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 19 02:52:34 mildred kernel: TCP: [2607:ff10:c5:509a::1]:41249 to 
[2001:1470:ff80::80:16]:44818 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::1]:49717 to 
[2001:1470:ff80::80:16]:8649 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 20 04:49:15 mildred kernel: TCP: [2607:ff10:c5:509a::1]:36877 to 
[2001:1470:ff80::143:1]:50100 tcpflags 0x2; tcp_input: Connection 
attempt to closed TCP: [2607:ff10:c5:509a::1]:42266 to 
[2001:1470:ff80::143:1]:49153 tcpflags 0x2; tcp_input: Connection 
attempt to closed port
Jan 20 10:03:52 mildred kernel: TCP: [2607:ff10:c5:509a::10]:31430 to 
[2001:1470:ff80::143:1]:8099 tcpflags 0x2; tcp_input: Connection

GELI with integrity verification on swap

2017-02-09 Thread Mark Martinec

After experiencing an unexplained restart on one host (11.0-RELEASE-p7),
which could be tied to a problem with a swap device (swap on a dedicated
gpt partition), I'm investigating options for adding some checksuming
to swap storage.

I understand that swap on ZFS is not a way to go, and that a gmirror
does not provide any checksuming on data, it seems to me the only
option is to use GELI with integrity verification (authentication)
enabled (aalgo).

Following advice in
  
https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/swap-encrypting.html
I ended up with the following in /etc/fstab (on a different host, same 
OS):


  /dev/gpt/sw1.eli none swap sw,sectorsize=4096,aalgo=HMAC/SHA256 0 0
  /dev/gpt/sw2.eli none swap sw,sectorsize=4096,aalgo=HMAC/SHA256 0 0

which seems to work fine, but spawns some questions:


1) On the first manual reboot after adding the above options,
there was a kernel panic. Subsequent reboot(s) were successful.
Is there any known problem with using integrity verification
on GELI for swap?


2) During boot the log shows a short flurry of messages like:

kernel: GEOM_ELI: Device gpt/sw1.eli created.
kernel: GEOM_ELI: Encryption: AES-XTS 128
kernel: GEOM_ELI:  Integrity: HMAC/SHA256
kernel: GEOM_ELI: Crypto: software
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 16384 bytes of 
data at offset 11452985344.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 11453235200.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 11453239296.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 11453239296.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 11453239296.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 11453235200.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 4096.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 0.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data 
at offset 11453239296.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 8192 bytes of data 
at offset 65536.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 8192 bytes of data 
at offset 8192.
kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 8192 bytes of data 
at offset 0.


which, according to geli(8) man page, could be normal, as these blocks 
were never
written to beforehand and contain random stuff. As the geli swap device 
is
supposed to be ephemeral (Flags: ONETIME, W-DETACH, AUTH, W-OPEN), there 
is
no way to initialize blocks on a swap device on boot. So, are these 
messages

really safe to be ignored?

Which brings us another, perhaps more important question: what business 
does
a kernel has to do READING from a swap device, blocks which never have 
been

written to before by this incarnation of the kernel???


3) Considering that the underlying device is a 4k sectored device, and
that HMAC/SHA256 takes some space (like 11%) on its own, what does it 
mean

that the provider (gpt/sw1.eli) as well as the consumer (gpt/sw1)
both show sector size 4096 ? Does that mean that all 4k alignment 
efforts

are wasted when one enables integrity verification on GELI?

  Geom name: gpt/sw1.eli
  State: ACTIVE
  EncryptionAlgorithm: AES-XTS
  KeyLength: 128
  AuthenticationAlgorithm: HMAC/SHA256
  Crypto: software
  Version: 7
  Flags: ONETIME, W-DETACH, AUTH, W-OPEN
  KeysAllocated: 24
  KeysTotal: 24
  Providers:
  1. Name: gpt/sw1.eli
 Mediasize: 11453243392 (11G)
 Sectorsize: 4096
 Mode: r1w1e0
  Consumers:
  1. Name: gpt/sw1
 Mediasize: 12884901888 (12G)
 Sectorsize: 512
 Stripesize: 4096
 Stripeoffset: 0
 Mode: r1w1e1


Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: net.inet.udp.log_in_vain strange syslog reports

2017-02-06 Thread Mark Martinec

On 2017-02-02 12:55, Mark Martinec wrote:

11.0-RELEASE-p7, net.inet.udp.log_in_vain=1

The following syslog entries seem to indicate some buffer overruns
in the reporting code (not all log lines are broken, just some).

(the actual failed connection attempts were indeed there,
it's just that the reported IP address is highly suspicious)

  Mark



2017-02-03 20:05, Eric van Gyzen wrote:

There is no buffer overrun, so no cause for alarm.  The problem is
concurrent usage of a single string buffer by multiple threads.  The
buffer is inside inet_ntoa(), defined in sys/libkern/inet_ntoa.c.  In
this case, it is called from udp_input().

Would you like to test the following patch?

Eric


diff --git a/sys/netinet/udp_usrreq.c b/sys/netinet/udp_usrreq.c
index 173c44c..ca2dda1 100644
--- a/sys/netinet/udp_usrreq.c
+++ b/sys/netinet/udp_usrreq.c
@@ -674,13 +674,13 @@ udp_input(struct mbuf **mp, int *offp, int proto)
INPLOOKUP_RLOCKPCB, ifp, m);
if (inp == NULL) {
if (udp_log_in_vain) {
-   char buf[4*sizeof "123"];
+   char src[4*sizeof "123"];
+   char dst[4*sizeof "123"];

-   strcpy(buf, inet_ntoa(ip->ip_dst));
log(LOG_INFO,
"Connection attempt to UDP %s:%d from 
%s:%d\n",
-   buf, ntohs(uh->uh_dport), 
inet_ntoa(ip->ip_src),

-   ntohs(uh->uh_sport));
+   inet_ntoa_r(ip->ip_dst, dst), 
ntohs(uh->uh_dport),
+   inet_ntoa_r(ip->ip_src, src), 
ntohs(uh->uh_sport));

}
UDPSTAT_INC(udps_noport);
if (m->m_flags & (M_BCAST | M_MCAST)) {



Thanks, the explanation makes sense and the patch looks good (mind the 
TABs).

Running it now, expecting no surprises there.


One minor nit:
instead of a hack:

char src[4*sizeof "123"];
char dst[4*sizeof "123"];

it would be cleaner and in sync with the equivalent code in 
sys/netinet6/udp6_usrreq.c

to use the INET_ADDRSTRLEN constant (from sys/netinet/in.h, value 16):

char src[INET_ADDRSTRLEN];
char dst[INET_ADDRSTRLEN];


Hope the fix finds its way into 11.1 (or better yet, as a patch level in 
10.0).

Should I open a bug report?

  Mark


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


net.inet.udp.log_in_vain strange syslog reports

2017-02-02 Thread Mark Martinec

11.0-RELEASE-p7, net.inet.udp.log_in_vain=1

The following syslog entries seem to indicate some buffer overruns
in the reporting code (not all log lines are broken, just some).

(the actual failed connection attempts were indeed there,
it's just that the reported IP address is highly suspicious)

  Mark


Connection attempt to UDP 193.2.4.2:53 from 95.87.1521242:26375
Connection attempt to UDP 193.2.4.2:53 from 95.87.1521242:55806
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:54530
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:55504
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:54530
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:49526
Connection attempt to UDP 193.2.4.2:53 from 95.8231520242:56838
Connection attempt to UDP 193.2.4.2:53 from 95.8231520242:32768
Connection attempt to UDP 193.2.4.2:53 from 95.8241523242:5387
Connection attempt to UDP 193.2.4.2:53 from 95.8241523242:54530
Connection attempt to UDP 193.2.4.2:53 from 21.823154.242:46692
Connection attempt to UDP 193.2.4.2:53 from 21.823154.242:32768
Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:51931
Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:59881
Connection attempt to UDP 193.2.4.2:53 from 212873154.242:53424
Connection attempt to UDP 193.2.4.2:53 from 212873154.242:53937
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:46692
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:52594
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:59639
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:50869
Connection attempt to UDP 193.2.4.2:53 from 19382.1587242:55806
Connection attempt to UDP 193.2.4.2:53 from 19382.1587242:54650
Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:54322
Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:49871
Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:57807
Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:51931
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:52930
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:50869
Connection attempt to UDP 193.2.4.2:53 from 212823152.242:56838
Connection attempt to UDP 193.2.4.2:53 from 212823152.242:32768
Connection attempt to UDP 193.2.4.2:53 from 21.8231521242:63724
Connection attempt to UDP 193.2.4.2:53 from 21.8231521242:55222
Connection attempt to UDP 193.2.4.2:53 from 1948249.230.46:52599
Connection attempt to UDP 193.2.4.2:53 from 1948249.230.46:38496
Connection attempt to UDP 193.2.4.2:53 from 2128235.209.250:43608
Connection attempt to UDP 193.2.4.2:53 from 2128235.209.250:47257
Connection attempt to UDP 193.2.4.2:53 from 19387.1594242:54324
Connection attempt to UDP 193.2.4.2:53 from 19387.1594242:34613
Connection attempt to UDP 193.2.4.2:53 from 2128235.2124180:54377
Connection attempt to UDP 193.2.4.2:53 from 2128235.2124180:50869
Connection attempt to UDP 193.2.4.2:53 from 95.87.1547242:51698
Connection attempt to UDP 193.2.4.2:53 from 95.87.1547242:55222
Connection attempt to UDP 193.2.4.2:53 from 193.2.4.2242:55222
Connection attempt to UDP 193.2.4.2:53 from 19.8241523242:38496
Connection attempt to UDP 193.2.4.2:53 from 19.8241523242:55135
Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:50370
Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:64533
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:55222
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:56228
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:53424
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:61230
Connection attempt to UDP 193.2.4.2:53 from 212823154.242:59716
Connection attempt to UDP 193.2.4.2:53 from 212823154.242:53424
Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:36439
Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:60638
Connection attempt to UDP 193.2.4.2:53 from 19387.1521242:59008
Connection attempt to UDP 193.2.4.2:53 from 19387.1521242:35505
Connection attempt to UDP 193.2.4.2:53 from 19.824154.242:54322
Connection attempt to UDP 193.2.4.2:53 from 19.824154.242:30943
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:51752
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:35165
Connection attempt to UDP 193.2.4.2:53 from 95.87.1587242:36439
Connection attempt to UDP 193.2.4.2:53 from 95.87.1587242:57311
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:36439
Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:59280
Connection attempt to UDP 193.2.4.2:53 from 19487.154.242:53424
Connection attempt to UDP 193.2.4.2:53 from 19487.154.242:53247
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:35165
Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:50473
Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:56838
Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:63658
Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:54322
Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:60637

vt(4) gibberish characters in 11.0 with nvidia

2017-02-02 Thread Mark Martinec
I have recently upgraded two hosts with identical nvidia boards (GeForce 
GT 730,
fresh driver nvidia-driver-375.26 from ports), one has been following 
11-STABLE
every now and then, the other was on 10.3. So they are now at 
11.0-RELEASE-p7

or on a recent 11-STABLE respectively.

The problem now showing on both hosts is that a virtual terminal console 
driver

(vt(4), no special settings) now shows gibberish character-cells in an
approx 90x24 raster. Character cells are of varying colors, some 
textured,

so it looks as if the font loaded was just random junk.

The boot screen sequence looks fine up to the moment when the X starts.
The X11 screen (with KDE) is fine too. It's just the ttyv0-ttyv7 
consoles

which are broken.

Solved my immediate problem by adding hw.vga.textmode=1 to 
/boot/loader.conf,
so that ttyv consoles now look fine (in fact much nicer, as the vt fonts 
are

pretty squashed and ugly to my eyes).

What puzzles me is what has changed recently, as both hosts were happily
using vt consoles in graphical mode until the upgrade.

(btw, I do have nvidia-modeset.ko and nvidia.ko loaded)

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Does building linux packages under poudriere require linux compatibility emulation?

2017-01-14 Thread Mark Martinec

Thanks to all who responded, makes perfect sense now.

Paul Mather wrote:
The only thing you need on the host is to have the linux kernel module 
loaded.
(You don't need to have any Linux packages installed there.)  The 
default setting
in /usr/local/etc/poudriere.conf is to have NOLINUX=yes commented out, 
i.e.,

Linux support in Poudriere is enabled unless you explicitly disable it.
The easiest way to load the linux kernel module on the host for use 
with

Poudriere is to add it to the "kld_list" setting in /etc/rc.conf, e.g.,
kld_list="linux"


I do have NOLINUX=yes commented out, as is a default.


2017-01-14 22:45 Timon wrote:

No, it doesn't require linuxulator to be configured, but require
linux.ko (and linux64.ko if your host is amd64) to be loaded.
Poudriere load linux.ko, but doesn't load linux64. Try this patch:

--- /usr/local/share/poudriere/common.sh.orig
+++ /usr/local/share/poudriere/common.sh
@@ -1686,6 +1686,9 @@ jail_start() {
if [ "${arch}" = "i386" -o "${arch}" = "amd64" ]; then
needfs="${needfs} linprocfs"
sysctl -n compat.linux.osrelease >/dev/null
2>&1 || kldload linux
+   if [ "${arch}" = "amd64" ]; then
+   kldload linux64
+   fi
fi
fi
[ -n "${USE_TMPFS}" ] && needfs="${needfs} tmpfs"



Great, that seems to do the trick! (actually, I just loaded the linux64
kmodule, did not try to apply the patch). Thanks!

Looks like the poudriere/common.sh needs this patch.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Does building linux packages under poudriere require linux compatibility emulation?

2017-01-13 Thread Mark Martinec
When building packages under poudriere on 11.0-RELEASE-p7 (from a 
command
line in a terminal window) I'm noticing occasional streams of 
diagnostic:


  ELF binary type "3" not known.

which seem to be related to building some linux packages (example below,
parallel builds). Poudriere still reports success for these builds.

The host where poudriere is running does not have linux.ko loaded.

Does building such packages really require linuxilator configured
on the build host ???

   Mark


[00:23:56] >> [02][00:00:00] Starting build of 
www/linux-c6-qt47-webkit
[00:23:57] >> [13][00:00:00] Starting build of 
textproc/linux-c6-libxml2

ELF binary type "3" not known.
ELF binary type "3" not known.
[00:24:09] >> [19][00:00:28] Finished build of 
textproc/linux-c6-aspell: Success

[00:24:11] >> [19][00:00:00] Starting build of devel/qt4-makeqpf
[00:24:11] >> [11][00:00:24] Finished build of 
security/linux-c6-openssl-compat: Success

ELF binary type "3" not known.
ELF binary type "3" not known.
ELF binary type "3" not known.
ELF binary type "3" not known.
ELF binary type "3" not known.
[00:24:12] >> [11][00:00:00] Starting build of x11-toolkits/vte
ELF binary type "3" not known.
ELF binary type "3" not known.
ELF binary type "3" not known.
ELF binary type "3" not known.
[00:24:16] >> [07][00:00:24] Finished build of 
graphics/linux-c6-glx-utils: Success

ELF binary type "3" not known.
ELF binary type "3" not known.
ELF binary type "3" not known.
[00:24:17] >> [07][00:00:00] Starting build of devel/qt4-qdoc3
ELF binary type "3" not known.
ELF binary type "3" not known.
[00:24:19] >> [13][00:00:22] Finished build of 
textproc/linux-c6-libxml2: Success

[00:24:19] >> [13][00:00:00] Starting build of graphics/goocanvas
[00:24:27] >> [10][00:02:26] Finished build of graphics/sdl_gfx: 
Success

[00:24:29] >> [10][00:00:00] Starting build of multimedia/mjpegtools
[00:24:34] >> [04][00:02:15] Finished build of 
devel/linux-c6-devtools: Success
[00:24:35] >> [04][00:00:00] Starting build of 
devel/linux-c6-ncurses-base

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Is System V IPC namespace still shared across jails?

2016-12-13 Thread Mark Martinec

2016-12-13 16:29, Alan Somers wrote:

I've already added support for sysvmsg, sysvsem, and sysvshm to
iocage.  They all default to "new", which means you won't have to do
anything special in your jail config to make postgres work.  You can
find the patch below.  The only reason it hasn't been merged is
because it can't (yet) be made to work correctly on the develop branch
of iocage.  But it works fine on the master branch.

https://github.com/iocage/iocage/pull/370

-Alan


Superb, appreciated!

  Mark




On Tue, Dec 13, 2016 at 8:08 AM, Mark Martinec
<mark.martinec+free...@ijs.si> wrote:

2016-12-12 20:38, Christian Schwarz wrote:


With the new jail parameters, new namespaces for SysV IPC are 
possible

on FreeBSD 11.

For those ezjail users, add something like this to the jail's config
after creating it using 'ezjail-admin create':

export jail_postgres_parameters="sysvmsg=new sysvsem=new sysvshm=new"

Cheers,
  Christian



Thank you, this is it!
I missed it in the JAIL(8) man page, and is not mentioned in release 
notes.


Now if only the iocage would recognized the sysvmsg, sysvsem, and 
sysvshm

options:

# iocage set sysvmsg='new' xxx
  ERROR: Unsupported property: sysvmsg!

I guess I should file a bug report.
  Mark

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Is System V IPC namespace still shared across jails?

2016-12-13 Thread Mark Martinec

2016-12-12 20:38, Christian Schwarz wrote:

With the new jail parameters, new namespaces for SysV IPC are possible
on FreeBSD 11.

For those ezjail users, add something like this to the jail's config
after creating it using 'ezjail-admin create':

export jail_postgres_parameters="sysvmsg=new sysvsem=new sysvshm=new"

Cheers,
  Christian



Thank you, this is it!
I missed it in the JAIL(8) man page, and is not mentioned in release 
notes.



Now if only the iocage would recognized the sysvmsg, sysvsem, and 
sysvshm

options:

# iocage set sysvmsg='new' xxx
  ERROR: Unsupported property: sysvmsg!

I guess I should file a bug report.


  Mark




man 8 jail

 ...
 allow.sysvipc
  A process within the jail has access to System V IPC
  primitives.  This is deprecated in favor of the per-
  module parameters (see below).  When this parameter is
  set, it is equivalent to setting sysvmsg, sysvsem, and
  sysvshm all to ``inherit''.
 ...

   sysvmsg
  Allow access to SYSV IPC message primitives.  If set to
  ``inherit'', all IPC objects on the system are visible to this
  jail, whether they were created by the jail itself, the base
  system, or other jails.  If set to ``new'', the jail will have
  its own key namespace, and can only see the objects that it has
  created; the system (or parent jail) has access to the jail's
  objects, but not to its keys.  If set to ``disable'', the jail
  cannot perform any sysvmsg-related system calls.

sysvsem, sysvshm
  Allow access to SYSV IPC semaphore and shared memory primitives,
  in the same manner as sysvmsg.


Regarding installation of PostgreSQL in a FreeBSD jail, the web hold 
plenty of
 warnings/advice that each postgres instance should have a unique 
UID, otherwise

they stumble across each other's feet:

| allow.sysvipc
 |   A process within the jail has access to System V IPC primitives. 
In the
 | current jail implementation, System V primitives share a single 
namespace
 | across the host and jail environments, meaning that processes 
within a jail
 | would be able to communicate with (and potentially interfere with) 
processes

 | outside of the jail, and in other jails.


Is this still the case in FreeBSD 11.0 ???

I remember hearing rumors that the System V namespace
no longer is (will?) be shared across jails.
(Couldn't find it being mentioned in release notes.)

  Mark

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Is System V IPC namespace still shared across jails?

2016-12-12 Thread Mark Martinec
Regarding installation of PostgreSQL in a FreeBSD jail, the web hold 
plenty of
warnings/advice that each postgres instance should have a unique UID, 
otherwise

they stumble across each other's feet:

| allow.sysvipc
|   A process within the jail has access to System V IPC primitives. In 
the
| current jail implementation, System V primitives share a single 
namespace
| across the host and jail environments, meaning that processes within a 
jail
| would be able to communicate with (and potentially interfere with) 
processes

| outside of the jail, and in other jails.


Is this still the case in FreeBSD 11.0 ???

I remember hearing rumors that the System V namespace
no longer is (will?) be shared across jails.
(Couldn't find it being mentioned in release notes.)

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Uppercase RE matching problems in FreeBSD 11

2016-11-07 Thread Mark Martinec

2016-11-06 22:49, Stefan Bethke wrote:

So what do I set my LANG and LC variables to?  I do want UTF-8, but I
do also want my scripts to continue to work.  Clearly, en_US.UTF-8 is
not what I want.  Is it C.UTF-8?




Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?


Yes, that is the safest bet. The LANG sets a default, but the
LC_COLLATE, LC_TIME, LC_NUMERIC and LC_MONETARY should better
be set to "C" to overrule the LANG in their domains.

Leave the LC_ALL undefined or empty, as this one overrules
every other locale setting (unless you really want everything
to be set to "C").

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Mark Martinec

2016-11-06 12:07, Baptiste Daroussin wrote:
Yes A-Z only means uppercase in an ASCII only world in a unicode world 
it means
AaBb... Z because there are way more characters that simple A-Z. In 
FreeBSD 11
we have a unicode collation instead of falling back in on LC_COLLATE=C 
which

means ascii only

For regrexp for example one should use the classes: :upper: or :lower:.


It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?) 
at "C"

when LANG or LC_CTYPE is set to something else, otherwise unexpected
things may happen.

  Mark



On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
I happened to run an old script today that uses sed(1) to extract the 
system
boot time from the kern.boottime sysctl MIB. On 11.0 this no longer 
works as

expected:

$ sysctl kern.boottime
kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 
2016

$ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
v  5 16:18:34 2016

sed passes over 'S' and 'N' until it hits 'v', which it considers 
uppercase
apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works 
as

expected:

$ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
Nov  5 16:18:34 2016

Testing every lowercase character separately gives even more 
inconsistent

results:

$ cat < a
> b
> c
> d
> e
> f
> g
> h
> i
> j
> k
> l
> m
> n
> o
> p
> q
> r
> s
> t
> u
> v
> w
> x
> y
> z
> !
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

Here sed thinks every lowercase character except for 'a' is uppercase! 
This
differs from the first test where sed did not think 'o' is uppercase. 
Again,

the above behaves as expected with LANG=C.

Does anyone have any insight into this? This is likely to break a lot 
of

existing code.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: PKG bootstrap FreeBSD 11.0 / VBox NAT problem

2016-10-28 Thread Mark Martinec

On 10/28/16 14:15, Tomasz CEDRO wrote:

Just for the curious. I am testing on VirtualBox (Version 5.1.8
r111374 (Qt5.5.1), macOS 10.12.1 host). Cannot bootstrap PKG on a host
with NAT enabled.I have noticed this problem occurs only when NAT is
enabled in VBox. When I use Bridged interface there is no problem. I
have noticed that outgoing packet following RST response has invalid
checksum. That may be VBox NAT problem..? Maybe someone noticed
similar behavior..

https://www.virtualbox.org/ticket/16126


Same thing here: after upgrading VirtualBox on Windows 10 to 5.0.28,
the 'pkg upgrade' on a FreeBSD 11.0-RELEASE-p2 guest fails with
a 'Connection reset by peer' - either right away, or after downloading
a (random) couple of packages - when using NAT provided by VirtualBox.

This worked well with a previous release of VirtualBox.

Looks like the problem is not specific to FreeBSD:

https://www.virtualbox.org/ticket/16084


  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: update.FreeBSD.org unresponsive?

2016-10-12 Thread Mark Martinec

Perhaps it's time to replace Apache httpd/2.2.16 (released 6+ years ago)
running on update.FreeBSD.org with something lighter and more agile
like nginx (or at least with a fresher version of Apache httpd).

The accf_http(9) (with: accept_filter=httpready) may help too.

  Mark


2016-10-12 17:23, Mark Martinec wrote:

Whatever you did, it started to work now normally. Thank you!
(no changes at our side)
  Mark


2016-10-12 16:29, Mark Martinec wrote:
Trying to upgrade a couple of hosts (11.0-RC2, 11.0-RC3, 
10.3-RELEASE-p10)

to 11.0 (using: freebsd-update upgrade -r 11.0-RELEASE), and it seems
the fetch(1) always fails with a timeout. Even a simple 
(freebsd-update fetch)

in an attempt to bump a 10.3-RELEASE-p9 to 10.3-RELEASE-p10 now fails
with a timeout, while previously it worked reliably and fast.

The interesting thing is that both the ping and ping6 to 
update.FreeBSD.org

work flawlessly with no packet loss.
I tried it several times yesterday, and again today. [...]

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: update.FreeBSD.org unresponsive?

2016-10-12 Thread Mark Martinec

Whatever you did, it started to work now normally. Thank you!
(no changes at our side)

  Mark



2016-10-12 16:29, Mark Martinec wrote:
Trying to upgrade a couple of hosts (11.0-RC2, 11.0-RC3, 
10.3-RELEASE-p10)

to 11.0 (using: freebsd-update upgrade -r 11.0-RELEASE), and it seems
the fetch(1) always fails with a timeout. Even a simple (freebsd-update 
fetch)

in an attempt to bump a 10.3-RELEASE-p9 to 10.3-RELEASE-p10 now fails
with a timeout, while previously it worked reliably and fast.

The interesting thing is that both the ping and ping6 to 
update.FreeBSD.org

work flawlessly with no packet loss.

I tried it several times yesterday, and again today. Our network
connectivity is otherwise good and fast. To rule out a possibility
of a firewall or routing issue I even tried it from a different
network (different ISP), and also over Hurricane-Electric tunnel.
Same thing over IPv4 or IPv6. Going through a web proxy doesn't
help either.

tcpdump / wireshark shows that the three-way SYN handshake succeeds,
then the client sends a GET, re-sends the packet several times,
but nothing comes back any more. Or sometimes even the SYN handshake
fails to complete.

Tried to use curl to fetch the same file, it fails too:

$ curl -6
http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e
curl: (7) Failed to connect to update.FreeBSD.org port 80: Operation 
timed out


$ curl -4
http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e
curl: (56) Recv failure: Operation timed out


So, do we just need to be patient, or is the update.FreeBSD.org hosed?

  Mark

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


update.FreeBSD.org unresponsive?

2016-10-12 Thread Mark Martinec
Trying to upgrade a couple of hosts (11.0-RC2, 11.0-RC3, 
10.3-RELEASE-p10)

to 11.0 (using: freebsd-update upgrade -r 11.0-RELEASE), and it seems
the fetch(1) always fails with a timeout. Even a simple (freebsd-update 
fetch)

in an attempt to bump a 10.3-RELEASE-p9 to 10.3-RELEASE-p10 now fails
with a timeout, while previously it worked reliably and fast.

The interesting thing is that both the ping and ping6 to 
update.FreeBSD.org

work flawlessly with no packet loss.

I tried it several times yesterday, and again today. Our network
connectivity is otherwise good and fast. To rule out a possibility
of a firewall or routing issue I even tried it from a different
network (different ISP), and also over Hurricane-Electric tunnel.
Same thing over IPv4 or IPv6. Going through a web proxy doesn't
help either.

tcpdump / wireshark shows that the three-way SYN handshake succeeds,
then the client sends a GET, re-sends the packet several times,
but nothing comes back any more. Or sometimes even the SYN handshake
fails to complete.

Tried to use curl to fetch the same file, it fails too:

$ curl -6 
http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e
curl: (7) Failed to connect to update.FreeBSD.org port 80: Operation 
timed out


$ curl -4 
http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e

curl: (56) Recv failure: Operation timed out


So, do we just need to be patient, or is the update.FreeBSD.org hosed?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Ephemeral /var/run and creating port-specific subdir at service startup time

2016-08-31 Thread Mark Martinec

I prefer to have a /var/run file system reside on a tmpfs
as its contents is small and ephemeral in its nature (like
pid files, lock files, sockets), need not be preserved across
reboots, and should not have to depend on any physical disk.

The problem is that some programs/services/ports like to create
their own subdirectory under /var/run. This works fine if such
subdirectory is created (when missing) by their rc.d script,
such as salt, dbus, jenkins, clamav-clamd, isc-dhcpd, kibana.

Unfortunately there are other ports which create a subdirectory
under /var/run at the installation time (pkg install). In this
case their subdirectory is missing on a reboot when /var/run
is re-created afresh, and they fail to start.

So my question is: are such ports (like influxdb, grafana3)
which do not create their subdirectory at a startup time
in error and a bug report is warranted, or am I wrong in
expecting that /var/run may be ephemeral and is such a setup
(as is common in Linux) unsupported?

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system

2016-01-14 Thread Mark Martinec

On 2016-01-14 23:13, Bryan Drewery wrote:

Where / What is the error?

The only example here was fixed in November.


Here is how a fresh svn checkout on a 10-stable
fails in make buildworld when /usr/src is noexec :


CC='cc ' mkdep -f .depend.getprotoent_test -a
-I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd 
-I/usr/src/contrib/netbsd-tests -std=gnu99   
/usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c
echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a  
/usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> 
.depend.getprotoent_test
(cd /usr/src/lib/libc/tests/net &&  NO_SUBDIR=1 make -f 
/usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS=  
PROG=ether_aton_test  DEPENDFILE=.depend.ether_aton_test 
.MAKE.DEPENDFILE=.depend.ether_aton_test   depend)
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr 
/usr/src/sys/net/if_ethersubr.c aton_ether_subr.c
make[7]: exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr) 
failed (Permission denied)

*** Error code 1

Stop.
make[7]: stopped in /usr/src/lib/libc/tests/net
*** Error code 1

Stop.
make[6]: stopped in /usr/src/lib/libc/tests/net
*** Error code 1

Stop.
make[5]: stopped in /usr/src/lib/libc/tests
*** Error code 1

Stop.
make[4]: stopped in /usr/src/lib/libc
*** Error code 1

Stop.
make[3]: stopped in /usr/src/lib
*** Error code 1
[...]


The net/gen_ether_subr looks like the same culprit as reported
in 2015-11-26.

Actually ... it seems that taking out the WITH_TESTS="yes" from
/etc/make.conf avoids the problem - although this was not necessary
in 10.2-RELEASE, as far as I can tell.


  Mark




On 1/14/2016 7:42 AM, Mark Martinec wrote:

Prompted by recent security advisories I did a 'make buildworld'
on a fresh svn checkout, only to find out that it seems the 'exec'
mount flag on /usr/src is still required for a successful build.

This wasn't so for 10.2, and I hope it won't become a requirement
in 10.3 - or at least it should be clearly documented in release 
notes.


  Mark


On 2015-12-07 16:35, Mark Martinec wrote:

So, is this a new state of affairs that /usr/src file system
needs to be mounted exec in order for buildworld to succeed,
or is this an unintended change and I should file a bug report?

  Mark


On 2015-11-26 19:44, Miroslav Lachman wrote:

Mark Martinec wrote on 11/26/2015 19:31:

Up to about a week ago building world on FreeBSD 10.2-STABLE went
just fine. Today after svn update the build fails:


# make buildworld
[...]

CC='cc ' mkdep -f .depend.getprotoent_test -a
-I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd
-I/usr/src/contrib/netbsd-tests -std=gnu99
/usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c
echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a
/usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >>
.depend.getprotoent_test
(cd /usr/src/lib/libc/tests/net && make -f
/usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS=  SUBDIR=
PROG=ether_aton_test  DEPENDFILE=.depend.ether_aton_test
.MAKE.DEPENDFILE=.depend.ether_aton_test   depend)
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
/usr/src/sys/net/if_ethersubr.c aton_ether_subr.c
make[7]:
exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr)
failed (Permission denied)
*** Error code 1

Stop.
make[7]: stopped in /usr/src/lib/libc/tests/net
*** Error code 1


It turns out that our file system /usr/src had an "exec" flag
turned off, so now running a command:
   /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
fails with "Permission denied".

It would be valuable if building a system on an exec-protected
src file system would continue to be possible.

Not sure if the
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
is the only such new command breaking the build. Anyway, a simple
workaround is to run shell from a command line instead of as a
shebang, i.e.:

   # /bin/sh 
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr


instead of:

   # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr


I was puzzled by similar thing years ago. I was using /var/db and 
/tmp

mounted with noexec. And then there was some changes. Ports need
/var/db with exec because of some script in /var/db/pkg and /tmp 
must

have exec too for buildworld or installworld (I don't remember it
well, now I always do mount -u -o current,exec /tmp before build +
install world and kernel)

Anyway - it would be better to not have these partitions mounted 
with

exec.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system

2016-01-14 Thread Mark Martinec

Prompted by recent security advisories I did a 'make buildworld'
on a fresh svn checkout, only to find out that it seems the 'exec'
mount flag on /usr/src is still required for a successful build.

This wasn't so for 10.2, and I hope it won't become a requirement
in 10.3 - or at least it should be clearly documented in release notes.

  Mark


On 2015-12-07 16:35, Mark Martinec wrote:

So, is this a new state of affairs that /usr/src file system
needs to be mounted exec in order for buildworld to succeed,
or is this an unintended change and I should file a bug report?

  Mark


On 2015-11-26 19:44, Miroslav Lachman wrote:

Mark Martinec wrote on 11/26/2015 19:31:

Up to about a week ago building world on FreeBSD 10.2-STABLE went
just fine. Today after svn update the build fails:


# make buildworld
[...]

CC='cc ' mkdep -f .depend.getprotoent_test -a
-I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd
-I/usr/src/contrib/netbsd-tests -std=gnu99
/usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c
echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a
/usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> 
.depend.getprotoent_test

(cd /usr/src/lib/libc/tests/net && make -f
/usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS=  SUBDIR=
PROG=ether_aton_test  DEPENDFILE=.depend.ether_aton_test
.MAKE.DEPENDFILE=.depend.ether_aton_test   depend)
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
/usr/src/sys/net/if_ethersubr.c aton_ether_subr.c
make[7]: 
exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr)

failed (Permission denied)
*** Error code 1

Stop.
make[7]: stopped in /usr/src/lib/libc/tests/net
*** Error code 1


It turns out that our file system /usr/src had an "exec" flag
turned off, so now running a command:
   /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
fails with "Permission denied".

It would be valuable if building a system on an exec-protected
src file system would continue to be possible.

Not sure if the 
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr

is the only such new command breaking the build. Anyway, a simple
workaround is to run shell from a command line instead of as a
shebang, i.e.:

   # /bin/sh 
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr


instead of:

   # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr


I was puzzled by similar thing years ago. I was using /var/db and /tmp
mounted with noexec. And then there was some changes. Ports need
/var/db with exec because of some script in /var/db/pkg and /tmp must
have exec too for buildworld or installworld (I don't remember it
well, now I always do mount -u -o current,exec /tmp before build +
install world and kernel)

Anyway - it would be better to not have these partitions mounted with 
exec.


Miroslav Lachman

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to 
"freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system

2015-12-07 Thread Mark Martinec

So, is this a new state of affairs that /usr/src file system
needs to be mounted exec in order for buildworld to succeed,
or is this an unintended change and I should file a bug report?

  Mark


On 2015-11-26 19:44, Miroslav Lachman wrote:

Mark Martinec wrote on 11/26/2015 19:31:

Up to about a week ago building world on FreeBSD 10.2-STABLE went
just fine. Today after svn update the build fails:


# make buildworld
[...]

CC='cc ' mkdep -f .depend.getprotoent_test -a
-I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd
-I/usr/src/contrib/netbsd-tests -std=gnu99
/usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c
echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a
/usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> 
.depend.getprotoent_test

(cd /usr/src/lib/libc/tests/net && make -f
/usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS=  SUBDIR=
PROG=ether_aton_test  DEPENDFILE=.depend.ether_aton_test
.MAKE.DEPENDFILE=.depend.ether_aton_test   depend)
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
/usr/src/sys/net/if_ethersubr.c aton_ether_subr.c
make[7]: 
exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr)

failed (Permission denied)
*** Error code 1

Stop.
make[7]: stopped in /usr/src/lib/libc/tests/net
*** Error code 1


It turns out that our file system /usr/src had an "exec" flag
turned off, so now running a command:
   /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
fails with "Permission denied".

It would be valuable if building a system on an exec-protected
src file system would continue to be possible.

Not sure if the 
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr

is the only such new command breaking the build. Anyway, a simple
workaround is to run shell from a command line instead of as a
shebang, i.e.:

   # /bin/sh /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr

instead of:

   # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr


I was puzzled by similar thing years ago. I was using /var/db and /tmp
mounted with noexec. And then there was some changes. Ports need
/var/db with exec because of some script in /var/db/pkg and /tmp must
have exec too for buildworld or installworld (I don't remember it
well, now I always do mount -u -o current,exec /tmp before build +
install world and kernel)

Anyway - it would be better to not have these partitions mounted with 
exec.


Miroslav Lachman

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system

2015-11-26 Thread Mark Martinec

Up to about a week ago building world on FreeBSD 10.2-STABLE went
just fine. Today after svn update the build fails:


# make buildworld
[...]

CC='cc ' mkdep -f .depend.getprotoent_test -a
-I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd 
-I/usr/src/contrib/netbsd-tests -std=gnu99   
/usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c
echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a  
/usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> 
.depend.getprotoent_test
(cd /usr/src/lib/libc/tests/net && make -f 
/usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS=  SUBDIR= 
PROG=ether_aton_test  DEPENDFILE=.depend.ether_aton_test 
.MAKE.DEPENDFILE=.depend.ether_aton_test   depend)
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr 
/usr/src/sys/net/if_ethersubr.c aton_ether_subr.c
make[7]: exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr) 
failed (Permission denied)

*** Error code 1

Stop.
make[7]: stopped in /usr/src/lib/libc/tests/net
*** Error code 1


It turns out that our file system /usr/src had an "exec" flag
turned off, so now running a command:
  /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr
fails with "Permission denied".

It would be valuable if building a system on an exec-protected
src file system would continue to be possible.

Not sure if the 
/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr

is the only such new command breaking the build. Anyway, a simple
workaround is to run shell from a command line instead of as a
shebang, i.e.:

  # /bin/sh /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr

instead of:

  # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr



Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Segmentation fault running ntpd

2015-11-04 Thread Mark Martinec

Upgrading 10.2-RELEASE-p6 to 10.2-RELEASE-p7 now solved ntpd crashes
(apparently fixed by: FreeBSD Errata Notice FreeBSD-EN-15:20.vm).

Thanks!!!

  Mark


On 2015-11-01 10:31, Andre Albsmeier wrote:

On Fri, 30-Oct-2015 at 19:47:59 +0100, Mark Martinec wrote:

Not sure if it's the same issue, but it sure looks like it is.

I have upgraded a couple of hosts (amd64) from 10.2-RELEASE-p5
to 10.2-RELEASE-p6, i.e. the freebsd-upgrade essentially just
replaced the /usr/sbin/ntpd with a new one; then I restarted
the ntpd.

On all host but one this was successful: the new ntpd starts
fine and works normally. But on one of these machines the
ntpd process immediately crashes with SIGSEGV. That machine
has an Intel Xeon cpu. It is not apparent to me in what way
this machine differs from others,


I'll add my observations here:

I am using an ntp.conf with a single server entry:

server ntp.some.domain.org

ntp.some.domain.org is a CNAME pointing to gate.some.domain.org
and the latter contains an A record pointing to 192.168.128.1.

After updating 9.3-STABLE to the latest version (one which includes ntp
4.2.8p4), ntpd crashes:

Nov 1 09:38:38 voyager kernel: pid 4443 (ntpd), uid 0: exited on signal 
11


This happens in line 871 of ntpd.c where mlockall() is called:

&& 0 != mlockall(MCL_CURRENT|MCL_FUTURE))

It does NOT crash with MCL_FUTURE only.
It does crash with MCL_CURRENT only.

When adding

rlimit memlock -1

to ntpd.conf it does NOT crash (as mlockall() won't be called anymore).

When specifying the IP address (192.168.128.1) as the server it
does NOT crash.

When specifying gate.some.domain.org as the server it also does
NOT crash. tcpdump shows in this case:

09:49:59.542310 IP 192.168.128.2.21102 > 192.168.128.1.53: 7639+ A?
gate.some.domain.org. (41)
09:49:59.542578 IP 192.168.128.1.53 > 192.168.128.2.21102: 7639* 1/1/0
A 192.168.128.1 (71)
09:49:59.542612 IP 192.168.128.2.52455 > 192.168.128.1.53: 42047+
? gate.some.domain.org. (41)
09:49:59.542792 IP 192.168.128.1.53 > 192.168.128.2.52455: 42047* 0/1/0 
(88)


When reverting the server entry back to ntp.some.domain.org
it crashes and tcpdump shows:

09:36:05.172552 IP 192.168.128.2.17836 > 192.168.128.1.53: 49768+ A?
ntp.some.domain.org. (40)
09:36:05.173320 IP 192.168.128.1.53 > 192.168.128.2.17836: 49768*
2/1/0 CNAME gate.some.domain.org., A 192.168.128.1 (89)
09:36:05.173361 IP 192.168.128.2.22611 > 192.168.128.1.53: 63808+
? ntp.some.domain.org. (40)
09:36:05.173595 IP 192.168.128.1.53 > 192.168.128.2.22611: 63808*
1/1/0 CNAME gate.some.domain.org. (106)

The probability for crashing increases with the speed and the
number of cores of the machine: On my old single-core Pentiums
it never crashes, on my quad-cores i7-3770K it always crashes.

The (asynchronous) resolving of the names start in line 3876 of
ntp_config.c:

getaddrinfo_sometime(curr_peer->addr->address,

If we put the mlockall() call directly before this line, the
crash is gone.

Maybe you want to play around with rlimit, CNAMES, IPs and
so on...

-Andre

Anyone else seeing this?

2015-10-30 12:34, je David Wolfskill napisal
> On Fri, Oct 30, 2015 at 09:42:07AM +0100, Dag-Erling Smørgrav wrote:
>> David Wolfskill <da...@catwhisker.org> writes:
>> > ...
>> > bound to 172.17.1.245 -- renewal in 43200 seconds.
>> > pid 544 (ntpd), uid 0: exited on signal 11 (core dumped)
>> > Starting Network: lo0 em0 iwn0 lagg0.
>> > ...
>>
>> Did you find a solution?  I'm wondering if the ntpd problems people
>> are
>> reporting on freebsd-security@ are related.  I vaguely recall hearing
>> that this had been traced to a pthread bug, but can't find anything
>> about it in commit logs or mailing list archives.
>> 
>
> I don't recall finding "a solution" per se; that said, I also don't
> recall seeing an occurrence of the above for enough time that I'm not
> sure when I sent that message. :-}
>
> As a reality check:
>
> g1-252(11.0-C)[1] ls -lT /*.core
> -rw-r--r--  1 root  wheel  13783040 Aug 18 04:19:03 2015 /ntpd.core
> g1-252(11.0-C)[2]
>
> So -- among other points -- my last sighting of whatever was causing
> that was the day I built:
>
> FreeBSD 11.0-CURRENT #157  r286880M/286880:1100079: Tue Aug 18
> 04:45:25 PDT 2015
> r...@g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY  amd64
>
> Note that the machines where I run head get updated daily (unless
> there's enough of a problem with head that I can't build it or can't
> boot it (and I'm unable to circumvent the issue within a reasonable
> time)) -- and while I do attempt to run ntpd on the machines, the above
> failure is more "annoying" than "crippling" in my particular case.
>
> And I'm presently running:
>
> FreeBSD 11.0-CURRENT #227  r290

Re: Segmentation fault running ntpd

2015-10-30 Thread Mark Martinec

Not sure if it's the same issue, but it sure looks like it is.

I have upgraded a couple of hosts (amd64) from 10.2-RELEASE-p5
to 10.2-RELEASE-p6, i.e. the freebsd-upgrade essentially just
replaced the /usr/sbin/ntpd with a new one; then I restarted
the ntpd.

On all host but one this was successful: the new ntpd starts
fine and works normally. But on one of these machines the
ntpd process immediately crashes with SIGSEGV. That machine
has an Intel Xeon cpu. It is not apparent to me in what way
this machine differs from others,

Played with some variations of ntpd on that host, here are
some findings:

- the new ntpd (that came with 10.2-RELEASE-p6) runs fine
  if it does *not* daemonize, i.e. ntpd with an option -n or -d
  stays attached to a terminal and works fine; the same
  happens when run under ktrace -d -i ntpd  ... it works fine,
  even when it daemonizes;

- the ntpd built from fresh net/ntp-devel behaves exactly
  the same: crashes on that machine when it daemonizes

- a previous ntpd (from 10.2-RELEASE-p5) works fine,
  so I ended up downgrading ntpd to that previous version
  on that machine. Also a ntpd from a recent 10-STABLE
  when copied to that host runs fine there!

I haven't tried yet to build it with debugging, or capture
a core dump.

Puzzling...

   Mark



2015-10-30 12:34, je David Wolfskill napisal

On Fri, Oct 30, 2015 at 09:42:07AM +0100, Dag-Erling Smørgrav wrote:

David Wolfskill  writes:
> ...
> bound to 172.17.1.245 -- renewal in 43200 seconds.
> pid 544 (ntpd), uid 0: exited on signal 11 (core dumped)
> Starting Network: lo0 em0 iwn0 lagg0.
> ...

Did you find a solution?  I'm wondering if the ntpd problems people 
are

reporting on freebsd-security@ are related.  I vaguely recall hearing
that this had been traced to a pthread bug, but can't find anything
about it in commit logs or mailing list archives.



I don't recall finding "a solution" per se; that said, I also don't
recall seeing an occurrence of the above for enough time that I'm not
sure when I sent that message. :-}

As a reality check:

g1-252(11.0-C)[1] ls -lT /*.core
-rw-r--r--  1 root  wheel  13783040 Aug 18 04:19:03 2015 /ntpd.core
g1-252(11.0-C)[2]

So -- among other points -- my last sighting of whatever was causing
that was the day I built:

FreeBSD 11.0-CURRENT #157  r286880M/286880:1100079: Tue Aug 18
04:45:25 PDT 2015
r...@g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY  amd64

Note that the machines where I run head get updated daily (unless
there's enough of a problem with head that I can't build it or can't
boot it (and I'm unable to circumvent the issue within a reasonable
time)) -- and while I do attempt to run ntpd on the machines, the above
failure is more "annoying" than "crippling" in my particular case.

And I'm presently running:

FreeBSD 11.0-CURRENT #227  r290138M/290138:1100084: Thu Oct 29
05:12:58 PDT 2015
r...@g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY  amd64

and building head @r290190 as I type.

And FWIW, I *suspect* that one of the issues involved (in my case)
was a ... lack of determinism ... in events involving getting the
(wireless) network connectivity into a usable state as part of the
initial transition to multi-user mode.  (I only have evidence at
the moment of the issue on my laptop; my build machine, which only
uses a wired NIC, has no /ntpd.core file.  It and my laptop are updated
pretty much in lock-step; it runs a completely GENERIC kernel, while
the laptop runs a modestly customized one based on GENERIC.)

Peace,
david

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: recommended poudriere jail versions?

2015-10-01 Thread Mark Martinec

2015-10-01 10:32, Marko Cupać wrote:

what is the recommended poudriere jail version for building ports? So
far I was trying to be on latest binary patchlevel for every minor
version for both base system, poudriere jails and clients, but I ended
up with three jails just for amd64 (9.3, 10.1 and 10.2), where I need 
to

rebuild all the ports every time I patch poudriere jails. This is
starting to take too much of my time.

I see that pkg.freebsd.org hosts just one set of ports per
architecture of major version. What is the OS version they are built
on? Are there any downsides in building all the ports for
10.2- on 10.1-?


I used to have poudriere jails based on a minor version like you have,
but ended up in a simplified setup, building ports only on 10.0-RELEASE
and installing them on 10.1 or 10.2 and 10-STABLE. I think the
official packages are also built based on 10.0-RELEASE .

This mostly works, except for a port like virtualbox-ose-kmod,
which causes a kernel crash when built on 10.0-RELEASE and run
on 10.2. So after each ports upgrade when noticing that pkg
is reinstalling virtualbox-ose-kmod, I re-build this one from
ports on a target host, otherwise the next reboot will end up
crashing on loading a vboxdrv kernel module during startup.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Latest stable (r287104) bash leaves zombies on exit

2015-08-27 Thread Mark Martinec

Pete French wrote:


I updated to stable yesterday, plus updated all my porst to
the latest pecompiled packages, but I am now seeing odd problems
with bash on exit. Sometimes it quits, but leaves a zombie
process... e.g

 PID TT  STATTIME COMMAND
44308 v0  IW   0:00.00 -bash (bash)
44312 v0  IW+  0:00.00 /bin/sh /usr/local/bin/startx -listen_tcp
44325 v0  IW+  0:00.00 xinit xterm -listen_tcp -- /usr/local/bin/X :0 
-auth /ho

44328 v0  IW   0:00.00 /usr/local/bin/wmaker
44340 v0  S0:03.35 /usr/local/bin/wmaker --for-real
49101  0- Z+   0:02.73 defunct
49314  1- Z+   0:00.17 defunct
56068  2  Ss   0:00.01 bash
56498  2  R+   0:00.00 ps
56074  3  Is   0:00.01 bash
56076  3  S+   0:00.00 mail freebsd-stable@freebsd.org
56308  4  Is+  0:00.01 bash

Thats the current 'ps' on this machine. The bash processes are running
inside an xterm, so am not sure if the issue is with bash or the
terminal. Kind of puzzled!


I can reproduce this easily, although not every time.

Running 10.2 under KDE, with bash as a default shell:
start xterm from a KDE 'konsole', then move to within the xterm
and try closing it (^D or exit). More often than not the xterm
will block and stay open, the bash process within goes defunct.

A normal kill of xterm has no effect, although a kill -9 to the
xterm blows away the xterm and the init process then clears
the bash zombie leftover. Seems like running a simple command
like 'date' in xterm before trying to close it does increase
the likelihood that xterm will block on exit.


Currently I have to reboot the machine periodicly once I have 
accumulated
enough zombies to be annoying. Its not really a long term solution 
though.


There is no need to reboot, just kill -9 the hanging xterm processes
and the init will clear the zombies.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: freebsd-update to 10.2-RELEASE broken ?

2015-08-17 Thread Mark Martinec

On Sun, 16 Aug 2015, Kimmo Paasiala wrote:

It could be the classic fall back to TCP on SRV records problem on
your upstream DNS forwarder if you're using one:
http://lists.freebsd.org/pipermail/freebsd-ports/2012-May/074801.html

The cure would be to use your own caching DNS resolver (configured to
query the authoritative name servers directly) such as dns/unbound.



2015-08-16 Christian Kratzer wrote:

I run my own bind9 resolvers on freebsd 10 at both sites.   I never
particurlarly like the concept of an upstream resolver.

All my resolvers are behind firewalls although different kinds.
ASA at one site and freebsd pf at the other.

I will investigate though.  Thanks for the tip.


ASA firewall has a nasty setting to *discard* DNS UDP packets
with UDP message size over 512 bytes, i.e. it does not allow EDNS0
option. Check that you have this DNS deep packet inspection
misfeature turned off. Check also the firewall log.

This would affect UDP DNS responses to a SRV query
_http._tcp.update.FreeBSD.org, which comes close to the size limit
(possibly depending on geolocation). Using google's public DNS server
may avoid the problem by stripping nonessential records from the
DNS reply (like the ADDITIONAL SECTION).

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: freebsd-update to 10.2-RELEASE broken ?

2015-08-17 Thread Mark Martinec

On Sun, 16 Aug 2015, Kimmo Paasiala wrote:

It could be the classic fall back to TCP on SRV records problem on
your upstream DNS forwarder if you're using one:
http://lists.freebsd.org/pipermail/freebsd-ports/2012-May/074801.html

The cure would be to use your own caching DNS resolver (configured to
query the authoritative name servers directly) such as dns/unbound.



2015-08-16 Christian Kratzer wrote:

I run my own bind9 resolvers on freebsd 10 at both sites.   I never
particurlarly like the concept of an upstream resolver.

All my resolvers are behind firewalls although different kinds.
ASA at one site and freebsd pf at the other.

I will investigate though.  Thanks for the tip.


ASA firewall has a nasty setting to *discard* DNS UDP packets
with UDP message size over 512 bytes, i.e. it does not allow EDNS0
option. Check that you have this DNS deep packet inspection
misfeature turned off. Check also the firewall log.

This would affect UDP DNS responses to a SRV query
_http._tcp.update.FreeBSD.org, which comes close to the size limit
(possibly depending on geolocation). Using google's public DNS server
may avoid the problem by stripping nonessential records from the
DNS reply (like the ADDITIONAL SECTION).

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org