FreeBSD 12.2-RC1 poudriere warning: awk: can't open file /sys/param.h
awk: can't open file /sys/param.h Probably innocent, but reporting just in case: $ poudriere version 3.3.4 $ freebsd-version 12.2-RC1 # poudriere jail -c -j 122amd64-srv -v 12.2-RC1 [00:00:00] Creating 122amd64-srv fs at /data0/poudriere/jails/122amd64-srv... done [00:00:01] Using pre-distributed MANIFEST for FreeBSD 12.2-RC1 amd64 [00:00:01] Fetching base for FreeBSD 12.2-RC1 amd64 /data0/poudriere/jails/122amd64-srv/fromftp/ba 173 MB 20 MBps 08s [00:00:12] Extracting base... done [00:00:48] Fetching src for FreeBSD 12.2-RC1 amd64 /data0/poudriere/jails/122amd64-srv/fromftp/sr 163 MB 4192 kBps 40s [00:01:29] Extracting src... done [00:02:14] Fetching lib32 for FreeBSD 12.2-RC1 amd64 /data0/poudriere/jails/122amd64-srv/fromftp/li 62 MB 3139 kBps 21s [00:02:35] Extracting lib32... done [00:02:48] Cleaning up... done awk: can't open file /sys/param.h source line number 1 [00:02:52] Recording filesystem state for clean... done [00:02:53] Jail 122amd64-srv 12.2-RC1 amd64 is ready to be used Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: New Xorg - different key-codes
I just updated my laptop from source, and somewhere along the way the key-codes Xorg sees changed. Indeed. This doesn't just affect -CURRENT: it happened to me on -STABLE last week, so I'm copying that list too. And a "Down" key now opens and closes a KDE "Application Launcher", alternatively with its original function (which makes editing a frustration). https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=244354 Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2
2019-12-10 16:35, Marc Branchaud wrote: On 2019-12-10 9:18 a.m., Mark Martinec wrote: Commenting on a thread from 2018-12 and from 2019-09-20, with my solution to the boot problem at the end, in case anyone is still interested. Thank you very much for this. A couple of questions: (1) Why do you say "raw devices for historical reasons"? Glancing through the zpool man page and the Handbook, I see nothing recommending or requiring GPT partitions. Apparently using raw devices for zpool is now discouraged, although I don't think it has ever become officially unsupported. (2) Just to be 100% clear, my 11.3 non-root zpool looks like this: NAMESTATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada2ONLINE 0 0 0 ada3ONLINE 0 0 0 ada4ONLINE 0 0 0 ada5ONLINE 0 0 0 ada6ONLINE 0 0 0 ada7ONLINE 0 0 0 So this is using raw devices. Are you saying that if I upgrade this machine to 12 that it won't be able to boot? It is possible it won't boot under 12, although not necessary. Try booting from a 12.0 (or 12.0) memory stick - it that boots, it is probably a safe bet that it will survive an upgrade. Of the bunch of machines that I have upgraded from 11.2 to 12, only three failed to boot under 12.0 loader. There were a couple of others which upgraded and booted fine even though they had a zfs pool on raw devices. I never had a problem of booting on hosts that had zfs pool on a gpt partition. So it's a lottery: a few raw devices in a zpool seem to do fine, while many raw devices in a zpool is asking for trouble under 12.0 and later. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2
Commenting on a thread from 2018-12 and from 2019-09-20, with my solution to the boot problem at the end, in case anyone is still interested. === On 2018-11-29 myself wrote: (after upgrading from 11.2 to 12.0): While booting, the 'BTX loader' comes up, lists the BIOS drives, then the spinner below the list comes up and begins turning, stuttering, and after a couple of seconds it grinds to a standstill and nothing happens afterwards. At this point the ZFS and the bootstrap loader is supposed to come up, but it doesn't. [...] (on 2018-12-04): The situation has not changed: the BTX loader lists all BIOS drives C..J (disk0..disk7), then a spinner starts and gets stuck forever. It never reaches the 'BIOS 635kB/3537856kB available memory' line. While trying to restore the old /boot from 11.2, I tried booting a live image from a 12.0-RC3 memory stick - and the loader got stuck again, same as when booting from a disk. So I had to boot from an 11.2 memstick to be able to regain control. === 2018-12-04, Ian Lepore writes: Toomas Soome wrote: |ok, if you could perform 2 tests: |1. from loader prompt enter 0x413 0xa000 - @w . cr |2. on first spinner, press space and type on boot: prompt: |/boot/loader_4th and see if that will do better |thanks, toomas I don't think that will be an option. If it hasn't gotten to the point of saying how much BIOS available memory there is, it's only halfway through loader main() and has hung before getting to interact(). In fact, if that line hasn't printed, but some disk drives have been listed, it pretty much has to be hung in the "March through the device switch probing for things" loop. If all the disks are listed, then it got through that entry in the devsw, and is likely hanging in the dv_init calls for either the pxedisk or zfsdev devices. === 2018-12-07 19:08, Willem Jan Withagen wrote: Ended up more or less in the same situation this afternoon with freebsd-upgrade to [12.0]-RC3 Boot stops after listing all DOS disks, in a spinner. So that is no fix. I booted from USB 11.2 and replaced the /boot/zfs{boot,loader} by the 11.2 ones. That makes my server again happy. ===are 2019-09-19 16:02, Kurt Jaeger wrote: Subject: Re: Lockdown adaX numbers to allow booting ? | Kurt Jaeger writes: |The problem is that if all 10 disks are connected, the system |looses track from where it should boot and fails to boot (serial boot log): | |Consoles: internal video/keyboard serial port |BTX loader 1.00 BTX version is 1.02 |Consoles: internal video/keyboard serial port |BIOS drive C: is disk0 |BIOS drive D: is disk1 |BIOS drive E: is disk2 |BIOS drive F: is disk3 |BIOS drive G: is disk4 |BIOS drive H: is disk5 |BIOS drive I: is disk6 |BIOS drive J: is disk7 |BIOS drive K: is disk8 |BIOS drive L: is disk9 |// |[...] |The solution right now is this to unplug all disks of the 'bck' pool, |reboot, and re-insert the data disks after the boot is finished. |[...] |No gpart on the bck pool, raw drives. 2019-09-20 17:27, Mark Martinec wrote: Subject: Re: Lockdown adaX numbers to allow booting ? This sounds very much like my experience: 2018-11-29, Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2 https://lists.freebsd.org/pipermail/freebsd-stable/2018-November/090129.html https://lists.freebsd.org/pipermail/freebsd-stable/2018-December/090159.html I now have three SuperMicro machines which are unable to boot after upgrading 11.2 to 12.0. After unsuccessfully fiddling with boot loaders, I have reverted two back to 11.2 (which boots and works fine again), and the third one is now at 12.0 but needs the boot hack as described by Kurt, i.e. pull out half the disks (of the 'data' pool), boot the system, plug the disks back in and zfs mount the remaining pool. Considering that the 11.2 boots and works fine on these machines, I consider it a btx loader failure and not a BIOS issue. What is common with these three machines is that they have one pool on raw devices for historical reasons (not on gpt partitions). My guess is that the new loader gets confused by these raw disks. === Ok, now to my current situation and solution/workaround. What was common with these hosts (and similar) is that a machine has more than a couple of disks, with a zfs pool (non-root) on raw devices (for historical reasons), not on gpt partitions. Three workarounds seem possible: - replace a boot loader with the one from 11.2, or - using a default loader from 12, disconnect a sufficient number of data disks, boot, then reconnect disks and zfs attach the pool, - or my current solution: zfs offline one disk at a time from a data pool, wipe it, set up a gpt partition on it and put it back to the pool by 'zfs replace', letting it resilver. It was a painful and slightly risky procedure (9 hours of resilvering each of the s
Re: No amdtemp sysctls, AMD Ryzen 5 3600X
On 15/11/2019 3:27 am, Mark Martinec wrote: Running 12.1-RELEASE-p1 on AMD Ryzen 5 3600X cpu, but I don't see any temperatures reported in sysctl, even though amdtemp.ko and amdsmn.ko are loaded and they don't produce any complaints on loading. 2019-11-15 03:01, Kubilay Kocak wrote: Resolver of original Ryzen 2 temperature support: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218264 "In Progress" issue for Ryzen 5 support with patch: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239607 I have applied the patch from Bug 239607 and it works now! Perfect, thanks! Was committed to head (CURRENT), apparently merged to stable/12 (cant find the MFC commit). I've updated/retriaged the issue, and asked about a merge to stable/11, but at this point it looks like it missed the 12.1-RELEASE window It's unfortunate that it missed the 12.1-RELEASE. Thanks for a quick response! Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
No amdtemp sysctls, AMD Ryzen 5 3600X
Running 12.1-RELEASE-p1 on AMD Ryzen 5 3600X cpu, but I don't see any temperatures reported in sysctl, even though amdtemp.ko and amdsmn.ko are loaded and they don't produce any complaints on loading. $ kldstat | fgrep amd 271 0x82f3 1458 amdtemp.ko 281 0x82f32000 808 amdsmn.ko $ sysctl -a | grep -i tempe $ $ sysctl dev.amdtemp sysctl: unknown oid 'dev.amdtemp' $ Nov 13 12:07:27 xxx kernel: CPU: AMD Ryzen 5 3600X 6-Core Processor (4100.09-MHz K8-class CPU) Nov 13 12:07:27 xxx kernel: Origin="AuthenticAMD" Id=0x870f10 Family=0x17 Model=0x71 Stepping=0 Motherboard is an ASUS with X570 chipset, latest BIOS. No obvious errors are reported during booting. Any additional information that I can provide? Any suggestions? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting
2018-12-26 22:26, Terry Kennedy wrote: The earlier LSI P20 releases were pretty flakey in some cases - try flashing 20.00.07.00. Indeed. I have upgraded LSI SAS2308 firmware from 20.00.02.00 to 20.00.07.00 a week ago, left it running for a while with 11.2, then upgraded again to 12.0, and the controller is stable now, even with the new mps driver that came with 12.0. To recap: - mps driver from FreeBSD 11.2 and earlier is stable with SAS2308 firmware 20.00.02.00 _and_ 20.00.07.00 - mps driver from FreeBSD 12.0 causes frequent controller resets with SAS2308 firmware 20.00.02.00 (and ZFS can't cope with that), but is stable with 20.00.07.00. Mark 2018-12-17 16:52, je Mark Martinec napisal One of our servers that was upgraded from 11.2 to 12.0 (to RC2 initially, then to RC3 and lastly to a 12.0-RELEASE) is suffering severe instability of a disk controller, resetting itself a couple of times a day, usually associated with high disk usage (like poudriere buils or zfs scrub or nightly file system scans). The same setup was rock-solid under 11.2 (and still/again is). The disk controller is LSI SAS2308. It has four disks attached as JBODs, one pair of SSDs and one pair of hard disks, each pair forming its own zpool. A controller reset can occur regardless of which pair is in heavy use. The following can be found in logs, just before machine becomes unusable (although not logged always, as disks may be dropped before syslog has a chance of writing anything): xxx kernel: [2382] mps0: IOC Fault 0x4d04, Resetting xxx kernel: [2382] mps0: Reinitializing controller xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd xxx kernel: [2383] mps0: IOCCapabilities: 5a85c xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack The IOC Fault location is always the same. Apparently the disk controller resets, all disk devices are dropped and ZFS finds itself with no disks. The machine still responds to ping, and if logged-in during the event and running zpool status -v 1, zfs reports loss of all devices for each pool: pool: data0 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://illumos.org/msg/ZFS-8000-HC scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov 17 00:22:38 2018 config: NAME STATE READ WRITE CKSUM data0 UNAVAIL 0 0 0 mirror-0UNAVAIL 024 0 2396428274137360341 REMOVED 0 0 0 was /dev/gpt/da2-PN1334PCKAKD4S 16738407333921736610 REMOVED 0 0 0 was /dev/gpt/da3-PN2338P4GJ1XYC (and similar for the other pool) At this point the machine is unusable and needs to be hard-reset. My guess is that after the controller resets, disk devices come up again (according to the report seen on the console, stating 'periph destroyed' first, then listing full info on each disk) - but zfs ignores them. I don't see any mention of changes of the mps driver in the 12.0 release notes, although diff-ing its sources between 11.2 and 12.0 shows plenty of nontrivial changes. After suffering this instability for some time, I finally downgraded the OS to 11.2, and things are back to normal again! This downgrade path was nontrivial, as I have foolishly upgraded pool features to what comes with 12.0, so downgrading involved hacking with dismantling both zfs mirror pools, recreating pools without the two new features, zfs send/receive copying, while having a machine hang during some of these operations. Not something for the faint at heart. I know, foolish of me to upgrade pools after just one day of uptime with 12.0. Some info on the controller: kernel: mps0: port 0xf000-0xf0ff mem 0xfbe4- 0xfbe4,0xfbe0-0xfbe3 irq 64 at device 0.0 numa-domain 1 on pci11 kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd mpsutil shows: mps0 Adapter: Board Name: LSI2308-IT Board Assembly: Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.00.00 Firmware Revision: 20.00.02.00 Integrated RAID: no So, what has changed in the mps driver for this to be happening? Would it be possible to take mps driver sources from 11.2, transplant them to 12.0, recompile, and use that? Could the new mps driver be using some new feature of the controller and hits a firmware bug? I have resisted upgrading SAS2308 firmware and its BIOS, as it is working very well under 11.2. Anyone else seen problems with mps driver and LSI SAS2308 controller? (btw, on another machine the mps driver with LSI SAS2004 is working just fine under 12.0) Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail
mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting
One of our servers that was upgraded from 11.2 to 12.0 (to RC2 initially, then to RC3 and lastly to a 12.0-RELEASE) is suffering severe instability of a disk controller, resetting itself a couple of times a day, usually associated with high disk usage (like poudriere buils or zfs scrub or nightly file system scans). The same setup was rock-solid under 11.2 (and still/again is). The disk controller is LSI SAS2308. It has four disks attached as JBODs, one pair of SSDs and one pair of hard disks, each pair forming its own zpool. A controller reset can occur regardless of which pair is in heavy use. The following can be found in logs, just before machine becomes unusable (although not logged always, as disks may be dropped before syslog has a chance of writing anything): xxx kernel: [2382] mps0: IOC Fault 0x4d04, Resetting xxx kernel: [2382] mps0: Reinitializing controller xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd xxx kernel: [2383] mps0: IOCCapabilities: 5a85c xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack The IOC Fault location is always the same. Apparently the disk controller resets, all disk devices are dropped and ZFS finds itself with no disks. The machine still responds to ping, and if logged-in during the event and running zpool status -v 1, zfs reports loss of all devices for each pool: pool: data0 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://illumos.org/msg/ZFS-8000-HC scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov 17 00:22:38 2018 config: NAME STATE READ WRITE CKSUM data0 UNAVAIL 0 0 0 mirror-0UNAVAIL 024 0 2396428274137360341 REMOVED 0 0 0 was /dev/gpt/da2-PN1334PCKAKD4S 16738407333921736610 REMOVED 0 0 0 was /dev/gpt/da3-PN2338P4GJ1XYC (and similar for the other pool) At this point the machine is unusable and needs to be hard-reset. My guess is that after the controller resets, disk devices come up again (according to the report seen on the console, stating 'periph destroyed' first, then listing full info on each disk) - but zfs ignores them. I don't see any mention of changes of the mps driver in the 12.0 release notes, although diff-ing its sources between 11.2 and 12.0 shows plenty of nontrivial changes. After suffering this instability for some time, I finally downgraded the OS to 11.2, and things are back to normal again! This downgrade path was nontrivial, as I have foolishly upgraded pool features to what comes with 12.0, so downgrading involved hacking with dismantling both zfs mirror pools, recreating pools without the two new features, zfs send/receive copying, while having a machine hang during some of these operations. Not something for the faint at heart. I know, foolish of me to upgrade pools after just one day of uptime with 12.0. Some info on the controller: kernel: mps0: port 0xf000-0xf0ff mem 0xfbe4- 0xfbe4,0xfbe0-0xfbe3 irq 64 at device 0.0 numa-domain 1 on pci11 kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd mpsutil shows: mps0 Adapter: Board Name: LSI2308-IT Board Assembly: Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.00.00 Firmware Revision: 20.00.02.00 Integrated RAID: no So, what has changed in the mps driver for this to be happening? Would it be possible to take mps driver sources from 11.2, transplant them to 12.0, recompile, and use that? Could the new mps driver be using some new feature of the controller and hits a firmware bug? I have resisted upgrading SAS2308 firmware and its BIOS, as it is working very well under 11.2. Anyone else seen problems with mps driver and LSI SAS2308 controller? (btw, on another machine the mps driver with LSI SAS2004 is working just fine under 12.0) Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: zfsboot@12.0: Shortening read at xxxx from 16 to -479991569
2018-12-13 16:59, Warner Losh wrote: Do you have any encrypted disks? Indeed I do, both pools are encrypted. (although I haven't seen such messages with 11.2, as far as I can tell) Mark On Thu, Dec 13, 2018, 6:19 AM Mark Martinec wrote: On one of my hosts (now running 12.0-RELEASE) the zfsboot shows this weird negative number, which sounds suspicious: Verifying DMI pool Data . Shortening read at 3907029152 from 16 to 15 Shortening read at 7435283708 from 16 to -479991569 BTX loader 1.0 BTX version is 1.02 Consoles: ... BIOS drive C: is disk0 ... The machine boots up normally and is fine, zpool scrub is happy, so, should I worry? Anything fishy there? Searching through sources, the message seems to come from stand/i386/zfsboot/zfsboot.c : printf("Shortening read at %lld from %d to %lld\n", alignlba, alignnb, (zdsk->dsk.size + zdsk->dsk.start) - alignlba); ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
zfsboot@12.0: Shortening read at xxxx from 16 to -479991569
On one of my hosts (now running 12.0-RELEASE) the zfsboot shows this weird negative number, which sounds suspicious: Verifying DMI pool Data . Shortening read at 3907029152 from 16 to 15 Shortening read at 7435283708 from 16 to -479991569 BTX loader 1.0 BTX version is 1.02 Consoles: ... BIOS drive C: is disk0 ... The machine boots up normally and is fine, zpool scrub is happy, so, should I worry? Anything fishy there? Searching through sources, the message seems to come from stand/i386/zfsboot/zfsboot.c : printf("Shortening read at %lld from %d to %lld\n", alignlba, alignnb, (zdsk->dsk.size + zdsk->dsk.start) - alignlba); Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2
2018-11-29 18:43, Toomas Soome wrote: I just did push biosdisk updates to stable/12, I wonder if you could test those bits… Myself wrote: Thank you! I haven't tried it yet, but I wonder whether this fix was already incorporated into 12.0-RC3, which would make my rescue easier. Otherwise I can build a stable/12 on another host and transplant the problematic file(s) to the affected host - if I knew which files to copy. 2018-12-02 18:59, Toomas wrote: The files are /boot/loader* binaries - to be exact, check which one is linked to /boot/loader. I can provide binaries if needed. [...] rgds, toomas I got a maintenance window today so I tried with the new loader, and it did not help. More specifically: As it comes with 12-RC2, the /boot/loader was hard linked with loader_lua. Its size is 421888 bytes. So I concentrated on this loader. I build a fresh stable/12 on another host, and copied the newly built loader_lua (425984 bytes) to the /boot directory of the affected host, deleted the file 'loader', and hard-linked loader_lua to loader. The situation has not changed: the BTX loader lists all BIOS drives C..J (disk0..disk7), then a spinner starts and gets stuck forever. It never reaches the 'BIOS 635kB/3537856kB available memory' line. While trying to restore the old /boot from 11.2, I tried booting a live image from a 12.0-RC3 memory stick - and the loader got stuck again, same as when booting from a disk. So I had to boot from an 11.2 memstick to be able to regain control. Mark On 29 Nov 2018, at 17:01, Mark Martinec wrote: After successfully upgraded three hosts from 11.2-p4 to 12.0-RC2 (amd64, zfs, bios), I tried my luck with one of our production hosts, and ended up with a stuck loader after rebooting with a new kernel (after the first stage of upgrade). These were the steps, and all went smoothly and normally until a reboot: freebsd-update upgrade -r 12.0-RC2 freebsd-update install shutdown -r now While booting, the 'BTX loader' comes up, lists the BIOS drives, then the spinner below the list comes up and begins turning, stuttering, and after a couple of seconds it grinds to a standstill and nothing happens afterwards. At this point the ZFS and the bootstrap loader is supposed to come up, but it doesn't. This host has too zfs pools, the system pool consists of two SSDs in a zfs mirror (also holding a freebsd-boot partition each), the other pool is a raidz2 with six JBOD disks on an LSI controller. The gptzfsboot in both freebsd-boot partitions is fresh from 11.2, both zpool versions are up-to-date with 11.2. The 'zpool status -v' is happy with both pools. After rebooting from an USB drive and reverting the /boot directory to a previous version, the machine comes up normally again with the 11.2-RELEASE-p4. I found a file init.core in the / directory, slightly predating the last reboot with a salvaged system - although it was probably not a cause of the problem, but a consequence of the rescue operation. It is unfortunate that this is a production host, so I can't play much with it. One or two more quick experiments I can probably afford, but not much more. Should I just first wait for the official 12.0 release? Should I try booting with a 12.0 on USB and try to import pools? Suggestions welcome. Now that the /boot has been manually restored to the 11.2 state, A SECOND QUESTION is about freebsd-update, which still thinks we are in the middle of an upgrade procedure. Trying now to just update the 11.2-RELEASE-p4 to 11.2-RELEASE-p5, the fetch complains: # uname -a FreeBSD xxx 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 # # freebsd-version 11.2-RELEASE-p4 # # freebsd-update fetch src component not installed, skipped You have a partially completed upgrade pending Run '/usr/sbin/freebsd-update install' first. Run '/usr/sbin/freebsd-update fetch -F' to proceed anyway. So what is the right way to get rid of all traces of the unsuccessful upgrade, and let freebsd-update believe we are cleanly at 11.2-p4 ? Removing /var/db/freebsd-update did not help. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2
2018-11-29 18:43, Toomas Soome wrote: I just did push biosdisk updates to stable/12, I wonder if you could test those bits… Thank you! I haven't tried it yet, but I wonder whether this fix was already incorporated into 12.0-RC3, which would make my rescue easier. Otherwise I can build a stable/12 on another host and transplant the problematic file(s) to the affected host - if I knew which files to copy. I wonder also, if the today's posting by cksalexan...@q.com on the freebsd-stable ML titled "FreeBSD-12.0-RC3-i386-disc1.iso does not boot" could be describing the same problem? Mark On 29 Nov 2018, at 17:01, Mark Martinec wrote: After successfully upgraded three hosts from 11.2-p4 to 12.0-RC2 (amd64, zfs, bios), I tried my luck with one of our production hosts, and ended up with a stuck loader after rebooting with a new kernel (after the first stage of upgrade). These were the steps, and all went smoothly and normally until a reboot: freebsd-update upgrade -r 12.0-RC2 freebsd-update install shutdown -r now While booting, the 'BTX loader' comes up, lists the BIOS drives, then the spinner below the list comes up and begins turning, stuttering, and after a couple of seconds it grinds to a standstill and nothing happens afterwards. At this point the ZFS and the bootstrap loader is supposed to come up, but it doesn't. This host has too zfs pools, the system pool consists of two SSDs in a zfs mirror (also holding a freebsd-boot partition each), the other pool is a raidz2 with six JBOD disks on an LSI controller. The gptzfsboot in both freebsd-boot partitions is fresh from 11.2, both zpool versions are up-to-date with 11.2. The 'zpool status -v' is happy with both pools. After rebooting from an USB drive and reverting the /boot directory to a previous version, the machine comes up normally again with the 11.2-RELEASE-p4. I found a file init.core in the / directory, slightly predating the last reboot with a salvaged system - although it was probably not a cause of the problem, but a consequence of the rescue operation. It is unfortunate that this is a production host, so I can't play much with it. One or two more quick experiments I can probably afford, but not much more. Should I just first wait for the official 12.0 release? Should I try booting with a 12.0 on USB and try to import pools? Suggestions welcome. Now that the /boot has been manually restored to the 11.2 state, A SECOND QUESTION is about freebsd-update, which still thinks we are in the middle of an upgrade procedure. Trying now to just update the 11.2-RELEASE-p4 to 11.2-RELEASE-p5, the fetch complains: # uname -a FreeBSD xxx 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 # # freebsd-version 11.2-RELEASE-p4 # # freebsd-update fetch src component not installed, skipped You have a partially completed upgrade pending Run '/usr/sbin/freebsd-update install' first. Run '/usr/sbin/freebsd-update fetch -F' to proceed anyway. So what is the right way to get rid of all traces of the unsuccessful upgrade, and let freebsd-update believe we are cleanly at 11.2-p4 ? Removing /var/db/freebsd-update did not help. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Boot loader stuck after first stage upgrading 11.2 to 12.0-RC2
After successfully upgraded three hosts from 11.2-p4 to 12.0-RC2 (amd64, zfs, bios), I tried my luck with one of our production hosts, and ended up with a stuck loader after rebooting with a new kernel (after the first stage of upgrade). These were the steps, and all went smoothly and normally until a reboot: freebsd-update upgrade -r 12.0-RC2 freebsd-update install shutdown -r now While booting, the 'BTX loader' comes up, lists the BIOS drives, then the spinner below the list comes up and begins turning, stuttering, and after a couple of seconds it grinds to a standstill and nothing happens afterwards. At this point the ZFS and the bootstrap loader is supposed to come up, but it doesn't. This host has too zfs pools, the system pool consists of two SSDs in a zfs mirror (also holding a freebsd-boot partition each), the other pool is a raidz2 with six JBOD disks on an LSI controller. The gptzfsboot in both freebsd-boot partitions is fresh from 11.2, both zpool versions are up-to-date with 11.2. The 'zpool status -v' is happy with both pools. After rebooting from an USB drive and reverting the /boot directory to a previous version, the machine comes up normally again with the 11.2-RELEASE-p4. I found a file init.core in the / directory, slightly predating the last reboot with a salvaged system - although it was probably not a cause of the problem, but a consequence of the rescue operation. It is unfortunate that this is a production host, so I can't play much with it. One or two more quick experiments I can probably afford, but not much more. Should I just first wait for the official 12.0 release? Should I try booting with a 12.0 on USB and try to import pools? Suggestions welcome. Now that the /boot has been manually restored to the 11.2 state, A SECOND QUESTION is about freebsd-update, which still thinks we are in the middle of an upgrade procedure. Trying now to just update the 11.2-RELEASE-p4 to 11.2-RELEASE-p5, the fetch complains: # uname -a FreeBSD xxx 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 # # freebsd-version 11.2-RELEASE-p4 # # freebsd-update fetch src component not installed, skipped You have a partially completed upgrade pending Run '/usr/sbin/freebsd-update install' first. Run '/usr/sbin/freebsd-update fetch -F' to proceed anyway. So what is the right way to get rid of all traces of the unsuccessful upgrade, and let freebsd-update believe we are cleanly at 11.2-p4 ? Removing /var/db/freebsd-update did not help. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
On 07/08/2018 15:58, Mark Martinec wrote: Collected, here it is: https://www.ijs.si/usr/mark/tmp/dtrace-cmd.out.bz2 2018-08-14 11:18, Andriy Gapon wrote: I see one memory leak, not sure if it's the only one. It looks like vdev_geom_read_config() leaks all parsed vdev nvlist-s but the last. The problems seems to come from r316760. Before that commit the function would return upon finding the first valid config, but now it keeps iterating. The memory leak should not be a problem when vdev-s are probed sufficiently rarely, but it appears that with an unhealthy pool the probing can happen much more frequently (e.g., every time pools are listed). Superb, thanks!!! I have opened a bug report now: Bug 230704: All the memory eaten away by ZFS 'solaris' malloc https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230704 Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
2018-08-13 21:48, Volodymyr Kostyrko wrote: I've been in the same situation. ZFS, only pool, no ZFS errors. I think the problem is rather between swapping and ZFS ARC. This host has different load, sometimes it needs more active memory, somtimes less... This means that active zone can expand and shrink like +-2G os mem (I have 16Gb installed there). The problem is, when huge task is idle it doesn't use much active memory and other activity is pushing it's memory to the swap. When active runs low and ARC runs >50% of memory it becomes very hard to make ARC give some memory back. My host even was broght to the point when it couldn't get tasks back into memory from swap, because while some pages were restored from swap the time passes by and the other pages are instead stored to swap due to zome ARC activity. Finally active zone shrinks so bad that the host becomes unresponsive. Like 6 month ago I tried tweaking kernel and swap to make things go other way. Currently I have `vm.swap_idle_enabled=1` in /etc/loader.conf and looks like this solves my problem. The other interesting things to look at are `vfs.zfs.arc_free_target`, `vfs.zfs.arc_shrink_shift`, `vfs.zfs.arc_grow_retry`. Or you can take another route and plain limit current ARC size with `vfs.zfs.arc_max`. What you describe is not the same problem as the one I described in this thread. In my case the ZFS malloc'ed memory ("solaris" zone) is growing, while the size of the ARC remains capped to a reasonably low value, and the ARC even shrinks as the "solaris" zone approaches the memory size. I too have been bitten previously by the ARC size being reluctant to shrink. Ths problem is described here, but only partially mitigated now in the 11.? version: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 The usually suggested workaround is to limit the size of the ARC, although it would be nice to find a solution to handle ARC UMA shrinking automatically, like it worked well in FreeBSD 9 but broke in FreeBSD 10. Like I said, the problem I described in this thread is different. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.2-R amd64
2018-08-04 21:47, Mark Johnston wrote: Sorry, I missed that message. Given that information, it would be useful to see the output of the following script instead: # dtrace -c "zpool list -Hp" -x temporal=off -n ' dtmalloc::solaris:malloc /pid == $target/{@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free /pid == $target/{@frees[stack(), args[3]] = count();}' This will record all allocations and frees from a single instance of "zpool list". 2018-08-07 14:58, Mark Martinec wrote: Collected, here it is: https://www.ijs.si/usr/mark/tmp/dtrace-cmd.out.bz2 Was there a mention of a defunct pool? Indeed. Haven't tried yet to destroy it, so it is only my hypothesis that a defunct pool plays a role in this leak. [...] I have jumped from 10.3 directly to 11.1-RELEASE-p11, so I'm not sure with exactly which version / patch level the problem was introduced. Tried to reproduce the problem on another host running 11.2R, using memory disk (md), created GPT partition on it and a ZFS pool on top, then destroyed the disk, so the pool was left as UNAVAILABLE. Unfortunately this did not reproduce the problem, the "zpool list" on that host does not cause ZFS to leak memory. Must be something specific to that failed disk or pool, which is causing the leak. Mark More news: on my last posting I said I can't reproduce the issue on another 11.2 host. Well, it turned out this was only half the truth. So this is what I did the last time: # create a test pool on md mdconfig -a -t swap -s 1Gb gpart create -s gpt /dev/md0 gpart add -t freebsd-zfs -a 4k /dev/md0 zpool create test /dev/md0p1 # destroy the disk underneath the pool, making it "unavailable" mdconfig -d -u 0 -o force and I reported that the "zpool list" command does not leak memory, unlike on another host where the problem was first detected. But in the following days after this, the second machine started to run out of memory and ground to a standstill after a couple of days - this now happened three times, until I realized the same thing was happening here as on the original host. (the "zpool list" is running periodically as a plugin to a "telegraf" monitoring) Sure enough the "zpool list" was leaking "solaris" zone memory here too, and even in larger chunks (previously by 570, now by about 2k): # (while true; do zpool list >/dev/null; vmstat -m | \ fgrep solaris; sleep 0.5; done) | awk '{print $2-a; a=$2}' 12224540 2509 3121 5022 2507 1834 2508 2505 And it's not just the "zpool list" command. The same leak occurs with "zpool status" and with "zpool iostat", either when explicitly specifying the defunct pool as argument, or without specifying a pool (implying all). (but not when a healthy pool is explicitly specified to such command) And to confirm the hypothesis: while running the "zpool list" in an above loop, I destroyed the defunct pool from another terminal, and the leak immediately vanished (the vmstat -m | fgrep solaris no longer grew). So the only missing link is: why the leak did not start immediately after revoking the disk and making the pool unavailable, but only some time later (hours? few days? after a reboot? after running some other command?). Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
On Sat, Aug 04, 2018 at 08:38:04PM +0200, Mark Martinec wrote: 2018-08-04 19:01, Mark Johnston wrote: > I think running "zpool list" is adding a lot of noise to the output. > Could you retry without doing that? No, like I said previously, the "zpool list" (with one defunct zfs pool) *is* the sole culprit of the zfs memory leak. With each invocation of "zpool list" the "solaris" malloc jumps up by the same amount, and never ever drops. Without running it (like repeatedly under 'telegraf' monitoring of zfs), the machine runs normally and never runs out of memory, the "solaris" malloc count no longer grows steadily. 2018-08-04 21:47, Mark Johnston wrote: Sorry, I missed that message. Given that information, it would be useful to see the output of the following script instead: # dtrace -c "zpool list -Hp" -x temporal=off -n ' dtmalloc::solaris:malloc /pid == $target/{@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free /pid == $target/{@frees[stack(), args[3]] = count();}' This will record all allocations and frees from a single instance of "zpool list". Collected, here it is: https://www.ijs.si/usr/mark/tmp/dtrace-cmd.out.bz2 Kevin P. Neal wrote: Was there a mention of a defunct pool? Indeed. Haven't tried yet to destroy it, so it is only my hypothesis that a defunct pool plays a role in this leak. I've got a machine with 8GB RAM running 11.1-RELEASE-p4 with a single ZFS pool. It runs zfs list in a script multiple times a minute, and it has been doing so for 181 days with no reboot. I have not seen any memory issues. I have jumped from 10.3 directly to 11.1-RELEASE-p11, so I'm not sure with exactly which version / patch level the problem was introduced. Tried to reproduce the problem on another host running 11.2R, using memory disk (md), created GPT partition on it and a ZFS pool on top, then destroyed the disk, so the pool was left as UNAVAILABLE. Unfortunately this did not reproduce the problem, the "zpool list" on that host does not cause ZFS to leak memory. Must be something specific to that failed disk or pool, which is causing the leak. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
2018-08-04 19:01, Mark Johnston wrote: I think running "zpool list" is adding a lot of noise to the output. Could you retry without doing that? No, like I said previously, the "zpool list" (with one defunct zfs pool) *is* the sole culprit of the zfs memory leak. With each invocation of "zpool list" the "solaris" malloc jumps up by the same amount, and never ever drops. Without running it (like repeatedly under 'telegraf' monitoring of zfs), the machine runs normally and never runs out of memory, the "solaris" malloc count no longer grows steadily. This leak was introduced sometime between 10.3 and 11.1R-p11, and is still there with 11.2. Mark On Fri, Aug 03, 2018 at 09:11:42PM +0200, Mark Martinec wrote: More attempts at tracking this down. The suggested dtrace command does usually abort with: Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3330. Hrmm. As a workaround you can add "-x temporal=off" to the dtrace(1) invocation. but with some luck soon after each machine reboot I can leave the dtrace running for about 10 or 20 seconds (max) before terminating it with a ^C, and succeed in collecting the report. If I miss the opportunity to leave dtrace running just long enough to collect useful info, but not long enough for it to hit the assertion check, then any further attempt to run the dtrace script hits the assertion fault immediately. Btw, (just in case) I have recompiled kernel from source (base/release/11.2.0) with debugging symbols, although the behaviour has not changed: FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238: Fri Aug 3 17:29:42 CEST 2018 m...@xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64 Anyway, after several attempts I was able to collect a useful dtrace output from the suggested dtrace stript: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count()}' while running "zpool list" repeatedly in another terminal screen: I think running "zpool list" is adding a lot of noise to the output. Could you retry without doing that? ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
More attempts at tracking this down. The suggested dtrace command does usually abort with: Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3330. but with some luck soon after each machine reboot I can leave the dtrace running for about 10 or 20 seconds (max) before terminating it with a ^C, and succeed in collecting the report. If I miss the opportunity to leave dtrace running just long enough to collect useful info, but not long enough for it to hit the assertion check, then any further attempt to run the dtrace script hits the assertion fault immediately. Btw, (just in case) I have recompiled kernel from source (base/release/11.2.0) with debugging symbols, although the behaviour has not changed: FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238: Fri Aug 3 17:29:42 CEST 2018 m...@xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64 Anyway, after several attempts I was able to collect a useful dtrace output from the suggested dtrace stript: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count()}' while running "zpool list" repeatedly in another terminal screen: # (while true; do zpool list -Hp >/dev/null; vmstat -m | fgrep solaris; \ sleep 0.2; done) | awk '{print $2-a; a=$2}' 454303 570 570 570 570 570 570 570 570 570 570 570 570 570 570 Two samples of the collected dtrace output (after about 15 seconds) are at: https://www.ijs.si/usr/mark/tmp/dtrace1.out.bz2 https://www.ijs.si/usr/mark/tmp/dtrace2.out.bz2 (the dtrace2.out is probably cleaner, I made sure no other service was running except my sshd and syslog) Not really sure what I'm looking at, but a couple of large entries stand out: $ awk '/^ .*[0-9]+ .*[0-9]$/' dtrace2.out | sort -k1n | tail -5 114688 138 114688 138 114688 138 114688 138 114688 138 Thanks in advance for looking into it, Mark 2018-08-01 09:12, myself wrote: On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote: I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE and the situation has not improved. Also turned off all services. ZFS is still leaking memory about 30 MB per hour, until the host runs out of memory and swap space and crashes, unless I reboot it first every four days. Any advise before I try to get rid of that faulted disk with a pool (or downgrade to 10.3, which was stable) ? 2018-08-01 00:09, Mark Johnston wrote: If you're able to use dtrace, it would be useful to try tracking allocations with the solaris tag: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}' Try letting that run for one minute, then kill it and paste the output. Ideally the host will be as close to idle as possible while still demonstrating the leak. Good and bad news: The suggested dtrace command bails out: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}' dtrace: description 'dtmalloc::solaris:malloc ' matched 2 probes Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3330. Abort trap But I did get one step further, localizing the culprit. I realized that the "solaris" malloc count goes up in sync with the 'telegraf' monitoring service polls, which also has a ZFS plugin which monitors the zfs pool and ARC. This plugin runs 'zpool list -Hp' periodically. So after stopping telegraf (and other remaining services), the 'vmstat -m' shows that InUse count for "solaris" goes up by 552 every time that I run "zpool list -Hp" : # (while true; do zpool list -Hp >/dev/null; vmstat -m | \ fgrep solaris; sleep 1; done) | awk '{print $2-a; a=$2}' 6664427 541 552 552 552 552 552 552 552 552 ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote: I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE and the situation has not improved. Also turned off all services. ZFS is still leaking memory about 30 MB per hour, until the host runs out of memory and swap space and crashes, unless I reboot it first every four days. Any advise before I try to get rid of that faulted disk with a pool (or downgrade to 10.3, which was stable) ? 2018-08-01 00:09, Mark Johnston wrote: If you're able to use dtrace, it would be useful to try tracking allocations with the solaris tag: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}' Try letting that run for one minute, then kill it and paste the output. Ideally the host will be as close to idle as possible while still demonstrating the leak. Good and bad news: The suggested dtrace command bails out: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}' dtrace: description 'dtmalloc::solaris:malloc ' matched 2 probes Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3330. Abort trap But I did get one step further, localizing the culprit. I realized that the "solaris" malloc count goes up in sync with the 'telegraf' monitoring service polls, which also has a ZFS plugin which monitors the zfs pool and ARC. This plugin runs 'zpool list -Hp' periodically. So after stopping telegraf (and other remaining services), the 'vmstat -m' shows that InUse count for "solaris" goes up by 552 every time that I run "zpool list -Hp" : # (while true; do zpool list -Hp >/dev/null; vmstat -m | \ fgrep solaris; sleep 1; done) | awk '{print $2-a; a=$2}' 6664427 541 552 552 552 552 552 552 552 552 556 548 552 552 552 552 552 552 552 552 552 # zpool list -Hp floki 68719476736 37354102272 31365374464 - - 49% 54 1.00x ONLINE - stuff - - - - - - - - UNAVAIL - Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE and the situation has not improved. Also turned off all services. ZFS is still leaking memory about 30 MB per hour, until the host runs out of memory and swap space and crashes, unless I reboot it first every four days. Any advise before I try to get rid of that faulted disk with a pool (or downgrade to 10.3, which was stable) ? Mark 2018-07-23 17:12, myself wrote: After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11 (amd64), ZFS is gradually eating up all memory, so that it crashes every few days when the memory is completely exhausted (after swapping heavily for a couple of hours). This machine has only 4 GB of memory. After capping up the ZFS ARC to 1.8 GB the machine can now stay up a bit longer, but in four days all the memory is used up. The machine is lightly loaded, it runs a bind resolver and a lightly used web server, the ps output does not show any excessive memory use by any process. During the last survival period I ran vmstat -m every second and logged results. What caught my eye was the 'solaris' entry, which seems to explain all the exhaustion. The MemUse for the solaris entry starts modestly, e.g. after a few hours of uptime: $ vmstat -m : Type InUse MemUse HighUse Requests Size(s) solaris 3141552 225178K - 12066929 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768 ... but this number keeps steadily growing. After about four days, shortly before a crash, it grew to 2.5 GB, which gets dangerously close to all the available memory: solaris 39359484 2652696K - 234986296 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768 Plotting the 'solaris' MemUse entry vs. wall time in seconds, one can see a steady linear growth, about 25 MB per hour. On a fine-resolution small scale the step size seems to be one small step increase per about 6 seconds. All steps are small, but not all are the same size. The only thing (in my mind) that distinguishes this host from others running 11.1 seems to be that one of the two ZFS pools is down because its disk is broken. This is a scratch data pool, not otherwise in use. The pool with the OS is healthy. The syslog shows entries like the following periodically: Jul 23 16:48:49 xxx ZFS: vdev state changed, pool_guid=15371508659919408885 vdev_guid=11732693005294113354 Jul 23 16:49:09 xxx ZFS: vdev state changed, pool_guid=15371508659919408885 vdev_guid=11732693005294113354 Jul 23 16:55:34 xxx ZFS: vdev state changed, pool_guid=15371508659919408885 vdev_guid=11732693005294113354 The 'zpool status -v' on this pool shows: pool: stuff state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-3C scan: none requested config: NAMESTATE READ WRITE CKSUM stuff UNAVAIL 0 0 0 11732693005294113354 UNAVAIL 0 0 0 was /dev/da2 The same machine with this broken pool could previously survive indefinitely under FreeBSD 10.3 . So, could this be the reason for memory depletion? Any fixes for that? Any more tests suggested to perform before I try to get rid of this pool? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11 (amd64), ZFS is gradually eating up all memory, so that it crashes every few days when the memory is completely exhausted (after swapping heavily for a couple of hours). This machine has only 4 GB of memory. After capping up the ZFS ARC to 1.8 GB the machine can now stay up a bit longer, but in four days all the memory is used up. The machine is lightly loaded, it runs a bind resolver and a lightly used web server, the ps output does not show any excessive memory use by any process. During the last survival period I ran vmstat -m every second and logged results. What caught my eye was the 'solaris' entry, which seems to explain all the exhaustion. The MemUse for the solaris entry starts modestly, e.g. after a few hours of uptime: $ vmstat -m : Type InUse MemUse HighUse Requests Size(s) solaris 3141552 225178K - 12066929 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768 ... but this number keeps steadily growing. After about four days, shortly before a crash, it grew to 2.5 GB, which gets dangerously close to all the available memory: solaris 39359484 2652696K - 234986296 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768 Plotting the 'solaris' MemUse entry vs. wall time in seconds, one can see a steady linear growth, about 25 MB per hour. On a fine-resolution small scale the step size seems to be one small step increase per about 6 seconds. All steps are small, but not all are the same size. The only thing (in my mind) that distinguishes this host from others running 11.1 seems to be that one of the two ZFS pools is down because its disk is broken. This is a scratch data pool, not otherwise in use. The pool with the OS is healthy. The syslog shows entries like the following periodically: Jul 23 16:48:49 xxx ZFS: vdev state changed, pool_guid=15371508659919408885 vdev_guid=11732693005294113354 Jul 23 16:49:09 xxx ZFS: vdev state changed, pool_guid=15371508659919408885 vdev_guid=11732693005294113354 Jul 23 16:55:34 xxx ZFS: vdev state changed, pool_guid=15371508659919408885 vdev_guid=11732693005294113354 The 'zpool status -v' on this pool shows: pool: stuff state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-3C scan: none requested config: NAMESTATE READ WRITE CKSUM stuff UNAVAIL 0 0 0 11732693005294113354 UNAVAIL 0 0 0 was /dev/da2 The same machine with this broken pool could previously survive indefinitely under FreeBSD 10.3 . So, could this be the reason for memory depletion? Any fixes for that? Any more tests suggested to perform before I try to get rid of this pool? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
Just a short report to a thread I started when 11.1 came out. This machine would stall in a busy loop while attaching disks during boot. Rebuilding a kernel with EARLY_AP_STARTUP disabled avoided the problem. This was a situation through the whole 11.1 life cycle (i.e. patch releases did not help). Today I have upgraded this host to 11.2-BETA2, and it is no longer necessary to disable EARLY_AP_STARTUP. Good, thanks! Mark 2017-07-20 02:03, Mark Johnston wrote: One thing to try at this point would be to disable EARLY_AP_STARTUP in the kernel config. That is, take a configuration with which you're able to reproduce the hang during boot, and remove "options EARLY_AP_STARTUP". 2017-07-20 15:45, Mark Martinec wrote: Done. And it avoids the problem altogether! Thanks. Tried a reboot several times and it succeeds every time. Here is all that I had in a config file for building a kernel, i.e. I took away the 'options DDB' which also seemingly avoided the problem: include GENERIC ident NELI nooptions EARLY_AP_STARTUP This feature has a fairly large impact on the bootup process and has had a few problems that manifested as hangs during boot. There was at least one other case where an innocuous change to the kernel configuration "fixed" the problem by introducing some second-order effect (causing kernel threads to be scheduled in a different order, for instance). [...] ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Should patch releases to stable 11.1 (errata) include fixes for kernel crashes?
Should patch releases to stable 11.1 (errata) include fixes for kernel crashes? Referring to: Bug 59 - 11.1-R crashing in sendfile syscall, as used by a uwsgi process https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=59 https://svnweb.freebsd.org/base?view=revision=323634 So my background story is: I fell for this trick twice now, upgrading a 11.1-p3 to -p4, and now a -p4 to -p5 -- forgetting that I need a patched kernel for our web servers (quite a common setup: nginx+uwsgi); so after an upgrade crashes returned. I know, my fault, but I wonder, shouldn't a fix for kernel crashes end up in a patch release of a stable version, with an errata notice? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: 11.1 coredumping in sendfile, as used by a uwsgi process
2017-09-12 15:46, Steven Hartland wrote: Could you post the decoded crash info from /var/crash/... Using crashinfo(8) I suppose? I would also create a bug report: https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base%20System Done (with additional info): Bug 59 - 11.1-R crashing in sendfile syscall, as used by a uwsgi process https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=59 Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
11.1 coredumping in sendfile, as used by a uwsgi process
A couple of days ago I have upgraded an Intel box from FreeBSD 10.3 to 11.1-RELEASE-p1, and reinstalled all the packages, built on the same OS version. This host is running nginx web server with an uwsgi as a backend. The file system is ZFS (recent as of 10.3, zpool not yet upgraded to new 11.1 features). Ever since the upgrade, this host is crashing/rebooting two or three times per day. The reported crash location is always the same: it is in a sendfile function (same addresses each time), the running process is always uwsgi: Sep 12 15:03:12 xxx syslogd: kernel boot file is /boot/kernel/kernel Sep 12 15:03:12 xxx kernel: [22677] Sep 12 15:03:12 xxx kernel: [22677] Sep 12 15:03:12 xxx kernel: [22677] Fatal trap 12: page fault while in kernel mode Sep 12 15:03:12 xxx kernel: [22677] cpuid = 7; apic id = 07 Sep 12 15:03:12 xxx kernel: [22677] fault virtual address = 0xe8 Sep 12 15:03:12 xxx kernel: [22677] fault code= supervisor write data, page not present Sep 12 15:03:12 xxx kernel: [22677] instruction pointer = 0x20:0x80afefb2 Sep 12 15:03:12 xxx kernel: [22677] stack pointer = 0x28:0xfe02397da5a0 Sep 12 15:03:12 xxx kernel: [22677] frame pointer = 0x28:0xfe02397da5e0 Sep 12 15:03:12 xxx kernel: [22677] code segment = base 0x0, limit 0xf, type 0x1b Sep 12 15:03:12 xxx kernel: [22677] = DPL 0, pres 1, long 1, def32 0, gran 1 Sep 12 15:03:12 xxx kernel: [22677] processor eflags = interrupt enabled, resume, IOPL = 0 Sep 12 15:03:12 xxx kernel: [22677] current process = 34504 (uwsgi) Sep 12 15:03:12 xxx kernel: [22677] trap number = 12 Sep 12 15:03:12 xxx kernel: [22677] panic: page fault Sep 12 15:03:12 xxx kernel: [22677] cpuid = 7 Sep 12 15:03:12 xxx kernel: [22677] KDB: stack backtrace: Sep 12 15:03:12 xxx kernel: [22677] #0 0x80aada97 at kdb_backtrace+0x67 Sep 12 15:03:12 xxx kernel: [22677] #1 0x80a6bb76 at vpanic+0x186 Sep 12 15:03:12 xxx kernel: [22677] #2 0x80a6b9e3 at panic+0x43 Sep 12 15:03:12 xxx kernel: [22677] #3 0x80edf832 at trap_fatal+0x322 Sep 12 15:03:12 xxx kernel: [22677] #4 0x80edf889 at trap_pfault+0x49 Sep 12 15:03:12 xxx kernel: [22677] #5 0x80edf0c6 at trap+0x286 Sep 12 15:03:12 xxx kernel: [22677] #6 0x80ec3641 at calltrap+0x8 Sep 12 15:03:12 xxx kernel: [22677] #7 0x80a6a2af at sendfile_iodone+0xbf Sep 12 15:03:12 xxx kernel: [22677] #8 0x80a69eae at vn_sendfile+0x124e Sep 12 15:03:12 xxx kernel: [22677] #9 0x80a6a4dd at sendfile+0x13d Sep 12 15:03:12 xxx kernel: [22677] #10 0x80ee0394 at amd64_syscall+0x6c4 Sep 12 15:03:12 xxx kernel: [22677] #11 0x80ec392b at Xfast_syscall+0xfb Sep 12 15:03:12 xxx kernel: [22677] Uptime: 6h17m57s Sep 12 15:03:12 xxx kernel: [22677] Dumping 983 out of 8129 MB:..2%..12%..22%..31%..41%..51%..61%..72%..82%..92%Copyright (c) 1992-2017 The FreeBSD Project. Sep 12 15:03:12 xxx kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Sep 12 15:03:12 xxx kernel: The Regents of the University of California. All rights reserved. Sep 12 15:03:12 xxx kernel: FreeBSD is a registered trademark of The FreeBSD Foundation. Sep 12 15:03:12 xxx kernel: FreeBSD 11.1-RELEASE-p1 #0: Wed Aug 9 11:55:48 UTC 2017 [...] Sep 12 15:03:12 xxx savecore: reboot after panic: page fault Sep 12 15:03:12 xxx savecore: writing core to /var/crash/vmcore.4 This host with the same services was very stable under 10.3, same ZFS pool. We have several other hosts running 11.1 with no incidents, running various services (but admittedly no other host has a comparably busy web server). Interestingly the nginx has a sendfile feature enabled too, but this does not cause a crash (on this or other hosts), only the sendfile as used by uwsgi seems to be the problem. For the time being I have disabled the use of sendfile in uwsgi, we'll see is this avoids the trouble. Suggestions? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
syslogd include directive reads but disregards all but the last included .conf file
Could somebody please check why the new 'include' 11.1 feature of syslogd does not work when given more than one file to include... Any chance of fixing this as a patch release to 11.1 ? The 11.1 release brought a very desirable feature to syslogd: $ man syslog.conf : A special include keyword can be used to include all files with names ending in '.conf' and not beginning with a '.' contained in the directory following the keyword. but ... It turns out that of all the *.conf files found in the included directory /etc/syslog.d, only entries found in the (alphabetically) *last* file there are taken into account, all other entries in remaining included files are just ignored. [...] Details at: Bug 221742 syslogd include directive reads but disregards all but the last included .conf file https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221742 Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: [USB] hang after upgrade from 11.0 to 11.1, ZFS or callout() related?
But this is all for 11.0, on 11.1 it hangs and I cannot look it up: Does it also hang if you choose 'Safe mode" in the loader dialog? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
2017-07-24 18:25, Ken Merry wrote: It is possible that the change I MFCed today (r321207 in head, r321415 in stable/11) is related, but Mark will have to boot his machine with the fix to see if it makes any difference. What happened in my case on one particular machine (not on most machines in our lab running the same code) was that mps_wait_command() / mpr_wait_command() would not wait the full 60 seconds for a write to the DPM table (Driver Persistent Mapping) table in the controller. So, it reported that there was a timeout. [...] Eliminating bogus timeouts will eliminate most all of the sources of those panics anyway. Took r321415 from stable/11 and applied it to 11.1-RC3 - and it makes no difference to booting: still hangs attempting to attach da0, with a spinning CPU (according to fan speed). Booting in safe mode, or with EARLY_AP_STARTUP disabled avoids the problem. There is a secondary bug that is still in the mps(4) / mpr(4) drivers when a timeout does happen — the error recovery code in the wait_command() routine reinitializes the controller, which clears out all the commands. When the wait_command() routine returns, the command passed in has been freed, but the caller doesn’t know that. So the caller (it happens in a number of places) dereferences a pointer to freed memory and the kernel panics. I’m planning to fix that bug, too, if slm@ doesn’t get to it first, I’ve just had other bugs to fix first. No panics in my case, just hangs. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
Thanks! Tried it, and the message (or a backtrace) does not show during a boot of a generic (patched) kernel, at least not in the last 40-lines screen before the hang occurs. (It also does not show during a "Safe mode" successful boot.) Btw (may or may not be relevant): after the above experiment I have rebooted the machine in "Safe mode" (generic kernel, EARLY_AP_STARTUP enabled by default) - and spent some time doing non-intensive interactive work on this host (web browsing, editor, shell, all under KDE) - and after about an hour the machine froze: clock display not updating, keyboard unresponsive, console virtual terminals inaccessible) - so had to reboot. According to fans speed the machine was idle. The /var/log/messages does not show anything of interest before the freeze. All disks are under ZFS. Can EARLY_AP_STARTUP have an effect also _after_ booting? This host never hung during normal work when EARLY_AP_STARTUP was disabled (or with 11.0 and earlier). Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
2017-07-24 04:15, Mark Johnston wrote: Could you try re-enabling EARLY_AP_STARTUP, applying the patch at the end of this email, and see if the message "sleeping before eventtimer init" appears in the boot output? If it does, it'll be followed by a backtrace that might be useful for tracking down the hang. It might produce false positives, but we'll see. Thanks! Tried it, and the message (or a backtrace) does not show during a boot of a generic (patched) kernel, at least not in the last 40-lines screen before the hang occurs. (It also does not show during a "Safe mode" successful boot.) Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
2017-07-20 02:03, Mark Johnston wrote: One thing to try at this point would be to disable EARLY_AP_STARTUP in the kernel config. That is, take a configuration with which you're able to reproduce the hang during boot, and remove "options EARLY_AP_STARTUP". Done. And it avoids the problem altogether! Thanks. Tried a reboot several times and it succeeds every time. Here is all that I had in a config file for building a kernel, i.e. I took away the 'options DDB' which also seemingly avoided the problem: include GENERIC ident NELI nooptions EARLY_AP_STARTUP This feature has a fairly large impact on the bootup process and has had a few problems that manifested as hangs during boot. There was at least one other case where an innocuous change to the kernel configuration "fixed" the problem by introducing some second-order effect (causing kernel threads to be scheduled in a different order, for instance). Regardless of whether the suggestion above makes a difference, it would be helpful to see verbose dmesgs from both a clean boot and a boot that hangs. If disabling EARLY_AP_STARTUP helps, then we can try adding some assertions that will cause the system to panic when the hang occurs, making it easier to see what's going on. Hmmm. I have now saved a couple of versions of /var/run/dmesg.boot (in boot_verbose mode) when EARLY_AP_STARTUP is disabled and the boot is successful. However, I don't know how to capture such log when booting hangs, as I have no serial interface and the boot never completes. All I have is a screen photo of the last state when a hang occurs (showing ada disks successfully attached, followed immediately by the attempt to attach a da disk, which hangs). Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
More news on the matter. As reported yesterday the locally built kernel with options INVARIANTS and DDB works fine and somehow avoids the trouble at attaching the da (mps) disks on an LSI controller, so today I wanted to get back to a reproducible hang - and sure enough, reverting to the generic kernel as distributed brings back the hang. So I tried rebuilding the kernel while experimenting with options like DDB and INVARIANTS. A locally built GENERIC kernel behaves the same as the original kernel from the distribution (as installed by freebsd-upgrade), so no surprises there. It hangs trying to attach the first of the da disks (after first successfully attaching all the ada disks). The alt ctrl esc is unable to enter debugger when the hang occurs (possibly due to an unresponsive USB keyboard at that time), even though the debug.kdb.break_to_debugger was set to 1 at a loader prompt. It needs loader "Safe mode" to be able to boot. Next, a locally built kernel with DDB and INVARIANTS works well (the remaining options come from an included GENERIC). Now the funny part: a locally built kernel with just the DDB option (and the rest included from GENERIC) *also* works well. Somehow the DDB option makes a difference, even though kernel debugger is never activated. To re-assert: at the time of a hang the CPU fan starts revving up, and the USB keyboard is unresponsive ( does not enter scroll mode, caps lock and num lock do not toggle their LED indicators, alt ctrl esc do not activate kernel debugger. Loader "Safe mode" avoids the problem (presumably by disabling SMP). Meanwhile I have successfully upgraded two other similar hosts from 11.0 to 11.1-RC3, no surprises there (but they do not have the same disk controller). Not sure what to try next. Mark 2017-07-19 01:18, Mark Martinec wrote: 2017-07-18 01:24, Mark Johnston wrote: Are you able to break into the debugger at this point? Try setting debug.kdb.break_to_debugger=1 and debug.kdb.alt_break_to_debugger=1 at the loader prompt, and hit the break key, or the key sequence ~ ctrl-b once the hang occurs. At the debugger prompt, try "bt" and "show allpcpu" to start. Thank you for a prompt and good suggestion! I spent an afternoon fiddling with the machine, with mixed results. Your suggestion to break into debugger did not work, there was no reaction to or to ~ ctrl-b. So I embarked on rebuilding the RC3 kernel with options KDB options DDB options BREAK_TO_DEBUGGER options ALT_BREAK_TO_DEBUGGER options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN but then I realized the key is mapped-to by: alt ctrl , which now does break into debugger - but not so early where the holdup occurs. The WITNESS produced some LOR warnings, but that is probably ok. I came across a trace just before the problem area, but it flows by so fast on a vt console and only the last 40 or so lines remain on the screen (I have a photo), which do not look like revealing much. Unfortunately this machine does not have a serial interface. So in my last attempt I rebuilt a kernel with INVARIANTS but without WITNESS - and now I cannot reproduce the problem, with or without a "safe mode". What is interesting here that now the da0..da3 disks are attached first, and only then the ada disks - and even within the group of disks on the same controller their order has been shuffled - no idea what could have caused it - and it may have avoided the problem by doing so. Will play some more with this tomorrow... Mark On Tue, Jul 18, 2017 at 01:01:16AM +0200, Mark Martinec wrote: Upgrading 11.0-RELEASE-p11 to 11.1-RC3 using the usual freebsd-update upgrade method I ended up with a system which gets stuck while trying to attach the second set of disks. This happened already after the first phase of the upgrade procedure (installing and re-booting with a new kernel). The first set of disks (ada0 .. ada2) are attached successfully, also a cd0, but then when the first of the set of four (a regular spinning disk) on an LSI controller is to be attached, the boot procedure just gets stuck there: kernel: ada1: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes) kernel: ada1: Command Queueing enabled kernel: ada1: 305245MB (625142448 512 byte sectors) kernel: ada2 at ahcich6 bus 0 scbus8 target 0 lun 0 kernel: ada2: ATA8-ACS SATA 3.x device kernel: ada2: Serial Number OCZ-O1L6RF591R09Z5C8 kernel: ada2: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes) kernel: ada2: Command Queueing enabled kernel: ada2: 114473MB (234441648 512 byte sectors) kernel: ada2: quirks=0x1<4K> kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0 (stuck here, keyboard not responding, fans rising their pitch, presumably CPU is spinning) [...] ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailm
Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
2017-07-18 01:24, Mark Johnston wrote: Are you able to break into the debugger at this point? Try setting debug.kdb.break_to_debugger=1 and debug.kdb.alt_break_to_debugger=1 at the loader prompt, and hit the break key, or the key sequence ~ ctrl-b once the hang occurs. At the debugger prompt, try "bt" and "show allpcpu" to start. Thank you for a prompt and good suggestion! I spent an afternoon fiddling with the machine, with mixed results. Your suggestion to break into debugger did not work, there was no reaction to or to ~ ctrl-b. So I embarked on rebuilding the RC3 kernel with options KDB options DDB options BREAK_TO_DEBUGGER options ALT_BREAK_TO_DEBUGGER options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN but then I realized the key is mapped-to by: alt ctrl , which now does break into debugger - but not so early where the holdup occurs. The WITNESS produced some LOR warnings, but that is probably ok. I came across a trace just before the problem area, but it flows by so fast on a vt console and only the last 40 or so lines remain on the screen (I have a photo), which do not look like revealing much. Unfortunately this machine does not have a serial interface. So in my last attempt I rebuilt a kernel with INVARIANTS but without WITNESS - and now I cannot reproduce the problem, with or without a "safe mode". What is interesting here that now the da0..da3 disks are attached first, and only then the ada disks - and even within the group of disks on the same controller their order has been shuffled - no idea what could have caused it - and it may have avoided the problem by doing so. Will play some more with this tomorrow... Mark On Tue, Jul 18, 2017 at 01:01:16AM +0200, Mark Martinec wrote: Upgrading 11.0-RELEASE-p11 to 11.1-RC3 using the usual freebsd-update upgrade method I ended up with a system which gets stuck while trying to attach the second set of disks. This happened already after the first phase of the upgrade procedure (installing and re-booting with a new kernel). The first set of disks (ada0 .. ada2) are attached successfully, also a cd0, but then when the first of the set of four (a regular spinning disk) on an LSI controller is to be attached, the boot procedure just gets stuck there: kernel: ada1: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes) kernel: ada1: Command Queueing enabled kernel: ada1: 305245MB (625142448 512 byte sectors) kernel: ada2 at ahcich6 bus 0 scbus8 target 0 lun 0 kernel: ada2: ATA8-ACS SATA 3.x device kernel: ada2: Serial Number OCZ-O1L6RF591R09Z5C8 kernel: ada2: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes) kernel: ada2: Command Queueing enabled kernel: ada2: 114473MB (234441648 512 byte sectors) kernel: ada2: quirks=0x1<4K> kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0 (stuck here, keyboard not responding, fans rising their pitch, presumably CPU is spinning) [...] ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
Upgrading 11.0-RELEASE-p11 to 11.1-RC3 using the usual freebsd-update upgrade method I ended up with a system which gets stuck while trying to attach the second set of disks. This happened already after the first phase of the upgrade procedure (installing and re-booting with a new kernel). The first set of disks (ada0 .. ada2) are attached successfully, also a cd0, but then when the first of the set of four (a regular spinning disk) on an LSI controller is to be attached, the boot procedure just gets stuck there: kernel: ada1: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes) kernel: ada1: Command Queueing enabled kernel: ada1: 305245MB (625142448 512 byte sectors) kernel: ada2 at ahcich6 bus 0 scbus8 target 0 lun 0 kernel: ada2: ATA8-ACS SATA 3.x device kernel: ada2: Serial Number OCZ-O1L6RF591R09Z5C8 kernel: ada2: 300.000MB/s transfers (SATA 2.x, PIO4, PIO 8192bytes) kernel: ada2: Command Queueing enabled kernel: ada2: 114473MB (234441648 512 byte sectors) kernel: ada2: quirks=0x1<4K> kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0 (stuck here, keyboard not responding, fans rising their pitch, presumably CPU is spinning) (instead of the normal continuation like: kernel: da0: Fixed Direct Access SPC-4 SCSI device kernel: da0: Serial Number kernel: da0: 600.000MB/s transfers kernel: da0: Command Queueing enabled kernel: da0: 1907729MB (3907029168 512 byte sectors) ) The controller for da0 .. da3 is an LSI: kernel: mps0: port 0x4000-0x40ff mem 0xd174-0xd1743fff,0xd130-0xd133 irq 16 at device 0.0 on pci1 kernel: mps0: Firmware: 14.00.01.00, Driver: 21.02.00.00-fbsd kernel: mps0: IOCCapabilities: 185c[...] kernel: mps0: SAS Address for SATA device = a4a4843003d0cf79 kernel: mps0: SAS Address from SATA device = a4a4843003d0cf79 kernel: mps0: SAS Address for SATA device = d3d48904eddff0d5 kernel: mps0: SAS Address from SATA device = d3d48904eddff0d5 [...] kernel: mps0: SAS Address for SATA device = 2a021c07585c665b kernel: mps0: SAS Address from SATA device = 2a021c07585c665b kernel: mps0: SAS Address for SATA device = 2a021c0758637b7c kernel: mps0: SAS Address from SATA device = 2a021c0758637b7c This host in this configuration worked perfectly well with 11.0 and many older versions of the OS. After some frustration I found out that the system can boot fine if a boot loader option "Safe mode" is set. This way I successfully finished the upgrade procedure (installing world). Playing with loader options that the "Safe mode" turns on ( /boot/menu-commands.4th ) it seems that kern.smp.disabled=1 is the crucial option, although my attempts at ruling out remaining options of the "Safe mode" turned out inconclusive - perhaps there is some random/race involved. Anyway, in "Safe mode" the machine always boots normally and attaches all disks. This experience is much like described in: https://forums.freebsd.org/threads/56524/ where the poster ended up disabling SMP to be able to have a working host. It is also somewhat similar to: https://lists.freebsd.org/pipermail/freebsd-hackers/2017-July/051258.html where a FreeBSD 11.1 prerelease only boots on a single-CPU AWS host, but fails to boot on a 2-core CPU, with various symptoms, including: ( https://lists.freebsd.org/pipermail/freebsd-hackers/2017-July/051260.html ) Feeding entropy: . spin lock 0x80db45c0 (smp rendezvous) held by 0xf80004378560 (tid 100074) too long timeout stopping cpus panic: spin lock held too long Please advise, thanks Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: net.inet.udp.log_in_vain strange syslog reports
2017-02-06 18:04, Eric van Gyzen wrote: On 02/06/2017 10:19, Mark Martinec wrote: Hope the fix finds its way into 11.1 (or better yet, as a patch level in 10.0). Should I open a bug report? It will quite likely get into 11.1. As for a 10.x patch, you would have to ask re@ (I think), but I doubt it. These messages are really just informative and can't be used for any filtering, since the source address could be spoofed. I meant to say 11.0-p*, but nevermind. In a similar vein, I noticed also the following in our logs, with net.inet.tcp.log_in_vain=1. Looks like messages got concatenated somehow: Jan 25 01:37:53 mildred kernel: TCP: [2607:ff10:c5:509a::10]:26459 to [2001:1470:ff80::80:16]:4911 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:14898 to [2001:1470:ff80::80:16]:5222 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:58022 to [2001:1470:ff80::80:16]:9981 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:34680 to [2001:1470:ff80::80:16]:10243 tcpflags 0x2; tcp_input: Connection attempt to closedport Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:30991 to [2001:1470:ff80::80:16]:8554 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:20012 to [2001:1470:ff80::80:16]:8443 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:14166 to [2001:1470:ff80::80:16]: tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:34680 to [2001:1470:ff80::80:16]:8010 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:47957 to [2001:1470:ff80::80:16]:3460 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:34680 to [2001:1470:ff80::80:16]:13579 tcpflags 0x2; tcp_input: Connection attempt to closedport Jan 25 23:55:09 mildred kernel: TCP: [2607:ff10:c5:509a::10]:20012 to [2001:1470:ff80::80:16]:9001 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:30651 to [2001:1470:ff80::80:16]:9000 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 12 04:50:58 mildred kernel: TCP: [2607:ff10:c5:509a::1]:42266 to [2001:1470:ff80::80:16]:49153 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::1]:35372 to [2001:1470:ff80::80:16]:62078 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:58022 to [2001:1470:ff80::80:16]:9200 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:46640 to [2001:1470:ff80::80:16]:8181 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:36877 to [2001:1470:ff80::80:16]:7218 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:46640 to [2001:1470:ff80::80:16]:7071 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:30651 to [2001:1470:ff80::80:16]:9000 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:36877 to [2001:1470:ff80::80:16]:2332 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 18 03:01:59 mildred kernel: TCP: [2607:ff10:c5:509a::10]:46640 to [2001:1470:ff80::80:16]:7548 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::10]:46640 to [2001:1470:ff80::80:16]:5986 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 19 02:52:34 mildred kernel: TCP: [2607:ff10:c5:509a::1]:42266 to [2001:1470:ff80::80:16]:49153 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::1]:35372 to [2001:1470:ff80::80:16]:62078 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 19 02:52:34 mildred kernel: TCP: [2607:ff10:c5:509a::1]:61788 to [2001:1470:ff80::80:16]:2 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::1]:34680 to [2001:1470:ff80::80:16]:10243 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 19 02:52:34 mildred kernel: TCP: [2607:ff10:c5:509a::1]:41249 to [2001:1470:ff80::80:16]:44818 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::1]:49717 to [2001:1470:ff80::80:16]:8649 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 20 04:49:15 mildred kernel: TCP: [2607:ff10:c5:509a::1]:36877 to [2001:1470:ff80::143:1]:50100 tcpflags 0x2; tcp_input: Connection attempt to closed TCP: [2607:ff10:c5:509a::1]:42266 to [2001:1470:ff80::143:1]:49153 tcpflags 0x2; tcp_input: Connection attempt to closed port Jan 20 10:03:52 mildred kernel: TCP: [2607:ff10:c5:509a::10]:31430 to [2001:1470:ff80::143:1]:8099 tcpflags 0x2; tcp_input: Connection
GELI with integrity verification on swap
After experiencing an unexplained restart on one host (11.0-RELEASE-p7), which could be tied to a problem with a swap device (swap on a dedicated gpt partition), I'm investigating options for adding some checksuming to swap storage. I understand that swap on ZFS is not a way to go, and that a gmirror does not provide any checksuming on data, it seems to me the only option is to use GELI with integrity verification (authentication) enabled (aalgo). Following advice in https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/swap-encrypting.html I ended up with the following in /etc/fstab (on a different host, same OS): /dev/gpt/sw1.eli none swap sw,sectorsize=4096,aalgo=HMAC/SHA256 0 0 /dev/gpt/sw2.eli none swap sw,sectorsize=4096,aalgo=HMAC/SHA256 0 0 which seems to work fine, but spawns some questions: 1) On the first manual reboot after adding the above options, there was a kernel panic. Subsequent reboot(s) were successful. Is there any known problem with using integrity verification on GELI for swap? 2) During boot the log shows a short flurry of messages like: kernel: GEOM_ELI: Device gpt/sw1.eli created. kernel: GEOM_ELI: Encryption: AES-XTS 128 kernel: GEOM_ELI: Integrity: HMAC/SHA256 kernel: GEOM_ELI: Crypto: software kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 16384 bytes of data at offset 11452985344. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 11453235200. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 11453239296. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 11453239296. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 11453239296. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 11453235200. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 4096. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 0. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 4096 bytes of data at offset 11453239296. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 8192 bytes of data at offset 65536. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 8192 bytes of data at offset 8192. kernel: GEOM_ELI: gpt/sw1.eli: Failed to authenticate 8192 bytes of data at offset 0. which, according to geli(8) man page, could be normal, as these blocks were never written to beforehand and contain random stuff. As the geli swap device is supposed to be ephemeral (Flags: ONETIME, W-DETACH, AUTH, W-OPEN), there is no way to initialize blocks on a swap device on boot. So, are these messages really safe to be ignored? Which brings us another, perhaps more important question: what business does a kernel has to do READING from a swap device, blocks which never have been written to before by this incarnation of the kernel??? 3) Considering that the underlying device is a 4k sectored device, and that HMAC/SHA256 takes some space (like 11%) on its own, what does it mean that the provider (gpt/sw1.eli) as well as the consumer (gpt/sw1) both show sector size 4096 ? Does that mean that all 4k alignment efforts are wasted when one enables integrity verification on GELI? Geom name: gpt/sw1.eli State: ACTIVE EncryptionAlgorithm: AES-XTS KeyLength: 128 AuthenticationAlgorithm: HMAC/SHA256 Crypto: software Version: 7 Flags: ONETIME, W-DETACH, AUTH, W-OPEN KeysAllocated: 24 KeysTotal: 24 Providers: 1. Name: gpt/sw1.eli Mediasize: 11453243392 (11G) Sectorsize: 4096 Mode: r1w1e0 Consumers: 1. Name: gpt/sw1 Mediasize: 12884901888 (12G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e1 Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: net.inet.udp.log_in_vain strange syslog reports
On 2017-02-02 12:55, Mark Martinec wrote: 11.0-RELEASE-p7, net.inet.udp.log_in_vain=1 The following syslog entries seem to indicate some buffer overruns in the reporting code (not all log lines are broken, just some). (the actual failed connection attempts were indeed there, it's just that the reported IP address is highly suspicious) Mark 2017-02-03 20:05, Eric van Gyzen wrote: There is no buffer overrun, so no cause for alarm. The problem is concurrent usage of a single string buffer by multiple threads. The buffer is inside inet_ntoa(), defined in sys/libkern/inet_ntoa.c. In this case, it is called from udp_input(). Would you like to test the following patch? Eric diff --git a/sys/netinet/udp_usrreq.c b/sys/netinet/udp_usrreq.c index 173c44c..ca2dda1 100644 --- a/sys/netinet/udp_usrreq.c +++ b/sys/netinet/udp_usrreq.c @@ -674,13 +674,13 @@ udp_input(struct mbuf **mp, int *offp, int proto) INPLOOKUP_RLOCKPCB, ifp, m); if (inp == NULL) { if (udp_log_in_vain) { - char buf[4*sizeof "123"]; + char src[4*sizeof "123"]; + char dst[4*sizeof "123"]; - strcpy(buf, inet_ntoa(ip->ip_dst)); log(LOG_INFO, "Connection attempt to UDP %s:%d from %s:%d\n", - buf, ntohs(uh->uh_dport), inet_ntoa(ip->ip_src), - ntohs(uh->uh_sport)); + inet_ntoa_r(ip->ip_dst, dst), ntohs(uh->uh_dport), + inet_ntoa_r(ip->ip_src, src), ntohs(uh->uh_sport)); } UDPSTAT_INC(udps_noport); if (m->m_flags & (M_BCAST | M_MCAST)) { Thanks, the explanation makes sense and the patch looks good (mind the TABs). Running it now, expecting no surprises there. One minor nit: instead of a hack: char src[4*sizeof "123"]; char dst[4*sizeof "123"]; it would be cleaner and in sync with the equivalent code in sys/netinet6/udp6_usrreq.c to use the INET_ADDRSTRLEN constant (from sys/netinet/in.h, value 16): char src[INET_ADDRSTRLEN]; char dst[INET_ADDRSTRLEN]; Hope the fix finds its way into 11.1 (or better yet, as a patch level in 10.0). Should I open a bug report? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
net.inet.udp.log_in_vain strange syslog reports
11.0-RELEASE-p7, net.inet.udp.log_in_vain=1 The following syslog entries seem to indicate some buffer overruns in the reporting code (not all log lines are broken, just some). (the actual failed connection attempts were indeed there, it's just that the reported IP address is highly suspicious) Mark Connection attempt to UDP 193.2.4.2:53 from 95.87.1521242:26375 Connection attempt to UDP 193.2.4.2:53 from 95.87.1521242:55806 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:54530 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:55504 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:54530 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:49526 Connection attempt to UDP 193.2.4.2:53 from 95.8231520242:56838 Connection attempt to UDP 193.2.4.2:53 from 95.8231520242:32768 Connection attempt to UDP 193.2.4.2:53 from 95.8241523242:5387 Connection attempt to UDP 193.2.4.2:53 from 95.8241523242:54530 Connection attempt to UDP 193.2.4.2:53 from 21.823154.242:46692 Connection attempt to UDP 193.2.4.2:53 from 21.823154.242:32768 Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:51931 Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:59881 Connection attempt to UDP 193.2.4.2:53 from 212873154.242:53424 Connection attempt to UDP 193.2.4.2:53 from 212873154.242:53937 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:46692 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:52594 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:59639 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:50869 Connection attempt to UDP 193.2.4.2:53 from 19382.1587242:55806 Connection attempt to UDP 193.2.4.2:53 from 19382.1587242:54650 Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:54322 Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:49871 Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:57807 Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:51931 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:52930 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:50869 Connection attempt to UDP 193.2.4.2:53 from 212823152.242:56838 Connection attempt to UDP 193.2.4.2:53 from 212823152.242:32768 Connection attempt to UDP 193.2.4.2:53 from 21.8231521242:63724 Connection attempt to UDP 193.2.4.2:53 from 21.8231521242:55222 Connection attempt to UDP 193.2.4.2:53 from 1948249.230.46:52599 Connection attempt to UDP 193.2.4.2:53 from 1948249.230.46:38496 Connection attempt to UDP 193.2.4.2:53 from 2128235.209.250:43608 Connection attempt to UDP 193.2.4.2:53 from 2128235.209.250:47257 Connection attempt to UDP 193.2.4.2:53 from 19387.1594242:54324 Connection attempt to UDP 193.2.4.2:53 from 19387.1594242:34613 Connection attempt to UDP 193.2.4.2:53 from 2128235.2124180:54377 Connection attempt to UDP 193.2.4.2:53 from 2128235.2124180:50869 Connection attempt to UDP 193.2.4.2:53 from 95.87.1547242:51698 Connection attempt to UDP 193.2.4.2:53 from 95.87.1547242:55222 Connection attempt to UDP 193.2.4.2:53 from 193.2.4.2242:55222 Connection attempt to UDP 193.2.4.2:53 from 19.8241523242:38496 Connection attempt to UDP 193.2.4.2:53 from 19.8241523242:55135 Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:50370 Connection attempt to UDP 193.2.4.2:53 from 95.824154.242:64533 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:55222 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:56228 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:53424 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:61230 Connection attempt to UDP 193.2.4.2:53 from 212823154.242:59716 Connection attempt to UDP 193.2.4.2:53 from 212823154.242:53424 Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:36439 Connection attempt to UDP 193.2.4.2:53 from 19387.154.242:60638 Connection attempt to UDP 193.2.4.2:53 from 19387.1521242:59008 Connection attempt to UDP 193.2.4.2:53 from 19387.1521242:35505 Connection attempt to UDP 193.2.4.2:53 from 19.824154.242:54322 Connection attempt to UDP 193.2.4.2:53 from 19.824154.242:30943 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:51752 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:35165 Connection attempt to UDP 193.2.4.2:53 from 95.87.1587242:36439 Connection attempt to UDP 193.2.4.2:53 from 95.87.1587242:57311 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:36439 Connection attempt to UDP 193.2.4.2:53 from 19387.1587242:59280 Connection attempt to UDP 193.2.4.2:53 from 19487.154.242:53424 Connection attempt to UDP 193.2.4.2:53 from 19487.154.242:53247 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:35165 Connection attempt to UDP 193.2.4.2:53 from 95.823154.242:50473 Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:56838 Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:63658 Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:54322 Connection attempt to UDP 193.2.4.2:53 from 21287.154.242:60637
vt(4) gibberish characters in 11.0 with nvidia
I have recently upgraded two hosts with identical nvidia boards (GeForce GT 730, fresh driver nvidia-driver-375.26 from ports), one has been following 11-STABLE every now and then, the other was on 10.3. So they are now at 11.0-RELEASE-p7 or on a recent 11-STABLE respectively. The problem now showing on both hosts is that a virtual terminal console driver (vt(4), no special settings) now shows gibberish character-cells in an approx 90x24 raster. Character cells are of varying colors, some textured, so it looks as if the font loaded was just random junk. The boot screen sequence looks fine up to the moment when the X starts. The X11 screen (with KDE) is fine too. It's just the ttyv0-ttyv7 consoles which are broken. Solved my immediate problem by adding hw.vga.textmode=1 to /boot/loader.conf, so that ttyv consoles now look fine (in fact much nicer, as the vt fonts are pretty squashed and ugly to my eyes). What puzzles me is what has changed recently, as both hosts were happily using vt consoles in graphical mode until the upgrade. (btw, I do have nvidia-modeset.ko and nvidia.ko loaded) Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Does building linux packages under poudriere require linux compatibility emulation?
Thanks to all who responded, makes perfect sense now. Paul Mather wrote: The only thing you need on the host is to have the linux kernel module loaded. (You don't need to have any Linux packages installed there.) The default setting in /usr/local/etc/poudriere.conf is to have NOLINUX=yes commented out, i.e., Linux support in Poudriere is enabled unless you explicitly disable it. The easiest way to load the linux kernel module on the host for use with Poudriere is to add it to the "kld_list" setting in /etc/rc.conf, e.g., kld_list="linux" I do have NOLINUX=yes commented out, as is a default. 2017-01-14 22:45 Timon wrote: No, it doesn't require linuxulator to be configured, but require linux.ko (and linux64.ko if your host is amd64) to be loaded. Poudriere load linux.ko, but doesn't load linux64. Try this patch: --- /usr/local/share/poudriere/common.sh.orig +++ /usr/local/share/poudriere/common.sh @@ -1686,6 +1686,9 @@ jail_start() { if [ "${arch}" = "i386" -o "${arch}" = "amd64" ]; then needfs="${needfs} linprocfs" sysctl -n compat.linux.osrelease >/dev/null 2>&1 || kldload linux + if [ "${arch}" = "amd64" ]; then + kldload linux64 + fi fi fi [ -n "${USE_TMPFS}" ] && needfs="${needfs} tmpfs" Great, that seems to do the trick! (actually, I just loaded the linux64 kmodule, did not try to apply the patch). Thanks! Looks like the poudriere/common.sh needs this patch. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Does building linux packages under poudriere require linux compatibility emulation?
When building packages under poudriere on 11.0-RELEASE-p7 (from a command line in a terminal window) I'm noticing occasional streams of diagnostic: ELF binary type "3" not known. which seem to be related to building some linux packages (example below, parallel builds). Poudriere still reports success for these builds. The host where poudriere is running does not have linux.ko loaded. Does building such packages really require linuxilator configured on the build host ??? Mark [00:23:56] >> [02][00:00:00] Starting build of www/linux-c6-qt47-webkit [00:23:57] >> [13][00:00:00] Starting build of textproc/linux-c6-libxml2 ELF binary type "3" not known. ELF binary type "3" not known. [00:24:09] >> [19][00:00:28] Finished build of textproc/linux-c6-aspell: Success [00:24:11] >> [19][00:00:00] Starting build of devel/qt4-makeqpf [00:24:11] >> [11][00:00:24] Finished build of security/linux-c6-openssl-compat: Success ELF binary type "3" not known. ELF binary type "3" not known. ELF binary type "3" not known. ELF binary type "3" not known. ELF binary type "3" not known. [00:24:12] >> [11][00:00:00] Starting build of x11-toolkits/vte ELF binary type "3" not known. ELF binary type "3" not known. ELF binary type "3" not known. ELF binary type "3" not known. [00:24:16] >> [07][00:00:24] Finished build of graphics/linux-c6-glx-utils: Success ELF binary type "3" not known. ELF binary type "3" not known. ELF binary type "3" not known. [00:24:17] >> [07][00:00:00] Starting build of devel/qt4-qdoc3 ELF binary type "3" not known. ELF binary type "3" not known. [00:24:19] >> [13][00:00:22] Finished build of textproc/linux-c6-libxml2: Success [00:24:19] >> [13][00:00:00] Starting build of graphics/goocanvas [00:24:27] >> [10][00:02:26] Finished build of graphics/sdl_gfx: Success [00:24:29] >> [10][00:00:00] Starting build of multimedia/mjpegtools [00:24:34] >> [04][00:02:15] Finished build of devel/linux-c6-devtools: Success [00:24:35] >> [04][00:00:00] Starting build of devel/linux-c6-ncurses-base ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Is System V IPC namespace still shared across jails?
2016-12-13 16:29, Alan Somers wrote: I've already added support for sysvmsg, sysvsem, and sysvshm to iocage. They all default to "new", which means you won't have to do anything special in your jail config to make postgres work. You can find the patch below. The only reason it hasn't been merged is because it can't (yet) be made to work correctly on the develop branch of iocage. But it works fine on the master branch. https://github.com/iocage/iocage/pull/370 -Alan Superb, appreciated! Mark On Tue, Dec 13, 2016 at 8:08 AM, Mark Martinec <mark.martinec+free...@ijs.si> wrote: 2016-12-12 20:38, Christian Schwarz wrote: With the new jail parameters, new namespaces for SysV IPC are possible on FreeBSD 11. For those ezjail users, add something like this to the jail's config after creating it using 'ezjail-admin create': export jail_postgres_parameters="sysvmsg=new sysvsem=new sysvshm=new" Cheers, Christian Thank you, this is it! I missed it in the JAIL(8) man page, and is not mentioned in release notes. Now if only the iocage would recognized the sysvmsg, sysvsem, and sysvshm options: # iocage set sysvmsg='new' xxx ERROR: Unsupported property: sysvmsg! I guess I should file a bug report. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Is System V IPC namespace still shared across jails?
2016-12-12 20:38, Christian Schwarz wrote: With the new jail parameters, new namespaces for SysV IPC are possible on FreeBSD 11. For those ezjail users, add something like this to the jail's config after creating it using 'ezjail-admin create': export jail_postgres_parameters="sysvmsg=new sysvsem=new sysvshm=new" Cheers, Christian Thank you, this is it! I missed it in the JAIL(8) man page, and is not mentioned in release notes. Now if only the iocage would recognized the sysvmsg, sysvsem, and sysvshm options: # iocage set sysvmsg='new' xxx ERROR: Unsupported property: sysvmsg! I guess I should file a bug report. Mark man 8 jail ... allow.sysvipc A process within the jail has access to System V IPC primitives. This is deprecated in favor of the per- module parameters (see below). When this parameter is set, it is equivalent to setting sysvmsg, sysvsem, and sysvshm all to ``inherit''. ... sysvmsg Allow access to SYSV IPC message primitives. If set to ``inherit'', all IPC objects on the system are visible to this jail, whether they were created by the jail itself, the base system, or other jails. If set to ``new'', the jail will have its own key namespace, and can only see the objects that it has created; the system (or parent jail) has access to the jail's objects, but not to its keys. If set to ``disable'', the jail cannot perform any sysvmsg-related system calls. sysvsem, sysvshm Allow access to SYSV IPC semaphore and shared memory primitives, in the same manner as sysvmsg. Regarding installation of PostgreSQL in a FreeBSD jail, the web hold plenty of warnings/advice that each postgres instance should have a unique UID, otherwise they stumble across each other's feet: | allow.sysvipc | A process within the jail has access to System V IPC primitives. In the | current jail implementation, System V primitives share a single namespace | across the host and jail environments, meaning that processes within a jail | would be able to communicate with (and potentially interfere with) processes | outside of the jail, and in other jails. Is this still the case in FreeBSD 11.0 ??? I remember hearing rumors that the System V namespace no longer is (will?) be shared across jails. (Couldn't find it being mentioned in release notes.) Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Is System V IPC namespace still shared across jails?
Regarding installation of PostgreSQL in a FreeBSD jail, the web hold plenty of warnings/advice that each postgres instance should have a unique UID, otherwise they stumble across each other's feet: | allow.sysvipc | A process within the jail has access to System V IPC primitives. In the | current jail implementation, System V primitives share a single namespace | across the host and jail environments, meaning that processes within a jail | would be able to communicate with (and potentially interfere with) processes | outside of the jail, and in other jails. Is this still the case in FreeBSD 11.0 ??? I remember hearing rumors that the System V namespace no longer is (will?) be shared across jails. (Couldn't find it being mentioned in release notes.) Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Uppercase RE matching problems in FreeBSD 11
2016-11-06 22:49, Stefan Bethke wrote: So what do I set my LANG and LC variables to? I do want UTF-8, but I do also want my scripts to continue to work. Clearly, en_US.UTF-8 is not what I want. Is it C.UTF-8? Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C? Yes, that is the safest bet. The LANG sets a default, but the LC_COLLATE, LC_TIME, LC_NUMERIC and LC_MONETARY should better be set to "C" to overrule the LANG in their domains. Leave the LC_ALL undefined or empty, as this one overrules every other locale setting (unless you really want everything to be set to "C"). Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Uppercase RE matching problems in FreeBSD 11
2016-11-06 12:07, Baptiste Daroussin wrote: Yes A-Z only means uppercase in an ASCII only world in a unicode world it means AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11 we have a unicode collation instead of falling back in on LC_COLLATE=C which means ascii only For regrexp for example one should use the classes: :upper: or :lower:. It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?) at "C" when LANG or LC_CTYPE is set to something else, otherwise unexpected things may happen. Mark On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: I happened to run an old script today that uses sed(1) to extract the system boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as expected: $ sysctl kern.boottime kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5 16:18:34 2016 $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' v 5 16:18:34 2016 sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as expected: $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/' Nov 5 16:18:34 2016 Testing every lowercase character separately gives even more inconsistent results: $ cat < a > b > c > d > e > f > g > h > i > j > k > l > m > n > o > p > q > r > s > t > u > v > w > x > y > z > ! b c d e f g h i j k l m n o p q r s t u v w x y z Here sed thinks every lowercase character except for 'a' is uppercase! This differs from the first test where sed did not think 'o' is uppercase. Again, the above behaves as expected with LANG=C. Does anyone have any insight into this? This is likely to break a lot of existing code. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: PKG bootstrap FreeBSD 11.0 / VBox NAT problem
On 10/28/16 14:15, Tomasz CEDRO wrote: Just for the curious. I am testing on VirtualBox (Version 5.1.8 r111374 (Qt5.5.1), macOS 10.12.1 host). Cannot bootstrap PKG on a host with NAT enabled.I have noticed this problem occurs only when NAT is enabled in VBox. When I use Bridged interface there is no problem. I have noticed that outgoing packet following RST response has invalid checksum. That may be VBox NAT problem..? Maybe someone noticed similar behavior.. https://www.virtualbox.org/ticket/16126 Same thing here: after upgrading VirtualBox on Windows 10 to 5.0.28, the 'pkg upgrade' on a FreeBSD 11.0-RELEASE-p2 guest fails with a 'Connection reset by peer' - either right away, or after downloading a (random) couple of packages - when using NAT provided by VirtualBox. This worked well with a previous release of VirtualBox. Looks like the problem is not specific to FreeBSD: https://www.virtualbox.org/ticket/16084 Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: update.FreeBSD.org unresponsive?
Perhaps it's time to replace Apache httpd/2.2.16 (released 6+ years ago) running on update.FreeBSD.org with something lighter and more agile like nginx (or at least with a fresher version of Apache httpd). The accf_http(9) (with: accept_filter=httpready) may help too. Mark 2016-10-12 17:23, Mark Martinec wrote: Whatever you did, it started to work now normally. Thank you! (no changes at our side) Mark 2016-10-12 16:29, Mark Martinec wrote: Trying to upgrade a couple of hosts (11.0-RC2, 11.0-RC3, 10.3-RELEASE-p10) to 11.0 (using: freebsd-update upgrade -r 11.0-RELEASE), and it seems the fetch(1) always fails with a timeout. Even a simple (freebsd-update fetch) in an attempt to bump a 10.3-RELEASE-p9 to 10.3-RELEASE-p10 now fails with a timeout, while previously it worked reliably and fast. The interesting thing is that both the ping and ping6 to update.FreeBSD.org work flawlessly with no packet loss. I tried it several times yesterday, and again today. [...] ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: update.FreeBSD.org unresponsive?
Whatever you did, it started to work now normally. Thank you! (no changes at our side) Mark 2016-10-12 16:29, Mark Martinec wrote: Trying to upgrade a couple of hosts (11.0-RC2, 11.0-RC3, 10.3-RELEASE-p10) to 11.0 (using: freebsd-update upgrade -r 11.0-RELEASE), and it seems the fetch(1) always fails with a timeout. Even a simple (freebsd-update fetch) in an attempt to bump a 10.3-RELEASE-p9 to 10.3-RELEASE-p10 now fails with a timeout, while previously it worked reliably and fast. The interesting thing is that both the ping and ping6 to update.FreeBSD.org work flawlessly with no packet loss. I tried it several times yesterday, and again today. Our network connectivity is otherwise good and fast. To rule out a possibility of a firewall or routing issue I even tried it from a different network (different ISP), and also over Hurricane-Electric tunnel. Same thing over IPv4 or IPv6. Going through a web proxy doesn't help either. tcpdump / wireshark shows that the three-way SYN handshake succeeds, then the client sends a GET, re-sends the packet several times, but nothing comes back any more. Or sometimes even the SYN handshake fails to complete. Tried to use curl to fetch the same file, it fails too: $ curl -6 http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e curl: (7) Failed to connect to update.FreeBSD.org port 80: Operation timed out $ curl -4 http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e curl: (56) Recv failure: Operation timed out So, do we just need to be patient, or is the update.FreeBSD.org hosed? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
update.FreeBSD.org unresponsive?
Trying to upgrade a couple of hosts (11.0-RC2, 11.0-RC3, 10.3-RELEASE-p10) to 11.0 (using: freebsd-update upgrade -r 11.0-RELEASE), and it seems the fetch(1) always fails with a timeout. Even a simple (freebsd-update fetch) in an attempt to bump a 10.3-RELEASE-p9 to 10.3-RELEASE-p10 now fails with a timeout, while previously it worked reliably and fast. The interesting thing is that both the ping and ping6 to update.FreeBSD.org work flawlessly with no packet loss. I tried it several times yesterday, and again today. Our network connectivity is otherwise good and fast. To rule out a possibility of a firewall or routing issue I even tried it from a different network (different ISP), and also over Hurricane-Electric tunnel. Same thing over IPv4 or IPv6. Going through a web proxy doesn't help either. tcpdump / wireshark shows that the three-way SYN handshake succeeds, then the client sends a GET, re-sends the packet several times, but nothing comes back any more. Or sometimes even the SYN handshake fails to complete. Tried to use curl to fetch the same file, it fails too: $ curl -6 http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e curl: (7) Failed to connect to update.FreeBSD.org port 80: Operation timed out $ curl -4 http://update.FreeBSD.org/11.0-RC3/amd64/t/78e79429ffc2730cbb467270372d754165c6a0812805d9a0522d412b3e9b7d7e curl: (56) Recv failure: Operation timed out So, do we just need to be patient, or is the update.FreeBSD.org hosed? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Ephemeral /var/run and creating port-specific subdir at service startup time
I prefer to have a /var/run file system reside on a tmpfs as its contents is small and ephemeral in its nature (like pid files, lock files, sockets), need not be preserved across reboots, and should not have to depend on any physical disk. The problem is that some programs/services/ports like to create their own subdirectory under /var/run. This works fine if such subdirectory is created (when missing) by their rc.d script, such as salt, dbus, jenkins, clamav-clamd, isc-dhcpd, kibana. Unfortunately there are other ports which create a subdirectory under /var/run at the installation time (pkg install). In this case their subdirectory is missing on a reboot when /var/run is re-created afresh, and they fail to start. So my question is: are such ports (like influxdb, grafana3) which do not create their subdirectory at a startup time in error and a bug report is warranted, or am I wrong in expecting that /var/run may be ephemeral and is such a setup (as is common in Linux) unsupported? Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system
On 2016-01-14 23:13, Bryan Drewery wrote: Where / What is the error? The only example here was fixed in November. Here is how a fresh svn checkout on a 10-stable fails in make buildworld when /usr/src is noexec : CC='cc ' mkdep -f .depend.getprotoent_test -a -I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd -I/usr/src/contrib/netbsd-tests -std=gnu99 /usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a /usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> .depend.getprotoent_test (cd /usr/src/lib/libc/tests/net && NO_SUBDIR=1 make -f /usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS= PROG=ether_aton_test DEPENDFILE=.depend.ether_aton_test .MAKE.DEPENDFILE=.depend.ether_aton_test depend) /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr /usr/src/sys/net/if_ethersubr.c aton_ether_subr.c make[7]: exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr) failed (Permission denied) *** Error code 1 Stop. make[7]: stopped in /usr/src/lib/libc/tests/net *** Error code 1 Stop. make[6]: stopped in /usr/src/lib/libc/tests/net *** Error code 1 Stop. make[5]: stopped in /usr/src/lib/libc/tests *** Error code 1 Stop. make[4]: stopped in /usr/src/lib/libc *** Error code 1 Stop. make[3]: stopped in /usr/src/lib *** Error code 1 [...] The net/gen_ether_subr looks like the same culprit as reported in 2015-11-26. Actually ... it seems that taking out the WITH_TESTS="yes" from /etc/make.conf avoids the problem - although this was not necessary in 10.2-RELEASE, as far as I can tell. Mark On 1/14/2016 7:42 AM, Mark Martinec wrote: Prompted by recent security advisories I did a 'make buildworld' on a fresh svn checkout, only to find out that it seems the 'exec' mount flag on /usr/src is still required for a successful build. This wasn't so for 10.2, and I hope it won't become a requirement in 10.3 - or at least it should be clearly documented in release notes. Mark On 2015-12-07 16:35, Mark Martinec wrote: So, is this a new state of affairs that /usr/src file system needs to be mounted exec in order for buildworld to succeed, or is this an unintended change and I should file a bug report? Mark On 2015-11-26 19:44, Miroslav Lachman wrote: Mark Martinec wrote on 11/26/2015 19:31: Up to about a week ago building world on FreeBSD 10.2-STABLE went just fine. Today after svn update the build fails: # make buildworld [...] CC='cc ' mkdep -f .depend.getprotoent_test -a -I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd -I/usr/src/contrib/netbsd-tests -std=gnu99 /usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a /usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> .depend.getprotoent_test (cd /usr/src/lib/libc/tests/net && make -f /usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS= SUBDIR= PROG=ether_aton_test DEPENDFILE=.depend.ether_aton_test .MAKE.DEPENDFILE=.depend.ether_aton_test depend) /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr /usr/src/sys/net/if_ethersubr.c aton_ether_subr.c make[7]: exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr) failed (Permission denied) *** Error code 1 Stop. make[7]: stopped in /usr/src/lib/libc/tests/net *** Error code 1 It turns out that our file system /usr/src had an "exec" flag turned off, so now running a command: /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr fails with "Permission denied". It would be valuable if building a system on an exec-protected src file system would continue to be possible. Not sure if the /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr is the only such new command breaking the build. Anyway, a simple workaround is to run shell from a command line instead of as a shebang, i.e.: # /bin/sh /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr instead of: # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr I was puzzled by similar thing years ago. I was using /var/db and /tmp mounted with noexec. And then there was some changes. Ports need /var/db with exec because of some script in /var/db/pkg and /tmp must have exec too for buildworld or installworld (I don't remember it well, now I always do mount -u -o current,exec /tmp before build + install world and kernel) Anyway - it would be better to not have these partitions mounted with exec. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system
Prompted by recent security advisories I did a 'make buildworld' on a fresh svn checkout, only to find out that it seems the 'exec' mount flag on /usr/src is still required for a successful build. This wasn't so for 10.2, and I hope it won't become a requirement in 10.3 - or at least it should be clearly documented in release notes. Mark On 2015-12-07 16:35, Mark Martinec wrote: So, is this a new state of affairs that /usr/src file system needs to be mounted exec in order for buildworld to succeed, or is this an unintended change and I should file a bug report? Mark On 2015-11-26 19:44, Miroslav Lachman wrote: Mark Martinec wrote on 11/26/2015 19:31: Up to about a week ago building world on FreeBSD 10.2-STABLE went just fine. Today after svn update the build fails: # make buildworld [...] CC='cc ' mkdep -f .depend.getprotoent_test -a -I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd -I/usr/src/contrib/netbsd-tests -std=gnu99 /usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a /usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> .depend.getprotoent_test (cd /usr/src/lib/libc/tests/net && make -f /usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS= SUBDIR= PROG=ether_aton_test DEPENDFILE=.depend.ether_aton_test .MAKE.DEPENDFILE=.depend.ether_aton_test depend) /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr /usr/src/sys/net/if_ethersubr.c aton_ether_subr.c make[7]: exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr) failed (Permission denied) *** Error code 1 Stop. make[7]: stopped in /usr/src/lib/libc/tests/net *** Error code 1 It turns out that our file system /usr/src had an "exec" flag turned off, so now running a command: /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr fails with "Permission denied". It would be valuable if building a system on an exec-protected src file system would continue to be possible. Not sure if the /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr is the only such new command breaking the build. Anyway, a simple workaround is to run shell from a command line instead of as a shebang, i.e.: # /bin/sh /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr instead of: # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr I was puzzled by similar thing years ago. I was using /var/db and /tmp mounted with noexec. And then there was some changes. Ports need /var/db with exec because of some script in /var/db/pkg and /tmp must have exec too for buildworld or installworld (I don't remember it well, now I always do mount -u -o current,exec /tmp before build + install world and kernel) Anyway - it would be better to not have these partitions mounted with exec. Miroslav Lachman ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system
So, is this a new state of affairs that /usr/src file system needs to be mounted exec in order for buildworld to succeed, or is this an unintended change and I should file a bug report? Mark On 2015-11-26 19:44, Miroslav Lachman wrote: Mark Martinec wrote on 11/26/2015 19:31: Up to about a week ago building world on FreeBSD 10.2-STABLE went just fine. Today after svn update the build fails: # make buildworld [...] CC='cc ' mkdep -f .depend.getprotoent_test -a -I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd -I/usr/src/contrib/netbsd-tests -std=gnu99 /usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a /usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> .depend.getprotoent_test (cd /usr/src/lib/libc/tests/net && make -f /usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS= SUBDIR= PROG=ether_aton_test DEPENDFILE=.depend.ether_aton_test .MAKE.DEPENDFILE=.depend.ether_aton_test depend) /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr /usr/src/sys/net/if_ethersubr.c aton_ether_subr.c make[7]: exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr) failed (Permission denied) *** Error code 1 Stop. make[7]: stopped in /usr/src/lib/libc/tests/net *** Error code 1 It turns out that our file system /usr/src had an "exec" flag turned off, so now running a command: /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr fails with "Permission denied". It would be valuable if building a system on an exec-protected src file system would continue to be possible. Not sure if the /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr is the only such new command breaking the build. Anyway, a simple workaround is to run shell from a command line instead of as a shebang, i.e.: # /bin/sh /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr instead of: # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr I was puzzled by similar thing years ago. I was using /var/db and /tmp mounted with noexec. And then there was some changes. Ports need /var/db with exec because of some script in /var/db/pkg and /tmp must have exec too for buildworld or installworld (I don't remember it well, now I always do mount -u -o current,exec /tmp before build + install world and kernel) Anyway - it would be better to not have these partitions mounted with exec. Miroslav Lachman ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
A recent 10.2-STABLE no longer builds on a no-exec /usr/src file system
Up to about a week ago building world on FreeBSD 10.2-STABLE went just fine. Today after svn update the build fails: # make buildworld [...] CC='cc ' mkdep -f .depend.getprotoent_test -a -I/usr/src/lib/libc/tests/net -I/usr/src/lib/libnetbsd -I/usr/src/contrib/netbsd-tests -std=gnu99 /usr/src/contrib/netbsd-tests/lib/libc/net/t_getprotoent.c echo getprotoent_test: /usr/obj/usr/src/tmp/usr/lib/libc.a /usr/obj/usr/src/tmp/usr/lib/private/libatf-c.a >> .depend.getprotoent_test (cd /usr/src/lib/libc/tests/net && make -f /usr/src/lib/libc/tests/net/Makefile _RECURSING_PROGS= SUBDIR= PROG=ether_aton_test DEPENDFILE=.depend.ether_aton_test .MAKE.DEPENDFILE=.depend.ether_aton_test depend) /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr /usr/src/sys/net/if_ethersubr.c aton_ether_subr.c make[7]: exec(/usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr) failed (Permission denied) *** Error code 1 Stop. make[7]: stopped in /usr/src/lib/libc/tests/net *** Error code 1 It turns out that our file system /usr/src had an "exec" flag turned off, so now running a command: /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr fails with "Permission denied". It would be valuable if building a system on an exec-protected src file system would continue to be possible. Not sure if the /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr is the only such new command breaking the build. Anyway, a simple workaround is to run shell from a command line instead of as a shebang, i.e.: # /bin/sh /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr instead of: # /usr/src/contrib/netbsd-tests/lib/libc/net/gen_ether_subr Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Segmentation fault running ntpd
Upgrading 10.2-RELEASE-p6 to 10.2-RELEASE-p7 now solved ntpd crashes (apparently fixed by: FreeBSD Errata Notice FreeBSD-EN-15:20.vm). Thanks!!! Mark On 2015-11-01 10:31, Andre Albsmeier wrote: On Fri, 30-Oct-2015 at 19:47:59 +0100, Mark Martinec wrote: Not sure if it's the same issue, but it sure looks like it is. I have upgraded a couple of hosts (amd64) from 10.2-RELEASE-p5 to 10.2-RELEASE-p6, i.e. the freebsd-upgrade essentially just replaced the /usr/sbin/ntpd with a new one; then I restarted the ntpd. On all host but one this was successful: the new ntpd starts fine and works normally. But on one of these machines the ntpd process immediately crashes with SIGSEGV. That machine has an Intel Xeon cpu. It is not apparent to me in what way this machine differs from others, I'll add my observations here: I am using an ntp.conf with a single server entry: server ntp.some.domain.org ntp.some.domain.org is a CNAME pointing to gate.some.domain.org and the latter contains an A record pointing to 192.168.128.1. After updating 9.3-STABLE to the latest version (one which includes ntp 4.2.8p4), ntpd crashes: Nov 1 09:38:38 voyager kernel: pid 4443 (ntpd), uid 0: exited on signal 11 This happens in line 871 of ntpd.c where mlockall() is called: && 0 != mlockall(MCL_CURRENT|MCL_FUTURE)) It does NOT crash with MCL_FUTURE only. It does crash with MCL_CURRENT only. When adding rlimit memlock -1 to ntpd.conf it does NOT crash (as mlockall() won't be called anymore). When specifying the IP address (192.168.128.1) as the server it does NOT crash. When specifying gate.some.domain.org as the server it also does NOT crash. tcpdump shows in this case: 09:49:59.542310 IP 192.168.128.2.21102 > 192.168.128.1.53: 7639+ A? gate.some.domain.org. (41) 09:49:59.542578 IP 192.168.128.1.53 > 192.168.128.2.21102: 7639* 1/1/0 A 192.168.128.1 (71) 09:49:59.542612 IP 192.168.128.2.52455 > 192.168.128.1.53: 42047+ ? gate.some.domain.org. (41) 09:49:59.542792 IP 192.168.128.1.53 > 192.168.128.2.52455: 42047* 0/1/0 (88) When reverting the server entry back to ntp.some.domain.org it crashes and tcpdump shows: 09:36:05.172552 IP 192.168.128.2.17836 > 192.168.128.1.53: 49768+ A? ntp.some.domain.org. (40) 09:36:05.173320 IP 192.168.128.1.53 > 192.168.128.2.17836: 49768* 2/1/0 CNAME gate.some.domain.org., A 192.168.128.1 (89) 09:36:05.173361 IP 192.168.128.2.22611 > 192.168.128.1.53: 63808+ ? ntp.some.domain.org. (40) 09:36:05.173595 IP 192.168.128.1.53 > 192.168.128.2.22611: 63808* 1/1/0 CNAME gate.some.domain.org. (106) The probability for crashing increases with the speed and the number of cores of the machine: On my old single-core Pentiums it never crashes, on my quad-cores i7-3770K it always crashes. The (asynchronous) resolving of the names start in line 3876 of ntp_config.c: getaddrinfo_sometime(curr_peer->addr->address, If we put the mlockall() call directly before this line, the crash is gone. Maybe you want to play around with rlimit, CNAMES, IPs and so on... -Andre Anyone else seeing this? 2015-10-30 12:34, je David Wolfskill napisal > On Fri, Oct 30, 2015 at 09:42:07AM +0100, Dag-Erling Smørgrav wrote: >> David Wolfskill <da...@catwhisker.org> writes: >> > ... >> > bound to 172.17.1.245 -- renewal in 43200 seconds. >> > pid 544 (ntpd), uid 0: exited on signal 11 (core dumped) >> > Starting Network: lo0 em0 iwn0 lagg0. >> > ... >> >> Did you find a solution? I'm wondering if the ntpd problems people >> are >> reporting on freebsd-security@ are related. I vaguely recall hearing >> that this had been traced to a pthread bug, but can't find anything >> about it in commit logs or mailing list archives. >> > > I don't recall finding "a solution" per se; that said, I also don't > recall seeing an occurrence of the above for enough time that I'm not > sure when I sent that message. :-} > > As a reality check: > > g1-252(11.0-C)[1] ls -lT /*.core > -rw-r--r-- 1 root wheel 13783040 Aug 18 04:19:03 2015 /ntpd.core > g1-252(11.0-C)[2] > > So -- among other points -- my last sighting of whatever was causing > that was the day I built: > > FreeBSD 11.0-CURRENT #157 r286880M/286880:1100079: Tue Aug 18 > 04:45:25 PDT 2015 > r...@g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY amd64 > > Note that the machines where I run head get updated daily (unless > there's enough of a problem with head that I can't build it or can't > boot it (and I'm unable to circumvent the issue within a reasonable > time)) -- and while I do attempt to run ntpd on the machines, the above > failure is more "annoying" than "crippling" in my particular case. > > And I'm presently running: > > FreeBSD 11.0-CURRENT #227 r290
Re: Segmentation fault running ntpd
Not sure if it's the same issue, but it sure looks like it is. I have upgraded a couple of hosts (amd64) from 10.2-RELEASE-p5 to 10.2-RELEASE-p6, i.e. the freebsd-upgrade essentially just replaced the /usr/sbin/ntpd with a new one; then I restarted the ntpd. On all host but one this was successful: the new ntpd starts fine and works normally. But on one of these machines the ntpd process immediately crashes with SIGSEGV. That machine has an Intel Xeon cpu. It is not apparent to me in what way this machine differs from others, Played with some variations of ntpd on that host, here are some findings: - the new ntpd (that came with 10.2-RELEASE-p6) runs fine if it does *not* daemonize, i.e. ntpd with an option -n or -d stays attached to a terminal and works fine; the same happens when run under ktrace -d -i ntpd ... it works fine, even when it daemonizes; - the ntpd built from fresh net/ntp-devel behaves exactly the same: crashes on that machine when it daemonizes - a previous ntpd (from 10.2-RELEASE-p5) works fine, so I ended up downgrading ntpd to that previous version on that machine. Also a ntpd from a recent 10-STABLE when copied to that host runs fine there! I haven't tried yet to build it with debugging, or capture a core dump. Puzzling... Mark 2015-10-30 12:34, je David Wolfskill napisal On Fri, Oct 30, 2015 at 09:42:07AM +0100, Dag-Erling Smørgrav wrote: David Wolfskillwrites: > ... > bound to 172.17.1.245 -- renewal in 43200 seconds. > pid 544 (ntpd), uid 0: exited on signal 11 (core dumped) > Starting Network: lo0 em0 iwn0 lagg0. > ... Did you find a solution? I'm wondering if the ntpd problems people are reporting on freebsd-security@ are related. I vaguely recall hearing that this had been traced to a pthread bug, but can't find anything about it in commit logs or mailing list archives. I don't recall finding "a solution" per se; that said, I also don't recall seeing an occurrence of the above for enough time that I'm not sure when I sent that message. :-} As a reality check: g1-252(11.0-C)[1] ls -lT /*.core -rw-r--r-- 1 root wheel 13783040 Aug 18 04:19:03 2015 /ntpd.core g1-252(11.0-C)[2] So -- among other points -- my last sighting of whatever was causing that was the day I built: FreeBSD 11.0-CURRENT #157 r286880M/286880:1100079: Tue Aug 18 04:45:25 PDT 2015 r...@g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY amd64 Note that the machines where I run head get updated daily (unless there's enough of a problem with head that I can't build it or can't boot it (and I'm unable to circumvent the issue within a reasonable time)) -- and while I do attempt to run ntpd on the machines, the above failure is more "annoying" than "crippling" in my particular case. And I'm presently running: FreeBSD 11.0-CURRENT #227 r290138M/290138:1100084: Thu Oct 29 05:12:58 PDT 2015 r...@g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY amd64 and building head @r290190 as I type. And FWIW, I *suspect* that one of the issues involved (in my case) was a ... lack of determinism ... in events involving getting the (wireless) network connectivity into a usable state as part of the initial transition to multi-user mode. (I only have evidence at the moment of the issue on my laptop; my build machine, which only uses a wired NIC, has no /ntpd.core file. It and my laptop are updated pretty much in lock-step; it runs a completely GENERIC kernel, while the laptop runs a modestly customized one based on GENERIC.) Peace, david ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: recommended poudriere jail versions?
2015-10-01 10:32, Marko Cupać wrote: what is the recommended poudriere jail version for building ports? So far I was trying to be on latest binary patchlevel for every minor version for both base system, poudriere jails and clients, but I ended up with three jails just for amd64 (9.3, 10.1 and 10.2), where I need to rebuild all the ports every time I patch poudriere jails. This is starting to take too much of my time. I see that pkg.freebsd.org hosts just one set of ports per architecture of major version. What is the OS version they are built on? Are there any downsides in building all the ports for 10.2- on 10.1-? I used to have poudriere jails based on a minor version like you have, but ended up in a simplified setup, building ports only on 10.0-RELEASE and installing them on 10.1 or 10.2 and 10-STABLE. I think the official packages are also built based on 10.0-RELEASE . This mostly works, except for a port like virtualbox-ose-kmod, which causes a kernel crash when built on 10.0-RELEASE and run on 10.2. So after each ports upgrade when noticing that pkg is reinstalling virtualbox-ose-kmod, I re-build this one from ports on a target host, otherwise the next reboot will end up crashing on loading a vboxdrv kernel module during startup. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Latest stable (r287104) bash leaves zombies on exit
Pete French wrote: I updated to stable yesterday, plus updated all my porst to the latest pecompiled packages, but I am now seeing odd problems with bash on exit. Sometimes it quits, but leaves a zombie process... e.g PID TT STATTIME COMMAND 44308 v0 IW 0:00.00 -bash (bash) 44312 v0 IW+ 0:00.00 /bin/sh /usr/local/bin/startx -listen_tcp 44325 v0 IW+ 0:00.00 xinit xterm -listen_tcp -- /usr/local/bin/X :0 -auth /ho 44328 v0 IW 0:00.00 /usr/local/bin/wmaker 44340 v0 S0:03.35 /usr/local/bin/wmaker --for-real 49101 0- Z+ 0:02.73 defunct 49314 1- Z+ 0:00.17 defunct 56068 2 Ss 0:00.01 bash 56498 2 R+ 0:00.00 ps 56074 3 Is 0:00.01 bash 56076 3 S+ 0:00.00 mail freebsd-stable@freebsd.org 56308 4 Is+ 0:00.01 bash Thats the current 'ps' on this machine. The bash processes are running inside an xterm, so am not sure if the issue is with bash or the terminal. Kind of puzzled! I can reproduce this easily, although not every time. Running 10.2 under KDE, with bash as a default shell: start xterm from a KDE 'konsole', then move to within the xterm and try closing it (^D or exit). More often than not the xterm will block and stay open, the bash process within goes defunct. A normal kill of xterm has no effect, although a kill -9 to the xterm blows away the xterm and the init process then clears the bash zombie leftover. Seems like running a simple command like 'date' in xterm before trying to close it does increase the likelihood that xterm will block on exit. Currently I have to reboot the machine periodicly once I have accumulated enough zombies to be annoying. Its not really a long term solution though. There is no need to reboot, just kill -9 the hanging xterm processes and the init will clear the zombies. Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: freebsd-update to 10.2-RELEASE broken ?
On Sun, 16 Aug 2015, Kimmo Paasiala wrote: It could be the classic fall back to TCP on SRV records problem on your upstream DNS forwarder if you're using one: http://lists.freebsd.org/pipermail/freebsd-ports/2012-May/074801.html The cure would be to use your own caching DNS resolver (configured to query the authoritative name servers directly) such as dns/unbound. 2015-08-16 Christian Kratzer wrote: I run my own bind9 resolvers on freebsd 10 at both sites. I never particurlarly like the concept of an upstream resolver. All my resolvers are behind firewalls although different kinds. ASA at one site and freebsd pf at the other. I will investigate though. Thanks for the tip. ASA firewall has a nasty setting to *discard* DNS UDP packets with UDP message size over 512 bytes, i.e. it does not allow EDNS0 option. Check that you have this DNS deep packet inspection misfeature turned off. Check also the firewall log. This would affect UDP DNS responses to a SRV query _http._tcp.update.FreeBSD.org, which comes close to the size limit (possibly depending on geolocation). Using google's public DNS server may avoid the problem by stripping nonessential records from the DNS reply (like the ADDITIONAL SECTION). Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: freebsd-update to 10.2-RELEASE broken ?
On Sun, 16 Aug 2015, Kimmo Paasiala wrote: It could be the classic fall back to TCP on SRV records problem on your upstream DNS forwarder if you're using one: http://lists.freebsd.org/pipermail/freebsd-ports/2012-May/074801.html The cure would be to use your own caching DNS resolver (configured to query the authoritative name servers directly) such as dns/unbound. 2015-08-16 Christian Kratzer wrote: I run my own bind9 resolvers on freebsd 10 at both sites. I never particurlarly like the concept of an upstream resolver. All my resolvers are behind firewalls although different kinds. ASA at one site and freebsd pf at the other. I will investigate though. Thanks for the tip. ASA firewall has a nasty setting to *discard* DNS UDP packets with UDP message size over 512 bytes, i.e. it does not allow EDNS0 option. Check that you have this DNS deep packet inspection misfeature turned off. Check also the firewall log. This would affect UDP DNS responses to a SRV query _http._tcp.update.FreeBSD.org, which comes close to the size limit (possibly depending on geolocation). Using google's public DNS server may avoid the problem by stripping nonessential records from the DNS reply (like the ADDITIONAL SECTION). Mark ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org