Re: svn commit: r362848 - in stable/12/sys: net netinet sys

2020-08-24 Thread Peter Jeremy via freebsd-stable
TL;DR: Ensure you explicitly destroy all ZFS labels on disused root pools.

On 2020-Jul-19 21:21:02 +1000, Peter Jeremy  wrote:
>I'm sending this to -stable, rather than the src groups because I
>don't believe the problem is the commit itself, rather the commit
>has uncovered a latent problem elsewhere.
>
>On 2020-Jul-01 18:03:38 +, Michael Tuexen  wrote:
>>Author: tuexen
>>Date: Wed Jul  1 18:03:38 2020
>>New Revision: 362848
>>URL: https://svnweb.freebsd.org/changeset/base/362848
>>
>>Log:
>>  MFC r353480: Use event handler in SCTP
>
>I have no idea how, but this update breaks booting amd64 for me (r362847
>works and this doesn't).  I have a custom kernel with ZFS but no SCTP so I
>have no real idea how this could break booting - presumably the
>eventhandler change has uncovered a bug somewhere else.

To close the loop on this, the problem was a combination of:
* changes in GEOM provider ordering;
* insufficient checks when ZFS is looking for the root pool;
* my system having remnants of a disused pool with the same name as the root 
poop.

It seems that the order of GEOM providers is relatively unstable - even
including a device, that doesn't physically exist, in a kernel can change
the provider order.  Presumably r362848 also resulted in a change in order.

During a root-on-ZFS boot, the kernel scans all providers, looking for ZFS
labels with a pool name matching the root pool.  Only minimal checks are
performed, in particular, there's no check that it's a valid pool, and the
first such label found is assumed to describe the root pool.

In my case, some time ago, I'd moved things around on my boot disk.  My old
root pool went to the end of the physical disk but I'd decided to shrink it
and left some free space at the end of the disk.  This meant that ZFS found
one (out of 4) labels when it tasted the physical disk and if GEOM sorted
the physical disk prior to its partitions then ZFS would use the pool GUIDs
from the stray label on the physical disk and then fail to find a usable
pool matching those GUIDs.  My fix was to zero the end of my disk.

-- 
Peter Jeremy


signature.asc
Description: PGP signature


Re: svn commit: r362848 - in stable/12/sys: net netinet sys

2020-07-21 Thread Peter Jeremy
On 2020-Jul-21 00:47:23 +0300, Konstantin Belousov  wrote:
>On Tue, Jul 21, 2020 at 07:20:44AM +1000, Peter Jeremy wrote:
>> On 2020-Jul-19 14:48:28 +0300, Konstantin Belousov  
>> wrote:
>> >On Sun, Jul 19, 2020 at 09:21:02PM +1000, Peter Jeremy wrote:
>> >> The symptoms are that I get:
>> >> Mounting from zfs:zroot/ROOT/r363310 failed with error 6; retrying for 3 
>> >> more seconds
>> >> Mounting from zfs:zroot/ROOT/r363310 failed with error 6
>> >> 
>> >> (r363310 is where I was trying to update to and I didn't change the BE
>> >> name as I was searching for the problem and error 6 is ENXIO).
>> >> 
>> >> I tried to reproduce the problem with GENERIC but it hangs after
>> >> displaying the EFI framebuffer information (I've seen that before and
>> >> suspect it is a loader problem but haven't dug into it).
>> 
>> I've confirmed that particular problem is bug 209821.  I've disabled
>> EFI and GENERIC r362848 boots and runs successfully.
>Did you mis-typed the PR number ?   The referenced bug talks about very
>early hang, while your report said that kernel boots up to the point of
>mounting root.

My failure was with a custom kernel.  Once I narrowed the problem to a
commit that seemed unrelated to my problem, I tried to boot a GENERIC
kernel at r362848.  The GENERIC kernel boot failed much earlier due to
the EFI problem documented in PR 209821.  When I disabled EFI, then
the GENERIC kernel worked, showing that my problem was due to my custom
kernel.

>> Since GENERIC worked, I did some more experimenting and tracked the
>> problem down to a lack of "options ACPI_DMAR" in my kernel config.
>> That makes more sense, though I have no idea why it suddenly became
>> mandatory for my system.
>No, this does not make too much sense either, since DMAR is disabled
>by default.  Did you enabled it ?

"options ACPI_DMAR" has been in GENERIC since you first submitted the
DMAR code was in r257251.  I haven't ever set the hw.dmar.enable=1
loader tunable but it's not at all obvious that a kernel built without
"options ACPI_DMAR" is functionally equivalent to a kernel that has
DMAR compiled in but disabled - there's a lot of IOMMU manipulation
code that is purely conditional on ACPI_DMAR.

That said, I'm not using virtualisation and haven't actually enabled
DMAR in the loader so I suspect that I've only masked the real issue.
I currently have INVARIANTS and WITNESS but will look into some of the
more extensive debugging options.

(It looks like I missed the addition of "options ACPI_DMAR" when I was
updating my custom kernel config with the differences between r250963
and r259512 about 8 years ago, and it hasn't caused any obvious
problems until now.  Obviously, I need to do a more careful review of
my custom kernel config against GENERIC/NOTES).

>BTW, you are using stable, right ?  There were some code reorganization
>commits in HEAD moving DMAR code around, but they were not merged to
>stable.

I'm using 12-STABLE.

-- 
Peter Jeremy


signature.asc
Description: PGP signature


Re: svn commit: r362848 - in stable/12/sys: net netinet sys

2020-07-20 Thread Konstantin Belousov
On Tue, Jul 21, 2020 at 07:20:44AM +1000, Peter Jeremy wrote:
> On 2020-Jul-19 14:48:28 +0300, Konstantin Belousov  
> wrote:
> >On Sun, Jul 19, 2020 at 09:21:02PM +1000, Peter Jeremy wrote:
> >> I'm sending this to -stable, rather than the src groups because I
> >> don't believe the problem is the commit itself, rather the commit
> >> has uncovered a latent problem elsewhere.
> >> 
> >> On 2020-Jul-01 18:03:38 +, Michael Tuexen  wrote:
> >> >Author: tuexen
> >> >Date: Wed Jul  1 18:03:38 2020
> >> >New Revision: 362848
> >> >URL: https://svnweb.freebsd.org/changeset/base/362848
> >> >
> >> >Log:
> >> >  MFC r353480: Use event handler in SCTP
> >> 
> >> I have no idea how, but this update breaks booting amd64 for me (r362847
> >> works and this doesn't).  I have a custom kernel with ZFS but no SCTP so I
> >> have no real idea how this could break booting - presumably the
> >> eventhandler change has uncovered a bug somewhere else.
> >> 
> >> The symptoms are that I get:
> >> Mounting from zfs:zroot/ROOT/r363310 failed with error 6; retrying for 3 
> >> more seconds
> >> Mounting from zfs:zroot/ROOT/r363310 failed with error 6
> >> 
> >> (r363310 is where I was trying to update to and I didn't change the BE
> >> name as I was searching for the problem and error 6 is ENXIO).
> >> 
> >> I tried to reproduce the problem with GENERIC but it hangs after
> >> displaying the EFI framebuffer information (I've seen that before and
> >> suspect it is a loader problem but haven't dug into it).
> 
> I've confirmed that particular problem is bug 209821.  I've disabled
> EFI and GENERIC r362848 boots and runs successfully.
Did you mis-typed the PR number ?   The referenced bug talks about very
early hang, while your report said that kernel boots up to the point of
mounting root.

> 
> >> Does anyone have any ideas?
> >
> >Did you checked that the physical devices where your ZFS pool is located,
> >are detected, and that kernel messages for their drivers are as usual ?
> >Overall, is there anything strange in the verbose dmesg ?
> 
> There's nothing obviously strange (in particular, I can see the physical
> boot/root disk) but the faulty kernel appears to have moved the msgbuf
> somewhere unexpected so it's not saved across reboots and I'm limited to
> eyeballing the messages via DDB.
> 
> Since GENERIC worked, I did some more experimenting and tracked the
> problem down to a lack of "options ACPI_DMAR" in my kernel config.
> That makes more sense, though I have no idea why it suddenly became
> mandatory for my system.
No, this does not make too much sense either, since DMAR is disabled
by default.  Did you enabled it ?

BTW, you are using stable, right ?  There were some code reorganization
commits in HEAD moving DMAR code around, but they were not merged to
stable.
___
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"


Re: svn commit: r362848 - in stable/12/sys: net netinet sys

2020-07-20 Thread Peter Jeremy
On 2020-Jul-19 14:48:28 +0300, Konstantin Belousov  wrote:
>On Sun, Jul 19, 2020 at 09:21:02PM +1000, Peter Jeremy wrote:
>> I'm sending this to -stable, rather than the src groups because I
>> don't believe the problem is the commit itself, rather the commit
>> has uncovered a latent problem elsewhere.
>> 
>> On 2020-Jul-01 18:03:38 +, Michael Tuexen  wrote:
>> >Author: tuexen
>> >Date: Wed Jul  1 18:03:38 2020
>> >New Revision: 362848
>> >URL: https://svnweb.freebsd.org/changeset/base/362848
>> >
>> >Log:
>> >  MFC r353480: Use event handler in SCTP
>> 
>> I have no idea how, but this update breaks booting amd64 for me (r362847
>> works and this doesn't).  I have a custom kernel with ZFS but no SCTP so I
>> have no real idea how this could break booting - presumably the
>> eventhandler change has uncovered a bug somewhere else.
>> 
>> The symptoms are that I get:
>> Mounting from zfs:zroot/ROOT/r363310 failed with error 6; retrying for 3 
>> more seconds
>> Mounting from zfs:zroot/ROOT/r363310 failed with error 6
>> 
>> (r363310 is where I was trying to update to and I didn't change the BE
>> name as I was searching for the problem and error 6 is ENXIO).
>> 
>> I tried to reproduce the problem with GENERIC but it hangs after
>> displaying the EFI framebuffer information (I've seen that before and
>> suspect it is a loader problem but haven't dug into it).

I've confirmed that particular problem is bug 209821.  I've disabled
EFI and GENERIC r362848 boots and runs successfully.

>> Does anyone have any ideas?
>
>Did you checked that the physical devices where your ZFS pool is located,
>are detected, and that kernel messages for their drivers are as usual ?
>Overall, is there anything strange in the verbose dmesg ?

There's nothing obviously strange (in particular, I can see the physical
boot/root disk) but the faulty kernel appears to have moved the msgbuf
somewhere unexpected so it's not saved across reboots and I'm limited to
eyeballing the messages via DDB.

Since GENERIC worked, I did some more experimenting and tracked the
problem down to a lack of "options ACPI_DMAR" in my kernel config.
That makes more sense, though I have no idea why it suddenly became
mandatory for my system.

-- 
Peter Jeremy


signature.asc
Description: PGP signature


Re: svn commit: r362848 - in stable/12/sys: net netinet sys

2020-07-19 Thread Konstantin Belousov
On Sun, Jul 19, 2020 at 09:21:02PM +1000, Peter Jeremy wrote:
> I'm sending this to -stable, rather than the src groups because I
> don't believe the problem is the commit itself, rather the commit
> has uncovered a latent problem elsewhere.
> 
> On 2020-Jul-01 18:03:38 +, Michael Tuexen  wrote:
> >Author: tuexen
> >Date: Wed Jul  1 18:03:38 2020
> >New Revision: 362848
> >URL: https://svnweb.freebsd.org/changeset/base/362848
> >
> >Log:
> >  MFC r353480: Use event handler in SCTP
> 
> I have no idea how, but this update breaks booting amd64 for me (r362847
> works and this doesn't).  I have a custom kernel with ZFS but no SCTP so I
> have no real idea how this could break booting - presumably the
> eventhandler change has uncovered a bug somewhere else.
> 
> The symptoms are that I get:
> Mounting from zfs:zroot/ROOT/r363310 failed with error 6; retrying for 3 more 
> seconds
> Mounting from zfs:zroot/ROOT/r363310 failed with error 6
> 
> (r363310 is where I was trying to update to and I didn't change the BE
> name as I was searching for the problem and error 6 is ENXIO).
> 
> I tried to reproduce the problem with GENERIC but it hangs after
> displaying the EFI framebuffer information (I've seen that before and
> suspect it is a loader problem but haven't dug into it).
> 
> Does anyone have any ideas?

Did you checked that the physical devices where your ZFS pool is located,
are detected, and that kernel messages for their drivers are as usual ?
Overall, is there anything strange in the verbose dmesg ?
___
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"


Re: svn commit: r362848 - in stable/12/sys: net netinet sys

2020-07-19 Thread Peter Jeremy
I'm sending this to -stable, rather than the src groups because I
don't believe the problem is the commit itself, rather the commit
has uncovered a latent problem elsewhere.

On 2020-Jul-01 18:03:38 +, Michael Tuexen  wrote:
>Author: tuexen
>Date: Wed Jul  1 18:03:38 2020
>New Revision: 362848
>URL: https://svnweb.freebsd.org/changeset/base/362848
>
>Log:
>  MFC r353480: Use event handler in SCTP

I have no idea how, but this update breaks booting amd64 for me (r362847
works and this doesn't).  I have a custom kernel with ZFS but no SCTP so I
have no real idea how this could break booting - presumably the
eventhandler change has uncovered a bug somewhere else.

The symptoms are that I get:
Mounting from zfs:zroot/ROOT/r363310 failed with error 6; retrying for 3 more 
seconds
Mounting from zfs:zroot/ROOT/r363310 failed with error 6

(r363310 is where I was trying to update to and I didn't change the BE
name as I was searching for the problem and error 6 is ENXIO).

I tried to reproduce the problem with GENERIC but it hangs after
displaying the EFI framebuffer information (I've seen that before and
suspect it is a loader problem but haven't dug into it).

Does anyone have any ideas?

-- 
Peter Jeremy


signature.asc
Description: PGP signature