Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-29 Thread Ben Parees
On Mon, Oct 28, 2019 at 11:24 PM Clayton Coleman 
wrote:

> There is a known bug in 4.2 where image stream content from the
> release payload is not mirrored correctly.  That is slated to be
> fixed.
>

Being tracked for 4.3 here:
https://bugzilla.redhat.com/show_bug.cgi?id=1741391

Backport BZ for 4.2.z hasn't been created yet, but i'll be linked from that
bug when it is.



>
> > On Oct 28, 2019, at 8:39 PM, W. Trevor King  wrote:
> >
> >> On Mon, Oct 28, 2019 at 5:08 PM Joel Pearson wrote:
> >> It looks like image streams don't honor the imageContentSources mirror,
> and try to reach out to the internet.
> >
> > Mirrors only apply to by-digest pullspecs.  You may need to patch your
> > image specs to use the mirrors [1] explicitly if you want them.
> >
> > Cheers,
> > Trevor
> >
> > [1]:
> https://github.com/openshift/release/blob/592c29c2b0f5422201e0cb2dce5e3f6bb7654cce/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L222-L263
> >
> > ___
> > users mailing list
> > users@lists.openshift.redhat.com
> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>


-- 
Ben Parees | OpenShift
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread Clayton Coleman
There is a known bug in 4.2 where image stream content from the
release payload is not mirrored correctly.  That is slated to be
fixed.

> On Oct 28, 2019, at 8:39 PM, W. Trevor King  wrote:
>
>> On Mon, Oct 28, 2019 at 5:08 PM Joel Pearson wrote:
>> It looks like image streams don't honor the imageContentSources mirror, and 
>> try to reach out to the internet.
>
> Mirrors only apply to by-digest pullspecs.  You may need to patch your
> image specs to use the mirrors [1] explicitly if you want them.
>
> Cheers,
> Trevor
>
> [1]: 
> https://github.com/openshift/release/blob/592c29c2b0f5422201e0cb2dce5e3f6bb7654cce/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L222-L263
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread W. Trevor King
On Mon, Oct 28, 2019 at 5:08 PM Joel Pearson wrote:
> It looks like image streams don't honor the imageContentSources mirror, and 
> try to reach out to the internet.

Mirrors only apply to by-digest pullspecs.  You may need to patch your
image specs to use the mirrors [1] explicitly if you want them.

Cheers,
Trevor

[1]: 
https://github.com/openshift/release/blob/592c29c2b0f5422201e0cb2dce5e3f6bb7654cce/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L222-L263

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread Joel Pearson
>
> Almost always means a node is broken / blocked / unable to schedule pods,
> which prevents DNS from deploying.


That's the weird thing though. DNS is deployed, and all the nodes are happy
according to "oc get nodes".

It seems that the operator is misreporting the error.  In the console
dashboard it has a number of alerts that seem out of date, that I'm not
able to clear too.

The dns-default DaemonSet says that 7 of 7 pods are ok.

Is there a way to reboot/re-initialise a "stuck" operator?
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread Clayton Coleman
On Oct 28, 2019, at 8:07 PM, Joel Pearson 
wrote:

> Maybe must-gather could be included in the release manifest so that it's
> available in disconnected environments by default?
> It is:
>   $ oc adm release info --image-for=must-gather
> quay.io/openshift-release-dev/ocp-release:4.2.0
>
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993
> If your 'oc adm must-gather' is reaching out to Quay, instead of
> hitting your mirror, it may be because your samples operator has yet
> to get the mirrored must-gather ImageStream set up.


It looks like image streams don't honor the imageContentSources mirror, and
try to reach out to the internet.

I had a look at the openshift/must-gather image stream and there was an
error saying:

Internal error occurred: Get https://quay.io/v2: dial tcp: lookup quay.io
on 172.30.0.10:53 server misbehaving

Running "oc adm must-gather --image
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993"
actually worked.

That (un)available typo should be fixed in master by [1], but looks
> like that hasn't been backported to 4.2.z.  But look for the
> machine-config daemon that is unready (possibly by listing Pods), and
> see why it's not going ready.


Turns out that all of the machine-config daemon's are ready (I can see 7 of
them all marked as ready). But the machine-config operator just doesn't
appear to be trying anymore.

It's listed as Available=False Progressing=False and Degraded=True.

I tried deleting the operator pod in the hope that it'd kickstart
something, but it didn't seem to help.

I noticed a message right up the top saying:
event.go:247] Could not construct reference to: '' Will not
report event 'Normal' 'LeaderElection' 'machine-config-operator-5f47...
become leader'

The pod that I deleted had that same message too, is this a red herring?

I have must-gather logs now, except that it will probably be complicated to
get them off this air-gapped system.  Are there any pointers about where I
should look to find out why it's no longer progressing? Can I make the
operator try again somehow?

I also noticed that the dns operator is marked available, but there is a
degraded status saying that "Not all desired DNS DaemonSets available"
however, they are all available.


Almost always means a node is broken / blocked / unable to schedule pods,
which prevents DNS from deploying.


On Tue, 29 Oct 2019 at 05:24, W. Trevor King  wrote:

> On Mon, Oct 28, 2019 at 4:05 AM Joel Pearson wrote:
> > Maybe must-gather could be included in the release manifest so that it's
> available in disconnected environments by default?
>
> It is:
>
>   $ oc adm release info --image-for=must-gather
> quay.io/openshift-release-dev/ocp-release:4.2.0
>
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993
>
> If your 'oc adm must-gather' is reaching out to Quay, instead of
> hitting your mirror, it may be because your samples operator has yet
> to get the mirrored must-gather ImageStream set up.
>
> >> Failed to resync 4.2.0 because: timed out waiting for the condition
> during waitForFaemonsetRollout: Daemonset machine-config-daemon is not
> ready. status (desired:7, updated 7, ready: 6, unavailable: 6)
>
> That (un)available typo should be fixed in master by [1], but looks
> like that hasn't been backported to 4.2.z.  But look for the
> machine-config daemon that is unready (possibly by listing Pods), and
> see why it's not going ready.
>
> Cheers,
> Trevor
>
> [1]:
> https://github.com/openshift/machine-config-operator/commit/efb6a96a5bcb13cb3c0c0a0ac0c2e7b022b72665
>


-- 
Kind Regards,

Joel Pearson
Agile Digital | Senior Software Consultant

Love Your Software™ | ABN 98 106 361 273
p: 1300 858 277 | m: 0405 417 843 <0405417843> | w: agiledigital.com.au

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread Joel Pearson
>
> > Maybe must-gather could be included in the release manifest so that it's
> available in disconnected environments by default?
> It is:
>   $ oc adm release info --image-for=must-gather
> quay.io/openshift-release-dev/ocp-release:4.2.0
>
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993
> If your 'oc adm must-gather' is reaching out to Quay, instead of
> hitting your mirror, it may be because your samples operator has yet
> to get the mirrored must-gather ImageStream set up.


It looks like image streams don't honor the imageContentSources mirror, and
try to reach out to the internet.

I had a look at the openshift/must-gather image stream and there was an
error saying:

Internal error occurred: Get https://quay.io/v2: dial tcp: lookup quay.io
on 172.30.0.10:53 server misbehaving

Running "oc adm must-gather --image
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993"
actually worked.

That (un)available typo should be fixed in master by [1], but looks
> like that hasn't been backported to 4.2.z.  But look for the
> machine-config daemon that is unready (possibly by listing Pods), and
> see why it's not going ready.


Turns out that all of the machine-config daemon's are ready (I can see 7 of
them all marked as ready). But the machine-config operator just doesn't
appear to be trying anymore.

It's listed as Available=False Progressing=False and Degraded=True.

I tried deleting the operator pod in the hope that it'd kickstart
something, but it didn't seem to help.

I noticed a message right up the top saying:
event.go:247] Could not construct reference to: '' Will not
report event 'Normal' 'LeaderElection' 'machine-config-operator-5f47...
become leader'

The pod that I deleted had that same message too, is this a red herring?

I have must-gather logs now, except that it will probably be complicated to
get them off this air-gapped system.  Are there any pointers about where I
should look to find out why it's no longer progressing? Can I make the
operator try again somehow?

I also noticed that the dns operator is marked available, but there is a
degraded status saying that "Not all desired DNS DaemonSets available"
however, they are all available.

On Tue, 29 Oct 2019 at 05:24, W. Trevor King  wrote:

> On Mon, Oct 28, 2019 at 4:05 AM Joel Pearson wrote:
> > Maybe must-gather could be included in the release manifest so that it's
> available in disconnected environments by default?
>
> It is:
>
>   $ oc adm release info --image-for=must-gather
> quay.io/openshift-release-dev/ocp-release:4.2.0
>
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993
>
> If your 'oc adm must-gather' is reaching out to Quay, instead of
> hitting your mirror, it may be because your samples operator has yet
> to get the mirrored must-gather ImageStream set up.
>
> >> Failed to resync 4.2.0 because: timed out waiting for the condition
> during waitForFaemonsetRollout: Daemonset machine-config-daemon is not
> ready. status (desired:7, updated 7, ready: 6, unavailable: 6)
>
> That (un)available typo should be fixed in master by [1], but looks
> like that hasn't been backported to 4.2.z.  But look for the
> machine-config daemon that is unready (possibly by listing Pods), and
> see why it's not going ready.
>
> Cheers,
> Trevor
>
> [1]:
> https://github.com/openshift/machine-config-operator/commit/efb6a96a5bcb13cb3c0c0a0ac0c2e7b022b72665
>


-- 
Kind Regards,

Joel Pearson
Agile Digital | Senior Software Consultant

Love Your Software™ | ABN 98 106 361 273
p: 1300 858 277 | m: 0405 417 843 <0405417843> | w: agiledigital.com.au
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread Clayton Coleman
Yes, that is a known 4.2 bug.

> On Oct 28, 2019, at 2:24 PM, W. Trevor King  wrote:
>
>> On Mon, Oct 28, 2019 at 4:05 AM Joel Pearson wrote:
>> Maybe must-gather could be included in the release manifest so that it's 
>> available in disconnected environments by default?
>
> It is:
>
>  $ oc adm release info --image-for=must-gather
> quay.io/openshift-release-dev/ocp-release:4.2.0
>  
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993
>
> If your 'oc adm must-gather' is reaching out to Quay, instead of
> hitting your mirror, it may be because your samples operator has yet
> to get the mirrored must-gather ImageStream set up.
>
>>> Failed to resync 4.2.0 because: timed out waiting for the condition during 
>>> waitForFaemonsetRollout: Daemonset machine-config-daemon is not ready. 
>>> status (desired:7, updated 7, ready: 6, unavailable: 6)
>
> That (un)available typo should be fixed in master by [1], but looks
> like that hasn't been backported to 4.2.z.  But look for the
> machine-config daemon that is unready (possibly by listing Pods), and
> see why it's not going ready.
>
> Cheers,
> Trevor
>
> [1]: 
> https://github.com/openshift/machine-config-operator/commit/efb6a96a5bcb13cb3c0c0a0ac0c2e7b022b72665
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread W. Trevor King
On Mon, Oct 28, 2019 at 4:05 AM Joel Pearson wrote:
> Maybe must-gather could be included in the release manifest so that it's 
> available in disconnected environments by default?

It is:

  $ oc adm release info --image-for=must-gather
quay.io/openshift-release-dev/ocp-release:4.2.0
  
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ff29512304f77b0ab70ea6850e7f8295a4d19e497ab690ea5102a7044ea993

If your 'oc adm must-gather' is reaching out to Quay, instead of
hitting your mirror, it may be because your samples operator has yet
to get the mirrored must-gather ImageStream set up.

>> Failed to resync 4.2.0 because: timed out waiting for the condition during 
>> waitForFaemonsetRollout: Daemonset machine-config-daemon is not ready. 
>> status (desired:7, updated 7, ready: 6, unavailable: 6)

That (un)available typo should be fixed in master by [1], but looks
like that hasn't been backported to 4.2.z.  But look for the
machine-config daemon that is unready (possibly by listing Pods), and
see why it's not going ready.

Cheers,
Trevor

[1]: 
https://github.com/openshift/machine-config-operator/commit/efb6a96a5bcb13cb3c0c0a0ac0c2e7b022b72665

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-28 Thread Joel Pearson
So I got past bootstrap this time, and it made it almost all the way, it
got stuck on the machine-operator.  All the other cluster operators have
passed.

I'm not really sure how to diagnose what is wrong with the machine-config
operator.  I tried 'oc adm must-gather', but it didn't work because the
must-gather container isn't part of the release manifest, so it tried to
reach out to quay.io to download that container, which obviously fails in a
disconnected environment.  Maybe must-gather could be included in the
release manifest so that it's available in disconnected environments by
default?

I instead ran "oc describe clusteroperator machine-config", this error
message was there, how do I diagnose this?

Failed to resync 4.2.0 because: timed out waiting for the condition during
> waitForFaemonsetRollout: Daemonset machine-config-daemon is not ready.
> status (desired:7, updated 7, ready: 6, unavailable: 6)


On Mon, 28 Oct 2019 at 05:59, Clayton Coleman  wrote:

> We probably need to remove the example from the docs and highlight
> that you must copy the value reported by image mirror
>
> > On Oct 27, 2019, at 11:33 AM, W. Trevor King  wrote:
> >
> >> On Sun, Oct 27, 2019 at 2:17 AM Joel Pearson wrote:
> >> Ooh, does this mean 4.2.2 is out or the release is imminent? Should I
> be trying to install 4.2.2 instead of 4.2.0?
> >
> > 4.2.2 exists and is in candidate-4.2.  That means it's currently
> > unsupported.  The point of candidate-* testing is to test the releases
> > to turn up anything that should block them going stable, so we're
> > certainly launching a bunch of throw-away 4.2.2 clusters at this
> > point, and other folks are welcome to do that too.  But if you want
> > stability and support, you should wait until it is promoted into
> > fast-4.2 or stable-4.2 (which may never happen if testing turns up a
> > serious-enough issue).  So "maybe" to both your question ;).
> >
> >> I mirrored quay.io/openshift-release-dev/ocp-release:4.2.0
> >
> > Yeah, should be no CI-registry images under that.
> >
> > Cheers,
> > Trevor
> >
> > ___
> > users mailing list
> > users@lists.openshift.redhat.com
> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-27 Thread Clayton Coleman
We probably need to remove the example from the docs and highlight
that you must copy the value reported by image mirror

> On Oct 27, 2019, at 11:33 AM, W. Trevor King  wrote:
>
>> On Sun, Oct 27, 2019 at 2:17 AM Joel Pearson wrote:
>> Ooh, does this mean 4.2.2 is out or the release is imminent? Should I be 
>> trying to install 4.2.2 instead of 4.2.0?
>
> 4.2.2 exists and is in candidate-4.2.  That means it's currently
> unsupported.  The point of candidate-* testing is to test the releases
> to turn up anything that should block them going stable, so we're
> certainly launching a bunch of throw-away 4.2.2 clusters at this
> point, and other folks are welcome to do that too.  But if you want
> stability and support, you should wait until it is promoted into
> fast-4.2 or stable-4.2 (which may never happen if testing turns up a
> serious-enough issue).  So "maybe" to both your question ;).
>
>> I mirrored quay.io/openshift-release-dev/ocp-release:4.2.0
>
> Yeah, should be no CI-registry images under that.
>
> Cheers,
> Trevor
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-27 Thread W. Trevor King
On Sun, Oct 27, 2019 at 2:17 AM Joel Pearson wrote:
> Ooh, does this mean 4.2.2 is out or the release is imminent? Should I be 
> trying to install 4.2.2 instead of 4.2.0?

4.2.2 exists and is in candidate-4.2.  That means it's currently
unsupported.  The point of candidate-* testing is to test the releases
to turn up anything that should block them going stable, so we're
certainly launching a bunch of throw-away 4.2.2 clusters at this
point, and other folks are welcome to do that too.  But if you want
stability and support, you should wait until it is promoted into
fast-4.2 or stable-4.2 (which may never happen if testing turns up a
serious-enough issue).  So "maybe" to both your question ;).

> I mirrored quay.io/openshift-release-dev/ocp-release:4.2.0

Yeah, should be no CI-registry images under that.

Cheers,
Trevor

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-27 Thread Joel Pearson
>
> quay.io/openshift-release-dev/ocp-release:4.2.0$ oc adm release info
> --pullspecs
> quay.io/openshift-release-dev/ocp-release:4.2.2 | grep -A3 Images:


Ooh, does this mean 4.2.2 is out or the release is imminent? Should I be
trying to install 4.2.2 instead of 4.2.0?

 ... And it's not in [1], although you should just be
> recycling whatever 'oc adm release mirror' suggests instead of blindly
> copy/pasting from docs.  Which release did you mirror?


Thanks for this information. Looks like I must have skipped reading the
output of the mirror command. Thanks for that heads up!

I mirrored quay.io/openshift-release-dev/ocp-release:4.2.0

  I dunno what happened with your API-server lock-up, but
> 'openshift-install gather bootstrap ...' will SSH into your bootstrap
> machine and from there onto the control-plane machines and gather the
> things we expected would be useful for debugging this sort of thing,
> so probably start with that.


I'll try out "openshift-install gather bootstrap" tomorrow. It sounds very
useful, thanks for that information.

Thanks,

Joel
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-26 Thread W. Trevor King
On Fri, Oct 25, 2019 at 3:01 AM Joel Pearson wrote:
> One strange thing that happened was that it was trying to download images 
> from "quay.io/openshift-release-dev/ocp-v4.0-art-dev" instead of the 
> documented "quay.io/openshift-release-dev/ocp-release".

The operator and so on images referenced by the release image are
there.  For example:

  $ oc adm release info --pullspecs
quay.io/openshift-release-dev/ocp-release:4.2.2 | grep -A3 Images:
  Images:
NAME  PULL SPEC
aws-machine-controllers
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3366f4120ebf7465f147a5e43d6e1c43a76cefbd461cdbe25349f8b3c735bc1d
azure-machine-controllers
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f724056624dc3b5130c9b80a88891e21579ddb6fa49c1ed21bd6173879609d2

> - mirrors: - :5000//release
>   source: registry.svc.ci.openshift.org/ocp/release

This entry should only be needed if your are installing a release from
the CI registry.  And it's not in [1], although you should just be
recycling whatever 'oc adm release mirror' suggests instead of blindly
copy/pasting from docs.  Which release did you mirror?

> Where can I look for errors?

I dunno what happened with your API-server lock-up, but
'openshift-install gather bootstrap ...' will SSH into your bootstrap
machine and from there onto the control-plane machines and gather the
things we expected would be useful for debugging this sort of thing,
so probably start with that.

Cheers,
Trevor

[1]: 
https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-preparations.html#installation-mirror-repository_installing-restricted-networks-preparations

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Failing to bootstrap disconnected 4.2 cluster on metal

2019-10-25 Thread Joel Pearson
Hi,

I'm trying to bootstrap a disconnected (air-gapped) 4.2 cluster using the bare
metal method
.
It is technically vmware, but I'm following the bare metal version as our
vmware cluster wasn't quite compatible with the vmware instructions.

After a few false starts I managed to get the bootstrapping to start to
take place.  One strange thing that happened was that it was trying to
download images from "quay.io/openshift-release-dev/ocp-v4.0-art-dev"
instead of the documented "quay.io/openshift-release-dev/ocp-release". I
found this rather odd, and I couldn't find many references to
"ocp-v4.0-art-dev" on the internet, so I'm not sure exactly where it came
from.  I did a "strings openshift-install | grep ocp-v4.0-art-dev" but that
didn't show anything, so it's a bit of a strange one.

So my image content sources ended up being:

imageContentSources: - mirrors: -
:5000//release source:
quay.io/openshift-release-dev/ocp-release - mirrors: -
:5000//release source:
quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors: - :5000//release source:
registry.svc.ci.openshift.org/ocp/release

I was watching the journalctl on the bootstrap server, and I saw each etcd
server join one by one, then once they had all joined, then the apiserver
on the bootstrap server seemed to lockup, when I tried to connect to
https://localhost:6443 the connections would hang.  Initially, I thought
this meant that bootstrap had completed, but then I noticed that none of
the master nodes were listing on 6443, they were all trying to look
themselves up in etcd at "api-int.." but nothing
was listening.

I then scoured the journal on the bootstrap node, but I struggled to find
logs related to why the apiserver had disappeared.  The journal was mostly
full of the bootstrap node trying to connect to https://localhost:6443,
which suggested to me that bootstrap was not yet complete.

I tried rebooting the bootstrap node, but I think that made it worse, it
seemed to be in a crash loop whinging about files in /etc/kubernetes
already existing or something like that.  I had a look through /var/logs
and found this error message in some pod logs:

exiting because of error: log: unable to create log: open
/var/log/bootstrap-control-plane/kube-apiserver.log: permission denied

I'm not sure if that error is because I restarted before bootstrap was
successful, or if that is actually some sort of problem.

I tried reinstalling from scratch a few times, and it always got stuck in
the same place, so it doesn't seem to be transient.

Where can I look for errors? Is "ocp-v4.0-art-dev" an indication of a
problem? Since it's an air-gapped solution it's difficult to get logs out
of the system, so I don't know if I'll be able to use must-gather.
However, if I'm understanding it correctly, must-gather can only be used
after bootstrap has succeeded.

Thoughts?
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users