CCing our Storage team lead(Brad Childs) into the thread.
On Tue, Aug 28, 2018 at 7:52 AM, Alan Christie < [email protected]> wrote: > OpenShift Master: v3.9.0+ba7faec-1 > Kubernetes Master: v1.9.1+a0ce1bc657 > OpenShift Web Console: v3.9.0+b600d46-dirty > > After working successfully for the past few months, my Jenkins deployment > started to fail to launch build agents for jobs. The event error was > essentially *Failed to start transient scope unit: Argument list too long*. > The error was initially confusing because it’s just running the same agents > it’s always been running. The agents are configured to live for a short > time (15 minutes) after which they’re removed and another created when > necessary. > > All this has been perfectly functional up until today. > > The complete event error was: - > > MountVolume.SetUp failed for volume "fs-input" : mount failed: exit status > 1 Mounting command: systemd-run Mounting arguments: > --description=Kubernetes transient mount for /var/lib/origin/openshift. > local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/ > kubernetes.io~nfs/fs-input --scope -- mount -t nfs -o ro > bastion.novalocal:/data/fs-input /var/lib/origin/ > openshift.local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/ > kubernetes.io~nfs/fs-input Output: Failed to start transient scope unit: > Argument list too long > > > I suspect it might be related to Kubernetes issue #57345 > <https://github.com/kubernetes/kubernetes/issues/57345> : *Number of > "loaded inactive dead" systemd transient mount units continues to grow*. > > In attempt to rectify the situation I tried the issue's suggestion, which > was to run: - > > $ *sudo systemctl daemon-reload* > > ...on the affected node(s). It worked on all nodes except the one that was > giving me problems. On the “broken” node the command took a few seconds to > complete but failed, responding with: - > > > *Failed to execute operation: Connection timed out* > I was unable to reboot the node from the command-line (clearly the system > was polluted to the point that it was essentially unusable) and I was > forced to resort to rebooting the node by other means. > > When the node returned Jenkins and it’s deployments eventually returned to > an operational state. > > So it looks like the issue may be right: - *the number of systemd > transient mount units continues to grow unchecked on nodes*. > > Although I’ve recovered the system and now believe I have a work-around > for the underlying fault next time I see this I wonder whether anyone else > seen this in 3.9 and is there a long-term solution for this? > > Alan Christie > [email protected] > > > > > _______________________________________________ > users mailing list > [email protected] > http://lists.openshift.redhat.com/openshiftmm/listinfo/users > > -- Ben Parees | OpenShift
_______________________________________________ users mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/users
