CCing our Storage team lead(Brad Childs) into the thread.

On Tue, Aug 28, 2018 at 7:52 AM, Alan Christie <
[email protected]> wrote:

> OpenShift Master: v3.9.0+ba7faec-1
> Kubernetes Master: v1.9.1+a0ce1bc657
> OpenShift Web Console: v3.9.0+b600d46-dirty
>
> After working successfully for the past few months, my Jenkins deployment
> started to fail to launch build agents for jobs. The event error was
> essentially *Failed to start transient scope unit: Argument list too long*.
> The error was initially confusing because it’s just running the same agents
> it’s always been running. The agents are configured to live for a short
> time (15 minutes) after which they’re removed and another created when
> necessary.
>
> All this has been perfectly functional up until today.
>
> The complete event error was: -
>
> MountVolume.SetUp failed for volume "fs-input" : mount failed: exit status
> 1 Mounting command: systemd-run Mounting arguments:
> --description=Kubernetes transient mount for /var/lib/origin/openshift.
> local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/
> kubernetes.io~nfs/fs-input --scope -- mount -t nfs -o ro
> bastion.novalocal:/data/fs-input /var/lib/origin/
> openshift.local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/
> kubernetes.io~nfs/fs-input Output: Failed to start transient scope unit:
> Argument list too long
>
>
> I suspect it might be related to Kubernetes issue #57345
> <https://github.com/kubernetes/kubernetes/issues/57345> : *Number of
> "loaded inactive dead" systemd transient mount units continues to grow*.
>
> In attempt to rectify the situation I tried the issue's suggestion, which
> was to run: -
>
> $ *sudo systemctl daemon-reload*
>
> ...on the affected node(s). It worked on all nodes except the one that was
> giving me problems. On the “broken” node the command took a few seconds to
> complete but failed, responding with: -
>
>
> *Failed to execute operation: Connection timed out*
> I was unable to reboot the node from the command-line (clearly the system
> was polluted to the point that it was essentially unusable) and I was
> forced to resort to rebooting the node by other means.
>
> When the node returned Jenkins and it’s deployments eventually returned to
> an operational state.
>
> So it looks like the issue may be right: - *the number of systemd
> transient mount units continues to grow unchecked on nodes*.
>
> Although I’ve recovered the system and now believe I have a work-around
> for the underlying fault next time I see this I wonder whether anyone else
> seen this in 3.9 and is there a long-term solution for this?
>
> Alan Christie
> [email protected]
>
>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>


-- 
Ben Parees | OpenShift
_______________________________________________
users mailing list
[email protected]
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Reply via email to