Thanks Richard.
I’ve come up with what I think is a better implementation for the PR based on
our use case:
if (user != null && !user.isEmpty()) {
cmdArgs.add(user);
int exitCode = 0;
try {
exitCode = new ProcessBuilder(cmdArgs).start().waitFor();
} catch (Exception e) {
// Ignore
} finally {
if (exitCode != 0) {
LOG.debug("CMD: '{}' returned exit code of {}",
String.join(" ", cmdArgs), exitCode);
cmdArgs.remove(user);
}
}
}
The code will first attempt to run the id -u command with a user specified. If
this is unsuccessful, then it removes the user from the command-line.
In our scenario we do ultimately need to run the id -u command without a user
specified to output the id of the user running the java process.
Should I go ahead and create a PR for this code change? I’m happy to act as a
test reproducer too.
Thanks,
Steve
From: Richard Zowalla <[email protected]>
Sent: Wednesday, May 15, 2024 3:51 PM
To: [email protected]; Stephen Clark <[email protected]>
Subject: [EXTERNAL] Re: Issue with Worker Termination in Kubernetes
CAUTION: This email originated from outside the organization. Do not click
links or open attachments unless you recognize the sender and know the content
is safe.
Hi,
I think we should fix that and just ensure, that this command cannot fail for a
non existing user.
I wouldn’t check for „?“, though. Maybe we can just define a new property to
explicitly skip those uid checks (similar as it is done for Windows).
The patch might work because „id -u“ will just output the id of the user
running the java process.
Overall, I guess, that we are open for a PR with a test reproducer :)
Gruß
Richard
Am 14.05.2024 um 16:40 schrieb Stephen Clark via user
<[email protected]<mailto:[email protected]>>:
Hi,
I am running the Storm Supervisor in an image that I've created in Kubernetes
using a securityContext that has the following:
securityContext:
runAsUser: 1000620005
fsGroup: 1000620005
supplementalGroups: [ 64000 ]
The UID 1000620005 is not related to a user specified in the /etc/passwd file
in the Docker image.
When I kill a topology, this generates the following exception:
2024-05-10 06:26:10.661 [SLOT_6700] itemId= { jobName="" ,jobTemplateId=""
,userOrAppId="" ,tenantId="", jobStep="", scaleCopyJobId=""} ERROR
apache.storm.daemon.supervisor.Slot - Error when processing event
java.lang.NullPointerException: null
at org.apache.storm.utils.ServerUtils.getUserId(ServerUtils.java:1095)
~[storm-server-2.6.1.jar:2.6.1]
at
org.apache.storm.utils.ServerUtils.isAnyPosixProcessPidDirAlive(ServerUtils.java:1284)
~[storm-server-2.6.1.jar:2.6.1]
at
org.apache.storm.utils.ServerUtils.isAnyPosixProcessPidDirAlive(ServerUtils.java:1216)
~[storm-server-2.6.1.jar:2.6.1]
at
org.apache.storm.utils.ServerUtils.areAllProcessesDead(ServerUtils.java:1178)
~[storm-server-2.6.1.jar:2.6.1]
at
org.apache.storm.container.DefaultResourceIsolationManager.areAllProcessesDead(DefaultResourceIsolationManager.java:146)
~[storm-server-2.6.1.jar:2.6.1]
at
org.apache.storm.daemon.supervisor.Container.areAllProcessesDead(Container.java:248)
~[storm-server-2.6.1.jar:2.6.1]
at
org.apache.storm.daemon.supervisor.Slot.killContainerFor(Slot.java:237)
~[storm-server-2.6.1.jar:2.6.1]
at org.apache.storm.daemon.supervisor.Slot.handleRunning(Slot.java:792)
~[storm-server-2.6.1.jar:2.6.1]
at
org.apache.storm.daemon.supervisor.Slot.stateMachineStep(Slot.java:184)
~[storm-server-2.6.1.jar:2.6.1]
at org.apache.storm.daemon.supervisor.Slot.run(Slot.java:1051)
[storm-server-2.6.1.jar:2.6.1]
which in turn means that the supervisor process dies, and the pod is restarted.
In looking at the Storm source code I think that the issue is in
storm-server/src/main/java/org/apache/storm/utils/ServerUtils.javawhere it has
the following code:
if (user != null && !user.isEmpty()) {
cmdArgs.add(user);
}
which results in the following command being executed:
id -u ?
since with the securityContext specified above there is not a named user
associated with the UID of 1000620005 and a username is not available.
I can see the following in worker.yaml for the topology:
bash-4.2$ cat worker.yaml
worker-id: 145eac49-838f-4796-bd77-c3c99e202e32
logs.users: []
logs.groups: []
topology.submitter.user: '?'
The id -u ? command outputs:
bash-4.2$ id -u ?
id: ?: no such user
this then causes the Null Pointer Exception since it can't parse the output.
I am running with a patch locally that detects whether the username is '?' and
doesn't add the user to the command line. This appears to work:
if (user != null && !user.isEmpty() && !user.equals("?")) {
cmdArgs.add(user);
}
Is there a different technique that would work in this scenario, or does it
require a code change in the storm-server to resolve the issue?
Thanks,
Steve