Thanks Richard.

I’ve come up with what I think is a better implementation for the PR based on 
our use case:

        if (user != null && !user.isEmpty()) {
            cmdArgs.add(user);
            int exitCode = 0;
            try {
                exitCode = new ProcessBuilder(cmdArgs).start().waitFor();
            } catch (Exception e) {
                // Ignore
            } finally {
                if (exitCode != 0) {
                    LOG.debug("CMD: '{}' returned exit code of {}", 
String.join(" ", cmdArgs), exitCode);
                    cmdArgs.remove(user);
                }
            }
        }

The code will first attempt to run the id -u command with a user specified.  If 
this is unsuccessful, then it removes the user from the command-line.

In our scenario we do ultimately need to run the id -u command without a user 
specified to output the id of the user running the java process.

Should I go ahead and create a PR for this code change?  I’m happy to act as a 
test reproducer too.

Thanks,

Steve

From: Richard Zowalla <[email protected]>
Sent: Wednesday, May 15, 2024 3:51 PM
To: [email protected]; Stephen Clark <[email protected]>
Subject: [EXTERNAL] Re: Issue with Worker Termination in Kubernetes


CAUTION: This email originated from outside the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.
Hi,

I think we should fix that and just ensure, that this command cannot fail for a 
non existing user.
I wouldn’t check for „?“, though. Maybe we can just define a new property to 
explicitly skip those uid checks (similar as it is done for Windows).

The patch might work because „id -u“ will just output the id of the user 
running the java process.

Overall, I guess, that we are open for a PR with a test reproducer :)

Gruß
Richard


Am 14.05.2024 um 16:40 schrieb Stephen Clark via user 
<[email protected]<mailto:[email protected]>>:

Hi,

I am running the Storm Supervisor in an image that I've created in Kubernetes 
using a securityContext that has the following:

      securityContext:
        runAsUser: 1000620005
        fsGroup: 1000620005
        supplementalGroups: [ 64000 ]

The UID 1000620005 is not related to a user specified in the /etc/passwd file 
in the Docker image.

When I kill a topology, this generates the following exception:

2024-05-10 06:26:10.661 [SLOT_6700] itemId= { jobName="" ,jobTemplateId="" 
,userOrAppId="" ,tenantId="",  jobStep="", scaleCopyJobId=""} ERROR 
apache.storm.daemon.supervisor.Slot - Error when processing event 
java.lang.NullPointerException: null
      at org.apache.storm.utils.ServerUtils.getUserId(ServerUtils.java:1095) 
~[storm-server-2.6.1.jar:2.6.1]
      at 
org.apache.storm.utils.ServerUtils.isAnyPosixProcessPidDirAlive(ServerUtils.java:1284)
 ~[storm-server-2.6.1.jar:2.6.1]
      at 
org.apache.storm.utils.ServerUtils.isAnyPosixProcessPidDirAlive(ServerUtils.java:1216)
 ~[storm-server-2.6.1.jar:2.6.1]
      at 
org.apache.storm.utils.ServerUtils.areAllProcessesDead(ServerUtils.java:1178) 
~[storm-server-2.6.1.jar:2.6.1]
      at 
org.apache.storm.container.DefaultResourceIsolationManager.areAllProcessesDead(DefaultResourceIsolationManager.java:146)
 ~[storm-server-2.6.1.jar:2.6.1]
      at 
org.apache.storm.daemon.supervisor.Container.areAllProcessesDead(Container.java:248)
 ~[storm-server-2.6.1.jar:2.6.1]
      at 
org.apache.storm.daemon.supervisor.Slot.killContainerFor(Slot.java:237) 
~[storm-server-2.6.1.jar:2.6.1]
      at org.apache.storm.daemon.supervisor.Slot.handleRunning(Slot.java:792) 
~[storm-server-2.6.1.jar:2.6.1]
      at 
org.apache.storm.daemon.supervisor.Slot.stateMachineStep(Slot.java:184) 
~[storm-server-2.6.1.jar:2.6.1]
      at org.apache.storm.daemon.supervisor.Slot.run(Slot.java:1051) 
[storm-server-2.6.1.jar:2.6.1]

which in turn means that the supervisor process dies, and the pod is restarted.

In looking at the Storm source code I think that the issue is in 
storm-server/src/main/java/org/apache/storm/utils/ServerUtils.javawhere it has 
the following code:

       if (user != null && !user.isEmpty()) {
            cmdArgs.add(user);
        }

which results in the following command being executed:

id -u ?

since with the securityContext specified above there is not a named user 
associated with the UID of 1000620005 and a username is not available.

I can see the following in worker.yaml for the topology:

bash-4.2$ cat worker.yaml
worker-id: 145eac49-838f-4796-bd77-c3c99e202e32
logs.users: []
logs.groups: []
topology.submitter.user: '?'

The id -u ? command outputs:

bash-4.2$ id -u ?
id: ?: no such user

this then causes the Null Pointer Exception since it can't parse the output.

I am running with a patch locally that detects whether the username is '?' and 
doesn't add the user to the command line.  This appears to work:

        if (user != null && !user.isEmpty() && !user.equals("?")) {
            cmdArgs.add(user);
        }

Is there a different technique that would work in this scenario, or does it 
require a code change in the storm-server to resolve the issue?

Thanks,

Steve

Reply via email to