Ok. You can close it as CNR then.

thanks,

Chris

On 11/8/18 10:14 AM, Gary Adams wrote:
Different ipc mechanisms are used on each target platform
  solaris - doors
  linux - unix domain sockets
  windows - named pipes

Solaris machines have a tendency to run for years without a need to reboot.

I have not examined the actual test machine where the original problem
was reported, but I was informed once about hundreds of leftover temp files
on another of the test machines.

The files could have accumulated over time from jprt or mach5 infrastructure
transitions. The collision on pid reuse alone would make this an extremely
rare event.

If the "Permission denied" issue appears again, we should get a dump of
the temp directory from the failed machine.

On 11/8/18, 12:42 PM, Chris Plummer wrote:
Ok. Any idea what might have led to that PID to be created by another user? Isn't all our testing done using the same userid? Also, why is this only a solaris problem?

Chris

On 11/7/18 7:43 PM, gary.ad...@oracle.com wrote:
There is recovery code that cleans up old attach files,
but a permission denied error would prevent the clean
up operation from taking place. We're looking at a case of
a file owned by another user as well as a pid recycling
and failed clean up situation.

On 11/7/18 3:09 PM, Chris Plummer wrote:
Are you saying that the PID was recycled, so the issue is a test run from long ago leaving behind the file? If so, I thought there was something in the attach code that cleaned up these PID files that were left behind. In general I would not count this as an infra issue. PID files left behind are the fault of the JVM. Maybe there is something wrong with the cleanup code on solaris, or maybe we don't clean up on any platform, but cycle through PIDs a lot faster on solaris.

Chris

On 11/7/18 11:20 AM, Gary Adams wrote:
If there are no further suggestions on JDK-8210337,
I plan to close it out as cannot reproduce.

Similar bugs had been filed for the "Permission denied" error
from the openDoor request failure and each was attributed
to an infrastructure issue. e.g. another user with the same
pid left a temporary file that is blocking the current test
from attaching correctly.

On 10/4/18, 1:49 PM, Gary Adams wrote:
My delay and retry did not fix the problem with permission denied.

When I was diagnosing the problem I instrumented the code
to catch an IOException and call checkPermission to get
more detail about the IOException. The error reported
from calling checkPermission was ENOENT (stat).

The code change I then proposed was catch the IOException,
delay, and retry the open. That fixed the problem of
ENOENT, but had nothing to do with "permission denied".

On 10/4/18, 1:25 PM, Chris Plummer wrote:
But I also thought you said the delay and retry fixed the problem. How could fix the problem if it is just duplicating something that is already in place?

Chris

On 10/4/18 9:48 AM, Gary Adams wrote:
My delay and retry just duplicated the openDoor retry.
The normal processing of FileNotFoundException(ENOENT) is to retry
several times until the file is available.

But the original problem reported is a "Permission denied" (EACCESS or EPERM).
Delay and retry will not resolve a permissions error.

On 10/4/18, 12:30 PM, Chris Plummer wrote:
Didn't the retry after 100ms delay work? If yes, why would it if the problem is that a java_pid was not cleaned up?

Chris

On 10/4/18 8:54 AM, Gary Adams wrote:
First, let me retract the proposed change,
it is not the right solution to the problem originally
reported.

Second, as a bit of explanation consider the code fragments below.

The high level processing calls openDoor which is willing to retry
the operation as long as the error is flagged specifically
as a FileNotFoundException.

  VirtualMachineImpl.java:72
  VirtualMachineImpl.c:81

During my testing I had added a check VirtualMachineImpl.java:214
and when an IOException was detected made a call to checkPermissions
to get more detailed information about the IOException. The error
I saw was an ENOENT from the stat call. And not the detailed checks for
specific permissions issues (VirtualMachineImpl.c:143)

    VirtualMachineImpl.c:118
    VirtualMachineImpl.c:147

What I missed in the original proposed solution was a FileNotFoundException
extends IOException. That means my delay and retry just duplicates the higher
level retry around the openDoor call.

Third, the original error message logged in the bug report :

java.io.IOException: Permission denied 
at jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native Method)

had to have come from

  VirtualMachineImpl.c:70
  VirtualMachineImpl.c:84

which means the actual open call reported the file does exist
but the permissions do not allow the file to be accessed.
That also means the normal mechanism of removing leftover
java_pid files would not have cleaned up another user's
java_pid files.

=====
src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java:
...
    67            // Opens the door file to the target VM. If the file is not
    68            // found it might mean that the attach mechanism isn't started in the
    69            // target VM so we attempt to start it and retry.
    70            try {
    71                fd = openDoor(pid);
    72            } catch (FileNotFoundException fnf1) {
    73                File f = createAttachFile(pid);
    74                try {
    75                    sigquit(pid);
    76   
    77                    // give the target VM time to start the attach mechanism
    78                    final int delay_step = 100;
    79                    final long timeout = attachTimeout();
    80                    long time_spend = 0;
    81                    long delay = 0;
    82                    do {
    83                        // Increase timeout on each attempt to reduce polling
    84                        delay += delay_step;
    85                        try {
    86                            Thread.sleep(delay);
    87                        } catch (InterruptedException x) { }
    88                        try {
    89                            fd = openDoor(pid);
    90                        } catch (FileNotFoundException fnf2) {
    91                            // pass
    92                        }
    93   
    94                        time_spend += delay;
    95                        if (time_spend > timeout/2 && fd == -1) {
    96                            // Send QUIT again to give target VM the last chance to react
    97                            sigquit(pid);
    98                        }
    99                    } while (time_spend <= timeout && fd == -1);
   100                    if (fd  == -1) {
   101                        throw new AttachNotSupportedException(
   102                            String.format("Unable to open door %s: " +
   103                              "target process %d doesn't respond within %dms " +
   104                              "or HotSpot VM not loaded", socket_path, pid, time_spend));
   105                    }
...
   212        // The door is attached to .java_pid<pid> in the temporary directory.
   213        private int openDoor(int pid) throws IOException {
   214            socket_path = tmpdir + "/.java_pid" + pid;
   215            fd = open(socket_path);
   216   
   217            // Check that the file owner/permission to avoid attaching to
   218            // bogus process
   219            try {
   220                checkPermissions(socket_path);
   221            } catch (IOException ioe) {
   222                close(fd);
   223                throw ioe;
   224            }
   225            return fd;
   226        }

=====
src/jdk.attach/solaris/native/libattach/VirtualMachineImpl.c:
...
    59    JNIEXPORT jint JNICALL Java_sun_tools_attach_VirtualMachineImpl_open
    60      (JNIEnv *env, jclass cls, jstring path)
    61    {
    62        jboolean isCopy;
    63        const char* p = GetStringPlatformChars(env, path, &isCopy);
    64        if (p == NULL) {
    65            return 0;
    66        } else {
    67            int fd;
    68            int err = 0;
    69   
    70            fd = open(p, O_RDWR);
    71            if (fd == -1) {
    72                err = errno;
    73            }
    74   
    75            if (isCopy) {
    76                JNU_ReleaseStringPlatformChars(env, path, p);
    77            }
    78   
    79            if (fd == -1) {
    80                if (err == ENOENT) {
    81                    JNU_ThrowByName(env, "java/io/FileNotFoundException", NULL);
    82                } else {
    83                    char* msg = strdup(strerror(err));
    84                    JNU_ThrowIOException(env, msg);
    85                    if (msg != NULL) {
    86                        free(msg);
    87                    }
    88                }
    89            }
    90            return fd;
    91        }
    92    }
...
    99    JNIEXPORT void JNICALL Java_sun_tools_attach_VirtualMachineImpl_checkPermissions
   100      (JNIEnv *env, jclass cls, jstring path)
   101    {
   102        jboolean isCopy;
   103        const char* p = GetStringPlatformChars(env, path, &isCopy);
   104        if (p != NULL) {
   105            struct stat64 sb;
   106            uid_t uid, gid;
   107            int res;
   108   
   109            memset(&sb, 0, sizeof(struct stat64));
   110   
   111            /*
   112             * Check that the path is owned by the effective uid/gid of this
   113             * process. Also check that group/other access is not allowed.
   114             */
   115            uid = geteuid();
   116            gid = getegid();
   117   
   118            res = stat64(p, &sb);
   119            if (res != 0) {
   120                /* save errno */
   121                res = errno;
   122            }
   123   
   124            if (res == 0) {
   125                char msg[100];
   126                jboolean isError = JNI_FALSE;
   127                if (sb.st_uid != uid && uid != ROOT_UID) {
   128                    snprintf(msg, sizeof(msg),
   129                        "file should be owned by the current user (which is %d) but is owned by %d", uid, sb.st_uid);
   130                    isError = JNI_TRUE;
   131                } else if (sb.st_gid != gid && uid != ROOT_UID) {
   132                    snprintf(msg, sizeof(msg),
   133                        "file's group should be the current group (which is %d) but the group is %d", gid, sb.st_gid);
   134                    isError = JNI_TRUE;
   135                } else if ((sb.st_mode & (S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH)) != 0) {
   136                    snprintf(msg, sizeof(msg),
   137                        "file should only be readable and writable by the owner but has 0%03o access", sb.st_mode & 0777);
   138                    isError = JNI_TRUE;
   139                }
   140                if (isError) {
   141                    char buf[256];
   142                    snprintf(buf, sizeof(buf), "well-known file %s is not secure: %s", p, msg);
   143                    JNU_ThrowIOException(env, buf);
   144                }
   145            } else {
   146                char* msg = strdup(strerror(res));
   147                JNU_ThrowIOException(env, msg);
   148                if (msg != NULL) {
   149                    free(msg);
   150                }
   151            }

On 10/2/18, 6:23 PM, David Holmes wrote:
Minor correction: EPERM -> EACCES for Solaris

Hard to see how to get a transient EACCES when opening a file ... though as it is really a door I guess there could be additional complexity.

David

On 3/10/2018 7:54 AM, Chris Plummer wrote:
On 10/2/18 2:38 PM, David Holmes wrote:
Chris,

On 3/10/2018 6:57 AM, Chris Plummer wrote:


On 10/2/18 1:44 PM, gary.ad...@oracle.com wrote:
The general attach sequence ...

src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java

 the attacher creates an attach_pid file in a directory where the attachee is runnning
 issues a signal to the attacheee

  loops waiting for the java_pid file to be created
  default timeout is 10 seconds

So getting a FileNotFoundException while in this loop is OK, but IOException is not.

src/hotspot/os/solaris/attachListener_solaris.cpp

   attachee creates the java_pid file
   listens til the attacher opens the door

I'm don't think this is related, but JDK-8199811 made a fix in attachListener_solaris.cpp to make it wait up to 10 seconds for initialization to complete before failing the enqueue.

...
Not sure when a bare IOException is thrown rather than the
more specific FileNotFoundException.
Where is the IOException originating from? I wonder if the issue is that the file is in the process of being created, but is not fully created yet. Maybe it is there, but owner/group/permissions have not been set yet, and this results in an IOException instead of FileNotFoundException.

The exception is shown in the bug report:

 [java.io.IOException: Permission denied
at jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native Method)
at jdk.attach/sun.tools.attach.VirtualMachineImpl.openDoor(VirtualMachineImpl.java:215)
at jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:71)
at jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
at jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
at jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:114)
at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:98)

And if you look at the native code the EPERM from open will cause IOException to be thrown.

./jdk.attach/solaris/native/libattach/VirtualMachineImpl.c

JNIEXPORT jint JNICALL Java_sun_tools_attach_VirtualMachineImpl_open
  (JNIEnv *env, jclass cls, jstring path)
{
    jboolean isCopy;
    const char* p = GetStringPlatformChars(env, path, &isCopy);
    if (p == NULL) {
        return 0;
    } else {
        int fd;
        int err = 0;

        fd = open(p, O_RDWR);
        if (fd == -1) {
            err = errno;
        }

        if (isCopy) {
            JNU_ReleaseStringPlatformChars(env, path, p);
        }

        if (fd == -1) {
            if (err == ENOENT) {
                JNU_ThrowByName(env, "java/io/FileNotFoundException", NULL);
            } else {
                char* msg = strdup(strerror(err));
                JNU_ThrowIOException(env, msg);
                if (msg != NULL) {
                    free(msg);
                }


We should add the path to the exception message.

Thanks David. So if EPERM is the error and a retry 100ms later works, I think that supports my hypothesis that the file is not quite fully created. So Gary's fix is probably fine. The only other possible fix I can think of that wouldn't require an explicit delay (or multiple retries) is probably not worth the complexity. It would require that the attachee create two files, and the attacher try to open the second file first. When it either opens or returns EPERM, you know the first file can safety be opened.

Chris
David
-----

Chris



On 10/2/18 4:11 PM, Chris Plummer wrote:
Can you summarize how the attach handshaking is suppose to work? I'm just wondering why the attacher would ever be looking for the file before the attachee has created it. It seems a proper handshake would prevent this. Maybe there's some sort of visibility issue where the attachee has indeed created the file, but it is not immediately visible to the attacher process.

Chris

On 10/2/18 12:27 PM, gary.ad...@oracle.com wrote:
The problem reproduced pretty quickly.
I added a call to checkPermission and revealed the
"file not found" from the stat call when the IOException
was detected.

There has been some flakiness from the Solaris test machines today,
so I'll continue with the testing a bit longer.

On 10/2/18 3:12 PM, Chris Plummer wrote:
Without the fix was this issue easy enough to reproduce that you can be sure this is resolving it?

Chris

On 10/2/18 8:16 AM, Gary Adams wrote:
Solaris debug builds are failing tests that use the attach interface.
An IOException is reported when the java_pid file is not opened.

It appears that the attempt to attach is taking place too quickly.
This workaround will allow the open operation to be retried
after a short pause.

  Webrev: http://cr.openjdk.java.net/~gadams/8210337/webrev/
  Issue: https://bugs.openjdk.java.net/browse/JDK-8210337

Testing is in progress.






















Reply via email to