Didn't the retry after 100ms delay
work? If yes, why would it if the problem is that a java_pid was
not cleaned up?
Chris
On 10/4/18 8:54 AM, Gary Adams wrote:
First, let me retract the proposed change,
it is not the right solution to the problem originally
reported.
Second, as a bit of explanation consider the code fragments below.
The high level processing calls openDoor which is willing to retry
the operation as long as the error is flagged specifically
as a FileNotFoundException.
VirtualMachineImpl.java:72
VirtualMachineImpl.c:81
During my testing I had added a check VirtualMachineImpl.java:214
and when an IOException was detected made a call to
checkPermissions
to get more detailed information about the IOException. The error
I saw was an ENOENT from the stat call. And not the detailed
checks for
specific permissions issues (VirtualMachineImpl.c:143)
VirtualMachineImpl.c:118
VirtualMachineImpl.c:147
What I missed in the original proposed solution was a
FileNotFoundException
extends IOException. That means my delay and retry just duplicates
the higher
level retry around the openDoor call.
Third, the original error message logged in the bug report :
java.io.IOException:
Permission denied
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native
Method)
had to have come from
VirtualMachineImpl.c:70
VirtualMachineImpl.c:84
which means the actual open call reported the file does exist
but the permissions do not allow the file to be accessed.
That also means the normal mechanism of removing leftover
java_pid files would not have cleaned up another user's
java_pid files.
=====
src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java:
...
67 // Opens the door file to the target VM. If the
file is not
68 // found it might mean that the attach mechanism
isn't started in the
69 // target VM so we attempt to start it and
retry.
70 try {
71 fd = openDoor(pid);
72 } catch (FileNotFoundException fnf1) {
73 File f = createAttachFile(pid);
74 try {
75 sigquit(pid);
76
77 // give the target VM time to start the
attach mechanism
78 final int delay_step = 100;
79 final long timeout = attachTimeout();
80 long time_spend = 0;
81 long delay = 0;
82 do {
83 // Increase timeout on each attempt
to reduce polling
84 delay += delay_step;
85 try {
86 Thread.sleep(delay);
87 } catch (InterruptedException x) { }
88 try {
89 fd = openDoor(pid);
90 } catch (FileNotFoundException fnf2)
{
91 // pass
92 }
93
94 time_spend += delay;
95 if (time_spend > timeout/2
&& fd == -1) {
96 // Send QUIT again to give
target VM the last chance to react
97 sigquit(pid);
98 }
99 } while (time_spend <= timeout
&& fd == -1);
100 if (fd == -1) {
101 throw new
AttachNotSupportedException(
102 String.format("Unable to open
door %s: " +
103 "target process %d doesn't
respond within %dms " +
104 "or HotSpot VM not loaded",
socket_path, pid, time_spend));
105 }
...
212 // The door is attached to .java_pid<pid> in
the temporary directory.
213 private int openDoor(int pid) throws IOException {
214 socket_path = tmpdir + "/.java_pid" + pid;
215 fd = open(socket_path);
216
217 // Check that the file owner/permission to avoid
attaching to
218 // bogus process
219 try {
220 checkPermissions(socket_path);
221 } catch (IOException ioe) {
222 close(fd);
223 throw ioe;
224 }
225 return fd;
226 }
=====
src/jdk.attach/solaris/native/libattach/VirtualMachineImpl.c:
...
59 JNIEXPORT jint JNICALL
Java_sun_tools_attach_VirtualMachineImpl_open
60 (JNIEnv *env, jclass cls, jstring path)
61 {
62 jboolean isCopy;
63 const char* p = GetStringPlatformChars(env, path,
&isCopy);
64 if (p == NULL) {
65 return 0;
66 } else {
67 int fd;
68 int err = 0;
69
70 fd = open(p, O_RDWR);
71 if (fd == -1) {
72 err = errno;
73 }
74
75 if (isCopy) {
76 JNU_ReleaseStringPlatformChars(env, path,
p);
77 }
78
79 if (fd == -1) {
80 if (err == ENOENT) {
81 JNU_ThrowByName(env,
"java/io/FileNotFoundException", NULL);
82 } else {
83 char* msg = strdup(strerror(err));
84 JNU_ThrowIOException(env, msg);
85 if (msg != NULL) {
86 free(msg);
87 }
88 }
89 }
90 return fd;
91 }
92 }
...
99 JNIEXPORT void JNICALL
Java_sun_tools_attach_VirtualMachineImpl_checkPermissions
100 (JNIEnv *env, jclass cls, jstring path)
101 {
102 jboolean isCopy;
103 const char* p = GetStringPlatformChars(env, path,
&isCopy);
104 if (p != NULL) {
105 struct stat64 sb;
106 uid_t uid, gid;
107 int res;
108
109 memset(&sb, 0, sizeof(struct stat64));
110
111 /*
112 * Check that the path is owned by the effective
uid/gid of this
113 * process. Also check that group/other access
is not allowed.
114 */
115 uid = geteuid();
116 gid = getegid();
117
118 res = stat64(p, &sb);
119 if (res != 0) {
120 /* save errno */
121 res = errno;
122 }
123
124 if (res == 0) {
125 char msg[100];
126 jboolean isError = JNI_FALSE;
127 if (sb.st_uid != uid && uid !=
ROOT_UID) {
128 snprintf(msg, sizeof(msg),
129 "file should be owned by the current
user (which is %d) but is owned by %d", uid, sb.st_uid);
130 isError = JNI_TRUE;
131 } else if (sb.st_gid != gid && uid
!= ROOT_UID) {
132 snprintf(msg, sizeof(msg),
133 "file's group should be the current
group (which is %d) but the group is %d", gid, sb.st_gid);
134 isError = JNI_TRUE;
135 } else if ((sb.st_mode &
(S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH)) != 0) {
136 snprintf(msg, sizeof(msg),
137 "file should only be readable and
writable by the owner but has 0%03o access", sb.st_mode &
0777);
138 isError = JNI_TRUE;
139 }
140 if (isError) {
141 char buf[256];
142 snprintf(buf, sizeof(buf), "well-known
file %s is not secure: %s", p, msg);
143 JNU_ThrowIOException(env, buf);
144 }
145 } else {
146 char* msg = strdup(strerror(res));
147 JNU_ThrowIOException(env, msg);
148 if (msg != NULL) {
149 free(msg);
150 }
151 }
On 10/2/18, 6:23 PM, David Holmes wrote:
Minor correction: EPERM -> EACCES for Solaris
Hard to see how to get a transient EACCES when opening a file
... though as it is really a door I guess there could be
additional complexity.
David
On 3/10/2018 7:54 AM, Chris Plummer wrote:
On 10/2/18 2:38 PM, David Holmes wrote:
Chris,
On 3/10/2018 6:57 AM, Chris Plummer wrote:
On 10/2/18 1:44 PM, gary.ad...@oracle.com wrote:
The general attach sequence ...
src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java
the attacher creates an attach_pid file in a directory
where the attachee is runnning
issues a signal to the attacheee
loops waiting for the java_pid file to be created
default timeout is 10 seconds
So getting a FileNotFoundException while in this loop is
OK, but IOException is not.
src/hotspot/os/solaris/attachListener_solaris.cpp
attachee creates the java_pid file
listens til the attacher opens the door
I'm don't think this is related, but JDK-8199811 made a
fix in attachListener_solaris.cpp to make it wait up to 10
seconds for initialization to complete before failing the
enqueue.
...
Not sure when a bare IOException is thrown rather than
the
more specific FileNotFoundException.
Where is the IOException originating from? I wonder if the
issue is that the file is in the process of being created,
but is not fully created yet. Maybe it is there, but
owner/group/permissions have not been set yet, and this
results in an IOException instead of
FileNotFoundException.
The exception is shown in the bug report:
[java.io.IOException: Permission denied
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native
Method)
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.openDoor(VirtualMachineImpl.java:215)
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:71)
at
jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
at
jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
at
jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:114)
at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:98)
And if you look at the native code the EPERM from open will
cause IOException to be thrown.
./jdk.attach/solaris/native/libattach/VirtualMachineImpl.c
JNIEXPORT jint JNICALL
Java_sun_tools_attach_VirtualMachineImpl_open
(JNIEnv *env, jclass cls, jstring path)
{
jboolean isCopy;
const char* p = GetStringPlatformChars(env, path,
&isCopy);
if (p == NULL) {
return 0;
} else {
int fd;
int err = 0;
fd = open(p, O_RDWR);
if (fd == -1) {
err = errno;
}
if (isCopy) {
JNU_ReleaseStringPlatformChars(env, path, p);
}
if (fd == -1) {
if (err == ENOENT) {
JNU_ThrowByName(env,
"java/io/FileNotFoundException", NULL);
} else {
char* msg = strdup(strerror(err));
JNU_ThrowIOException(env, msg);
if (msg != NULL) {
free(msg);
}
We should add the path to the exception message.
Thanks David. So if EPERM is the error and a retry 100ms later
works, I think that supports my hypothesis that the file is
not quite fully created. So Gary's fix is probably fine. The
only other possible fix I can think of that wouldn't require
an explicit delay (or multiple retries) is probably not worth
the complexity. It would require that the attachee create two
files, and the attacher try to open the second file first.
When it either opens or returns EPERM, you know the first file
can safety be opened.
Chris
David
-----
Chris
On 10/2/18 4:11 PM, Chris Plummer wrote:
Can you summarize how the attach
handshaking is suppose to work? I'm just wondering why
the attacher would ever be looking for the file before
the attachee has created it. It seems a proper
handshake would prevent this. Maybe there's some sort
of visibility issue where the attachee has indeed
created the file, but it is not immediately visible to
the attacher process.
Chris
On 10/2/18 12:27 PM, gary.ad...@oracle.com
wrote:
The problem reproduced pretty
quickly.
I added a call to checkPermission and revealed the
"file not found" from the stat call when the
IOException
was detected.
There has been some flakiness from the Solaris test
machines today,
so I'll continue with the testing a bit longer.
On 10/2/18 3:12 PM, Chris Plummer wrote:
Without the fix was this
issue easy enough to reproduce that you can be
sure this is resolving it?
Chris
On 10/2/18 8:16 AM, Gary Adams wrote:
Solaris debug builds are
failing tests that use the attach interface.
An IOException is reported when the java_pid
file is not opened.
It appears that the attempt to attach is taking
place too quickly.
This workaround will allow the open operation to
be retried
after a short pause.
Webrev: http://cr.openjdk.java.net/~gadams/8210337/webrev/
Issue: https://bugs.openjdk.java.net/browse/JDK-8210337
Testing is in progress.
|