Ok. Any idea what might have led to
that PID to be created by another user? Isn't all our testing done
using the same userid? Also, why is this only a solaris problem?
Chris
On 11/7/18 7:43 PM, [email protected] wrote:
There is recovery code that cleans up
old attach files,
but a permission denied error would prevent the clean
up operation from taking place. We're looking at a case of
a file owned by another user as well as a pid recycling
and failed clean up situation.
On 11/7/18 3:09 PM, Chris Plummer wrote:
Are you saying that the PID was
recycled, so the issue is a test run from long ago leaving
behind the file? If so, I thought there was something in the
attach code that cleaned up these PID files that were left
behind. In general I would not count this as an infra issue.
PID files left behind are the fault of the JVM. Maybe there is
something wrong with the cleanup code on solaris, or maybe we
don't clean up on any platform, but cycle through PIDs a lot
faster on solaris.
Chris
On 11/7/18 11:20 AM, Gary Adams wrote:
If there are no further suggestions on JDK-8210337,
I plan to close it out as cannot reproduce.
Similar bugs had been filed for the "Permission denied" error
from the openDoor request failure and each was attributed
to an infrastructure issue. e.g. another user with the same
pid left a temporary file that is blocking the current test
from attaching correctly.
On 10/4/18, 1:49 PM, Gary Adams wrote:
My delay and retry did not fix the problem with permission
denied.
When I was diagnosing the problem I instrumented the code
to catch an IOException and call checkPermission to get
more detail about the IOException. The error reported
from calling checkPermission was ENOENT (stat).
The code change I then proposed was catch the IOException,
delay, and retry the open. That fixed the problem of
ENOENT, but had nothing to do with "permission denied".
On 10/4/18, 1:25 PM, Chris Plummer wrote:
But I also thought you said
the delay and retry fixed the problem. How could fix the
problem if it is just duplicating something that is
already in place?
Chris
On 10/4/18 9:48 AM, Gary Adams wrote:
My delay and retry just duplicated the openDoor retry.
The normal processing of FileNotFoundException(ENOENT)
is to retry
several times until the file is available.
But the original problem reported is a "Permission
denied" (EACCESS or EPERM).
Delay and retry will not resolve a permissions error.
On 10/4/18, 12:30 PM, Chris Plummer wrote:
Didn't the retry after
100ms delay work? If yes, why would it if the
problem is that a java_pid was not cleaned up?
Chris
On 10/4/18 8:54 AM, Gary Adams wrote:
First, let me retract the proposed change,
it is not the right solution to the problem
originally
reported.
Second, as a bit of explanation consider the code
fragments below.
The high level processing calls openDoor which is
willing to retry
the operation as long as the error is flagged
specifically
as a FileNotFoundException.
VirtualMachineImpl.java:72
VirtualMachineImpl.c:81
During my testing I had added a check
VirtualMachineImpl.java:214
and when an IOException was detected made a call to
checkPermissions
to get more detailed information about the
IOException. The error
I saw was an ENOENT from the stat call. And not the
detailed checks for
specific permissions issues
(VirtualMachineImpl.c:143)
VirtualMachineImpl.c:118
VirtualMachineImpl.c:147
What I missed in the original proposed solution was
a FileNotFoundException
extends IOException. That means my delay and retry
just duplicates the higher
level retry around the openDoor call.
Third, the original error message logged in the bug
report :
java.io.IOException: Permission
denied
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native
Method)
had to have come from
VirtualMachineImpl.c:70
VirtualMachineImpl.c:84
which means the actual open call reported the file
does exist
but the permissions do not allow the file to be
accessed.
That also means the normal mechanism of removing
leftover
java_pid files would not have cleaned up another
user's
java_pid files.
=====
src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java:
...
67 // Opens the door file to the
target VM. If the file is not
68 // found it might mean that the
attach mechanism isn't started in the
69 // target VM so we attempt to
start it and retry.
70 try {
71 fd = openDoor(pid);
72 } catch (FileNotFoundException
fnf1) {
73 File f =
createAttachFile(pid);
74 try {
75 sigquit(pid);
76
77 // give the target VM time
to start the attach mechanism
78 final int delay_step =
100;
79 final long timeout =
attachTimeout();
80 long time_spend = 0;
81 long delay = 0;
82 do {
83 // Increase timeout on
each attempt to reduce polling
84 delay += delay_step;
85 try {
86
Thread.sleep(delay);
87 } catch
(InterruptedException x) { }
88 try {
89 fd =
openDoor(pid);
90 } catch
(FileNotFoundException fnf2) {
91 // pass
92 }
93
94 time_spend += delay;
95 if (time_spend >
timeout/2 && fd == -1) {
96 // Send QUIT again
to give target VM the last chance to react
97 sigquit(pid);
98 }
99 } while (time_spend <=
timeout && fd == -1);
100 if (fd == -1) {
101 throw new
AttachNotSupportedException(
102
String.format("Unable to open door %s: " +
103 "target process
%d doesn't respond within %dms " +
104 "or HotSpot VM
not loaded", socket_path, pid, time_spend));
105 }
...
212 // The door is attached to
.java_pid<pid> in the temporary directory.
213 private int openDoor(int pid) throws
IOException {
214 socket_path = tmpdir +
"/.java_pid" + pid;
215 fd = open(socket_path);
216
217 // Check that the file
owner/permission to avoid attaching to
218 // bogus process
219 try {
220 checkPermissions(socket_path);
221 } catch (IOException ioe) {
222 close(fd);
223 throw ioe;
224 }
225 return fd;
226 }
=====
src/jdk.attach/solaris/native/libattach/VirtualMachineImpl.c:
...
59 JNIEXPORT jint JNICALL
Java_sun_tools_attach_VirtualMachineImpl_open
60 (JNIEnv *env, jclass cls, jstring path)
61 {
62 jboolean isCopy;
63 const char* p =
GetStringPlatformChars(env, path, &isCopy);
64 if (p == NULL) {
65 return 0;
66 } else {
67 int fd;
68 int err = 0;
69
70 fd = open(p, O_RDWR);
71 if (fd == -1) {
72 err = errno;
73 }
74
75 if (isCopy) {
76
JNU_ReleaseStringPlatformChars(env, path, p);
77 }
78
79 if (fd == -1) {
80 if (err == ENOENT) {
81 JNU_ThrowByName(env,
"java/io/FileNotFoundException", NULL);
82 } else {
83 char* msg =
strdup(strerror(err));
84 JNU_ThrowIOException(env,
msg);
85 if (msg != NULL) {
86 free(msg);
87 }
88 }
89 }
90 return fd;
91 }
92 }
...
99 JNIEXPORT void JNICALL
Java_sun_tools_attach_VirtualMachineImpl_checkPermissions
100 (JNIEnv *env, jclass cls, jstring path)
101 {
102 jboolean isCopy;
103 const char* p =
GetStringPlatformChars(env, path, &isCopy);
104 if (p != NULL) {
105 struct stat64 sb;
106 uid_t uid, gid;
107 int res;
108
109 memset(&sb, 0, sizeof(struct
stat64));
110
111 /*
112 * Check that the path is owned by
the effective uid/gid of this
113 * process. Also check that
group/other access is not allowed.
114 */
115 uid = geteuid();
116 gid = getegid();
117
118 res = stat64(p, &sb);
119 if (res != 0) {
120 /* save errno */
121 res = errno;
122 }
123
124 if (res == 0) {
125 char msg[100];
126 jboolean isError = JNI_FALSE;
127 if (sb.st_uid != uid
&& uid != ROOT_UID) {
128 snprintf(msg, sizeof(msg),
129 "file should be owned
by the current user (which is %d) but is owned by
%d", uid, sb.st_uid);
130 isError = JNI_TRUE;
131 } else if (sb.st_gid != gid
&& uid != ROOT_UID) {
132 snprintf(msg, sizeof(msg),
133 "file's group should
be the current group (which is %d) but the group is
%d", gid, sb.st_gid);
134 isError = JNI_TRUE;
135 } else if ((sb.st_mode &
(S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH)) != 0) {
136 snprintf(msg, sizeof(msg),
137 "file should only be
readable and writable by the owner but has 0%03o
access", sb.st_mode & 0777);
138 isError = JNI_TRUE;
139 }
140 if (isError) {
141 char buf[256];
142 snprintf(buf, sizeof(buf),
"well-known file %s is not secure: %s", p, msg);
143 JNU_ThrowIOException(env,
buf);
144 }
145 } else {
146 char* msg =
strdup(strerror(res));
147 JNU_ThrowIOException(env,
msg);
148 if (msg != NULL) {
149 free(msg);
150 }
151 }
On 10/2/18, 6:23 PM, David Holmes wrote:
Minor correction: EPERM -> EACCES
for Solaris
Hard to see how to get a transient EACCES when
opening a file ... though as it is really a door I
guess there could be additional complexity.
David
On 3/10/2018 7:54 AM, Chris Plummer wrote:
On 10/2/18 2:38 PM, David
Holmes wrote:
Chris,
On 3/10/2018 6:57 AM, Chris Plummer wrote:
On 10/2/18 1:44 PM, [email protected]
wrote:
The general attach
sequence ...
src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java
the attacher creates an attach_pid file
in a directory where the attachee is
runnning
issues a signal to the attacheee
loops waiting for the java_pid file to
be created
default timeout is 10 seconds
So getting a FileNotFoundException while in
this loop is OK, but IOException is not.
src/hotspot/os/solaris/attachListener_solaris.cpp
attachee creates the java_pid file
listens til the attacher opens the door
I'm don't think this is related, but
JDK-8199811 made a fix in
attachListener_solaris.cpp to make it wait
up to 10 seconds for initialization to
complete before failing the enqueue.
...
Not sure when a bare IOException is thrown
rather than the
more specific FileNotFoundException.
Where is the IOException originating from? I
wonder if the issue is that the file is in
the process of being created, but is not
fully created yet. Maybe it is there, but
owner/group/permissions have not been set
yet, and this results in an IOException
instead of FileNotFoundException.
The exception is shown in the bug report:
[java.io.IOException: Permission denied
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native
Method)
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.openDoor(VirtualMachineImpl.java:215)
at
jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:71)
at
jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
at
jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
at
jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:114)
at
jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:98)
And if you look at the native code the EPERM
from open will cause IOException to be thrown.
./jdk.attach/solaris/native/libattach/VirtualMachineImpl.c
JNIEXPORT jint JNICALL
Java_sun_tools_attach_VirtualMachineImpl_open
(JNIEnv *env, jclass cls, jstring path)
{
jboolean isCopy;
const char* p =
GetStringPlatformChars(env, path,
&isCopy);
if (p == NULL) {
return 0;
} else {
int fd;
int err = 0;
fd = open(p, O_RDWR);
if (fd == -1) {
err = errno;
}
if (isCopy) {
JNU_ReleaseStringPlatformChars(env, path, p);
}
if (fd == -1) {
if (err == ENOENT) {
JNU_ThrowByName(env,
"java/io/FileNotFoundException", NULL);
} else {
char* msg =
strdup(strerror(err));
JNU_ThrowIOException(env,
msg);
if (msg != NULL) {
free(msg);
}
We should add the path to the exception
message.
Thanks David. So if EPERM is the error and a
retry 100ms later works, I think that supports
my hypothesis that the file is not quite fully
created. So Gary's fix is probably fine. The
only other possible fix I can think of that
wouldn't require an explicit delay (or multiple
retries) is probably not worth the complexity.
It would require that the attachee create two
files, and the attacher try to open the second
file first. When it either opens or returns
EPERM, you know the first file can safety be
opened.
Chris
David
-----
Chris
On 10/2/18 4:11 PM, Chris Plummer wrote:
Can you summarize
how the attach handshaking is suppose to
work? I'm just wondering why the
attacher would ever be looking for the
file before the attachee has created it.
It seems a proper handshake would
prevent this. Maybe there's some sort of
visibility issue where the attachee has
indeed created the file, but it is not
immediately visible to the attacher
process.
Chris
On 10/2/18 12:27 PM, [email protected]
wrote:
The problem
reproduced pretty quickly.
I added a call to checkPermission and
revealed the
"file not found" from the stat call
when the IOException
was detected.
There has been some flakiness from the
Solaris test machines today,
so I'll continue with the testing a
bit longer.
On 10/2/18 3:12 PM, Chris Plummer
wrote:
Without the
fix was this issue easy enough to
reproduce that you can be sure this
is resolving it?
Chris
On 10/2/18 8:16 AM, Gary Adams
wrote:
Solaris
debug builds are failing tests
that use the attach interface.
An IOException is reported when
the java_pid file is not opened.
It appears that the attempt to
attach is taking place too
quickly.
This workaround will allow the
open operation to be retried
after a short pause.
Webrev: http://cr.openjdk.java.net/~gadams/8210337/webrev/
Issue: https://bugs.openjdk.java.net/browse/JDK-8210337
Testing is in progress.
|