[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-04-04 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075061#comment-17075061
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~tigran]
See the dump file. You log talks about this advice too:

{noformat}
Please refer to 
/home/tigran/eProjects/dcache-git/modules/common/target/surefire-reports for 
the individual test results.
Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump 
and [date].dumpstream.
{noformat}


> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-04-04 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075060#comment-17075060
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~tigran]
My notices how to investigate:

See the commit. Please have a look and see the history of further changes on 
GitHuib whether there is some change which overrides these changes. I believe 
there are not such cases. But you should check it as many users do.

Second, very important notice is that SNAPSHOT is always risky to take because 
we are working on PR #240 and this is still incomplete and this PR deploys the 
same SNAPSHOT version. Most probably you are testing with that. So my 
recommendation is to checkout the project on your side, build the project 
without tests which installs the artifacts to the local repo, and run your test 
on identical hardware Paul has used. If it fails then please see the dump file 
(the path is written in your log) and you may find the root cause.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-04-03 Thread Tigran Mkrtchyan (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074932#comment-17074932
 ] 

Tigran Mkrtchyan commented on SUREFIRE-1719:


Well, as Paul and I working together and on the same project I am pretty sure 
that this is the same issue.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-04-03 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074820#comment-17074820
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~tigran]
99% of these logs are useless. You can simply call {{System.exit(0)}} in your 
test and you get the same logs many other people have on this planet with 
totally different root cause.
So you should checkout the 
[commit|https://issues.apache.org/jira/browse/SUREFIRE-1719?focusedCommentId=17004616=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17004616]
 and build the plugin locally using {{mvn install -DskipTests}} and run your 
tests in your project in offline mode {{mvn -o test}}.
Again you may have error but root cause is maybe different from what 
[~paulmillar] had. How can you prove that you have his root cause?

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-04-03 Thread Tigran Mkrtchyan (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074781#comment-17074781
 ] 

Tigran Mkrtchyan commented on SUREFIRE-1719:


Unfortunately even with current master I can see the error:


{code}
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-SNAPSHOT:test 
(default-test) on project dcache-common: There are test failures.
[ERROR] 
[ERROR] Please refer to 
/home/tigran/eProjects/dcache-git/modules/common/target/surefire-reports for 
the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, 
[date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] ExecutionException The forked VM terminated without properly saying 
goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd 
/home/tigran/eProjects/dcache-git/modules/common && 
/usr/lib/jvm/java-11-openjdk-11.0.6.10-0.fc31.x86_64/bin/java '${argLine}' 
--add-modules=java.security.jgss -jar 
/home/tigran/eProjects/dcache-git/modules/common/target/surefire/surefirebooter15798824340157134030.jar
 /home/tigran/eProjects/dcache-git/modules/common/target/surefire 
2020-04-03T20-17-52_663-jvmRun1 surefire6874474157466927398tmp 
surefire_59109352228556929562tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1


{code}


> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and 

[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-04-02 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074071#comment-17074071
 ] 

Tibor Digana commented on SUREFIRE-1719:


You can do it meanwhile. We are finishing the biggest task and then we will be 
hopefully very fast with easy tasks.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-04-02 Thread Tigran Mkrtchyan (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073728#comment-17073728
 ] 

Tigran Mkrtchyan commented on SUREFIRE-1719:


As M5 is not released yet, can we just use threads instead or set forkCount=1 
as a workaround?

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-01-07 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010135#comment-17010135
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~paulmillar]
I am glad that it works for you. Pls use the version for some time because 
important changes have to be made in January and then we will cut a new release 
version.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2020-01-07 Thread Paul Millar (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009517#comment-17009517
 ] 

Paul Millar commented on SUREFIRE-1719:
---

Hi Tibor,

I've tested the 3.0.0-SNAPSHOT and it works perfectly.

Thanks for all you help!
Paul.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Assignee: Tibor Digana
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-12-27 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004367#comment-17004367
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~paulmillar]
Hi Paul,

When you are in the office to check it out with a new snapshot version of 
Surefire {{3.0.0-SNAPSHOT}}?

Thx T

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-20 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978304#comment-16978304
 ] 

Tibor Digana commented on SUREFIRE-1719:


I will inform you when i have something in a branch.
This is not the easiest part of the code but nevetheless I am asking everybody 
the same question if you are open to contribute with writing the code.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-20 Thread Paul Millar (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978189#comment-16978189
 ] 

Paul Millar commented on SUREFIRE-1719:
---

Hi Tibor,

Good to hear you've found the problem.

Is there something I can do to confirm it?  For example, if you have a private 
branch of surefire with additional logging, I could build that version, deploy 
it as a snapshot in our private nexus and exercise building dCache.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-19 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977946#comment-16977946
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~paulmillar]
I found the problem but it is horrible because this was not found in our 
internal tests and I believe that all people had to have this problem but 
nobody had reported it!
The whole problem is that {{ImmediateCommands.acknowledgeByeEventReceived()}}. 
This is shared interface across all forks.
If you take a look in the ForkStarter, ForkClient and ForkedBooter you see 
BYE->ACK->EXIT0 and it is clearly logical what happens before and what happens 
after. The same code tells you that the ACK must be sent ouf of order in this 
bug - a kind of distributed race and this has really happened. Because the 
users do not use such fast CPUs therefore nobody found this race.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-19 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977849#comment-16977849
 ] 

Tibor Digana commented on SUREFIRE-1719:


I am analysing the latest code and the commit da7ff6aa2 as well but stil did 
not find anything with the error.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-18 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976598#comment-16976598
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~paulmillar]
It's strange that you do not have the dump files. I do not have access to your 
system, so the best would be if you deploy Surefire with orthogonal SNAPSHOT 
version on your Nexus repo and let the CI to run the build. Comment off the 
parts of the code in ForkedBooter and ForkStarter where you have suspicions. I 
could do the same if i had to access your system. This is the way to confirm 
the bug. And then we have to investigate deeper.
Make sure that your system, especially the file system, is okay and the machine 
does not run out of resources.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-18 Thread Paul Millar (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976582#comment-16976582
 ] 

Paul Millar commented on SUREFIRE-1719:
---

Hi Tibor,

Just a quick update.

I've been able to reproduce the problem with Surefire v3.0.0-M4.

Running the same command (mvn -am -pl modules/common clean package), the 
unit-tests failed with a similar output as reported above.  I attempted 
building with v3.0.0-M4 ten times and all attempts failed.  This is seemingly a 
worse failure rate than described above; although, if this is a race-condition, 
external factors may be having an impact.

This is building on a 24-core machine.  As before, updating the pom.xml to set 
forkCount from "1C" (== 24) to "6" results in all ten build attempts passing.

Despite the error message mentioning *.dump and *.dumpstream files, I see no 
such files in the target/surefire-reports directory.  I also checked /tmp and 
/var/tmp: no such files are there, either.

I'm not sure which options I should add to avoid GC logging: I believe this is 
off by default.  If you have a concrete suggestion, I'd be happy to try it out. 
 I'm using OpenJDK "v1.8.0_232" JRE "1.8.0_232-b09", OpenJDK 64-bit server 
(build 25.232-b09, mixed mode).

One additional observation.  The "modules/common" module has 35 test  classes.  
For each of the 35 classes, I see maven logs that is is running that class 
(lines like "[INFO] Running org.dcache").  However, there are *fewer* 
logged test result lines (lines like "[INFO] Tests run: x, Failures: x, Errors: 
x, ").  So, maven/surefire is not logging a result line for some test 
classes, despite (apparently) starting that JVM. 

The specific "missing" results changes from run to run.  The precise number of 
missing test class results also changes: sometimes only one class is missing, 
sometimes it's two classes.

For each test class with a test result line, there is a corresponding .txt and 
.xml file in results/surefire-reports directory.  However, the test classes 
that have no test result line also have no corresponding .txt or .xml file.

HTH,
Paul.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always 

[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-16 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975676#comment-16975676
 ] 

Tibor Digana commented on SUREFIRE-1719:


[~paulmillar]
Pls configure your CI system so that you will be able to handle build files 
from the file system in CI and observe the dump files located in 
{{target/surefire-reports}}. These streams use to be corrupted by native logs 
printed by JVM GC in the std/out and std/in. Try to make the GC quite for a 
try. My suspicion is that GC corrupted the std/out stream and the BYE event was 
lost. This has happenedat the end after all tests completed and GC printed the 
messages at the time when the forked JVM sent BYE and the fork was waiting for 
next 30 seconds for the acknowledgement and killed itself if not arived.
We will introduce TCP connector instead of process pipes in 3.0.0-M5. Meanwhile 
i would like you to confirm this problem with corrupted stream which can be 
seen in the dump file produced by Surefire.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SUREFIRE-1719) Race condition results in "VM crash or System.exit called?" failure

2019-11-16 Thread Tibor Digana (Jira)


[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975652#comment-16975652
 ] 

Tibor Digana commented on SUREFIRE-1719:


Thx for have a look into this issue.
I want to ask you to follow the test instruction from [current release 
vote|https://lists.apache.org/thread.html/6b9987be2fbd4d82daf8aaabbd1d58dfa2a533598770e0423d0b5ee0@%3Cdev.maven.apache.org%3E]
 and repeat the tests with the version 3.0.0-M4. Code changes have been made in 
those parts you described, ForkedBooter and CommandReader.
Regarding the fields, i checked the instance fields used by 
{{acknowledgedExit()}} in the {{ForkedBooter}}. I remember that i fixed a race 
in old version of the plugin one or two years ago, so i did not expect a new 
similar issue happening again. Pls see our code on GitHub and feel free to come 
with some findings. We will definitely have a look.

> Race condition results in "VM crash or System.exit called?" failure
> ---
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
>  Issue Type: Bug
>  Components: Maven Surefire Plugin
>Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>Reporter: Paul Millar
>Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{}}
> {{  org.apache.maven.plugins}}
> {{  maven-surefire-plugin}}
> {{  }}
> {{    }}
> {{  **/*Test.class}}
> {{  **/*Tests.class}}
> {{    }}
> {{    }}
> {{    1C}}
> {{    false}}
> {{  }}
> {{ }}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)