[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646845#comment-16646845 ] Scott Wegner commented on BEAM-5467: Thanks for the additional context. I'm not an expert on diagnosing memory issues, but here's what I can pull out of there: * The build scan shows [some stats on memory usage|https://scans.gradle.com/s/f2u3q2obrgaqu/performance/build#memory], and for this build I see "PS Eden Space" of 1.36/1.36 GB (99.5%). I would deduce that the JVM ran out of allotted memory causing the segfault. * The [infrastructure tab|https://scans.gradle.com/s/f2u3q2obrgaqu#infrastructure] shows the "Max JVM memory heap size" for the job: 3824 MB * In the [timeline|https://scans.gradle.com/s/f2u3q2obrgaqu/timeline] I can see that the task that failed was {{:beam-sdks-python:flinkCompatibilityMatrixBatch}}. Nothing was running concurrently as part of the build, so either this task ate up the entire heap space, or some previous task is leaking memory. My recommendation would be to work towards getting a local repro so that you can attach a memory profiler and validate potential fixes. The Jenkins job shows the full command-line used to launch the job, including JVM memory configuration: {{gradlew --info --continue --max-workers=12 -Dorg.gradle.jvmargs=-Xms2g -Dorg.gradle.jvmargs=-Xmx4g :beam-sdks-python:flinkCompatibilityMatrixBatch :beam-sdks-python:flinkCompatibilityMatrixStreaming}} > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 9h 40m > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646790#comment-16646790 ] Thomas Weise commented on BEAM-5467: [~swegner] here is an example. (It was necessary to download the full log to find the error message.) [https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/305/consoleText] {code:java} # Thread: Segmentation fault (core dumped) > Task :beam-sdks-python:flinkCompatibilityMatrixBatch FAILED ... BUILD FAILED in 11m 6s 59 actionable tasks: 54 executed, 4 from cache, 1 up-to-date Publishing build scan... https://gradle.com/s/f2u3q2obrgaqu Build step 'Invoke Gradle script' changed build result to FAILURE Build step 'Invoke Gradle script' marked build as failure Sending e-mails to: commits@beam.apache.org Finished: FAILURE {code} > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 9h 40m > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646757#comment-16646757 ] Scott Wegner commented on BEAM-5467: [~thw] offhand this doesn't look familiar to me. Can you link to a Jenkin job run / Gradle build scan with more details? > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 9h 40m > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640413#comment-16640413 ] Thomas Weise commented on BEAM-5467: Yes, and as you had also noticed earlier, they fail in Jenkins frequently with: Segmentation fault (core dumped) [https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/] [~swegner] do you have an idea what the cause could be or where to look for next level of detail? Perhaps memory settings? > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 2.5h > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640350#comment-16640350 ] Ankur Goenka commented on BEAM-5467: The test pass consistently on local jenkins setup. > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 2.5h > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631049#comment-16631049 ] Ankur Goenka commented on BEAM-5467: I verified that they get executed sequentially so that should not be a problem. :beam-sdks-python:flinkCompatibilityMatrixBatchFAILED Started: 5m 27.699s Duration: 4m 0.393s :beam-sdks-python:flinkCompatibilityMatrixStreamingFAILED Started: 9m 28.093s Duration: 1m 13.280s > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 1h 20m > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631010#comment-16631010 ] Thomas Weise commented on BEAM-5467: [~angoenka] should we try to turn off the parallel execution? I also think we should move the following to a distinct task in sdks/python/build.gradle: {code:java} tasks(':beam-sdks-python:flinkCompatibilityMatrixBatch') tasks(':beam-sdks-python:flinkCompatibilityMatrixStreaming'){code} > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 1h 20m > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5467) Python Flink ValidatesRunner job fixes
[ https://issues.apache.org/jira/browse/BEAM-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631002#comment-16631002 ] Ankur Goenka commented on BEAM-5467: Anecdotally, tasks are failing because of segfault with following error 04:18:38 Segmentation fault (core dumped) 04:18:38 04:18:38 > Task :beam-sdks-python:flinkCompatibilityMatrixStreaming FAILED 04:18:38 :beam-sdks-python:flinkCompatibilityMatrixStreaming (Thread[Task worker for ':' Thread 6,5,main]) completed. Took 1 mins 13.28 secs. 04:18:38 04:18:38 FAILURE: Build completed with 2 failures. 04:18:38 04:18:38 1: Task failed with an exception. 04:18:38 --- 04:18:38 * Where: 04:18:38 Build file '/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_VR_Flink/src/sdks/python/build.gradle' line: 340 04:18:38 04:18:38 * What went wrong: 04:18:38 Execution failed for task ':beam-sdks-python:flinkCompatibilityMatrixBatch'. 04:18:38 > Process 'command 'sh'' finished with non-zero exit value 139 04:18:38 04:18:38 * Try: 04:18:38 Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output. Run with --scan to get full insights. 04:18:38 == > Python Flink ValidatesRunner job fixes > -- > > Key: BEAM-5467 > URL: https://issues.apache.org/jira/browse/BEAM-5467 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Thomas Weise >Assignee: Thomas Weise >Priority: Minor > Labels: portability-flink > Time Spent: 1h 20m > Remaining Estimate: 0h > > Add status to README > Rename script and job for consistency > -- This message was sent by Atlassian JIRA (v7.6.3#76005)