[jira] [Commented] (FLINK-13550) Support for CPU FlameGraphs in new web UI
[ https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984769#comment-16984769 ] David Moravek commented on FLINK-13550: --- Hi [~xintongsong], we already use this in our internal build and I'm definitely planning to contribute this back. It'd be helpful if you could assign the issue to me. I'll try to send an initial PR with the rest endpoint within next week. > Support for CPU FlameGraphs in new web UI > - > > Key: FLINK-13550 > URL: https://issues.apache.org/jira/browse/FLINK-13550 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST, Runtime / Web Frontend >Reporter: David Moravek >Priority: Major > > For a better insight into a running job, it would be useful to have ability > to render a CPU flame graph for a particular job vertex. > Flink already has a stack-trace sampling mechanism in-place, so it should be > straightforward to implement. > This should be done by implementing a new endpoint in REST API, which would > sample the stack-trace the same way as current BackPressureTracker does, only > with a different sampling rate and length of sampling. > [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the > feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14746) History server does not handle uncaught exceptions in archive fetcher.
David Moravek created FLINK-14746: - Summary: History server does not handle uncaught exceptions in archive fetcher. Key: FLINK-14746 URL: https://issues.apache.org/jira/browse/FLINK-14746 Project: Flink Issue Type: Bug Components: Runtime / Web Frontend Affects Versions: 1.9.1, 1.8.2 Reporter: David Moravek In case archive fetcher fails with an error - eg. OOM while parsing json archives, the error is swallowed by ScheduledExectutorService and the submitted runnable is never rescheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14709) Allow outputting elements in close method of chained drivers.
David Moravek created FLINK-14709: - Summary: Allow outputting elements in close method of chained drivers. Key: FLINK-14709 URL: https://issues.apache.org/jira/browse/FLINK-14709 Project: Flink Issue Type: Improvement Components: Runtime / Task Affects Versions: 1.9.1, 1.8.1, 1.7.2 Reporter: David Moravek Currently, BatchTask and DataSourceTask only allow outputting elements in close method of "rich" operators, that they directly execute. Task workflow is as follows: 1) open "head" driver (calls "open" method on udf) 2) open chained drivers 3) run "head" driver 4) close "head" driver (calls "close" method on udf) 5) close output collector (no elements can be collected after this point) 6) close chained drivers In order to properly support outputs from close method, we want to switch 6) and 5). We also need to tweak implementation of Reduce / Combine chained drivers, because they dispose sorters in closeTask method (this should be done in the close method). This would bring huge performance improvement for Beam users, because we could properly implement bundling on batch (whole partition = single bundle). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14169) Cleanup expired jobs from history server
[ https://issues.apache.org/jira/browse/FLINK-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935722#comment-16935722 ] David Moravek commented on FLINK-14169: --- Can you please assign this to [~david.hrbacek]? We have this planned for the current sprint. Also, should this behavior be configurable, so it's consistent with prior versions? Thanks, D. > Cleanup expired jobs from history server > > > Key: FLINK-14169 > URL: https://issues.apache.org/jira/browse/FLINK-14169 > Project: Flink > Issue Type: Improvement >Reporter: David Moravek >Priority: Minor > > Cleanup jobs, that are no longer in history refresh locations during > JobArchiveFetcher::run. > https://github.com/apache/flink/blob/master/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServerArchiveFetcher.java#L138 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14169) Cleanup expired jobs from history server
David Moravek created FLINK-14169: - Summary: Cleanup expired jobs from history server Key: FLINK-14169 URL: https://issues.apache.org/jira/browse/FLINK-14169 Project: Flink Issue Type: Improvement Reporter: David Moravek Cleanup jobs, that are no longer in history refresh locations during JobArchiveFetcher::run. https://github.com/apache/flink/blob/master/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServerArchiveFetcher.java#L138 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14166) Reuse cache from previous history server run
David Moravek created FLINK-14166: - Summary: Reuse cache from previous history server run Key: FLINK-14166 URL: https://issues.apache.org/jira/browse/FLINK-14166 Project: Flink Issue Type: Improvement Reporter: David Moravek Currently history server is not able to reuse cache from previous run, even when `historyserver.web.tmpdir` is set. It could simply "warm up" cached job ids set, from previously parsed jobs. https://github.com/apache/flink/blob/master/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServerArchiveFetcher.java#L129 This should be configurable, so it does not break backward compatibility. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery
[ https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923264#comment-16923264 ] David Moravek commented on FLINK-13958: --- Hi Till, we're using attached mode on yarn. We'll try the detached mode instead and let you know. Anyway, do you think this is an issue worth fixing in general? If so, what would be the correct approach? > Job class loader may not be reused after batch job recovery > --- > > Key: FLINK-13958 > URL: https://issues.apache.org/jira/browse/FLINK-13958 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.9.0 >Reporter: David Moravek >Priority: Major > > [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com] > 1) We have a per-job flink cluster > 2) We use BATCH execution mode + region failover strategy > Point 1) should imply single user code class loader per task manager (because > there is only single pipeline, that reuses class loader cached in > BlobLibraryCacheManager). We need this property, because we have UDFs that > access C libraries using JNI (I think this may be fairly common use-case when > dealing with legacy code). [JDK > internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466] > make sure that single library can be only loaded by a single class loader > per JVM. > When region recovery is triggered, vertices that need recover are first reset > back to CREATED stated and then rescheduled. In case all tasks in a task > manager are reset, this results in [cached class loader being > released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338]. > This unfortunately causes job failure, because we try to reload a native > library in a newly created class loader. > I believe the correct approach would be not to release cached class loader if > the job is recovering, even though there are no tasks currently registered > with TM. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders
[ https://issues.apache.org/jira/browse/FLINK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922555#comment-16922555 ] David Moravek commented on FLINK-11402: --- This is a [JVM limitation|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466] and you have to make sure to load native libraries using system class loader in case you're submitting multiple jobs into the same cluster. > User code can fail with an UnsatisfiedLinkError in the presence of multiple > classloaders > > > Key: FLINK-11402 > URL: https://issues.apache.org/jira/browse/FLINK-11402 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz > > > As reported on the user mailing list thread "[`env.java.opts` not persisting > after job canceled or failed and then > restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];, > there can be issues with using native libraries and user code class loading. > h2. Steps to reproduce > I was able to reproduce the issue reported on the mailing list using > [snappy-java|https://github.com/xerial/snappy-java] in a user program. > Running the attached user program works fine on initial submission, but > results in a failure when re-executed. > I'm using Flink 1.7.0 using a standalone cluster started via > {{bin/start-cluster.sh}}. > 0. Unpack attached Maven project and build using {{mvn clean package}} *or* > directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}} > 1. Download > [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar] > and unpack libsnappyjava for your system: > {code} > jar tf snappy-java-1.1.7.2.jar | grep libsnappy > ... > org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so > ... > org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib > ... > {code} > 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} > (path needs to be adjusted for your system): > {code} > env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64 > {code} > 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}} > {code} > bin/flink run hello-snappy-1.0-SNAPSHOT.jar > Starting execution of program > Program execution finished > Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished. > Job Runtime: 359 ms > {code} > 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}} > {code} > bin/flink run hello-snappy-1.0-SNAPSHOT.jar > Starting execution of program > > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 7d69baca58f33180cb9251449ddcd396) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268) > at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) > at com.github.uce.HelloSnappy.main(HelloSnappy.java:18) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421) > at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813) > at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126) > at > org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at >
[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery
[ https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922554#comment-16922554 ] David Moravek commented on FLINK-13958: --- [~1u0] It's unrelated issue. The behavior you are describing is expected when you submit multiple jobs into the same cluster. Only option in that case is to workaround using system class loader (it's technically not a workaround). > Job class loader may not be reused after batch job recovery > --- > > Key: FLINK-13958 > URL: https://issues.apache.org/jira/browse/FLINK-13958 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.9.0 >Reporter: David Moravek >Priority: Major > > [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com] > 1) We have a per-job flink cluster > 2) We use BATCH execution mode + region failover strategy > Point 1) should imply single user code class loader per task manager (because > there is only single pipeline, that reuses class loader cached in > BlobLibraryCacheManager). We need this property, because we have UDFs that > access C libraries using JNI (I think this may be fairly common use-case when > dealing with legacy code). [JDK > internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466] > make sure that single library can be only loaded by a single class loader > per JVM. > When region recovery is triggered, vertices that need recover are first reset > back to CREATED stated and then rescheduled. In case all tasks in a task > manager are reset, this results in [cached class loader being > released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338]. > This unfortunately causes job failure, because we try to reload a native > library in a newly created class loader. > I believe the correct approach would be not to release cached class loader if > the job is recovering, even though there are no tasks currently registered > with TM. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13958) Job class loader may not be reused after batch job recovery
David Moravek created FLINK-13958: - Summary: Job class loader may not be reused after batch job recovery Key: FLINK-13958 URL: https://issues.apache.org/jira/browse/FLINK-13958 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.9.0 Reporter: David Moravek [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com] 1) We have a per-job flink cluster 2) We use BATCH execution mode + region failover strategy Point 1) should imply single user code class loader per task manager (because there is only single pipeline, that reuses class loader cached in BlobLibraryCacheManager). We need this property, because we have UDFs that access C libraries using JNI (I think this may be fairly common use-case when dealing with legacy code). [JDK internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466] make sure that single library can be only loaded by a single class loader per JVM. When region recovery is triggered, vertices that need recover are first reset back to CREATED stated and then rescheduled. In case all tasks in a task manager are reset, this results in [cached class loader being released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338]. This unfortunately causes job failure, because we try to reload a native library in a newly created class loader. I believe the correct approach would be not to release cached class loader if the job is recovering, even though there are no tasks currently registered with TM. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13445) Distinguishing Memory Configuration for TaskManager and JobManager
[ https://issues.apache.org/jira/browse/FLINK-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918732#comment-16918732 ] David Moravek commented on FLINK-13445: --- Hi [~StephanEwen], we're hitting the same issue on YARN and this should be fairly easy to fix. We're having large `containerized.heap-cutoff-min` for taskmanagers (lets say 2GB), because we need to allocated large chunks of non-flink managed memory off-heap. This doesn't allow us to allocate jobmanager with less than 2GBs, which is wasteful. I'll send a PR that would introduce `jobmanager.containerized.heap-cutoff-min`, that would fallback to `containerized.heap-cutoff-min` and would allow you to override this option for jobmanager. WDYT? > Distinguishing Memory Configuration for TaskManager and JobManager > -- > > Key: FLINK-13445 > URL: https://issues.apache.org/jira/browse/FLINK-13445 > Project: Flink > Issue Type: New Feature > Components: Runtime / Configuration >Affects Versions: 1.8.1 >Reporter: madong >Priority: Major > > we use flink to run some job build in non-java language, so we increase the > value of `containerized.heap-cutoff-ratio` to reserve more memory for > non-java process , which would affect memory allocation for jobManager. > Considering the different behaviors of taskManager and jobManager, should we > use this configuration separately? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13550) Support for CPU FlameGraphs in new web UI
[ https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900143#comment-16900143 ] David Moravek commented on FLINK-13550: --- [~vthinkxie] Great! I'll ping you once I have the REST endpoint ready ;) Thanks > Support for CPU FlameGraphs in new web UI > - > > Key: FLINK-13550 > URL: https://issues.apache.org/jira/browse/FLINK-13550 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST, Runtime / Web Frontend >Reporter: David Moravek >Priority: Major > > For a better insight into a running job, it would be useful to have ability > to render a CPU flame graph for a particular job vertex. > Flink already has a stack-trace sampling mechanism in-place, so it should be > straightforward to implement. > This should be done by implementing a new endpoint in REST API, which would > sample the stack-trace the same way as current BackPressureTracker does, only > with a different sampling rate and length of sampling. > [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the > feature. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13550) Support for CPU FlameGraphs in new web UI
[ https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898701#comment-16898701 ] David Moravek commented on FLINK-13550: --- [Original thread|https://lists.apache.org/thread.html/8d0c0661c3587a485995f24f4a95841680535d9654534c207517e800@%3Cdev.flink.apache.org%3E]. > Support for CPU FlameGraphs in new web UI > - > > Key: FLINK-13550 > URL: https://issues.apache.org/jira/browse/FLINK-13550 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST, Runtime / Web Frontend >Reporter: David Moravek >Priority: Major > > For a better insight into a running job, it would be useful to have ability > to render a CPU flame graph for a particular job vertex. > Flink already has a stack-trace sampling mechanism in-place, so it should be > straightforward to implement. > This should be done by implementing a new endpoint in REST API, which would > sample the stack-trace the same way as current BackPressureTracker does, only > with a different sampling rate and length of sampling. > [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the > feature. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13552) Render vertex FlameGraph in web UI
David Moravek created FLINK-13552: - Summary: Render vertex FlameGraph in web UI Key: FLINK-13552 URL: https://issues.apache.org/jira/browse/FLINK-13552 Project: Flink Issue Type: Sub-task Components: Runtime / Web Frontend Reporter: David Moravek Add a new FlameGraph tab in "vertex detail" page, that will actively poll flame graph endpoint and render it using d3 library. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13551) Add vertex FlameGraph REST endpoint
David Moravek created FLINK-13551: - Summary: Add vertex FlameGraph REST endpoint Key: FLINK-13551 URL: https://issues.apache.org/jira/browse/FLINK-13551 Project: Flink Issue Type: Sub-task Components: Runtime / REST Reporter: David Moravek Add a new endpoint that returns data for flame graph rendering. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (FLINK-13550) Support for CPU FlameGraphs in new web UI
[ https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Moravek updated FLINK-13550: -- Component/s: Runtime / REST > Support for CPU FlameGraphs in new web UI > - > > Key: FLINK-13550 > URL: https://issues.apache.org/jira/browse/FLINK-13550 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST, Runtime / Web Frontend >Reporter: David Moravek >Priority: Major > > For a better insight into a running job, it would be useful to have ability > to render a CPU flame graph for a particular job vertex. > Flink already has a stack-trace sampling mechanism in-place, so it should be > straightforward to implement. > This should be done by implementing a new endpoint in REST API, which would > sample the stack-trace the same way as current BackPressureTracker does, only > with a different sampling rate and length of sampling. > [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the > feature. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13550) Support for CPU FlameGraphs in new web UI
David Moravek created FLINK-13550: - Summary: Support for CPU FlameGraphs in new web UI Key: FLINK-13550 URL: https://issues.apache.org/jira/browse/FLINK-13550 Project: Flink Issue Type: New Feature Components: Runtime / Web Frontend Reporter: David Moravek For a better insight into a running job, it would be useful to have ability to render a CPU flame graph for a particular job vertex. Flink already has a stack-trace sampling mechanism in-place, so it should be straightforward to implement. This should be done by implementing a new endpoint in REST API, which would sample the stack-trace the same way as current BackPressureTracker does, only with a different sampling rate and length of sampling. [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the feature. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13369) Recursive closure cleaner ends up with stackOverflow in case of circular dependency
David Moravek created FLINK-13369: - Summary: Recursive closure cleaner ends up with stackOverflow in case of circular dependency Key: FLINK-13369 URL: https://issues.apache.org/jira/browse/FLINK-13369 Project: Flink Issue Type: Bug Affects Versions: 1.8.1, 1.9.0 Reporter: David Moravek -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13367) Make ClosureCleaner detect writeReplace serialization override
David Moravek created FLINK-13367: - Summary: Make ClosureCleaner detect writeReplace serialization override Key: FLINK-13367 URL: https://issues.apache.org/jira/browse/FLINK-13367 Project: Flink Issue Type: Bug Affects Versions: 1.8.1, 1.9.0 Reporter: David Moravek Nested ClosureCleaner introduced in FLINK-12297 does not respect [writeReplace serialization overrides.|https://docs.oracle.com/javase/8/docs/api/java/io/Serializable.html] This is a problem for Apache Beam, that takes advantage of this while serializing avro schemas. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13127) "--yarnship" doesn't support resource classloading
[ https://issues.apache.org/jira/browse/FLINK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880494#comment-16880494 ] David Moravek commented on FLINK-13127: --- Hello Yang, yes you can definitely do this. Problem is when you're using existing libraries that rely on classloading. Also yarnship behavior is now inconsistent, because classloading works for archives, but not for resources (*which are added to classpath too, but in incorrect format*). I think proper behavior would be: * We traverse ship directory recursively ** If we find an archive, we add it to a archives list ** If we find a resource, we resolve its parent directory and add it to resourceDirectories set * We construct classpath for shipped files as follows: sorted(resourceDirectories) + sorted(archives) I'll provide a patch for this today. > "--yarnship" doesn't support resource classloading > -- > > Key: FLINK-13127 > URL: https://issues.apache.org/jira/browse/FLINK-13127 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.8.1 >Reporter: David Moravek >Priority: Major > > Currently yarnship works as follows: > * user specifies directory to ship with the job > * yarn ships it with the container > * org.apache.flink.yarn.AbstractYarnClusterDescriptor#uploadAndRegisterFiles > traverses directory recursively and adds each file to the classpath > This works well for shipping jars, but doesn't work correctly with shipping > resources that we want to load using java.lang.ClassLoader#getResource method. > In order to make resource classloading work, we need to register it's > directory instead of the file itself (java classpath expects directories or > archives). > CLASSPATH="shipped/custom.conf:${CLASSPATH}" needs to become > CLASSPATH="shipped:${CLASSPATH}" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13127) "--yarnship" doesn't support resource classloading
David Moravek created FLINK-13127: - Summary: "--yarnship" doesn't support resource classloading Key: FLINK-13127 URL: https://issues.apache.org/jira/browse/FLINK-13127 Project: Flink Issue Type: Bug Components: Deployment / YARN Affects Versions: 1.8.1 Reporter: David Moravek Currently yarnship works as follows: * user specifies directory to ship with the job * yarn ships it with the container * org.apache.flink.yarn.AbstractYarnClusterDescriptor#uploadAndRegisterFiles traverses directory recursively and adds each file to the classpath This works well for shipping jars, but doesn't work correctly with shipping resources that we want to load using java.lang.ClassLoader#getResource method. In order to make resource classloading work, we need to register it's directory instead of the file itself (java classpath expects directories or archives). CLASSPATH="shipped/custom.conf:${CLASSPATH}" needs to become CLASSPATH="shipped:${CLASSPATH}" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-4232) Flink executable does not return correct pid
David Moravek created FLINK-4232: Summary: Flink executable does not return correct pid Key: FLINK-4232 URL: https://issues.apache.org/jira/browse/FLINK-4232 Project: Flink Issue Type: Bug Affects Versions: 1.0.3 Reporter: David Moravek Priority: Minor Eg. when using supervisor, pid returned by ./bin/flink is pid of shell executable instead of java process -- This message was sent by Atlassian JIRA (v6.3.4#6332)