[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3525 Great to hear @WangTaoTheTonic. Thanks a lot for testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 @tillrohrmann I've tried with your commit and the issue is resolved, thanks. Closing this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 Thanks for your fix, i'll check in a day or two. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3525 @WangTaoTheTonic I've opened a PR which addresses the problem: #3888. If you like, then you can try it out and see if it solves your problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 That looks good to me. Looking forward to fix from @tillrohrmann. Thank you very much :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 @WangTaoTheTonic I think we can solve that the following way: - The local upload uses `ATOMIC_MOVE` to rename the file - Only the thread that succeeds will store the blob in HDFS or S3 What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 For HA case, the blob server will upload jars to HDFS for recovery, and there's a cocurrent operations here too. I'm not sure if the solutions ou proposed can cover that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 @WangTaoTheTonic I have debugged a bit further in this issue, and it seems there is a bit more to do. For non-HA blob servers, the atomic rename fix would do it. For HA cases, we need to do a bit more. A recent change was that the blob cache will try and fetch blobs directly from the blob store, which may cause pre-mature reads before the blob has been fully written. Because the storage systems we target for HA do not all support atomic renames (S3 does not), we need to use the `_SUCCESS` file trick to mark completed blobs. I chatted with @tillrohrmann about that, he agreed to take a look at fixing these and will make an effort to get this into the 1.3 release. Hope that this will work for you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 Yes, @netguy204 - that is definitely one possible way for class loaders to leak over... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user netguy204 commented on the issue: https://github.com/apache/flink/pull/3525 @StephanEwen Yes, I do have at least objects and classes being stored in a static context. Any easy example (that has also bitten me a few times) is the class cache that Avro maintains: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L146 The Avro API's, unless told otherwise, will use a singleton instance of SpecificData and will access that shared cache. Would something like that be enough to cause the classloader to pass between jobs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 @netguy204 I think you are affected by a different issue. In your case, there are no damaged jar files, but it looks like the classloader has been closed. Flink creates classloaders per job and caches them across different tasks of that job. It closes the dynamically created classloaders when all tasks from the job are done. Is it possible that a classloader passes between jobs, meaning that another job uses a class loaders that was created for another job? Do you store some objects / classes / classloaders somewhere in a static context or a cache or interner so that it can be that one job created them and another job re-uses them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user netguy204 commented on the issue: https://github.com/apache/flink/pull/3525 +1 I'm looking forward to this fix as I think I'm encountering this bug in production. I bundle my jobs into a single JAR file with multiple mains. I submit the jobs to the cluster sequentially (once the cluster accepts one I submit the next). My job also has two dependency JARs that I provide via HTTP using the -C switch to flink. When a job fails it automatically restarts but it seems to cause other jobs from the same JAR to fail and restart as well. The error is always some variation of: ``` java.lang.IllegalStateException: zip file closed at java.util.zip.ZipFile.ensureOpen(ZipFile.java:669) at java.util.zip.ZipFile.getEntry(ZipFile.java:309) at java.util.jar.JarFile.getEntry(JarFile.java:240) at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128) at java.util.jar.JarFile.getJarEntry(JarFile.java:223) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005) at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:983) at sun.misc.URLClassPath.findResource(URLClassPath.java:188) at java.net.URLClassLoader$2.run(URLClassLoader.java:569) at java.net.URLClassLoader$2.run(URLClassLoader.java:567) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findResource(URLClassLoader.java:566) at java.lang.ClassLoader.getResource(ClassLoader.java:1093) at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232) backtrace from some arbitrary point in my code that never is doing anything with reflection ... ``` The class load that triggers the fault is arbitrary. The same job may fail and restart multiple times in the same day with a different failing class load. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 @WangTaoTheTonic I will take this issue next week. I think we can fix this using `Files.move(src, dest, ATOMIC_MOVE)` to avoid that multiple jobs get in each others' way. By that we preserve the cross-job caching behavior and should fix the issue you described. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 I am currently travelling and then attending Flink Forward. Will come back to this after that. Quick feedback: - I am still thinking that the random suffix breaks the original idea of the cached blobs. - The blob manager counts references to files and does not delete them as long as someone has a reference. That prevents deletion if multiple parties work with the same jar. - Properly handling rename and add reference in one lock, as well as de-reference and delete in the same lock should fix it, I think - The blob manager needs to make sure it has an exclusive directory, so that no other process accesses the files. But I think that is the case already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 hi stephan, could you help review? @StephanEwen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 ping @StephanEwen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 Right...I have same thought as you at the beginning and i've tried to make the move atomic but it has serveral side affect, like: 1. if we use this way to handle this, which means two job can share the same jar file in blobserver, it will be a problem when one of them being canceled and deleting its jars(now it seems like it doesn't do the delete, but it should do) 2. for job recovery(or other kind of recovery, i'm not sure, just observed the phenomenon) blob server will upload jars to hdfs using same name of local file. Even the two jobs share same jar in blob store, they will upload it twice at same time, which will cause file lease occuptation in hdfs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 I think we should then fix this in the blob server. The problem that only one should succeed upon collision should be fixable by using `Files.move()` with `ATOMIC_MOVE`. Only when that succeeds, we store the file in the blob store. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 The second rename will not fail, but make the file which written by the first corrupted, which will make the first job failed if the task is loading this jar. by the way, the jar file will be uploaded to hdfs for recovery, and the uploading will fail too if there are more than two clients writing file with same name. It is easy to reoccur. First launch a session with enough slots, then run a script contains many same job submitting, says there are 20 lines of "flink run ../examples/steaming/WindowJoin.jar &". Make sure there's a "&" in end of each line to make them run in parallel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 I don't quite understand the issue. Currently, the name should exactly match the hash to make sure that each library is stored only once. Adding a random suffix exactly destroys that behavior. In the case where multiple clients upload the same jar to *different* clusters, it should not be a problem, if they use different storage directories (which they should definitely do). In the case where multiple clients upload the same jar to the *same* cluster, the first rename from tmp to file will succeed. The second rename from tmp to file will fail, but that's not a problem, because the file already exists with the same contents, and the client can assume success. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---