[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-18 Thread tillrohrmann
Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/3525
  
Great to hear @WangTaoTheTonic. Thanks a lot for testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-18 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
@tillrohrmann I've tried with your commit and the issue is resolved, 
thanks. Closing this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-16 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
Thanks for your fix, i'll check in a day or two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-15 Thread tillrohrmann
Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/3525
  
@WangTaoTheTonic I've opened a PR which addresses the problem: #3888. If 
you like, then you can try it out and see if it solves your problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-08 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
That looks good to me. Looking forward to fix from @tillrohrmann. Thank you 
very much :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-08 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
@WangTaoTheTonic I think we can solve that the following way:

  - The local upload uses `ATOMIC_MOVE` to rename the file
  - Only the thread that succeeds will store the blob in HDFS or S3

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-08 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
For HA case, the blob server will upload jars to HDFS for recovery, and 
there's a cocurrent operations here too. I'm not sure if the solutions ou 
proposed can cover that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-06 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
@WangTaoTheTonic I have debugged a bit further in this issue, and it seems 
there is a bit more to do.
For non-HA blob servers, the atomic rename fix would do it.

For HA cases, we need to do a bit more. A recent change was that the blob 
cache will try and fetch blobs directly from the blob store, which may cause 
pre-mature reads before the blob has been fully written. Because the storage 
systems we target for HA do not all support atomic renames (S3 does not), we 
need to use the `_SUCCESS` file trick to mark completed blobs.

I chatted with @tillrohrmann about that, he agreed to take a look at fixing 
these and will make an effort to get this into the 1.3 release. Hope that this 
will work for you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-06 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
Yes, @netguy204 - that is definitely one possible way for class loaders to 
leak over...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-05 Thread netguy204
Github user netguy204 commented on the issue:

https://github.com/apache/flink/pull/3525
  
@StephanEwen Yes, I do have at least objects and classes being stored in a 
static context. Any easy example (that has also bitten me a few times) is the 
class cache that Avro maintains: 


https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L146

The Avro API's, unless told otherwise, will use a singleton instance of 
SpecificData and will access that shared cache.

Would something like that be enough to cause the classloader to pass 
between jobs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-04 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
@netguy204 I think you are affected by a different issue. In your case, 
there are no damaged jar files, but it looks like the classloader has been 
closed.

Flink creates classloaders per job and caches them across different tasks 
of that job. It closes the dynamically created classloaders when all tasks from 
the job are done.

Is it possible that a classloader passes between jobs, meaning that another 
job uses a class loaders that was created for another job? Do you store some 
objects / classes / classloaders somewhere in a static context or a cache or 
interner so that it can be that one job created them and another job re-uses 
them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-05-02 Thread netguy204
Github user netguy204 commented on the issue:

https://github.com/apache/flink/pull/3525
  
+1 I'm looking forward to this fix as I think I'm encountering this bug in 
production.

I bundle my jobs into a single JAR file with multiple mains. I submit the 
jobs to the cluster sequentially (once the cluster accepts one I submit the 
next). My job also has two dependency JARs that I provide via HTTP using the -C 
switch to flink.

When a job fails it automatically restarts but it seems to cause other jobs 
from the same JAR to fail and restart as well. The error is always some 
variation of:

```
java.lang.IllegalStateException: zip file closed
at java.util.zip.ZipFile.ensureOpen(ZipFile.java:669)
at java.util.zip.ZipFile.getEntry(ZipFile.java:309)
at java.util.jar.JarFile.getEntry(JarFile.java:240)
at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128)
at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:983)
at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
 backtrace from some arbitrary point in my code that never is 
doing anything with reflection ...
```

The class load that triggers the fault is arbitrary. The same job may fail 
and restart multiple times in the same day with a different failing class load.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-04-21 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
@WangTaoTheTonic I will take this issue next week. I think we can fix this 
using `Files.move(src, dest, ATOMIC_MOVE)` to avoid that multiple jobs get in 
each others' way. By that we preserve the cross-job caching behavior and should 
fix the issue you described.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-04-05 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
I am currently travelling and then attending Flink Forward. Will come back 
to this after that.

Quick feedback:
  - I am still thinking that the random suffix breaks the original idea of 
the cached blobs.
  - The blob manager counts references to files and does not delete them as 
long as someone has a reference. That prevents deletion if multiple parties 
work with the same jar.
  - Properly handling rename and add reference in one lock, as well as 
de-reference and delete in the same lock should fix it, I think
  - The blob manager needs to make sure it has an exclusive directory, so 
that no other process accesses the files. But I think that is the case already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-04-05 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
hi stephan, could you help review? @StephanEwen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-03-20 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
ping @StephanEwen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-03-17 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
Right...I have same thought as you at the beginning and i've tried to make 
the move atomic but it has serveral side affect, like:
1. if we use this way to handle this, which means two job can share the 
same jar file in blobserver, it will be a problem when one of them being 
canceled and deleting its jars(now it seems like it doesn't do the delete, but 
it should do)
2. for job recovery(or other kind of recovery, i'm not sure, just observed 
the phenomenon) blob server will upload jars to hdfs using same name of local 
file. Even the two jobs share same jar in blob store, they will upload it twice 
at same time, which will cause file lease occuptation in hdfs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-03-17 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
I think we should then fix this in the blob server.

The problem that only one should succeed upon collision should be fixable 
by using `Files.move()` with `ATOMIC_MOVE`. Only when that succeeds, we store 
the file in the blob store.

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-03-17 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
The second rename will not fail, but make the file which written by the 
first corrupted, which will make the first job failed if the task is loading 
this jar.

by the way, the jar file will be uploaded to hdfs for recovery, and the 
uploading will fail too if there are more than two clients writing file with 
same name.

It is easy to reoccur. First launch a session with enough slots, then run a 
script contains many same job submitting, says there are 20 lines of "flink run 
../examples/steaming/WindowJoin.jar &". Make sure there's a "&" in end of each 
line to make them run in parallel.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #3525: [FLINK-6020]add a random integer suffix to blob key to av...

2017-03-17 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
I don't quite understand the issue. Currently, the name should exactly 
match the hash to make sure that each library is stored only once. Adding a 
random suffix exactly destroys that behavior.

In the case where multiple clients upload the same jar to *different* 
clusters, it should not be a problem, if they use different storage directories 
(which they should definitely do).

In the case where multiple clients upload the same jar to the *same* 
cluster, the first rename from tmp to file will succeed. The second rename from 
tmp to file will fail, but that's not a problem, because the file already 
exists with the same contents, and the client can assume success.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---