[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932206#comment-15932206 ] ASF GitHub Bot commented on FLINK-6020: --- Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 ping @StephanEwen > Blob Server cannot hanlde multiple job sumits(with same content) parallelly > --- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Bug >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931016#comment-15931016 ] ASF GitHub Bot commented on FLINK-6020: --- Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 Right...I have same thought as you at the beginning and i've tried to make the move atomic but it has serveral side affect, like: 1. if we use this way to handle this, which means two job can share the same jar file in blobserver, it will be a problem when one of them being canceled and deleting its jars(now it seems like it doesn't do the delete, but it should do) 2. for job recovery(or other kind of recovery, i'm not sure, just observed the phenomenon) blob server will upload jars to hdfs using same name of local file. Even the two jobs share same jar in blob store, they will upload it twice at same time, which will cause file lease occuptation in hdfs. > Blob Server cannot hanlde multiple job sumits(with same content) parallelly > --- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Bug >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930495#comment-15930495 ] ASF GitHub Bot commented on FLINK-6020: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 I think we should then fix this in the blob server. The problem that only one should succeed upon collision should be fixable by using `Files.move()` with `ATOMIC_MOVE`. Only when that succeeds, we store the file in the blob store. What do you think? > Blob Server cannot hanlde multiple job sumits(with same content) parallelly > --- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Bug >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930311#comment-15930311 ] ASF GitHub Bot commented on FLINK-6020: --- Github user WangTaoTheTonic commented on the issue: https://github.com/apache/flink/pull/3525 The second rename will not fail, but make the file which written by the first corrupted, which will make the first job failed if the task is loading this jar. by the way, the jar file will be uploaded to hdfs for recovery, and the uploading will fail too if there are more than two clients writing file with same name. It is easy to reoccur. First launch a session with enough slots, then run a script contains many same job submitting, says there are 20 lines of "flink run ../examples/steaming/WindowJoin.jar &". Make sure there's a "&" in end of each line to make them run in parallel. > Blob Server cannot hanlde multiple job sumits(with same content) parallelly > --- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Bug >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930103#comment-15930103 ] ASF GitHub Bot commented on FLINK-6020: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 I don't quite understand the issue. Currently, the name should exactly match the hash to make sure that each library is stored only once. Adding a random suffix exactly destroys that behavior. In the case where multiple clients upload the same jar to *different* clusters, it should not be a problem, if they use different storage directories (which they should definitely do). In the case where multiple clients upload the same jar to the *same* cluster, the first rename from tmp to file will succeed. The second rename from tmp to file will fail, but that's not a problem, because the file already exists with the same contents, and the client can assume success. > Blob Server cannot hanlde multiple job sumits(with same content) parallelly > --- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Bug >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907321#comment-15907321 ] ASF GitHub Bot commented on FLINK-6020: --- GitHub user WangTaoTheTonic opened a pull request: https://github.com/apache/flink/pull/3525 [FLINK-6020]add a random integer suffix to blob key to avoid naming conflicting In yarn-cluster mode, if we submit one same job multiple times parallelly, the task will encounter class load problem and lease occuputation. Because blob server stores user jars in name with generated sha1sum of those, first writes a temp file and move it to finalialize. For recovery it also will put them to HDFS with same file name. In same time, when multiple clients sumit same job with same jar, the local jar files in blob server and those file on hdfs will be handled in multiple threads(BlobServerConnection), and impact each other. I've found a way to solve this by adding a random integer suffix to blob key. Like changed here. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WangTaoTheTonic/flink FLINK-6020 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3525.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3525 commit 3d9f41afad9c831431b3c7bd0eb2a8006b92718e Author: WangTaoTheTonicDate: 2017-03-13T11:52:36Z add a random integer suffix to blob key to avoid naming conflicting > Blob Server cannot hanlde multiple job sumits(with same content) parallelly > --- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Bug >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907293#comment-15907293 ] Tao Wang commented on FLINK-6020: - I've found a way to solve this by adding a random integer suffix to blob key. Will post the commit later. > Blob Server cannot hanlde multiple job sumits(with same content) parallelly > --- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Bug >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)