[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932206#comment-15932206
 ] 

ASF GitHub Bot commented on FLINK-6020:
---

Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
ping @StephanEwen 


> Blob Server cannot hanlde multiple job sumits(with same content) parallelly
> ---
>
> Key: FLINK-6020
> URL: https://issues.apache.org/jira/browse/FLINK-6020
> Project: Flink
>  Issue Type: Bug
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, 
> the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, 
> first writes a temp file and move it to finalialize. For recovery it also 
> will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local 
> jar files in blob server and those file on hdfs will be handled in multiple 
> threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) 
> sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931016#comment-15931016
 ] 

ASF GitHub Bot commented on FLINK-6020:
---

Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
Right...I have same thought as you at the beginning and i've tried to make 
the move atomic but it has serveral side affect, like:
1. if we use this way to handle this, which means two job can share the 
same jar file in blobserver, it will be a problem when one of them being 
canceled and deleting its jars(now it seems like it doesn't do the delete, but 
it should do)
2. for job recovery(or other kind of recovery, i'm not sure, just observed 
the phenomenon) blob server will upload jars to hdfs using same name of local 
file. Even the two jobs share same jar in blob store, they will upload it twice 
at same time, which will cause file lease occuptation in hdfs.


> Blob Server cannot hanlde multiple job sumits(with same content) parallelly
> ---
>
> Key: FLINK-6020
> URL: https://issues.apache.org/jira/browse/FLINK-6020
> Project: Flink
>  Issue Type: Bug
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, 
> the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, 
> first writes a temp file and move it to finalialize. For recovery it also 
> will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local 
> jar files in blob server and those file on hdfs will be handled in multiple 
> threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) 
> sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930495#comment-15930495
 ] 

ASF GitHub Bot commented on FLINK-6020:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
I think we should then fix this in the blob server.

The problem that only one should succeed upon collision should be fixable 
by using `Files.move()` with `ATOMIC_MOVE`. Only when that succeeds, we store 
the file in the blob store.

What do you think?


> Blob Server cannot hanlde multiple job sumits(with same content) parallelly
> ---
>
> Key: FLINK-6020
> URL: https://issues.apache.org/jira/browse/FLINK-6020
> Project: Flink
>  Issue Type: Bug
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, 
> the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, 
> first writes a temp file and move it to finalialize. For recovery it also 
> will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local 
> jar files in blob server and those file on hdfs will be handled in multiple 
> threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) 
> sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930311#comment-15930311
 ] 

ASF GitHub Bot commented on FLINK-6020:
---

Github user WangTaoTheTonic commented on the issue:

https://github.com/apache/flink/pull/3525
  
The second rename will not fail, but make the file which written by the 
first corrupted, which will make the first job failed if the task is loading 
this jar.

by the way, the jar file will be uploaded to hdfs for recovery, and the 
uploading will fail too if there are more than two clients writing file with 
same name.

It is easy to reoccur. First launch a session with enough slots, then run a 
script contains many same job submitting, says there are 20 lines of "flink run 
../examples/steaming/WindowJoin.jar &". Make sure there's a "&" in end of each 
line to make them run in parallel.


> Blob Server cannot hanlde multiple job sumits(with same content) parallelly
> ---
>
> Key: FLINK-6020
> URL: https://issues.apache.org/jira/browse/FLINK-6020
> Project: Flink
>  Issue Type: Bug
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, 
> the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, 
> first writes a temp file and move it to finalialize. For recovery it also 
> will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local 
> jar files in blob server and those file on hdfs will be handled in multiple 
> threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) 
> sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930103#comment-15930103
 ] 

ASF GitHub Bot commented on FLINK-6020:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3525
  
I don't quite understand the issue. Currently, the name should exactly 
match the hash to make sure that each library is stored only once. Adding a 
random suffix exactly destroys that behavior.

In the case where multiple clients upload the same jar to *different* 
clusters, it should not be a problem, if they use different storage directories 
(which they should definitely do).

In the case where multiple clients upload the same jar to the *same* 
cluster, the first rename from tmp to file will succeed. The second rename from 
tmp to file will fail, but that's not a problem, because the file already 
exists with the same contents, and the client can assume success.


> Blob Server cannot hanlde multiple job sumits(with same content) parallelly
> ---
>
> Key: FLINK-6020
> URL: https://issues.apache.org/jira/browse/FLINK-6020
> Project: Flink
>  Issue Type: Bug
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, 
> the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, 
> first writes a temp file and move it to finalialize. For recovery it also 
> will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local 
> jar files in blob server and those file on hdfs will be handled in multiple 
> threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) 
> sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907321#comment-15907321
 ] 

ASF GitHub Bot commented on FLINK-6020:
---

GitHub user WangTaoTheTonic opened a pull request:

https://github.com/apache/flink/pull/3525

[FLINK-6020]add a random integer suffix to blob key to avoid naming 
conflicting

In yarn-cluster mode, if we submit one same job multiple times parallelly, 
the task will encounter class load problem and lease occuputation.

Because blob server stores user jars in name with generated sha1sum of 
those, first writes a temp file and move it to finalialize. For recovery it 
also will put them to HDFS with same file name.

In same time, when multiple clients sumit same job with same jar, the local 
jar files in blob server and those file on hdfs will be handled in multiple 
threads(BlobServerConnection), and impact each other.

I've found a way to solve this by adding a random integer suffix to blob 
key. Like changed here.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WangTaoTheTonic/flink FLINK-6020

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3525


commit 3d9f41afad9c831431b3c7bd0eb2a8006b92718e
Author: WangTaoTheTonic 
Date:   2017-03-13T11:52:36Z

add a random integer suffix to blob key to avoid naming conflicting




> Blob Server cannot hanlde multiple job sumits(with same content) parallelly
> ---
>
> Key: FLINK-6020
> URL: https://issues.apache.org/jira/browse/FLINK-6020
> Project: Flink
>  Issue Type: Bug
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, 
> the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, 
> first writes a temp file and move it to finalialize. For recovery it also 
> will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local 
> jar files in blob server and those file on hdfs will be handled in multiple 
> threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) 
> sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-13 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907293#comment-15907293
 ] 

Tao Wang commented on FLINK-6020:
-

I've found a way to solve this by adding a random integer suffix to blob key. 
Will post the commit later.

> Blob Server cannot hanlde multiple job sumits(with same content) parallelly
> ---
>
> Key: FLINK-6020
> URL: https://issues.apache.org/jira/browse/FLINK-6020
> Project: Flink
>  Issue Type: Bug
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, 
> the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, 
> first writes a temp file and move it to finalialize. For recovery it also 
> will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local 
> jar files in blob server and those file on hdfs will be handled in multiple 
> threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) 
> sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)