[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-18 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-141406457
  
Will merge this and add a JAR file entry to the pom file...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-09 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-138940821
  
Yes, this failed check is unrelated to your changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-09 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-138942223
  
Should we bundle the utility into a JAR like the other examples? If so, we 
need to adjust the `pom.xml` file in flink-examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-08 Thread detonator413
Github user detonator413 commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-138547980
  
1 profile check mysteriously fails and seems unrelated to the changes I 
introduced. The code should be now compliant to the guidelines. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-07 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-138240660
  
Okay, let's merge it to the examples.

@detonator413 Can you add some class-level comments to the files that 
explains what they do?
Also, we need to remove the author tags. It is an Apache policy that code 
is not author tagged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-07 Thread detonator413
Github user detonator413 commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-138241466
  
Sure, will push some changes soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-04 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137757590
  
Okay, why not add it to the examples then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-04 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137760319
  
Yes, I guess it is a better fit for the examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-04 Thread detonator413
Github user detonator413 commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137680835
  
Actually hadoop distcp also has an implementation of a dynamic input format 
which in my taste is a bit overcomplicated. So not sure if this Flink tool will 
give much benefits in real life (also it's lacking elasticity unlike hadoop 
distcp), but can be a good example how one can implement his own input format 
for a slightly unusual usecase. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-03 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137477152
  
Thanks for your pull request! I'm assuming you would use this utility to 
copy files from your local to a remote file system, right? Your utility starts 
a Flink job to copy the files to the remote file systems. This only works if 
you execute it locally because otherwise the task managers need to have the 
files available and that might defeat the utility's purpose. Also, imagine 
someone embedding the tool in a Flink program. The person might wonder why 
his/her program actually executes two jobs (one for the utility, one for the 
actual job). 

I think this would be more useful as a utility function, e.g. in a 
`FileUtils` class in `flink-core`. The method there would receive a list of 
files and then upload the files like you did using Flink's `FileSystem` 
abstraction. We could still parallelize the method by starting multiple threads 
to upload the files.

Correct me if I'm wrong or misunderstood your pull request :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-03 Thread detonator413
Github user detonator413 commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137480178
  
Hi Max,

Look at the distcp utility 
(http://hadoop.apache.org/docs/r1.2.1/distcp.html 
). The purpose of it is to 
copy big amount of files within one cluster or between clusters. In local mode 
the tool will also work for local FS, whereas in the distributed mode only HDFS 
paths are supposed to be used. I made a simple benchmark on copying 800GB of 
data within one cluster running Hadoop distcp (using default distcp input 
format ) and Flink distcp in parallel. Flink job was 1.5 minutes faster (it 
took approximately 35 minutes in our setup).

Slava

> On 03 Sep 2015, at 17:00, Max  wrote:
> 
> Thanks for your pull request! I'm assuming you would use this utility to 
copy files from your local to a remote file system, right? Your utility starts 
a Flink job to copy the files to the remote file systems. This only works if 
you execute it locally because otherwise the task managers need to have the 
files available and that might defeat the utility's purpose. Also, imagine 
someone embedding the tool in a Flink program. The person might wonder why 
his/her program actually executes two jobs (one for the utility, one for the 
actual job).
> 
> I think this would be more useful as a utility function, e.g. in a 
FileUtils class in flink-core. The method there would receive a list of files 
and then upload the files like you did using Flink's FileSystem abstraction. We 
could still parallelize the method by starting multiple threads to upload the 
files.
> 
> Correct me if I'm wrong or misunderstood your pull request :)
> 
> —
> Reply to this email directly or view it on GitHub 
.
> 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-03 Thread detonator413
GitHub user detonator413 opened a pull request:

https://github.com/apache/flink/pull/1090

Implementation of distributed copying utility using Flink

Uses a "dynamic" input format where faster nodes will get more stuff to be 
copied. 
The finest level of granularity is a file.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/detonator413/flink distcp-example-20150903

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1090.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1090


commit 05fc8ec319780eddbb67269e52f3bc1d090df41f
Author: Vyacheslav Zholudev 
Date:   2015-09-03T14:09:18Z

initial Flink DistCp example

commit c4f2b447e89a4ef80ee6ab171d04c519f7498d0e
Author: Vyacheslav Zholudev 
Date:   2015-09-03T14:13:28Z

a bit more comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-03 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137533466
  
@detonator413 Good point with the dynamic assignment.

What do sou think, would `flink-contrib` or `flink-examples` be a better 
place? Is it rather a nice tool, or is it also educational code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-03 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137499875
  
Thanks for pointing me to the `distcp` page. So far, I was agnostic of this 
tool :) The performance difference between Hadoop and Flink should not be too 
different because the copying of files is mostly IO-bound work. Still, it is 
1.5 minutes faster.

Not sure if we can include your code in the Flink examples but definitely 
under `flink-contrib` where we usually put external tools that are not directly 
part of Flink.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implementation of distributed copying utility ...

2015-09-03 Thread detonator413
Github user detonator413 commented on the pull request:

https://github.com/apache/flink/pull/1090#issuecomment-137516063
  
It could be faster because of dynamic assignment of files to copy as 
opposed to the default method of distcp where set of files are preassigned to 
mappers in advance


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---