Summing up what would be the recommendations for copy 1) DistCP 2) shell cp command 3) Using File System API(FileUtils to be precise) inside of a Java program 4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP does)
I did not run any comparisons as my dev cluster is just a two node cluster and not sure how this would perform on a production cluster. Kay On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[email protected]> wrote: > Yes makes sense... cp is serialized and simpler, and does not rely on > jobtracker- Whereas distcp actually only submits a job and waits for > completion. > So it can fail if tasks start to fail or timeout. > I Have seen distcp fail and hang before albeit not often. > > Sent from my iPhone > > On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <[email protected]> > wrote: > > if cluster is busy with other jobs distcp will wait for free map slots. > Regular cp is more reliable and predictable. Especialy if you need to copy > just several GB > On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[email protected]> wrote: > >> CP command is not parallel, It's just call FileSystem, even if DFSClient >> has multi threads. >> >> DistCp can work well on the same cluster. >> >> >> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[email protected]> wrote: >> >>> The File System Copy utility copies files byte by byte if I'm not wrong. >>> Could it be possible that the cp command works with blocks and moves them >>> which could be significantly efficient? >>> >>> >>> Also how does the cp command work if the file is distributed on >>> different data nodes?? >>> >>> Thanks >>> Kay >>> >>> >>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[email protected]> wrote: >>> >>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do >>>> a "fully" parallel copy to the detsination). >>>> >>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and >>>> issues a copy command for every source file. >>>> >>>> I have an additional question: how is CP which is internal to a cluster >>>> optimized (if at all) ? >>>> >>>> >>>> >>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[email protected]> wrote: >>>> >>>>> ** >>>>> Hi, >>>>> >>>>> I think it' better using Copy in the same cluster while using distCP >>>>> between clusters, and cp command is a hadoop internal parallel process and >>>>> will not copy files locally. >>>>> >>>>> ------------------------------ >>>>> 麦树荣 >>>>> >>>>> *From:* KayVajj <[email protected]> >>>>> *Date:* 2013-04-11 06:20 >>>>> *To:* [email protected] >>>>> *Subject:* Copy Vs DistCP >>>>> I have few questions regarding the usage of DistCP for copying >>>>> files in the same cluster. >>>>> >>>>> >>>>> 1) Which one is better within a same cluster and what factors (like >>>>> file size etc) wouldinfluence the usage of one over te other? >>>>> >>>>> 2) when we run a cp command like below from a client node of the >>>>> cluster (not a data node), How does the cp command work >>>>> i) like an MR job >>>>> ii) copy files locally and then it copy it back at the new >>>>> location. >>>>> >>>>> Example of the copy command >>>>> >>>>> hdfs dfs -cp /<some_location>/file /<new_location>/ >>>>> >>>>> Thanks, your responses are appreciated. >>>>> >>>>> -- Kay >>>>> >>>> >>>> >>>> >>>> -- >>>> Jay Vyas >>>> http://jayunit100.blogspot.com >>>> >>> >>> >>
