Yes makes sense... cp is serialized and simpler, and does not rely on jobtracker- Whereas distcp actually only submits a job and waits for completion. So it can fail if tasks start to fail or timeout. I Have seen distcp fail and hang before albeit not often.
Sent from my iPhone On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <[email protected]> wrote: > if cluster is busy with other jobs distcp will wait for free map slots. > Regular cp is more reliable and predictable. Especialy if you need to copy > just several GB > > On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[email protected]> wrote: >> CP command is not parallel, It's just call FileSystem, even if DFSClient has >> multi threads. >> >> DistCp can work well on the same cluster. >> >> >> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[email protected]> wrote: >>> The File System Copy utility copies files byte by byte if I'm not wrong. >>> Could it be possible that the cp command works with blocks and moves them >>> which could be significantly efficient? >>> >>> >>> Also how does the cp command work if the file is distributed on different >>> data nodes?? >>> >>> Thanks >>> Kay >>> >>> >>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[email protected]> wrote: >>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a >>>> "fully" parallel copy to the detsination). >>>> >>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and >>>> issues a copy command for every source file. >>>> >>>> I have an additional question: how is CP which is internal to a cluster >>>> optimized (if at all) ? >>>> >>>> >>>> >>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[email protected]> wrote: >>>>> Hi, >>>>> >>>>> I think it' better using Copy in the same cluster while using distCP >>>>> between clusters, and cp command is a hadoop internal parallel process >>>>> and will not copy files locally. >>>>> >>>>> 麦树荣 >>>>> >>>>> From: KayVajj >>>>> Date: 2013-04-11 06:20 >>>>> To: [email protected] >>>>> Subject: Copy Vs DistCP >>>>> I have few questions regarding the usage of DistCP for copying files in >>>>> the same cluster. >>>>> >>>>> >>>>> 1) Which one is better within a same cluster and what factors (like file >>>>> size etc) wouldinfluence the usage of one over te other? >>>>> >>>>> 2) when we run a cp command like below from a client node of the cluster >>>>> (not a data node), How does the cp command work >>>>> i) like an MR job >>>>> ii) copy files locally and then it copy it back at the new location. >>>>> >>>>> Example of the copy command >>>>> >>>>> hdfs dfs -cp /<some_location>/file /<new_location>/ >>>>> >>>>> Thanks, your responses are appreciated. >>>>> >>>>> -- Kay >>>> >>>> >>>> >>>> -- >>>> Jay Vyas >>>> http://jayunit100.blogspot.com
