That was a hidden shameless plug Ted ;-) The main disadvantage of fs -cp is that all data has to transit via the machine you issue the command on, depending on the size of data you want to copy that can be a killer. DistCp is distributed as its name imply, so no bottleneck of this kind then. On Apr 14, 2013 6:15 AM, "Ted Dunning" <[email protected]> wrote:
> > Lance, > > Never say never. > > Linux programs can read from the right kind of Hadoop cluster without > using FUSE. > > > > > On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <[email protected]> wrote: > >> Shell 'cp' only works if you use 'fuse', which makes the HDFS file >> system visible as a Unix mounted file system. Otherwise, Unix programs >> cannot read or write HDFS files. >> >> On 04/11/2013 09:52 AM, KayVajj wrote: >> >> Summing up what would be the recommendations for copy >> >> 1) DistCP >> 2) shell cp command >> 3) Using File System API(FileUtils to be precise) inside of a Java >> program >> 4) A MR with an Identity Mapper and no Reducer (may be this is what >> DistCP does) >> >> >> I did not run any comparisons as my dev cluster is just a two node >> cluster and not sure how this would perform on a production cluster. >> >> Kay >> >> >> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[email protected]> wrote: >> >>> Yes makes sense... cp is serialized and simpler, and does not rely on >>> jobtracker- Whereas distcp actually only submits a job and waits for >>> completion. >>> So it can fail if tasks start to fail or timeout. >>> I Have seen distcp fail and hang before albeit not often. >>> >>> Sent from my iPhone >>> >>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <[email protected]> >>> wrote: >>> >>> if cluster is busy with other jobs distcp will wait for free map >>> slots. Regular cp is more reliable and predictable. Especialy if you need >>> to copy just several GB >>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[email protected]> wrote: >>> >>>> CP command is not parallel, It's just call FileSystem, even if >>>> DFSClient has multi threads. >>>> >>>> DistCp can work well on the same cluster. >>>> >>>> >>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <[email protected]>wrote: >>>> >>>>> The File System Copy utility copies files byte by byte if I'm not >>>>> wrong. Could it be possible that the cp command works with blocks and >>>>> moves >>>>> them which could be significantly efficient? >>>>> >>>>> >>>>> Also how does the cp command work if the file is distributed on >>>>> different data nodes?? >>>>> >>>>> Thanks >>>>> Kay >>>>> >>>>> >>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <[email protected]>wrote: >>>>> >>>>>> DistCP is a full blown mapreduce job (mapper only, where the >>>>>> mappers do a "fully" parallel copy to the detsination). >>>>>> >>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem >>>>>> and issues a copy command for every source file. >>>>>> >>>>>> I have an additional question: how is CP which is internal to a >>>>>> cluster optimized (if at all) ? >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <[email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I think it' better using Copy in the same cluster while using distCP >>>>>>> between clusters, and cp command is a hadoop internal parallel process >>>>>>> and >>>>>>> will not copy files locally. >>>>>>> >>>>>>> ------------------------------ >>>>>>> 麦树荣 >>>>>>> >>>>>>> *From:* KayVajj <[email protected]> >>>>>>> *Date:* 2013-04-11 06:20 >>>>>>> *To:* [email protected] >>>>>>> *Subject:* Copy Vs DistCP >>>>>>> I have few questions regarding the usage of DistCP for >>>>>>> copying files in the same cluster. >>>>>>> >>>>>>> >>>>>>> 1) Which one is better within a same cluster and what factors (like >>>>>>> file size etc) wouldinfluence the usage of one over te other? >>>>>>> >>>>>>> 2) when we run a cp command like below from a client node of the >>>>>>> cluster (not a data node), How does the cp command work >>>>>>> i) like an MR job >>>>>>> ii) copy files locally and then it copy it back at the new >>>>>>> location. >>>>>>> >>>>>>> Example of the copy command >>>>>>> >>>>>>> hdfs dfs -cp /<some_location>/file /<new_location>/ >>>>>>> >>>>>>> Thanks, your responses are appreciated. >>>>>>> >>>>>>> -- Kay >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jay Vyas >>>>>> http://jayunit100.blogspot.com >>>>>> >>>>> >>>>> >>>> >> >> >
