Re: Copy Vs DistCP

Lance Norskog Fri, 12 Apr 2013 10:16:15 -0700

Shell 'cp' only works if you use 'fuse', which makes the HDFS filesystem visible as a Unix mounted file system. Otherwise, Unix programscannot read or write HDFS files.


On 04/11/2013 09:52 AM, KayVajj wrote:

Summing up what would be the recommendations for copy


1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program

4) A MR with an Identity Mapper and no Reducer (may be this is whatDistCP does)

I did not run any comparisons as my dev cluster is just a two nodecluster and not sure how this would perform on a production cluster.

Kay

On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[email protected]<mailto:[email protected]>> wrote:


    Yes makes sense...  cp is serialized and simpler, and does not
    rely on jobtracker- Whereas distcp actually only submits a job and
    waits for completion.
    So it can fail if tasks start to fail or timeout.
     I Have seen distcp fail and hang before albeit not often.

    Sent from my iPhone

    On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
    <[email protected] <mailto:[email protected]>> wrote:

    if cluster is busy with other jobs distcp will wait for free map
    slots. Regular cp is more reliable and predictable. Especialy if
    you need to copy just several GB

    On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[email protected]
    <mailto:[email protected]>> wrote:

        CP command is not parallel, It's just call FileSystem, even
        if DFSClient has multi threads.

        DistCp can work well on the same cluster.


        On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
        <[email protected] <mailto:[email protected]>> wrote:

            The File System Copy utility copies files byte by byte if
            I'm not wrong. Could it be possible that the cp command
            works with blocks and moves them which could be
            significantly efficient?


            Also how does the cp command work if the file is
            distributed on different data nodes??

            Thanks
            Kay


            On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
            <[email protected] <mailto:[email protected]>> wrote:

                DistCP is a full blown mapreduce job (mapper only,
                where the mappers do a "fully" parallel copy to the
                detsination).

                CP appears (correct me if im wrong) to simply invoke
                the FileSystem and issues a copy command for every
                source file.

                I have an additional question: how is CP which is
                internal to a cluster optimized (if at all) ?



                On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
                <[email protected]
                <mailto:[email protected]>> wrote:

                    Hi，
                    I think it' better using Copy in the same cluster
                    while using distCP between clusters, and cp
                    command is a hadoop internal parallel process and
                    will not copy files locally.
                    
------------------------------------------------------------------------
                    麦树荣
                    *From:* KayVajj <mailto:[email protected]>
                    *Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
                    *To:* [email protected]
                    <mailto:[email protected]>
                    *Subject:* Copy Vs DistCP
                    I have few questions regarding the usage of
                    DistCP for copying files in the same cluster.


                    1) Which one is better within a  same cluster and
                    what factors (like file size etc) wouldinfluence
                    the usage of one over te other?

2) when we run a cp command like below from aclient node of the cluster (not a data node), How

                    does the cp command work
                         i) like an MR job
                        ii) copy files locally and then it copy it
                    back at the new location.

                    Example of the copy command

                    hdfs dfs -cp /<some_location>/file /<new_location>/

                    Thanks, your responses are appreciated.

                    -- Kay

--Jay Vyas

                http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Reply via email to