Shell 'cp' only works if you use 'fuse', which makes the HDFS file system visible as a Unix mounted file system. Otherwise, Unix programs cannot read or write HDFS files.

On 04/11/2013 09:52 AM, KayVajj wrote:
Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP does)


I did not run any comparisons as my dev cluster is just a two node cluster and not sure how this would perform on a production cluster.

Kay


On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[email protected] <mailto:[email protected]>> wrote:

    Yes makes sense...  cp is serialized and simpler, and does not
    rely on jobtracker- Whereas distcp actually only submits a job and
    waits for completion.
    So it can fail if tasks start to fail or timeout.
     I Have seen distcp fail and hang before albeit not often.

    Sent from my iPhone

    On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
    <[email protected] <mailto:[email protected]>> wrote:

    if cluster is busy with other jobs distcp will wait for free map
    slots. Regular cp is more reliable and predictable. Especialy if
    you need to copy just several GB

    On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[email protected]
    <mailto:[email protected]>> wrote:

        CP command is not parallel, It's just call FileSystem, even
        if DFSClient has multi threads.

        DistCp can work well on the same cluster.


        On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
        <[email protected] <mailto:[email protected]>> wrote:

            The File System Copy utility copies files byte by byte if
            I'm not wrong. Could it be possible that the cp command
            works with blocks and moves them which could be
            significantly efficient?


            Also how does the cp command work if the file is
            distributed on different data nodes??

            Thanks
            Kay


            On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
            <[email protected] <mailto:[email protected]>> wrote:

                DistCP is a full blown mapreduce job (mapper only,
                where the mappers do a "fully" parallel copy to the
                detsination).

                CP appears (correct me if im wrong) to simply invoke
                the FileSystem and issues a copy command for every
                source file.

                I have an additional question: how is CP which is
                internal to a cluster optimized (if at all) ?



                On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
                <[email protected]
                <mailto:[email protected]>> wrote:

                    Hi,
                    I think it' better using Copy in the same cluster
                    while using distCP between clusters, and cp
                    command is a hadoop internal parallel process and
                    will not copy files locally.
                    
------------------------------------------------------------------------
                    麦树荣
                    *From:* KayVajj <mailto:[email protected]>
                    *Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
                    *To:* [email protected]
                    <mailto:[email protected]>
                    *Subject:* Copy Vs DistCP
                    I have few questions regarding the usage of
                    DistCP for copying files in the same cluster.


                    1) Which one is better within a  same cluster and
                    what factors (like file size etc) wouldinfluence
                    the usage of one over te other?

2) when we run a cp command like below from a client node of the cluster (not a data node), How
                    does the cp command work
                         i) like an MR job
                        ii) copy files locally and then it copy it
                    back at the new location.

                    Example of the copy command

                    hdfs dfs -cp /<some_location>/file /<new_location>/

                    Thanks, your responses are appreciated.

                    -- Kay




-- Jay Vyas
                http://jayunit100.blogspot.com





Reply via email to