Shell 'cp' only works if you use 'fuse', which makes the HDFS file
system visible as a Unix mounted file system. Otherwise, Unix programs
cannot read or write HDFS files.
On 04/11/2013 09:52 AM, KayVajj wrote:
Summing up what would be the recommendations for copy
1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what
DistCP does)
I did not run any comparisons as my dev cluster is just a two node
cluster and not sure how this would perform on a production cluster.
Kay
On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <[email protected]
<mailto:[email protected]>> wrote:
Yes makes sense... cp is serialized and simpler, and does not
rely on jobtracker- Whereas distcp actually only submits a job and
waits for completion.
So it can fail if tasks start to fail or timeout.
I Have seen distcp fail and hang before albeit not often.
Sent from my iPhone
On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
<[email protected] <mailto:[email protected]>> wrote:
if cluster is busy with other jobs distcp will wait for free map
slots. Regular cp is more reliable and predictable. Especialy if
you need to copy just several GB
On Apr 10, 2013 6:31 PM, "Azuryy Yu" <[email protected]
<mailto:[email protected]>> wrote:
CP command is not parallel, It's just call FileSystem, even
if DFSClient has multi threads.
DistCp can work well on the same cluster.
On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
<[email protected] <mailto:[email protected]>> wrote:
The File System Copy utility copies files byte by byte if
I'm not wrong. Could it be possible that the cp command
works with blocks and moves them which could be
significantly efficient?
Also how does the cp command work if the file is
distributed on different data nodes??
Thanks
Kay
On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
<[email protected] <mailto:[email protected]>> wrote:
DistCP is a full blown mapreduce job (mapper only,
where the mappers do a "fully" parallel copy to the
detsination).
CP appears (correct me if im wrong) to simply invoke
the FileSystem and issues a copy command for every
source file.
I have an additional question: how is CP which is
internal to a cluster optimized (if at all) ?
On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
<[email protected]
<mailto:[email protected]>> wrote:
Hi,
I think it' better using Copy in the same cluster
while using distCP between clusters, and cp
command is a hadoop internal parallel process and
will not copy files locally.
------------------------------------------------------------------------
麦树荣
*From:* KayVajj <mailto:[email protected]>
*Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
*To:* [email protected]
<mailto:[email protected]>
*Subject:* Copy Vs DistCP
I have few questions regarding the usage of
DistCP for copying files in the same cluster.
1) Which one is better within a same cluster and
what factors (like file size etc) wouldinfluence
the usage of one over te other?
2) when we run a cp command like below from a
client node of the cluster (not a data node), How
does the cp command work
i) like an MR job
ii) copy files locally and then it copy it
back at the new location.
Example of the copy command
hdfs dfs -cp /<some_location>/file /<new_location>/
Thanks, your responses are appreciated.
-- Kay
--
Jay Vyas
http://jayunit100.blogspot.com