Hello, I am new to Hadoop and HDFS so maybe I am not understanding things properly but I have the following issue:
I have set up a name node and a bunch of data nodes for HDFS. Each node contributes 1.6TB of space so the total space shown on the hdfs web front end is about 25TB. I have set the replication to be 3. I am downloading large files on a single data node from Amazon's S3 using the -distcp command - like this: hadoop --config /etc/hadoop distcp s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json hdfs://10.10.0.198:54310/test/2013-12-03.json Where 10.10.0.198 is the Hadoop Name node. All I am getting is that the machine I am running these commands on (one of the data nodes) is getting all the files - they do not seem to be "spreading" around the HDFS cluster. Is this expected? Did I completely misunderstand the point of a parallel DISTRIBUTED file system? :) Thanks! Ognen
