There is a lesson in this by the way, I just realized I pasted my access/secret access key to the bucket in the public email. DOH, changed ;)
Ognen On Tue, Jan 28, 2014 at 10:55 AM, Ognen Duzlevski <[email protected]>wrote: > Ahh. No, I do not have a job tracker. OK - I guess I need to set one up :) > > Thanks! > Ognen > > > On Tue, Jan 28, 2014 at 10:51 AM, Bryan Beaudreault < > [email protected]> wrote: > >> Do you have a jobtracker? Without a jobtracker and tasktrackers, distcp >> is running in LocalRunner mode. I.E. it is running a single-threaded >> process on the local machine. The default behavior of the DFSClient is to >> write data locally first, with replicas being placed off-rack then on-rack. >> >> This would explain why everything seems to be going locally, it is also >> probably much slower than it could be. >> >> >> On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski < >> [email protected]> wrote: >> >>> Hello, >>> >>> I am new to Hadoop and HDFS so maybe I am not understanding things >>> properly but I have the following issue: >>> >>> I have set up a name node and a bunch of data nodes for HDFS. Each node >>> contributes 1.6TB of space so the total space shown on the hdfs web front >>> end is about 25TB. I have set the replication to be 3. >>> >>> I am downloading large files on a single data node from Amazon's S3 >>> using the -distcp command - like this: >>> >>> hadoop --config /etc/hadoop distcp >>> s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json >>> hdfs://10.10.0.198:54310/test/2013-12-03.json >>> >>> Where 10.10.0.198 is the Hadoop Name node. >>> >>> All I am getting is that the machine I am running these commands on (one >>> of the data nodes) is getting all the files - they do not seem to be >>> "spreading" around the HDFS cluster. >>> >>> Is this expected? Did I completely misunderstand the point of a parallel >>> DISTRIBUTED file system? :) >>> >>> Thanks! >>> Ognen >>> >> >> >
