No. distcp is actually a mapreduce job under the hood. Warm Regards, Tariq cloudfront.blogspot.com
On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee < [email protected]> wrote: > Thanks to both of you! > > Rahul > > > On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <[email protected]>wrote: > >> you can do that using file:/// >> >> example: >> >> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ >> >> >> >> >> >> >> >> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> @Tariq can you point me to some resource which shows how distcp is used >>> to upload files from local to hdfs. >>> >>> isn't distcp a MR job ? wouldn't it need the data to be already present >>> in the hadoop's fs? >>> >>> Rahul >>> >>> >>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[email protected]>wrote: >>> >>>> You'r welcome :) >>>> >>>> Warm Regards, >>>> Tariq >>>> cloudfront.blogspot.com >>>> >>>> >>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >>>> [email protected]> wrote: >>>> >>>>> Thanks Tariq! >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq >>>>> <[email protected]>wrote: >>>>> >>>>>> @Rahul : Yes. distcp can do that. >>>>>> >>>>>> And, bigger the files lesser the metadata hence lesser memory >>>>>> consumption. >>>>>> >>>>>> Warm Regards, >>>>>> Tariq >>>>>> cloudfront.blogspot.com >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> IMHO,I think the statement about NN with regard to block metadata is >>>>>>> more like a general statement. Even if you put lots of small files of >>>>>>> combined size 10 TB , you need to have a capable NN. >>>>>>> >>>>>>> can disct cp be used to copy local - to - hdfs ? >>>>>>> >>>>>>> Thanks, >>>>>>> Rahul >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> absolutely rite Mohammad >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>>>>> >>>>>>>>> Every file and block in HDFS is treated as an object and for each >>>>>>>>> object around 200B of metadata get created. So the NN should be >>>>>>>>> powerful >>>>>>>>> enough to handle that much metadata, since it is going to be >>>>>>>>> in-memory. >>>>>>>>> Actually memory is the most important metric when it comes to NN. >>>>>>>>> >>>>>>>>> Am I correct @Nitin? >>>>>>>>> >>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data >>>>>>>>> you don't actually just do a "put". You could use something like >>>>>>>>> "distcp" >>>>>>>>> for parallel copying. A better approach would be to use a data >>>>>>>>> aggregation >>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses >>>>>>>>> their own data aggregation tool, called Scribe for this purpose. >>>>>>>>> >>>>>>>>> Warm Regards, >>>>>>>>> Tariq >>>>>>>>> cloudfront.blogspot.com >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> NN would still be in picture because it will be writing a lot of >>>>>>>>>> meta data for each individual file. so you will need a NN capable >>>>>>>>>> enough >>>>>>>>>> which can store the metadata for your entire dataset. Data will >>>>>>>>>> never go to >>>>>>>>>> NN but lot of metadata about data will be on NN so its always good >>>>>>>>>> idea to >>>>>>>>>> have a strong NN. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not >>>>>>>>>>> be a >>>>>>>>>>> part of the actual data write pipeline , means that the data would >>>>>>>>>>> not >>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to >>>>>>>>>>> time to >>>>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Rahul >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>>>> >>>>>>>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>>>>>>> upload to HDFS, several factors come into picture >>>>>>>>>>>> >>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>>>> >>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>>>>> >>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I >>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS >>>>>>>>>>>> write >>>>>>>>>>>> features to speed up the process. >>>>>>>>>>>> you can hdfs put command in parallel manner and in my >>>>>>>>>>>> experience it has not failed when we write a lot of data. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>>>> >>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>>>>> pipeline . >>>>>>>>>>>>> >>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload >>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size >>>>>>>>>>>>> using >>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge >>>>>>>>>>>>> data. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks in advance >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of >>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the >>>>>>>>>>>>>> companies do >>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs >>>>>>>>>>>>>> on a >>>>>>>>>>>>>> frequency basis and then its retained on hdfs for some duration >>>>>>>>>>>>>> as per >>>>>>>>>>>>>> needed and from there its sent to archivers or deleted. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For data management products, you can look at falcon which is >>>>>>>>>>>>>> open sourced by inmobi along with hortonworks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>>>>>>>>> options available to you >>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs >>>>>>>>>>>>>> like flume etc >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes >>>>>>>>>>>>>>> to Hadoop >>>>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>>>> and after processing how they download those files from HDFS >>>>>>>>>>>>>>> to local file system. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs >>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide >>>>>>>>>>>>>>> say 10 >>>>>>>>>>>>>>> parts each 10 petabytes and compress and use the command line >>>>>>>>>>>>>>> hadoop fs put >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Please help me . >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> thoihen >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Nitin Pawar >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Nitin Pawar >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >> -- >> Nitin Pawar >> > >
