I had said that if you use distcp to copy data *from localFS to HDFS* then you won't be able to exploit parallelism as entire file is present on a single machine. So no multiple TTs.
Please comment if you think I am wring somewhere. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee < [email protected]> wrote: > Yes , it's a MR job under the hood . my question was that you wrote that > using distcp you loose the benefits of parallel processing of Hadoop. I > think the MR job of distcp divides files into individual map tasks based on > the total size of the transfer , so multiple mappers would still be spawned > if the size of transfer is huge and they would work in parallel. > > Correct me if there is anything wrong! > > Thanks, > Rahul > > > On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <[email protected]>wrote: > >> No. distcp is actually a mapreduce job under the hood. >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> Thanks to both of you! >>> >>> Rahul >>> >>> >>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <[email protected]>wrote: >>> >>>> you can do that using file:/// >>>> >>>> example: >>>> >>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < >>>> [email protected]> wrote: >>>> >>>>> @Tariq can you point me to some resource which shows how distcp is >>>>> used to upload files from local to hdfs. >>>>> >>>>> isn't distcp a MR job ? wouldn't it need the data to be already >>>>> present in the hadoop's fs? >>>>> >>>>> Rahul >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq >>>>> <[email protected]>wrote: >>>>> >>>>>> You'r welcome :) >>>>>> >>>>>> Warm Regards, >>>>>> Tariq >>>>>> cloudfront.blogspot.com >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks Tariq! >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> @Rahul : Yes. distcp can do that. >>>>>>>> >>>>>>>> And, bigger the files lesser the metadata hence lesser memory >>>>>>>> consumption. >>>>>>>> >>>>>>>> Warm Regards, >>>>>>>> Tariq >>>>>>>> cloudfront.blogspot.com >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> IMHO,I think the statement about NN with regard to block metadata >>>>>>>>> is more like a general statement. Even if you put lots of small files >>>>>>>>> of >>>>>>>>> combined size 10 TB , you need to have a capable NN. >>>>>>>>> >>>>>>>>> can disct cp be used to copy local - to - hdfs ? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Rahul >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> absolutely rite Mohammad >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>>>>>>> >>>>>>>>>>> Every file and block in HDFS is treated as an object and for >>>>>>>>>>> each object around 200B of metadata get created. So the NN should be >>>>>>>>>>> powerful enough to handle that much metadata, since it is going to >>>>>>>>>>> be >>>>>>>>>>> in-memory. Actually memory is the most important metric when it >>>>>>>>>>> comes to >>>>>>>>>>> NN. >>>>>>>>>>> >>>>>>>>>>> Am I correct @Nitin? >>>>>>>>>>> >>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data >>>>>>>>>>> you don't actually just do a "put". You could use something like >>>>>>>>>>> "distcp" >>>>>>>>>>> for parallel copying. A better approach would be to use a data >>>>>>>>>>> aggregation >>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook >>>>>>>>>>> uses >>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose. >>>>>>>>>>> >>>>>>>>>>> Warm Regards, >>>>>>>>>>> Tariq >>>>>>>>>>> cloudfront.blogspot.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> NN would still be in picture because it will be writing a lot >>>>>>>>>>>> of meta data for each individual file. so you will need a NN >>>>>>>>>>>> capable enough >>>>>>>>>>>> which can store the metadata for your entire dataset. Data will >>>>>>>>>>>> never go to >>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good >>>>>>>>>>>> idea to >>>>>>>>>>>> have a strong NN. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could >>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN >>>>>>>>>>>>> would not be a >>>>>>>>>>>>> part of the actual data write pipeline , means that the data >>>>>>>>>>>>> would not >>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to >>>>>>>>>>>>> time to >>>>>>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Rahul >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>>>>>> >>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want >>>>>>>>>>>>>> to upload to HDFS, several factors come into picture >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>>>>>> >>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I >>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the >>>>>>>>>>>>>> DFS write >>>>>>>>>>>>>> features to speed up the process. >>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my >>>>>>>>>>>>>> experience it has not failed when we write a lot of data. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>>>>>>> pipeline . >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload >>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file >>>>>>>>>>>>>>> size using >>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with >>>>>>>>>>>>>>> huge data. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks in advance >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of >>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the >>>>>>>>>>>>>>>> companies do >>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to >>>>>>>>>>>>>>>> hdfs on a >>>>>>>>>>>>>>>> frequency basis and then its retained on hdfs for some >>>>>>>>>>>>>>>> duration as per >>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For data management products, you can look at falcon which >>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are >>>>>>>>>>>>>>>> few options available to you >>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs >>>>>>>>>>>>>>>> like flume etc >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes >>>>>>>>>>>>>>>>> to Hadoop >>>>>>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>>>>>> and after processing how they download those files from >>>>>>>>>>>>>>>>> HDFS to local file system. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop >>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they >>>>>>>>>>>>>>>>> divide say 10 >>>>>>>>>>>>>>>>> parts each 10 petabytes and compress and use the command >>>>>>>>>>>>>>>>> line hadoop fs put >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please help me . >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> thoihen >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Nitin Pawar >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >> >
