@Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs.
isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[email protected]> wrote: > You'r welcome :) > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < > [email protected]> wrote: > >> Thanks Tariq! >> >> >> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[email protected]>wrote: >> >>> @Rahul : Yes. distcp can do that. >>> >>> And, bigger the files lesser the metadata hence lesser memory >>> consumption. >>> >>> Warm Regards, >>> Tariq >>> cloudfront.blogspot.com >>> >>> >>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>> [email protected]> wrote: >>> >>>> IMHO,I think the statement about NN with regard to block metadata is >>>> more like a general statement. Even if you put lots of small files of >>>> combined size 10 TB , you need to have a capable NN. >>>> >>>> can disct cp be used to copy local - to - hdfs ? >>>> >>>> Thanks, >>>> Rahul >>>> >>>> >>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar >>>> <[email protected]>wrote: >>>> >>>>> absolutely rite Mohammad >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[email protected]>wrote: >>>>> >>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>> >>>>>> Every file and block in HDFS is treated as an object and for each >>>>>> object around 200B of metadata get created. So the NN should be powerful >>>>>> enough to handle that much metadata, since it is going to be in-memory. >>>>>> Actually memory is the most important metric when it comes to NN. >>>>>> >>>>>> Am I correct @Nitin? >>>>>> >>>>>> @Thoihen : As Nitin has said, when you talk about that much data you >>>>>> don't actually just do a "put". You could use something like "distcp" for >>>>>> parallel copying. A better approach would be to use a data aggregation >>>>>> tool >>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their >>>>>> own >>>>>> data aggregation tool, called Scribe for this purpose. >>>>>> >>>>>> Warm Regards, >>>>>> Tariq >>>>>> cloudfront.blogspot.com >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> NN would still be in picture because it will be writing a lot of >>>>>>> meta data for each individual file. so you will need a NN capable enough >>>>>>> which can store the metadata for your entire dataset. Data will never >>>>>>> go to >>>>>>> NN but lot of metadata about data will be on NN so its always good idea >>>>>>> to >>>>>>> have a strong NN. >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a >>>>>>>> part of the actual data write pipeline , means that the data would not >>>>>>>> travel through the NN , the dfs would contact the NN from time to time >>>>>>>> to >>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Rahul >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>> >>>>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>>>> upload to HDFS, several factors come into picture >>>>>>>>> >>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>> >>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>> >>>>>>>>> I would definitely not write files sequentially in HDFS. I would >>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write >>>>>>>>> features >>>>>>>>> to speed up the process. >>>>>>>>> you can hdfs put command in parallel manner and in my experience >>>>>>>>> it has not failed when we write a lot of data. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns >>>>>>>>> <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>> >>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>> pipeline . >>>>>>>>>> >>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these >>>>>>>>>> files of size 10 TB and is there any limit to the file size using >>>>>>>>>> hadoop >>>>>>>>>> command line . Can hadoop put command line work with huge data. >>>>>>>>>> >>>>>>>>>> Thanks in advance >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data >>>>>>>>>>> in one go. Its an accumulating process and most of the companies do >>>>>>>>>>> have a >>>>>>>>>>> data pipeline in place where the data is written to hdfs on a >>>>>>>>>>> frequency >>>>>>>>>>> basis and then its retained on hdfs for some duration as per >>>>>>>>>>> needed and >>>>>>>>>>> from there its sent to archivers or deleted. >>>>>>>>>>> >>>>>>>>>>> For data management products, you can look at falcon which is >>>>>>>>>>> open sourced by inmobi along with hortonworks. >>>>>>>>>>> >>>>>>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>>>>>> options available to you >>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>> 4) command line hdfs >>>>>>>>>>> 5) data collection tools come with support to write to hdfs like >>>>>>>>>>> flume etc >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi All, >>>>>>>>>>>> >>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo >>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop >>>>>>>>>>>> HDFS >>>>>>>>>>>> cluster for processing >>>>>>>>>>>> and after processing how they download those files from HDFS to >>>>>>>>>>>> local file system. >>>>>>>>>>>> >>>>>>>>>>>> I don't think they might be using the command line hadoop fs >>>>>>>>>>>> put to upload files as it would take too long or do they divide >>>>>>>>>>>> say 10 >>>>>>>>>>>> parts each 10 petabytes and compress and use the command line >>>>>>>>>>>> hadoop fs put >>>>>>>>>>>> >>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>> >>>>>>>>>>>> Please help me . >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> thoihen >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Nitin Pawar >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Nitin Pawar >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nitin Pawar >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>> >>>> >>> >> >
