You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com
On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < [email protected]> wrote: > Thanks Tariq! > > > On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[email protected]>wrote: > >> @Rahul : Yes. distcp can do that. >> >> And, bigger the files lesser the metadata hence lesser memory consumption. >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> IMHO,I think the statement about NN with regard to block metadata is >>> more like a general statement. Even if you put lots of small files of >>> combined size 10 TB , you need to have a capable NN. >>> >>> can disct cp be used to copy local - to - hdfs ? >>> >>> Thanks, >>> Rahul >>> >>> >>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <[email protected]>wrote: >>> >>>> absolutely rite Mohammad >>>> >>>> >>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[email protected]>wrote: >>>> >>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>> >>>>> Every file and block in HDFS is treated as an object and for each >>>>> object around 200B of metadata get created. So the NN should be powerful >>>>> enough to handle that much metadata, since it is going to be in-memory. >>>>> Actually memory is the most important metric when it comes to NN. >>>>> >>>>> Am I correct @Nitin? >>>>> >>>>> @Thoihen : As Nitin has said, when you talk about that much data you >>>>> don't actually just do a "put". You could use something like "distcp" for >>>>> parallel copying. A better approach would be to use a data aggregation >>>>> tool >>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their >>>>> own >>>>> data aggregation tool, called Scribe for this purpose. >>>>> >>>>> Warm Regards, >>>>> Tariq >>>>> cloudfront.blogspot.com >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar >>>>> <[email protected]>wrote: >>>>> >>>>>> NN would still be in picture because it will be writing a lot of meta >>>>>> data for each individual file. so you will need a NN capable enough which >>>>>> can store the metadata for your entire dataset. Data will never go to NN >>>>>> but lot of metadata about data will be on NN so its always good idea to >>>>>> have a strong NN. >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>>> understand the meaning of capable NN. As I know , the NN would not be a >>>>>>> part of the actual data write pipeline , means that the data would not >>>>>>> travel through the NN , the dfs would contact the NN from time to time >>>>>>> to >>>>>>> get locations of DN as where to store the data blocks. >>>>>>> >>>>>>> Thanks, >>>>>>> Rahul >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>> >>>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>>> upload to HDFS, several factors come into picture >>>>>>>> >>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>> >>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>> >>>>>>>> I would definitely not write files sequentially in HDFS. I would >>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write >>>>>>>> features >>>>>>>> to speed up the process. >>>>>>>> you can hdfs put command in parallel manner and in my experience it >>>>>>>> has not failed when we write a lot of data. >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>> >>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>> pipeline . >>>>>>>>> >>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these >>>>>>>>> files of size 10 TB and is there any limit to the file size using >>>>>>>>> hadoop >>>>>>>>> command line . Can hadoop put command line work with huge data. >>>>>>>>> >>>>>>>>> Thanks in advance >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> first of all .. most of the companies do not get 100 PB of data >>>>>>>>>> in one go. Its an accumulating process and most of the companies do >>>>>>>>>> have a >>>>>>>>>> data pipeline in place where the data is written to hdfs on a >>>>>>>>>> frequency >>>>>>>>>> basis and then its retained on hdfs for some duration as per needed >>>>>>>>>> and >>>>>>>>>> from there its sent to archivers or deleted. >>>>>>>>>> >>>>>>>>>> For data management products, you can look at falcon which is >>>>>>>>>> open sourced by inmobi along with hortonworks. >>>>>>>>>> >>>>>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>>>>> options available to you >>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>> 2) use hdfs proxy >>>>>>>>>> 3) there is webhdfs >>>>>>>>>> 4) command line hdfs >>>>>>>>>> 5) data collection tools come with support to write to hdfs like >>>>>>>>>> flume etc >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo >>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop >>>>>>>>>>> HDFS >>>>>>>>>>> cluster for processing >>>>>>>>>>> and after processing how they download those files from HDFS to >>>>>>>>>>> local file system. >>>>>>>>>>> >>>>>>>>>>> I don't think they might be using the command line hadoop fs put >>>>>>>>>>> to upload files as it would take too long or do they divide say 10 >>>>>>>>>>> parts >>>>>>>>>>> each 10 petabytes and compress and use the command line hadoop fs >>>>>>>>>>> put >>>>>>>>>>> >>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>> >>>>>>>>>>> Please help me . >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> thoihen >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Nitin Pawar >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Nitin Pawar >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Nitin Pawar >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >> >
