absolutely rite Mohammad
On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[email protected]> wrote: > Sorry for barging in guys. I think Nitin is talking about this : > > Every file and block in HDFS is treated as an object and for each object > around 200B of metadata get created. So the NN should be powerful enough to > handle that much metadata, since it is going to be in-memory. Actually > memory is the most important metric when it comes to NN. > > Am I correct @Nitin? > > @Thoihen : As Nitin has said, when you talk about that much data you don't > actually just do a "put". You could use something like "distcp" for > parallel copying. A better approach would be to use a data aggregation tool > like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own > data aggregation tool, called Scribe for this purpose. > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <[email protected]>wrote: > >> NN would still be in picture because it will be writing a lot of meta >> data for each individual file. so you will need a NN capable enough which >> can store the metadata for your entire dataset. Data will never go to NN >> but lot of metadata about data will be on NN so its always good idea to >> have a strong NN. >> >> >> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> @Nitin , parallel dfs to write to hdfs is great , but could not >>> understand the meaning of capable NN. As I know , the NN would not be a >>> part of the actual data write pipeline , means that the data would not >>> travel through the NN , the dfs would contact the NN from time to time to >>> get locations of DN as where to store the data blocks. >>> >>> Thanks, >>> Rahul >>> >>> >>> >>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <[email protected]>wrote: >>> >>>> is it safe? .. there is no direct answer yes or no >>>> >>>> when you say , you have files worth 10TB files and you want to upload >>>> to HDFS, several factors come into picture >>>> >>>> 1) Is the machine in the same network as your hadoop cluster? >>>> 2) If there guarantee that network will not go down? >>>> >>>> and Most importantly I assume that you have a capable hadoop cluster. >>>> By that I mean you have a capable namenode. >>>> >>>> I would definitely not write files sequentially in HDFS. I would prefer >>>> to write files in parallel to hdfs to utilize the DFS write features to >>>> speed up the process. >>>> you can hdfs put command in parallel manner and in my experience it has >>>> not failed when we write a lot of data. >>>> >>>> >>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[email protected]>wrote: >>>> >>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>> >>>>> But I have one more question , say I have 10 TB data in the pipeline . >>>>> >>>>> Is it perfectly OK to use hadopo fs put command to upload these files >>>>> of size 10 TB and is there any limit to the file size using hadoop >>>>> command >>>>> line . Can hadoop put command line work with huge data. >>>>> >>>>> Thanks in advance >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar >>>>> <[email protected]>wrote: >>>>> >>>>>> first of all .. most of the companies do not get 100 PB of data in >>>>>> one go. Its an accumulating process and most of the companies do have a >>>>>> data pipeline in place where the data is written to hdfs on a frequency >>>>>> basis and then its retained on hdfs for some duration as per needed and >>>>>> from there its sent to archivers or deleted. >>>>>> >>>>>> For data management products, you can look at falcon which is open >>>>>> sourced by inmobi along with hortonworks. >>>>>> >>>>>> In any case, if you want to write files to hdfs there are few options >>>>>> available to you >>>>>> 1) Write your dfs client which writes to dfs >>>>>> 2) use hdfs proxy >>>>>> 3) there is webhdfs >>>>>> 4) command line hdfs >>>>>> 5) data collection tools come with support to write to hdfs like >>>>>> flume etc >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc >>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS >>>>>>> cluster >>>>>>> for processing >>>>>>> and after processing how they download those files from HDFS to >>>>>>> local file system. >>>>>>> >>>>>>> I don't think they might be using the command line hadoop fs put to >>>>>>> upload files as it would take too long or do they divide say 10 parts >>>>>>> each >>>>>>> 10 petabytes and compress and use the command line hadoop fs put >>>>>>> >>>>>>> Or if they use any tool to upload huge files. >>>>>>> >>>>>>> Please help me . >>>>>>> >>>>>>> Thanks >>>>>>> thoihen >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Nitin Pawar >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >> >> >> -- >> Nitin Pawar >> > > -- Nitin Pawar
