IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN.
can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <[email protected]>wrote: > absolutely rite Mohammad > > > On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[email protected]>wrote: > >> Sorry for barging in guys. I think Nitin is talking about this : >> >> Every file and block in HDFS is treated as an object and for each object >> around 200B of metadata get created. So the NN should be powerful enough to >> handle that much metadata, since it is going to be in-memory. Actually >> memory is the most important metric when it comes to NN. >> >> Am I correct @Nitin? >> >> @Thoihen : As Nitin has said, when you talk about that much data you >> don't actually just do a "put". You could use something like "distcp" for >> parallel copying. A better approach would be to use a data aggregation tool >> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own >> data aggregation tool, called Scribe for this purpose. >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <[email protected]>wrote: >> >>> NN would still be in picture because it will be writing a lot of meta >>> data for each individual file. so you will need a NN capable enough which >>> can store the metadata for your entire dataset. Data will never go to NN >>> but lot of metadata about data will be on NN so its always good idea to >>> have a strong NN. >>> >>> >>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>> [email protected]> wrote: >>> >>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>> understand the meaning of capable NN. As I know , the NN would not be a >>>> part of the actual data write pipeline , means that the data would not >>>> travel through the NN , the dfs would contact the NN from time to time to >>>> get locations of DN as where to store the data blocks. >>>> >>>> Thanks, >>>> Rahul >>>> >>>> >>>> >>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar >>>> <[email protected]>wrote: >>>> >>>>> is it safe? .. there is no direct answer yes or no >>>>> >>>>> when you say , you have files worth 10TB files and you want to upload >>>>> to HDFS, several factors come into picture >>>>> >>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>> 2) If there guarantee that network will not go down? >>>>> >>>>> and Most importantly I assume that you have a capable hadoop cluster. >>>>> By that I mean you have a capable namenode. >>>>> >>>>> I would definitely not write files sequentially in HDFS. I would >>>>> prefer to write files in parallel to hdfs to utilize the DFS write >>>>> features >>>>> to speed up the process. >>>>> you can hdfs put command in parallel manner and in my experience it >>>>> has not failed when we write a lot of data. >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[email protected]>wrote: >>>>> >>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>> >>>>>> But I have one more question , say I have 10 TB data in the pipeline . >>>>>> >>>>>> Is it perfectly OK to use hadopo fs put command to upload these files >>>>>> of size 10 TB and is there any limit to the file size using hadoop >>>>>> command >>>>>> line . Can hadoop put command line work with huge data. >>>>>> >>>>>> Thanks in advance >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> first of all .. most of the companies do not get 100 PB of data in >>>>>>> one go. Its an accumulating process and most of the companies do have a >>>>>>> data pipeline in place where the data is written to hdfs on a frequency >>>>>>> basis and then its retained on hdfs for some duration as per needed and >>>>>>> from there its sent to archivers or deleted. >>>>>>> >>>>>>> For data management products, you can look at falcon which is open >>>>>>> sourced by inmobi along with hortonworks. >>>>>>> >>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>> options available to you >>>>>>> 1) Write your dfs client which writes to dfs >>>>>>> 2) use hdfs proxy >>>>>>> 3) there is webhdfs >>>>>>> 4) command line hdfs >>>>>>> 5) data collection tools come with support to write to hdfs like >>>>>>> flume etc >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc >>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS >>>>>>>> cluster >>>>>>>> for processing >>>>>>>> and after processing how they download those files from HDFS to >>>>>>>> local file system. >>>>>>>> >>>>>>>> I don't think they might be using the command line hadoop fs put to >>>>>>>> upload files as it would take too long or do they divide say 10 parts >>>>>>>> each >>>>>>>> 10 petabytes and compress and use the command line hadoop fs put >>>>>>>> >>>>>>>> Or if they use any tool to upload huge files. >>>>>>>> >>>>>>>> Please help me . >>>>>>>> >>>>>>>> Thanks >>>>>>>> thoihen >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nitin Pawar >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> > > > -- > Nitin Pawar >
