@Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks.
Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <[email protected]>wrote: > is it safe? .. there is no direct answer yes or no > > when you say , you have files worth 10TB files and you want to upload to > HDFS, several factors come into picture > > 1) Is the machine in the same network as your hadoop cluster? > 2) If there guarantee that network will not go down? > > and Most importantly I assume that you have a capable hadoop cluster. By > that I mean you have a capable namenode. > > I would definitely not write files sequentially in HDFS. I would prefer to > write files in parallel to hdfs to utilize the DFS write features to speed > up the process. > you can hdfs put command in parallel manner and in my experience it has > not failed when we write a lot of data. > > > On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[email protected]> wrote: > >> @Nitin Pawar , thanks for clearing my doubts . >> >> But I have one more question , say I have 10 TB data in the pipeline . >> >> Is it perfectly OK to use hadopo fs put command to upload these files of >> size 10 TB and is there any limit to the file size using hadoop command >> line . Can hadoop put command line work with huge data. >> >> Thanks in advance >> >> >> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <[email protected]>wrote: >> >>> first of all .. most of the companies do not get 100 PB of data in one >>> go. Its an accumulating process and most of the companies do have a data >>> pipeline in place where the data is written to hdfs on a frequency basis >>> and then its retained on hdfs for some duration as per needed and from >>> there its sent to archivers or deleted. >>> >>> For data management products, you can look at falcon which is open >>> sourced by inmobi along with hortonworks. >>> >>> In any case, if you want to write files to hdfs there are few options >>> available to you >>> 1) Write your dfs client which writes to dfs >>> 2) use hdfs proxy >>> 3) there is webhdfs >>> 4) command line hdfs >>> 5) data collection tools come with support to write to hdfs like flume >>> etc >>> >>> >>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <[email protected]>wrote: >>> >>>> Hi All, >>>> >>>> Can anyone help me know how does companies like Facebook ,Yahoo etc >>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster >>>> for processing >>>> and after processing how they download those files from HDFS to local >>>> file system. >>>> >>>> I don't think they might be using the command line hadoop fs put to >>>> upload files as it would take too long or do they divide say 10 parts each >>>> 10 petabytes and compress and use the command line hadoop fs put >>>> >>>> Or if they use any tool to upload huge files. >>>> >>>> Please help me . >>>> >>>> Thanks >>>> thoihen >>>> >>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> > > > -- > Nitin Pawar >
