you can do that using file:/// example:
hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < [email protected]> wrote: > @Tariq can you point me to some resource which shows how distcp is used to > upload files from local to hdfs. > > isn't distcp a MR job ? wouldn't it need the data to be already present in > the hadoop's fs? > > Rahul > > > On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[email protected]>wrote: > >> You'r welcome :) >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> Thanks Tariq! >>> >>> >>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[email protected]>wrote: >>> >>>> @Rahul : Yes. distcp can do that. >>>> >>>> And, bigger the files lesser the metadata hence lesser memory >>>> consumption. >>>> >>>> Warm Regards, >>>> Tariq >>>> cloudfront.blogspot.com >>>> >>>> >>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>> [email protected]> wrote: >>>> >>>>> IMHO,I think the statement about NN with regard to block metadata is >>>>> more like a general statement. Even if you put lots of small files of >>>>> combined size 10 TB , you need to have a capable NN. >>>>> >>>>> can disct cp be used to copy local - to - hdfs ? >>>>> >>>>> Thanks, >>>>> Rahul >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar >>>>> <[email protected]>wrote: >>>>> >>>>>> absolutely rite Mohammad >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>>> >>>>>>> Every file and block in HDFS is treated as an object and for each >>>>>>> object around 200B of metadata get created. So the NN should be powerful >>>>>>> enough to handle that much metadata, since it is going to be in-memory. >>>>>>> Actually memory is the most important metric when it comes to NN. >>>>>>> >>>>>>> Am I correct @Nitin? >>>>>>> >>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you >>>>>>> don't actually just do a "put". You could use something like "distcp" >>>>>>> for >>>>>>> parallel copying. A better approach would be to use a data aggregation >>>>>>> tool >>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their >>>>>>> own >>>>>>> data aggregation tool, called Scribe for this purpose. >>>>>>> >>>>>>> Warm Regards, >>>>>>> Tariq >>>>>>> cloudfront.blogspot.com >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> NN would still be in picture because it will be writing a lot of >>>>>>>> meta data for each individual file. so you will need a NN capable >>>>>>>> enough >>>>>>>> which can store the metadata for your entire dataset. Data will never >>>>>>>> go to >>>>>>>> NN but lot of metadata about data will be on NN so its always good >>>>>>>> idea to >>>>>>>> have a strong NN. >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be >>>>>>>>> a >>>>>>>>> part of the actual data write pipeline , means that the data would not >>>>>>>>> travel through the NN , the dfs would contact the NN from time to >>>>>>>>> time to >>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Rahul >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>> >>>>>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>>>>> upload to HDFS, several factors come into picture >>>>>>>>>> >>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>> >>>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>>> >>>>>>>>>> I would definitely not write files sequentially in HDFS. I would >>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write >>>>>>>>>> features >>>>>>>>>> to speed up the process. >>>>>>>>>> you can hdfs put command in parallel manner and in my experience >>>>>>>>>> it has not failed when we write a lot of data. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>> >>>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>>> pipeline . >>>>>>>>>>> >>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these >>>>>>>>>>> files of size 10 TB and is there any limit to the file size using >>>>>>>>>>> hadoop >>>>>>>>>>> command line . Can hadoop put command line work with huge data. >>>>>>>>>>> >>>>>>>>>>> Thanks in advance >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data >>>>>>>>>>>> in one go. Its an accumulating process and most of the companies >>>>>>>>>>>> do have a >>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a >>>>>>>>>>>> frequency >>>>>>>>>>>> basis and then its retained on hdfs for some duration as per >>>>>>>>>>>> needed and >>>>>>>>>>>> from there its sent to archivers or deleted. >>>>>>>>>>>> >>>>>>>>>>>> For data management products, you can look at falcon which is >>>>>>>>>>>> open sourced by inmobi along with hortonworks. >>>>>>>>>>>> >>>>>>>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>>>>>>> options available to you >>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>> 5) data collection tools come with support to write to hdfs >>>>>>>>>>>> like flume etc >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi All, >>>>>>>>>>>>> >>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to >>>>>>>>>>>>> Hadoop >>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>> and after processing how they download those files from HDFS >>>>>>>>>>>>> to local file system. >>>>>>>>>>>>> >>>>>>>>>>>>> I don't think they might be using the command line hadoop fs >>>>>>>>>>>>> put to upload files as it would take too long or do they divide >>>>>>>>>>>>> say 10 >>>>>>>>>>>>> parts each 10 petabytes and compress and use the command line >>>>>>>>>>>>> hadoop fs put >>>>>>>>>>>>> >>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>> >>>>>>>>>>>>> Please help me . >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> thoihen >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Nitin Pawar >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Nitin Pawar >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Nitin Pawar >>>>>> >>>>> >>>>> >>>> >>> >> > -- Nitin Pawar
