Yes , it's a MR job under the hood . my question was that you wrote that using distcp you loose the benefits of parallel processing of Hadoop. I think the MR job of distcp divides files into individual map tasks based on the total size of the transfer , so multiple mappers would still be spawned if the size of transfer is huge and they would work in parallel.
Correct me if there is anything wrong! Thanks, Rahul On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <[email protected]> wrote: > No. distcp is actually a mapreduce job under the hood. > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee < > [email protected]> wrote: > >> Thanks to both of you! >> >> Rahul >> >> >> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <[email protected]>wrote: >> >>> you can do that using file:/// >>> >>> example: >>> >>> >>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < >>> [email protected]> wrote: >>> >>>> @Tariq can you point me to some resource which shows how distcp is used >>>> to upload files from local to hdfs. >>>> >>>> isn't distcp a MR job ? wouldn't it need the data to be already present >>>> in the hadoop's fs? >>>> >>>> Rahul >>>> >>>> >>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[email protected]>wrote: >>>> >>>>> You'r welcome :) >>>>> >>>>> Warm Regards, >>>>> Tariq >>>>> cloudfront.blogspot.com >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks Tariq! >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> @Rahul : Yes. distcp can do that. >>>>>>> >>>>>>> And, bigger the files lesser the metadata hence lesser memory >>>>>>> consumption. >>>>>>> >>>>>>> Warm Regards, >>>>>>> Tariq >>>>>>> cloudfront.blogspot.com >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> IMHO,I think the statement about NN with regard to block metadata >>>>>>>> is more like a general statement. Even if you put lots of small files >>>>>>>> of >>>>>>>> combined size 10 TB , you need to have a capable NN. >>>>>>>> >>>>>>>> can disct cp be used to copy local - to - hdfs ? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Rahul >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> absolutely rite Mohammad >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>>>>>> >>>>>>>>>> Every file and block in HDFS is treated as an object and for each >>>>>>>>>> object around 200B of metadata get created. So the NN should be >>>>>>>>>> powerful >>>>>>>>>> enough to handle that much metadata, since it is going to be >>>>>>>>>> in-memory. >>>>>>>>>> Actually memory is the most important metric when it comes to NN. >>>>>>>>>> >>>>>>>>>> Am I correct @Nitin? >>>>>>>>>> >>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data >>>>>>>>>> you don't actually just do a "put". You could use something like >>>>>>>>>> "distcp" >>>>>>>>>> for parallel copying. A better approach would be to use a data >>>>>>>>>> aggregation >>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook >>>>>>>>>> uses >>>>>>>>>> their own data aggregation tool, called Scribe for this purpose. >>>>>>>>>> >>>>>>>>>> Warm Regards, >>>>>>>>>> Tariq >>>>>>>>>> cloudfront.blogspot.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> NN would still be in picture because it will be writing a lot of >>>>>>>>>>> meta data for each individual file. so you will need a NN capable >>>>>>>>>>> enough >>>>>>>>>>> which can store the metadata for your entire dataset. Data will >>>>>>>>>>> never go to >>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good >>>>>>>>>>> idea to >>>>>>>>>>> have a strong NN. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not >>>>>>>>>>>> be a >>>>>>>>>>>> part of the actual data write pipeline , means that the data would >>>>>>>>>>>> not >>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to >>>>>>>>>>>> time to >>>>>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Rahul >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>>>>> >>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>>>>>>>> upload to HDFS, several factors come into picture >>>>>>>>>>>>> >>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>>>>> >>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>>>>>> >>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I >>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the >>>>>>>>>>>>> DFS write >>>>>>>>>>>>> features to speed up the process. >>>>>>>>>>>>> you can hdfs put command in parallel manner and in my >>>>>>>>>>>>> experience it has not failed when we write a lot of data. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>>>>> >>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>>>>>> pipeline . >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload >>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file >>>>>>>>>>>>>> size using >>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge >>>>>>>>>>>>>> data. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks in advance >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of >>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the >>>>>>>>>>>>>>> companies do >>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs >>>>>>>>>>>>>>> on a >>>>>>>>>>>>>>> frequency basis and then its retained on hdfs for some >>>>>>>>>>>>>>> duration as per >>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For data management products, you can look at falcon which >>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are >>>>>>>>>>>>>>> few options available to you >>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs >>>>>>>>>>>>>>> like flume etc >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes >>>>>>>>>>>>>>>> to Hadoop >>>>>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>>>>> and after processing how they download those files from >>>>>>>>>>>>>>>> HDFS to local file system. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop >>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they >>>>>>>>>>>>>>>> divide say 10 >>>>>>>>>>>>>>>> parts each 10 petabytes and compress and use the command line >>>>>>>>>>>>>>>> hadoop fs put >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please help me . >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> thoihen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Nitin Pawar >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Nitin Pawar >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> >
