Soon after replying I realized something else related to this. Say we have a single file in HDFS (hdfs configured for default block size 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it from the current hdfs to another one , then whether there would be any parallelism or just a single map task would be fired?
As per what I have read , a mapper is launcher for a complete file or a set of files. It doesn't operate at block level.So no parallelism even if the file resides in HDFS. Thanks, Rahul On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee < [email protected]> wrote: > yeah you are right I mis read your earlier post. > > Thanks, > Rahul > > > On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <[email protected]>wrote: > >> I had said that if you use distcp to copy data *from localFS to HDFS*then >> you won't be able to exploit parallelism as entire file is present on >> a single machine. So no multiple TTs. >> >> Please comment if you think I am wring somewhere. >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> Yes , it's a MR job under the hood . my question was that you wrote that >>> using distcp you loose the benefits of parallel processing of Hadoop. I >>> think the MR job of distcp divides files into individual map tasks based on >>> the total size of the transfer , so multiple mappers would still be spawned >>> if the size of transfer is huge and they would work in parallel. >>> >>> Correct me if there is anything wrong! >>> >>> Thanks, >>> Rahul >>> >>> >>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <[email protected]>wrote: >>> >>>> No. distcp is actually a mapreduce job under the hood. >>>> >>>> Warm Regards, >>>> Tariq >>>> cloudfront.blogspot.com >>>> >>>> >>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee < >>>> [email protected]> wrote: >>>> >>>>> Thanks to both of you! >>>>> >>>>> Rahul >>>>> >>>>> >>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar >>>>> <[email protected]>wrote: >>>>> >>>>>> you can do that using file:/// >>>>>> >>>>>> example: >>>>>> >>>>>> >>>>>> hadoop distcp hdfs://localhost:8020/somefile >>>>>> file:///Users/myhome/Desktop/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> @Tariq can you point me to some resource which shows how distcp is >>>>>>> used to upload files from local to hdfs. >>>>>>> >>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already >>>>>>> present in the hadoop's fs? >>>>>>> >>>>>>> Rahul >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> You'r welcome :) >>>>>>>> >>>>>>>> Warm Regards, >>>>>>>> Tariq >>>>>>>> cloudfront.blogspot.com >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks Tariq! >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> @Rahul : Yes. distcp can do that. >>>>>>>>>> >>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory >>>>>>>>>> consumption. >>>>>>>>>> >>>>>>>>>> Warm Regards, >>>>>>>>>> Tariq >>>>>>>>>> cloudfront.blogspot.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> IMHO,I think the statement about NN with regard to block >>>>>>>>>>> metadata is more like a general statement. Even if you put lots of >>>>>>>>>>> small >>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN. >>>>>>>>>>> >>>>>>>>>>> can disct cp be used to copy local - to - hdfs ? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Rahul >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> absolutely rite Mohammad >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this >>>>>>>>>>>>> : >>>>>>>>>>>>> >>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for >>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should >>>>>>>>>>>>> be >>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going >>>>>>>>>>>>> to be >>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it >>>>>>>>>>>>> comes to >>>>>>>>>>>>> NN. >>>>>>>>>>>>> >>>>>>>>>>>>> Am I correct @Nitin? >>>>>>>>>>>>> >>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much >>>>>>>>>>>>> data you don't actually just do a "put". You could use something >>>>>>>>>>>>> like >>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use >>>>>>>>>>>>> a data >>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already >>>>>>>>>>>>> pointed. >>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for >>>>>>>>>>>>> this >>>>>>>>>>>>> purpose. >>>>>>>>>>>>> >>>>>>>>>>>>> Warm Regards, >>>>>>>>>>>>> Tariq >>>>>>>>>>>>> cloudfront.blogspot.com >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot >>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN >>>>>>>>>>>>>> capable enough >>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will >>>>>>>>>>>>>> never go to >>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always >>>>>>>>>>>>>> good idea to >>>>>>>>>>>>>> have a strong NN. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could >>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN >>>>>>>>>>>>>>> would not be a >>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data >>>>>>>>>>>>>>> would not >>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time >>>>>>>>>>>>>>> to time to >>>>>>>>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Rahul >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want >>>>>>>>>>>>>>>> to upload to HDFS, several factors come into picture >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop >>>>>>>>>>>>>>>> cluster? >>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable >>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I >>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the >>>>>>>>>>>>>>>> DFS write >>>>>>>>>>>>>>>> features to speed up the process. >>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my >>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in >>>>>>>>>>>>>>>>> the pipeline . >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload >>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file >>>>>>>>>>>>>>>>> size using >>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with >>>>>>>>>>>>>>>>> huge data. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks in advance >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB >>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of >>>>>>>>>>>>>>>>>> the companies do >>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to >>>>>>>>>>>>>>>>>> hdfs on a >>>>>>>>>>>>>>>>>> frequency basis and then its retained on hdfs for some >>>>>>>>>>>>>>>>>> duration as per >>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> For data management products, you can look at falcon >>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are >>>>>>>>>>>>>>>>>> few options available to you >>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to >>>>>>>>>>>>>>>>>> hdfs like flume etc >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 >>>>>>>>>>>>>>>>>>> petabytes to Hadoop >>>>>>>>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>>>>>>>> and after processing how they download those files from >>>>>>>>>>>>>>>>>>> HDFS to local file system. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I don't think they might be using the command line >>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or >>>>>>>>>>>>>>>>>>> do they divide >>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and compress and use the >>>>>>>>>>>>>>>>>>> command line >>>>>>>>>>>>>>>>>>> hadoop fs put >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Please help me . >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>> thoihen >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Nitin Pawar >>>>>> >>>>> >>>>> >>>> >>> >> >
