Hahaha..I think we could continue this over there.. Warm Regards, Tariq cloudfront.blogspot.com
On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee < [email protected]> wrote: > sorry for my blunder as well. my previous post for for Tariq in a wrong > post. > > Thanks. > Rahul > > > On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee < > [email protected]> wrote: > >> Oh! I though distcp works on complete files rather then mappers per >> datablock. >> So I guess parallelism would still be there if there are multipel files.. >> please correct if ther is anything wrong. >> >> Thank, >> Rahul >> >> >> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <[email protected]>wrote: >> >>> @Rahul : I'm sorry as I am not aware of any such document. But you could >>> use distcp for local to HDFS copy : >>> *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* >>> * >>> * >>> And yes. When you use distcp from local to HDFS, you can't take the >>> pleasure of parallelism as the data is stored in a non distributed fashion. >>> >>> Warm Regards, >>> Tariq >>> cloudfront.blogspot.com >>> >>> >>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <[email protected]>wrote: >>> >>>> Hello guys, >>>> >>>> My 2 cents : >>>> >>>> Actually no. of mappers is primarily governed by the no. of InputSplits >>>> created by the InputFormat you are using and the no. of reducers by the no. >>>> of partitions you get after the map phase. Having said that, you should >>>> also keep the no of slots, available per slave, in mind, along with the >>>> available memory. But as a general rule you could use this approach : >>>> >>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can >>>> configure. For example, if you have 12 physical cores (or 24 virtual >>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement >>>> you could choose how many mappers and reducers you want to use. With 18 MR >>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers >>>> or whatever you think is OK with you. >>>> >>>> I don't know if it ,makes much sense, but it helps me pretty decently. >>>> >>>> Warm Regards, >>>> Tariq >>>> cloudfront.blogspot.com >>>> >>>> >>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am also new to Hadoop world , here is my take on your question , if >>>>> there is something missing then others would surely correct that. >>>>> >>>>> For per-YARN , the slots are fixed and computed based on the crunching >>>>> capacity of the datanode hardware , once the slots per data node is >>>>> ascertained , they are divided into Map and reducer slots and that goes >>>>> into the config files and remain fixed , until changed.In YARN , its >>>>> decided at runtime based on the kind of requirement of particular task.Its >>>>> very much possible that a datanode at certain point of time running 10 >>>>> tasks and another similar datanode is only running 4 tasks. >>>>> >>>>> Coming to your question. Based of the data set size , block size of >>>>> dfs and input formater , the number of map tasks are decided , generally >>>>> for file based inputformats its one mapper per data block , however there >>>>> are way to change this using configuration settings.Reduce tasks are set >>>>> using job configuration. >>>>> >>>>> General rule as I have read from various documents is that Mappers >>>>> should run atleast a minute , so you can run a sample to find out a good >>>>> size of data block which would make you mapper run more than a minute. Now >>>>> it again depends on your SLA , in case you are not looking for a very >>>>> small >>>>> SLA you can choose to run less mappers at the expense of higher runtime. >>>>> >>>>> But again its all theory , not sure how these things are handled in >>>>> actual prod clusters. >>>>> >>>>> HTH, >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Rahul >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Users, >>>>>> >>>>>> I am new to Hadoop and confused about task slots in a cluster. How >>>>>> would I know how many task slots would be required for a job. Is there >>>>>> any >>>>>> empirical formula or on what basis should I set the number of task slots. >>>>>> >>>>>> Advanced Thanks >>>>>> >>>>> >>>>> >>>> >>> >> >
