You are welcome. First things first. We can never compare Hadoop with traditional warehouse systems or DBMSs. Both are meant for different purposes.
One small example, consider you have 1G of data, then there is nothing that could match RDBMSs. You'll get the results instantly, as you have specified above. Now, suppose your company is doing very good has grown very big and you have 500TB of data. If you try to process this much data using any traditional system you would face a lot of difficulty, as these systems have got poor horizontal scalability. The only thing which you could is increasing your H/W capacity, which can be done only upto a certain limit. Now, Hadoop comes into picture here. You can combine 'N' small machines together and utilize their power collectively to process your huge data. Basic principle of distributed computing. Long story short, you cannot evaluate the power of Hadoop on a small dataset. If you are going to do some OLTP kinda thing, I would not suggest Hadoop. Same holds good for Hive or Pig. Hadoop is basically a batch processing system and not meant for realtime stuff. Now, coming back to your actual question, the no. of mappers depends mainly on the no. of InputSplits created by the InputFormat you are using to process you data and the no. of reducers depend on the no of partitions created after the map phase. HTH Regards, Mohammad Tariq On Thu, Dec 13, 2012 at 6:25 PM, imen Megdiche <imen.megdi...@gmail.com>wrote: > thank you for your explanantions. I work in a pseudo distributed mode and > not in cluster. Does your recommendation are also available in this mode > and how can i do to have an execution time increasing in function of the > nbr of map reduces tasks, if it is possible. > I don t understand in general how mapreduce is much performant in analysis > then other systems like the datawarehouses. I have tested for example with > hive a simple query "select sum(col1) from table1" and the resultts > abtained with hive is in order of 10 min and with oracle is in the order > of 0, 20 min for a size of dat ain the order of 40 MB. > > Thank you. > > > 2012/12/13 Mohammad Tariq <donta...@gmail.com> > >> Hello Imen, >> >> If you have huge no of tasks then the overhead of managing the map >> and reduce task creation begins to dominate the total job execution time. >> Also, more tasks means you need more free cpu slots. If the slots are not >> free then the data block of interest will be moved to some other node where >> frees lots are available and it will consume time and it is also against >> the most basic principle of Hadoop i.e data localization. So, the no. of >> maps and reduces should be raised keeping all the factors in mind, >> otherwise you may face performance issues. >> >> HTH >> >> >> Regards, >> Mohammad Tariq >> >> >> >> On Thu, Dec 13, 2012 at 4:11 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> If the number of maps or reducers your job launched are more than the >>> jobqueue/cluster capacity, cpu time will increase >>> On Dec 13, 2012 4:02 PM, "imen Megdiche" <imen.megdi...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> I am trying to increase the number of map and reduce tasks for a job >>>> and even for the same data size, I noticed that the total time CPU >>>> increases but I thought it would decrease. MapReduce is known for >>>> performance calculation, but I do not see this when i do these small >>>> tests. >>>> >>>> What de you thins about this issue ? >>>> >>>> >> >