Hi all, I looked into various configurations and have come up with the following information:
1. The underlying files are compressed .seq files,but i guess that is a pretty standard format for HDFS. 2. The files are located on the hdfs spread across 2 servers,on top of which hive runs. 3. I am not too familiar with map-reduce,however to the best of my knowledge all the configurations in the jobtrackers as well as core- utilization were the default ones. 4. No swapping occurs( ~200mb remains free all the time) For more information I give below the output of a sample query: hive> select count (1) from searchlogs where days=20110311; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201010220228_0022, Tracking URL = http://serverName:50030/jobdetails.jsp?jobid=job_201010220228_0022 Kill Command = /home/userName/hadoop/bin/hadoop job -Dmapred.job.tracker=serverName:9001 -kill job_201010220228_0022 2011-03-11 16:49:15,457 Stage-1 map = 0%, reduce = 0% 2011-03-11 16:49:26,677 Stage-1 map = 1%, reduce = 0% 2011-03-11 16:49:32,926 Stage-1 map = 2%, reduce = 0% 2011-03-11 16:49:45,350 Stage-1 map = 3%, reduce = 0% 2011-03-11 16:49:51,988 Stage-1 map = 4%, reduce = 0% 2011-03-11 16:49:57,424 Stage-1 map = 5%, reduce = 0% 2011-03-11 16:50:09,872 Stage-1 map = 6%, reduce = 0% 2011-03-11 16:50:16,056 Stage-1 map = 7%, reduce = 2% 2011-03-11 16:50:28,403 Stage-1 map = 8%, reduce = 2% ....................................................... ....................................................... ....................................................... 2011-03-11 17:02:25,947 Stage-1 map = 96%, reduce = 32% 2011-03-11 17:02:28,026 Stage-1 map = 97%, reduce = 32% 2011-03-11 17:02:37,483 Stage-1 map = 98%, reduce = 32% 2011-03-11 17:02:43,920 Stage-1 map = 99%, reduce = 32% 2011-03-11 17:02:47,036 Stage-1 map = 99%, reduce = 33% 2011-03-11 17:02:50,135 Stage-1 map = 100%, reduce = 33% 2011-03-11 17:03:04,455 Stage-1 map = 100%, reduce = 100% Ended Job = job_201010220228_0022 OK 6768 Time taken: 835.495 seconds hive> As you can see, it took nearly 14 minutes to execute this query. The query: hive> select count(1) from searchlogs; fired immediately after the above one, takes about 25 minutes,and gives the answer as 15118. As you all pointed out,this is slow,even by Hive standards.How do I proceed further to solve this problem? P.S: In this setup,data is being continuously added to the HDFS at the approx. rate of 1 mb/sec through Flume (https://github.com/cloudera/flume) .Hive runs and queries on top of that data.Could this,in any way,affect performance?If so,what can be the solution? Regards, Abhishek Pathak ________________________________ From: abhishek pathak <forever_yours_a...@yahoo.co.in> To: user@hive.apache.org Sent: Tue, 8 March, 2011 12:04:22 PM Subject: Re: Hive too slow? Thank you all for the tips.I'll dig into all these and let you people know :) ________________________________ From: Igor Tatarinov <i...@decide.com> To: user@hive.apache.org Sent: Tue, 8 March, 2011 11:47:20 AM Subject: Re: Hive too slow? Most likely, Hadoop's memory settings are too high and Linux starts swapping. You should be able to detect that too using vmstat. Just a guess. On Mon, Mar 7, 2011 at 10:11 PM, Ajo Fod <ajo....@gmail.com> wrote: hmm I don't know of such a place ... but if I had to debug, I'd try to understand the following: >1) are the underlying files zipped/compressed ... that ususally makes it slower. >2) are the files located on the hard drive or hdfs? >3) are all the cores being used? ... check number of reduce and map tasks. > >-Ajo > > > >On Mon, Mar 7, 2011 at 9:24 PM, abhishek pathak ><forever_yours_a...@yahoo.co.in> >wrote: > >I suspected as such.My system is a Core2Duo,1.86 Ghz.I understand that >map-reduce is not instantaneous, just wanted to confirm that 2200 rows in 4 >minutes is indeeed not normal behaviour.Could you point me at some places >where >i can get some info on how to tune this up? >> >> >>Regards, >>Abhishek >> >> >> ________________________________ From: Ajo Fod <ajo....@gmail.com> >>To: user@hive.apache.org >>Sent: Mon, 7 March, 2011 9:21:51 PM >>Subject: Re: Hive too slow? >> >> >>In my experience, hive is not instantaneous like other DBs, but 4 minutes to >>count 2200 rows seems unreasonable. >> >>For comparison my query of 169k rows one one computer with 4 cores running >>1Ghz >>(approx) took 20 seconds. >> >>Cheers, >>Ajo. >> >> >>On Mon, Mar 7, 2011 at 1:19 AM, abhishek pathak >><forever_yours_a...@yahoo.co.in> >>wrote: >> >>Hi, >>> >>> >>>I am a hive newbie.I just finished setting up hive on a cluster of two >>>servers >>>for my organisation.As a test drill, we operated some simple queries.It took >>>the >>>standard map-reduce algorithm around 4 minutes just to execute this query: >>> >>> >>>count(1) from tablename; >>> >>> >>>The answer returned was around 2200.Clearly, this is not a big number by >>>hadoop >>>standards.My question is whether this is a standard performance or is there >>>some >>>configuration that is not optimised?Will scaling up of data to say,50 times, >>>produce any drastic slowness?I tried reading the documentation but was not >>>clear >>>on these issues, and i would like to have an idea before this setup starts >>>working in a production environment. >>> >>> >>>Thanks in advance, >>>Regards, >>>Abhishek Pathak >>> >>> >>> >>> >>> >> >> >