Hi all,

I looked into various configurations and have come up with the following 
information:

1. The underlying files are compressed .seq files,but i guess that is a pretty 
standard format for HDFS.
2. The files are located on the hdfs spread across 2 servers,on top of which 
hive runs.
3.  I am not too familiar with map-reduce,however to the best of my knowledge 
all the configurations in the jobtrackers as well as core-     utilization were 
the default ones.
4. No swapping occurs( ~200mb remains free all the time)

For more information I give below the output of a sample query:

hive> select count (1) from searchlogs where days=20110311;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201010220228_0022, Tracking URL = 
http://serverName:50030/jobdetails.jsp?jobid=job_201010220228_0022
Kill Command = /home/userName/hadoop/bin/hadoop job 
 -Dmapred.job.tracker=serverName:9001 -kill job_201010220228_0022
2011-03-11 16:49:15,457 Stage-1 map = 0%,  reduce = 0%
2011-03-11 16:49:26,677 Stage-1 map = 1%,  reduce = 0%
2011-03-11 16:49:32,926 Stage-1 map = 2%,  reduce = 0%
2011-03-11 16:49:45,350 Stage-1 map = 3%,  reduce = 0%
2011-03-11 16:49:51,988 Stage-1 map = 4%,  reduce = 0%
2011-03-11 16:49:57,424 Stage-1 map = 5%,  reduce = 0%
2011-03-11 16:50:09,872 Stage-1 map = 6%,  reduce = 0%
2011-03-11 16:50:16,056 Stage-1 map = 7%,  reduce = 2%
2011-03-11 16:50:28,403 Stage-1 map = 8%,  reduce = 2%
.......................................................
.......................................................
.......................................................
2011-03-11 17:02:25,947 Stage-1 map = 96%,  reduce = 32%
2011-03-11 17:02:28,026 Stage-1 map = 97%,  reduce = 32%
2011-03-11 17:02:37,483 Stage-1 map = 98%,  reduce = 32%
2011-03-11 17:02:43,920 Stage-1 map = 99%,  reduce = 32%
2011-03-11 17:02:47,036 Stage-1 map = 99%,  reduce = 33%
2011-03-11 17:02:50,135 Stage-1 map = 100%,  reduce = 33%
2011-03-11 17:03:04,455 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201010220228_0022
OK
6768
Time taken: 835.495 seconds
hive>

As you can see, it took nearly 14 minutes to execute this query.
The query:

hive> select count(1) from searchlogs;

fired immediately after the above one, takes about 25 minutes,and gives the 
answer as 15118.

As you all pointed out,this is slow,even by Hive standards.How do I proceed 
further to solve this problem?

P.S: In this setup,data is being continuously added to the HDFS at the approx. 
rate of 1 mb/sec through Flume (https://github.com/cloudera/flume) .Hive runs 
and queries on top of that data.Could this,in any way,affect performance?If 
so,what can be the solution?

Regards,
Abhishek Pathak



________________________________
From: abhishek pathak <forever_yours_a...@yahoo.co.in>
To: user@hive.apache.org
Sent: Tue, 8 March, 2011 12:04:22 PM
Subject: Re: Hive too slow?


Thank you all for the tips.I'll dig into all these and let you people know :)



________________________________
From: Igor Tatarinov <i...@decide.com>
To: user@hive.apache.org
Sent: Tue, 8 March, 2011 11:47:20 AM
Subject: Re: Hive too slow?

Most likely, Hadoop's memory settings are too high and Linux starts swapping. 
You should be able to detect that too using vmstat.
Just a guess.


On Mon, Mar 7, 2011 at 10:11 PM, Ajo Fod <ajo....@gmail.com> wrote:

hmm I don't know of such a place ... but if I had to debug, I'd try to 
understand the following:
>1) are the underlying files zipped/compressed ... that ususally makes it 
slower.
>2) are the files located on the hard drive or hdfs?
>3) are all the cores being used? ... check number of reduce and map tasks.
>
>-Ajo
>
>
>
>On Mon, Mar 7, 2011 at 9:24 PM, abhishek pathak 
><forever_yours_a...@yahoo.co.in> 
>wrote:
>
>I suspected as such.My system is a Core2Duo,1.86 Ghz.I understand that 
>map-reduce is not instantaneous, just wanted to confirm that 2200 rows in 4 
>minutes is indeeed not normal behaviour.Could you point me at some places 
>where 
>i can get some info on how to tune this up?
>>
>>
>>Regards,
>>Abhishek
>>
>>
>>
________________________________
 From: Ajo Fod <ajo....@gmail.com>
>>To: user@hive.apache.org
>>Sent: Mon, 7 March, 2011 9:21:51 PM
>>Subject: Re: Hive too slow?
>>
>>
>>In my experience, hive is not instantaneous like other DBs, but 4 minutes to 
>>count 2200 rows seems unreasonable.
>>
>>For comparison my query of 169k rows one one computer with 4 cores running 
>>1Ghz 
>>(approx) took 20 seconds.
>>
>>Cheers,
>>Ajo.
>>
>>
>>On Mon, Mar 7, 2011 at 1:19 AM, abhishek pathak 
>><forever_yours_a...@yahoo.co.in> 
>>wrote:
>>
>>Hi,
>>>
>>>
>>>I am a hive newbie.I just finished setting up hive on a cluster of two 
>>>servers 
>>>for my organisation.As a test drill, we operated some simple queries.It took 
>>>the 
>>>standard map-reduce algorithm around 4 minutes just to execute this query:
>>>
>>>
>>>count(1) from tablename;
>>>
>>>
>>>The answer returned was around 2200.Clearly, this is not a big number by 
>>>hadoop 
>>>standards.My question is whether this is a standard performance or is there 
>>>some 
>>>configuration that is not optimised?Will scaling up of data to say,50 times, 
>>>produce any drastic slowness?I tried reading the documentation but was not 
>>>clear 
>>>on these issues, and i would like to have an idea before this setup starts 
>>>working in a production  environment.
>>>
>>>
>>>Thanks in advance,
>>>Regards,
>>>Abhishek Pathak
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Reply via email to