I'd say start with something simpler ... say how about converting all files
to tab delimited text files in uncompressed format and run the same query on
the new table. If that works, you know the problem is with the .seq files
... if not there is something funky about the configuration or the machine.

Also, ou might want to check the CPU usage .. are all cores being used? If
not perhaps you want to see if increasing reduce threads helps
set mapred.reduce.tasks=<number>
Also, if you file is small, you could split it into the number of cores ...
and
set mapred.map.tasks=<number>
... but I suspect this is not the bottleneck given the time the query takes.

Cheers,
-Ajo.

On Fri, Mar 11, 2011 at 4:05 AM, abhishek pathak <
forever_yours_a...@yahoo.co.in> wrote:

> Hi all,
>
> I looked into various configurations and have come up with the following
> information:
>
> 1. The underlying files are compressed .seq files,but i guess that is a
> pretty standard format for HDFS.
> 2. The files are located on the hdfs spread across 2 servers,on top of
> which hive runs.
> 3.  I am not too familiar with map-reduce,however to the best of my
> knowledge all the configurations in the jobtrackers as well as core-
> utilization were the default ones.
> 4. No swapping occurs( ~200mb remains free all the time)
>
> For more information I give below the output of a sample query:
>
> hive> select count (1) from searchlogs where days=20110311;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
> Starting Job = job_201010220228_0022, Tracking URL =
> http://serverName:50030/jobdetails.jsp?jobid=job_201010220228_0022
> Kill Command = /home/userName/hadoop/bin/hadoop job
>  -Dmapred.job.tracker=serverName:9001 -kill job_201010220228_0022
> 2011-03-11 16:49:15,457 Stage-1 map = 0%,  reduce = 0%
> 2011-03-11 16:49:26,677 Stage-1 map = 1%,  reduce = 0%
> 2011-03-11 16:49:32,926 Stage-1 map = 2%,  reduce = 0%
> 2011-03-11 16:49:45,350 Stage-1 map = 3%,  reduce = 0%
> 2011-03-11 16:49:51,988 Stage-1 map = 4%,  reduce = 0%
> 2011-03-11 16:49:57,424 Stage-1 map = 5%,  reduce = 0%
> 2011-03-11 16:50:09,872 Stage-1 map = 6%,  reduce = 0%
> 2011-03-11 16:50:16,056 Stage-1 map = 7%,  reduce = 2%
> 2011-03-11 16:50:28,403 Stage-1 map = 8%,  reduce = 2%
> .......................................................
> .......................................................
> .......................................................
> 2011-03-11 17:02:25,947 Stage-1 map = 96%,  reduce = 32%
> 2011-03-11 17:02:28,026 Stage-1 map = 97%,  reduce = 32%
> 2011-03-11 17:02:37,483 Stage-1 map = 98%,  reduce = 32%
> 2011-03-11 17:02:43,920 Stage-1 map = 99%,  reduce = 32%
> 2011-03-11 17:02:47,036 Stage-1 map = 99%,  reduce = 33%
> 2011-03-11 17:02:50,135 Stage-1 map = 100%,  reduce = 33%
> 2011-03-11 17:03:04,455 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201010220228_0022
> OK
> 6768
> Time taken: 835.495 seconds
> hive>
>
> As you can see, it took nearly 14 minutes to execute this query.
> The query:
>
> hive> select count(1) from searchlogs;
>
> fired immediately after the above one, takes about 25 minutes,and gives the
> answer as 15118.
>
> As you all pointed out,this is slow,even by Hive standards.How do I proceed
> further to solve this problem?
>
> P.S: In this setup,data is being continuously added to the HDFS at the
> approx. rate of 1 mb/sec through Flume (https://github.com/cloudera/flume)
> .Hive runs and queries on top of that data.Could this,in any way,affect
> performance?If so,what can be the solution?
>
> Regards,
> Abhishek Pathak
>
> ------------------------------
> *From:* abhishek pathak <forever_yours_a...@yahoo.co.in>
>
> *To:* user@hive.apache.org
> *Sent:* Tue, 8 March, 2011 12:04:22 PM
>
> *Subject:* Re: Hive too slow?
>
> Thank you all for the tips.I'll dig into all these and let you people know
> :)
>
> ------------------------------
> *From:* Igor Tatarinov <i...@decide.com>
> *To:* user@hive.apache.org
> *Sent:* Tue, 8 March, 2011 11:47:20 AM
> *Subject:* Re: Hive too slow?
>
> Most likely, Hadoop's memory settings are too high and Linux starts
> swapping. You should be able to detect that too using vmstat.
> Just a guess.
>
> On Mon, Mar 7, 2011 at 10:11 PM, Ajo Fod <ajo....@gmail.com> wrote:
>
>> hmm I don't know of such a place ... but if I had to debug, I'd try to
>> understand the following:
>> 1) are the underlying files zipped/compressed ... that ususally makes it
>> slower.
>> 2) are the files located on the hard drive or hdfs?
>> 3) are all the cores being used? ... check number of reduce and map tasks.
>>
>> -Ajo
>>
>>
>> On Mon, Mar 7, 2011 at 9:24 PM, abhishek pathak <
>> forever_yours_a...@yahoo.co.in> wrote:
>>
>>> I suspected as such.My system is a Core2Duo,1.86 Ghz.I understand that
>>> map-reduce is not instantaneous, just wanted to confirm that 2200 rows in 4
>>> minutes is indeeed not normal behaviour.Could you point me at some places
>>> where i can get some info on how to tune this up?
>>>
>>> Regards,
>>> Abhishek
>>>
>>> ------------------------------
>>> *From:* Ajo Fod <ajo....@gmail.com>
>>> *To:* user@hive.apache.org
>>> *Sent:* Mon, 7 March, 2011 9:21:51 PM
>>> *Subject:* Re: Hive too slow?
>>>
>>> In my experience, hive is not instantaneous like other DBs, but 4 minutes
>>> to count 2200 rows seems unreasonable.
>>>
>>> For comparison my query of 169k rows one one computer with 4 cores
>>> running 1Ghz (approx) took 20 seconds.
>>>
>>> Cheers,
>>> Ajo.
>>>
>>> On Mon, Mar 7, 2011 at 1:19 AM, abhishek pathak <
>>> forever_yours_a...@yahoo.co.in> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am a hive newbie.I just finished setting up hive on a cluster of two
>>>> servers for my organisation.As a test drill, we operated some simple
>>>> queries.It took the standard map-reduce algorithm around 4 minutes just to
>>>> execute this query:
>>>>
>>>> count(1) from tablename;
>>>>
>>>> The answer returned was around 2200.Clearly, this is not a big number by
>>>> hadoop standards.My question is whether this is a standard performance or 
>>>> is
>>>> there some configuration that is not optimised?Will scaling up of data to
>>>> say,50 times, produce any drastic slowness?I tried reading the 
>>>> documentation
>>>> but was not clear on these issues, and i would like to have an idea before
>>>> this setup starts working in a production environment.
>>>>
>>>> Thanks in advance,
>>>> Regards,
>>>> Abhishek Pathak
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
>

Reply via email to