You are right, I didn't change this parameter, therefore the default
is used from src/mapred/mapred-default.xml
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>The default number of reduce tasks per job. Typically set
to 99%
of the cluster's reduce capacity, so that if a node fails the reduces
can
still be executed in a single wave.
Ignored when mapred.job.tracker is "local".
</description>
</property>
Not clear for me what is the reduce capacity of my cluster :)
On 10/08/2010 01:00 PM, Jeff Zhang wrote:
I guess maybe your reduce number is 1 which cause the reduce phase very slowly.
On Fri, Oct 8, 2010 at 4:44 PM, Vincent<[email protected]> wrote:
Well I can see from the job tracker that all the jobs are done quite
quickly expect 2 for which reduce phase goes really really slowly.
But how can I make the parallel between a job in the Hadoop jop tracker
(example: job_201010072150_0045) and the Pig script execution?
And what is the most efficient: several small Pig scripts? or one big Pig
script? I did one big to avoid to load several time the same logs in
different scripts. Maybe it is not so good design...
Thanks for your help.
- Vincent
On 10/08/2010 11:31 AM, Vincent wrote:
I'm using pig-0.7.0 on hadoop-0.20.2.
For the script, well it's more then 500 lines, I'm not sure if I post it
here that somebody will read it till the end :-)
On 10/08/2010 11:26 AM, Dmitriy Ryaboy wrote:
What version of Pig, and what does your script look like?
On Thu, Oct 7, 2010 at 11:48 PM, Vincent<[email protected]>
wrote:
Hi All,
I'm quite new to Pig/Hadoop. So maybe my cluster size will make you
laugh.
I wrote a script on Pig handling 1.5GB of logs in less than one hour in
pig
local mode on a Intel core 2 duo with 3GB of RAM.
Then I tried this script on a simple 2 nodes cluster. These 2 nodes are
not
servers but simple computers:
- Intel core 2 duo with 3GB of RAM.
- Intel Quad with 4GB of RAM.
Well I was aware that hadoop has overhead and that it won't be done in
half
an hour (time in local divided by number of nodes). But I was surprised
to
see this morning it took 7 hours to complete!!!
My configuration was made according to this link:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
My question is simple: Is it normal?
Cheers
Vincent