Thanks a lot for your explanation Felix .
MY query is not using global sort/count. But still i am unable to understand -
even i set the mapped.reduce.tasks=4
when the hadoop job runs i still see
14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100
Does that look ok , numReduceTasks should be 4 right ?
Also i am pasting my cascalog query below. Please point me where am i wrong.
why is the performance not increased?
Cascalog code
(def info
(hfs-delimited "/users/si/File.txt"
:delimiter ";"
:outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
:classes [String String String String ]
:skip-header? true))
(defn convert-to-long [a]
(ct/to-long (f/parse custom-formatter a)))
(def info-tap
(<- [?timestamp ?BTT367 ]
((select-fields info ["?timestamp" "?BTT367"]) ?timestamp ?BTT367)))
(defn convert-to-float [a]
(try
(if (not= a " ")
(read-string a))
(catch Exception e (do
nil))))
(?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
(convert-to-float ?BTT367 :> ?converted-BTT367 )
(convert-to-long ?timestamp :> ?converted-timestamp)
(>= ?converted-timestamp start-value)
(<= ?converted-timestamp end-value)
(:sort ?converted-BTT367)(:reverse true)
(c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out
?highest-value))
Regards,
Sindhu
On 04 Aug 2014, at 19:10, Felix Chern <[email protected]> wrote:
> The mapper and reducer numbers really depends on what your program is trying
> to do. Without your actual query it’s really difficult to tell why you are
> having this problem.
>
> For example, if you tried to perform a global sum or count, cascalog will
> only use one reducer since this is the only way to do a global sum/count. To
> avoid this behavior you can set a output key that can generally split the
> reducer. e.g. word count example use word as the output key. With this word
> count output you can sum it up in a serial manner or run the global map
> reduce job with this much smaller input.
>
> The mapper number is usually not a performance bottleneck. For your curious,
> if the file is splittable (ie, unzipped text or sequence file), the number of
> mapper number is controlled by the split size in configuration. The smaller
> the split size it is, the more mappers are queued.
>
> In short, your problem is not likely to be a configuration problem, but
> misunderstood the map reduce logic. To solve your problem, can you paste your
> cascalog query and let people take a look?
>
> Felix
>
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <[email protected]> wrote:
>
>>
>> I am not coding in mapreduce. I am running my cascalog queries on hadoop
>> cluster(1 node ) on data of size 280MB. So all the config settings has to be
>> made on hadoop cluster itself.
>> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4
>> and mapred.tasktracker.reduce.tasks.maximum = 4
>> and then kept tuning it up ways and down ways like below
>> (4+4) (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>>
>> But all the time performance remains same .
>> Everytime, inspite whatever combination of
>> mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum i use - produces same execution
>> time .
>>
>> Then when the above things failed i also tried mapred.reduce.tasks = 4
>> still results are same. No reduction in execution time.
>>
>> What other things should i set? Also i made sure hadoop is restarted every
>> time after changing config.
>> I have attached my conf folder ..please indicate me what should be added
>> where ?
>> I am really stuck ..Your help would be much appreciated. Thank you .
>> <(singlenodecuda)conf.zip>
>>
>> Regards,
>> Sindhu
>