Thanks a lot for your explanation Felix .
MY query is not using global sort/count. But still i am unable to understand - 
even i set the mapped.reduce.tasks=4
when the hadoop job runs i still see 
14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100

Does that look ok , numReduceTasks should be 4 right ?
Also i am pasting my cascalog query below. Please point me where am i wrong. 
why is the performance not increased?

Cascalog code
(def info
      (hfs-delimited  "/users/si/File.txt"
                       :delimiter ";"
                       :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
                       :classes [String String String String  ]
                       :skip-header? true))
       
(defn convert-to-long [a]
             (ct/to-long (f/parse custom-formatter a)))

(def info-tap
  (<- [?timestamp  ?BTT367 ]
      ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))

(defn convert-to-float [a] 
  (try
    (if (not= a " ")
      (read-string a))
   (catch Exception e (do 
 nil))))

 (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
      (convert-to-float ?BTT367 :> ?converted-BTT367 )
      (convert-to-long ?timestamp :> ?converted-timestamp)
      (>= ?converted-timestamp start-value)
      (<= ?converted-timestamp end-value)
      (:sort ?converted-BTT367)(:reverse true)
      (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out 
?highest-value)) 


Regards,
Sindhu





On 04 Aug 2014, at 19:10, Felix Chern <[email protected]> wrote:

> The mapper and reducer numbers really depends on what your program is trying 
> to do. Without your actual query it’s really difficult to tell why you are 
> having this problem.
> 
> For example, if you tried to perform a global sum or count, cascalog will 
> only use one reducer since this is the only way to do a global sum/count. To 
> avoid this behavior you can set a output key that can generally split the 
> reducer. e.g. word count example use word as the output key. With this word 
> count output you can sum it up in a serial manner or run the global map 
> reduce job with this much smaller input.
> 
> The mapper number is usually not a performance bottleneck. For your curious, 
> if the file is splittable (ie, unzipped text or sequence file), the number of 
> mapper number is controlled by the split size in configuration. The smaller 
> the split size it is, the more mappers are queued.
> 
> In short, your problem is not likely to be a configuration problem, but 
> misunderstood the map reduce logic. To solve your problem, can you paste your 
> cascalog query and let people take a look?
> 
> Felix
> 
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <[email protected]> wrote:
> 
>> 
>> I am not coding in mapreduce. I am running my cascalog queries on hadoop 
>> cluster(1 node ) on data of size 280MB. So all the config settings has to be 
>> made on hadoop cluster itself.
>> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>>  and mapred.tasktracker.reduce.tasks.maximum = 4  
>> and then kept tuning it up ways and down ways  like below 
>> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>> 
>> But all the time performance remains same .
>> Everytime, inspite whatever combination of 
>> mapred.tasktracker.map.tasks.maximum and 
>> mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution 
>> time .
>> 
>> Then when the above things failed i also tried mapred.reduce.tasks = 4 
>> still results are same. No reduction in execution time.
>> 
>> What other things should i set? Also i made sure hadoop is restarted every 
>> time after changing config.
>> I have attached my conf folder ..please indicate me what should be added 
>> where ?
>> I am really stuck ..Your help would be much appreciated. Thank you .
>> <(singlenodecuda)conf.zip>
>> 
>> Regards,
>> Sindhu
> 

Reply via email to