Hi, is it possible to limit the number of map slots used for the load function? For example I have 5 nodes with 10 map slots (each node has 2 slots for every cpu) I want only one map task for every node. Is there a way to set this only for the load function? I know there is a option called "mapred.tasktracker.map.tasks.maximum", but this would influence every MapReduce job. I want to influence the number only for this specific job.
My use case is the following: I'm using a modified version of the HBaseStorage function. I try to load for example 10 different rowkeys with very big column sizes and join them afterwords. Since the columns all have the same column family every row can be stored to a different server. For example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on the other nodes. So If I create a Pig script to load the 10 keys and join them afterwards this will end up in 1 MapReduce Job with 10 map task and some reduce tasks (depends on the parallel factor). The problem is that there will be created 2 map tasks on node1, because there are 2 slots available. This means every task is reading simultaneously a large number of columns from the local hard drive. Maybe I'm wrong, but this should be a performance issue?! It should be faster if to read each rowkey one after another!? kind regards