Take a look at InputSplit: http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/InputSplit.java#InputSplit.getLocations%28%29
Then take a look at how TableSplit is implemented (getLocations method in particular): http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.5/org/apache/hadoop/hbase/mapreduce/TableSplit.java#TableSplit.getLocations%28%29 Also look at TableInputFormatBase#getSplits method to see how the region locations are populated. http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.4/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#TableInputFormatBase.getSplits%28org.apache.hadoop.hbase.mapreduce.JobContext%29 In your case, if you want to run your maps on all available nodes regardless of the fact that only two of those nodes contain your regions ... you would implement a custom InputSplit that returns an empty String[] in the getLocations() method. --Suraj On Mon, Apr 9, 2012 at 1:29 AM, arnaud but <[email protected]> wrote: > ok thanks, > > >> Yes - if you do a custom split, and have sufficient map slots in your >> cluster > > if I understand well even if the lines are stored on only two nodes of my > luster I can distribute the "map tasks" on the other nodes? > > eg > i have 10 nodes in the cluster i done a custom split that split every 100 > rows. > All rows are stored on only two nodes, my map/reduce task generate 10 map > task because i have 1000 rows. > is that all nodes will receive a map task has executed ? or only the two > nodes where is stored the 1000 rows. > > >> you can parallelize the map tasks to run on other nodes as >> well > > How i can do that ? i do not see how i can say this split Will Be execute on > this node programmatically? > > Le 08/04/2012 18:37, Suraj Varma a écrit : > >>> if i do a custom input that split the table by 100 rows, can i >>> distribute manually each part on a node regardless where the data >>> is ? >> >> >> Yes - if you do a custom split, and have sufficient map slots in your >> cluster, you can parallelize the map tasks to run on other nodes as >> well. But if you are using HBase as the sink / source, these map tasks >> will still reach back to the region server node holding that row. So - >> if you have all your rows in two nodes, all the map tasks will still >> reach out to those two nodes. Depending on what your map tasks are >> doing (intensive crunching vs I/O) this may or may not help with what >> you are doing. >> --Suraj >> >> >> >> On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy<[email protected]> wrote: >>> >>> yes i know but it's just an exemple we can do the same exemple with >>> one billion but effectivelly you could say me in this case the rows >>> would be stored on all node. >>> >>> maybe it's not possible to distributed manually the task through the >>> cluster ? >>> and maybe it's not a good idea but I would like to know in order to >>> make the best schema for my data. >>> >>> Le 5 avril 2012 15:08, Doug Meil<[email protected]> a écrit >>> : >>>> >>>> >>>> If you only have 1000 rows, why use MapReduce? >>>> >>>> >>>> >>>> >>>> >>>> On 4/5/12 6:37 AM, "Arnaud Le-roy"<[email protected]> wrote: >>>> >>>>> but do you think that i can change the default behavior ? >>>>> >>>>> for exemple i have ten nodes in my cluster and my table is stored only >>>>> on two nodes this table have 1000 rows. >>>>> with the default behavior only two nodes will work for a map/reduce >>>>> task., isn't it ? >>>>> >>>>> if i do a custom input that split the table by 100 rows, can i >>>>> distribute manually each part on a node regardless where the data >>>>> is ? >>>>> >>>>> Le 5 avril 2012 00:36, Doug Meil<[email protected]> a >>>>> écrit : >>>>>> >>>>>> >>>>>> The default behavior is that the input splits are where the data is >>>>>> stored. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 4/4/12 5:24 PM, "sdnetwork"<[email protected]> wrote: >>>>>> >>>>>>> ok thanks, >>>>>>> >>>>>>> but i don't find the information that tell me how the result of the >>>>>>> split >>>>>>> is >>>>>>> distrubuted across the different node of the cluster ? >>>>>>> >>>>>>> 1) randomely ? >>>>>>> 2) where the data is stored ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >> > >
