Re: hbase map/reduce questions

Suraj Varma Mon, 09 Apr 2012 10:45:41 -0700

Take a look at InputSplit:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/InputSplit.java#InputSplit.getLocations%28%29


Then take a look at how TableSplit is implemented (getLocations method
in particular):
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.5/org/apache/hadoop/hbase/mapreduce/TableSplit.java#TableSplit.getLocations%28%29

Also look at TableInputFormatBase#getSplits method to see how the
region locations are populated.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.4/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#TableInputFormatBase.getSplits%28org.apache.hadoop.hbase.mapreduce.JobContext%29

In your case, if you want to run your maps on all available nodes
regardless of the fact that only two of those nodes contain your
regions ... you would implement a custom InputSplit that returns an
empty String[] in the getLocations() method.
--Suraj

On Mon, Apr 9, 2012 at 1:29 AM, arnaud but <[email protected]> wrote:
> ok thanks,
>
>
>> Yes - if you do a custom split, and have sufficient map slots in your
>> cluster
>
> if I understand well even if the lines are stored on only two nodes of my
> luster I can distribute the "map tasks" on the other nodes?
>
> eg
> i have 10 nodes in the cluster i done a custom split that split every 100
> rows.
> All rows are stored on only two nodes, my map/reduce task generate 10 map
> task because i have 1000 rows.
> is that all nodes will receive a map task has executed ? or only the two
> nodes where is stored the 1000 rows.
>
>
>> you can parallelize the map tasks to run on other nodes as
>> well
>
> How i can do that ? i do not see how i can say this split Will Be execute on
> this node programmatically?
>
> Le 08/04/2012 18:37, Suraj Varma a écrit :
>
>>> if i do a custom input that split the table by 100 rows, can i
>>> distribute manually each part  on a node   regardless where the data
>>> is ?
>>
>>
>> Yes - if you do a custom split, and have sufficient map slots in your
>> cluster, you can parallelize the map tasks to run on other nodes as
>> well. But if you are using HBase as the sink / source, these map tasks
>> will still reach back to the region server node holding that row. So -
>> if you have all your rows in two nodes, all the map tasks will still
>> reach out to those two nodes. Depending on what your map tasks are
>> doing (intensive crunching vs I/O) this may or may not help with what
>> you are doing.
>> --Suraj
>>
>>
>>
>> On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy<[email protected]>  wrote:
>>>
>>> yes i know but it's just an exemple we can do the same exemple with
>>> one billion but effectivelly you could say me in this case the rows
>>> would be stored on all node.
>>>
>>> maybe it's not possible to distributed manually the task through the
>>> cluster ?
>>> and maybe it's not a good idea but  I would like to know in order to
>>> make the best schema for my data.
>>>
>>> Le 5 avril 2012 15:08, Doug Meil<[email protected]>  a écrit
>>> :
>>>>
>>>>
>>>> If you only have 1000 rows, why use MapReduce?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 4/5/12 6:37 AM, "Arnaud Le-roy"<[email protected]>  wrote:
>>>>
>>>>> but do you think that i can change the default behavior ?
>>>>>
>>>>> for exemple i have ten nodes in my cluster and my table is stored only
>>>>> on two nodes this table have 1000 rows.
>>>>> with the default behavior only two nodes will work for a map/reduce
>>>>> task., isn't it ?
>>>>>
>>>>> if i do a custom input that split the table by 100 rows, can i
>>>>> distribute manually each part  on a node   regardless where the data
>>>>> is ?
>>>>>
>>>>> Le 5 avril 2012 00:36, Doug Meil<[email protected]>  a
>>>>> écrit :
>>>>>>
>>>>>>
>>>>>> The default behavior is that the input splits are where the data is
>>>>>> stored.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/4/12 5:24 PM, "sdnetwork"<[email protected]>  wrote:
>>>>>>
>>>>>>> ok thanks,
>>>>>>>
>>>>>>> but i don't find the information that tell me how the result of the
>>>>>>> split
>>>>>>> is
>>>>>>> distrubuted across the different node of the cluster ?
>>>>>>>
>>>>>>> 1) randomely ?
>>>>>>> 2) where the data is stored ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>
>

Re: hbase map/reduce questions

Reply via email to