ok thanks,
> Yes - if you do a custom split, and have sufficient map slots in your
> cluster
if I understand well even if the lines are stored on only two nodes of
my luster I can distribute the "map tasks" on the other nodes?
eg
i have 10 nodes in the cluster i done a custom split that split every
100 rows.
All rows are stored on only two nodes, my map/reduce task generate 10
map task because i have 1000 rows.
is that all nodes will receive a map task has executed ? or only the two
nodes where is stored the 1000 rows.
> you can parallelize the map tasks to run on other nodes as
> well
How i can do that ? i do not see how i can say this split Will Be
execute on this node programmatically?
Le 08/04/2012 18:37, Suraj Varma a écrit :
if i do a custom input that split the table by 100 rows, can i
distribute manually each part on a node regardless where the data
is ?
Yes - if you do a custom split, and have sufficient map slots in your
cluster, you can parallelize the map tasks to run on other nodes as
well. But if you are using HBase as the sink / source, these map tasks
will still reach back to the region server node holding that row. So -
if you have all your rows in two nodes, all the map tasks will still
reach out to those two nodes. Depending on what your map tasks are
doing (intensive crunching vs I/O) this may or may not help with what
you are doing.
--Suraj
On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy<[email protected]> wrote:
yes i know but it's just an exemple we can do the same exemple with
one billion but effectivelly you could say me in this case the rows
would be stored on all node.
maybe it's not possible to distributed manually the task through the cluster ?
and maybe it's not a good idea but I would like to know in order to
make the best schema for my data.
Le 5 avril 2012 15:08, Doug Meil<[email protected]> a écrit :
If you only have 1000 rows, why use MapReduce?
On 4/5/12 6:37 AM, "Arnaud Le-roy"<[email protected]> wrote:
but do you think that i can change the default behavior ?
for exemple i have ten nodes in my cluster and my table is stored only
on two nodes this table have 1000 rows.
with the default behavior only two nodes will work for a map/reduce
task., isn't it ?
if i do a custom input that split the table by 100 rows, can i
distribute manually each part on a node regardless where the data
is ?
Le 5 avril 2012 00:36, Doug Meil<[email protected]> a écrit :
The default behavior is that the input splits are where the data is
stored.
On 4/4/12 5:24 PM, "sdnetwork"<[email protected]> wrote:
ok thanks,
but i don't find the information that tell me how the result of the
split
is
distrubuted across the different node of the cluster ?
1) randomely ?
2) where the data is stored ?