RE: Hadoop processing

Kartashov, Andy Thu, 08 Nov 2012 07:58:06 -0800

Thanks guys for your responses. This is exactly what my guts were telling me.
I suspected that : "So in that case, yes the data is shipped to the node".


As per suggestion on here, I went and checked out Hadoop test examples and came 
across below question. I thought that C (the correct answer) wasn't entirely 
correct so I went with A. :(
How does Hadoop process large volumes of data?


A.  Hadoop uses a lot of machines in parallel. This optimizes data processing.

B.  Hadoop was specifically designed to process large amount of data by taking 
advantage of MPP hardware

C.  Hadoop ships the code to the data instead of sending the data to the code.

D.  Hadoop uses sophisticated cacheing techniques on namenode to speed 
processing of data

Rgds,
AK47

From: Michael Segel [mailto:[email protected]]
Sent: Thursday, November 08, 2012 10:03 AM
To: [email protected]
Subject: Re: Hadoop processing

To go back to the OP's initial position.
2 new nodes where data hasn't yet been 'balanced'.

First, that's a small window of time.

But to answer your question...

The JT will attempt to schedule work to where the data is. If you're using 3X 
replication, there are 3 nodes where the block resides. So you can calculate 
the odds of getting an open slot to process your data local to its location.

However, if there is an open slot where the data is not located, you will still 
process the data in that open slot. You lose data locality and that smaller 
chunk of data will be processed on that node.  So in that case, yes the data is 
shipped to the node. If you look at your job tracker web page for the results 
of your processing you will see something in terms of what percentage of the 
work occurred in terms of data locality. Hadoop is pretty good in that respect.


NOTE THE FOLLOWING...
If you know that the processing time is a couple of orders of magnitude longer 
than the time it takes to ship the data to a node, you can override the normal 
characteristic and force the processing to be done remotely. (We've done this 
and there is a paper on this on InfoQ) [We were bored and didn't like the fact 
that our Ganglia maps were not all red. We are evil in that way ;-) ] We really 
don't recommend doing this unless you are either insane or really know what you 
are doing.

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas 
<[email protected]<mailto:[email protected]>> wrote:


Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers locally, 
but i think you could force these DNs to actively participate in a Mapper job 
by decreasing the size of input splits, which would allow for many more 
mappers, some of which would be forced to run on files which were not 
necessarily local - in this scenario, those DNs don't yet have any local files 
on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying mapper 
outputs from all over the cluster, one would expect that your Data nodes would 
naturally take part in this portion of the task if the num.reducers parameter 
was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy 
<[email protected]<mailto:[email protected]>> wrote:
Hadoopers,
"Hadoop ships the code to the data instead of sending the data to the code."
Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. 
you have not ran the balancer.
In view of the above quoted statement, will these two nodes not participate in 
the MapReduce job until you balanced some data onto those nodes? Please kindly 
elaborate.

Rgds,
AK47
NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont 
confidentiels, protégés par le droit d'auteur et peuvent être couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autorisée est 
interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, 
supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à 
l'environnement avant d'imprimer le présent courriel



--
Jay Vyas
http://jayunit100.blogspot.com<http://jayunit100.blogspot.com/>

NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont 
confidentiels, protégés par le droit d'auteur et peuvent être couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autorisée est 
interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, 
supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à 
l'environnement avant d'imprimer le présent courriel

RE: Hadoop processing

Reply via email to