Distributed processing and CVB clustering

DAN HELM Mon, 04 Jun 2012 08:52:31 -0700

This could be more of a hadoop question/issue but I have a question about 
distributed processing in CVB clustering.



Previously I created a derivative rowid program to generate multiple “matrix” 
files (i.e., one for each input “part” file generated by seq2sparse).  For my 
testing, the new rowid generates 3 “matrix” files, matrix-0, matrix-1, and 
matrix-2.


When running CVB against these multiple “matrix” files I am getting (possibly) 
odd behavior.  I am running on a 3 node cluster and noticed, as expected, the 3 
matrix files are copied/reside on 3 separate nodes (3 input split locations).


But when running CVB, where I specify the HDFS folder continuing the matrix 
files as input, it seems to run 3 mappers on one node for each iteration.  For 
the first iteration of CVB, the 3 mappers ran on the machine I submitted the 
job from (our namenode machine), for the second iteration a different node was 
selected to run the 3 mappers, for iteration 3, a different node was selected 
again, etc.


Each node in our cluster is quite high-end and very underutilized so I’m 
wondering if hadoop is running mappers on the same machine since there are lots 
of available cores?

Thanks, Dan

Distributed processing and CVB clustering

Reply via email to