This could be more of a hadoop question/issue but I have a question about 
distributed processing in CVB clustering.


Previously I created a derivative rowid program to generate multiple “matrix” 
files (i.e., one for each input “part” file generated by seq2sparse).  For my 
testing, the new rowid generates 3 “matrix” files, matrix-0, matrix-1, and 
matrix-2.


When running CVB against these multiple “matrix” files I am getting (possibly) 
odd behavior.  I am running on a 3 node cluster and noticed, as expected, the 3 
matrix files are copied/reside on 3 separate nodes (3 input split locations).


But when running CVB, where I specify the HDFS folder continuing the matrix 
files as input, it seems to run 3 mappers on one node for each iteration.  For 
the first iteration of CVB, the 3 mappers ran on the machine I submitted the 
job from (our namenode machine), for the second iteration a different node was 
selected to run the 3 mappers, for iteration 3, a different node was selected 
again, etc.


Each node in our cluster is quite high-end and very underutilized so I’m 
wondering if hadoop is running mappers on the same machine since there are lots 
of available cores?

Thanks, Dan

Reply via email to