This could be more of a hadoop question/issue but I have a question about distributed processing in CVB clustering.
Previously I created a derivative rowid program to generate multiple “matrix” files (i.e., one for each input “part” file generated by seq2sparse). For my testing, the new rowid generates 3 “matrix” files, matrix-0, matrix-1, and matrix-2. When running CVB against these multiple “matrix” files I am getting (possibly) odd behavior. I am running on a 3 node cluster and noticed, as expected, the 3 matrix files are copied/reside on 3 separate nodes (3 input split locations). But when running CVB, where I specify the HDFS folder continuing the matrix files as input, it seems to run 3 mappers on one node for each iteration. For the first iteration of CVB, the 3 mappers ran on the machine I submitted the job from (our namenode machine), for the second iteration a different node was selected to run the 3 mappers, for iteration 3, a different node was selected again, etc. Each node in our cluster is quite high-end and very underutilized so I’m wondering if hadoop is running mappers on the same machine since there are lots of available cores? Thanks, Dan
