All,
I'm trying to understand how the current FileInputFormat implements
locality. As far as I can tell, it calculates splits using getSplit and
each split will contain the node that hosts the first block of data in
that split. Is my understanding correct?
Looking at the FileInputFormat for the old API (mapred), it appears that
it does more to implement locality, using getSplitHosts to "return the
hosts that contribute most for a given split"
If I understand correctly, why was this changed?
Thanks,
Brian