Hi guys,

I've been trying to figure out whether a map-side join using the join-package does anything clever regarding data locality with respect to at least one of the partitions to join. To be more specific, if I want to join two datasets and some partition of dataset A is larger than the corresponding partition of dataset B, does Hadoop account for this and try to ensure that the map task is executed on the datanode storing the bigger partition thus reducing data transfer (if the other partition does not happen to be located on that same datanode)? I couldn't conclude the one or the other behavior from the source code and I couldn't find any documentation about this detail.

Thanks for clarifying!
Sigurd

Reply via email to