I'm having an issue in client code where there are multiple clusters with HA
namenodes involved. Example setup using Hadoop 2.3.0:
Cluster A with the following properties defined in core, hdfs, etc:
dfs.nameservices=clusterA
dfs.ha.namenodes.clusterA=nn1,nn2
dfs.namenode.rpc-address.clusterA.nn1=
dfs.namenode.http-address.clusterA.nn1=
dfs.namenode.rpc-address.clusterA.nn2=
dfs.namenode.http-address.clusterA.nn2=
dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
Cluster B has similar properties defined in its core-site.xml, hdfs-site.xml,
etc.
Now, I want to be able to distcp from clusterA to clusterB. Regardless of which
cluster I am executing this from, neither has all of the information. Looking
at DFSClient and DataNode:
- if I put both clusterA and clusterB into dfs.nameservices, then the
datanodes will try to federate the blocks from both nameservices.
- if I don't put both clusterA and clusterB into dfs.nameservices, then the
client won't know how to resolve both namenodes for the nameservices in the
distcp command.
I'm wondering if I am missing a property or something that will allow me to
define both nameservice on both clusters and have the datanodes for the cluster
*not* try and federate. Looking at DataNode, it appears that it tries to
connect to all namenodes defined and the first one that sets the clusterid
wins. It seems that there should be a dfs.datanode.clusterid property that the
datanode uses. This seems to line up with 'namenode -format -clusterid
<cluster>' command when you have multiple nameservices. Am I missing something
in the configuration that will allow me to do what I want? To get distcp to
work I had to create a 3 set of configuration files just for the client to use.