We are trying to run a distcp workflow action with "-update" flag. This
action attempts to copy around 5 TB of data around the cluster. The
action keeps timing out in subsequent runs (not the first time though!)
and the exception shown is:
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
at
org.apache.hadoop.hdfs.DFSClient.getFileChecksum(DFSClient.java:750)
at
org.apache.hadoop.hdfs.DFSClient.getFileChecksum(DFSClient.java:711)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:553)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:53)
at org.apache.hadoop.tools.DistCp.sameFile(DistCp.java:1245)
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1120)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:391)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Intercepting System.exit(-999)
Failing Oozie Launcher, Main class [org.apache.hadoop.tools.DistCp],
exit code [-999]
Of course, we use both "-i" and "-update" flags.
Oozie client build version: 2.3.2-cdh3u2
Hadoop 0.20.2-cdh3u2
After investigating the code around the exception, we decided to
increase the dfs.socket.timeout from the default "60 * 1000" to
"300000". Local tests confirm that this _could_ fix our timeout problem.
However, we do not want this parameter to be changed for the whole
cluster, but just for this oozie job. Is there a way to override this
parameter only when invoking the job via oozie?
Thanks,
Badri