We are trying to run a distcp workflow action with "-update" flag. This action attempts to copy around 5 TB of data around the cluster. The action keeps timing out in subsequent runs (not the first time though!) and the exception shown is:

With failures, global counters are inaccurate; consider running with -i
Copy failed: java.net.ConnectException: Connection timed out
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
at org.apache.hadoop.hdfs.DFSClient.getFileChecksum(DFSClient.java:750) at org.apache.hadoop.hdfs.DFSClient.getFileChecksum(DFSClient.java:711) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:553) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:53)
    at org.apache.hadoop.tools.DistCp.sameFile(DistCp.java:1245)
    at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1120)
    at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:391)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)

Intercepting System.exit(-999)
Failing Oozie Launcher, Main class [org.apache.hadoop.tools.DistCp], exit code [-999]


Of course, we use both "-i" and "-update" flags.

Oozie client build version: 2.3.2-cdh3u2
Hadoop 0.20.2-cdh3u2

After investigating the code around the exception, we decided to increase the dfs.socket.timeout from the default "60 * 1000" to "300000". Local tests confirm that this _could_ fix our timeout problem. However, we do not want this parameter to be changed for the whole cluster, but just for this oozie job. Is there a way to override this parameter only when invoking the job via oozie?

Thanks,
Badri

Reply via email to