We noticed once a while, some oozie workflow actions (part of coordinator
jobs) took a long time to transition.
We noticed the following exceptions from Oozie logs (oozie version: 4.1.0):
2016-06-30 07:51:12,830 WARN ActionCheckXCommand:544 - SERVER[
node75-144.prod-aws.eadpdata.ea.com] USER[hadoop] GROUP[-] TOKEN[]
APP[PIN-Translation] JOB[0042217-160627222917756-oozie-oozi-W]
ACTION[0042217-160627222917756-oozie-oozi-W@pin-translation_wf] Exception
while executing check(). Error Code [JA009], Message[JA009: null]
org.apache.oozie.action.ActionExecutorException: JA009: null
at
org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:418)
at
org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:396)
at
org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1296)
at
org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:181)
at
org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:55)
at org.apache.oozie.command.XCommand.call(XCommand.java:281)
at
org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: *java.io.EOFException*
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1776)
at
org.apache.oozie.action.hadoop.LauncherMapperHelper$1.run(LauncherMapperHelper.java:270)
at
org.apache.oozie.action.hadoop.LauncherMapperHelper$1.run(LauncherMapperHelper.java:264)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.oozie.action.hadoop.LauncherMapperHelper.getActionData(LauncherMapperHelper.java:264)
at
org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1207)
... 7 more
It caused retry happened:
2016-06-30 07:51:12,836 INFO ActionCheckXCommand:541 - SERVER[
node75-144.prod-aws.eadpdata.ea.com] USER[hadoop] GROUP[-] TOKEN[]
APP[PIN-Translation] JOB[*0042217-160627222917756-oozie-oozi-W*] ACTION[
*0042217-160627222917756-oozie-oozi-W*@pin-translation_wf] Next Retry,
Attempt Number [1] in [60,000] milliseconds
And at the end, the retry maxed out, the workflow action got suspended.
2016-06-30 07:54:13,209 WARN ActionCheckXCommand:544 - SERVER[
node75-144.prod-aws.eadpdata.ea.com] USER[hadoop] GROUP[-] TOKEN[]
APP[PIN-Translation] JOB[*0042217-160627222917756-oozie-oozi-W*] ACTION[
*0042217-160627222917756-oozie-oozi-W*@pin-translation_wf] Suspending
Workflow Job id=*0042217-160627222917756-oozie-oozi-W*
Any idea what is going wrong here?
Thanks,
Shanzhong