We noticed once a while, some oozie workflow actions (part of coordinator
jobs) took a long time to transition.

We noticed the following exceptions from Oozie logs (oozie version: 4.1.0):

2016-06-30 07:51:12,830  WARN ActionCheckXCommand:544 - SERVER[
node75-144.prod-aws.eadpdata.ea.com] USER[hadoop] GROUP[-] TOKEN[]
APP[PIN-Translation] JOB[0042217-160627222917756-oozie-oozi-W]
ACTION[0042217-160627222917756-oozie-oozi-W@pin-translation_wf] Exception
while executing check(). Error Code [JA009], Message[JA009: null]

org.apache.oozie.action.ActionExecutorException: JA009: null

        at
org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:418)

        at
org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:396)

        at
org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1296)

        at
org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:181)

        at
org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:55)

        at org.apache.oozie.command.XCommand.call(XCommand.java:281)

        at
org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

Caused by: *java.io.EOFException*

        at java.io.DataInputStream.readFully(DataInputStream.java:197)

        at java.io.DataInputStream.readFully(DataInputStream.java:169)

        at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848)

        at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)

        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762)

        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1776)

        at
org.apache.oozie.action.hadoop.LauncherMapperHelper$1.run(LauncherMapperHelper.java:270)

        at
org.apache.oozie.action.hadoop.LauncherMapperHelper$1.run(LauncherMapperHelper.java:264)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

        at
org.apache.oozie.action.hadoop.LauncherMapperHelper.getActionData(LauncherMapperHelper.java:264)

        at
org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1207)

        ... 7 more

It caused retry happened:

2016-06-30 07:51:12,836  INFO ActionCheckXCommand:541 - SERVER[
node75-144.prod-aws.eadpdata.ea.com] USER[hadoop] GROUP[-] TOKEN[]
APP[PIN-Translation] JOB[*0042217-160627222917756-oozie-oozi-W*] ACTION[
*0042217-160627222917756-oozie-oozi-W*@pin-translation_wf] Next Retry,
Attempt Number [1] in [60,000] milliseconds

And at the end, the retry maxed out, the workflow action got suspended.

2016-06-30 07:54:13,209  WARN ActionCheckXCommand:544 - SERVER[
node75-144.prod-aws.eadpdata.ea.com] USER[hadoop] GROUP[-] TOKEN[]
APP[PIN-Translation] JOB[*0042217-160627222917756-oozie-oozi-W*] ACTION[
*0042217-160627222917756-oozie-oozi-W*@pin-translation_wf] Suspending
Workflow Job id=*0042217-160627222917756-oozie-oozi-W*


Any idea what is going wrong here?

Thanks,

Shanzhong

Reply via email to