I am running a Oozie coordinator job (frequency: 15 mins) which occasionally missed SLA due to the following error:
1. 2016-07-03 05:49:35,377 WARN ActionCheckXCommand:544 - SERVER[node75- 144.prod-aws.xx.yy.com] USER[hadoop] GROUP[-] TOKEN[] APP[PIN-Translation ] JOB[0096164-160627222917756-oozie-oozi-W] ACTION[0096164- 160627222917756-oozie-oozi-W@pin-translation_wf] Exception while executing check(). Error Code [JA009], Message[JA009: null] 2. 3. org.apache.oozie.action.ActionExecutorException: JA009: null 4. at org.apache.oozie.action.ActionExecutor.convertExceptionHelper( ActionExecutor.java:418) 5. at org.apache.oozie.action.ActionExecutor.convertException( ActionExecutor.java:396) 6. at org.apache.oozie.action.hadoop.JavaActionExecutor.check( JavaActionExecutor.java:1296) 7. at org.apache.oozie.command.wf.ActionCheckXCommand.execute( ActionCheckXCommand.java:181) 8. at org.apache.oozie.command.wf.ActionCheckXCommand.execute( ActionCheckXCommand.java:55) 9. at org.apache.oozie.command.XCommand.call(XCommand.java:281) 10. at org.apache.oozie.service.CallableQueueService$CallableWrapper.run( CallableQueueService.java:174) 11. at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) 12. at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) 13. at java.lang.Thread.run(Thread.java:745) 14. Caused by: java.io.EOFException 15. at java.io.DataInputStream.readFully(DataInputStream.java:197) 16. at java.io.DataInputStream.readFully(DataInputStream.java:169) 17. at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java: 1848) 18. at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile. java:1813) 19. at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 1762) 20. at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 1776) 21. at org.apache.oozie.action.hadoop.LauncherMapperHelper$1.run( LauncherMapperHelper.java:270) 22. at org.apache.oozie.action.hadoop.LauncherMapperHelper$1.run( LauncherMapperHelper.java:264) 23. at java.security.AccessController.doPrivileged(Native Method) 24. at javax.security.auth.Subject.doAs(Subject.java:415) 25. at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1657) 26. at org.apache.oozie.action.hadoop.LauncherMapperHelper.getActionData( LauncherMapperHelper.java:264) 27. at org.apache.oozie.action.hadoop.JavaActionExecutor.check( JavaActionExecutor.java:1207) 28. ... 7 more The workflow action is an JAVA action which internally generates MR jobs. Oozie version: 4.1.0. Hadoop version: 2.7.0. My understanding is that, JavaActionExceutor failed in opening the Sequence file action-data.seq (EOFException). But I don't know why such error occurs. The error will trigger Oozie to retry for a few times, eventually the workflow action will be suspended. 1. 2016-07-03 05:49:35,377 INFO ActionCheckXCommand:541 - SERVER[node75- 144.prod-aws.xx.yy.com] USER[hadoop] GROUP[-] TOKEN[] APP[PIN-Translation ] JOB[0096164-160627222917756-oozie-oozi-W] ACTION[0096164- 160627222917756-oozie-oozi-W@pin-translation_wf] Next Retry, Attempt Number [1] in [60,000] milliseconds 2. 3. 2016-07-03 05:50:35,496 WARN JavaActionExecutor:544 - SERVER[node75- 144.prod-aws.xx.yy.com] USER[hadoop] GROUP[-] TOKEN[] APP[PIN-Translation ] JOB[0096164-160627222917756-oozie-oozi-W] ACTION[0096164- 160627222917756-oozie-oozi-W@pin-translation_wf] Exception in check(). Message[null] 4. .... 5. 6. 2016-07-03 05:52:35,785 WARN ActionCheckXCommand:544 - SERVER[node75- 144.prod-aws.xx.yy.com] USER[hadoop] GROUP[-] TOKEN[] APP[PIN-Translation ] JOB[0096164-160627222917756-oozie-oozi-W] ACTION[0096164- 160627222917756-oozie-oozi-W@pin-translation_wf] Suspending Workflow Job id=0096164-160627222917756-oozie-oozi-W However, Oozie will automatically re-execute the workflow. And when re-executing happen, the action will succeed. Can someone share any insight into this issue? Thanks, Shanzhong
