Hi Eduardo, If your PROD environment has jobs that are more database-intensive, can you check/increase your oozie server settings for the following
* oozie.service.JPAService.pool.max.active.conn * oozie.service.coord.normal.default.timeout Other properties to check * oozie.command.default.lock.timeout * If materialization window value is large (you want more coord actions to get materialized simultaneously), but the throttling factor is low, then your actions will stay in PREP Your log errors are pointing towards transaction-level problems. Can you elaborate a bit more on the difference between your REF and PROD environments? -- Mona Chitnis On 9/27/12 3:05 PM, "Eduardo Afonso Ferreira" <[email protected]<mailto:[email protected]>> wrote: Hi there, I've seen some posts about this problem, but I still could not determine what causes it and what is the fix for it. I have been running Oozie 2.3.2 from Cloudera's package (2.3.2-cdh3u3) until this morning on a VM running Ubuntu 10.04.3 LTS. The database is MySQL running on another VM, same Ubuntu version, MySQL server version 5.1.63-0ubuntu0.10.04.1-log (as displayed by mysql when I connect). I have coordinators launching workflows with frequency=3, concurrency=2. 5 to 10 Coordinators. My workflows run 2-3 actions each, java, pig, shell (python). Nothing too heavy. This morning I upgraded Oozie to version 3.2.0 built from the stable branch I downloaded from here (http://incubator.apache.org/oozie/Downloads.html), (06-Jun-2012 18:43). I ran this version for at least one week, maybe two in a REF environment without any problems, but I'm having issues in PROD. I see connection issues to MySQL, timeouts, workflow actions getting stuck in PREP state. Do you guys know what could be causing this problem? Anything I may have missed on the PROD environment? Related to the problem, I see on the oozie.log file it displays the following: 2012-09-27 13:39:00,403 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Acquired lock for [0000173-120927111953690-oozie-oozi-W] in [action.start] 2012-09-27 13:39:00,404 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Load state for [0000173-120927111953690-oozie-oozi-W] 2012-09-27 13:39:00,403 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobUpdateJPAExecutor] 2012-09-27 13:39:00,404 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobGetJPAExecutor] 2012-09-27 13:39:00,405 WARN JPAService:542 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor [WorkflowJobGetJPAExecutor] ended with an active transaction, rolling back 2012-09-27 13:39:00,405 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Released lock for [0000173-120927111953690-oozie-oozi-W] in [action.start] Actions that complete the processing perform other operations before releasing the lock listed on the last line above. I suspect something is failing before that, but I don't see a log message indicating what happened. If you have insights on this problem, please help me. Thank you. Eduardo.
