Hi Eduardo,

If your PROD environment has jobs that are more database-intensive, can you 
check/increase your oozie server settings for the following

 *   oozie.service.JPAService.pool.max.active.conn
 *   oozie.service.coord.normal.default.timeout

Other properties to check

 *   oozie.command.default.lock.timeout
 *   If materialization window value is large (you want more coord actions to 
get materialized simultaneously), but the throttling factor is low, then your 
actions will stay in PREP

Your log errors are pointing towards transaction-level problems. Can you 
elaborate a bit more on the difference between your REF and PROD environments?

--
Mona Chitnis



On 9/27/12 3:05 PM, "Eduardo Afonso Ferreira" 
<[email protected]<mailto:[email protected]>> wrote:

Hi there,

I've seen some posts about this problem, but I still could not determine what 
causes it and what is the fix for it.

I have been running Oozie 2.3.2 from Cloudera's package (2.3.2-cdh3u3) until 
this morning on a VM running Ubuntu 10.04.3 LTS.
The database is MySQL running on another VM, same Ubuntu version, MySQL server 
version 5.1.63-0ubuntu0.10.04.1-log (as displayed by mysql when I connect).

I have coordinators launching workflows with frequency=3, concurrency=2. 5 to 
10 Coordinators.

My workflows run 2-3 actions each, java, pig, shell (python). Nothing too heavy.


This morning I upgraded Oozie to version 3.2.0 built from the stable branch I 
downloaded from here (http://incubator.apache.org/oozie/Downloads.html), 
(06-Jun-2012 18:43).

I ran this version for at least one week, maybe two in a REF environment 
without any problems, but I'm having issues in PROD.

I see connection issues to MySQL, timeouts, workflow actions getting stuck in 
PREP state.

Do you guys know what could be causing this problem? Anything I may have missed 
on the PROD environment?

Related to the problem, I see on the oozie.log file it displays the following:


2012-09-27 13:39:00,403 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] 
TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] 
Acquired lock for [0000173-120927111953690-oozie-oozi-W] in [action.start]
2012-09-27 13:39:00,404 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] 
TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] 
Load state for [0000173-120927111953690-oozie-oozi-W]
2012-09-27 13:39:00,403 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobUpdateJPAExecutor]
2012-09-27 13:39:00,404 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobGetJPAExecutor]
2012-09-27 13:39:00,405  WARN JPAService:542 - USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[-] ACTION[-] JPAExecutor [WorkflowJobGetJPAExecutor] ended with an active 
transaction, rolling back
2012-09-27 13:39:00,405 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] 
TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] 
Released lock for [0000173-120927111953690-oozie-oozi-W] in [action.start]


Actions that complete the processing perform other operations before releasing 
the lock listed on the last line above. I suspect something is failing before 
that, but I don't see a log message indicating what happened.

If you have insights on this problem, please help me.

Thank you.
Eduardo.

Reply via email to