See answers inline

--
Mona Chitnis


On 9/28/12 9:22 AM, "Eduardo Afonso Ferreira" 
<[email protected]<mailto:[email protected]>> wrote:

Hey, Mona,

Hi Eduardo,

Thanks for the deliberation on the suggestions. For max concurrent db 
connections, we do not have any benchmarks currently around a recommended 
number, but it can be an estimate of what you think your particular database 
can handle without hitting network issues. For e.g. For some prod oracle 
servers, I have seen it set to ~300.


The list of jobs in the PROD environment is basically the same as in REF.

When you mentioned "more database-intensive" I assume you're referring to the 
communication with the Oozie DB and in that case, nothing is different. My jobs 
don't do anything extra communication with the Oozie DB. I just submit a 
handful of coordinators to Oozie server and let Oozie take care of the rest.

The main difference between REF and PROD is that in PROD we have a much larger 
data to process and therefore, the jobs that run on the hadoop cluster take 
longer to complete and we may have a larger number of concurrent oozie 
workflows running. So, I see the importance of setting the max active conn 
accordingly (oozie.service.JPAService.pool.max.active.conn). I currently have 
it at 50, but do you have a recommendation on the ideal value for that? How do 
you determine what number is good, like not too small but not too large?

If you want to configure your Oozie server to check for available input 
dependencies more frequently, you can reduce the 
"oozie.service.coord.input.check.requeue.interval" (default is 60 sec). Though 
it will result in more memory usage (oozie queue) and network usage (requests 
to NN), but it might make your actions start sooner.

What does the timeout attribute do 
(oozie.service.coord.normal.default.timeout)? It's currently set at 120 
(minutes).

This is the timeout for a coordinator action input check (in minutes) for 
normal job. But the default value is pretty high, so unless you set a smaller 
non-default, this is not to worry about.

How the throttle attributes affect the way coordinators launch new workflows?
I'm talking about the attributes oozie.service.coord.default.throttle (12) and 
oozie.service.coord.materialization.throttling.factor (0.05). Are there any 
other attributes I could or should use that are related to materialization?

oozie.service.coord.default.throttle (12) controls how many actions per 
coordinator can be in state WAITING concurrently. WAITING is when the input 
dependency checks occur. The "materialization.throttling.factor" is similar, 
but this is a function (percentage) of your queue size. This makes more sense 
in a multi-tenant environment when you don't want all of your Oozie command 
queue getting filled up only one user's job.


Also, would you elaborate a little more on how I can control the 
materialization so that my coordinators won't get stuck like it's happening 
every once in a while in my PROD environment?


How should I set those attributes to cause coordinators to do things like these:
- Always launch new jobs (workflows) if their nominal time is reached and 
there's room (max concurrency not reached yet).
- Launch most current jobs first (execution=LIFO)

One of the coordinator <control><execution-order> tags. Set it to LIFO. 
Possible values are

* FIFO (oldest first) default
* LIFO (newest first)
* LAST_ONLY (discards all older materializations)

- Don't create a bunch of WAITING jobs. I don't care about WAITING (future) 
jobs, but just READY ones, i.e. nominal time is reached (or about to be 
reached).

Jobs will go to WAITING first, have their input dependencies checked, and then 
become READY. For this case (to avoid stuck jobs, ok with fewer jobs but READY 
more quickly) decrease your throttle values and decrease input check requeue 
interval (not too much) to have actions addressed faster.

- Don't leave actions in the PREP state forever. If they can't start for any 
reason, either retry or get them out of the way (FAIL, KILL, whatever). 
Whenever an action is left in the PREP state, one of the concurrency slots is 
blocked and if several actions get in the same scenario, pretty soon the 
coordinator will get stuck with no more room to launch new tasks.

Actions will not block concurrency slots in PREP (PREP just means action id 
persisted in database).

Let me know if the above steps work for you. Happy to help.



Thank you.
Eduardo.



________________________________
From: Mona Chitnis <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>; Eduardo Afonso Ferreira 
<[email protected]<mailto:[email protected]>>
Sent: Thursday, September 27, 2012 6:25 PM
Subject: Re: Action stuck in PREP state.
Hi Eduardo,

If your PROD environment has jobs that are more database-intensive, can you 
check/increase your oozie server settings for the following

*   oozie.service.JPAService.pool.max.active.conn
*   oozie.service.coord.normal.default.timeout

Other properties to check

*   oozie.command.default.lock.timeout
*   If materialization window value is large (you want more coord actions to 
get materialized simultaneously), but the throttling factor is low, then your 
actions will stay in PREP

Your log errors are pointing towards transaction-level problems. Can you 
elaborate a bit more on the difference between your REF and PROD environments?

--
Mona Chitnis



On 9/27/12 3:05 PM, "Eduardo Afonso Ferreira" 
<[email protected]<mailto:[email protected]><mailto:[email protected]>> 
wrote:

Hi there,

I've seen some posts about this problem, but I still could not determine what 
causes it and what is the fix for it.

I have been running Oozie 2.3.2 from Cloudera's package (2.3.2-cdh3u3) until 
this morning on a VM running Ubuntu 10.04.3 LTS.
The database is MySQL running on another VM, same Ubuntu version, MySQL server 
version 5.1.63-0ubuntu0.10.04.1-log (as displayed by mysql when I connect).

I have coordinators launching workflows with frequency=3, concurrency=2. 5 to 
10 Coordinators.

My workflows run 2-3 actions each, java, pig, shell (python). Nothing too heavy.


This morning I upgraded Oozie to version 3.2.0 built from the stable branch I 
downloaded from here (http://incubator.apache.org/oozie/Downloads.html), 
(06-Jun-2012 18:43).

I ran this version for at least one week, maybe two in a REF environment 
without any problems, but I'm having issues in PROD.

I see connection issues to MySQL, timeouts, workflow actions getting stuck in 
PREP state.

Do you guys know what could be causing this problem? Anything I may have missed 
on the PROD environment?

Related to the problem, I see on the oozie.log file it displays the following:


2012-09-27 13:39:00,403 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] 
TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] 
Acquired lock for [0000173-120927111953690-oozie-oozi-W] in [action.start]
2012-09-27 13:39:00,404 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] 
TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] 
Load state for [0000173-120927111953690-oozie-oozi-W]
2012-09-27 13:39:00,403 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobUpdateJPAExecutor]
2012-09-27 13:39:00,404 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobGetJPAExecutor]
2012-09-27 13:39:00,405  WARN JPAService:542 - USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[-] ACTION[-] JPAExecutor [WorkflowJobGetJPAExecutor] ended with an active 
transaction, rolling back
2012-09-27 13:39:00,405 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] 
TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] 
Released lock for [0000173-120927111953690-oozie-oozi-W] in [action.start]


Actions that complete the processing perform other operations before releasing 
the lock listed on the last line above. I suspect something is failing before 
that, but I don't see a log message indicating what happened.

If you have insights on this problem, please help me.

Thank you.
Eduardo.

Reply via email to