See answers inline -- Mona Chitnis
On 9/28/12 9:22 AM, "Eduardo Afonso Ferreira" <[email protected]<mailto:[email protected]>> wrote: Hey, Mona, Hi Eduardo, Thanks for the deliberation on the suggestions. For max concurrent db connections, we do not have any benchmarks currently around a recommended number, but it can be an estimate of what you think your particular database can handle without hitting network issues. For e.g. For some prod oracle servers, I have seen it set to ~300. The list of jobs in the PROD environment is basically the same as in REF. When you mentioned "more database-intensive" I assume you're referring to the communication with the Oozie DB and in that case, nothing is different. My jobs don't do anything extra communication with the Oozie DB. I just submit a handful of coordinators to Oozie server and let Oozie take care of the rest. The main difference between REF and PROD is that in PROD we have a much larger data to process and therefore, the jobs that run on the hadoop cluster take longer to complete and we may have a larger number of concurrent oozie workflows running. So, I see the importance of setting the max active conn accordingly (oozie.service.JPAService.pool.max.active.conn). I currently have it at 50, but do you have a recommendation on the ideal value for that? How do you determine what number is good, like not too small but not too large? If you want to configure your Oozie server to check for available input dependencies more frequently, you can reduce the "oozie.service.coord.input.check.requeue.interval" (default is 60 sec). Though it will result in more memory usage (oozie queue) and network usage (requests to NN), but it might make your actions start sooner. What does the timeout attribute do (oozie.service.coord.normal.default.timeout)? It's currently set at 120 (minutes). This is the timeout for a coordinator action input check (in minutes) for normal job. But the default value is pretty high, so unless you set a smaller non-default, this is not to worry about. How the throttle attributes affect the way coordinators launch new workflows? I'm talking about the attributes oozie.service.coord.default.throttle (12) and oozie.service.coord.materialization.throttling.factor (0.05). Are there any other attributes I could or should use that are related to materialization? oozie.service.coord.default.throttle (12) controls how many actions per coordinator can be in state WAITING concurrently. WAITING is when the input dependency checks occur. The "materialization.throttling.factor" is similar, but this is a function (percentage) of your queue size. This makes more sense in a multi-tenant environment when you don't want all of your Oozie command queue getting filled up only one user's job. Also, would you elaborate a little more on how I can control the materialization so that my coordinators won't get stuck like it's happening every once in a while in my PROD environment? How should I set those attributes to cause coordinators to do things like these: - Always launch new jobs (workflows) if their nominal time is reached and there's room (max concurrency not reached yet). - Launch most current jobs first (execution=LIFO) One of the coordinator <control><execution-order> tags. Set it to LIFO. Possible values are * FIFO (oldest first) default * LIFO (newest first) * LAST_ONLY (discards all older materializations) - Don't create a bunch of WAITING jobs. I don't care about WAITING (future) jobs, but just READY ones, i.e. nominal time is reached (or about to be reached). Jobs will go to WAITING first, have their input dependencies checked, and then become READY. For this case (to avoid stuck jobs, ok with fewer jobs but READY more quickly) decrease your throttle values and decrease input check requeue interval (not too much) to have actions addressed faster. - Don't leave actions in the PREP state forever. If they can't start for any reason, either retry or get them out of the way (FAIL, KILL, whatever). Whenever an action is left in the PREP state, one of the concurrency slots is blocked and if several actions get in the same scenario, pretty soon the coordinator will get stuck with no more room to launch new tasks. Actions will not block concurrency slots in PREP (PREP just means action id persisted in database). Let me know if the above steps work for you. Happy to help. Thank you. Eduardo. ________________________________ From: Mona Chitnis <[email protected]<mailto:[email protected]>> To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>; Eduardo Afonso Ferreira <[email protected]<mailto:[email protected]>> Sent: Thursday, September 27, 2012 6:25 PM Subject: Re: Action stuck in PREP state. Hi Eduardo, If your PROD environment has jobs that are more database-intensive, can you check/increase your oozie server settings for the following * oozie.service.JPAService.pool.max.active.conn * oozie.service.coord.normal.default.timeout Other properties to check * oozie.command.default.lock.timeout * If materialization window value is large (you want more coord actions to get materialized simultaneously), but the throttling factor is low, then your actions will stay in PREP Your log errors are pointing towards transaction-level problems. Can you elaborate a bit more on the difference between your REF and PROD environments? -- Mona Chitnis On 9/27/12 3:05 PM, "Eduardo Afonso Ferreira" <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote: Hi there, I've seen some posts about this problem, but I still could not determine what causes it and what is the fix for it. I have been running Oozie 2.3.2 from Cloudera's package (2.3.2-cdh3u3) until this morning on a VM running Ubuntu 10.04.3 LTS. The database is MySQL running on another VM, same Ubuntu version, MySQL server version 5.1.63-0ubuntu0.10.04.1-log (as displayed by mysql when I connect). I have coordinators launching workflows with frequency=3, concurrency=2. 5 to 10 Coordinators. My workflows run 2-3 actions each, java, pig, shell (python). Nothing too heavy. This morning I upgraded Oozie to version 3.2.0 built from the stable branch I downloaded from here (http://incubator.apache.org/oozie/Downloads.html), (06-Jun-2012 18:43). I ran this version for at least one week, maybe two in a REF environment without any problems, but I'm having issues in PROD. I see connection issues to MySQL, timeouts, workflow actions getting stuck in PREP state. Do you guys know what could be causing this problem? Anything I may have missed on the PROD environment? Related to the problem, I see on the oozie.log file it displays the following: 2012-09-27 13:39:00,403 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Acquired lock for [0000173-120927111953690-oozie-oozi-W] in [action.start] 2012-09-27 13:39:00,404 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Load state for [0000173-120927111953690-oozie-oozi-W] 2012-09-27 13:39:00,403 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobUpdateJPAExecutor] 2012-09-27 13:39:00,404 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobGetJPAExecutor] 2012-09-27 13:39:00,405 WARN JPAService:542 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor [WorkflowJobGetJPAExecutor] ended with an active transaction, rolling back 2012-09-27 13:39:00,405 DEBUG ActionStartXCommand:545 - USER[aspen] GROUP[-] TOKEN[] APP[ad_counts-wf] JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Released lock for [0000173-120927111953690-oozie-oozi-W] in [action.start] Actions that complete the processing perform other operations before releasing the lock listed on the last line above. I suspect something is failing before that, but I don't see a log message indicating what happened. If you have insights on this problem, please help me. Thank you. Eduardo.
