RE: Workflow submission time

2016-06-27 Thread ganesh raman
Hi Pierre

Besides TDCH based use cases we needed something similar for some of our other 
use cases and we had to resort to rewriting workflows based on amount of 
concurrency we would like to have.

I think it makes a lot of sense to make fork join more configurable in terms of 
degree of action level parallelism .


Ganesh

-Original Message-
From: "Pierre Villard" 
Sent: ‎6/‎27/‎2016 9:31 PM
To: "user@oozie.apache.org" 
Subject: Re: Workflow submission time

Thanks Peter, it worked like a charm!

Regarding the concurrent actions, in my case it seems to be a bit tricky...

I am calling Sqoop actions that are using the Teradata connector. The thing
is that all the oozie launchers for sqoop actions are run concurrently and
there is no more free slots to run the actual teradata map/reduce
actions... in the end I only have oozie launchers waiting for teradata
exports to end... and teradata exports waiting for some slots from the RM.

Maybe I am doing something wrong here but that's why I am feeling that it
could be nice to have a parameter to limit the number of concurrent actions
from a single fork.

Again, thanks for the property, really helpful!

Pierre



2016-06-27 17:38 GMT+02:00 Peter Cseh :

> Hi Pierre,
>
> Now that you've mentioned it I've found that you can disable fork-join
> validation at workflow and application level:
>
> https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes
>
> "By default, Oozie performs some validation that any forking in a workflow
> is valid and won't lead to any incorrect behavior or instability. However,
> if Oozie is preventing a workflow from being submitted and you are very
> certain that it should work, you can disable forkjoin validation so that
> Oozie will accept the workflow. To disable this validation just for a
> specific workflow, simply set *oozie.wf.validate.ForkJoin* to false in the
> job.properties file. To disable this validation for all workflows, simply
> set *oozie.validate.ForkJoin* to false in the oozie-site.xml file.
> Disabling this validation is determined by the AND of both of these
> properties, so it will be disabled if either or both are set to false and
> only enabled if both are set to true (or not specified)."
>
> You may limit the number of concurrent actions by submitting them into a
> queue in YARN and configure the scheduler accordingly.
>
> BRs,
> Peter
>
> On Mon, Jun 27, 2016 at 5:22 PM, Pierre Villard <
> pierre.villard...@gmail.com
> > wrote:
>
> > Hi Peter,
> >
> > Thanks a lot for your answer, useful references to the JIRAs!
> > I'll try to have a look at the code and see if this can be improved.
> >
> > Out of curiosity, what is the process covered by 'validation of the
> XML'? I
> > am asking because, when doing 'oozie validate' command, it is OK very
> > quickly.
> >
> > Is there a way to "deactivate" this validation part?
> >
> > In my specific use-case, I could use one single fork/join, the thing is
> > that if I take that route, I'd like to be able to limit the number of
> > concurrent actions that can run in parallel from the fork. Is it
> something
> > we can do?
> >
> > Thanks,
> > Pierre.
> >
> >
> >
> >
> >
> >
> > 2016-06-27 17:01 GMT+02:00 Peter Cseh :
> >
> > > Hi Pierre,
> > >
> > > There was a bugfix around submitting fork jobs which parallelized job
> > > submission:
> > > https://issues.apache.org/jira/browse/OOZIE-2345
> > >
> > > But the issue you've reported is known and not resolved yet:
> > > https://issues.apache.org/jira/browse/OOZIE-1978
> > >
> > > I could not find a workaround description, but one sub-workflow per
> fork
> > > may help as the validation of the xml is the slow part.
> > > Best regards,
> > > Peter
> > >
> > > On Mon, Jun 27, 2016 at 4:22 PM, Pierre Villard <
> > > pierre.villard...@gmail.com
> > > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > I am trying to submit workflows with around 50 actions. However
> > depending
> > > > of how the workflow is defined and the number of actions, the time
> > needed
> > > > by Oozie to accept the workflow may change a lot (I am not talking
> > about
> > > > the execution time of actions, I’m really talking about the time
> needed
> > > > between the moment I launch the command line 'job –run' and the
> moment
> > I
> > > > get back the prompt and my job ID).
> > > >
> > > > The submission time also seems to exponentially depend of the number
> of
> > > > forks in the workflow (5 forks : few seconds, 6 forks : 1 minute, 7
> > > forks :
> > > > 10 minutes, 8 forks : one hour).
> > > >
> > > > I was expecting to have workflows with a higher number of actions. Is
> > it
> > > a
> > > > known issue? Is there some tuning to perform? are there workarounds?
> > > should
> > > > I use sub-workflows?
> > > >
> > > > Thanks for your help,
> > > > Best regards,
> > > > Pierre
> > > >
> > >
> > >
> > >
> > > 

Re: Oozie BoF at Hadoop Summit San Jose.

2016-06-27 Thread Rohini Palaniswamy
Puru,
I am not sure if 1 and 3 make sense given that your talk on Oozie is
also about the same and it will be a repeat of information.

Robert,
It would be good to talk about the Oozie and unmanaged AM branch work.
Possible for you to do a 5 min talk or get someone else from Cloudera do
that?

Regards,
Rohini

On Thu, Jun 23, 2016 at 3:24 PM, Purshotam Shah <
purus...@yahoo-inc.com.invalid> wrote:

> Two major highlights of next release are coordinator complex input
> dependencies support and spark action.
> If nobody has any suggestion we are planning to present
>
> 1. Coordinator complex input dependencies support (Puru)2. Spark action
> (Satish)3. Y! experience of running Oozie (Puru and Satish)
>
> Thanks,
> On Thursday, June 23, 2016 3:04 PM, Mohammad Islam
>  wrote:
>
>
>  what in the next release plan?
>
> On Thursday, June 23, 2016 2:53 PM, Purshotam Shah
>  wrote:
>
>
>  Checking again if anybody want to present anything or want us to cover
> any topic.
> Thanks,
>
>
> On Thursday, June 2, 2016 10:21 AM, Purshotam Shah
>  wrote:
>
>
>  Hi All,  We are hosting Oozie BoF at Hadoop Summit San Jose. It will be
> combined with Cloud & Operations. Lets you know if anybody want to present
> anything or you want us to cover any topic.
> Thanks,
>
>
>
>
>
>
>
>


Re: Workflow submission time

2016-06-27 Thread Pierre Villard
Thanks Peter, it worked like a charm!

Regarding the concurrent actions, in my case it seems to be a bit tricky...

I am calling Sqoop actions that are using the Teradata connector. The thing
is that all the oozie launchers for sqoop actions are run concurrently and
there is no more free slots to run the actual teradata map/reduce
actions... in the end I only have oozie launchers waiting for teradata
exports to end... and teradata exports waiting for some slots from the RM.

Maybe I am doing something wrong here but that's why I am feeling that it
could be nice to have a parameter to limit the number of concurrent actions
from a single fork.

Again, thanks for the property, really helpful!

Pierre



2016-06-27 17:38 GMT+02:00 Peter Cseh :

> Hi Pierre,
>
> Now that you've mentioned it I've found that you can disable fork-join
> validation at workflow and application level:
>
> https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes
>
> "By default, Oozie performs some validation that any forking in a workflow
> is valid and won't lead to any incorrect behavior or instability. However,
> if Oozie is preventing a workflow from being submitted and you are very
> certain that it should work, you can disable forkjoin validation so that
> Oozie will accept the workflow. To disable this validation just for a
> specific workflow, simply set *oozie.wf.validate.ForkJoin* to false in the
> job.properties file. To disable this validation for all workflows, simply
> set *oozie.validate.ForkJoin* to false in the oozie-site.xml file.
> Disabling this validation is determined by the AND of both of these
> properties, so it will be disabled if either or both are set to false and
> only enabled if both are set to true (or not specified)."
>
> You may limit the number of concurrent actions by submitting them into a
> queue in YARN and configure the scheduler accordingly.
>
> BRs,
> Peter
>
> On Mon, Jun 27, 2016 at 5:22 PM, Pierre Villard <
> pierre.villard...@gmail.com
> > wrote:
>
> > Hi Peter,
> >
> > Thanks a lot for your answer, useful references to the JIRAs!
> > I'll try to have a look at the code and see if this can be improved.
> >
> > Out of curiosity, what is the process covered by 'validation of the
> XML'? I
> > am asking because, when doing 'oozie validate' command, it is OK very
> > quickly.
> >
> > Is there a way to "deactivate" this validation part?
> >
> > In my specific use-case, I could use one single fork/join, the thing is
> > that if I take that route, I'd like to be able to limit the number of
> > concurrent actions that can run in parallel from the fork. Is it
> something
> > we can do?
> >
> > Thanks,
> > Pierre.
> >
> >
> >
> >
> >
> >
> > 2016-06-27 17:01 GMT+02:00 Peter Cseh :
> >
> > > Hi Pierre,
> > >
> > > There was a bugfix around submitting fork jobs which parallelized job
> > > submission:
> > > https://issues.apache.org/jira/browse/OOZIE-2345
> > >
> > > But the issue you've reported is known and not resolved yet:
> > > https://issues.apache.org/jira/browse/OOZIE-1978
> > >
> > > I could not find a workaround description, but one sub-workflow per
> fork
> > > may help as the validation of the xml is the slow part.
> > > Best regards,
> > > Peter
> > >
> > > On Mon, Jun 27, 2016 at 4:22 PM, Pierre Villard <
> > > pierre.villard...@gmail.com
> > > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > I am trying to submit workflows with around 50 actions. However
> > depending
> > > > of how the workflow is defined and the number of actions, the time
> > needed
> > > > by Oozie to accept the workflow may change a lot (I am not talking
> > about
> > > > the execution time of actions, I’m really talking about the time
> needed
> > > > between the moment I launch the command line 'job –run' and the
> moment
> > I
> > > > get back the prompt and my job ID).
> > > >
> > > > The submission time also seems to exponentially depend of the number
> of
> > > > forks in the workflow (5 forks : few seconds, 6 forks : 1 minute, 7
> > > forks :
> > > > 10 minutes, 8 forks : one hour).
> > > >
> > > > I was expecting to have workflows with a higher number of actions. Is
> > it
> > > a
> > > > known issue? Is there some tuning to perform? are there workarounds?
> > > should
> > > > I use sub-workflows?
> > > >
> > > > Thanks for your help,
> > > > Best regards,
> > > > Pierre
> > > >
> > >
> > >
> > >
> > > --
> > > Peter Cseh
> > > Software Engineer
> > > 
> > >
> >
>
>
>
> --
> Peter Cseh
> Software Engineer
> 
>


Re: Workflow submission time

2016-06-27 Thread Peter Cseh
Hi Pierre,

Now that you've mentioned it I've found that you can disable fork-join
validation at workflow and application level:
https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes

"By default, Oozie performs some validation that any forking in a workflow
is valid and won't lead to any incorrect behavior or instability. However,
if Oozie is preventing a workflow from being submitted and you are very
certain that it should work, you can disable forkjoin validation so that
Oozie will accept the workflow. To disable this validation just for a
specific workflow, simply set *oozie.wf.validate.ForkJoin* to false in the
job.properties file. To disable this validation for all workflows, simply
set *oozie.validate.ForkJoin* to false in the oozie-site.xml file.
Disabling this validation is determined by the AND of both of these
properties, so it will be disabled if either or both are set to false and
only enabled if both are set to true (or not specified)."

You may limit the number of concurrent actions by submitting them into a
queue in YARN and configure the scheduler accordingly.

BRs,
Peter

On Mon, Jun 27, 2016 at 5:22 PM, Pierre Villard  wrote:

> Hi Peter,
>
> Thanks a lot for your answer, useful references to the JIRAs!
> I'll try to have a look at the code and see if this can be improved.
>
> Out of curiosity, what is the process covered by 'validation of the XML'? I
> am asking because, when doing 'oozie validate' command, it is OK very
> quickly.
>
> Is there a way to "deactivate" this validation part?
>
> In my specific use-case, I could use one single fork/join, the thing is
> that if I take that route, I'd like to be able to limit the number of
> concurrent actions that can run in parallel from the fork. Is it something
> we can do?
>
> Thanks,
> Pierre.
>
>
>
>
>
>
> 2016-06-27 17:01 GMT+02:00 Peter Cseh :
>
> > Hi Pierre,
> >
> > There was a bugfix around submitting fork jobs which parallelized job
> > submission:
> > https://issues.apache.org/jira/browse/OOZIE-2345
> >
> > But the issue you've reported is known and not resolved yet:
> > https://issues.apache.org/jira/browse/OOZIE-1978
> >
> > I could not find a workaround description, but one sub-workflow per fork
> > may help as the validation of the xml is the slow part.
> > Best regards,
> > Peter
> >
> > On Mon, Jun 27, 2016 at 4:22 PM, Pierre Villard <
> > pierre.villard...@gmail.com
> > > wrote:
> >
> > > Hi guys,
> > >
> > > I am trying to submit workflows with around 50 actions. However
> depending
> > > of how the workflow is defined and the number of actions, the time
> needed
> > > by Oozie to accept the workflow may change a lot (I am not talking
> about
> > > the execution time of actions, I’m really talking about the time needed
> > > between the moment I launch the command line 'job –run' and the moment
> I
> > > get back the prompt and my job ID).
> > >
> > > The submission time also seems to exponentially depend of the number of
> > > forks in the workflow (5 forks : few seconds, 6 forks : 1 minute, 7
> > forks :
> > > 10 minutes, 8 forks : one hour).
> > >
> > > I was expecting to have workflows with a higher number of actions. Is
> it
> > a
> > > known issue? Is there some tuning to perform? are there workarounds?
> > should
> > > I use sub-workflows?
> > >
> > > Thanks for your help,
> > > Best regards,
> > > Pierre
> > >
> >
> >
> >
> > --
> > Peter Cseh
> > Software Engineer
> > 
> >
>



-- 
Peter Cseh
Software Engineer



Re: Workflow submission time

2016-06-27 Thread Peter Cseh
Hi Pierre,

There was a bugfix around submitting fork jobs which parallelized job
submission:
https://issues.apache.org/jira/browse/OOZIE-2345

But the issue you've reported is known and not resolved yet:
https://issues.apache.org/jira/browse/OOZIE-1978

I could not find a workaround description, but one sub-workflow per fork
may help as the validation of the xml is the slow part.
Best regards,
Peter

On Mon, Jun 27, 2016 at 4:22 PM, Pierre Villard  wrote:

> Hi guys,
>
> I am trying to submit workflows with around 50 actions. However depending
> of how the workflow is defined and the number of actions, the time needed
> by Oozie to accept the workflow may change a lot (I am not talking about
> the execution time of actions, I’m really talking about the time needed
> between the moment I launch the command line 'job –run' and the moment I
> get back the prompt and my job ID).
>
> The submission time also seems to exponentially depend of the number of
> forks in the workflow (5 forks : few seconds, 6 forks : 1 minute, 7 forks :
> 10 minutes, 8 forks : one hour).
>
> I was expecting to have workflows with a higher number of actions. Is it a
> known issue? Is there some tuning to perform? are there workarounds? should
> I use sub-workflows?
>
> Thanks for your help,
> Best regards,
> Pierre
>



-- 
Peter Cseh
Software Engineer



Workflow submission time

2016-06-27 Thread Pierre Villard
Hi guys,

I am trying to submit workflows with around 50 actions. However depending
of how the workflow is defined and the number of actions, the time needed
by Oozie to accept the workflow may change a lot (I am not talking about
the execution time of actions, I’m really talking about the time needed
between the moment I launch the command line 'job –run' and the moment I
get back the prompt and my job ID).

The submission time also seems to exponentially depend of the number of
forks in the workflow (5 forks : few seconds, 6 forks : 1 minute, 7 forks :
10 minutes, 8 forks : one hour).

I was expecting to have workflows with a higher number of actions. Is it a
known issue? Is there some tuning to perform? are there workarounds? should
I use sub-workflows?

Thanks for your help,
Best regards,
Pierre