[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Bafna updated OOZIE-1978: -- Priority: Blocker (was: Major) > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko >Priority: Blocker > Fix For: 4.3.0 > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-002.patch, OOZIE-1978-003.patch, OOZIE-1978-004.patch, > OOZIE-1978-005.patch, OOZIE-1978-006.patch, OOZIE-1978_wip.001.patch, > workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-006.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: 4.3.0 > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-002.patch, OOZIE-1978-003.patch, OOZIE-1978-004.patch, > OOZIE-1978-005.patch, OOZIE-1978-006.patch, OOZIE-1978_wip.001.patch, > workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] abhishek bafna updated OOZIE-1978: -- Fix Version/s: (was: trunk) 4.3.0 > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: 4.3.0 > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-002.patch, OOZIE-1978-003.patch, OOZIE-1978-004.patch, > OOZIE-1978-005.patch, OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-002.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-002.patch, OOZIE-1978-003.patch, OOZIE-1978-004.patch, > OOZIE-1978-005.patch, OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-005.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-002.patch, OOZIE-1978-003.patch, OOZIE-1978-004.patch, > OOZIE-1978-005.patch, OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: (was: OOZIE-1978-002.patch) > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-002.patch, OOZIE-1978-003.patch, OOZIE-1978-004.patch, > OOZIE-1978-005.patch, OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-002.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-002.patch, OOZIE-1978-003.patch, OOZIE-1978-004.patch, > OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-004.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-003.patch, OOZIE-1978-004.patch, OOZIE-1978_wip.001.patch, > workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-003.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978-003.patch, OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-002.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978-002.patch, > OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-1978: Attachment: OOZIE-1978-001.patch > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Peter Bacsko > Fix For: trunk > > Attachments: OOZIE-1978-001.patch, OOZIE-1978_wip.001.patch, > workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated OOZIE-1978: - Attachment: OOZIE-1978_wip.001.patch That's a fancy ASCII diagram [~pbacsko]. I realized that there's a ton of paths before, but didn't realize there were that many (I never did the math here). Anyway, I have a work-in-progress patch that I had been meaning to upload for quite some time, but never got around to it. I'll attach it now. Feel free to use it or steal things from it, but if you already have something, that's fine too. It's been so long since I looked at it, and this whole thing gets complicated, so I don't remember how it works or what else I was planning to do, though there are some comments. > Forkjoin validation code is ridiculously slow in some cases > --- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: trunk, 4.0.1 >Reporter: Robert Kanter >Assignee: Robert Kanter > Fix For: trunk > > Attachments: OOZIE-1978_wip.001.patch, workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases
[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated OOZIE-1978: - Attachment: workflow.xml Forkjoin validation code is ridiculously slow in some cases --- Key: OOZIE-1978 URL: https://issues.apache.org/jira/browse/OOZIE-1978 Project: Oozie Issue Type: Bug Components: core Affects Versions: trunk, 4.0.1 Reporter: Robert Kanter Fix For: trunk Attachments: workflow.xml We've had a few users who have run into problems where submitting a workflow appears to hang (in the case of a subworkflow, it's similar but stuck in PREP). It turns out that if you wait long enough, it will actually go through and the workflow will run normally. The problem is that the forkjoin validation code is taking a really long time. The attached example has a series of 20 forks where each fork has 6 actions (it's based on an actual workflow, but all of the names were changed and the actions were all replaced by simple shell actions). One of our support guys said it took 1-2 hours , but on my computer it was taking {color:red}*15+ hours*{color} (I had to cancel it) While this example doesn't have any nested forks, those can also take a long time too. It's easy to verify that it's the forkjoin validation code that's taking so long by looking at a jstack of the Oozie server and seeing deep recursive calls to {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I also noticed a lot of sitting around in calls LinkedList.contains. I think we have 3 options: # See if we can make the existing code faster somehow. Perhaps there's a way to parallelize it? Maybe there's some redundant checking that we can identify and skip? Change some data structures? etc # See if we can write a new way to do this validation. I had originally completely rewritten this code a while ago, and we've since made a few fixes to catch edge cases and things. Perhaps it needs another rewrite? # Try to identify when it's taking a long time and at least let the user know what's happening or something. Right now, it just appears that the Oozie CLI has hung and the job doesn't show up in the Oozie server. Most users aren't going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.2#6252)