Re: [openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug

2015-12-10 Thread Clint Byrum
Excerpts from ELISHA, Moshe (Moshe)'s message of 2015-12-07 08:29:44 -0800:
> Hi all,
> 
> The current bugfix I am working on[1] have unveiled / created a bug.
> Test "WorkflowResumeTest.test_resume_different_task_states" sometimes fails 
> because "task4" is executed twice instead of once (See unit test output and 
> workflow below).
> 
> This happens because task2 on-complete is running task4 as expected but also 
> task3 executes task4 by mistake.
> 
> It is not consistent but it happens quite often - This happens if the unit 
> test resumes the WF and updates action execution of task2 and finishes task2 
> before task3 is finished.
> Scenario:
> 
> 
> 1.   Task2 in method on_action_complete - changes task2 state to RUNNING.
> 
> 2.   Task3 in method on_action_complete - changes task2 state to RUNNING 
> (before task2 calls _on_task_state_change).
> 
> 3.   Task3 in "_on_task_state_change" > "continue_workflow" > 
> "DirectWorkflowController ._find_next_commands" - it finds task2 because 
> task2 is in SUCCESS and processed = False and 
> "_find_next_commands_for_task(task2)" returns task4.
> 
> 4.   Task3 executes command to RunTask task4.
> 
> 5.   Task2 in "_on_task_state_change" > "continue_workflow" > 
> "DirectWorkflowController ._find_next_commands" - it finds task2 because 
> task2 is in SUCCESS and processed = False and 
> "_find_next_commands_for_task(task2)" returns task4.
> 
> 6.   Task2 executes command to RunTask task4.
> 
> 
> [1] - https://review.openstack.org/#/c/253819/
> 
> 
> If I am not mistaken - this issue also exists in the current code and my 
> bugfix only made it much more often. Can you confirm?
> I don't have enough knowledge on how to fix this issue...
> For now - I have modified the test_resume_different_task_states unit test to 
> wait for task3 to be processed before updating the action execution of task2.
> If you agree this bug exist today as well - we can proceed with my bugfix and 
> open a different bug for that issue.
> 

I'd agree that this is likely happening more reliably because READ
COMMITTED will just give you the state that causes the bug more often
than REPEATABLE READ, because now if you happen to have threads running
at the same time when the new vulnerable state is reached, they can both
see the new state and react to it.  Before they only had that problem if
they both started after the enabling state was in the db, thus sharing
the same db snapshot.

What you actually need is to atomically claim the rows. You have to do
like this:

UPDATE whatever_table SET executor = 'me' WHERE executor IS NULL;

And if you get 0 rows updated, that means somebody else claimed it and
you do nothing. Note that you also need some liveness testing in this
system, since if your executor dies, that row will be lost forever. In
Heat they solved it by having a queue for each executor and pinging on
oslo.messaging. Please don't do that.

I suggest instead switching to tooz, and joining the distributed lock
revolution, where zookeeper will give you a nice atomic distributed lock
for this, and detect when to break it because of a dead executor. (or
consul or etcd once our fine community finishes landing those)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug

2015-12-09 Thread KOFFMAN, Noa (Noa)
Hi,

I have reproduced this issue multiple times before your fix was merged.

In order to reproduce I used a workflow with multiple async actions, and 
resumed all of them at the same time.

I just created a ticket in launchpad [1], with the workflow used and the 
mistral engine logs.

[1] - https://bugs.launchpad.net/mistral/+bug/1524477

If anyone could take a look and confirm the bug it would be great.

Thanks
Noa Koffman


From: ELISHA, Moshe (Moshe) [moshe.eli...@alcatel-lucent.com]
Sent: Monday, December 07, 2015 6:29 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: [openstack-dev] [mistral] bugfix for "Fix concurrency issues by using 
READ_COMMITTED" unveils / creates a different bug

Hi all,

The current bugfix I am working on[1] have unveiled / created a bug.
Test “WorkflowResumeTest.test_resume_different_task_states” sometimes fails 
because “task4” is executed twice instead of once (See unit test output and 
workflow below).

This happens because task2 on-complete is running task4 as expected but also 
task3 executes task4 by mistake.

It is not consistent but it happens quite often – This happens if the unit test 
resumes the WF and updates action execution of task2 and finishes task2 before 
task3 is finished.
Scenario:


1.   Task2 in method on_action_complete – changes task2 state to RUNNING.

2.   Task3 in method on_action_complete – changes task2 state to RUNNING 
(before task2 calls _on_task_state_change).

3.   Task3 in “_on_task_state_change” > “continue_workflow” > 
“DirectWorkflowController ._find_next_commands” – it finds task2 because task2 
is in SUCCESS and processed = False and “_find_next_commands_for_task(task2)” 
returns task4.

4.   Task3 executes command to RunTask task4.

5.   Task2 in “_on_task_state_change” > “continue_workflow” > 
“DirectWorkflowController ._find_next_commands” – it finds task2 because task2 
is in SUCCESS and processed = False and “_find_next_commands_for_task(task2)” 
returns task4.

6.   Task2 executes command to RunTask task4.


[1] - https://review.openstack.org/#/c/253819/


If I am not mistaken – this issue also exists in the current code and my bugfix 
only made it much more often. Can you confirm?
I don’t have enough knowledge on how to fix this issue…
For now – I have modified the test_resume_different_task_states unit test to 
wait for task3 to be processed before updating the action execution of task2.
If you agree this bug exist today as well – we can proceed with my bugfix and 
open a different bug for that issue.

Thanks.



[stack@melisha-devstack mistral(keystone_admin)]$ tox -e py27 -- 
WorkflowResumeTest.test_resume_different_task_states
...
==
FAIL: 
mistral.tests.unit.engine.test_workflow_resume.WorkflowResumeTest.test_resume_different_task_states
tags: worker-0
--
pythonlogging:'': {{{WARNING [oslo_db.sqlalchemy.utils] Id not in sort_keys; is 
sort_keys unique?}}}
stderr: {{{
/opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/novaclient/v2/client.py:109:
 UserWarning: 'novaclient.v2.client.Client' is not designed to be initialized 
directly. It is inner class of novaclient. Please, use 
'novaclient.client.Client' instead. Related lp bug-report: 1493576
  _LW("'novaclient.v2.client.Client' is not designed to be "
}}}

stdout: {{{
Engine test case exception occurred: 4 != 5
Exception type: 

Printing workflow executions...

wb.wf1 [state=SUCCESS, output={u'__execution': {u'params': {}, u'id': 
u'2807dd99-ca6f-49d7-886d-7d3b79e1c49e', u'spec': {u'type': u'direct', u'name': 
u'wf1', u'tasks': {u'task4': {u'type': u'direct', u'name': u'task4', 
u'version': u'2.0', u'action': u'std.echo output="Task 4"'}, u'task2': 
{u'type': u'direct', u'name': u'task2', u'on-complete': [u'task4'], u'version': 
u'2.0', u'action': u'std.mistral_http url="http://google.com";'}, u'task3': 
{u'type': u'direct', u'name': u'task3', u'version': u'2.0', u'action': 
u'std.echo output="Task 3"'}, u'task1': {u'type': u'direct', u'name': u'task1', 
u'on-complete': [u'task3', u'pause'], u'version': u'2.0', u'action': u'std.echo 
output="Hi!"'}}, u'version': u'2.0'}, u'input': {}

[openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug

2015-12-07 Thread ELISHA, Moshe (Moshe)
Hi all,

The current bugfix I am working on[1] have unveiled / created a bug.
Test "WorkflowResumeTest.test_resume_different_task_states" sometimes fails 
because "task4" is executed twice instead of once (See unit test output and 
workflow below).

This happens because task2 on-complete is running task4 as expected but also 
task3 executes task4 by mistake.

It is not consistent but it happens quite often - This happens if the unit test 
resumes the WF and updates action execution of task2 and finishes task2 before 
task3 is finished.
Scenario:


1.   Task2 in method on_action_complete - changes task2 state to RUNNING.

2.   Task3 in method on_action_complete - changes task2 state to RUNNING 
(before task2 calls _on_task_state_change).

3.   Task3 in "_on_task_state_change" > "continue_workflow" > 
"DirectWorkflowController ._find_next_commands" - it finds task2 because task2 
is in SUCCESS and processed = False and "_find_next_commands_for_task(task2)" 
returns task4.

4.   Task3 executes command to RunTask task4.

5.   Task2 in "_on_task_state_change" > "continue_workflow" > 
"DirectWorkflowController ._find_next_commands" - it finds task2 because task2 
is in SUCCESS and processed = False and "_find_next_commands_for_task(task2)" 
returns task4.

6.   Task2 executes command to RunTask task4.


[1] - https://review.openstack.org/#/c/253819/


If I am not mistaken - this issue also exists in the current code and my bugfix 
only made it much more often. Can you confirm?
I don't have enough knowledge on how to fix this issue...
For now - I have modified the test_resume_different_task_states unit test to 
wait for task3 to be processed before updating the action execution of task2.
If you agree this bug exist today as well - we can proceed with my bugfix and 
open a different bug for that issue.

Thanks.



[stack@melisha-devstack mistral(keystone_admin)]$ tox -e py27 -- 
WorkflowResumeTest.test_resume_different_task_states
...
==
FAIL: 
mistral.tests.unit.engine.test_workflow_resume.WorkflowResumeTest.test_resume_different_task_states
tags: worker-0
--
pythonlogging:'': {{{WARNING [oslo_db.sqlalchemy.utils] Id not in sort_keys; is 
sort_keys unique?}}}
stderr: {{{
/opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/novaclient/v2/client.py:109:
 UserWarning: 'novaclient.v2.client.Client' is not designed to be initialized 
directly. It is inner class of novaclient. Please, use 
'novaclient.client.Client' instead. Related lp bug-report: 1493576
  _LW("'novaclient.v2.client.Client' is not designed to be "
}}}

stdout: {{{
Engine test case exception occurred: 4 != 5
Exception type: 

Printing workflow executions...

wb.wf1 [state=SUCCESS, output={u'__execution': {u'params': {}, u'id': 
u'2807dd99-ca6f-49d7-886d-7d3b79e1c49e', u'spec': {u'type': u'direct', u'name': 
u'wf1', u'tasks': {u'task4': {u'type': u'direct', u'name': u'task4', 
u'version': u'2.0', u'action': u'std.echo output="Task 4"'}, u'task2': 
{u'type': u'direct', u'name': u'task2', u'on-complete': [u'task4'], u'version': 
u'2.0', u'action': u'std.mistral_http url="http://google.com";'}, u'task3': 
{u'type': u'direct', u'name': u'task3', u'version': u'2.0', u'action': 
u'std.echo output="Task 3"'}, u'task1': {u'type': u'direct', u'name': u'task1', 
u'on-complete': [u'task3', u'pause'], u'version': u'2.0', u'action': u'std.echo 
output="Hi!"'}}, u'version': u'2.0'}, u'input': {}}, u'task4': u'Task 4', 
u'task3': u'Task 3', u'__tasks': {u'848c6e92-b1b1-4d54-b11d-c93cfb4fc88f': 
u'task2', u'00a546e7-8da9-4603-b6be-54d58b14c625': u'task1'}}]
 task2 [id=848c6e92-b1b1-4d54-b11d-c93cfb4fc88f, state=SUCCESS, 
published={}]
 task1 [id=00a546e7-8da9-4603-b6be-54d58b14c625, state=SUCCESS, 
published={}]
 task3 [id=8ce20324-7fba-4424-bcd2-1e0c9b27fd4a, state=SUCCESS, 
published={}]
 task4 [id=3758de43-9bc3-4ac9-b3f3-29eb543b16ef, state=SUCCESS, 
published={}]
 task4 [id=f12ee464-0ba5-48c7-8423-9f425a00e675, state=SUCCESS, 
published={}]
}}}

Traceback (most recent call last):
  File "mistral/tests/unit/engine/test_workflow_resume.py", line 321, in 
test_resume_different_task_states
self.assertEqual(4, len(wf_ex.task_executions))
  File 
"/opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py",
 line 350, in assertEqual
self.assertThat(observed, matcher, message)
  File 
"/opt/stack/mistral/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py",
 line 435, in assertThat
raise mismatch_error
testtools.matchers._impl.MismatchError: 4 != 5
Ran 1 tests in 15.517s (+2.591s)
FAILED (id=301, failures=1 (+1))
error: testr failed (1)
ERROR: InvocationError: '/opt/stack/mistral/.tox/py27/bin/python setup.py testr 
--slowest --testr-args=WorkflowResumeTest.test_resume_different_task_states'
___