davisp opened a new pull request #1596: Fix couch server race condition
URL: https://github.com/apache/couchdb/pull/1596
 
 
   ## Overview
   
   There are two separate bugs here being fixed. The first was in 
`couch_server:terminate/2`. If there was an active open_async process when 
couch_server died it would throw a function clause when it tried to shutdown 
the database that wasn't yet open. We just filter out any `#entry{}` record to 
avoid the issue. This a simple bug but it prevented couch_server from properly 
generating a SASL report that ended up covering up the second more complex race 
condition.
   
   The second bug has to deal with a race between deletions and opens as well 
as the state of couch_server's mailbox. There's a very specific set of 
operations that have to happen in a particular order to trigger this bug:
   
   1. An in-flight open or creation request that will ultimately succeed. The 
`'$gen_call'` message must be in couch_server's message queue before Step 2 
finishes.
   2. A deletion request
   3. A second open or creation request that is enqueued before couch_server 
processes the `'$gen_call'` message from Step 1.
   4. The couch_db_updater pid's `'EXIT'` signal is delayed until after the 
open_result from Step 3 is delivered to couch_server
   
   The underlying issue here is that the deletion request clears the 
`couch_dbs` ets table state without removing the possible corresponding state 
in couch_server's message queue. As things progress couch_server ends up 
mistaking the `open_result` message from Step 1 as corresponding to the 
open_async process spawned in Step 3. Currently this results in the client from 
Step 3 getting an invalid response from couch_server, and then more 
importantly, couch_server dies while attempting to process the real 
`open_result` message because of the state in the `couch_dbs` ets table being 
incorrect (it would throw a function_clause error in a list comprehension 
because `#entry.waiters` was undefined).
   
   The fix was just to ensure that the opener pid in the `#entry{}` record 
matches the pid of the caller. Anything that doesn't match is ignored since the 
deletion already cleaned up the server state. Although we do kill the 
couch_db_updater pid for extra security, Though technically its duplicating 
work that the deletion request already handled (via killing the `open_async` 
pid which is linked to it).
   
   ## Testing recommendations
   
   The second of three commits adds a failing test case that is fixed by the 
third commit.
   
   `make eunit`
   
   ## Checklist
   
   - [x] Code is written and works correctly;
   - [x] Changes are covered by tests;
   - [x] No documentation to change
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to