[sqlalchemy] Recovering from StaleDataErrors

George Reilly Sun, 19 Apr 2015 14:58:48 -0700

I ran into a `StaleDataError` from PostgreSQL the other day when a client
app sent us
two update requests back-to-back without waiting for the first request
to send a response.


Given

    class Mood(Base):
        id = Column(BigInteger, primary_key=True, autoincrement=False,
nullable=False)
        name = Column(VARCHAR(length=255), nullable=False, index=True,
unique=True)
        # etc

    class Avatar(Base):
        id = Column(BigInteger, primary_key=True, autoincrement=False,
nullable=False)
        version = Column(Integer, nullable=False, default=0)
        moods = relationship(Mood, secondary=lambda: avatar_moods_table)
        # etc

    # The association link between avatars and moods.
    avatar_moods_table = Table(
        'avatar_moods',
        Base.metadata,
        Column('avatar_id', BigInteger, ForeignKey(Avatar.id,
ondelete='CASCADE'),
               nullable=False, primary_key=True, index=True),
        Column('mood_id', BigInteger, ForeignKey(Mood.id,
ondelete='CASCADE'),
               nullable=False, primary_key=True),
    )

The state of the Avatar before either client request was that there were
three moods,
`['Grumpy', 'Tired', 'Disgruntled']`. Both client requests' payloads
included
`moods = ['Elated']`, as well as other data that (probably) differed
between the two.

The code to update the moods looks like

    orm_model.moods = [Mood.get_by_name(self.session, mood) for mood in
logic_model.moods]

where `logic_model` is the deserialized request body and `orm_model` is the
SQLAlchemy object
retrieved from the database.


The aggregated centralized log looks like

    first interleaved request, 254c1046-0da701ca, at 19:09:03, which
switches moods from `['Grumpy', 'Tired', 'Disgruntled']` to `['Elated']`
        Apr 15 19:09:03 [3843:MainThread] INFO  [254c1046-0da701ca] REST
request: PUT http://<hostname>/avatar/<avatar_id> received...

    Second interleaved request, 711bc322-df5afba6, at 19:09:04, also with
moods=`['Elated']`
        Apr 15 19:09:04 [4572:MainThread] INFO  [711bc322-df5afba6] REST
request: PUT http://<hostname>/avatar/<avatar_id> received...

    Second request fails
        Apr 15 19:09:05 [4572:MainThread] WARNI [711bc322-df5afba6] REST
request: PUT http://<hostname>/avatar/<avatar_id> failed: 400 Bad Request
(GENERAL): 'DELETE statement on table 'avatar_moods' expected to delete 3
row(s); Only 0 were matched.' (DELETE statement on table 'avatar_moods'
expected to delete 3 row(s); Only 0 were matched.)

    First request succeeds
        Apr 15 19:09:05 [3843:MainThread] INFO  [254c1046-0da701ca] REST
response: [1570ms] PUT http://<hostname>/avatar/<avatar_id> responded: 200
OK


Here's what I think happened:

Request 1 arrived
* The AvatarRestResource started a session and a transaction
* The AvatarLogicResource read the current state of the avatar from
PostgreSQL,
  which included 3 moods (presumably)
* LR checked the Version in the request body (3) against the one in the ORM
model (also version 3).
* The AvatarLogicResource set the moods on the ORM model to 1 mood,
`['Elated']`.
* When the transaction committed, just before returning from the Rest
Resource,
  SQLAlchemy deleted the 3 rows in the avatar_moods association table
  and inserted 1 new row, along with any other updates.

Request 2 arrived, while Request 1 was in progress.
It arrived in a different worker process or even a different server box.
* The AvatarRestResource started another session and transaction.
  ACID transactions provide Isolation, so it should only have seen
  the committed state of the database. Presumably, Request 1 hadn't
completed,
  so none of its changes should have been visible in Request 2
* LR#2 read the ORM model and got 3 moods
* LR#2 also set the moods to `['Elated']`.
* RR#2 committed the transaction and it blew up with the error,
  `DELETE statement on table 'avatar_moods' expected to delete 3 row(s);
Only 0 were matched.`

Based on my reading of some of the below links about MVCC, #1's transaction
completed first.
Transaction #2 in PostgreSQL now had visibility to those updated rows in
the DB,
so it returned "0 rows deleted" to SQLAlchemy.
This confounded SA's expectations and it blew up.
I think.

Postgres is using Multiversion Concurrency Control:
* https://devcenter.heroku.com/articles/postgresql-concurrency
* http://www.slideshare.net/robertsosinski/mvcc-robert-sosinski
* http://www.postgresql.org/docs/9.4/static/mvcc.html
* https://wiki.postgresql.org/wiki/MVCC


Finally, my question: What, if anything, could we have done to mitigate
this on the server?

Looking at dependency.py for 0.9.9, it looks like SQLAlchemy considers this
to be an unrecoverable error.

Note: I think the client app must enforce serialization of requests
if it wants a consistent worldview from its own actions.
We do have a `version` column in the Avatar table, but that's used to
handle concurrent/stale updates from two different devices.
Some requests are long running and a case could be made for accepting more
requests
while the earlier ones are still being handled—assuming no conflict—to
make the client app more responsive.
-- 
/George V. Reilly

PS Congratulations on releasing SQLAlchemy 1.0! I've been a happy user
since 2008.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

[sqlalchemy] Recovering from StaleDataErrors

Reply via email to