[Wikitech-l] Re: Research on Wikimedia Production Errors

Thiemo Kreuz Thu, 08 Jun 2023 00:31:34 -0700

I'm in no way an expert in this area. But from what I have seen the
past years I think I can identify two repeating patterns:


1. Minor programming mistakes in unrelated code. This happens often
when we add more strict types to existing code, or make it throw
exceptions when it's called in a way it should never have been called.
E.g. when a method that expects a string is called with null. Tests
can rarely catch such "unthinkable" edge cases beforehand. They bubble
up in production where codebases work together in ways that have never
been part of any automated or manual sest setup. Luckily this kind of
error is often easy to fix or safe to ignore.

2. Database hickups. Errors that appear to be "random" and are really
hard, if not impossible to reproduce. Sometimes it turns out the
reason is a really, really old database row that was created with very
different constraints in mind. More recent code might have a different
idea how a particular database table works nowadays and fails when
faced with incompatible data. Or we find that the database schema on
certain replication machines is not what it should be. For example
foreign keys to tables that shouldn't exist any more since 18 years,
but somehow still do. ;-) https://phabricator.wikimedia.org/T299387

Let's say I'm interested, but have no research at hand. :-)

Best
Thiemo
_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: Research on Wikimedia Production Errors

Reply via email to