On 10/21/10 4:04 PM, Aryeh Gregor wrote:
> On Thu, Oct 21, 2010 at 6:31 PM, Neil Kandalgaonkar<[email protected]>  
> wrote:
>> For what it's worth, I'm influenced by my former job at Flickr, where
>> the practice was to deploy several times *per day*, directly from trunk.
>> That may be more extreme than we want  but be aware there are people who
>> are doing it successfully -- it just takes a few extra development
>> practices.
>
> Personally, I think it would be awesome if we could migrate to this
> level of deployment frequency eventually.  I imagine that
> comprehensive automated test suites are a major part of making this
> reliable.

Nope. Automated tests help a lot with this approach but Flickr doesn't 
have much better tests than MediaWiki does.

We *should* have better tests, but I would just say that it is not 
required for us to have a great test suite before doing this.


>  To the extent you can share any details about how stuff
> works at Flickr, what long-term changes are necessary for this to be
> practical?

Flickr engineers have already talked a lot about this in public. See 
references below.

The main insight here is that branching is a bad way for a website to 
manage change. We do not have an install base that's out there in the 
world, like shrink-wrapped software, where we issue patches on CD. For a 
website, we control the entire install base.[1]

What we need are ways of managing change across our server clusters, or 
managing incremental feature and infrastructure upgrades. This leads to 
"branching in code".

Doing things the Flickr way entirely would require:

1 - A "feature flag" system, for "branching in code". The point is to 
start developing a new feature with it being turned off by default for 
most environments and without succumbing to branching and merging 
misery. In other words, day one of a new feature looks like this:

   if ( $wgFeature['MyNewThing'] ) {
     /* ... new code ... */
   } else {
     /* ... old code ... */
   }

Of course if you're fixing bugs there's no need to hide that behind a 
feature flag.


2 - Every developer with commit access is thinking about deployment onto 
a cluster of machines all the time. Committing to the repository means 
you are asserting this will work in production. (This is the hard part 
for us, I think, but maybe not insurmountable).

3 - One can deploy with a single button press (and there is a system 
recording what changes were deployed and why, for ops' convenience).

4 - When there's trouble, new deploys can be blocked centrally, and then 
ops can revert to a previous version with a single button press.

5 - Developers are good about "cleaning up" code that was previously 
protected by feature flags once the behaviour is standard. (HINT: this 
is the part Flickr doesn't talk about in public... but as an open source 
project with more visible dirty laundry, perhaps we can do better.)

This system does result in more "oops" moments. But the point is to make 
those easy to recover from, and to have a culture where people aren't 
blamed too much for this. Not to make a system that tries to ensure that 
deploy branches can be tested to be almost perfect. The real problems 
are always things that nobody anticipated anyway.


NOTES

[1] I am for the purposes of the argument ignoring MediaWiki as a 
deliverable and only thinking about project websites.

REFERENCES

Here's the most concise presentation:
"Always Ship Trunk: Managing Change In Complex Websites" by Paul Hammond
http://www.paulhammond.org/2010/06/trunk/alwaysshiptrunk.pdf

And a longer talk about all this from Paul Hammond and John Allspaw
10+ Deploys Per Day: Dev/Ops Cooperation at Flickr
http://velocityconference.blip.tv/file/2284377/

Blog post about the Feature Flag system by Ross Harmes
"Flipping out"
http://code.flickr.com/blog/2009/12/02/flipping-out/



-- 
Neil Kandalgaonkar ( ) <[email protected]>

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to