some thoughts related to today's Skype outage

Andrew Purtell Wed, 22 Dec 2010 15:21:02 -0800

Today the Skype service suffered a major outage:

http://blogs.skype.com/en/2010/12/skype_downtime_today.html
>>>
"Skype isn’t a network like a conventional phone or IM network – instead, it 
relies on millions of individual connections between computers and phones to 
keep things up and running. Some of these computers are what we call 
‘supernodes’ – they act a bit like phone directories for Skype. If you want to 
talk to someone, and your Skype app can’t find them immediately (for example, 
because they’re connecting from a different location or from a different 
device) your computer or phone will first try to find a supernode to figure out 
how to reach them.
 
Under normal circumstances, there are a large number of supernodes available. 
Unfortunately, today, many of them were taken offline by a problem affecting 
some versions of Skype. As Skype relies on being able to maintain contact with 
supernodes, it may appear offline for some of you.
 
What are we doing to help? Our engineers are creating new ‘mega-supernodes’ as 
fast as they can, which should gradually return things to normal."
<<<
 
The Skype directory function is a peer-to-peer distributed system. As is common 
with P2P architectures, it employs a global protocol so everyone can 
communicate. This introduces a global failure domain. Although the Skype 
directory service is distributed, it is encapsulated within a single failure 
domain in the sense that if there is a weakness in the shared protocol 
implementation, a cascading or wide scale failure can result. Because P2P 
architectures by design are homogeneous, there are pressures to deploy a 
homogeneous population. This is my understanding, given the available 
information, of what happened with Skype today. After a wide scale failure of 
many supernodes in a large homogeneous population, triggered by poison messages 
of some type, it appears there was insufficient remaining service; the 
remaining supernodes were overwhelmed. Related perhaps is the Amazon S3 outage 
from 2008: http://status.aws.amazon.com/s3-20080720.html


I claim that a cascading failure of a large homogeneous P2P population has a 
rough equivalence to the failure of a master in a master-slave architecture.

But meanwhile in a master-slave architecture the master is already engineered 
to handle all of the coordination traffic for the slaves. We expect the master 
to fail because as engineers we are eternal pessimists, so we prepare a fail 
over plan. The fail over resources also have sufficient resources to handle all 
of the slaves. On the one hand this is a pain. On the other hand, recovery is 
easier to reason about and plan for. I have never witnessed a speedy recovery 
from a large scale failure of a P2P system. I'm not seeing one here either. 
 
I'm not saying that always the master-slave or P2P architecture is better than 
the other. Both have their places. One will more suitable for some use cases 
than the other. Something like Skype is only possible with the P2P model. We 
know some systems where a singleton master is a scalability limit. :-) 
(Master-slave != singleton-slave, strictly speaking.)
 
However sometimes we see proponents of fully decentralized systems look at 
master-slave architectures and loudly proclaim "SPOF! SPOF!". It is worth 
considering that all designs have their own weaknesses. 
 
Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back.
  - Piet Hein (via Tom White)

some thoughts related to today's Skype outage

Reply via email to