Today the Skype service suffered a major outage:
http://blogs.skype.com/en/2010/12/skype_downtime_today.html
>>>
"Skype isn’t a network like a conventional phone or IM network – instead, it
relies on millions of individual connections between computers and phones to
keep things up and running. Some of these computers are what we call
‘supernodes’ – they act a bit like phone directories for Skype. If you want to
talk to someone, and your Skype app can’t find them immediately (for example,
because they’re connecting from a different location or from a different
device) your computer or phone will first try to find a supernode to figure out
how to reach them.
Under normal circumstances, there are a large number of supernodes available.
Unfortunately, today, many of them were taken offline by a problem affecting
some versions of Skype. As Skype relies on being able to maintain contact with
supernodes, it may appear offline for some of you.
What are we doing to help? Our engineers are creating new ‘mega-supernodes’ as
fast as they can, which should gradually return things to normal."
<<<
The Skype directory function is a peer-to-peer distributed system. As is common
with P2P architectures, it employs a global protocol so everyone can
communicate. This introduces a global failure domain. Although the Skype
directory service is distributed, it is encapsulated within a single failure
domain in the sense that if there is a weakness in the shared protocol
implementation, a cascading or wide scale failure can result. Because P2P
architectures by design are homogeneous, there are pressures to deploy a
homogeneous population. This is my understanding, given the available
information, of what happened with Skype today. After a wide scale failure of
many supernodes in a large homogeneous population, triggered by poison messages
of some type, it appears there was insufficient remaining service; the
remaining supernodes were overwhelmed. Related perhaps is the Amazon S3 outage
from 2008: http://status.aws.amazon.com/s3-20080720.html
I claim that a cascading failure of a large homogeneous P2P population has a
rough equivalence to the failure of a master in a master-slave architecture.
But meanwhile in a master-slave architecture the master is already engineered
to handle all of the coordination traffic for the slaves. We expect the master
to fail because as engineers we are eternal pessimists, so we prepare a fail
over plan. The fail over resources also have sufficient resources to handle all
of the slaves. On the one hand this is a pain. On the other hand, recovery is
easier to reason about and plan for. I have never witnessed a speedy recovery
from a large scale failure of a P2P system. I'm not seeing one here either.
I'm not saying that always the master-slave or P2P architecture is better than
the other. Both have their places. One will more suitable for some use cases
than the other. Something like Skype is only possible with the P2P model. We
know some systems where a singleton master is a scalability limit. :-)
(Master-slave != singleton-slave, strictly speaking.)
However sometimes we see proponents of fully decentralized systems look at
master-slave architectures and loudly proclaim "SPOF! SPOF!". It is worth
considering that all designs have their own weaknesses.
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back.
- Piet Hein (via Tom White)