Hi, 

So just a few quick notes on running shindig in a high-volume production
environment.  As you may know hi5 launched our OpenSocial platform on
March 31st, and we've been at 100% since last Thursday.  Shindig has been
doing a great job for us.

* On a single server instance we were able to push shindig into the 500
  req/sec range.  Beyond that and we saw request latency go higher than we
  found acceptable.

* If you're careful about your iframe URLs you can get near 100% cache
  hit ratio.  Our iframes only vary on country/lang/userprefs, and we
  have low usage of userprefs.  Shindig has good support for generating and
  processing URLs with a version param (v=...).

* Using a CDN or caching reverse proxy in front of your shindig server
  gives good results, even with all the uncachable requests for
  /gadgets/socialdata and /gadgets/proxy.  We are seeing
  a 30% hit rate on our cache.  Making sure we gzipped requests
  outbound and inbound helped too.

  We had an outbound http proxy for shindig requests to outside hosts for a
  while, however we've removed it for now.  This is because we were only
  seeing a 2% hit rate on requests and it added a high amount of latency,
  slowing everything down.

  We'll probably add this back at some point, since a proxy of this sort
  can hold onto stale content when the originating site is down.

* We wrote our own implementations of the social data interfaces, the
  content fetch interface, and the gadget signer interface.  This is all
  pre-Guice patch.

  For ContentFetch we retrieve the appXML from a local memcache instance,
  since we grab and parse the appXML in our developer console.

  For GadgetSigner we use our custom RSA key.

  For the SocialData API stuff we integrated the various calls with our
  existing User/Friends/Activity services.  We spent a bit of time
  optimizing the People fetches to only convert users that fall into the
  resultset.  The trick is to convert filters into appropriate backend
  calls to reduce the data set size, and then to intelligently select only
  the actual users to convert based on first/max/sort

  We also converted a custom hi5 extension to use a new socialdata
  call for photo albums.  We also added our experimental 'presence'
  field to Person to show ONLINE/OFFLINE status.

* The built-in Fetch memory cache is mostly useless (and should probably
  be removed)  We spent a couple of days chasing red herrings on why apps
  would not get refreshed because of this..

* The patches to add REFRESH_INTERVAL to gadgets.io.makeRequest() helped
  our cachability a lot.  It appears that more and more people are
  using signed requests which will really hurt cachability.  We regenerate
  the security token on every request, that might have to change....

* Prefetch of data would go a long way towards making shindig scale better.
  Right now requests for /gadgets/socialdata dominate, and for profile pages
  with multiple apps you'll see the same data fetched multiple times.

* The default five second timeout on proxy requests scares me.  If one of
  our major partners starts getting slow we'll stack up a lot of
  threads...

You can see the hi5 modifications to shindig at
http://www.hi5networks.com/platform/browser/shindig/


Reply via email to