Re: Report from the trenches -- shindig in production

Kevin Brown Wed, 09 Apr 2008 12:42:08 -0700

On Wed, Apr 9, 2008 at 5:08 AM, Paul Lindner <[EMAIL PROTECTED]> wrote:


> Hi,
>
> So just a few quick notes on running shindig in a high-volume production
> environment.  As you may know hi5 launched our OpenSocial platform on
> March 31st, and we've been at 100% since last Thursday.  Shindig has been
> doing a great job for us.
>
> * On a single server instance we were able to push shindig into the 500
>  req/sec range.  Beyond that and we saw request latency go higher than we
>  found acceptable.


Presumably you're running more than one instance in production right? Are
you just using some standard load balancer for this?


>
> * If you're careful about your iframe URLs you can get near 100% cache
>  hit ratio.  Our iframes only vary on country/lang/userprefs, and we
>  have low usage of userprefs.  Shindig has good support for generating and
>  processing URLs with a version param (v=...).


This is indeed really useful, but I worry that we'll have to sacrifice
iframe caching in the long run in exchange for preloading app data. Maybe we
can be intelligent about this and have the metadata handlers take care of
determining what the optimal trade off is between preloading and cache
preservation.


>
> * Using a CDN or caching reverse proxy in front of your shindig server
>  gives good results, even with all the uncachable requests for
>  /gadgets/socialdata and /gadgets/proxy.  We are seeing
>  a 30% hit rate on our cache.  Making sure we gzipped requests
>  outbound and inbound helped too.


gzipping is critical -- virtually all servlet containers should support it.

Edge caching is also extremely valuable, especially if you have a user base
that is widely distributed geographically.

We had an outbound http proxy for shindig requests to outside hosts for a
>  while, however we've removed it for now.  This is because we were only
>  seeing a 2% hit rate on requests and it added a high amount of latency,
>  slowing everything down.
>
>  We'll probably add this back at some point, since a proxy of this sort
>  can hold onto stale content when the originating site is down.


We decided to just use a shared cache rather than an outbound proxy. The
easiest way to achieve this is to provide a custom RemoteContentFetcher. So
far it hasn't been that useful for proxied requests, but it certainly helps
to have a shared cache (I have a ticket opened to provide out of the box
memcached support) for gadget specs and message bundles. This is especially
true as you scale out. Serving stale entries is almost always better than
failing to respond.

* We wrote our own implementations of the social data interfaces, the
>  content fetch interface, and the gadget signer interface.  This is all
>  pre-Guice patch.


You'll still need the custom gadget signer, sadly. Brian and I have been
trying to come up with an encryption scheme for this that will be suitable
for most deployments.

For ContentFetch we retrieve the appXML from a local memcache instance,
>  since we grab and parse the appXML in our developer console.


Are you preserving stale entries as well?  We've found this extremely
valuable.

For GadgetSigner we use our custom RSA key.


Any interest in helping to come up with a standard implementation for this,
so key generation is all that's necessary? Unfortunately the google
implementation isn't suitable for this because it uses some proprietary
security tools.


>
>  For the SocialData API stuff we integrated the various calls with our
>  existing User/Friends/Activity services.  We spent a bit of time
>  optimizing the People fetches to only convert users that fall into the
>  resultset.  The trick is to convert filters into appropriate backend
>  calls to reduce the data set size, and then to intelligently select only
>  the actual users to convert based on first/max/sort
>
>  We also converted a custom hi5 extension to use a new socialdata
>  call for photo albums.  We also added our experimental 'presence'
>  field to Person to show ONLINE/OFFLINE status.
>
> * The built-in Fetch memory cache is mostly useless (and should probably
>  be removed)  We spent a couple of days chasing red herrings on why apps
>  would not get refreshed because of this..


Absolutely useless! See my JIRA ticket :)


>
>
> * The patches to add REFRESH_INTERVAL to gadgets.io.makeRequest() helped
>  our cachability a lot.  It appears that more and more people are
>  using signed requests which will really hurt cachability.  We regenerate
>  the security token on every request, that might have to change....
>
> * Prefetch of data would go a long way towards making shindig scale
> better.
>  Right now requests for /gadgets/socialdata dominate, and for profile
> pages
>  with multiple apps you'll see the same data fetched multiple times.


Louis Ryan is working on this. He already committed the initial Preload
stuff...we need Preload signing / oauth next.

* The default five second timeout on proxy requests scares me.  If one of
>  our major partners starts getting slow we'll stack up a lot of
>  threads...


So make it configurable  -- a new parameter is easy to add with Guice :)


>
>
> You can see the hi5 modifications to shindig at
> http://www.hi5networks.com/platform/browser/shindig/
>
>
>


-- 
~Kevin

Re: Report from the trenches -- shindig in production

Reply via email to