On Wed, Apr 9, 2008 at 5:08 AM, Paul Lindner <[EMAIL PROTECTED]> wrote:
> Hi, > > So just a few quick notes on running shindig in a high-volume production > environment. As you may know hi5 launched our OpenSocial platform on > March 31st, and we've been at 100% since last Thursday. Shindig has been > doing a great job for us. > > * On a single server instance we were able to push shindig into the 500 > req/sec range. Beyond that and we saw request latency go higher than we > found acceptable. Presumably you're running more than one instance in production right? Are you just using some standard load balancer for this? > > * If you're careful about your iframe URLs you can get near 100% cache > hit ratio. Our iframes only vary on country/lang/userprefs, and we > have low usage of userprefs. Shindig has good support for generating and > processing URLs with a version param (v=...). This is indeed really useful, but I worry that we'll have to sacrifice iframe caching in the long run in exchange for preloading app data. Maybe we can be intelligent about this and have the metadata handlers take care of determining what the optimal trade off is between preloading and cache preservation. > > * Using a CDN or caching reverse proxy in front of your shindig server > gives good results, even with all the uncachable requests for > /gadgets/socialdata and /gadgets/proxy. We are seeing > a 30% hit rate on our cache. Making sure we gzipped requests > outbound and inbound helped too. gzipping is critical -- virtually all servlet containers should support it. Edge caching is also extremely valuable, especially if you have a user base that is widely distributed geographically. We had an outbound http proxy for shindig requests to outside hosts for a > while, however we've removed it for now. This is because we were only > seeing a 2% hit rate on requests and it added a high amount of latency, > slowing everything down. > > We'll probably add this back at some point, since a proxy of this sort > can hold onto stale content when the originating site is down. We decided to just use a shared cache rather than an outbound proxy. The easiest way to achieve this is to provide a custom RemoteContentFetcher. So far it hasn't been that useful for proxied requests, but it certainly helps to have a shared cache (I have a ticket opened to provide out of the box memcached support) for gadget specs and message bundles. This is especially true as you scale out. Serving stale entries is almost always better than failing to respond. * We wrote our own implementations of the social data interfaces, the > content fetch interface, and the gadget signer interface. This is all > pre-Guice patch. You'll still need the custom gadget signer, sadly. Brian and I have been trying to come up with an encryption scheme for this that will be suitable for most deployments. For ContentFetch we retrieve the appXML from a local memcache instance, > since we grab and parse the appXML in our developer console. Are you preserving stale entries as well? We've found this extremely valuable. For GadgetSigner we use our custom RSA key. Any interest in helping to come up with a standard implementation for this, so key generation is all that's necessary? Unfortunately the google implementation isn't suitable for this because it uses some proprietary security tools. > > For the SocialData API stuff we integrated the various calls with our > existing User/Friends/Activity services. We spent a bit of time > optimizing the People fetches to only convert users that fall into the > resultset. The trick is to convert filters into appropriate backend > calls to reduce the data set size, and then to intelligently select only > the actual users to convert based on first/max/sort > > We also converted a custom hi5 extension to use a new socialdata > call for photo albums. We also added our experimental 'presence' > field to Person to show ONLINE/OFFLINE status. > > * The built-in Fetch memory cache is mostly useless (and should probably > be removed) We spent a couple of days chasing red herrings on why apps > would not get refreshed because of this.. Absolutely useless! See my JIRA ticket :) > > > * The patches to add REFRESH_INTERVAL to gadgets.io.makeRequest() helped > our cachability a lot. It appears that more and more people are > using signed requests which will really hurt cachability. We regenerate > the security token on every request, that might have to change.... > > * Prefetch of data would go a long way towards making shindig scale > better. > Right now requests for /gadgets/socialdata dominate, and for profile > pages > with multiple apps you'll see the same data fetched multiple times. Louis Ryan is working on this. He already committed the initial Preload stuff...we need Preload signing / oauth next. * The default five second timeout on proxy requests scares me. If one of > our major partners starts getting slow we'll stack up a lot of > threads... So make it configurable -- a new parameter is easy to add with Guice :) > > > You can see the hi5 modifications to shindig at > http://www.hi5networks.com/platform/browser/shindig/ > > > -- ~Kevin

