On 03/07/2011 12:35 PM, Ming Fu wrote: > My Name is Ming Fu. I have worked on squid on and off since 2000.
> My current interesting is to improve the performance of squid 3. > What I am think of doing is to move the none cache-able processing > off the main squid thread. > My assumption is the following: > 1. significant portion of reply from web server are not cache-able. Agreed. > 2. Off load not cache-able processing off the main squid thread can > save some CPU load from the main squid thread. This is similar to > what is already happening on disk write and unlink. Not exactly. Removing CPU work from the main Squid thread is only helpful if there is a spare CPU core to which that work can be moved to (and if moving/synchronization does not cost more than we gain from the added parallelism). Using multiple CPU cores with low synchronization overheads is what SMP Squid already does. (Also keep in mind that direct disk access block Squid process, often for a long time, wasting available CPU cycles. Non-cachable processing does not do that. You can consider code execution as "CPU blocking", in which case the above reasoning about multiple cores still applies). > Two approach I can think of: > 1. move the processing of not cache-able reply to separate threads, > these threads not need to access the cache. > 2. Push the work down to the kernel's socket layer. Some kind of > kernel filter that is able to associate two sockets and copy the in > coming data from one socket to another. The squid establishes the > association and provide information for the kernel filter to tell the > end of a reply (chunked encoding or content-length). The kernel > breaks the association when one reply is processed and squid regains > the control of the sockets. > The option 2 could potentially be faster than option 1, but will be > depends on the OS platform. I come from a BSD background, I have some > confidence that this will be possible for FreeBSD. You are correct: Non-cachable responses currently suffer from some caching-related overheads. Removing those overheads would help make Squid faster. I do not think it is a good idea to somehow move processing of non-cachable responses to a different thread or process because the problem is _not_ that non-cachable responses are blocked on cachable responses (there should be no blocking disk I/O in a performance-sensitive Squid worker, even if it caches). The problem is that non-cachable responses have to go through some useless (for them) caching code. Removal of that useless processing is the right solution, IMO. Moving transactions to a different process or thread (beyond what SMP Squid already does) will just add overheads. The primary obstacle towards the optimization you want is Squid assumption that all objects come from server-side through Store. This adds a lot of needless processing for non-cachable responses, including multiple memory copies. IMO, a better design would be to make the server side capable of feeding the responses to the client side directly. In other words, separate "subscribe to receive response" interface from Store and have both Store and server-side implement that interface. Moreover, we have already done pretty much the same thing for requests: The server-side code can receive requests from multiple sources (client-side, ICAP, eCAP), without really knowing where the request is coming from. I believe the same should be done for handling traffic in the opposite direction. Many pieces of the required interface are already implemented and can be reused. If you want to work on this, let's discuss specifics! As for TCP slicing, sendfile(), and other low-level optimizations, they can happen on top of the streamlined processing outlined above. As Amos has already noted, those optimizations will need to be mindful of ACLs, adaptation, and other code that wants to retain some control over response handling, but not all environments have such code enabled. Moreover, we may use the same low-level optimization for to-HTTP, to-ICAP, and from-ICAP traffic streams as well! The key is to have a single "message passing" interface mentioned above so that you can insert low-level optimizations between any appropriate "sides" without duplicating optimization or sides code. Cheers, Alex.
