On Tue, 24 Nov 2009 16:13:37 -0700, Alex Rousskov <[email protected]> wrote: > On 11/20/2009 10:59 PM, Robert Collins wrote: >> On Tue, 2009-11-17 at 08:45 -0700, Alex Rousskov wrote: >>>>> Q1. What are the major areas or units of asynchronous code execution? >>>>> Some of us may prefer large areas such as "http_port acceptor" or >>>>> "cache" or "server side". Others may root for AsyncJob as the largest >>>>> asynchronous unit of execution. These two approaches and their >>>>> implications differ a lot. There may be other designs worth >>>>> considering. > >> I'd like to let people start writing (and perf testing!) patches. To >> unblock people. I think the primary questions are: >> - do we permit multiple approaches inside the same code base. E.g. >> OpenMP in some bits, pthreads / windows threads elsewhere, and 'job >> queues' or some such abstraction elsewhere ? >> (I vote yes, but with caution: someone trying something we don't >> already do should keep it on a branch and really measure it well until >> its got plenty of buy in). > > I vote for multiple approaches at lower levels of the architecture and > against multiple approaches at highest level of the architecture. My Q1 > was only about the highest levels, BTW. > > For example, I do not think it is a good idea to allow a combination of > OpenMP, ACE, and something else as a top-level design. Understanding, > supporting, and tuning such a mix would be a nightmare, IMO. > > On the other hand, using threads within some disk storage schemes while > using processes for things like "cache" may make a lot of sense, and we > already have examples of some of that working. >
OpenMP seems almost unanimous negative by the people who know it. > > This is why I believe that the decision of processes versus threads *at > the highest level* of the architecture is so important. Yes, we are, > can, and will use threads at lower levels. There is no argument there. > The question is whether we can also use threads to split Squid into > several instances of "major areas" like client side(s), cache(s), and > server side(s). > > See Henrik's email on why it is difficult to use threads at highest > levels. I am not convinced yet, but I do see Henrik's point, and I > consider the dangers he cites critical for the right Q1 answer. > > >> - If we do *not* permit multiple approaches, then what approach do we >> want for parallelisation. E.g. a number of long lived threads that take >> on work, or many transient threads as particular bits of the code need >> threads. I favour the former (long lived 'worker' threads). > > For highest-level models, I do not think that "one job per > thread/process", "one call per thread/process", or any other "one little > short-lived something per thread/process" is a good idea. I do believe > we have to parallelize "major areas", and I think we should support > multiple instances of some of those "areas" (e.g., multiple client > sides). Each "major area" would be long-lived process/thread, of course. Agreed. mostly. As Rob points out the idea is for one small'ish pathway of the code to be run N times with different state data each time by a single thread. Sachins' initial AcceptFD thread proposal would perhapse be exemplar for this type of thread. Where one thread does the comm layer; accept() through to the scheduling call hand-off to handlers outside comm. Then goes back for the next accept(). The only performance issue brought up was by you that its particular case might flood the slower main process if done first. Not all code can be done this way. Overheads are simply moving the state data in/out of the thread. IMO starting/stopping threads too often is a fairly bad idea. Most events will end up being grouped together into types (perhapse categorized by component, perhapse by client request, perhapse by pathway) with a small thread dedicated to handling that type of call. > > Again for higher-level models, I am also skeptical that it is a good > idea to just split Squid into N mostly non-cooperating nearly identical > instances. It may be the right first step, but I would like to offer > more than that in terms of overall performance and tunability. The answer to that is: of all the SMP models we theorize, that one is the only proven model so far. Administrators are already doing it with all the instance management manually handled on quad+ core machines. With a lot of performance success. In last nights discussion on IRC we covered what issues are outstanding from making this automatic and all are resolvable except cache index. It's not easily shareable between instances. > > I hope the above explains why I consider Q1 critical for the meant > "highest level" scope and why "we already use processes and threads" is > certainly true but irrelevant within that scope. > > > Thank you, > > Alex. Thank you for clarifying that. I now think we are all more or less headed in the same direction(s). With three models proposed for the overall architecture. In the order they were brought up... (NP: the TODO only applies if we work towards that goal) MODEL: * fully threaded. some helper child processes PROS: smaller memory resource footprint when running. CONS: potentially larger CPU footprint swapping data between threads. potential problems making threaded paths too small vs the overheads. TODO: continue polishing the code into distinct calls determine thread-safe code determine shared data and add appropriate locking make above segments into threads. add some way to pass events/calls to existing long-term threads either ... a super-lock as described by Henrik, or ... a 2-queue alternative as described by Amos MODEL: * process chunks with sub-threads and sometimes helper child processes PROS: it's known to be very fast. but not amazingly so. (ref: postfix) (ref: squid helpers) CONS: current code uses a LOT of data sharing between components. particularly of small 1-32 byte chunks of random data (config flags, stats, shared cache data snippets). identifying distinct chunks is a big time consuming issue. TODO: identify the major process chunks and splitting out from the main binary add efficient ways to pass data between cleanly between processes (at capacity). copy relevant external shared data into the state objects to pass along with the request data plus all same TODO from fully-threaded model, for the sub-threads within each process. MODEL: * separate instances with sub-threads and helper child processes PROS: we can almost do the macro change today. (sub-threads later) it can scale the base app speed up a reasonable percentage (ref: apache2) CONS: duplication of data. particularly in the storage. is very wasteful of resources. NP: apache evade this with effectively read-only disk data, all dynamics are in the instance memory. TODO: the -I option needs porting so the master can open main ports and children share the listening. finish the logging TCP module ideas (for reliable shared logging). some code to make the master process handle multiple children. some alterations to safely handle the shared config file settings (cache_dir etc). MODEL: * status-quo. Where we continue to work on all the above TODOs as time permits and needs require. wait and see which model gets finished first. PROS: the way forward is already well known. CONS: it's not fast enough reaching multi-CPU usage The easiest way forward seems to be toward separate instances, with finer grained threading and/or process chunking being done later after deeper analysis for extra gains at each change. This makes me think that we are not in fact proposing competing models, but simply looking at different levels of code. Each approach which has come up may best be used at varying levels; upper (instances), middle (processes, threads, jobs), and low (signals, events, cbdata, async calls). It also seems to me the top instances choice is the most easily reversed if it's found to actually be a bad idea. The major support change being in the parent main() code setting up for several children instances. Possibilities there for configuring it on/off or how many instances. Amos
