Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-19 Thread Sam Horrocks

 > You know, I had brief look through some of the SpeedyCGI code yesterday,
 > and I think the MRU process selection might be a bit of a red herring. 
 > I think the real reason Speedy won the memory test is the way it spawns
 > processes.

 Please take a look at that code again.  There's no smoke and mirrors,
 no red-herrings.  Also, I don't look at the benchmarks as "winning" - I
 am not trying to start a mod_perl vs speedy battle here.  Gunther wanted
 to know if there were "real bechmarks", so I reluctantly put them up.

 Here's how SpeedyCGI works (this is from version 2.02 of the code):

When the frontend starts, it tries to quickly grab a backend from
the front of the be_wait queue, which is a LIFO.  This is in
speedy_frontend.c, get_a_backend() function.

If there aren't any idle be's, it puts itself onto the fe_wait queue.
Same file, get_a_backend_hard().

If this fe (frontend) is at the front of the fe_wait queue, it
"takes charge" and starts looking to see if a backend needs to be
spawned.  This is part of the "frontend_ping()" function.  It will
only spawn a be if no other backends are being spawned, so only
one backend gets spawned at a time.

Every frontend in the queue, drops into a sigsuspend and waits for an
alarm signal.  The alarm is set for 1-second.  This is also in
get_a_backend_hard().

When a backend is ready to handle code, it goes and looks at the fe_wait
queue and if there are fe's there, it sends a SIGALRM to the one at
the front, and sets the sent_sig flag for that fe.  This done in
speedy_group.c, speedy_group_sendsigs().

When a frontend wakes on an alarm (either due to a timeout, or due to
a be waking it up), it looks at its sent_sig flag to see if it can now
grab a be from the queue.  If so it does that.  If not, it runs various
checks then goes back to sleep.

 In most cases, you should get a be from the lifo right at the beginning
 in the get_a_backend() function.  Unless there aren't enough be's running,
 or somethign is killing them (bad perl code), or you've set the
 MaxBackends option to limit the number of be's.


 > If I understand what's going on in Apache's source, once every second it
 > has a look at the scoreboard and says "less than MinSpareServers are
 > idle, so I'll start more" or "more than MaxSpareServers are idle, so
 > I'll kill one".  It only kills one per second.  It starts by spawning
 > one, but the number spawned goes up exponentially each time it sees
 > there are still not enough idle servers, until it hits 32 per second. 
 > It's easy to see how this could result in spawning too many in response
 > to sudden load, and then taking a long time to clear out the unnecessary
 > ones.
 > 
 > In contrast, Speedy checks on every request to see if there are enough
 > backends running.  If there aren't, it spawns more until there are as
 > many backends as queued requests.
 
 Speedy does not check on every request to see if there are enough
 backends running.  In most cases, the only thing the frontend does is
 grab an idle backend from the lifo.  Only if there are none available
 does it start to worry about how many are running, etc.

 > That means it never overshoots the mark.

 You're correct that speedy does try not to overshoot, but mainly
 because there's no point in overshooting - it just wastes swap space.
 But that's not the heart of the mechanism.  There truly is a LIFO
 involved.  Please read that code again, or run some tests.  Speedy
 could overshoot by far, and the worst that would happen is that you
 would get a lot of idle backends sitting in virtual memory, which the
 kernel would page out, and then at some point they'll time out and die.
 Unless of course the load increases to a point where they're needed,
 in which case they would get used.

 If you have speedy installed, you can manually start backends yourself
 and test.  Just run "speedy_backend script.pl &" to start a backend.
 If you start lots of those on a script that says 'print "$$\n"', then
 run the frontend on the same script, you will still see the same pid
 over and over.  This is the LIFO in action, reusing the same process
 over and over.

 > Going back to your example up above, if Apache actually controlled the
 > number of processes tightly enough to prevent building up idle servers,
 > it wouldn't really matter much how processes were selected.  If after
 > the 1st and 2nd interpreters finish their run they went to the end of
 > the queue instead of the beginning of it, that simply means they will
 > sit idle until called for instead of some other two processes sitting
 > idle until called for.  If the systems were both efficient enough about
 > spawning to only create as many interpreters as needed, none of them
 > would be sitting idle and memory usage would always be as low as
 > possible.
 > 
 > I don't know if I'm explaining this very well, but the gist of my theory
 > is tha

RE: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-19 Thread Matt Sergeant

There seems to be a lot of talk here, and analogies, and zero real-world
benchmarking.

Now it seems to me from reading this thread, that speedycgi would be
better where you run 1 script, or only a few scripts, and mod_perl might
win where you have a large application with hundreds of different URLs
with different code being executed on each. That may change with the next
release of speedy, but then lots of things will change with the next major
release of mod_perl too, so its irrelevant until both are released.

And as well as that, speedy still suffers (IMHO) that is still follows the
CGI scripting model, whereas mod_perl offers a much more flexible
environemt, and feature rich API (the Apache API). What's more, I could
never build something like AxKit in speedycgi, without resorting to hacks
like mod_rewrite to hide nasty URL's. At least thats my conclusion from
first appearances.

Either way, both solutions have their merits. Neither is going to totally
replace the other.

What I'd really like to do though is sum up this thread in a short article
for take23. I'll see if I have time on Sunday to do it.

-- 


/||** Director and CTO **
   //||**  AxKit.com Ltd   **  ** XML Application Serving **
  // ||** http://axkit.org **  ** XSLT, XPathScript, XSP  **
 // \\| // ** Personal Web Site: http://sergeant.org/ **
 \\//
 //\\
//  \\




RE: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-19 Thread Stephen Anderson



>  > This doesn't affect the argument, because the core of it is that:
>  > 
>  > a) the CPU will not completely process a single task all 
> at once; instead,
>  > it will divide its time _between_ the tasks
>  > b) tasks do not arrive at regular intervals
>  > c) tasks take varying amounts of time to complete
>  > 
[snip]

>  I won't agree with (a) unless you qualify it further - what 
> do you claim
>  is the method or policy for (a)?

I think this has been answered ... basically, resource conflicts (including
I/O), interrupts, long running tasks, higher priority tasks, and, of course,
the process yielding, can all cause the CPU to switch processes (which of
these qualify depends very much on the OS in question).

This is why, despite the efficiency of single-task running, you can usefully
run more than one process on a UNIX system. Otherwise, if you ran a single
Apache process and had no traffic, you couldn't run a shell at the same time
- Apache would consume practically all your CPU in its select() loop 8-)

>  Apache httpd's are scheduled on an LRU basis.  This was 
> discussed early
>  in this thread.  Apache uses a file-lock for its mutex 
> around the accept
>  call, and file-locking is implemented in the kernel using a 
> round-robin
>  (fair) selection in order to prevent starvation.  This results in
>  incoming requests being assigned to httpd's in an LRU fashion.

I'll apologise, and say, yes, of course you're right, but I do have a query:

There are at (IIRC) 5 methods that Apache uses to serialize requests:
fcntl(), flock(), Sys V semaphores, uslock (IRIX only) and Pthreads
(reliably only on Solaris). Do they _all_ result in LRU?

>  Remember that the httpd's in the speedycgi case will have very little
>  un-shared memory, because they don't have perl interpreters in them.
>  So the processes are fairly indistinguishable, and the LRU isn't as 
>  big a penalty in that case.


Ye_but_, interpreter for interpreter, won't the equivalent speedycgi
have roughly as much unshared memory as the mod_perl? I've had a lot of
(dumb) discussions with people who complain about the size of
Apache+mod_perl without realising that the interpreter code's all shared,
and with pre-loading a lot of the perl code can be too. While I _can_ see
speedycgi having an advantage (because it's got a much better overview of
what's happening, and can intelligently manage the situation), I don't think
it's as large as you're suggesting. I think this needs to be intensively
benchmarked to answer that

>  other interpreters, and you expand the number of interpreters in use.
>  But still, you'll wind up using the smallest number of interpreters
>  required for the given load and timeslice.  As soon as those 1st and
>  2nd perl interpreters finish their run, they go back at the beginning
>  of the queue, and the 7th/ 8th or later requests can then 
> use them, etc.
>  Now you have a pool of maybe four interpreters, all being 
> used on an MRU
>  basis.  But it won't expand beyond that set unless your load 
> goes up or
>  your program's CPU time requirements increase beyond another 
> timeslice.
>  MRU will ensure that whatever the number of interpreters in use, it
>  is the lowest possible, given the load, the CPU-time required by the
>  program and the size of the timeslice.

Yep...no arguments here. SpeedyCGI should result in fewer interpreters.


I will say that there are a lot of convincing reasons to follow the
SpeedyCGI model rather than the mod_perl model, but I've generally thought
that the increase in that kind of performance that can be obtained as
sufficiently minimal as to not warrant the extra layer... thoughts, anyone?

Stephen.



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-19 Thread Sam Horrocks

 > >  There's only one run queue in the kernel.  THe first task ready to run is
 > put
 > >  at the head of that queue, and anything arriving afterwards waits.  Only
 > >  if that first task blocks on a resource or takes a very long time, or
 > >  a higher priority process becomes able to run due to an interrupt is that
 > >  process taken out of the queue.
 > 
 > Note that any I/O request that isn't completely handled by buffers will
 > trigger the 'blocks on a resource' clause above, which means that
 > jobs doing any real work will complete in an order determined by
 > something other than the cpu and not strictly serialized.  Also, most
 > of my web servers are dual-cpu so even cpu bound processes may
 > complete out of order.

 I think it's much easier to visualize how MRU helps when you look at one
 thing running at a time.  And MRU works best when every process runs
 to completion instead of blocking, etc.  But even if the process gets
 timesliced, blocked, etc, MRU still degrades gracefully.  You'll get
 more processes in use, but still the numbers will remain small.

 > >  > Similarly, because of the non-deterministic nature of computer systems,
 > >  > Apache doesn't service requests on an LRU basis; you're comparing
 > SpeedyCGI
 > >  > against a straw man. Apache's servicing algortihm approaches randomness,
 > so
 > >  > you need to build a comparison between forced-MRU and random choice.
 > >
 > >  Apache httpd's are scheduled on an LRU basis.  This was discussed early
 > >  in this thread.  Apache uses a file-lock for its mutex around the accept
 > >  call, and file-locking is implemented in the kernel using a round-robin
 > >  (fair) selection in order to prevent starvation.  This results in
 > >  incoming requests being assigned to httpd's in an LRU fashion.
 > 
 > But, if you are running a front/back end apache with a small number
 > of spare servers configured on the back end there really won't be
 > any idle perl processes during the busy times you care about.  That
 > is, the  backends will all be running or apache will shut them down
 > and there won't be any difference between MRU and LRU (the
 > difference would be which idle process waits longer - if none are
 > idle there is no difference).

 If you can tune it just right so you never run out of ram, then I think
 you could get the same performance as MRU on something like hello-world.

 > >  Once the httpd's get into the kernel's run queue, they finish in the
 > >  same order they were put there, unless they block on a resource, get
 > >  timesliced or are pre-empted by a higher priority process.
 > 
 > Which means they don't finish in the same order if (a) you have
 > more than one cpu, (b) they do any I/O (including delivering the
 > output back which they all do), or (c) some of them run long enough
 > to consume a timeslice.
 > 
 > >  Try it and see.  I'm sure you'll run more processes with speedycgi, but
 > >  you'll probably run a whole lot fewer perl interpreters and need less ram.
 > 
 > Do you have a benchmark that does some real work (at least a dbm
 > lookup) to compare against a front/back end mod_perl setup?

 No, but if you send me one, I'll run it.



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-19 Thread Perrin Harkins

Sam Horrocks wrote:
>  say they take two slices, and interpreters 1 and 2 get pre-empted and
>  go back into the queue.  So then requests 5/6 in the queue have to use
>  other interpreters, and you expand the number of interpreters in use.
>  But still, you'll wind up using the smallest number of interpreters
>  required for the given load and timeslice.  As soon as those 1st and
>  2nd perl interpreters finish their run, they go back at the beginning
>  of the queue, and the 7th/ 8th or later requests can then use them, etc.
>  Now you have a pool of maybe four interpreters, all being used on an MRU
>  basis.  But it won't expand beyond that set unless your load goes up or
>  your program's CPU time requirements increase beyond another timeslice.
>  MRU will ensure that whatever the number of interpreters in use, it
>  is the lowest possible, given the load, the CPU-time required by the
>  program and the size of the timeslice.

You know, I had brief look through some of the SpeedyCGI code yesterday,
and I think the MRU process selection might be a bit of a red herring. 
I think the real reason Speedy won the memory test is the way it spawns
processes.

If I understand what's going on in Apache's source, once every second it
has a look at the scoreboard and says "less than MinSpareServers are
idle, so I'll start more" or "more than MaxSpareServers are idle, so
I'll kill one".  It only kills one per second.  It starts by spawning
one, but the number spawned goes up exponentially each time it sees
there are still not enough idle servers, until it hits 32 per second. 
It's easy to see how this could result in spawning too many in response
to sudden load, and then taking a long time to clear out the unnecessary
ones.

In contrast, Speedy checks on every request to see if there are enough
backends running.  If there aren't, it spawns more until there are as
many backends as queued requests.  That means it never overshoots the
mark.

Going back to your example up above, if Apache actually controlled the
number of processes tightly enough to prevent building up idle servers,
it wouldn't really matter much how processes were selected.  If after
the 1st and 2nd interpreters finish their run they went to the end of
the queue instead of the beginning of it, that simply means they will
sit idle until called for instead of some other two processes sitting
idle until called for.  If the systems were both efficient enough about
spawning to only create as many interpreters as needed, none of them
would be sitting idle and memory usage would always be as low as
possible.

I don't know if I'm explaining this very well, but the gist of my theory
is that at any given time both systems will require an equal number of
in use interpreters to do an equal amount of work and the diffirentiator
between the two is Apache's relatively poor estimate of how many
processes should be available at any given time.  I think this theory
matches up nicely with the results of Sam's tests: when MaxClients
prevents Apache from spawning too many processes, both systems have
similar performance characteristics.

There are some knobs to twiddle in Apache's source if anyone is
interested in playing with it.  You can change the frequency of the
checks and the maximum number of servers spawned per check.  I don't
have much motivation to do this investigation myself, since I've already
tuned our MaxClients and process size constraints to prevent problems
with our application.

- Perrin



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-18 Thread Les Mikesell


- Original Message -
From: "Sam Horrocks" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: "mod_perl list" <[EMAIL PROTECTED]>; "Stephen Anderson"
<[EMAIL PROTECTED]>
Sent: Thursday, January 18, 2001 10:38 PM
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc
ripts that contain un-shared memory


>  There's only one run queue in the kernel.  THe first task ready to run is
put
>  at the head of that queue, and anything arriving afterwards waits.  Only
>  if that first task blocks on a resource or takes a very long time, or
>  a higher priority process becomes able to run due to an interrupt is that
>  process taken out of the queue.

Note that any I/O request that isn't completely handled by buffers will
trigger the 'blocks on a resource' clause above, which means that
jobs doing any real work will complete in an order determined by
something other than the cpu and not strictly serialized.  Also, most
of my web servers are dual-cpu so even cpu bound processes may
complete out of order.

>  > Similarly, because of the non-deterministic nature of computer systems,
>  > Apache doesn't service requests on an LRU basis; you're comparing
SpeedyCGI
>  > against a straw man. Apache's servicing algortihm approaches randomness,
so
>  > you need to build a comparison between forced-MRU and random choice.
>
>  Apache httpd's are scheduled on an LRU basis.  This was discussed early
>  in this thread.  Apache uses a file-lock for its mutex around the accept
>  call, and file-locking is implemented in the kernel using a round-robin
>  (fair) selection in order to prevent starvation.  This results in
>  incoming requests being assigned to httpd's in an LRU fashion.

But, if you are running a front/back end apache with a small number
of spare servers configured on the back end there really won't be
any idle perl processes during the busy times you care about.  That
is, the  backends will all be running or apache will shut them down
and there won't be any difference between MRU and LRU (the
difference would be which idle process waits longer - if none are
idle there is no difference).

>  Once the httpd's get into the kernel's run queue, they finish in the
>  same order they were put there, unless they block on a resource, get
>  timesliced or are pre-empted by a higher priority process.

Which means they don't finish in the same order if (a) you have
more than one cpu, (b) they do any I/O (including delivering the
output back which they all do), or (c) some of them run long enough
to consume a timeslice.

>  Try it and see.  I'm sure you'll run more processes with speedycgi, but
>  you'll probably run a whole lot fewer perl interpreters and need less ram.

Do you have a benchmark that does some real work (at least a dbm
lookup) to compare against a front/back end mod_perl setup?

>  Remember that the httpd's in the speedycgi case will have very little
>  un-shared memory, because they don't have perl interpreters in them.
>  So the processes are fairly indistinguishable, and the LRU isn't as
>  big a penalty in that case.
>
>  This is why the original designers of Apache thought it was safe to
>  create so many httpd's.  If they all have the same (shared) memory,
>  then creating a lot of them does not have much of a penalty.  mod_perl
>  applications throw a big monkey wrench into this design when they add
>  a lot of unshared memory to the httpd's.

This is part of the reason the front/back end  mod_perl configuration
works well, keeping the backend numbers low.  The real win when serving
over the internet, though, is that the perl memory is no longer tied
up while delivering the output back over frequently slow connections.

   Les Mikesell
   [EMAIL PROTECTED]





Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-18 Thread Sam Horrocks

 > This doesn't affect the argument, because the core of it is that:
 > 
 > a) the CPU will not completely process a single task all at once; instead,
 > it will divide its time _between_ the tasks
 > b) tasks do not arrive at regular intervals
 > c) tasks take varying amounts of time to complete
 > 
 > Now, if (a) were true but (b) and (c) were not, then, yes, it would have the
 > same effective result as sequential processing. Tasks that arrived first
 > would finish first. In the real world however, (b) and (c) are usually true,
 > and it becomes practically impossible to predict which task handler (in this
 > case, a mod_perl process) will complete first.

 I'll agree with (b) and (c) - I ignored them to keep my analogy as simple
 as possible.  Again, the goal of my analogy was to show that a stream of
 10 concurrent requests can be handled with the same througput with a lot
 fewer than 10 perl interpreters.  (b) and (c) don't really have an effect
 on that - they don't control the order in which processes arrive and get
 queued up for the CPU.

 I won't agree with (a) unless you qualify it further - what do you claim
 is the method or policy for (a)?

 There's only one run queue in the kernel.  THe first task ready to run is put
 at the head of that queue, and anything arriving afterwards waits.  Only
 if that first task blocks on a resource or takes a very long time, or
 a higher priority process becomes able to run due to an interrupt is that
 process taken out of the queue.

 It is inefficient for the unix kernel to be constantly switching
 very quickly from process to process, because it takes time to do
 context switches.  Also, unless the processes share the same memory,
 some amount of the processor cache can get flushed when you switch
 processes because you're changing to a different set of memory pages.
 That's why it's best for overall throughput if the kernel keeps a single
 process running as long as it can.

 > Similarly, because of the non-deterministic nature of computer systems,
 > Apache doesn't service requests on an LRU basis; you're comparing SpeedyCGI
 > against a straw man. Apache's servicing algortihm approaches randomness, so
 > you need to build a comparison between forced-MRU and random choice.

 Apache httpd's are scheduled on an LRU basis.  This was discussed early
 in this thread.  Apache uses a file-lock for its mutex around the accept
 call, and file-locking is implemented in the kernel using a round-robin
 (fair) selection in order to prevent starvation.  This results in
 incoming requests being assigned to httpd's in an LRU fashion.

 Once the httpd's get into the kernel's run queue, they finish in the
 same order they were put there, unless they block on a resource, get
 timesliced or are pre-empted by a higher priority process.

 > Thinking about it, assuming you are, at some time, servicing requests
 > _below_ system capacity, SpeedyCGI will always win in memory usage, and
 > probably have an edge in handling response time. My concern would be, does
 > it offer _enough_ of an edge? Especially bearing in mind, if I understand,
 > you could end runing anywhere up 2x as many processes (n Apache handlers + n
 > script handlers)?

 Try it and see.  I'm sure you'll run more processes with speedycgi, but
 you'll probably run a whole lot fewer perl interpreters and need less ram.
 
 Remember that the httpd's in the speedycgi case will have very little
 un-shared memory, because they don't have perl interpreters in them.
 So the processes are fairly indistinguishable, and the LRU isn't as 
 big a penalty in that case.

 This is why the original designers of Apache thought it was safe to
 create so many httpd's.  If they all have the same (shared) memory,
 then creating a lot of them does not have much of a penalty.  mod_perl
 applications throw a big monkey wrench into this design when they add
 a lot of unshared memory to the httpd's.

 > > No, homogeneity (or the lack of it) wouldn't make a 
 > > difference.  Those 3rd,
 > > 5th or 6th processes run only *after* the 1st and 2nd have 
 > > finished using
 > > the CPU.  And at that poiint you could re-use those 
 > > interpreters that 1 and 2
 > > were using.
 > 
 > This, if you'll excuse me, is quite clearly wrong. See the above argument,
 > and imagine that tasks 1 and 2 happen to take three times as long to
 > complete than 3, and you should see that that they could all end being in
 > the scheduling queue together. Perhaps you're considering tasks which are
 > too small to take more than 1 or 2 timeslices, in which case, you're much
 > less likely to want to accelerate them.

 So far to keep things fairly simple I've assumed you take less than one
 time slice to run.  A timeslice is fairly long on a linux pc (210ms).

 But say they take two slices, and interpreters 1 and 2 get pre-empted and
 go back into the queue.  So then requests 5/6 in the queue have to use
 other interpreters, and you expand the number of interpr