Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-19 Thread Perrin Harkins

On Fri, 19 Jan 2001, Sam Horrocks wrote:
   You know, I had brief look through some of the SpeedyCGI code yesterday,
   and I think the MRU process selection might be a bit of a red herring. 
   I think the real reason Speedy won the memory test is the way it spawns
   processes.
 
  Please take a look at that code again.  There's no smoke and mirrors,
  no red-herrings.

I didn't mean that MRU isn't really happening, just that it isn't the
reason why Speedy is running fewer interpeters.

  Also, I don't look at the benchmarks as "winning" - I
  am not trying to start a mod_perl vs speedy battle here.

Okay, but let's not be so polite about things that we don't acknowledge
when someone is onto a better way of doing things.  Stealing good ideas
from other projects is a time-honored open source tradition.

  Speedy does not check on every request to see if there are enough
  backends running.  In most cases, the only thing the frontend does is
  grab an idle backend from the lifo.  Only if there are none available
  does it start to worry about how many are running, etc.

Sorry, I had a lot of the details about what Speedy is doing wrong.  
However, it still sounds like it has a more efficient approach than
Apache in terms of managing process spawning.

  You're correct that speedy does try not to overshoot, but mainly
  because there's no point in overshooting - it just wastes swap space.
  But that's not the heart of the mechanism.  There truly is a LIFO
  involved.  Please read that code again, or run some tests.  Speedy
  could overshoot by far, and the worst that would happen is that you
  would get a lot of idle backends sitting in virtual memory, which the
  kernel would page out, and then at some point they'll time out and die.

When you spawn a new process it starts out in real memory, doesn't
it?  Spawning too many could use up all the physical RAM and send a box
into swap, at least until it managed to page out the idle
processes.  That's what I think happened to mod_perl in this test.

  If you start lots of those on a script that says 'print "$$\n"', then
  run the frontend on the same script, you will still see the same pid
  over and over.  This is the LIFO in action, reusing the same process
  over and over.

Right, but I don't think that explains why fewer processes are running.  
Suppose you start 10 processes, and then send in one request at a time,
and that request takes one time slice to complete.  If MRU works
perfectly, you'll get process 1 over and over again handling the requests.  
LRU will use process 1, then 2, then 3, etc.  But both of them have 9
processes idle and one in use at any given time.  The 9 idle ones should
either be killed off, or ideally never have been spawned in the first
place.  I think Speedy does a better job of preventing unnecessary process
spawning.

One alternative theory is that keeping the same process busy instead of
rotating through all 10 means that the OS can page out the other 9 and
thus use less physical RAM.

Anyway, I feel like we've been putting you on the spot, and I don't want
you to feel obligated to respond personally to all the messages on this
thread.  I'm only still talking about it because it's interesting and I've
learned a couple of things about Linux and Apache from it.  If I get the
chance this weekend, I'll try some tests of my own.

- Perrin




Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Sam Horrocks

I think the major problem is that you're assuming that just because
there are 10 constant concurrent requests, that there have to be 10
perl processes serving those requests at all times in order to get
maximum throughput.  The problem with that assumption is that there
is only one CPU - ten processes cannot all run simultaneously anyways,
so you don't really need ten perl interpreters.

I've been trying to think of better ways to explain this.  I'll try to
explain with an analogy - it's sort-of lame, but maybe it'll give you
a mental picture of what's happening.  To eliminate some confusion,
this analogy doesn't address LRU/MRU, nor waiting on other events like
network or disk i/o.  It only tries to explain why you don't necessarily
need 10 perl-interpreters to handle a stream of 10 concurrent requests
on a single-CPU system.

You own a fast-food restaurant.  The players involved are:

Your customers.  These represent the http requests.

Your cashiers.  These represent the perl interpreters.

Your cook.  You only have one.  THis represents your CPU.

The normal flow of events is this:

A cashier gets an order from a customer.  The cashier goes and
waits until the cook is free, and then gives the order to the cook.
The cook then cooks the meal, taking 5-minutes for each meal.
The cashier waits for the meal to be ready, then takes the meal and
gives it to the customer.  The cashier then serves another customer.
The cashier/customer interaction takes a very small amount of time.

The analogy is this:

An http request (customer) arrives.  It is given to a perl
interpreter (cashier).  A perl interpreter must wait for all other
perl interpreters ahead of it to finish using the CPU (the cook).
It can't serve any other requests until it finishes this one.
When its turn arrives, the perl interpreter uses the CPU to process
the perl code.  It then finishes and gives the results over to the
http client (the customer).

Now, say in this analogy you begin the day with 10 customers in the store.
At each 5-minute interval thereafter another customer arrives.  So at time
0, there is a pool of 10 customers.  At time +5, another customer arrives.
At time +10, another customer arrives, ad infinitum.

You could hire 10 cashiers in order to handle this load.  What would
happen is that the 10 cashiers would fairly quickly get all the orders
from the first 10 customers simultaneously, and then start waiting for
the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
cashier number 1 would receive the meal, deliver it to customer #1, and
then serve the next customer (#11) that just arrived at the 5-minute mark.
Cashier #1 would take customer #11's order, then queue up and wait in
line for the cook - there will be 9 other cashiers already in line, so
the wait will be long.  At the 10-minute mark, cashier #2 would receive
a meal from the cook, deliver it to customer #2, then go on and serve
the next customer (#12) that just arrived.  Cashier #2 would then go and
wait in line for the cook.  This continues on through all the cashiers
in order 1-10, then repeating, 1-10, ad infinitum.

Now even though you have 10 cashiers, most of their time is spent
waiting to put in an order to the cook.  Starting with customer #11,
all customers will wait 50-minutes for their meal.  When customer #11
comes in he/she will immediately get to place an order, but it will take
the cashier 45-minutes to wait for the cook to become free, and another
5-minutes for the meal to be cooked.  Same is true for customer #12,
and all customers from then on.

Now, the question is, could you get the same throughput with fewer
cashiers?  Say you had 2 cashiers instead.  The 10 customers are
there waiting.  The 2 cashiers take orders from customers #1 and #2.
Cashier #1 then gives the order to the cook and waits.  Cashier #2 waits
in line for the cook behind cashier #1.  At the 5-minute mark, the first
meal is done.  Cashier #1 delivers the meal to customer #1, then serves
customer #3.  Cashier #1 then goes and stands in line behind cashier #2.
At the 10-minute mark, cashier #2's meal is ready - it's delivered to
customer #2 and then customer #4 is served.  This continues on with the
cashiers trading off between serving customers.

Does the scenario with two cashiers go any more slowly than the one with
10 cashiers?  No.  When the 11th customer arrives at the 5-minute mark,
what he/she sees is that customer #3 is just now putting in an order.
There are 7 other people there waiting to put in orders.  Customer #11 will
wait 40 minutes until he/she puts in an order, then wait another 10 minutes
for the meal to arrive.  Same is true for customer #12, and all others arriving
thereafter.

The only difference between the two scenarious is the number of cashiers,
and where the waiting is taking place.  In the first scenario, 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Gunther Birznieks

I guess as I get older I start to slip technically. :) This helps me a bit, 
but it doesn't really help me understand the final arguement (that MRU is 
still going to help on a fully loaded system).

With some modification, I guess I am thinking that the cook is really the 
OS and the CPU is really the oven. But the hamburgers on an Intel oven have 
to be timesliced instead of left to cook and then after it's done the next 
hamburger is put on.

So if we think of meals as Perl requests, the reality is that not all meals 
take the same amount of time to cook. A quarter pounder surely takes longer 
than your typical paper thin McDonald's Patty.

The fact that a customer requests a meal that takes longer to cook than 
another one is relatively random. In fact in the real world, it is likely 
to be random. This means that it's possible for all 10 meals to be cooking 
but the 3rd meal gets done really fast, so another customer gets time 
sliced to use the oven for their meal -- which might be a long meal.

In your testing, perhaps the problem is that you are benchmarking with a 
homogeneous process. So of course you are seeing this behavior that makes 
it look like serializing 10 connections is just the same wait as time 
slicing them and therefore an MRU algorithm works better (of course it 
works better, because you keep releasing the systems in order)...

But in the world where the 3rd or 5th or 6th process may finish sooner and 
release sooner than others, then an MRU algorithm doesn't matter. And 
actually a process that finishes in 10 seconds shouldn't have to wait until 
a process than takes 30 seconds to complete has finished.

And all 10 interpreters are in use at the same time, serving all requests 
and randomly popping off the queue and starting again where no MRU or LRU 
algorithm will really help. It's all the same.

Anyway, maybe I am still not really getting it. Even with the fast food 
analogy. Maybe it is time to throw in the network time and other variables 
that seemed to make a difference in Perrin understanding how you were 
approaching the explanation.

I am now curious -- on a fully loaded system of max 10 processes, did you 
see that SpeedyCGI scaled better than mod_perl on your benchmarks? Or are 
we still just speculating?

At 03:19 AM 1/17/01 -0800, Sam Horrocks wrote:
I think the major problem is that you're assuming that just because
there are 10 constant concurrent requests, that there have to be 10
perl processes serving those requests at all times in order to get
maximum throughput.  The problem with that assumption is that there
is only one CPU - ten processes cannot all run simultaneously anyways,
so you don't really need ten perl interpreters.

I've been trying to think of better ways to explain this.  I'll try to
explain with an analogy - it's sort-of lame, but maybe it'll give you
a mental picture of what's happening.  To eliminate some confusion,
this analogy doesn't address LRU/MRU, nor waiting on other events like
network or disk i/o.  It only tries to explain why you don't necessarily
need 10 perl-interpreters to handle a stream of 10 concurrent requests
on a single-CPU system.

You own a fast-food restaurant.  The players involved are:

 Your customers.  These represent the http requests.

 Your cashiers.  These represent the perl interpreters.

 Your cook.  You only have one.  THis represents your CPU.

The normal flow of events is this:

 A cashier gets an order from a customer.  The cashier goes and
 waits until the cook is free, and then gives the order to the cook.
 The cook then cooks the meal, taking 5-minutes for each meal.
 The cashier waits for the meal to be ready, then takes the meal and
 gives it to the customer.  The cashier then serves another customer.
 The cashier/customer interaction takes a very small amount of time.

The analogy is this:

 An http request (customer) arrives.  It is given to a perl
 interpreter (cashier).  A perl interpreter must wait for all other
 perl interpreters ahead of it to finish using the CPU (the cook).
 It can't serve any other requests until it finishes this one.
 When its turn arrives, the perl interpreter uses the CPU to process
 the perl code.  It then finishes and gives the results over to the
 http client (the customer).

Now, say in this analogy you begin the day with 10 customers in the store.
At each 5-minute interval thereafter another customer arrives.  So at time
0, there is a pool of 10 customers.  At time +5, another customer arrives.
At time +10, another customer arrives, ad infinitum.

You could hire 10 cashiers in order to handle this load.  What would
happen is that the 10 cashiers would fairly quickly get all the orders
from the first 10 customers simultaneously, and then start waiting for
the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
cashier 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Buddy Lee Haystack

I have a wide assortment of queries on a site, some of which take several minutes to 
execute, while others execute in less than one second. If understand this analogy 
correctly, I'd be better off with the current incarnation of mod_perl because there 
would be more cashiers around to serve the "quick cups of coffee" that many customers 
request at my dinner.

Is this correct?


Sam Horrocks wrote:
 
 I think the major problem is that you're assuming that just because
 there are 10 constant concurrent requests, that there have to be 10
 perl processes serving those requests at all times in order to get
 maximum throughput.  The problem with that assumption is that there
 is only one CPU - ten processes cannot all run simultaneously anyways,
 so you don't really need ten perl interpreters.
 
 I've been trying to think of better ways to explain this.  I'll try to
 explain with an analogy - it's sort-of lame, but maybe it'll give you
 a mental picture of what's happening.  To eliminate some confusion,
 this analogy doesn't address LRU/MRU, nor waiting on other events like
 network or disk i/o.  It only tries to explain why you don't necessarily
 need 10 perl-interpreters to handle a stream of 10 concurrent requests
 on a single-CPU system.
 
 You own a fast-food restaurant.  The players involved are:
 
 Your customers.  These represent the http requests.
 
 Your cashiers.  These represent the perl interpreters.
 
 Your cook.  You only have one.  THis represents your CPU.
 
 The normal flow of events is this:
 
 A cashier gets an order from a customer.  The cashier goes and
 waits until the cook is free, and then gives the order to the cook.
 The cook then cooks the meal, taking 5-minutes for each meal.
 The cashier waits for the meal to be ready, then takes the meal and
 gives it to the customer.  The cashier then serves another customer.
 The cashier/customer interaction takes a very small amount of time.
 
 The analogy is this:
 
 An http request (customer) arrives.  It is given to a perl
 interpreter (cashier).  A perl interpreter must wait for all other
 perl interpreters ahead of it to finish using the CPU (the cook).
 It can't serve any other requests until it finishes this one.
 When its turn arrives, the perl interpreter uses the CPU to process
 the perl code.  It then finishes and gives the results over to the
 http client (the customer).
 
 Now, say in this analogy you begin the day with 10 customers in the store.
 At each 5-minute interval thereafter another customer arrives.  So at time
 0, there is a pool of 10 customers.  At time +5, another customer arrives.
 At time +10, another customer arrives, ad infinitum.
 
 You could hire 10 cashiers in order to handle this load.  What would
 happen is that the 10 cashiers would fairly quickly get all the orders
 from the first 10 customers simultaneously, and then start waiting for
 the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
 first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
 cashier number 1 would receive the meal, deliver it to customer #1, and
 then serve the next customer (#11) that just arrived at the 5-minute mark.
 Cashier #1 would take customer #11's order, then queue up and wait in
 line for the cook - there will be 9 other cashiers already in line, so
 the wait will be long.  At the 10-minute mark, cashier #2 would receive
 a meal from the cook, deliver it to customer #2, then go on and serve
 the next customer (#12) that just arrived.  Cashier #2 would then go and
 wait in line for the cook.  This continues on through all the cashiers
 in order 1-10, then repeating, 1-10, ad infinitum.
 
 Now even though you have 10 cashiers, most of their time is spent
 waiting to put in an order to the cook.  Starting with customer #11,
 all customers will wait 50-minutes for their meal.  When customer #11
 comes in he/she will immediately get to place an order, but it will take
 the cashier 45-minutes to wait for the cook to become free, and another
 5-minutes for the meal to be cooked.  Same is true for customer #12,
 and all customers from then on.
 
 Now, the question is, could you get the same throughput with fewer
 cashiers?  Say you had 2 cashiers instead.  The 10 customers are
 there waiting.  The 2 cashiers take orders from customers #1 and #2.
 Cashier #1 then gives the order to the cook and waits.  Cashier #2 waits
 in line for the cook behind cashier #1.  At the 5-minute mark, the first
 meal is done.  Cashier #1 delivers the meal to customer #1, then serves
 customer #3.  Cashier #1 then goes and stands in line behind cashier #2.
 At the 10-minute mark, cashier #2's meal is ready - it's delivered to
 customer #2 and then customer #4 is served.  This continues on with the
 cashiers trading off between serving customers.
 
 Does the scenario with two cashiers go any more slowly than the one with
 10 cashiers?  No.  When the 11th 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Sam Horrocks

  I guess as I get older I start to slip technically. :) This helps me a bit, 
  but it doesn't really help me understand the final arguement (that MRU is 
  still going to help on a fully loaded system).
  
  With some modification, I guess I am thinking that the cook is really the 
  OS and the CPU is really the oven. But the hamburgers on an Intel oven have 
  to be timesliced instead of left to cook and then after it's done the next 
  hamburger is put on.
  
  So if we think of meals as Perl requests, the reality is that not all meals 
  take the same amount of time to cook. A quarter pounder surely takes longer 
  than your typical paper thin McDonald's Patty.
  
  The fact that a customer requests a meal that takes longer to cook than 
  another one is relatively random. In fact in the real world, it is likely 
  to be random. This means that it's possible for all 10 meals to be cooking 
  but the 3rd meal gets done really fast, so another customer gets time 
  sliced to use the oven for their meal -- which might be a long meal.

I don't like your mods to the analogy, because they don't model how
a CPU actually works.  Even if the cook == the OS and the oven == the
CPU, the oven *must* work on tasks sequentially.  If you look at the
assembly language for your Intel CPU you won't see anything about it
doing multi-tasking.  It does adds, subtracts, stores, loads, jumps, etc.
It executes code sequentially.  You must model this somewhere in your
analogy if it's going to be accurate.

So I'll modify your analogy to say the oven can only cook one thing at
a time.  Now, what you could do is have the cook take one of the longer
meals (the 10 minute meatloaf) out of the oven in order to cook something
small, then put the meatloaf back later to finish cooking.  But the oven
does *not* cook things in parallel.  Remember that things have
to cook for a very long time before they get timesliced -- 210ms is a
long time for a CPU, and that's the default timeslice on a Linux PC.

If we say the oven cooks things sequentially, it doesn't really change
the overall results that I had in the previous example.  The cook just
puts things in the oven sequentially, in the order in which they were
received from the cashiers - this represents the run queue in the OS.
But the cashiers still sit there and wait for the meals from the cook,
and the cook just stands there waiting for the oven to cook meals
sequentially.

  In your testing, perhaps the problem is that you are benchmarking with a 
  homogeneous process. So of course you are seeing this behavior that makes 
  it look like serializing 10 connections is just the same wait as time 
  slicing them and therefore an MRU algorithm works better (of course it 
  works better, because you keep releasing the systems in order)...
  
  But in the world where the 3rd or 5th or 6th process may finish sooner and 
  release sooner than others, then an MRU algorithm doesn't matter. And 
  actually a process that finishes in 10 seconds shouldn't have to wait until 
  a process than takes 30 seconds to complete has finished.

No, homogeneity (or the lack of it) wouldn't make a difference.  Those 3rd,
5th or 6th processes run only *after* the 1st and 2nd have finished using
the CPU.  And at that poiint you could re-use those interpreters that 1 and 2
were using.

  And all 10 interpreters are in use at the same time, serving all requests 
  and randomly popping off the queue and starting again where no MRU or LRU 
  algorithm will really help. It's all the same.

If in both the MRU/LRU case there were exactly 10 interpreters busy at
all times, then you're right it wouldn't matter.  But don't confuse
the issues - 10 concurrent requests do *not* necessarily require 10
concurrent interpreters.  The MRU has an affect on the way a stream of 10
concurrent requests are handled, and MRU results in those same requests
being handled by fewer interpreters.

  Anyway, maybe I am still not really getting it. Even with the fast food 
  analogy. Maybe it is time to throw in the network time and other variables 
  that seemed to make a difference in Perrin understanding how you were 
  approaching the explanation.

Please again take a look at the first analogy.  The CPU can't do multi-tasking.
Until that gets straightened out, I don't think adding more to the analogy
will help.

Also, I think the analogy is about to break - that's why I put in extra
disclaimers at the top.  It was only intended to show that 10 concurrent
requests don't necessarily require 10 perl interpreters in order to
achieve maximum throughput.

  I am now curious -- on a fully loaded system of max 10 processes, did you 
  see that SpeedyCGI scaled better than mod_perl on your benchmarks? Or are 
  we still just speculating?

It is actually possible to benchmark.  Given the same concurrent load
and the same number of httpds running, speedycgi will use fewer perl
interpreters than mod_perl.  This will usually result in speedycgi

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Sam Horrocks

There is no coffee.  Only meals.  No substitutions. :-)

If we added coffee to the menu it would still have to be prepared by the cook.
Remember that you only have one CPU, and all the perl interpreters large and
small must gain access to that CPU in order to run.

Sam


  I have a wide assortment of queries on a site, some of which take several minutes to 
 execute, while others execute in less than one second. If understand this analogy 
 correctly, I'd be better off with the current incarnation of mod_perl because there 
 would be more cashiers around to serve the "quick cups of coffee" that many customers 
 request at my dinner.
  
  Is this correct?
  
  
  Sam Horrocks wrote:
   
   I think the major problem is that you're assuming that just because
   there are 10 constant concurrent requests, that there have to be 10
   perl processes serving those requests at all times in order to get
   maximum throughput.  The problem with that assumption is that there
   is only one CPU - ten processes cannot all run simultaneously anyways,
   so you don't really need ten perl interpreters.
   
   I've been trying to think of better ways to explain this.  I'll try to
   explain with an analogy - it's sort-of lame, but maybe it'll give you
   a mental picture of what's happening.  To eliminate some confusion,
   this analogy doesn't address LRU/MRU, nor waiting on other events like
   network or disk i/o.  It only tries to explain why you don't necessarily
   need 10 perl-interpreters to handle a stream of 10 concurrent requests
   on a single-CPU system.
   
   You own a fast-food restaurant.  The players involved are:
   
   Your customers.  These represent the http requests.
   
   Your cashiers.  These represent the perl interpreters.
   
   Your cook.  You only have one.  THis represents your CPU.
   
   The normal flow of events is this:
   
   A cashier gets an order from a customer.  The cashier goes and
   waits until the cook is free, and then gives the order to the cook.
   The cook then cooks the meal, taking 5-minutes for each meal.
   The cashier waits for the meal to be ready, then takes the meal and
   gives it to the customer.  The cashier then serves another customer.
   The cashier/customer interaction takes a very small amount of time.
   
   The analogy is this:
   
   An http request (customer) arrives.  It is given to a perl
   interpreter (cashier).  A perl interpreter must wait for all other
   perl interpreters ahead of it to finish using the CPU (the cook).
   It can't serve any other requests until it finishes this one.
   When its turn arrives, the perl interpreter uses the CPU to process
   the perl code.  It then finishes and gives the results over to the
   http client (the customer).
   
   Now, say in this analogy you begin the day with 10 customers in the store.
   At each 5-minute interval thereafter another customer arrives.  So at time
   0, there is a pool of 10 customers.  At time +5, another customer arrives.
   At time +10, another customer arrives, ad infinitum.
   
   You could hire 10 cashiers in order to handle this load.  What would
   happen is that the 10 cashiers would fairly quickly get all the orders
   from the first 10 customers simultaneously, and then start waiting for
   the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
   first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
   cashier number 1 would receive the meal, deliver it to customer #1, and
   then serve the next customer (#11) that just arrived at the 5-minute mark.
   Cashier #1 would take customer #11's order, then queue up and wait in
   line for the cook - there will be 9 other cashiers already in line, so
   the wait will be long.  At the 10-minute mark, cashier #2 would receive
   a meal from the cook, deliver it to customer #2, then go on and serve
   the next customer (#12) that just arrived.  Cashier #2 would then go and
   wait in line for the cook.  This continues on through all the cashiers
   in order 1-10, then repeating, 1-10, ad infinitum.
   
   Now even though you have 10 cashiers, most of their time is spent
   waiting to put in an order to the cook.  Starting with customer #11,
   all customers will wait 50-minutes for their meal.  When customer #11
   comes in he/she will immediately get to place an order, but it will take
   the cashier 45-minutes to wait for the cook to become free, and another
   5-minutes for the meal to be cooked.  Same is true for customer #12,
   and all customers from then on.
   
   Now, the question is, could you get the same throughput with fewer
   cashiers?  Say you had 2 cashiers instead.  The 10 customers are
   there waiting.  The 2 cashiers take orders from customers #1 and #2.
   Cashier #1 then gives the order to the cook and waits.  Cashier #2 waits
   in line for the cook behind cashier #1.  At the 5-minute mark, the first
   meal is done. 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-13 Thread Gunther Birznieks

I have just gotten around to reading this thread I've been saving for a 
rainy day. Well, it's not rainy, but I'm finally getting to it. Apologizes 
to those who hate when when people don't snip their reply mails but I am 
including it so that the entire context is not lost.

Sam (or others who may understand Sam's explanation),

I am still confused by this explanation of MRU helping when there are 10 
processes serving 10 requests at all times. I understand MRU helping when 
the processes are not at max, but I don't see how it helps when they are at 
max utilization.

It seems to me that if the wait is the same for mod_perl backend processes 
and speedyCGI processes, that it doesn't matter if some of the speedycgi 
processes cycle earlier than the mod_perl ones because all 10 will always 
be used.

I did read and reread (once) the snippets about modeling concurrency and 
the HTTP waiting for an accept.. But I still don't understand how MRU helps 
when all the processes would be in use anyway. At that point they all have 
an equal chance of being called.

Could you clarify this with a simpler example? Maybe 4 processes and a 
sample timeline of what happens to those when there are enough requests to 
keep all 4 busy all the time for speedyCGI and a mod_perl backend?

At 04:32 AM 1/6/01 -0800, Sam Horrocks wrote:
   Let me just try to explain my reasoning.  I'll define a couple of my
   base assumptions, in case you disagree with them.
  
   - Slices of CPU time doled out by the kernel are very small - so small
   that processes can be considered concurrent, even though technically
   they are handled serially.

  Don't agree.  You're equating the model with the implemntation.
  Unix processes model concurrency, but when it comes down to it, if you
  don't have more CPU's than processes, you can only simulate concurrency.

  Each process runs until it either blocks on a resource (timer, network,
  disk, pipe to another process, etc), or a higher priority process
  pre-empts it, or it's taken so much time that the kernel wants to give
  another process a chance to run.

   - A set of requests can be considered "simultaneous" if they all arrive
   and start being handled in a period of time shorter than the time it
   takes to service a request.

  That sounds OK.

   Operating on these two assumptions, I say that 10 simultaneous requests
   will require 10 interpreters to service them.  There's no way to handle
   them with fewer, unless you queue up some of the requests and make them
   wait.

  Right.  And that waiting takes place:

 - In the mutex around the accept call in the httpd

 - In the kernel's run queue when the process is ready to run, but is
   waiting for other processes ahead of it.

  So, since there is only one CPU, then in both cases (mod_perl and
  SpeedyCGI), processes spend time waiting.  But what happens in the
  case of SpeedyCGI is that while some of the httpd's are waiting,
  one of the earlier speedycgi perl interpreters has already finished
  its run through the perl code and has put itself back at the front of
  the speedycgi queue.  And by the time that Nth httpd gets around to
  running, it can re-use that first perl interpreter instead of needing
  yet another process.

  This is why it's important that you don't assume that Unix is truly
  concurrent.

   I also say that if you have a top limit of 10 interpreters on your
   machine because of memory constraints, and you're sending in 10
   simultaneous requests constantly, all interpreters will be used all the
   time.  In that case it makes no difference to the throughput whether you
   use MRU or LRU.

  This is not true for SpeedyCGI, because of the reason I give above.
  10 simultaneous requests will not necessarily require 10 interpreters.

 What you say would be true if you had 10 processors and could get
 true concurrency.  But on single-cpu systems you usually don't need
 10 unix processes to handle 10 requests concurrently, since they get
 serialized by the kernel anyways.
  
   I think the CPU slices are smaller than that.  I don't know much about
   process scheduling, so I could be wrong.  I would agree with you if we
   were talking about requests that were coming in with more time between
   them.  Speedycgi will definitely use fewer interpreters in that case.

  This url:

 http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html

  says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
  There's also lots of good info there on Linux scheduling.

 I found that setting MaxClients to 100 stopped the paging.  At 
 concurrency
 level 100, both mod_perl and mod_speedycgi showed similar rates 
 with ab.
 Even at higher levels (300), they were comparable.
  
   That's what I would expect if both systems have a similar limit of how
   many interpreters they can fit in RAM at once.  Shared memory would help
   here, since it would allow more 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-08 Thread Keith G. Murphy

Les Mikesell wrote:

[cut] 
 
 I don't think I understand what you mean by LRU.   When I view the
 Apache server-status with ExtendedStatus On,  it appears that
 the backend server processes recycle themselves as soon as they
 are free instead of cycling sequentially through all the available
 processes.   Did you mean to imply otherwise or are you talking
 about something else?
 
Be careful here.  Note my message earlier in the thread about the
misleading effect of persistent connections (HTTP 1.1).

Perrin Harkins noted in another thread that it had fooled him as well as
me.

Not saying that's what you're seeing, just take it into account. 
(Quick-and-dirty test: run Netscape as the client browser; do you still
see the same thing?)



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-07 Thread Joshua Chamas

Sam Horrocks wrote:
 
 A few things:
 
 - In your results, could you add the speedycgi version number (2.02),
   and the fact that this is using the mod_speedycgi frontend.

The version numbers are gathered at runtime, so for mod_speedycgi,
this would get picked up if you registered it in the Apache server
header that gets sent out.  I'll list the test as mod_speedycgi.

   The fork/exec frontend will be much slower on hello-world so I don't
   want people to get the wrong idea.  You may want to benchmark
   the fork/exec version as well.
 

If its slower than what's the point :)  If mod_speedycgi is the faster
way to run it, they that should be good enough, no?  If you would like 
to contribute that test to the suite, please do so.

 - You may be able to eke out a little more performance by setting
   MaxRuns to 0 (infinite).  The is set for mod_speedycgi using the
   SpeedyMaxRuns directive, or on the command-line using "-r0".
   This setting is similar to the MaxRequestsPerChild setting in apache.
 

Will do.

 - My tests show mod_perl/speedy much closer than yours do, even with
   MaxRuns at its default value of 500.  Maybe you're running on
   a different OS than I am - I'm using Redhat 6.2.  I'm also running
   one rev lower of mod_perl in case that matters.
 

I'm running the same thing, RH 6.2, I don't know if the mod_perl rev 
matters, but what often does matter is that I have 2 CPUs in my box, so 
my results often look different from other peoples.

--Josh

_
Joshua Chamas   Chamas Enterprises Inc.
NodeWorks  free web link monitoring   Huntington Beach, CA  USA 
http://www.nodeworks.com1-714-625-4051





Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Sam Horrocks

  Let me just try to explain my reasoning.  I'll define a couple of my
  base assumptions, in case you disagree with them.
  
  - Slices of CPU time doled out by the kernel are very small - so small
  that processes can be considered concurrent, even though technically
  they are handled serially.

 Don't agree.  You're equating the model with the implemntation.
 Unix processes model concurrency, but when it comes down to it, if you
 don't have more CPU's than processes, you can only simulate concurrency.

 Each process runs until it either blocks on a resource (timer, network,
 disk, pipe to another process, etc), or a higher priority process
 pre-empts it, or it's taken so much time that the kernel wants to give
 another process a chance to run.

  - A set of requests can be considered "simultaneous" if they all arrive
  and start being handled in a period of time shorter than the time it
  takes to service a request.

 That sounds OK.

  Operating on these two assumptions, I say that 10 simultaneous requests
  will require 10 interpreters to service them.  There's no way to handle
  them with fewer, unless you queue up some of the requests and make them
  wait.

 Right.  And that waiting takes place:

- In the mutex around the accept call in the httpd

- In the kernel's run queue when the process is ready to run, but is
  waiting for other processes ahead of it.

 So, since there is only one CPU, then in both cases (mod_perl and
 SpeedyCGI), processes spend time waiting.  But what happens in the
 case of SpeedyCGI is that while some of the httpd's are waiting,
 one of the earlier speedycgi perl interpreters has already finished
 its run through the perl code and has put itself back at the front of
 the speedycgi queue.  And by the time that Nth httpd gets around to
 running, it can re-use that first perl interpreter instead of needing
 yet another process.

 This is why it's important that you don't assume that Unix is truly
 concurrent.

  I also say that if you have a top limit of 10 interpreters on your
  machine because of memory constraints, and you're sending in 10
  simultaneous requests constantly, all interpreters will be used all the
  time.  In that case it makes no difference to the throughput whether you
  use MRU or LRU.

 This is not true for SpeedyCGI, because of the reason I give above.
 10 simultaneous requests will not necessarily require 10 interpreters.

What you say would be true if you had 10 processors and could get
true concurrency.  But on single-cpu systems you usually don't need
10 unix processes to handle 10 requests concurrently, since they get
serialized by the kernel anyways.
  
  I think the CPU slices are smaller than that.  I don't know much about
  process scheduling, so I could be wrong.  I would agree with you if we
  were talking about requests that were coming in with more time between
  them.  Speedycgi will definitely use fewer interpreters in that case.

 This url:

http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html

 says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
 There's also lots of good info there on Linux scheduling.

I found that setting MaxClients to 100 stopped the paging.  At concurrency
level 100, both mod_perl and mod_speedycgi showed similar rates with ab.
Even at higher levels (300), they were comparable.
  
  That's what I would expect if both systems have a similar limit of how
  many interpreters they can fit in RAM at once.  Shared memory would help
  here, since it would allow more interpreters to run.
  
  By the way, do you limit the number of SpeedyCGI processes as well?  it
  seems like you'd have to, or they'd start swapping too when you throw
  too many requests in.

 SpeedyCGI has an optional limit on the number of processes, but I didn't
 use it in my testing.

But, to show that the underlying problem is still there, I then changed
the hello_world script and doubled the amount of un-shared memory.
And of course the problem then came back for mod_perl, although speedycgi
continued to work fine.  I think this shows that mod_perl is still
using quite a bit more memory than speedycgi to provide the same service.
  
  I'm guessing that what happened was you ran mod_perl into swap again. 
  You need to adjust MaxClients when your process size changes
  significantly.

 Right, but this also points out how difficult it is to get mod_perl
 tuning just right.  My opinion is that the MRU design adapts more
 dynamically to the load.

   I believe that with speedycgi you don't have to lower the MaxClients
   setting, because it's able to handle a larger number of clients, at
   least in this test.

 Maybe what you're seeing is an ability to handle a larger number of
 requests (as opposed to clients) because of the performance benefit I
 mentioned above.
   
I don't follow.
  
  When not all processes are in use, I 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Perrin Harkins

Sam Horrocks wrote:
  Don't agree.  You're equating the model with the implemntation.
  Unix processes model concurrency, but when it comes down to it, if you
  don't have more CPU's than processes, you can only simulate concurrency.
[...]
  This url:
 
 http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html
 
  says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
  There's also lots of good info there on Linux scheduling.

Thanks for the info.  This makes much more sense to me now.  It sounds
like using an MRU algrorithm for process selection is automatically
finding the sweet spot in terms of how many processes can run within the
space of one request and coming close to the ideal of never having
unused processes in memory.  Now I'm really looking forward to getting
MRU and shared memory in the same package and seeing how high I can
scale my hardware.

- Perrin



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Buddy Lee Haystack

Does this mean that mod_perl's memory hunger will curbed in the future using some of 
the neat tricks in Speedycgi?


Perrin Harkins wrote:
 
 Sam Horrocks wrote:
   Don't agree.  You're equating the model with the implemntation.
   Unix processes model concurrency, but when it comes down to it, if you
   don't have more CPU's than processes, you can only simulate concurrency.
 [...]
   This url:
 
  http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html
 
   says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
   There's also lots of good info there on Linux scheduling.
 
 Thanks for the info.  This makes much more sense to me now.  It sounds
 like using an MRU algrorithm for process selection is automatically
 finding the sweet spot in terms of how many processes can run within the
 space of one request and coming close to the ideal of never having
 unused processes in memory.  Now I'm really looking forward to getting
 MRU and shared memory in the same package and seeing how high I can
 scale my hardware.
 
 - Perrin

-- 
www.RentZone.org



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Perrin Harkins

Buddy Lee Haystack wrote:
 
 Does this mean that mod_perl's memory hunger will curbed in the future using some of 
the neat tricks in Speedycgi?

Yes.  The upcoming mod_perl 2 (running on Apache 2) will use MRU to
select threads.  Doug demoed this at ApacheCon a few months back.

- Perrin



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Les Mikesell


- Original Message -
From: "Sam Horrocks" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: "mod_perl list" [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Saturday, January 06, 2001 6:32 AM
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts
that contain un-shared memory



  Right, but this also points out how difficult it is to get mod_perl
  tuning just right.  My opinion is that the MRU design adapts more
  dynamically to the load.

How would this compare to apache's process management when
using the front/back end approach?

  I'd agree that the size of one Speedy backend + one httpd would be the
  same or even greater than the size of one mod_perl/httpd when no memory
  is shared.  But because the speedycgi httpds are small (no perl in them)
  and the number of SpeedyCGI perl interpreters is small, the total memory
  required is significantly smaller for the same load.

Likewise, it would be helpful if you would always make the comparison
to the dual httpd setup that is often used for busy sites.   I think it must
really boil down to the efficiency of your IPC vs. access to the full
apache environment.

  Les Mikesell
 [EMAIL PROTECTED]




Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Joshua Chamas

Sam Horrocks wrote:
 
  Don't agree.  You're equating the model with the implemntation.
  Unix processes model concurrency, but when it comes down to it, if you
  don't have more CPU's than processes, you can only simulate concurrency.
 

Hey Sam, nice module.  I just installed your SpeedyCGI for a good 'ol
HelloWorld benchmark  it was a snap, well done.  I'd like to add to the 
numbers below that a fair benchmark would be between mod_proxy in front 
of a mod_perl server and mod_speedycgi, as it would be a similar memory 
saving model ( this is how we often scale mod_perl )... both models would
end up forwarding back to a smaller set of persistent perl interpreters.

However, I did not do such a benchmark, so SpeedyCGI looses out a
bit for the extra layer it has to do :(   This is based on the 
suite at http://www.chamas.com/bench/hello.tar.gz, but I have not
included the speedy test in that yet.

 -- Josh

Test Name  Test File  Hits/sec   Total Hits Total Time sec/Hits   
Bytes/Hit  
   -- -- -- -- -- 
-- 
Apache::Registry v2.01 CGI.pm  hello.cgi   451.9 27128 hits 60.03 sec  0.002213   
216 bytes  
Speedy CGI hello.cgi   375.2 22518 hits 60.02 sec  0.002665   
216 bytes  

Apache Server Header Tokens
---
(Unix)
Apache/1.3.14
OpenSSL/0.9.6
PHP/4.0.3pl1
mod_perl/1.24
mod_ssl/2.7.1



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Sam Horrocks

Right, but this also points out how difficult it is to get mod_perl
tuning just right.  My opinion is that the MRU design adapts more
dynamically to the load.
  
  How would this compare to apache's process management when
  using the front/back end approach?

 Same thing applies.  The front/back end approach does not change the
 fundamentals.

I'd agree that the size of one Speedy backend + one httpd would be the
same or even greater than the size of one mod_perl/httpd when no memory
is shared.  But because the speedycgi httpds are small (no perl in them)
and the number of SpeedyCGI perl interpreters is small, the total memory
required is significantly smaller for the same load.
  
  Likewise, it would be helpful if you would always make the comparison
  to the dual httpd setup that is often used for busy sites.   I think it must
  really boil down to the efficiency of your IPC vs. access to the full
  apache environment.

 The reason I don't include that comparison is that it's not fundamental
 to the differences between mod_perl and speedycgi or LRU and MRU that
 I have been trying to point out.  Regardless of whether you add a
 frontend or not, the mod_perl process selection remains LRU and the
 speedycgi process selection remains MRU.



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Sam Horrocks

A few things:

- In your results, could you add the speedycgi version number (2.02),
  and the fact that this is using the mod_speedycgi frontend.
  The fork/exec frontend will be much slower on hello-world so I don't
  want people to get the wrong idea.  You may want to benchmark
  the fork/exec version as well.

- You may be able to eke out a little more performance by setting
  MaxRuns to 0 (infinite).  The is set for mod_speedycgi using the
  SpeedyMaxRuns directive, or on the command-line using "-r0".
  This setting is similar to the MaxRequestsPerChild setting in apache.

- My tests show mod_perl/speedy much closer than yours do, even with
  MaxRuns at its default value of 500.  Maybe you're running on
  a different OS than I am - I'm using Redhat 6.2.  I'm also running
  one rev lower of mod_perl in case that matters.


  Hey Sam, nice module.  I just installed your SpeedyCGI for a good 'ol
  HelloWorld benchmark  it was a snap, well done.  I'd like to add to the 
  numbers below that a fair benchmark would be between mod_proxy in front 
  of a mod_perl server and mod_speedycgi, as it would be a similar memory 
  saving model ( this is how we often scale mod_perl )... both models would
  end up forwarding back to a smaller set of persistent perl interpreters.
  
  However, I did not do such a benchmark, so SpeedyCGI looses out a
  bit for the extra layer it has to do :(   This is based on the 
  suite at http://www.chamas.com/bench/hello.tar.gz, but I have not
  included the speedy test in that yet.
  
   -- Josh
  
  Test Name  Test File  Hits/sec   Total Hits Total Time sec/Hits  
  Bytes/Hit  
     -- -- -- -- 
 -- -- 
  Apache::Registry v2.01 CGI.pm  hello.cgi   451.9 27128 hits 60.03 sec  0.002213  
  216 bytes  
  Speedy CGI hello.cgi   375.2 22518 hits 60.02 sec  0.002665  
  216 bytes  
  
  Apache Server Header Tokens
  ---
  (Unix)
  Apache/1.3.14
  OpenSSL/0.9.6
  PHP/4.0.3pl1
  mod_perl/1.24
  mod_ssl/2.7.1



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Les Mikesell


- Original Message -
From: "Sam Horrocks" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; "mod_perl list" [EMAIL PROTECTED]
Sent: Saturday, January 06, 2001 4:37 PM
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts
that contain un-shared memory


Right, but this also points out how difficult it is to get mod_perl
 tuning just right.  My opinion is that the MRU design adapts more
 dynamically to the load.
  
   How would this compare to apache's process management when
   using the front/back end approach?

  Same thing applies.  The front/back end approach does not change the
  fundamentals.

It changes them drastically in the world of slow internet connections,
but perhaps not much in artificial benchmarks or LAN use.   I think
you can reduce the problem to:

 How much time do you spend in non-perl apache code vs. how
 much time  you spend in perl code.
and the solution to:
Only use the memory footprint of perl for the miminal time it is needed.

If your I/O is slow and your program complexity minimal, the bulk of
the wall-clock time is spent in i/o wait by non-perl apache code.  Using
a front-end proxy greatly reduces this time (and correspondingly the
ratio of time spent in non-perl code) for the backend where it matters
because you are tying up a copy of perl in memory. Likewise, increasing
the complexity of the perl code will reduce this ratio, reducing the
potential for saving memory regardless of what you do, so benchmarking
a trivial perl program will likely be misleading.

 I'd agree that the size of one Speedy backend + one httpd would be the
 same or even greater than the size of one mod_perl/httpd when no memory
 is shared.  But because the speedycgi httpds are small (no perl in
them)
 and the number of SpeedyCGI perl interpreters is small, the total
memory
 required is significantly smaller for the same load.
  
   Likewise, it would be helpful if you would always make the comparison
   to the dual httpd setup that is often used for busy sites.   I think it
must
   really boil down to the efficiency of your IPC vs. access to the full
   apache environment.

  The reason I don't include that comparison is that it's not fundamental
  to the differences between mod_perl and speedycgi or LRU and MRU that
  I have been trying to point out.  Regardless of whether you add a
  frontend or not, the mod_perl process selection remains LRU and the
  speedycgi process selection remains MRU.

I don't think I understand what you mean by LRU.   When I view the
Apache server-status with ExtendedStatus On,  it appears that
the backend server processes recycle themselves as soon as they
are free instead of cycling sequentially through all the available
processes.   Did you mean to imply otherwise or are you talking
about something else?

   Les Mikesell
 [EMAIL PROTECTED]





Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-05 Thread Sam Horrocks

 Are the speedycgi+Apache processes smaller than the mod_perl
 processes?  If not, the maximum number of concurrent requests you can
 handle on a given box is going to be the same.
  
The size of the httpds running mod_speedycgi, plus the size of speedycgi
perl processes is significantly smaller than the total size of the httpd's
running mod_perl.
  
  That would be true if you only ran one mod_perl'd httpd, but can you
  give a better comparison to the usual setup for a busy site where
  you run a non-mod_perl lightweight front end and let mod_rewrite
  decide what is proxied through to the larger mod_perl'd backend,
  letting apache decide how many backends you need to have
  running?

 The fundamental differences would remain the same - even in the mod_perl
 backend, the requests will be spread out over all the httpd's that are
 running, whereas speedycgi would tend to use fewer perl interpreters
 to handle the same load.

 But with this setup, the mod_perl backend could probably be set to run
 fewer httpds because it doesn't have to wait on slow clients.  And the
 fewer httpd's you run with mod_perl the smaller your total memory.

The reason for this is that only a handful of perl processes are required by
speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
in all of the httpds.
  
  I always see at least a 10-1 ratio of front-to-back end httpd's when serving
  over the internet.   One effect that is difficult to benchmark is that clients
  connecting over the internet are often slow and will hold up the process
  that is delivering the data even though the processing has been completed.
  The proxy approach provides some buffering and allows the backend
  to move on more quickly.  Does speedycgi do the same?

 There are plans to make it so that SpeedyCGI does more buffering of
 the output in memory, perhaps eliminating the need for caching frontend
 webserver.  It works now only for the "speedy" binary (not mod_speedycgi)
 if you set the BufsizGet value high enough.

 Of course you could add a caching webserver in front of the SpeedyCGI server
 just like you do with mod_perl now.  So yes you can do the same with
 speedycgi now.



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-04 Thread Sam Horrocks

Sorry for the late reply - I've been out for the holidays.

  By the way, how are you doing it?  Do you use a mutex routine that works
  in LIFO fashion?

 Speedycgi uses separate backend processes that run the perl interpreters.
 The frontend processes (the httpd's that are running mod_speedycgi)
 communicate with the backends, sending over the request and getting the output.

 Speedycgi uses some shared memory (an mmap'ed file in /tmp) to keep track
 of the backends and frontends.  This shared memory contains the queue.
 When backends become free, they add themselves at the front of this queue.
 When the frontends need a backend they pull the first one from the front
 of this list.

  
I am saying that since SpeedyCGI uses MRU to allocate requests to perl
interpreters, it winds up using a lot fewer interpreters to handle the
same number of requests.
  
  What I was saying is that it doesn't make sense for one to need fewer
  interpreters than the other to handle the same concurrency.  If you have
  10 requests at the same time, you need 10 interpreters.  There's no way
  speedycgi can do it with fewer, unless it actually makes some of them
  wait.  That could be happening, due to the fork-on-demand model, although
  your warmup round (priming the pump) should take care of that.

 What you say would be true if you had 10 processors and could get
 true concurrency.  But on single-cpu systems you usually don't need
 10 unix processes to handle 10 requests concurrently, since they get
 serialized by the kernel anyways.  I'll try to show how mod_perl handles
 10 concurrent requests, and compare that to mod_speedycgi so you can
 see the difference.

 For mod_perl, let's assume we have 10 httpd's, h1 through h10,
 when the 10 concurent requests come in.  h1 has aquired the mutex,
 and h2-h10 are waiting (in order) on the mutex.  Here's how the cpu
 actually runs the processes:

h1 accepts
h1 releases the mutex, making h2 runnable
h1 runs the perl code and produces the results
h1 waits for the mutex

h2 accepts
h2 releases the mutex, making h3 runnable
h2 runs the perl code and produces the results
h2 waits for the mutex

h3 accepts
...

 This is pretty straightforward.  Each of h1-h10 run the perl code
 exactly once.  They may not run exactly in this order since a process
 could get pre-empted, or blocked waiting to send data to the client,
 etc.  But regardless, each of the 10 processes will run the perl code
 exactly once.

 Here's the mod_speedycgi example - it too uses httpd's h1-h10, and they
 all take turns running the mod_speedycgi frontend code.  But the backends,
 where the perl code is, don't have to all be run fairly - they use MRU
 instead.  I'll use b1 and b2 to represent 2 speedycgi backend processes,
 already queued up in that order.

 Here's a possible speedycgi scenario:

h1 accepts
h1 releases the mutex, making h2 runnable
h1 sends a request to b1, making b1 runnable

h2 accepts
h2 releases the mutex, making h3 runnable
h2 sends a request to b2, making b2 runnable

b1 runs the perl code and sends the results to h1, making h1 runnable
b1 adds itself to the front of the queue

h3 accepts
h3 releases the mutex, making h4 runnable
h3 sends a request to b1, making b1 runnable

b2 runs the perl code and sends the results to h2, making h2 runnable
b2 adds itself to the front of the queue

h1 produces the results it got from b1
h1 waits for the mutex

h4 accepts
h4 releases the mutex, making h5 runnable
h4 sends a request to b2, making b2 runnable

b1 runs the perl code and sends the results to h3, making h3 runnable
b1 adds itself to the front of the queue

h2 produces the results it got from b2
h2 waits for the mutex

h5 accepts
h5 release the mutex, making h6 runnable
h5 sends a request to b1, making b1 runnable

b2 runs the perl code and sends the results to h4, making h4 runnable
b2 adds itself to the front of the queue

 This may be hard to follow, but hopefully you can see that the 10 httpd's
 just take turns using b1 and b2 over and over.  So, the 10 conncurrent
 requests end up being handled by just two perl backend processes.  Again,
 this is simplified.  If the perl processes get blocked, or pre-empted,
 you'll end up using more of them.  But generally, the LIFO will cause
 SpeedyCGI to sort-of settle into the smallest number of processes needed for
 the task.

 The difference between the two approaches is that the mod_perl
 implementation forces unix to use 10 separate perl processes, while the
 mod_speedycgi implementation sort-of decides on the fly how many
 different processes are needed.

Please let me know what you think I should change.  So far my
benchmarks only show one trend, but if you can tell me specifically
what I'm doing wrong (and it's something reasonable), I'll try it.
  
  Try setting MinSpareServers 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-04 Thread Les Mikesell


- Original Message -
From: "Sam Horrocks" [EMAIL PROTECTED]
To: "Perrin Harkins" [EMAIL PROTECTED]
Cc: "Gunther Birznieks" [EMAIL PROTECTED]; "mod_perl list"
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Thursday, January 04, 2001 6:56 AM
Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts
that contain un-shared memory


  
   Are the speedycgi+Apache processes smaller than the mod_perl
   processes?  If not, the maximum number of concurrent requests you can
   handle on a given box is going to be the same.

  The size of the httpds running mod_speedycgi, plus the size of speedycgi
  perl processes is significantly smaller than the total size of the httpd's
  running mod_perl.

That would be true if you only ran one mod_perl'd httpd, but can you
give a better comparison to the usual setup for a busy site where
you run a non-mod_perl lightweight front end and let mod_rewrite
decide what is proxied through to the larger mod_perl'd backend,
letting apache decide how many backends you need to have
running?

  The reason for this is that only a handful of perl processes are required by
  speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
  in all of the httpds.

I always see at least a 10-1 ratio of front-to-back end httpd's when serving
over the internet.   One effect that is difficult to benchmark is that clients
connecting over the internet are often slow and will hold up the process
that is delivering the data even though the processing has been completed.
The proxy approach provides some buffering and allows the backend
to move on more quickly.  Does speedycgi do the same?

  Les Mikesell
[EMAIL PROTECTED]





Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-04 Thread Perrin Harkins

Hi Sam,

I think we're talking in circles here a bit, and I don't want to
diminish the original point, which I read as "MRU process selection is a
good idea for Perl-based servers."  Your tests showed that this was
true.

Let me just try to explain my reasoning.  I'll define a couple of my
base assumptions, in case you disagree with them.

- Slices of CPU time doled out by the kernel are very small - so small
that processes can be considered concurrent, even though technically
they are handled serially.
- A set of requests can be considered "simultaneous" if they all arrive
and start being handled in a period of time shorter than the time it
takes to service a request.

Operating on these two assumptions, I say that 10 simultaneous requests
will require 10 interpreters to service them.  There's no way to handle
them with fewer, unless you queue up some of the requests and make them
wait.

I also say that if you have a top limit of 10 interpreters on your
machine because of memory constraints, and you're sending in 10
simultaneous requests constantly, all interpreters will be used all the
time.  In that case it makes no difference to the throughput whether you
use MRU or LRU.

  What you say would be true if you had 10 processors and could get
  true concurrency.  But on single-cpu systems you usually don't need
  10 unix processes to handle 10 requests concurrently, since they get
  serialized by the kernel anyways.

I think the CPU slices are smaller than that.  I don't know much about
process scheduling, so I could be wrong.  I would agree with you if we
were talking about requests that were coming in with more time between
them.  Speedycgi will definitely use fewer interpreters in that case.

  I found that setting MaxClients to 100 stopped the paging.  At concurrency
  level 100, both mod_perl and mod_speedycgi showed similar rates with ab.
  Even at higher levels (300), they were comparable.

That's what I would expect if both systems have a similar limit of how
many interpreters they can fit in RAM at once.  Shared memory would help
here, since it would allow more interpreters to run.

By the way, do you limit the number of SpeedyCGI processes as well?  it
seems like you'd have to, or they'd start swapping too when you throw
too many requests in.

  But, to show that the underlying problem is still there, I then changed
  the hello_world script and doubled the amount of un-shared memory.
  And of course the problem then came back for mod_perl, although speedycgi
  continued to work fine.  I think this shows that mod_perl is still
  using quite a bit more memory than speedycgi to provide the same service.

I'm guessing that what happened was you ran mod_perl into swap again. 
You need to adjust MaxClients when your process size changes
significantly.

 I believe that with speedycgi you don't have to lower the MaxClients
 setting, because it's able to handle a larger number of clients, at
 least in this test.
  
   Maybe what you're seeing is an ability to handle a larger number of
   requests (as opposed to clients) because of the performance benefit I
   mentioned above.
 
  I don't follow.

When not all processes are in use, I think Speedy would handle requests
more quickly, which would allow it to handle n requests in less time
than mod_perl.  Saying it handles more clients implies that the requests
are simultaneous.  I don't think it can handle more simultaneous
requests.

   Are the speedycgi+Apache processes smaller than the mod_perl
   processes?  If not, the maximum number of concurrent requests you can
   handle on a given box is going to be the same.
 
  The size of the httpds running mod_speedycgi, plus the size of speedycgi
  perl processes is significantly smaller than the total size of the httpd's
  running mod_perl.
 
  The reason for this is that only a handful of perl processes are required by
  speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
  in all of the httpds.

I think this is true at lower levels, but not when the number of
simultaneous requests gets up to the maximum that the box can handle. 
At that point, it's a question of how many interpreters can fit in
memory.  I would expect the size of one Speedy + one httpd to be about
the same as one mod_perl/httpd when no memory is shared.  With sharing,
you'd be able to run more processes.

- Perrin



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Perrin Harkins

Gunther Birznieks wrote:
 Sam just posted this to the speedycgi list just now.
[...]
 The underlying problem in mod_perl is that apache likes to spread out
 web requests to as many httpd's, and therefore as many mod_perl interpreters,
 as possible using an LRU selection processes for picking httpd's.

Hmmm... this doesn't sound right.  I've never looked at the code in
Apache that does this selection, but I was under the impression that the
choice of which process would handle each request was an OS dependent
thing, based on some sort of mutex.

Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html

Doesn't that appear to be saying that whichever process gets into the
mutex first will get the new request?  In my experience running
development servers on Linux it always seemed as if the the requests
would continue going to the same process until a request came in when
that process was already busy.

As I understand it, the implementation of "wake-one" scheduling in the
2.4 Linux kernel may affect this as well.  It may then be possible to
skip the mutex and use unserialized accept for single socket servers,
which will definitely hand process selection over to the kernel.

 The problem is that at a high concurrency level, mod_perl is using lots
 and lots of different perl-interpreters to handle the requests, each
 with its own un-shared memory.  It's doing this due to its LRU design.
 But with SpeedyCGI's MRU design, only a few speedy_backends are being used
 because as much as possible it tries to use the same interpreter over and
 over and not spread out the requests to lots of different interpreters.
 Mod_perl is using lots of perl-interpreters, while speedycgi is only using
 a few.  mod_perl is requiring that lots of interpreters be in memory in
 order to handle the requests, wherase speedy only requires a small number
 of interpreters to be in memory.

This test - building up unshared memory in each process - is somewhat
suspect since in most setups I've seen, there is a very significant
amount of memory being shared between mod_perl processes.  Regardless,
the explanation here doesn't make sense to me.  If we assume that each
approach is equally fast (as Sam seems to say earlier in his message)
then it should take an equal number of speedycgi and mod_perl processes
to handle the same concurrency.

That leads me to believe that what's really happening here is that
Apache is pre-forking a bit over-zealously in response to a sudden surge
of traffic from ab, and thus has extra unused processes sitting around
waiting, while speedycgi is avoiding this situation by waiting for
someone to try and use the processes before forking them (i.e. no
pre-forking).  The speedycgi way causes a brief delay while new
processes fork, but doesn't waste memory.  Does this sound like a
plausible explanation to folks?

This is probably all a moot point on a server with a properly set
MaxClients and Apache::SizeLimit that will not go into swap.  I would
expect mod_perl to have the advantage when all processes are
fully-utilized because of the shared memory.  It would be cool if
speedycgi could somehow use a parent process model and get the shared
memory benefits too.  Speedy seems like it might be more attractive to
ISPs, and it would be nice to increase interoperability between the two
projects.

- Perrin



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Gunther Birznieks

I think you could actually make speedycgi even better for shared memory 
usage by creating a special directive which would indicate to speedycgi to 
preload a series of modules. And then to tell speedy cgi to do forking of 
that "master" backend preloaded module process and hand control over to 
that forked process whenever you need to launch a new process.

Then speedy would potentially have the best of both worlds.

Sorry I cross posted your thing. But I do think it is a problem of mod_perl 
also, and I am happily using speedycgi in production on at least one 
commercial site where mod_perl could not be installed so easily because of 
infrastructure issues.

I believe your mechanism of round robining among MRU perl interpreters is 
actually also accomplished by ActiveState's PerlEx (based on 
Apache::Registry but using multithreaded IIS and pool of Interpreters). A 
method similar to this will be used in Apache 2.0 when Apache is 
multithreaded and therefore can control within program logic which Perl 
interpeter gets called from a pool of Perl interpreters.

It just isn't so feasible right now in Apache 1.0 to do this. And sometimes 
people forget that mod_perl came about primarily for writing handlers in 
Perl not as an application environment although it is very good for the 
later as well.

I think SpeedyCGI needs more advocacy from the mod_perl group because put 
simply speedycgi is way easier to set up and use than mod_perl and will 
likely get more PHP people using Perl again. If more people rely on Perl 
for their fast websites, then you will get more people looking for more 
power, and by extension more people using mod_perl.

Whoops... here we go with the advocacy thing again.

Later,
Gunther

At 02:50 AM 12/21/2000 -0800, Sam Horrocks wrote:
   Gunther Birznieks wrote:
Sam just posted this to the speedycgi list just now.
   [...]
The underlying problem in mod_perl is that apache likes to spread out
web requests to as many httpd's, and therefore as many mod_perl 
 interpreters,
as possible using an LRU selection processes for picking httpd's.
  
   Hmmm... this doesn't sound right.  I've never looked at the code in
   Apache that does this selection, but I was under the impression that the
   choice of which process would handle each request was an OS dependent
   thing, based on some sort of mutex.
  
   Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html
  
   Doesn't that appear to be saying that whichever process gets into the
   mutex first will get the new request?

  I would agree that whichver process gets into the mutex first will get
  the new request.  That's exactly the problem I'm describing.  What you
  are describing here is first-in, first-out behaviour which implies LRU
  behaviour.

  Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
  2 finishes and requests the mutex, then 3 finishes and requests the mutex.
  So when the next three requests come in, they are handled in the same order:
  1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.

   In my experience running
   development servers on Linux it always seemed as if the the requests
   would continue going to the same process until a request came in when
   that process was already busy.

  No, they don't.  They go round-robin (or LRU as I say it).

  Try this simple test script:

  use CGI;
  my $cgi = CGI-new;
  print $cgi-header();
  print "mypid=$$\n";

  WIth mod_perl you constantly get different pids.  WIth mod_speedycgi you
  usually get the same pid.  THis is a really good way to see the LRU/MRU
  difference that I'm talking about.

  Here's the problem - the mutex in apache is implemented using a lock
  on a file.  It's left up to the kernel to decide which process to give
  that lock to.

  Now, if you're writing a unix kernel and implementing this file locking 
 code,
  what implementation would you use?  Well, this is a general purpose thing -
  you have 100 or so processes all trying to acquire this file lock.  You 
 could
  give out the lock randomly or in some ordered fashion.  If I were writing
  the kernel I would give it out in a round-robin fashion (or the
  least-recently-used process as I referred to it before).  Why?  Because
  otherwise one of those processes may starve waiting for this lock - it may
  never get the lock unless you do it in a fair (round-robin) manner.

  THe kernel doesn't know that all these httpd's are exactly the same.
  The kernel is implementing a general-purpose file-locking scheme and
  it doesn't know whether one process is more important than another.  If
  it's not fair about giving out the lock a very important process might
  starve.

  Take a look at fs/locks.c (I'm looking at linux 2.3.46).  In there is the
  comment:

  /* Insert waiter into blocker's block list.
   * We use a circular list so that processes can be easily woken up in
   * the order they blocked. The documentation 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Keith G. Murphy

Perrin Harkins wrote:

[cut]
 
 Doesn't that appear to be saying that whichever process gets into the
 mutex first will get the new request?  In my experience running
 development servers on Linux it always seemed as if the the requests
 would continue going to the same process until a request came in when
 that process was already busy.
 
Is it possible that the persistent connections utilized by HTTP 1.1 just
made it look that way?  Would happen if the clients were MSIE.

Even recent Netscape browsers only use 1.0, IIRC.

(I was recently perplexed by differing performance between MSIE and NS
browsers hitting my system until I realized this.)



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

  Gunther Birznieks wrote:
   Sam just posted this to the speedycgi list just now.
  [...]
   The underlying problem in mod_perl is that apache likes to spread out
   web requests to as many httpd's, and therefore as many mod_perl interpreters,
   as possible using an LRU selection processes for picking httpd's.
  
  Hmmm... this doesn't sound right.  I've never looked at the code in
  Apache that does this selection, but I was under the impression that the
  choice of which process would handle each request was an OS dependent
  thing, based on some sort of mutex.
  
  Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html
  
  Doesn't that appear to be saying that whichever process gets into the
  mutex first will get the new request?

 I would agree that whichver process gets into the mutex first will get
 the new request.  That's exactly the problem I'm describing.  What you
 are describing here is first-in, first-out behaviour which implies LRU
 behaviour.

 Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
 2 finishes and requests the mutex, then 3 finishes and requests the mutex.
 So when the next three requests come in, they are handled in the same order:
 1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.

  In my experience running
  development servers on Linux it always seemed as if the the requests
  would continue going to the same process until a request came in when
  that process was already busy.

 No, they don't.  They go round-robin (or LRU as I say it).

 Try this simple test script:

 use CGI;
 my $cgi = CGI-new;
 print $cgi-header();
 print "mypid=$$\n";

 WIth mod_perl you constantly get different pids.  WIth mod_speedycgi you
 usually get the same pid.  THis is a really good way to see the LRU/MRU
 difference that I'm talking about.

 Here's the problem - the mutex in apache is implemented using a lock
 on a file.  It's left up to the kernel to decide which process to give
 that lock to.

 Now, if you're writing a unix kernel and implementing this file locking code,
 what implementation would you use?  Well, this is a general purpose thing -
 you have 100 or so processes all trying to acquire this file lock.  You could
 give out the lock randomly or in some ordered fashion.  If I were writing
 the kernel I would give it out in a round-robin fashion (or the
 least-recently-used process as I referred to it before).  Why?  Because
 otherwise one of those processes may starve waiting for this lock - it may
 never get the lock unless you do it in a fair (round-robin) manner.

 THe kernel doesn't know that all these httpd's are exactly the same.
 The kernel is implementing a general-purpose file-locking scheme and
 it doesn't know whether one process is more important than another.  If
 it's not fair about giving out the lock a very important process might
 starve.

 Take a look at fs/locks.c (I'm looking at linux 2.3.46).  In there is the
 comment:

 /* Insert waiter into blocker's block list.
  * We use a circular list so that processes can be easily woken up in
  * the order they blocked. The documentation doesn't require this but
  * it seems like the reasonable thing to do.
  */
 static void locks_insert_block(struct file_lock *blocker, struct file_lock *waiter)

  As I understand it, the implementation of "wake-one" scheduling in the
  2.4 Linux kernel may affect this as well.  It may then be possible to
  skip the mutex and use unserialized accept for single socket servers,
  which will definitely hand process selection over to the kernel.

 If the kernel implemented the queueing for multiple accepts using a LIFO
 instead of a FIFO and apache used this method instead of file locks,
 then that would probably solve it.

 Just found this on the net on this subject:
http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0455.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0453.html

   The problem is that at a high concurrency level, mod_perl is using lots
   and lots of different perl-interpreters to handle the requests, each
   with its own un-shared memory.  It's doing this due to its LRU design.
   But with SpeedyCGI's MRU design, only a few speedy_backends are being used
   because as much as possible it tries to use the same interpreter over and
   over and not spread out the requests to lots of different interpreters.
   Mod_perl is using lots of perl-interpreters, while speedycgi is only using
   a few.  mod_perl is requiring that lots of interpreters be in memory in
   order to handle the requests, wherase speedy only requires a small number
   of interpreters to be in memory.
  
  This test - building up unshared memory in each process - is somewhat
  suspect since in most setups I've seen, there is a very significant
  amount of memory being shared between mod_perl processes.

 My message and testing concerns un-shared memory only.  If all of your memory
 is shared, then there shouldn't be a 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

  Folks, your discussion is not short of wrong statements that can be easily
  proved, but I don't find it useful.

 I don't follow.  Are you saying that my conclusions are wrong, but
 you don't want to bother explaining why?
 
 Would you agree with the following statement?

Under apache-1, speedycgi scales better than mod_perl with
scripts that contain un-shared memory 



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Gunther Birznieks

At 09:16 PM 12/21/00 +0100, Stas Bekman wrote:
[much removed]

So the moment mod_perl 2.0 hits the shelves, this possible benefit
of speedycgi over mod_perl becomes irrelevant. I think this more or less
summarizes this thread.
I think you are right about the summarization. However, I also think it's 
unfair for people here to pin too many hopes on mod_perl 2.0.

First Apache 2.0 has to be fully released. It's still in Alpha! Then, 
mod_perl 2.0 has to be released. I haven't seen any realistic timelines 
that indicate to me that these will be released and stable for production 
use in only a few months time. And Apache 2.0 has been worked on for years. 
I first saw a talk on Apache 2.0's architecture at the first ApacheCon 2 
years ago! To be fair, back then they were using Mozilla's NPR which I 
think they learned from, threw away, and rewrote from scratch after all (to 
become APR). But still, the point is that it's been a long time and 
probably will be a while yet.

Who in their right mind would pin their business or production database on 
the hope that mod_perl 2.0 comes out in a few months? I don't think anyone 
would. Sam has a solution that works now, and is open source and provides 
some benefits for web applications that mod_perl and apache is not as 
efficient at for some types of applications.

As people interested in Perl, we should be embracing these alternatives not 
telling people to wait for new versions of software that may not come out soon.

If there is a problem with mod_perl advocacy, it's that it is precisely too 
mod_perl centric. Mod_perl is a niche crowd which has a high learning 
curve. I think the technology mod_perl offers is great, but as has been 
said before, the problem is that people are going to PHP away from Perl. If 
more people had easier solutions to implement their simple apps in Perl yet 
be as fast as PHP, less people would go to PHP.

Those Perl people would eventually discover mod_perl's power as they 
require it, and then they would take the step to "upgrade" to the power of 
handlers away from the "missing link".

But without that "missing link" to make things easy for people to move from 
PHP to Perl, then Perl will miss something very crucial to maintaining its 
standing as the "defacto language for Web applications".

3 years ago, I think it would be accurate to say Perl apps drive 95% of the 
dynamic web. Sadly, I believe (anecdotally) that this is no longer true.

SpeedyCGI is not "THE" missing link, but I see it as a crucial part of this 
link between newbies and mod_perl. This is why I believe that mod_perl and 
its documentation should have a section (even if tiny) on this stuff, so 
that people will know that if they find mod_perl too hard, that there are 
alternatives that are less powerful, yet provide at least enough power to 
beat PHP.

I also see SpeedyCGI as being on the way to being more ISP-friendly already 
for hosting casual users of Perl than mod_perl is. Different apps use a 
different backend engine by default. So the problem with virtual hosts 
screwing each other over by accident is gone for the casual user. There are 
still some needs for improvement (eg memory is likely still an issue with 
different backends)...

Anyway, these are just my feelings. I really shouldn't be spending time on 
posting this as I have some deadlines to meet. But I felt they were still 
important points to make that I think some people may be potentially 
missing here. :)





Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

I've put your suggestion on the todo list.  It certainly wouldn't hurt to
have that feature, though I think memory sharing becomes a much much smaller
issue once you switch to MRU scheduling.

At the moment I think SpeedyCGI has more pressing needs though - for
example multiple scripts in a single interpreter, and an NT port.


  I think you could actually make speedycgi even better for shared memory 
  usage by creating a special directive which would indicate to speedycgi to 
  preload a series of modules. And then to tell speedy cgi to do forking of 
  that "master" backend preloaded module process and hand control over to 
  that forked process whenever you need to launch a new process.
  
  Then speedy would potentially have the best of both worlds.
  
  Sorry I cross posted your thing. But I do think it is a problem of mod_perl 
  also, and I am happily using speedycgi in production on at least one 
  commercial site where mod_perl could not be installed so easily because of 
  infrastructure issues.
  
  I believe your mechanism of round robining among MRU perl interpreters is 
  actually also accomplished by ActiveState's PerlEx (based on 
  Apache::Registry but using multithreaded IIS and pool of Interpreters). A 
  method similar to this will be used in Apache 2.0 when Apache is 
  multithreaded and therefore can control within program logic which Perl 
  interpeter gets called from a pool of Perl interpreters.
  
  It just isn't so feasible right now in Apache 1.0 to do this. And sometimes 
  people forget that mod_perl came about primarily for writing handlers in 
  Perl not as an application environment although it is very good for the 
  later as well.
  
  I think SpeedyCGI needs more advocacy from the mod_perl group because put 
  simply speedycgi is way easier to set up and use than mod_perl and will 
  likely get more PHP people using Perl again. If more people rely on Perl 
  for their fast websites, then you will get more people looking for more 
  power, and by extension more people using mod_perl.
  
  Whoops... here we go with the advocacy thing again.
  
  Later,
  Gunther
  
  At 02:50 AM 12/21/2000 -0800, Sam Horrocks wrote:
 Gunther Birznieks wrote:
  Sam just posted this to the speedycgi list just now.
 [...]
  The underlying problem in mod_perl is that apache likes to spread out
  web requests to as many httpd's, and therefore as many mod_perl 
   interpreters,
  as possible using an LRU selection processes for picking httpd's.

 Hmmm... this doesn't sound right.  I've never looked at the code in
 Apache that does this selection, but I was under the impression that the
 choice of which process would handle each request was an OS dependent
 thing, based on some sort of mutex.

 Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html

 Doesn't that appear to be saying that whichever process gets into the
 mutex first will get the new request?
  
I would agree that whichver process gets into the mutex first will get
the new request.  That's exactly the problem I'm describing.  What you
are describing here is first-in, first-out behaviour which implies LRU
behaviour.
  
Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
2 finishes and requests the mutex, then 3 finishes and requests the mutex.
So when the next three requests come in, they are handled in the same order:
1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.
  
 In my experience running
 development servers on Linux it always seemed as if the the requests
 would continue going to the same process until a request came in when
 that process was already busy.
  
No, they don't.  They go round-robin (or LRU as I say it).
  
Try this simple test script:
  
use CGI;
my $cgi = CGI-new;
print $cgi-header();
print "mypid=$$\n";
  
WIth mod_perl you constantly get different pids.  WIth mod_speedycgi you
usually get the same pid.  THis is a really good way to see the LRU/MRU
difference that I'm talking about.
  
Here's the problem - the mutex in apache is implemented using a lock
on a file.  It's left up to the kernel to decide which process to give
that lock to.
  
Now, if you're writing a unix kernel and implementing this file locking 
   code,
what implementation would you use?  Well, this is a general purpose thing -
you have 100 or so processes all trying to acquire this file lock.  You 
   could
give out the lock randomly or in some ordered fashion.  If I were writing
the kernel I would give it out in a round-robin fashion (or the
least-recently-used process as I referred to it before).  Why?  Because
otherwise one of those processes may starve waiting for this lock - it may
never get the lock unless you do it in a fair (round-robin) manner.
  
THe kernel doesn't know 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

I really wasn't trying to work backwards from a benchmark.  It was
more of an analysis of the design, and the benchmarks bore it out.
It's sort of like coming up with a theory in science - if you can't get
any experimental data to back up the theory, you're in big trouble.
But if you can at least point out the existence of some experiments
that are consistent with your theory, it means your theory may be true.

The best would be to have other people do the same tests and see if they
see the same trend.  If no-one else sees this trend, then I'd really
have to re-think my analysis.

Another way to look at it - as you say below MRU is going to be in
mod_perl-2.0.  ANd what is the reason for that?  If there's no performance
difference between LRU and MRU why would the author bother to switch
to MRU.  So, I'm saying there must be some benchmarks somewhere that
point out this difference - if there weren't any real-world difference,
why bother even implementing MRU.

I claim that my benchmarks point out this difference between MRU over
LRU, and that's why my benchmarks show better performance on speedycgi
than on mod_perl.

Sam

- SpeedyCGI uses MRU, mod_perl-2 will eventually use MRU.  
  On Thu, 21 Dec 2000, Sam Horrocks wrote:
  
 Folks, your discussion is not short of wrong statements that can be easily
 proved, but I don't find it useful.
   
I don't follow.  Are you saying that my conclusions are wrong, but
you don't want to bother explaining why?

Would you agree with the following statement?
   
   Under apache-1, speedycgi scales better than mod_perl with
   scripts that contain un-shared memory 
  
  I don't know. It's easy to give a simple example and claim being better.
  So far whoever tried to show by benchmarks that he is better, most often
  was proved wrong, since the technologies in question have so many
  features, that I believe no benchmark will prove any of them absolutely
  superior or inferior. Therefore I said that trying to tell that your grass
  is greener is doomed to fail if someone has time on his hands to prove you
  wrong. Well, we don't have this time.
  
  Therefore I'm not trying to prove you wrong or right. Gunther's point of
  the original forward was to show things that mod_perl may need to adopt to
  make it better. Doug already explained in his paper that the MRU approach
  has been already implemented in mod_perl-2.0. You could read it in the
  link that I've attached and the quote that I've quoted.
  
  So your conclusions about MRU are correct and we have it implemented
  already (well very soon now :). I apologize if my original reply was
  misleading.
  
  I'm not telling that benchmarks are bad. What I'm telling is that it's
  very hard to benchmark things which are different. You benefit the most
  from the benchmarking when you take the initial code/product, benchmark
  it, then you try to improve the code and benchmark again to see whether it
  gave you any improvement. That's the area when the benchmarks rule and
  their are fair because you test the same thing. Well you could read more
  of my rambling about benchmarks in the guide.
  
  So if you find some cool features in other technologies that mod_perl
  might adopt and benefit from, don't hesitate to tell the rest of the gang.
  
  
  
  Something that I'd like to comment on:
  
  I find it a bad practice to quote one sentence from person's post and
  follow up on it. Someone from the list has sent me this email (SB == me):
  
  SB I don't find it useful
  
  and follow up. Why not to use a single letter:
  
  SB I
  
  and follow up? It's so much easier to flame on things taken out of their
  context.
  
  it has been no once that people did this to each other here on the list, I
  think I did too. So please be more careful when taking things out of
  context. Thanks a lot, folks!
  
  Cheers...
  
  _
  Stas Bekman  JAm_pH --   Just Another mod_perl Hacker
  http://stason.org/   mod_perl Guide  http://perl.apache.org/guide 
  mailto:[EMAIL PROTECTED]   http://apachetoday.com http://logilune.com/
  http://singlesheaven.com http://perl.apache.org http://perlmonth.com/  
  



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Ken Williams

[EMAIL PROTECTED] (Perrin Harkins) wrote:
Hi Sam,
[snip]
  I am saying that since SpeedyCGI uses MRU to allocate requests to perl
  interpreters, it winds up using a lot fewer interpreters to handle the
  same number of requests.

What I was saying is that it doesn't make sense for one to need fewer
interpreters than the other to handle the same concurrency.  If you have
10 requests at the same time, you need 10 interpreters.  There's no way
speedycgi can do it with fewer, unless it actually makes some of them
wait.

Well, there is one way, though it's probably not a huge factor.  If
mod_perl indeed manages the child-farming in such a way that too much
memory is used, then each process might slow down as memory becomes
sparse, especially if you start swapping.  Then if each request takes
longer, your child pool is more saturated with requests, and you might
have to fork a few more kids.

So in a sense, I think you're both correct.  If "concurrency" means the
number of requests that can be handled at once, both systems are
necessarily (and trivially) equivalent.  This isn't a very useful
measurement, though; a more useful one is how many children (or perhaps
how much memory) will be necessary to handle a given number of incoming
requests per second, and with this metric the two systems could perform
differently.


  ------
  Ken Williams Last Bastion of Euclidity
  [EMAIL PROTECTED]The Math Forum