Re: [Libevent-users] sensible thread-safe signal handling proposal

2007-11-04 Thread Christopher Layne
On Sun, Nov 04, 2007 at 12:15:56PM -0800, Steven Grimm wrote:
 On Nov 4, 2007, at 8:13 AM, Marc Lehmann wrote:
 This would create additional loops (event_bases). The difference is  
 these cannot handle signals (or child watchers) at all, with the  
 default loop
 being the only one to do signal handling.
 This seems like a totally sane approach to me. Having multiple loops  
 is a big performance win for some applications (e.g., memcached in  
 multithreaded mode), so making the behavior a bit more consistent is a  
 good thing.

It's only a performance win when the number of context switches and
cache stomping, as a result of multiple threads cycling within their own
context does not outweigh the latency of a model using less or even
1 thread.

Consider a room with 20 people in it and a single door. The goal is to
hand them a football as a new football is dropped off the assembly
line and have them exit the door. You could throw them all a new football
right as it comes off the line and have them immediately rush for the door -
resulting in a log jam that one has to stop tending the assembly line to
handle. You then head back to the line and begin the patterened task of
throwing footballs to workers as fast as you can - only to have the log jam
repeat itself.

The only way to solve this efficiently is to have less people try and exit
the door at once, or add more doors (CPUs).

 Now if only there were a way to wake just one thread up when input  
 arrives on a descriptor being monitored by multiple threads... But  
 that isn't supported by any of the underlying poll mechanisms as far  
 as I can tell.

It isn't typically supported because it's not a particularly useful or
efficient path to head down in the first place.

Thread pools being what they are, incredibly useful and pretty much the de
facto in threaded code, do have their own abstraction limits as well.

Setting up a thread pool, an inherently asynchronous and unordered collection
of contexts, to asynchronously process an ordered stream of data (unless
your protocol has no sequence, which I doubt), which I presume to somehow
be in the name of performance, is way more complex and troublesome design
than it needs to be. It's anchored somewhat to the every thread can do
anything school of thought which has many hidden costs.

The issue in itself is having multiple threads monitor the *same* fd via any
kind of wait mechanism. It's short circuiting application layers, so that a
thread (*any* thread in that pool) can immediately process new data. I think
it would be much more structured, less complex (i.e. better performance in
the long run anyways), and a cleaner design to have a set number (or even
1) thread handle the controller task of tending to new network events,
push them onto a per-connection PDU queue, or pre-process in some form or
fashion, condsig, and let previously mentioned thread pool handle it in an
ordered fashion. Having a group of threads listening to the same fd has now
just thrown our football manager out entirely and become a smash-and-grab
for new footballs. There's still the door to get through.

2007-11-04 Thread Adrian Chadd
On Sun, Nov 04, 2007, Steven Grimm wrote:

 Would this be for listen sockets, or for general read/write IO on an  
 Specifically for a mixed TCP- and UDP-based protocol where any thread  
 is equally able to handle an incoming request on the UDP socket, but  
 TCP sockets are bound to particular threads.

Makes sense. Doesn't solaris event ports system handle this? I haven't
checked in depth.

It sounds like something that kqueue could be extended to do relatively

What about multiple threads blocking on the same UDP socket? Do multiple
threads wake up when IO arrives? Or just one?


2007-11-04 Thread William Ahern
On Sun, Nov 04, 2007 at 03:18:42PM -0800, Steven Grimm wrote:
 You've just pretty accurately described my initial implementation of  
 thread support in memcached. It worked, but it was both more CPU- 
 intensive and had higher response latency (yes, I actually measured  
 it) than the model I'm using now. The only practical downside of my  
 current implementation is that when there is only one UDP packet  
 waiting to be processed, some CPU time is wasted on the threads that  
 don't end up winning the race to read it. But those threads were idle  
 at that instant anyway (or they wouldn't have been in a position to  
 wake up) so, according to my benchmarking, there doesn't turn out to  
 be an impact on latency. And though I am wasting CPU cycles, my total  
 CPU consumption still ends up being lower than passing messages around  
 between threads.

Is this on Linux? They addressed the stampeding herd problem years ago. If
you dig deep down in the kernel you'll see their waitq implemention for
non-blocking socket work (and lots of other stuff). Only one thread is ever
woken per event.
2007-11-04 Thread Scott Lamb
Christopher Layne wrote:
 On Sun, Nov 04, 2007 at 04:23:01PM -0800, Scott Lamb wrote:
 It wasn't what I expected; I was fully confident at first that the
 thread-pool, work-queue model would be the way to go, since it's one
 I've implemented in many applications in the past. But the numbers said
 Thanks for the case study. To rephrase (hopefully correctly), you tried
 these two models:

 1) one thread polls and puts events on a queue; a bunch of other threads
 pull from the queue. (resulted in high latency, and I'm not too extra context switch before handling any events.)
 So back to this..
 2) a bunch of threads read and handle events independently. (your
 current model.)
 BTW: How does this model somehow exempt itself from said context switching
 issue of the former?

Hmm, William Ahern says that at least on Linux, they only wake one
thread per event. That would explain it.

 Did you also tried the so-called leader/follower model, in which the
 thread which does the polling handles the first event and puts the rest
 on a queue; another thread takes over polling if otherwise idle while
 the first thread is still working. My impression this was a widely
 favored model, though I don't know the details of where each performs best.
 Something about this just seems like smoke and mirrors to me. At the end of
 the day we still only have a finite amount of CPU cores available to us and
 any amount of playing with the order of things is not going to extract any
 magical *more* throughput out of a given box. Yes, some of these methods
 influence recv/send buffers and have a cascading effect on overall throughput,
 but efficient code and algorithms are going to make the real difference - not
 goofy thread games.
 (and this is coming from someone who *likes* comp.programming.threads)

Oh, I don't know, there is something to be said for not making a handoff
between threads if you can avoid it. You're not going to get more
throughput than n_cores times what you got with one processor, but I'd
expect avoiding context switches and cache bouncing to help you get
closer to that.
