Re: cvs commit: apr-serf/docs notes-filter-chains.txt

Aaron Bannert 29 Aug 2002 16:50:50 -0000

[continuation to address other points]

On Thu, Aug 29, 2002 at 01:18:59AM -0700, Justin Erenkrantz wrote:
[...]
 
> > 7) In order to achieve filter implementations that are agnostic of both
> >   push and pull, the filter graph simply takes the data types and applies
> >   them only to the next appropriate filter. That filter then decides one
> >   of three things: 1) to pass the original data, 2) to consume the data
> >   and produce nothing, 3) to consume the data and produce one or more
> >   new data thingies*, or 4) to fail.
> 
> One question is how do we allow for ordering?  Say one instance we
> want to have filter I before P and another time we want P before I.
> My guess is that they'd take in the same data and produce the
> same output.  (I believe in your scheme they'd be RESPONSE_BODIES.)


Ah, this is an interesting conceptual point. There are two cases here
in my mind:

1) if the data requires ordering, then it's probably not the same
   data and requires a new datatype. This gives us automatic ordering.
2) If the data truly is the same, then it /shouldn't/ matter in
   what order the filters are placed.

I'm not so sure these axioms hold everywhere, so we'll have to explore
the design space a little more before we're sure.


> >  - The types of filters I could imagine being involved here would be:
> > 
> >   1. SOURCE - consumes nothing, produces TRANSACTION
> 
> I'd imagine that it should have produced REQUEST rather than
> TRANSACTION...

I thought about that, but then I thought that one single graph
will deal with both the request parsing and the response generation
within the same "transaction" (again, for lack of a better term).

This is a totally generic filter system. When we're dealing with
HTTP on the server side, I would imagine we'd always start with
a filter graph that looked like this:

SOURCE --> REQUEST_PARSER --> REQUEST_HANDLER --> RESPONSE_WRITER --> SINK

REQUEST_PARSER consumes TRANSACTION, produces REQUEST
REQUSET_HANDLER consumes REQUEST produces RESPONSE
RESPONSE_WRITER consumes RESPONSE produces TRANSACTION

Would that be a general enough characterization?


> >   2. SOCKET_READER - consumes TRANSACTION, produces HEADERS
> 
> Eek.  I'd see a lot of stuff in between socket_reader and something
> that produces HEADERS, but yeah, we can do handwaving and say that
> we somehow get to HEADERS...

Definately a lot of stuff, but I'm trying to keep the examples simple
and abstract enough so we can all stay on the same page. As we
move into more details I'm sure we'll better understand what layers
exist within the current server, and we'll be able to detail where
the filter boundaries exist.

> (Note, I'd prefer splitting into both STATUS/RESOURCE and HEADERS.)

Yes, that is probably better. :)

> >   3. RESPONSE_DISPATCHER - consumes HEADERS, produces a bunch of stuff,
> >                            one of which is a RESPONSE_FILE.
> 
> So, the response_dispatch is essentially equvialent to the handler
> in httpd?

Pretty much. Now a module can implement a filter that fits just about
anywhere in the server, which is a huge boon. All the filters have to
do is make sure their input/output types agree with the filter at
that {location, directory, file, etc...}

> >   4. STATIC_FILE_HANDLER - consumes a bunch of things, one of which is a
> >                            RESPONSE_FILE*, produces RESPONSE_HEADERS and
> >                            a FILE_DESCRIPTOR.
> 
> Here's a question: how would something like mod_include or mod_php
> work?  (You brought up serving pages, so...)  Ideally, they don't
> want to take FILE_DESCRIPTOR as input.  They want RESPONSE_CONTENTS
> as input and spit RESPONSE_CONTENTS back out, right?

Right, or maybe they exist at a higher-resolution level. I'm thinking
some of them might want the entire body (as in the case of a chunked-encoder
or a gzip encoder). Others like mod_php or mod_include might only
operate properly when another filter is in the chain somewhere earlier
that is producing SSI_BLOBs that the main ssi filter could then
consume later.

> Hmm.  Within your type system you could have:
> 
>       RESPONSE_CONTENTS
>       /               \
> FILE_DESCRIPTOR    BUFFER_BACKED

Which calls into question if we need multiple inheritance. *sigh* :)

> where a response_contents can also contain other response_contents.
> So, there would be a common API that both F_D and B_B implement
> (and defined by R_C).  Then, if a filter is smart enough, it can
> query and say, "I know you are a R_C, but do you really happen to
> be a F_D by any chance?"  Then, it could do specific opts on it.

Well, a filter that just wanted a RESPONSE_CONTENTS type should probably
only operate on that thing only through the operations available to
R_C's. I doubt it will have a reason to look and see what subtype it
is (although in might be necessary in some special cases -- I don't think
we should encourage that though...)

> >   5. SOCKET_WRITER - consumes all sorts of byte-oriented data types, 
> > including
> >                      RESPONSE_HEADERS and FILE_DESCRIPTORS (using sendfile
> >                      or mmap()+writev() to do the actual response), produces
> >                      a TRANSACTION.
> 
> Again, I think by the time we make it to SOCKET_WRITER, we've
> stripped the headers and made it byte-oriented (so that
> there are only F_D or B_B present).  Some filter
> (say header_out_filter) has stripped all of the RESPONSE_HEADERS out
> and produced B_B data out of it.

The key point isn't that we force a particular filter to accept new
types (which is a failure of the current filter system, IMHO). Instead
when we need to look at the stream of data in a new way, we create
new filters that consume the old input, produce the new output;
stick out new filters in there, then on the end convert from the new
back to the old.

Basicly what I'm saying is it's all a matter of where we draw the lines.
This example is merely an example.

> I think we might generalize it to have a socket_file_writer and
> socket_buffer_writer.  Therefore, as we see a F_D, we call the
> socket_file_writer to write it, and then as we see a B_B, we
> call socket_buffer_writer.

The idea is to hide that kind of stuff inside types like "RESPONSE_BODY"
and then create functions that operate on those RESPONSE_BODYs that
can be called by the SOCKET_WRITER or whatever filter consumes that
type.

Remember, I want to get away from things like "bytes" and go towards
things like "RESPONSE_BODY->write(socketfd)".



> I'm not sure that a content_filter (or something like mod_include)
> would be paired up with something on the opposite side.  Perhaps,
> but I'm not clear on this.

One would parse the body and create tokens or inbetween-data.
The next would take both and produce another body.



> This brings me to my next thought (and perhaps this is really
> what Aaron is proposing).  We don't really have a chain constructed
> whatsoever.  But, we merely start with something (I say data read
> from a socket) and then filters along the way issue tokens for
> other filters to execute.  We finally iterate until we have
> nothing left (at that point, we're done).
> 
> [Bear with me, but I want to do this step-by-step so my point is
>  clear.  This is going to be for a server responding to a request,
>  but I believe a similar model could work for the client in reverse.]
> 
> So, since we're HTTP-based, I start with:
> 
> * "I want HTTP_REQUEST_LINE filter"
> * "data read from a socket"
> 
> The HTTP_REQUEST_LINE is what starts us off.  It'll determine
> what we're dealing with.  Note, SSL would start off with:
> 
> * "I want to decrypt SSL data filter"
> * "I want HTTP_REQUEST_LINE filter"
> * "I want to output SSL data filter"
> * "data read from a socket" (which happens to be encrypted)
> 
> We'd execute the decryption algorithm first and then call the
> HTTP_REQUEST_LINE filter.  (Use of output SSL data filter will
> become obvious later, I hope.)  Okay, back to my example...
> 
> HTTP_REQUEST_LINE reads the initial line and sees that it is
> good and it runs and places new tokens in the chain after
> reading what it needed to from the socket.  Now, we have:
> 
> * "I want HTTP_HEADER filter"
> * "I want HTTP_REQUEST_BODY filter"
> * "I want REQUEST_DISPATCHER filter"
> * "I want http_output filter"
> * REQUEST_LINE
> * "data read from a socket"
> 
> (Note that if we saw a HTTP/0.9 request, we wouldn't insert the
> header filter.)
> 
> So, we see this "I want HTTP_HEADER filter" token and we call the
> parser to parse the http headers and expand them:
> 
> * "I want HTTP_REQUEST_BODY filter"
> * "I want REQUEST_DISPATCHER filter"
> * "I want http_output filter"
> * REQUEST_LINE
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE
> * "data read from a socket"
> 
> Then, we now see that we should call "HTTP_REQUEST_BODY" filter.
> So, we call it.  If there is a body in this request, we then get:
> 
> * "I want REQUEST_DISPATCHER filter"
> * "I want http_output filter"
> * REQUEST_LINE
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE
> * REQUEST_BODY
> 
> (Note that it may not really produce a REQUEST_BODY if there
> isn't a body in the message.  This filter can produce nothing.
> Any data left unread at this state would be 'given' back or ignored
> somehow.  Most likely we never read it from the socket anyway.)
> 
> Note here that if we were to implement Waka, we'd merely replace
> the filters - the REQUEST_LINE, HEADER_LINE objects would be
> identical - the difference is which filter would be called to
> produce these objects (WAKA_REQUEST_LINE, WAKA_HEADER_LINE, etc).
> 
> Then, we see the token for Aaron's RESOURCE_DISPATCHER.  It now
> maps the request to a response and produces:
> 
> * "I want mod_include filter"
> * "I want mod_php filter"
> * "I want mod_deflate filter"
> * "I want http_output filter"
> * RESPONSE_STATUS     (OK)
> * HEADER_LINE         (Server/Apache)
> * HEADER_LINE         (Date/8-29-2002 12:55:35AM)
> * HEADER_LINE         (Last-Modified/8-29-2002 12:01:00AM)
> * RESPONSE_BODY               (fd-backed)
> 
> What happens here is that the resource_dispatcher has identified
> that mod_include, mod_php and mod_deflate should run for this
> response.  They now get executed in order.  mod_include's invocation
> yields:
> 
> * "I want mod_php filter"
> * "I want mod_deflate filter"
> * "I want http_output filter"
> * RESPONSE_STATUS
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE (perhaps different from before)
> * RESPONSE_BODY (buffer-backed)
> * RESPONSE_BODY (fd-backed)
> 
> mod_php yields:
> 
> * "I want mod_deflate filter"
> * "I want http_output filter"
> * RESPONSE_STATUS
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE (perhaps different from before yet again)
> * RESPONSE_BODY (buffer-backed)
> * RESPONSE_BODY (buffer-backed)
> * RESPONSE_BODY (buffer-backed)
> * RESPONSE_BODY (fd-backed)
> 
> Now, we have mod_deflate which sees its token and we yield:
> 
> * "I want http_output filter"
> * RESPONSE_STATUS
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE (perhaps different from before yet again)
> * HEADER_LINE (adds mod_deflate-specific headers)
> * RESPONSE_BODY (buffer-backed)
> 
> Then, we're off to do http_output.  We're now conceptually done
> with the body as we have no filter tokens (realize that a filter
> can add and remove tokens at will.)  http_output then produces
> meaningful representations and produces a bytestream.  We have:
> 
> * "I want socket_writer filter"
> * RESPONSE_BODY (buffer-backed)
> * RESPONSE_BODY (buffer-backed)
> * RESPONSE_BODY (buffer-backed)
> 
> The socket_writer is now called.  It can then write out each
> component that it wants to the network.  If we still had a
> fd-backed response_body, it'd call use the right call.  When
> the socket_writer is complete, we're empty.  Time to start
> again.
> 
> Now, why all of these tokens?  Error handling.  It's a nightmare
> in httpd-2.0 right now.  So, let's consider what would happen if
> mod_php suddenly found an error.  Remember it came in with:
> 
> * "I want mod_php filter"
> * "I want mod_deflate filter"
> * "I want http_output filter"
> * RESPONSE_STATUS
> * HEADER_LINE
> * HEADER_LINE
> * HEADER_LINE (perhaps different from before)
> * RESPONSE_BODY (buffer-backed)
> * RESPONSE_BODY (fd-backed)
> 
> But, let's say something was malformed in your php script.  So, we
> want to produce a 500.  Simple.  Replace the entire contents with:
> 
> * "I want error filter"
> * ERROR (500, 'your PHP skills need work, buster')
> 
> When mod_php completes, we now see a token for the error filter.
> So, we call it and we now get:
> 
> * "I want http_output filter"
> * RESPONSE_STATUS (500, 'your PHP skills need work, buster')
> * HEADER_LINE
> * HEADER_LINE
> * RESPONSE_BODY (potentially custom response body)
> 
> And, then we immediately step into the http_output filter.
> 
> Note what a module doing a 302 redirect would yield:
> 
> * "I want error filter"
> * ERROR (302, 'redirect')
> * HEADER_LINE (Location/http://example.com/newplace/)
> 
> The HEADER_LINE would be preserved...
> 
> Of course this example is really focused on the server, but I
> believe a similar strategy in reverse could apply to the
> client.  

I think we're on the same track now. I'm not seeing any major difference
between what you're saying and what I'm proposing. Correct?

-aaron

Re: cvs commit: apr-serf/docs notes-filter-chains.txt

Reply via email to