Re: cvs commit: apr-serf/docs notes-filter-chains.txt

Justin Erenkrantz 29 Aug 2002 08:18:56 -0000

On Wed, Aug 28, 2002 at 06:26:16PM -0700, Aaron Bannert wrote:
> 1) Let's get away from the notion of Push and Pull, and go to the idea
>   where the app drives the filter chain, and each filter is given
>   control of a piece of data for a short period of time.


Can you give an example of how a client would read and write
data?  (Feel free to ignore the filters part.)

To give you an idea where I'm coming from:

A push-model would be driven by a poll() loop.  As data is available
on the sockets, the socket would be read and the data would be
pushed into the filter chain.

A pull-model, when invoked, would first check to see if there is any
data in the 'spillage' (i.e. left over from last read).  If there is
enough to return, it would do so.  If there isn't, then it would call
poll() and wait for something.  Once data is available, it reads from
the socket and pushes the data into the filter chain.  Then, as data
comes out of the filter chain, it is returned to the client.

> 3) The chain of filters now forms into a graph (with help from this registry
>   or some external mechanism). Since the dataflow is no longer linear through
>   this graph, we probably shouldn't call it a filter, but instead let's call
>   it a "filter graph". For any other graph theory geeks out there, this
>   graph happens to be a directed acyclic graph (DAG).

Yea, linear filters seem a bit antiquated, but how one filter
could then branch its data to two filters doesn't seem like it
could work.  So, at its essence, it seems it'd have to still be
linear.  So, do you have a use-case where the data isn't linear?

> 4) All graphs have by default two filters, a SOURCE and a SINK. The SOURCE
>   filter produces a single TRANSACTION (for lack of a better term),
>   and a SINK consumes the same.

I like the typing aspect.

> 6) The application then uses the filter graph. If there are any unused inputs
>   or outputs, there is an immediate runtime error. The reason this is deferred
>   until the filter graph is actually used is so that the contents of the
>   graph can be modified at any time. (One can imagine wanting to add or
>   remove certain filters after a particular filter graph has been used, so
>   why limit ourselves here?)

Yeah - make sense.

> 7) In order to achieve filter implementations that are agnostic of both
>   push and pull, the filter graph simply takes the data types and applies
>   them only to the next appropriate filter. That filter then decides one
>   of three things: 1) to pass the original data, 2) to consume the data
>   and produce nothing, 3) to consume the data and produce one or more
>   new data thingies*, or 4) to fail.

One question is how do we allow for ordering?  Say one instance we
want to have filter I before P and another time we want P before I.
My guess is that they'd take in the same data and produce the
same output.  (I believe in your scheme they'd be RESPONSE_BODIES.)

>  - The types of filters I could imagine being involved here would be:
> 
>   1. SOURCE - consumes nothing, produces TRANSACTION

I'd imagine that it should have produced REQUEST rather than
TRANSACTION...

>   2. SOCKET_READER - consumes TRANSACTION, produces HEADERS

Eek.  I'd see a lot of stuff in between socket_reader and something
that produces HEADERS, but yeah, we can do handwaving and say that
we somehow get to HEADERS...

(Note, I'd prefer splitting into both STATUS/RESOURCE and HEADERS.)

>   3. RESPONSE_DISPATCHER - consumes HEADERS, produces a bunch of stuff,
>                            one of which is a RESPONSE_FILE.

So, the response_dispatch is essentially equvialent to the handler
in httpd?

>   4. STATIC_FILE_HANDLER - consumes a bunch of things, one of which is a
>                            RESPONSE_FILE*, produces RESPONSE_HEADERS and
>                            a FILE_DESCRIPTOR.

Here's a question: how would something like mod_include or mod_php
work?  (You brought up serving pages, so...)  Ideally, they don't
want to take FILE_DESCRIPTOR as input.  They want RESPONSE_CONTENTS
as input and spit RESPONSE_CONTENTS back out, right?

Hmm.  Within your type system you could have:

        RESPONSE_CONTENTS
        /               \
FILE_DESCRIPTOR    BUFFER_BACKED

where a response_contents can also contain other response_contents.
So, there would be a common API that both F_D and B_B implement
(and defined by R_C).  Then, if a filter is smart enough, it can
query and say, "I know you are a R_C, but do you really happen to
be a F_D by any chance?"  Then, it could do specific opts on it.

>   5. SOCKET_WRITER - consumes all sorts of byte-oriented data types, including
>                      RESPONSE_HEADERS and FILE_DESCRIPTORS (using sendfile
>                      or mmap()+writev() to do the actual response), produces
>                      a TRANSACTION.

Again, I think by the time we make it to SOCKET_WRITER, we've
stripped the headers and made it byte-oriented (so that
there are only F_D or B_B present).  Some filter
(say header_out_filter) has stripped all of the RESPONSE_HEADERS out
and produced B_B data out of it.

I think we might generalize it to have a socket_file_writer and
socket_buffer_writer.  Therefore, as we see a F_D, we call the
socket_file_writer to write it, and then as we see a B_B, we
call socket_buffer_writer.

>   6. SINK - consumes TRANSACTION, produces nothing
> 
> Now this is a simple linear example, so it doesn't illustrate the
> inherent parallelism in this model, but one could easily imagine at
> this point where new types of handlers fit in. The best way to think
> about this in my mind is by layers (aka resolution):
> 
>  low res:    SOURCE                    -->                        SINK
>                 \                                                 /
>   2x res:      SOCKET_READER           -->              SOCKET_WRITER
>                     \                                     /
>   3x res:          RESPONSE_DISPATCHER --> STATIC_FILE_HANDLER
> 
> New layers are added in sets of at least 2, typically more. This gives
> us multiplicity *and* allows us to reuse high-level concepts at the
> right level in our graph.

I'm not sure that a content_filter (or something like mod_include)
would be paired up with something on the opposite side.  Perhaps,
but I'm not clear on this.

> Type System:

No real concerns here.

This brings me to my next thought (and perhaps this is really
what Aaron is proposing).  We don't really have a chain constructed
whatsoever.  But, we merely start with something (I say data read
from a socket) and then filters along the way issue tokens for
other filters to execute.  We finally iterate until we have
nothing left (at that point, we're done).

[Bear with me, but I want to do this step-by-step so my point is
 clear.  This is going to be for a server responding to a request,
 but I believe a similar model could work for the client in reverse.]

So, since we're HTTP-based, I start with:

* "I want HTTP_REQUEST_LINE filter"
* "data read from a socket"

The HTTP_REQUEST_LINE is what starts us off.  It'll determine
what we're dealing with.  Note, SSL would start off with:

* "I want to decrypt SSL data filter"
* "I want HTTP_REQUEST_LINE filter"
* "I want to output SSL data filter"
* "data read from a socket" (which happens to be encrypted)

We'd execute the decryption algorithm first and then call the
HTTP_REQUEST_LINE filter.  (Use of output SSL data filter will
become obvious later, I hope.)  Okay, back to my example...

HTTP_REQUEST_LINE reads the initial line and sees that it is
good and it runs and places new tokens in the chain after
reading what it needed to from the socket.  Now, we have:

* "I want HTTP_HEADER filter"
* "I want HTTP_REQUEST_BODY filter"
* "I want REQUEST_DISPATCHER filter"
* "I want http_output filter"
* REQUEST_LINE
* "data read from a socket"

(Note that if we saw a HTTP/0.9 request, we wouldn't insert the
header filter.)

So, we see this "I want HTTP_HEADER filter" token and we call the
parser to parse the http headers and expand them:

* "I want HTTP_REQUEST_BODY filter"
* "I want REQUEST_DISPATCHER filter"
* "I want http_output filter"
* REQUEST_LINE
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE
* "data read from a socket"

Then, we now see that we should call "HTTP_REQUEST_BODY" filter.
So, we call it.  If there is a body in this request, we then get:

* "I want REQUEST_DISPATCHER filter"
* "I want http_output filter"
* REQUEST_LINE
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE
* REQUEST_BODY

(Note that it may not really produce a REQUEST_BODY if there
isn't a body in the message.  This filter can produce nothing.
Any data left unread at this state would be 'given' back or ignored
somehow.  Most likely we never read it from the socket anyway.)

Note here that if we were to implement Waka, we'd merely replace
the filters - the REQUEST_LINE, HEADER_LINE objects would be
identical - the difference is which filter would be called to
produce these objects (WAKA_REQUEST_LINE, WAKA_HEADER_LINE, etc).

Then, we see the token for Aaron's RESOURCE_DISPATCHER.  It now
maps the request to a response and produces:

* "I want mod_include filter"
* "I want mod_php filter"
* "I want mod_deflate filter"
* "I want http_output filter"
* RESPONSE_STATUS       (OK)
* HEADER_LINE           (Server/Apache)
* HEADER_LINE           (Date/8-29-2002 12:55:35AM)
* HEADER_LINE           (Last-Modified/8-29-2002 12:01:00AM)
* RESPONSE_BODY         (fd-backed)

What happens here is that the resource_dispatcher has identified
that mod_include, mod_php and mod_deflate should run for this
response.  They now get executed in order.  mod_include's invocation
yields:

* "I want mod_php filter"
* "I want mod_deflate filter"
* "I want http_output filter"
* RESPONSE_STATUS
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE (perhaps different from before)
* RESPONSE_BODY (buffer-backed)
* RESPONSE_BODY (fd-backed)

mod_php yields:

* "I want mod_deflate filter"
* "I want http_output filter"
* RESPONSE_STATUS
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE (perhaps different from before yet again)
* RESPONSE_BODY (buffer-backed)
* RESPONSE_BODY (buffer-backed)
* RESPONSE_BODY (buffer-backed)
* RESPONSE_BODY (fd-backed)

Now, we have mod_deflate which sees its token and we yield:

* "I want http_output filter"
* RESPONSE_STATUS
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE (perhaps different from before yet again)
* HEADER_LINE (adds mod_deflate-specific headers)
* RESPONSE_BODY (buffer-backed)

Then, we're off to do http_output.  We're now conceptually done
with the body as we have no filter tokens (realize that a filter
can add and remove tokens at will.)  http_output then produces
meaningful representations and produces a bytestream.  We have:

* "I want socket_writer filter"
* RESPONSE_BODY (buffer-backed)
* RESPONSE_BODY (buffer-backed)
* RESPONSE_BODY (buffer-backed)

The socket_writer is now called.  It can then write out each
component that it wants to the network.  If we still had a
fd-backed response_body, it'd call use the right call.  When
the socket_writer is complete, we're empty.  Time to start
again.

Now, why all of these tokens?  Error handling.  It's a nightmare
in httpd-2.0 right now.  So, let's consider what would happen if
mod_php suddenly found an error.  Remember it came in with:

* "I want mod_php filter"
* "I want mod_deflate filter"
* "I want http_output filter"
* RESPONSE_STATUS
* HEADER_LINE
* HEADER_LINE
* HEADER_LINE (perhaps different from before)
* RESPONSE_BODY (buffer-backed)
* RESPONSE_BODY (fd-backed)

But, let's say something was malformed in your php script.  So, we
want to produce a 500.  Simple.  Replace the entire contents with:

* "I want error filter"
* ERROR (500, 'your PHP skills need work, buster')

When mod_php completes, we now see a token for the error filter.
So, we call it and we now get:

* "I want http_output filter"
* RESPONSE_STATUS (500, 'your PHP skills need work, buster')
* HEADER_LINE
* HEADER_LINE
* RESPONSE_BODY (potentially custom response body)

And, then we immediately step into the http_output filter.

Note what a module doing a 302 redirect would yield:

* "I want error filter"
* ERROR (302, 'redirect')
* HEADER_LINE (Location/http://example.com/newplace/)

The HEADER_LINE would be preserved...

Of course this example is really focused on the server, but I
believe a similar strategy in reverse could apply to the
client.  

Thoughts?  -- justin

Re: cvs commit: apr-serf/docs notes-filter-chains.txt

Reply via email to