> On Nov 13, 2015, at 4:36 AM, Cory Benfield <c...@lukasa.co.uk> wrote:
>
> Folks,
>
> # Problem Statement
>
> Thanks for your feedback on my HTTP/2 questions. I’ve started work
> implementing a spike of a HTTP/2 protocol for twisted.web. I’m aiming to have
> something that works in at least some cases by the end of the day.
Hooray!
> As part of my dive into twisted.web, I noticed something that surprised me:
> it seems to have no support for ‘streaming’ request bodies. By this I mean
> that the Request.requestReceived() method is not actually called until the
> complete request body has been received. This is a somewhat unexpected
> limitation for Twisted: why should I have to wait until the entire body has
> been uploaded to start doing things with it?
This is exactly what <https://twistedmatrix.com/trac/ticket/288> is about (as
you note).
> This problem is thrown into sharp relief with HTTP/2, which essentially
> always chunks the body, even if a content-length is provided. This means that
> it is now very easy to receive data in delimited chunks, which an
> implementation may want to have semantic meaning. However, the request is
> unable to access this data in this way. It also makes it impossible to use a
> HTTP/2 request/response pair as a long-running communication channel, as we
> cannot safely call requestReceived until the response is terminated (which
> also terminates the HTTP/2 stream).
>
> Adi pointed me at a related issue, #6928[0], which itself points at what
> appears to be an issue tracking exactly this request. That issue is issue
> #288[1], which is 12 years old(!). This has clearly been a pain point for
> quite some time.
>
> Issue #6928 has glyph suggesting that we come to the mailing list to discuss
> this, but the last time it was raised no responses were received[2]. I
> believe that with HTTP/2 on the horizon, this issue is more acute than it was
> before, and needs solving if Twisted is going to continue to remain relevant
> for the web. It should also allow people to build more performant web
> applications, as they should be able to handle how the data queues up in
> their apps.
>
> This does not immediately block my HTTP/2 work, so we can take some time and
> get this right.
I'm very glad to hear you say this, because:
(A) we should absolutely make progress on this ticket now that there is some
impetus to do so, but,
(B) we should absolutely NOT let this new API block any purely protocol-level
HTTP2 work that needs to proceed (I say "protocol level" as a contrast to
"feature level" because we could totally implement a thing that speaks HTTP 2
against the current API, although it might be awkward to expose the
advantageous parts of the protocol as API features until we do some cleanup)
> # Proposed Solution
>
> To help us move forward, I’m providing a proposal for how I’d solve this
> problem. This is not necessarily going to be the final approach, but is
> instead a straw-man we can use to form the basis of a discussion about what
> the correct fix should be.
>
> My proposal is to deprecate the current Request/Resource model. It currently
> functions and should continue to function, but as of this point we should
> consider it a bad way to do things, and we should push people to move to a
> fully asynchronous model.
We have considered it a bad way to do things for a long time. There have been
several attempts to rewrite it (Nevow's request model, web2) but none of them
have really been the comprehensive re-design we need.
> We should then move to an API that is much more like the one used by Go:
> specifically, that by default all requests/responses are streamed. Request
> objects (and, logically, any other object that handles requests/responses,
> such as Resource) should be extended to have a chunkReceived method that can
> be overridden by users.
No.
Let me elaborate :-).
First of all, this is already the case (sort of).
twisted.web.server.Request inherits from twisted.web.http.Request; as you can
see in http.Request, there is already a method called handleContentChunk, which
is called by HTTPChannel. By overriding this method, you can already handle
request data streaming in from off the wire. This is one of the reasons that
#288 is so interminable: using Twisted's public API, today, you can totally
write an HTTP-based application that happily streams data from the wire. The
only problem is that this API does not propagate up to Resource objects,
because Resource objects can expect the (still, as of this writing,
undocumented) "content" attribute to have been filled out in getChild. Mostly,
they don't, actually! But it's impossible to tell if they might in the general
case.
You can (and many applications have) just broken the technical compatibility
contract with Resource, and written a subclass of twisted.web.server.Site that
has a custom requestFactory method that returns a 'streamed' resource.
So, if we're already doing this, why "no"?
Superclasses with overridable methods are a terrible mechanism for exposing
extensibility. These are used extensively throughout Twisted, the older the
API the more inheritance it uses. Newer code, you may notice, is generally
written much more in a pattern of delegation to formal interfaces. So we have
tried to learn our lesson here.
> If a user chooses not to override that method, the default implementation
> would continue to do what is done now (save to a buffer). Once the
> request/response is complete (marked by receipt of a zero-length chunk, or a
> frame with END_STREAM set, or when the remaining content-length is 0),
For what it's worth, I hope that all of these will be exposed as the same event
to applications, since the fact that these differ on the wire is entirely an
implementation detail?
> request/responseComplete would be called. For users that did not override
> chunkReceived can now safely access the content buffer: other users can do
> whatever they see fit. We’d also update requestReceived to ensure that it’s
> called when all the *headers* are received, rather than waiting for the body.
Again, this is very similar to what already happens, at the layer of the HTTP
protocol. The question is, how do you indicate that you're delegating to a
Resource object which may expect the .content attribute to already be populated
during .getChild?
> A similar approach should be taken with sending data: we should assume that
> users want to chunk it if they do not provide a content-length. An extreme
> position to take (and I do) is that this should be sufficiently easy that
> most users actually *accidentally* end up chunking their data: that is, we do
> not provide special helpers to set content-length, instead just checking
> whether that’s a header users actually send, and if they don’t we chunk the
> data.
request.write() already basically does this, I think? Here, at least, we have
lots of opportunity to make the implementation do smarter things (better error
checking regarding content-length, for example) without changing the interface
at all.
> This logic would make it much easier to work with HTTP/2 *and* with
> WebSockets, requiring substantially less special-case code to handle the
> WebSocket upgrade (when the headers are complete, we can spot the upgrade
> easily).
>
> What do people think of this approach?
So I think you're roughly on the right track but there are probably some
Twisted-level gaps to fill in.
I've already gestured in the direction of Tubes (as have others) and it's
something to think about. But before we get to that, let's talk about a much
more basic deficiency in the API: although there's an "IRequest", and an
"IResource", there's no such thing as an "IResponse". Instead, "IRequest"
stands in for both the request and the response, because you write directly to
a request (implicitly filling out its response as you do so).
Luckily we have an existing interface that might point the way to a better
solution, both for requests and responses: specifically, the client IResponse:
https://twistedmatrix.com/documents/15.4.0/api/twisted.web.iweb.IResponse.html.
This interface is actually pretty close to what we want for a server IResponse
as well. Perhaps even identical. Its static data is all exposed as attributes
which can be relatively simply inspected, and the way it delivers a streaming
response is that it delivers its body to an IProtocol implementation (via
.deliverBody(aProtocol)). This is not quite as graceful as having a
.bodyFount() method that returns an IFount from the tubes package; however, the
tubes package is still not exactly mature software, so we may not want to block
on depending on it. Importantly though, this delivers all the events you need
as a primitive for interfacing with such a high-level interface; it would
definitely be better to add this sort of interface Real Soon Now, because then
the tubes package could simply have a method, responseToFount (which it will
need anyway to work with Agent) that calls deliverBody internally.
This works as a primitive because you have all the hooks you need for
flow-control. This protocol receives, to its 'makeConnection' method, an
ITransport which can provide the IProducer
https://twistedmatrix.com/documents/15.4.0/api/twisted.internet.interfaces.IProducer.html
and IConsumer
https://twistedmatrix.com/documents/15.4.0/api/twisted.internet.interfaces.IConsumer.html
interfaces for flow-control. It receives dataReceived to tell it a chunk has
arrived and connectionLost to tell it the stream has terminated.
Unfortunately the client IRequest
https://twistedmatrix.com/documents/15.4.0/api/twisted.web.iweb.IClientRequest.html
isn't quite as useful (although its relative minimalism should be an
inspiration to anyone designing a next-generation IRequest more than the
current IRequest's sprawling kitchen-sink aesthetic). However,
IResponse.deliverBody could be applied to IGoodRequest as well. If we have a
very similar-to-IResponse shaped IRequest object, say with 'method', 'uri' and
'headers', and then a 'deliverBody' that delivers the request body in much the
same way, we could get a gracefully structured streaming request with works
with a lot of existing code within Twisted.
Then the question is: what to do with IResource?
Right now the flow of processing a request is, roughly:
-> wait for full request to arrive
-> have HTTPChannel fill out IRequest object
-> look at request.site.resource for the root
*-> call getChildWithDefault repeatedly, mutating "cursor" state on the
IRequest as you move (specifically: "prepath" and "postpath" attributes)
-> eventually reach the leaf Resource, or one with 'isLeaf' set on it, and
delegate producing the response to that resource
*-> call resource.render(request)
-> examine the return value; if it's bytes, deliver them and close the
connection; NOT_DONE_YET, just leave the connection open,
Instead, I think a good flow would be:
-> receive method/headers from request
-> recurse down from request.site.resource, calling something like nevow's or
web2's locateChild, but not modifying 'request' at each stage; instead, pass a
"cursor" object - perhaps indeed just a twisted.python.url.URL - indicating
where we are in the resource traversal hierarchy. the reason the
request.prepath and request.postpath attributes exist is mainly for Resource
objects to be able to orient themselves within a resource tree and generate
links.
also, it probably bears some explanation; the signature of the current "get the
next resource" call is resource.getChildWithDefault(onePathSegment, request) ->
resource. This is somewhat limiting as it requires you to consume only an
individual path segment at a time, which can be highly awkward for implementing
sites that have a URL structure that is, for example,
/YYYY/MM/DD/HH/MM/index.html.
Instead, locateChild took the entire remaining path, and returned a 2-tuple of
a resource, and the _still_ remaining path. So for the above, you could do:
def locateChild(self, request, path):
y, m, d = path[:3]
return ymdresource(y, m, d), path[3:]
that 2-tuple instructs the traversal machinery, "keep going". One alternative
that we toyed with for this was to make consuming the path destructive, since
that made it a lot easier to tell what resource you were "looking at":
def locateChild(self, request, path):
y, m, d = path.consume(3)
return ymdresource(y, m, d)
Either of these approaches also let you implement 'isLeaf' attribute without
special support from the framework; you simply return leaf(), () or
path.consume(path.length)
-> finally, call .responseForRequest(request) -> IResponse on the final
Resource and deliver the IResponse to the network.
The way compatibility could be achieved here is to write a wrapper that would
implement .responseForRequest to first collect the entire body, then synthesize
a gross old-style-IRequest-like object out of the combination of that body and
the other information about the resource, then call .getChildWithDefault on it
a few times, then call the old-style .render_GET, et. al. The IResponse
returned from this compatibility .responseForRequest would wrap up calls like
request.write and turn them into write() calls.
This is long and increasingly rambly, so I should probably stop now, send it,
and get your feedback. Does the strategy I'm proposing make sense? I'm sure
I'm leaving a ton out so feel free to ask for clarification. Hopefully I
didn't leave too many incomplete sentences in the middle.
-glyph
_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python