Re: [Twisted-Python] Streaming HTTP

Glyph Lefkowitz Wed, 18 Nov 2015 04:19:36 -0800

> On Nov 13, 2015, at 4:36 AM, Cory Benfield <c...@lukasa.co.uk> wrote:
> 
> Folks,
> 
> # Problem Statement
> 
> Thanks for your feedback on my HTTP/2 questions. I’ve started work 
> implementing a spike of a HTTP/2 protocol for twisted.web. I’m aiming to have 
> something that works in at least some cases by the end of the day.


Hooray!

> As part of my dive into twisted.web, I noticed something that surprised me: 
> it seems to have no support for ‘streaming’ request bodies. By this I mean 
> that the Request.requestReceived() method is not actually called until the 
> complete request body has been received. This is a somewhat unexpected 
> limitation for Twisted: why should I have to wait until the entire body has 
> been uploaded to start doing things with it?

This is exactly what <https://twistedmatrix.com/trac/ticket/288> is about (as 
you note).

> This problem is thrown into sharp relief with HTTP/2, which essentially 
> always chunks the body, even if a content-length is provided. This means that 
> it is now very easy to receive data in delimited chunks, which an 
> implementation may want to have semantic meaning. However, the request is 
> unable to access this data in this way. It also makes it impossible to use a 
> HTTP/2 request/response pair as a long-running communication channel, as we 
> cannot safely call requestReceived until the response is terminated (which 
> also terminates the HTTP/2 stream).
> 
> Adi pointed me at a related issue, #6928[0], which itself points at what 
> appears to be an issue tracking exactly this request. That issue is issue 
> #288[1], which is 12 years old(!). This has clearly been a pain point for 
> quite some time.
> 
> Issue #6928 has glyph suggesting that we come to the mailing list to discuss 
> this, but the last time it was raised no responses were received[2]. I 
> believe that with HTTP/2 on the horizon, this issue is more acute than it was 
> before, and needs solving if Twisted is going to continue to remain relevant 
> for the web. It should also allow people to build more performant web 
> applications, as they should be able to handle how the data queues up in 
> their apps.
> 
> This does not immediately block my HTTP/2 work, so we can take some time and 
> get this right.

I'm very glad to hear you say this, because:

(A) we should absolutely make progress on this ticket now that there is some 
impetus to do so, but,
(B) we should absolutely NOT let this new API block any purely protocol-level 
HTTP2 work that needs to proceed (I say "protocol level" as a contrast to 
"feature level" because we could totally implement a thing that speaks HTTP 2 
against the current API, although it might be awkward to expose the 
advantageous parts of the protocol as API features until we do some cleanup)

> # Proposed Solution
> 
> To help us move forward, I’m providing a proposal for how I’d solve this 
> problem. This is not necessarily going to be the final approach, but is 
> instead a straw-man we can use to form the basis of a discussion about what 
> the correct fix should be.
> 
> My proposal is to deprecate the current Request/Resource model. It currently 
> functions and should continue to function, but as of this point we should 
> consider it a bad way to do things, and we should push people to move to a 
> fully asynchronous model.

We have considered it a bad way to do things for a long time.  There have been 
several attempts to rewrite it (Nevow's request model, web2) but none of them 
have really been the comprehensive re-design we need.

> We should then move to an API that is much more like the one used by Go: 
> specifically, that by default all requests/responses are streamed. Request 
> objects (and, logically, any other object that handles requests/responses, 
> such as Resource) should be extended to have a chunkReceived method that can 
> be overridden by users.

No.

Let me elaborate :-).

First of all, this is already the case (sort of).

twisted.web.server.Request inherits from twisted.web.http.Request; as you can 
see in http.Request, there is already a method called handleContentChunk, which 
is called by HTTPChannel. By overriding this method, you can already handle 
request data streaming in from off the wire.  This is one of the reasons that 
#288 is so interminable: using Twisted's public API, today, you can totally 
write an HTTP-based application that happily streams data from the wire.  The 
only problem is that this API does not propagate up to Resource objects, 
because Resource objects can expect the (still, as of this writing, 
undocumented) "content" attribute to have been filled out in getChild.  Mostly, 
they don't, actually!  But it's impossible to tell if they might in the general 
case.

You can (and many applications have) just broken the technical compatibility 
contract with Resource, and written a subclass of twisted.web.server.Site that 
has a custom requestFactory method that returns a 'streamed' resource.

So, if we're already doing this, why "no"?

Superclasses with overridable methods are a terrible mechanism for exposing 
extensibility.  These are used extensively throughout Twisted, the older the 
API the more inheritance it uses.  Newer code, you may notice, is generally 
written much more in a pattern of delegation to formal interfaces.  So we have 
tried to learn our lesson here.

> If a user chooses not to override that method, the default implementation 
> would continue to do what is done now (save to a buffer). Once the 
> request/response is complete (marked by receipt of a zero-length chunk, or a 
> frame with END_STREAM set, or when the remaining content-length is 0),

For what it's worth, I hope that all of these will be exposed as the same event 
to applications, since the fact that these differ on the wire is entirely an 
implementation detail?

> request/responseComplete would be called. For users that did not override 
> chunkReceived can now safely access the content buffer: other users can do 
> whatever they see fit. We’d also update requestReceived to ensure that it’s 
> called when all the *headers* are received, rather than waiting for the body.

Again, this is very similar to what already happens, at the layer of the HTTP 
protocol.  The question is, how do you indicate that you're delegating to a 
Resource object which may expect the .content attribute to already be populated 
during .getChild?

> A similar approach should be taken with sending data: we should assume that 
> users want to chunk it if they do not provide a content-length. An extreme 
> position to take (and I do) is that this should be sufficiently easy that 
> most users actually *accidentally* end up chunking their data: that is, we do 
> not provide special helpers to set content-length, instead just checking 
> whether that’s a header users actually send, and if they don’t we chunk the 
> data.

request.write() already basically does this, I think?  Here, at least, we have 
lots of opportunity to make the implementation do smarter things (better error 
checking regarding content-length, for example) without changing the interface 
at all.

> This logic would make it much easier to work with HTTP/2 *and* with 
> WebSockets, requiring substantially less special-case code to handle the 
> WebSocket upgrade (when the headers are complete, we can spot the upgrade 
> easily).
> 
> What do people think of this approach?

So I think you're roughly on the right track but there are probably some 
Twisted-level gaps to fill in.

I've already gestured in the direction of Tubes (as have others) and it's 
something to think about.  But before we get to that, let's talk about a much 
more basic deficiency in the API: although there's an "IRequest", and an 
"IResource", there's no such thing as an "IResponse".  Instead, "IRequest" 
stands in for both the request and the response, because you write directly to 
a request (implicitly filling out its response as you do so).

Luckily we have an existing interface that might point the way to a better 
solution, both for requests and responses: specifically, the client IResponse: 
https://twistedmatrix.com/documents/15.4.0/api/twisted.web.iweb.IResponse.html.

This interface is actually pretty close to what we want for a server IResponse 
as well.  Perhaps even identical.  Its static data is all exposed as attributes 
which can be relatively simply inspected, and the way it delivers a streaming 
response is that it delivers its body to an IProtocol implementation (via 
.deliverBody(aProtocol)).  This is not quite as graceful as having a 
.bodyFount() method that returns an IFount from the tubes package; however, the 
tubes package is still not exactly mature software, so we may not want to block 
on depending on it.  Importantly though, this delivers all the events you need 
as a primitive for interfacing with such a high-level interface; it would 
definitely be better to add this sort of interface Real Soon Now, because then 
the tubes package could simply have a method, responseToFount (which it will 
need anyway to work with Agent) that calls deliverBody internally.

This works as a primitive because you have all the hooks you need for 
flow-control.  This protocol receives, to its 'makeConnection' method, an 
ITransport which can provide the IProducer 
https://twistedmatrix.com/documents/15.4.0/api/twisted.internet.interfaces.IProducer.html
 and IConsumer 
https://twistedmatrix.com/documents/15.4.0/api/twisted.internet.interfaces.IConsumer.html
 interfaces for flow-control.  It receives dataReceived to tell it a chunk has 
arrived and connectionLost to tell it the stream has terminated.

Unfortunately the client IRequest 
https://twistedmatrix.com/documents/15.4.0/api/twisted.web.iweb.IClientRequest.html
 isn't quite as useful (although its relative minimalism should be an 
inspiration to anyone designing a next-generation IRequest more than the 
current IRequest's sprawling kitchen-sink aesthetic).  However, 
IResponse.deliverBody could be applied to IGoodRequest as well.  If we have a 
very similar-to-IResponse shaped IRequest object, say with 'method', 'uri' and 
'headers', and then a 'deliverBody' that delivers the request body in much the 
same way, we could get a gracefully structured streaming request with works 
with a lot of existing code within Twisted.

Then the question is: what to do with IResource?

Right now the flow of processing a request is, roughly:

-> wait for full request to arrive
  -> have HTTPChannel fill out IRequest object
-> look at request.site.resource for the root
 *-> call getChildWithDefault repeatedly, mutating "cursor" state on the 
IRequest as you move (specifically: "prepath" and "postpath" attributes)
  -> eventually reach the leaf Resource, or one with 'isLeaf' set on it, and 
delegate producing the response to that resource
*-> call resource.render(request)
-> examine the return value; if it's bytes, deliver them and close the 
connection; NOT_DONE_YET, just leave the connection open, 

Instead, I think a good flow would be:

-> receive method/headers from request
-> recurse down from request.site.resource, calling something like nevow's or 
web2's locateChild, but not modifying 'request' at each stage; instead, pass a 
"cursor" object - perhaps indeed just a twisted.python.url.URL - indicating 
where we are in the resource traversal hierarchy.  the reason the 
request.prepath and request.postpath attributes exist is mainly for Resource 
objects to be able to orient themselves within a resource tree and generate 
links.

also, it probably bears some explanation; the signature of the current "get the 
next resource" call is resource.getChildWithDefault(onePathSegment, request) -> 
resource. This is somewhat limiting as it requires you to consume only an 
individual path segment at a time, which can be highly awkward for implementing 
sites that have a URL structure that is, for example, 
/YYYY/MM/DD/HH/MM/index.html.  

Instead, locateChild took the entire remaining path, and returned a 2-tuple of 
a resource, and the _still_ remaining path.  So for the above, you could do:

def locateChild(self, request, path):
    y, m, d = path[:3]
    return ymdresource(y, m, d), path[3:]

that 2-tuple instructs the traversal machinery, "keep going".  One alternative 
that we toyed with for this was to make consuming the path destructive, since 
that made it a lot easier to tell what resource you were "looking at":

def locateChild(self, request, path):
    y, m, d = path.consume(3)
    return ymdresource(y, m, d)

Either of these approaches also let you implement 'isLeaf' attribute without 
special support from the framework; you simply return leaf(), () or 
path.consume(path.length)

-> finally, call .responseForRequest(request) -> IResponse on the final 
Resource and deliver the IResponse to the network.

The way compatibility could be achieved here is to write a wrapper that would 
implement .responseForRequest to first collect the entire body, then synthesize 
a gross old-style-IRequest-like object out of the combination of that body and 
the other information about the resource, then call .getChildWithDefault on it 
a few times, then call the old-style .render_GET, et. al.  The IResponse 
returned from this compatibility .responseForRequest would wrap up calls like 
request.write and turn them into write() calls.

This is long and increasingly rambly, so I should probably stop now, send it, 
and get your feedback.  Does the strategy I'm proposing make sense?  I'm sure 
I'm leaving a ton out so feel free to ask for clarification.  Hopefully I 
didn't leave too many incomplete sentences in the middle.

-glyph

_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Re: [Twisted-Python] Streaming HTTP

Reply via email to