Wow! This is the friendliest way to welcome a new Twisted programmer. Great job Glyph! :)
Regards, Manish On Sat, Aug 6, 2016 at 3:51 PM, Glyph Lefkowitz <gl...@twistedmatrix.com> wrote: > > On Aug 6, 2016, at 03:48, Randomcoder <randomcod...@gmail.com> wrote: > > Hello, > > I've been working on a small Twisted program. > > > Cool, thanks for using Twisted. > > The program makes HTTP requests to a large number of feeds. > Twisted is used to speed up the entire process. > After the feeds are fetched, they're parsed. Finally they should be > written to a database (to simplify the code, that part is left out). > > > Thanks for including examples, so we know exactly what you're talking > about! :) > > Feeds are fetched in parallel using gatherResults, and a batch is > built. Then all batches are again gathered into a set of batches, > a DeferredList is built out of those. A semaphore controls both the > batch-level list of deferreds, and a semaphore controls the entire batch > list deferred. > > Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between > 5 and 20. > > > This all seems pretty reasonable and following best practices and such... > > However, I notice the program starts to hang for a long time, when the > number of feeds goes over 150-200. > > > Two key questions: what do you mean by "hang" and what is "a long time"? > Do you mean it's totally unresponsive, or do you mean it's just failing to > make progress on downloading more feeds? > > > To be more precise, at the end of running the program, messages > like these are printed, but the program seems to not be very active: > > Stopping factory <twisted.web.client._HTTP11ClientFactory instance at > 0x7f0b7d5f3908> > > It seems like this is the cleanup phase. > > > This just means that it is finished making connections. We have to do > some clean-up around the usefulness of these log messages, sorry :-\. > > I've read what I could find on the topic. I wasn't able to make progress > on it, so I'm posting to the mailing list to ask if someone has > encountered this > before. Maybe it's a common pitfall or issue that other people have also > bumped into. > > > Right now, my guess is this: some of the sites you're contacting have very > slow proxies, or for some other reason let you *connect* to them, but > then hang when sent requests. If you're simultaneously requesting stuff > from a very large number of different sites, this is sort of inevitably > bound to happen, either based on network problems, or issues with the sites > themselves. I suspect you thought that the connectTimeout argument to > Agent would save you from this, but that timeout is just for making the > initial underlying TCP connection, not receiving a full response. What you > actually want to do is cancel the Deferred returned by Agent.request. > > Luckily, https://treq.readthedocs.io/en/latest/ already implements this > high-level timeout functionality for you, in the form of the 'timeout=' > argument it accepts. If you give that a try, do you see more connections > timing out as it runs, rather than "hanging" the process for long periods > of time? > > As long as I'm looking at your code, as a way of thanking you for > providing such a nice specific runnable example, I have a few other random > thoughts which may be useful to you: > > - I see you're importing psycopg. Do you know about https://txpostgres. > readthedocs.io/en/latest/ ? You can talk to postgres asynchronously with > Twisted. > - d.addCallback(lambda out: out).addCallback(lambda resp: > client.readBody(resp)) can be much more briefly spelled > "d.addCallback(client.readBody)". d.addErrback(lambda err: err) does > nothing and can just be removed. > - BrowserLikePolicyForHTTPS() is the default, so you don't need to pass > that. > - clean_up_and_exit will only be called if batchesDef doesn't fail, and if > it does fail, it will swallow the exception message. Rather than manually > calling `reactor.stop`, you probably want to use react(), < > https://twistedmatrix.com/documents/16.3.0/api/twisted. > internet.task.html#react>. This way your function is an API that anyone > who wants to use it can call - it just returns a Deferred when it's done - > but your __main__ block calls react() which will both start and stop the > reactor, as well as reporting errors if there's a problem while still > shutting down. > > Hope some of that code review is helpful - let us know if the treq timeout > solves the problem or if the issue is somewhere else! > > -glyph > > _______________________________________________ > Twisted-Python mailing list > Twisted-Python@twistedmatrix.com > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python > >
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python