Re: Do I have to use threads?
On Jan 7, 5:38 pm, MRAB pyt...@mrabarnett.plus.com wrote: Jorgen Grahn wrote: On Thu, 2010-01-07, Marco Salden wrote: On Jan 6, 5:36 am, Philip Semanchuk phi...@semanchuk.com wrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Yep, the more easier and straightforward the approach, the better: threads are always (programmers')-error-prone by nature. But my question would be: does it REALLY need to be simultaneously: the CPU/OS only has more overhead doing this in parallel with processess. Measuring sequential processing and then trying to optimize (e.g. for user response or whatever) would be my prefered way to go. Less=More. Normally when you do HTTP in parallell over several TCP sockets, it has nothing to do with CPU overhead. You just don't want every GET to be delayed just because the server(s) are lazy responding to the first few ones; or you might want to read the text of a web page and the CSS before a few huge pictures have been downloaded. His I have to [do them] simultaneously makes me want to ask Why?. If he's expecting *many* pictures, I doubt that the parallel download will buy him much. Reusing the same TCP socket for all of them is more likely to help, especially if the pictures aren't tiny. One long-lived TCP connection is much more efficient than dozens of short-lived ones. Personally, I'd popen() wget and let it do the job for me. From my own experience: I wanted to download a number of webpages. I noticed that there was a significant delay before it would reply, and an especially long delay for one of them, so I used a number of threads, each one reading a URL from a queue, performing the download, and then reading the next URL, until there were none left (actually, until it read the sentinel None, which it put back for the other threads). The result? Shorter total download time because it could be downloading one webpage while waiting for another to reply. (Of course, I had to make sure that I didn't have too many threads, because that might've put too many demands on the website, not a nice thing to do!) A fair few of my scripts require multiple uploads and downloads, and I always use threads to do so. I was using an API which was quite badly designed, and I got a list of UserId's from one API call then had to query another API method to get info on each of the UserId's I got from the first API. I could have used twisted, but in the end I just made a simple thread pool (30 threads and an in/out Queue). The result? A *massive* speedup, even with the extra complications of waiting until all the threads are done then grouping the results together from the output Queue. Since then I always use native threads. Tom -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Wed, 2010-01-06, Gary Herron wrote: aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? Please point me in the right direction. Thanks Aditya You've been given some bad advice here. First -- threads are lighter-weight than processes, so threads are probably *more* efficient. However, with only five thread/processes, the difference is probably not noticeable.(If the prejudice against threads comes from concerns over the GIL -- that also is a misplaced concern in this instance. Since you only have network connection, you will receive only one packet at a time, so only one thread will be active at a time. If the extraction process uses a significant enough amount of CPU time I wonder what that extraction would be, by the way. Unless you ask for compression of the HTTP data, the images come as-is on the TCP stream. so that the extractions are all running at the same time *AND* if you are running on a machine with separate CPU/cores *AND* you would like the extractions to be running truly in parallel on those separate cores, *THEN*, and only then, will processes be more efficient than threads.) I can't remember what the bad advice was, but here processes versus threads clearly doesn't matter performance-wise. I generally recommend processes, because how they work is well-known and they're not as vulnerable to weird synchronization bugs as threads. Second, running 5 wgets is equivalent to 5 processes not 5 threads. And third -- you don't have to use either threads *or* processes. There is another possibility which is much more light-weight: asynchronous I/O, available through the low level select module, or more usefully via the higher-level asyncore module. Yeah, that would be my first choice too for a problem which isn't clearly CPU-bound. Or my second choice -- the first would be calling on a utility like wget(1). /Jorgen -- // Jorgen Grahn grahn@ Oo o. . . \X/ snipabacken.se O o . -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
Marco Salden wrote: On Jan 6, 5:36 am, Philip Semanchuk phi...@semanchuk.com wrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Yep, the more easier and straightforward the approach, the better: threads are always (programmers')-error-prone by nature. But my question would be: does it REALLY need to be simultaneously: the CPU/OS only has more overhead doing this in parallel with processess. Measuring sequential processing and then trying to optimize (e.g. for user response or whatever) would be my prefered way to go. Less=More. regards, Marco Threads aren't as hard a some people make out although it does depend on the problem. If your processes are effectively independent then threads are probably the right solution. You can turn any function into a thread quite easily, I posted a function for this a while back... http://groups.google.com/group/comp.lang.python/msg/3361a897db3834b4?dmode=source Also it's often a good idea to build in a flag that switches your app from multi threaded to single threaded as it's easier to debug the latter. Roger. -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Jan 6, 5:36 am, Philip Semanchuk phi...@semanchuk.com wrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Yep, the more easier and straightforward the approach, the better: threads are always (programmers')-error-prone by nature. But my question would be: does it REALLY need to be simultaneously: the CPU/OS only has more overhead doing this in parallel with processess. Measuring sequential processing and then trying to optimize (e.g. for user response or whatever) would be my prefered way to go. Less=More. regards, Marco -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Thu, 2010-01-07, Marco Salden wrote: On Jan 6, 5:36 am, Philip Semanchuk phi...@semanchuk.com wrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Yep, the more easier and straightforward the approach, the better: threads are always (programmers')-error-prone by nature. But my question would be: does it REALLY need to be simultaneously: the CPU/OS only has more overhead doing this in parallel with processess. Measuring sequential processing and then trying to optimize (e.g. for user response or whatever) would be my prefered way to go. Less=More. Normally when you do HTTP in parallell over several TCP sockets, it has nothing to do with CPU overhead. You just don't want every GET to be delayed just because the server(s) are lazy responding to the first few ones; or you might want to read the text of a web page and the CSS before a few huge pictures have been downloaded. His I have to [do them] simultaneously makes me want to ask Why?. If he's expecting *many* pictures, I doubt that the parallel download will buy him much. Reusing the same TCP socket for all of them is more likely to help, especially if the pictures aren't tiny. One long-lived TCP connection is much more efficient than dozens of short-lived ones. Personally, I'd popen() wget and let it do the job for me. /Jorgen -- // Jorgen Grahn grahn@ Oo o. . . \X/ snipabacken.se O o . -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
Jorgen Grahn wrote: On Thu, 2010-01-07, Marco Salden wrote: On Jan 6, 5:36 am, Philip Semanchuk phi...@semanchuk.com wrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Yep, the more easier and straightforward the approach, the better: threads are always (programmers')-error-prone by nature. But my question would be: does it REALLY need to be simultaneously: the CPU/OS only has more overhead doing this in parallel with processess. Measuring sequential processing and then trying to optimize (e.g. for user response or whatever) would be my prefered way to go. Less=More. Normally when you do HTTP in parallell over several TCP sockets, it has nothing to do with CPU overhead. You just don't want every GET to be delayed just because the server(s) are lazy responding to the first few ones; or you might want to read the text of a web page and the CSS before a few huge pictures have been downloaded. His I have to [do them] simultaneously makes me want to ask Why?. If he's expecting *many* pictures, I doubt that the parallel download will buy him much. Reusing the same TCP socket for all of them is more likely to help, especially if the pictures aren't tiny. One long-lived TCP connection is much more efficient than dozens of short-lived ones. Personally, I'd popen() wget and let it do the job for me. From my own experience: I wanted to download a number of webpages. I noticed that there was a significant delay before it would reply, and an especially long delay for one of them, so I used a number of threads, each one reading a URL from a queue, performing the download, and then reading the next URL, until there were none left (actually, until it read the sentinel None, which it put back for the other threads). The result? Shorter total download time because it could be downloading one webpage while waiting for another to reply. (Of course, I had to make sure that I didn't have too many threads, because that might've put too many demands on the website, not a nice thing to do!) -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Jan 7, 2010, at 11:32 AM, Jorgen Grahn wrote: On Thu, 2010-01-07, Marco Salden wrote: On Jan 6, 5:36 am, Philip Semanchuk phi...@semanchuk.com wrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Yep, the more easier and straightforward the approach, the better: threads are always (programmers')-error-prone by nature. But my question would be: does it REALLY need to be simultaneously: the CPU/OS only has more overhead doing this in parallel with processess. Measuring sequential processing and then trying to optimize (e.g. for user response or whatever) would be my prefered way to go. Less=More. Normally when you do HTTP in parallell over several TCP sockets, it has nothing to do with CPU overhead. You just don't want every GET to be delayed just because the server(s) are lazy responding to the first few ones; or you might want to read the text of a web page and the CSS before a few huge pictures have been downloaded. His I have to [do them] simultaneously makes me want to ask Why?. Exactly what I was thinking. He's surely doing something more complicated than his post suggests, and without that detail it's impossible to say whether threads, processes, asynch or voodoo is the best approach. bye P -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Jan 6, 2010, at 12:45 AM, Brian J Mingus wrote: On Tue, Jan 5, 2010 at 9:36 PM, Philip Semanchuk phi...@semanchuk.comwrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Obviously, spawning 5 copies of wget is equivalent to starting 5 threads. The answer is 'yes'. ??? Process != thread -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Wed, Jan 6, 2010 at 6:24 AM, Philip Semanchuk phi...@semanchuk.comwrote: On Jan 6, 2010, at 12:45 AM, Brian J Mingus wrote: On Tue, Jan 5, 2010 at 9:36 PM, Philip Semanchuk phi...@semanchuk.com wrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Obviously, spawning 5 copies of wget is equivalent to starting 5 threads. The answer is 'yes'. ??? Process != thread Just like the other nitpicker it is up to you to explain why the differences, and not he similarities, are relevant to this problem. -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On 04:26 am, adityashukla1...@gmail.com wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? Please point me in the right direction. See Twisted, http://twistedmatrix.com/ in particular, Twisted Web's asynchronous HTTP client, http://twistedmatrix.com/documents/current/web/howto/client.html http://twistedmatrix.com/documents/current/api/twisted.web.client.html Jean-Paul -- http://mail.python.org/mailman/listinfo/python-list
Do I have to use threads?
Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? Please point me in the right direction. Thanks Aditya -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Tue, Jan 5, 2010 at 11:26 PM, aditya shukla adityashukla1...@gmail.comwrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? Please point me in the right direction. Threads in python are very easy to work with but not very efficient and for most cases slower than running multiple processes. Look at using multiple processes instead of going with threads performance will be much better. Thanks Aditya -- http://mail.python.org/mailman/listinfo/python-list -- [ Rodrick R. Brown ] http://www.rodrickbrown.com http://www.linkedin.com/in/rodrickbrown -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
On Tue, Jan 5, 2010 at 9:36 PM, Philip Semanchuk phi...@semanchuk.comwrote: On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? No. You could spawn 5 copies of wget (or curl or a Python program that you've written). Whether or not that will perform better or be easier to code, debug and maintain depends on the other aspects of your program(s). bye Philip Obviously, spawning 5 copies of wget is equivalent to starting 5 threads. The answer is 'yes'. -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
Thanks.i will look into multiprocessing. Aditya -- http://mail.python.org/mailman/listinfo/python-list
Re: Do I have to use threads?
aditya shukla wrote: Hello people, I have 5 directories corresponding 5 different urls .I want to download images from those urls and place them in the respective directories.I have to extract the contents and download them simultaneously.I can extract the contents and do then one by one. My questions is for doing it simultaneously do I have to use threads? Please point me in the right direction. Thanks Aditya You've been given some bad advice here. First -- threads are lighter-weight than processes, so threads are probably *more* efficient. However, with only five thread/processes, the difference is probably not noticeable.(If the prejudice against threads comes from concerns over the GIL -- that also is a misplaced concern in this instance. Since you only have network connection, you will receive only one packet at a time, so only one thread will be active at a time. If the extraction process uses a significant enough amount of CPU time so that the extractions are all running at the same time *AND* if you are running on a machine with separate CPU/cores *AND* you would like the extractions to be running truly in parallel on those separate cores, *THEN*, and only then, will processes be more efficient than threads.) Second, running 5 wgets is equivalent to 5 processes not 5 threads. And third -- you don't have to use either threads *or* processes. There is another possibility which is much more light-weight: asynchronous I/O, available through the low level select module, or more usefully via the higher-level asyncore module. (Although the learning curve might trip you up, and some people find the programming model for asyncore hard to fathom, I find it more intuitive in this case than threads/processes.) In fact, the asyncore manual page has a ~20 line class which implements a web page retrieval. You could replace that example's single call to http_client with five calls, one for each of your ULRs. Then when you enter the last line (that is the asyncore.loop() call) the five will be downloading simultaneously. See http://docs.python.org/library/asyncore.html Gary Herron -- http://mail.python.org/mailman/listinfo/python-list