---------------------------------------- Re: Suggestion
Mauro Tortonesi Thu, 13 Jul 2006 05:38:05 -0700 Kumar Varanasi ha scritto: Hello there, I am using WGET in my system to download http files. I see that there is no option to download the file faster with multiple connections to the server. Are you planning on a multi-threaded version of WGET to make downloads much faster? no, there is no plan to implement parallel download at the moment. however, please notice that it is highly unlikely that opening more than one connection with the same server will speed up the download process. parallel download makes sense only when more than one server is involved. ------------------------------------------------------------ Sorry for the very late reply but i can't find a more appropriate thread to post to. Unfortunately I had to test and found that opening more than one connection with the same server will significantly speed up a download. Unfortunately because I don't like to use a more complex and brittle script and deal with all the download details in instead of the simple wget shell script I was used to. I have to download 2 files hosted in a California server 8 times a day. If I download the files from another California server the download speed in the range of Mbytes per second: $ wget http://fah-web.stanford.edu/daily_team_summary.txt --18:00:52-- http://fah-web.stanford.edu/daily_team_summary.txt Resolving fah-web.stanford.edu... 171.65.103.94 Connecting to fah-web.stanford.edu|171.65.103.94|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 601878 (588K) [text/plain] Saving to: `daily_team_summary.txt' 100%[=======================================>] 601,878 2.47M/s in 0.2s If I download the same file from Brazil, where it will processed, the download speed is about 3 to 5 Kbytes/s. My connection is ADSL 1024/512 Kbis/s. I can download at the full 100 Kbytes/s from any fast server, except for some servers in the US. This is the traceroute from my city to that California server: C:\Documents and Settings\cpn>tracert fah-web.stanford.edu Rastreando a rota para vspm27.stanford.edu [171.65.103.94] com no máximo 30 saltos: 1 <1 ms <1 ms <1 ms 10.1.1.1 2 23 ms 21 ms 24 ms BrT-L10-bsace705-vrdef.dsl.brasiltelecom.net.br[200.103.122.254] 3 20 ms 19 ms 20 ms BrT-G5-0-750-bsace-core01.brasiltelecom.net.br [201.10.248.113] 4 20 ms 20 ms 22 ms BrT-G3-2-bsaco-border.brasiltelecom.net.br [201.10.209.54] 5 654 ms 652 ms 653 ms p5-1.core01.mia03.atlas.cogentco.com [154.54.10.13] 6 653 ms 650 ms 652 ms p10-0.core01.mia01.atlas.cogentco.com [154.54.2.193] 7 705 ms 654 ms 654 ms p5-0.core01.tpa01.atlas.cogentco.com [66.28.4.58] 8 806 ms 805 ms 774 ms p5-0.core01.iah01.atlas.cogentco.com [66.28.4.45] 9 806 ms 849 ms 799 ms p10-0.core01.sjc01.atlas.cogentco.com [66.28.4.238] 10 799 ms 807 ms 801 ms p4-0.core01.sfo01.atlas.cogentco.com [66.28.4.93] 11 798 ms 807 ms 809 ms p10-0.core01.sjc04.atlas.cogentco.com [66.28.4.230] 12 805 ms 807 ms 796 ms Stanford_University2.demarc.cogentco.com [66.250.7.138] 13 789 ms 802 ms 810 ms bbra-rtr.Stanford.EDU [171.64.1.151] 14 810 ms 802 ms 807 ms medc-rtr.Stanford.EDU [171.67.1.130] 15 797 ms 810 ms 799 ms vspm27.Stanford.EDU [171.65.103.94] It uses a bad route to the server. I have already complained to the provider which is one of the backbones in Brazil, but I doubt they will do anything about it. As one of the files size is 5.5 Mbytes its download makes my processing finish very late. So I had the idea of testing parallel downloading and found that it can significantly speed up the download. My script is written in Python and uses the pycurl module which, as the name suggests, is an interface to the libcurl library. I don't want to make any comparison between wget and curl. It is just that libcurl has the multi download feature. These are the script outputs when spliting the file download in 1, 2, 3, 5, 10, 20 and 50 chunks: Chunks: 1 File size: 601878 Total time seconds: 124.86 KBytes/s: 4.71 Chunks: 2 File size: 601878 Total time seconds: 65.40 KBytes/s: 8.99 Chunks: 3 File size: 601878 Total time seconds: 54.98 KBytes/s: 10.69 Chunks: 5 File size: 601878 Total time seconds: 38.76 KBytes/s: 15.17 Chunks: 10 File size: 601878 Total time seconds: 33.69 KBytes/s: 17.45 Chunks: 20 File size: 601878 Total time seconds: 18.12 KBytes/s: 32.45 Chunks: 50 File size: 601878 Total time seconds: 13.88 KBytes/s: 42.34 This is the script in case someone cares to inspect it: ###################################### import pycurl, StringIO, sys, re, time b = StringIO.StringIO() c = pycurl.Curl() c.setopt(pycurl.URL, 'http://fah-web.stanford.edu/daily_team_summary.txt') c.setopt(pycurl.HEADER, True) c.setopt(pycurl.NOBODY, True) c.setopt(pycurl.WRITEFUNCTION, b.write) c.perform() b = b.getvalue() last_modified = re.findall(r'^Last-Modified:\s*(.*)\s*$', b, re.M + re.I)[0] size = int(re.findall(r'^Content-Length:\s*(.*)\s*$', b, re.M + re.I)[0]) if size > 10000: chunk_number = 10 else: chunk_number = 1 chunk_size = size / chunk_number m = pycurl.CurlMulti() c = list() b = list() for i in range(chunk_number): start = chunk_size * i if i < chunk_number -1: end = start + chunk_size -1 else: end = '' b.append(StringIO.StringIO()) c.append(pycurl.Curl()) c[i].setopt(pycurl.HTTPHEADER, ['Range: bytes=%s-%s' % (start, end)]) c[i].setopt(pycurl.URL, 'http://fah-web.stanford.edu/daily_team_summary.txt') c[i].setopt(pycurl.WRITEFUNCTION, b[i].write) m.add_handle(c[i]) start = time.time() while True: ret, num_handles = m.perform() if ret != pycurl.E_CALL_MULTI_PERFORM: break while num_handles: ret = m.select(1.0) if ret == -1: continue while True: ret, num_handles = m.perform() if ret != pycurl.E_CALL_MULTI_PERFORM: break m.close() total_time = time.time() - start b = ''.join([x.getvalue() for x in b]) print """\ Chunks: %s File size: %s Total time seconds: %.2f KBytes/s: %.2f """ % (chunk_number, size, total_time, size / total_time / 1024) ############################################ I can't say parallel donwnload is a good feature to have since it can be abused and in most cases gives no benefits. But I can say that in my case it would be very nice since the above script still needs some serious work to be usable and reliable and communications scripting is not my main business so I would rather not have to develop and maintain such a thing. In instead I would like a nice and simple wget shell script. Regards -- Clodoaldo Pinto Neto -- Clodoaldo Pinto Neto
