Parallel download works for me

Clodoaldo Thu, 11 Jan 2007 10:51:02 -0800

----------------------------------------
Re: Suggestion


Mauro Tortonesi
Thu, 13 Jul 2006 05:38:05 -0700

Kumar Varanasi ha scritto:

   Hello there,

   I am using WGET in my system to download http files. I see that there is no
   option to download the file faster with multiple connections to the server.

   Are you planning on a multi-threaded version of WGET to make
downloads much faster?

no, there is no plan to implement parallel download at the moment.

however, please notice that it is highly unlikely that opening more
than one connection with the same server will speed up the download
process. parallel download makes sense only when more than one server
is involved.
------------------------------------------------------------

Sorry for the very late reply but i can't find a more appropriate
thread to post to.

Unfortunately I had to test and found that opening more than one
connection with the same server will significantly speed up a
download. Unfortunately because I don't like to use a more complex and
brittle script and deal with all the download details in instead of
the simple wget shell script I was used to.

I have to download 2 files hosted in a California server 8 times a
day. If I download the files from another California server the
download speed in the range of Mbytes per second:

$ wget http://fah-web.stanford.edu/daily_team_summary.txt
--18:00:52--  http://fah-web.stanford.edu/daily_team_summary.txt
Resolving fah-web.stanford.edu... 171.65.103.94
Connecting to fah-web.stanford.edu|171.65.103.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 601878 (588K) [text/plain]
Saving to: `daily_team_summary.txt'

100%[=======================================>] 601,878     2.47M/s   in 0.2s

If I download the same file from Brazil, where it will processed, the
download speed is about 3 to 5 Kbytes/s. My connection is ADSL
1024/512 Kbis/s. I can download at the full 100 Kbytes/s from any fast
server, except for some servers in the US.

This is the traceroute from my city to that California server:

C:\Documents and Settings\cpn>tracert fah-web.stanford.edu



Rastreando a rota para vspm27.stanford.edu [171.65.103.94]

com no máximo 30 saltos:



 1    <1 ms    <1 ms    <1 ms  10.1.1.1

 2    23 ms    21 ms    24 ms
BrT-L10-bsace705-vrdef.dsl.brasiltelecom.net.br[200.103.122.254]

 3    20 ms    19 ms    20 ms
BrT-G5-0-750-bsace-core01.brasiltelecom.net.br [201.10.248.113]

 4    20 ms    20 ms    22 ms
BrT-G3-2-bsaco-border.brasiltelecom.net.br [201.10.209.54]

 5   654 ms   652 ms   653 ms  p5-1.core01.mia03.atlas.cogentco.com
[154.54.10.13]

 6   653 ms   650 ms   652 ms  p10-0.core01.mia01.atlas.cogentco.com
[154.54.2.193]

 7   705 ms   654 ms   654 ms  p5-0.core01.tpa01.atlas.cogentco.com
[66.28.4.58]

 8   806 ms   805 ms   774 ms  p5-0.core01.iah01.atlas.cogentco.com
[66.28.4.45]

 9   806 ms   849 ms   799 ms  p10-0.core01.sjc01.atlas.cogentco.com
[66.28.4.238]

10   799 ms   807 ms   801 ms  p4-0.core01.sfo01.atlas.cogentco.com
[66.28.4.93]

11   798 ms   807 ms   809 ms  p10-0.core01.sjc04.atlas.cogentco.com
[66.28.4.230]

12   805 ms   807 ms   796 ms
Stanford_University2.demarc.cogentco.com [66.250.7.138]

13   789 ms   802 ms   810 ms  bbra-rtr.Stanford.EDU [171.64.1.151]

14   810 ms   802 ms   807 ms  medc-rtr.Stanford.EDU [171.67.1.130]

15   797 ms   810 ms   799 ms  vspm27.Stanford.EDU [171.65.103.94]



It uses a bad route to the server. I have already complained to the
provider which is one of the backbones in Brazil, but I doubt they
will do anything about it.

As one of the files size is 5.5 Mbytes its download makes my
processing finish very late. So I had the idea of testing parallel
downloading and found that it can significantly speed up the download.

My script is written in Python and uses the pycurl module which, as
the name suggests, is an interface to the libcurl library. I don't
want to make any comparison between wget and curl. It is just that
libcurl has the multi download feature.

These are the script outputs when spliting the file download in 1, 2,
3, 5, 10, 20 and 50 chunks:

Chunks: 1
File size: 601878
Total time seconds: 124.86
KBytes/s: 4.71

Chunks: 2
File size: 601878
Total time seconds: 65.40
KBytes/s: 8.99

Chunks: 3
File size: 601878
Total time seconds: 54.98
KBytes/s: 10.69

Chunks: 5
File size: 601878
Total time seconds: 38.76
KBytes/s: 15.17

Chunks: 10
File size: 601878
Total time seconds: 33.69
KBytes/s: 17.45

Chunks: 20
File size: 601878
Total time seconds: 18.12
KBytes/s: 32.45

Chunks: 50
File size: 601878
Total time seconds: 13.88
KBytes/s: 42.34

This is the script in case someone cares to inspect it:
######################################
import pycurl, StringIO, sys, re, time

b = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.URL, 'http://fah-web.stanford.edu/daily_team_summary.txt')
c.setopt(pycurl.HEADER, True)
c.setopt(pycurl.NOBODY, True)
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.perform()
b = b.getvalue()
last_modified = re.findall(r'^Last-Modified:\s*(.*)\s*$', b, re.M + re.I)[0]
size = int(re.findall(r'^Content-Length:\s*(.*)\s*$', b, re.M + re.I)[0])
if size > 10000: chunk_number = 10
else: chunk_number = 1
chunk_size = size / chunk_number

m = pycurl.CurlMulti()
c = list()
b = list()
for i in range(chunk_number):
  start = chunk_size * i
  if i < chunk_number -1:
     end = start + chunk_size -1
  else:
     end = ''
  b.append(StringIO.StringIO())
  c.append(pycurl.Curl())
  c[i].setopt(pycurl.HTTPHEADER, ['Range: bytes=%s-%s' % (start, end)])
  c[i].setopt(pycurl.URL, 'http://fah-web.stanford.edu/daily_team_summary.txt')
  c[i].setopt(pycurl.WRITEFUNCTION, b[i].write)
  m.add_handle(c[i])

start = time.time()
while True:
  ret, num_handles = m.perform()
  if ret != pycurl.E_CALL_MULTI_PERFORM: break
while num_handles:
  ret = m.select(1.0)
  if ret == -1: continue
  while True:
     ret, num_handles = m.perform()
     if ret != pycurl.E_CALL_MULTI_PERFORM: break
m.close()

total_time = time.time() - start
b = ''.join([x.getvalue() for x in b])
print """\
Chunks: %s
File size: %s
Total time seconds: %.2f
KBytes/s: %.2f
""" % (chunk_number, size, total_time, size / total_time / 1024)
############################################

I can't say parallel donwnload is a good feature to have since it can
be abused and in most cases gives no benefits. But I can say that in
my case it would be very nice since the above script still needs some
serious work to be usable and reliable and communications scripting is
not my main business so I would rather not have to develop and
maintain such a thing. In instead I would like a nice and simple wget
shell script.

Regards
--
Clodoaldo Pinto Neto

--
Clodoaldo Pinto Neto

Parallel download works for me

Reply via email to