re: 4 gig ceiling on wget download of wiki database. Wikipedia database being blocked?

Jonathan Bazemore Sun, 24 Dec 2006 12:27:32 -0800

Please cc: (copy) your response(s) to my email,
[EMAIL PROTECTED], as I am not subscribed to the
list, thank you.


I've repeatedly tried to download the wikipedia
database, one of which is 8 gigabytes, and the other,
57.5 gigabytes, from a server which supports
resumability.

The downloads are consistently breaking around the 4
gigabyte mark.  The ceiling isn't on my end: running
and downloading to NTFS.  I've also done test
downloads from the same wiki server
(download.wikimedia.org) (it works fine) and repeated
tests of my own bandwidth and network (somewhat slower
than it should be, with congestion at times and
sporadic dropouts, but since wget supports
resumability, that shouldn't be an issue) which rules
out those factors--granted, the download might slow
down or be broken off, but why can't it resume after
the 4 gigabyte mark?  

I've used a file splitting program to break the
partially downloaded database file into smaller parts
of differing size.  Here are my results:


6 gigabyte file to start.  (The 6 gigabyte file
resulted from a "lucky patch" when the connection was
unbroken after resuming a 4 gigabyte file--but that
isn't acceptable for my purposes) 

6 gigabytes broken into 2 gigabyte segments:

first 2 gigabyte segment resumed successfully.

6 gigs broken into 3 gigabyte segments:

first 3 gigabytes resumed successfully.

6 gigs broken into 4.5 gigabyte segment(s)(seg.
2-partial):

will not resume.

6 gig broken into 4.1 gigabyte segment(s) (seg.
2-partial):

will not resume.

6 gig broken into 3.9 gigabyte segment(s) (seg.
2-partial):

resumed successfully.  

Of course, the original 6 gigabyte partial file
couldn't be resumed.

As you are aware, NTFS, while certainly not the
Rolls-Royce of FS's, supports multiple exabytes, and
therefore that 4 gig "ceiling" would only apply under
a Win-32 formatted partition.  Such limits are rare in
up-to-date operating systems.

I've considered if the data stream is being corrupted,
but wget (to my knowledge) doesn't do error checking
in the file itself, it just checks remote and local
file sizes and does a difference comparison,
downloading the remainder if the file size is smaller
on the client side.  And even if the file were being
corrupted, the file-splitting program (which is not
adding headers) should have ameliorated the problem by
now (by excising the corrupt part), unless either: 1.
the corruption is happening at the same point each
time; or 2. the server, or something interposed
between myself and the server, is blocking the
download when resumption of the download of the
database file is detected at or beyond the 4 gigabyte
mark.  

I've also tried different download managers:
Truedownloader (open-source download manager), which
is rejected by the server; and getright, a good
commercial program, but it is throttled at
19k/s--making the smaller download well over 120
hours--too slow, especially not knowing if the file is
any good to begin with.

Wikipedia doesn't have tech support, and I haven't
seen anything about this error/problem listed in a
search that should encompass their forums--but they do
suggest the use of wget for that particular
application, so I would infer that the problem is at
least related to wget itself.

I am using wgetgui (as I mentioned in my previous post
to the mailing list) and yes, all the options are
checked correctly, I've double and triple-checked and
quaduple-checked everything.  And then I checked
again. 
 
The database size is irrelevant: it could be 100
gigabytes, and that would present no difficulty from
the standpoint of bandwidth.  However, the reason we
have such programs as wget is to deal with redundancy
and resumability issues with large file downloads.  I
also see that you've been working on large file issues
with wget since 2002, and security issues.  But the
internet has network protocols to deal with this--what
is happening?

Why can't I get the data?  Have the network transport
protocols failed?  Has wget failed?  The data is
supposed to go from point A to point B--what is
stopping that?  It doesn't make sense.  

If I'm running up against a wall, I want to see that
wall.  If something is failing, I want to know what is
failing so I can fix it.  

Do you have an intermediary server that I can FTP off
of to get the wikipedia databases?  What about
CuteFTP?  



*************************************************************
This e-mail and any files transmitted with it may contain confidential and/or 
proprietary information.  It is intended solely for the use of the individual 
or entity who is the intended recipient.  Unauthorized use of this information 
is prohibited.  If you have received this in error, please contact the sender 
by replying to this message and delete this material from any system it may be 
on.  This disclaimer precedes in law, and supercedes any and all other 
disclaimers, regardless of conflicts of construction or interpretation.
*************************************************************

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

re: 4 gig ceiling on wget download of wiki database. Wikipedia database being blocked?

Reply via email to