Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download.

Ilmari Karonen Sat, 11 Apr 2009 13:34:27 -0700

Jameson Scanlon wrote:
> I should state some of the following items of info in response to the
> email correspondence received :
> 
> 1)    Windows version information (I am not providing the full 'winver'
> response obtained, because its probably not necessary – all that I
> imagine that you'd need to know is the approximate windows OS upon
> which I am attempting to download the relevant information).
> 
> Microsoft (R) Windows
> Version 5.1 (Build 2600.xpsp_sp3_gdr.080814-1236 : Service Pack 3)
> Copyright (C) 2007 Microsoft Corporation


The information that might actually be relevant is whether the disk 
you're trying to download the dump to is using FAT or NTFS.  FAT32 only 
supports files up to 4 GiB, while NTFS should be able to handle larger 
files.

> I should have stated in my original statement that sometimes it is
> possible for me to download more than 4GB, but that (for some reason
> or other) the download cuts out (dunno why).

Well, if so, that does kind of suggest that it's not the file system 
that's the problem.

> 3)    As a separate point, it occurs to me that one of the reasons for
> why the download might cut out is that there are a sequence of servers
> (according to tracert) upon which I rely for the download to proceed.
> I could be wrong, but all it may take is one server (for whatever
> reason) deciding that the download is problematic for the whole file
> download to fail.

The servers listed by tracert are only passing IP data packets between 
your computer and Wikimedia's server.  They don't know or care if you're 
downloading one big file or several small ones, so they shouldn't make 
any difference.

However, if your browser is configured to use a proxy, and the proxy 
can't handle large files properly, that could indeed be a problem.

>       It also seems like a good idea to split large files up using a file
> splitter (whichever one takes your fancy) as larger file downloads
> would seem to be problematic for most people who have access to
> networks with only a limited connection speed.
> 
>       It occurs to me that, given the randomness of this problem, this
> response might also be correspondingly random.  Still, how long might
> it take to organise something in the way of a (perhaps unix script
> automated?) file splitting for the larger wikipedia database download
> files?

No, it wouldn't be difficult to do at all; the major issue, I'd assume, 
is that we'd have to store all the data twice if we wanted to provide 
both single file and split versions of the dumps.

(Technically, it should be possible to write a PHP script or something 
to deliver individual chunks from a single large file, but that'd have 
its own complications.)

Anyway, if the problem is that the download gets interrupted half way 
through, what you really want to do is use a download client (such as 
wget -c) that knows how to resume interrupted downloads from where they 
left off.  Latest versions of Firefox apparently do have some limited 
support for that, but I'm not sure if there's any way to get Firefox to 
resume a download once it's decided it's failed.

> PS – If it were ever the case that bit torrent were used for the
> dissemination of large files (there has been some mention of this on
> the wikipedia database download talk page), I can still imagine that
> there might be problems with trying to propagate the WHOLE of such a
> large file (~14GB) – though this assertion might run contrary to other
> peoples experiences.

Given that people routine use BitTorrent to download several dozen 
gigabyte movie files, I don't think it should have any problem with a 
mere 14 GiB database dump.

> Anyhow, it occurs to me that, for the interests
> of redundancy, it would be worthwhile to figure out whether there's a
> way of changing the structure of the wikipedia database download so
> that, even if only the first 1GB of the database were downloaded, it
> would still be possible to read the information on it (perhaps this is
> already the case – but, from what I gather, once an incomplete
> database dump is downloaded – it is pretty useless, unless someone can
> correct me).

Actually, a truncated database dump should be perfectly usable, it just 
won't have all the data on it.  Indeed, for some purposes, even a piece 
from the middle of the dump file can be used to extract useful data, 
although many standard tools won't be able to decompress and parse it.

-- 
Ilmari Karonen

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download.

Reply via email to