Re: [Wikitech-l] downloading wikipedia database dumps

Robert Rohde Fri, 08 Jan 2010 17:25:43 -0800

On Fri, Jan 8, 2010 at 2:37 PM, Gregory Maxwell <[email protected]> wrote:
> Er. I've maintained a non-WMF disaster recovery archive for a long
> time, though its no longer completely current since the rsync went
> away and web fetching is lossy.
>
> It saved our rear a number of times, saving thousands of images from
> irreparable loss.


While I certainly can't fault your good will, I do find it disturbing
that it was necessary.  Ideally, Wikimedia should have internal
backups of sufficient quality that we don't have to depend on what
third parties happen to have saved for any circumstance short of
meteors falling from the heavens.

> Moreover it allowed things like image hashing before
> we had that in the database, and it would allow perceptual lossy hash
> matching if I ever got around to implementing tools to access the
> output.

If the goal is some version of "do something useful for Wikimedia",
then it actually seems rather bizarre to have the first step be "copy
X TB of gradually changing data to privately owned and managed
servers".  For Wikimedia applications, it would seem much more natural
to make tools and technology available to do such things within
Wikimedia.  That way developers could  work on such problems without
having to worry about how much disk space they can personally afford.
Again, there is nothing wrong with you generously doing such things
with your own resources, but ideally running duplicate repositories
for the benefit of Wikimedia should be unnecessary.

> There really are use cases.  Moreover, making complete copies of the
> public data available as dumps to the public is a WMF board supported
> initiative.

I agree with the goal of making WMF content available, but given that
we don't offer any image dump right now and a comprehensive dump as
such would be usable to almost no one, then I don't think a classic
dump is where we should start.  Even you don't seem to want that.  If
I understand correctly, you'd like to have an easier way to reliably
download individual image files.  You wouldn't actually want to be
presented with some form of monolithic multi-terabyte tarball each
month.

Hence, I would say say it makes more sense to discuss way to make
individual images and user specified subsets of images more easily
available.  The same gateways that could allow you to keep
synchronized could also help other people to download individual
files.  Other goals could see functions like export pages expanded to
include options for download all associated image files at the same
time one downloads a set of wikitext.

The general point I am trying to make is that if we think about what
people really want, and how the files are likely to be used, then
there may be better delivery approaches than trying to create huge
image dumps.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

Reply via email to