The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out. So, I am very sorry for having needlessly held things up. (I also have aa crawler that requests pages changed since the latest xml dump, so that projects I am on can keep a current xml file; we've been running that way for at least a year.) Ariel Στις 23-02-2009, ημέρα Δευ, και ώρα 00:37 +0100, ο/η Gerard Meijssen έγραψε: > Hoi, > There have been previous offers for developer time and for hardware... > Thanks, > GerardM > > 2009/2/23 Platonides <[email protected]> > > > Robert Ullmann wrote: > > > Hi, > > > > > > Maybe I should offer a constructive suggestion? > > > > They are better than rants :) > > > > > Clearly, trying to do these dumps (particularly "history" dumps) as it > > > is being done from the servers is proving hard to manage > > > > > > I also realize that you can't just put the set of daily > > > permanent-media backups on line, as they contain lots of user info, > > > plus deleted and oversighted revs, etc. > > > > > > But would it be possible to put each backup disc (before sending one > > > of the several copies off to its secure storage) in a machine that > > > would filter all the content into a public file (or files)? Then > > > someone else could download each disc (i.e. a 10-15 GB chunk of > > > updates) and sort it into the useful files for general download? > > > > I don't think they move backup copies off to secure storage. They have > > the db replicated and the backup discs would be copies of that same > > dumps. (Some sysadmin to confirm?) > > > > > Then someone can produce a current (for example) English 'pedia XML > > > file; and with more work the cumulative history files (if we want that > > > as one file). > > > > > > There would be delays, each of your permanent media backup discs has > > > to be (probably manually, but changers are available) loaded on the > > > "filter" system, and I don't know how many discs WMF generates per > > > day. (;-) and then it has to filter all the revision data etc. But it > > > still would easily be available for others in 48-72 hours, which beats > > > the present ~6 weeks when the dumps are working. > > > > > > No shortage of people with a box or two and any number of Tbyte hard > > > drives that might be willing to help, if they can get the raw backups. > > > > The problem is that WMF can't provide that raw unfiltered information. > > Perhaps you could donate a box on the condition that it could only be > > used for dump processing, but giving out unfiltered data would be too > > risky. > > > > > > > > _______________________________________________ > > Wikitech-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
