Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
Hello People, I have completed my first set in uploading the osm/fosm dataset (350gb unpacked) to archive.org http://osmopenlayers.blogspot.de/2012/05/upload-finished.html We can do something similar with wikipedia, the bucket size of archive.org is 10gb, we need to split up the data in a way that it is useful. I have done this by putting each object on one line and each file contains the full data records and the parts that belong to the previous block and next block, so you are able to process the blocks almost stand alone. mike ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
There is no such 10GB limit, http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example) ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you want to join the effort use the mailing list https://groups.google.com/group/wikiteam-discuss to avoid wasting resources. 2012/5/18 Mike Dupont jamesmikedup...@googlemail.com Hello People, I have completed my first set in uploading the osm/fosm dataset (350gb unpacked) to archive.org http://osmopenlayers.blogspot.de/2012/05/upload-finished.html We can do something similar with wikipedia, the bucket size of archive.org is 10gb, we need to split up the data in a way that it is useful. I have done this by putting each object on one line and each file contains the full data records and the parts that belong to the previous block and next block, so you are able to process the blocks almost stand alone. mike ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l -- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/ ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
On Thu, May 17, 2012 at 6:06 AM, John phoenixoverr...@gmail.com wrote: If your willing to foot the bill for the new hardware Ill gladly prove my point given the millions of dollars that wikipedia has, it should not be a problem to provide such resources for a good cause like that. -- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
I'd like to point out that the increasingly technical nature of this conversation probably belongs either on wikitech-l, or off-list, and that the strident nature of the comments is fast approaching inappropriate. Alex Wikimedia-l list administrator 2012/5/17 Anthony wikim...@inbox.org On Thu, May 17, 2012 at 2:06 AM, John phoenixoverr...@gmail.com wrote: On Thu, May 17, 2012 at 1:52 AM, Anthony wikim...@inbox.org wrote: On Thu, May 17, 2012 at 1:22 AM, John phoenixoverr...@gmail.com wrote: Anthony the process is linear, you have a php inserting X number of rows per Y time frame. Amazing. I need to switch all my databases to MySQL. It can insert X rows per Y time frame, regardless of whether the database is 20 gigabytes or 20 terabytes in size, regardless of whether the average row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a RAID array or a cluster of servers, etc. When refering to X over Y time, its an average of a of say 1000 revisions per 1 minute, any X over Y period must be considered with averages in mind, or getting a count wouldnt be possible. The *average* en.wikipedia revision is more than twice the size of the *average* simple.wikipedia revision. The *average* performance of a 20 gig database is faster than the *average* performance of a 20 terabyte database. The *average* performance of your laptop's thumb drive is different from the *average* performance of a(n array of) drive(s) which can handle 20 terabytes of data. If you setup your sever/hardware correctly it will compress the text information during insertion into the database Is this how you set up your simple.wikipedia test? How long does it take import the data if you're using the same compression mechanism as WMF (which, you didn't answer, but I assume is concatenation and compression). How exactly does this work during insertion anyway? Does it intelligently group sets of revisions together to avoid decompressing and recompressing the same revision several times? I suppose it's possible, but that would introduce quite a lot of complication into the import script, slowing things down dramatically. What about the answers to my other questions? If you want to put your money where your mouth is, import en.wikipedia. It'll only take 5 days, right? If I actually had a server or the disc space to do it I would, just to prove your smartass comments as stupid as they actually are. However given my current resource limitations (fairly crappy internet connection, older laptops, and lack of HDD) I tried to select something that could give reliable benchmarks. If your willing to foot the bill for the new hardware Ill gladly prove my point What you seem to be saying is that you're *not* putting your money where your mouth is. Anyway, if you want, I'll make a deal with you. A neutral third party rents the hardware at Amazon Web Services (AWS). We import simple.wikipedia full history (concatenating and compressing during import). We take the ratio of revisions in simple.wikipedia to the ratio of revisions in en.wikipedia. We import en.wikipedia full history (concatenating and compressing during import). If the ratio of time it takes to import en.wikipedia vs simple.wikipedia is greater than or equal to twice the ratio of revisions, then you reimburse the third party. If the ratio of import time is less than twice the ratio of revisions (you claim it is linear, therefore it'll be the same ratio), then I reimburse the third party. Either way, we save the new dump, with the processing already done, and send it to archive.org (and WMF if they're willing to host it). So we actually get a useful result out of this. It's not just for the purpose of settling an argument. Either of us can concede defeat at any point, and stop the experiment. At that point if the neutral third party wishes to pay to continue the job, s/he would be responsible for the additional costs. Shouldn't be too expensive. If you concede defeat after 5 days, then your CPU-time costs are $54 (assuming Extra Large High Memory Instance). Including 4 terabytes of EBS (which should be enough if you compress on the fly) for 5 days should be less than $100. I'm tempted to do it even if you don't take the bet. ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
On 17/05/12 12:49, Anthony wrote: Please have someone at WMF coordinate this so that there aren't multiple requests made. In my opinion, it should preferably be made by a WMF employee. Fill out the form at https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry Tell them you want to create a public data set which is a snapshot of the English Wikipedia. We can coordinate any questions, and any implementation details, on a separate list. That's a fantastic idea, and would give en: Wikipedia yet another public replica for very little effort. I would imagine that if they are willing to host enwiki, they may also be be willing to host most, or all, of the other projects. It will also mean that running Wikipedia data-munching experiments on EC2 will become much easier. Neil ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
On Thu, May 17, 2012 at 07:43:09AM -0400, Anthony wrote: In fact, I think someone at WMF should contact Amazon and see if they'll let us conduct the experiment for free, in exchange for us creating the dump for them to host as a public data set (http://aws.amazon.com/publicdatasets/). That sounds like an excellent plan. At the same time, it might be useful to get Archive Team involved. * They have warm bodies. (always useful, one can never have enough volunteers ;) * They have experience with very large datasets * They'd be very happy to help (it's their mission) * Some of them may be able to provide Sufficient Storage(tm) and server capacity. Saves us the Amazon AWS bill. * We might set a precedent where others might provide their data to AT directly too. AT's mission dovetails nicely with ours. We provide the sum of all human knowledge to people. AT ensures that the sum of all human knowledge is not subtracted from. sincerely, Kim Bruning ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
Ill run a quick benchmark and import the full history of simple.wikipedia to my laptop wiki on a stick, and give an exact duration On Thu, May 17, 2012 at 12:26 AM, John phoenixoverr...@gmail.com wrote: Toolserver is a clone of the wmf servers minus files. they run a database replication of all wikis. these times are dependent on available hardware and may very, but should provide a decent estimate On Thu, May 17, 2012 at 12:23 AM, Anthony wikim...@inbox.org wrote: On Thu, May 17, 2012 at 12:18 AM, John phoenixoverr...@gmail.com wrote: take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor exactly how to import an existing dump, I know the process of re-importing a cluster for the toolserver is normally just a few days when they have the needed dumps. Toolserver doesn't have full history, does it? ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
On Thu, May 17, 2012 at 12:30 AM, John phoenixoverr...@gmail.com wrote: Ill run a quick benchmark and import the full history of simple.wikipedia to my laptop wiki on a stick, and give an exact duration Simple.wikipedia is nothing like en.wikipedia. For one thing, there's no need to turn on $wgCompressRevisions with simple.wikipedia. Is $wgCompressRevisions still used? I haven't followed this in quite a while. ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
Well to be honest, I am still upset about how much data is deleted from wikipedia because it is not notable, there are so many articles that I might be interested in that are lost in the same garbage as spam and other things. We should make non notable articles and non harmful ones available in the backups as well. mike On Thu, May 17, 2012 at 2:28 AM, Kim Bruning k...@bruning.xs4all.nl wrote: On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote: I know from experience that a wiki can be re-built from any one of the dumps that are provided, (pages-meta-current) for example contains everything needed to reboot a site except its user database (names/passwords ect). see http://www.mediawiki.org/wiki/Manual:Moving_a_wiki Sure. Does this include all images, including commons images, eventually converted to operate locally? I'm thinking about full snapshot-and-later-restore, say 25 or 50 years from now, or in an academic setting, (or FSM-forbid in a worst case scenario knock on wood). That's what the AT folks are most interested in. ==Fire Drill== Has anyone recently set up a full-external-duplicate of (for instance) en.wp? This includes all images, all discussions, all page history (excepting the user accounts and deleted pages) This would be a useful and important exercise; possibly to be repeated once per year. I get a sneaky feeling that the first few iterations won't go so well. I'm sure AT would be glad to help out with the running of these fire drills, as it seems to be in line with their mission. sincerely, Kim Bruning ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l -- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
On Thu, May 17, 2012 at 1:22 AM, John phoenixoverr...@gmail.com wrote: Anthony the process is linear, you have a php inserting X number of rows per Y time frame. Amazing. I need to switch all my databases to MySQL. It can insert X rows per Y time frame, regardless of whether the database is 20 gigabytes or 20 terabytes in size, regardless of whether the average row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a RAID array or a cluster of servers, etc. Yes rebuilding the externallinks, links, and langlinks tables will take some additional time and wont scale. And this is part of the process too, right? However I have been working with the toolserver since 2007 and Ive lost count of the number of times that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can be done in a semi-reasonable timeframe. Re-importing how? From the compressed XML full history dumps? The WMF actually compresses all text blobs not just old versions. Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate? Is WMF using gzip or object? complete download and decompression of simple only took 20 minutes on my 2 year old consumer grade laptop with a standard home cable internet connection, same download on the toolserver (minus decompression) was 88s. Yeah Importing will take a little longer but shouldnt be that big of a deal. For the full history English Wikipedia it *is* a big deal. If you think it isn't, stop playing with simple.wikipedia, and tell us how long it takes to get a mirror up and running of en.wikipedia. Do you plan to run compressOld.php? Are you going to import everything in plain text first, and *then* start compressing? Seems like an awful lot of wasted hard drive space. There will also be some need cleanup tasks. However the main issue, archiving and restoring wmf wikis isnt an issue, and with moderately recent hardware is no big deal. Im putting my money where my mouth is, and getting actual valid stats and figures. Yes it may not be an exactly 1:1 ratio when scaling up, however given the basics of how importing a dump functions it should remain close to the same ratio If you want to put your money where your mouth is, import en.wikipedia. It'll only take 5 days, right? ___ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l