Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On Sunday, December 19, 2010, Martin Møller Skarbiniks Pedersen wrote: Should probably be København and not Křbenhavn Thanks Martin, that's evidence that there are still bugs, and that Python's Universal Encoding Detector is probabilistic! ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On 17 December 2010 21:18, Joseph Reagle joseph.2...@reagle.org wrote: On Thursday, December 16, 2010, Federico Leva (Nemo) wrote: I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/ I fixed some of the encoding issues. The DB dump contained different encodings. So, the encoding of each diff in the dump is independently now guessed using Python's CharDet (Universal Encoding Detector) library. So now you can read up on the few accented topics in the early Wikipedia including: Göteborg, Köpenhamn, and Křbenhavn. Should probably be København and not Křbenhavn /Martin ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
From the interpe...@nupedia.com log posted the other day, written by Larry Sanger: Second, a little bit of history will help to explain this as well. I was more or less offered the job of editing Nupedia when I was, as an ABD philosophy graduate student, soliciting Jimbo's (and other friends') advice on a website I was thinking of starting. It was the first I had heard of Jimbo's idea of an open content encyclopedia, and I was delighted to take the job. Nathan ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
Good news from Wiki-research-l in case you're not subscribed to it... Nemo Messaggio Originale Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered Data: Thu, 16 Dec 2010 13:53:14 -0500 Da: Joseph Reagle I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/ Messaggio Originale Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered Data: Fri, 17 Dec 2010 00:03:00 +1100 Da: Tim Starling On 16/12/10 23:10, Joseph Reagle wrote: On Wednesday, December 15, 2010, Tim Starling wrote: There were some changes made to the page text that weren't represented in diff_log, specifically changing certain camel-case links to free links. It appears my problems were related to some CR/LF issues not round-tripping between diff and patch, but I hope to be able to address that. And yes, in addition to some of the CamelCase issues, I expect another problem is that if a page is blanked Describe the new page here. will reappear outside of the diff_log. I don't think that will be a problem. But there are other problems that I've encountered. UseMod had a deletion feature. It turns out to be easy enough to skip deleted pages, since they don't have a corresponding entry in rclog. It also had an admin-only rename feature, which optionally fixed links in all pages. This accounts for the free link changes I was seeing earlier. And it had a link replacement feature which could be invoked without a page move. These features were rarely used, due to the arcane interface, usually people just moved pages by copying and pasting. But during the free-link conversion, a lot of pages were renamed using the admin-only feature. All these admin-only features were unlogged, but it turns out to be possible to reconstruct page moves, because when a page was moved, its name was updated in rclog but not in diff_log. By finding the first diff_log entry with the new name, you can roughly work out when the page moves were done. Anyway, I'm developing a script which will import the dump into a modified MediaWiki instance, the idea being that I can then export XML from it. Once it works, I'll upload the XML to somewhere. I'm not sure when that will be. -- Tim Starling ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On 16/12/2010 20:01, Federico Leva (Nemo) wrote: Good news from Wiki-research-l in case you're not subscribed to it... Nemo Messaggio Originale Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered Data: Thu, 16 Dec 2010 13:53:14 -0500 Da: Joseph Reagle I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/ Amazingly, AfghanistanTransportations still exists as a redirect. I thought there were too many people with time on their hands persecuting such dinosaur tracks. Of course it is now doomed ... Charles ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On Tue, Dec 14, 2010 at 10:54 AM, Tim Starling tstarl...@wikimedia.org wrote: I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! This is exciting, because there is lots of article history in here which was assumed to be lost forever. I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope. The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet. I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001. I've put the two log files up on the web, at: http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z The 7-zip archive is only 8.4MB -- much more manageable than today's backups. rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files. -- Tim Starling I have to say this is super cool. It's like digging up a time capsule right before the 10th anniversary. One of my favorite early edits: This is the new WikiPedia! The idea here is to write a complete encyclopedia from scratch, without peer review process, etc. Some people think that this may be a hopeless endeavor, that the result will necessarily suck. We aren't so sure. So, let's get to work! -Chad ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On 12/14/2010 7:54 AM, Tim Starling wrote: I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! I guess producing database dumps was easier in those days. Seriously though, this is absolutely fantastic news! --Michael Snow ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
Can these edits be imported into wikipedia in time for the tenth anniversary? I'm assuming some will relate to pages that have since been moved or deleted so I appreciate this won't be an easy project. WereSpielChequers On 14 December 2010 16:16, Michael Snow wikipe...@frontier.com wrote: On 12/14/2010 7:54 AM, Tim Starling wrote: I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! I guess producing database dumps was easier in those days. Seriously though, this is absolutely fantastic news! --Michael Snow ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
Deferring to tech views but I'd have thought almost certainly not. There may well be gaps after August 2001 for one thing; importing earlier records would incorrectly imply a complete history was shown of user and page edits. We probably could make a museum piece of them by creating January2001.wikipedia.org though. FT2 On Tue, Dec 14, 2010 at 5:08 PM, WereSpielChequers werespielchequ...@gmail.com wrote: Can these edits be imported into wikipedia in time for the tenth anniversary? I'm assuming some will relate to pages that have since been moved or deleted so I appreciate this won't be an easy project. WereSpielChequers On 14 December 2010 16:16, Michael Snow wikipe...@frontier.com wrote: On 12/14/2010 7:54 AM, Tim Starling wrote: I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! I guess producing database dumps was easier in those days. Seriously though, this is absolutely fantastic news! --Michael Snow ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
Hi; Thanks Tim. Congratulations. Is Wikipedia:UuU[1] now out-of-date? Regards, emijrp [1] http://en.wikipedia.org/wiki/Wikipedia:UuU 2010/12/14 Tim Starling tstarl...@wikimedia.org I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! This is exciting, because there is lots of article history in here which was assumed to be lost forever. I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope. The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet. I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001. I've put the two log files up on the web, at: http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7zhttp://noc.wikimedia.org/%7Etstarling/wikipedia-logs-2001-08-17.7z The 7-zip archive is only 8.4MB -- much more manageable than today's backups. rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files. -- Tim Starling ___ foundation-l mailing list foundatio...@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarl...@wikimedia.org wrote: I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! This is exciting, because there is lots of article history in here which was assumed to be lost forever. I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope. The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet. I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001. I've put the two log files up on the web, at: http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z The 7-zip archive is only 8.4MB -- much more manageable than today's backups. rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files. -- Tim Starling AWESOME. This is so cool. I've copied the research list too, since there's many Wikipedia historians that will be eager to see the older versions. I hope we can get them up in a browsable way, like nostalgia.wikipedia.org! -- phoebe ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarl...@wikimedia.org wrote: I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! This is exciting, because there is lots of article history in here which was assumed to be lost forever. Wow, this is really, really amazing! I'm not sure just how you avoided having a heart attack after seeing this: -- HomePage|979586833 1c1 Describe the new page here. --- This is the new WikiPedia! Great work! Rob ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
Would prefer on its own wiki as this is comprehensive up to a given date. Maybe January2001.wikipedia.org -- immediate impact. (DNS software cannot handle 2001.wikipedia.org) FT2 On Tue, Dec 14, 2010 at 6:04 PM, phoebe ayers phoebe.w...@gmail.com wrote: On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarl...@wikimedia.org wrote: I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001! This is exciting, because there is lots of article history in here which was assumed to be lost forever. I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope. The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet. I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001. I've put the two log files up on the web, at: http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z The 7-zip archive is only 8.4MB -- much more manageable than today's backups. rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files. -- Tim Starling AWESOME. This is so cool. I've copied the research list too, since there's many Wikipedia historians that will be eager to see the older versions. I hope we can get them up in a browsable way, like nostalgia.wikipedia.org ! -- phoebe ___ foundation-l mailing list foundatio...@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On 15/12/10 04:17, emijrp wrote: Hi; Thanks Tim. Congratulations. Is Wikipedia:UuU[1] now out-of-date? Yes, the earliest surviving edit is now This is the new WikiPedia!, made to HomePage by office.bomis.com, presumably Jimmy. Larry signed a comment a short time later from a different IP address, so it wasn't him. Articles were created in the following order: * HomePage * WikiPedia * PhilosophyAndLogic * UnitedStates * PopularMusic * SportS * MathematicsAndStatistics * CountriesOfTheWorld * AaA * AfghanistaN * UuU * TechnologY * ComputinG * ComputerSoftware * TransporT * NamingConventions -- Tim Starling ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
I appreciate the challenge in getting old versions posted again. But I'm also interested in the folks, rather more than in CamelCase and UseMod. As I asked somewhere else recently, where are they now? I don't mean outing people; just what do we really know about the Old Bolsheviks, shot or not? (I was rather saddened, talking of Old Bolsheviks, at Stevertigo's recent ban and departure, not because I agreed with him, but he was apparently editing in 13 June 2002, i.e. a year before me, and despite our conflict on this list offered and played with me a couple of games of online go.) Où sont les Wiks d'antan? Charles ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On Tue, Dec 14, 2010 at 11:02 PM, Tim Starling tstarl...@wikimedia.org wrote: HomePage * WikiPedia * PhilosophyAndLogic * UnitedStates * PopularMusic * SportS * MathematicsAndStatistics * CountriesOfTheWorld * AaA * AfghanistaN * UuU * TechnologY * ComputinG * ComputerSoftware * TransporT * NamingConventions Nice, I have added this as a userpage http://en.wikipedia.org/wiki/User:Mdupont/FirstPages All of them work except for. They have been deleted as meaningless with no relevant historical value. 20:12, 18 April 2006 RexNL (talk | contribs) deleted AfghanistaN (content was: '{{db|R3:Redirects as a result of an implausible typo}}#REDIRECT Afghanistan') 09:19, 24 May 2005 Thue (talk | contribs) deleted TechnologY (content was: '#REDIRECT Technology') 04:48, 8 March 2007 Raul654 (talk | contribs) deleted NamingConventions (content was: '#REDIRECT wikipedia:Naming conventions') The should all be restored under the catagory Muesum of WIkipedia! mike -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered
On 14 December 2010 22:02, Tim Starling tstarl...@wikimedia.org wrote: him. Articles were created in the following order: * HomePage * WikiPedia * PhilosophyAndLogic It's interesting to note our early priorities! http://grey.colorado.edu/wikipedia_2001/979602227.txt Two months later... http://en.wikipedia.org/w/index.php?title=PhilosophyAndLogicoldid=272836 ...we'd only done the last one of those. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l