Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-20 Thread Joseph Reagle
On Sunday, December 19, 2010, Martin Møller Skarbiniks Pedersen wrote:
 Should probably be København and not Křbenhavn

Thanks Martin, that's evidence that there are still bugs, and that Python's 
Universal Encoding Detector is probabilistic!

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-18 Thread Martin Møller Skarbiniks Pedersen
On 17 December 2010 21:18, Joseph Reagle joseph.2...@reagle.org wrote:
 On Thursday, December 16, 2010, Federico Leva (Nemo) wrote:
 I have the first 10K edits up reconstructed in their various pages at:
    http://cyber.law.harvard.edu/~reagle/wp-redux/

 I fixed some of the encoding issues. The DB dump contained different 
 encodings. So, the encoding of each diff in the dump is independently now 
 guessed using Python's CharDet (Universal Encoding Detector) library.

 So now you can read up on the few accented topics in the early Wikipedia 
 including: Göteborg, Köpenhamn, and Křbenhavn.

Should probably be København and not Křbenhavn

/Martin

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-17 Thread Nathan
From the interpe...@nupedia.com log posted the other day, written by
Larry Sanger:

Second, a little bit of history will help to explain this as well.  I was
more or less offered the job of editing Nupedia when I was, as an ABD
philosophy graduate student, soliciting Jimbo's (and other friends')
advice on a website I was thinking of starting.  It was the first I had
heard of Jimbo's idea of an open content encyclopedia, and I was delighted
to take the job.

Nathan

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-16 Thread Federico Leva (Nemo)
Good news from Wiki-research-l in case you're not subscribed to it...

Nemo

 Messaggio Originale  
Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered
Data: Thu, 16 Dec 2010 13:53:14 -0500
Da: Joseph Reagle

I have the first 10K edits up reconstructed in their various pages at:
   http://cyber.law.harvard.edu/~reagle/wp-redux/

 Messaggio Originale  
Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered
Data: Fri, 17 Dec 2010 00:03:00 +1100
Da: Tim Starling

On 16/12/10 23:10, Joseph Reagle wrote:
  On Wednesday, December 15, 2010, Tim Starling wrote:
  There were some changes made to the page text that weren't represented
  in diff_log, specifically changing certain camel-case links to free
  links.
  It appears my problems were related to some CR/LF issues not 
round-tripping between diff and patch, but I hope to be able to address 
that. And yes, in addition to some of the CamelCase issues, I expect 
another problem is that if a page is blanked Describe the new page 
here. will reappear outside of the diff_log.

I don't think that will be a problem. But there are other problems
that I've encountered.

UseMod had a deletion feature. It turns out to be easy enough to skip
deleted pages, since they don't have a corresponding entry in rclog.

It also had an admin-only rename feature, which optionally fixed links
in all pages. This accounts for the free link changes I was seeing
earlier. And it had a link replacement feature which could be invoked
without a page move. These features were rarely used, due to the
arcane interface, usually people just moved pages by copying and
pasting. But during the free-link conversion, a lot of pages were
renamed using the admin-only feature.

All these admin-only features were unlogged, but it turns out to be
possible to reconstruct page moves, because when a page was moved, its
name was updated in rclog but not in diff_log. By finding the first
diff_log entry with the new name, you can roughly work out when the
page moves were done.

Anyway, I'm developing a script which will import the dump into a
modified MediaWiki instance, the idea being that I can then export XML
from it. Once it works, I'll upload the XML to somewhere. I'm not sure
when that will be.

-- Tim Starling

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-16 Thread Charles Matthews
On 16/12/2010 20:01, Federico Leva (Nemo) wrote:
 Good news from Wiki-research-l in case you're not subscribed to it...

 Nemo

  Messaggio Originale  
 Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered
 Data: Thu, 16 Dec 2010 13:53:14 -0500
 Da: Joseph Reagle

 I have the first 10K edits up reconstructed in their various pages at:
 http://cyber.law.harvard.edu/~reagle/wp-redux/

Amazingly, AfghanistanTransportations still exists as a redirect. I 
thought there were too many people with time on their hands persecuting 
such dinosaur tracks. Of course it is now doomed ...

Charles


___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread Chad
On Tue, Dec 14, 2010 at 10:54 AM, Tim Starling tstarl...@wikimedia.org wrote:
 I was looking through some old files in our SourceForge project. I
 opened a file called wiki.tar.gz, and inside were three complete
 backups of the text of Wikipedia, from February, March and August 2001!

 This is exciting, because there is lots of article history in here
 which was assumed to be lost forever.

 I've long been interested in Wikipedia's history, and I've tried in
 the past to locate such backups. I asked various people who might have
 had one. I had given up hope.

 The history of particularly old Wikipedia articles, as seen in the
 present Wikipedia database, is incomplete, due to Usemod's policy of
 deleting old revisions of pages after about a month. The script which
 Brion wrote to import the article histories from UseMod to MediaWiki
 only fetched those revisions which hadn't been purged yet.

 I didn't want to believe that those revisions had been lost forever,
 and I even opened the UseMod source code and stared forlornly at the
 unlink() call. What I (and Brion before) missed is that UseMod appends
 a record of every change made to two files, called diff_log and rclog.
 In these two files is a record of every change made to Wikipedia from
 January 15 to August 17, 2001.

 I've put the two log files up on the web, at:

 http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z

 The 7-zip archive is only 8.4MB -- much more manageable than today's
 backups.

 rclog contains IP addresses. The Usemod software made IP addresses of
 logged-in users public, so the people who made these edits had no
 expectation that their IP address would be kept private. That, coupled
 with the passage of time, makes me think that no harm to user privacy
 can come from releasing these files.

 -- Tim Starling


I have to say this is super cool. It's like digging up a time capsule
right before the 10th anniversary. One of my favorite early edits:

This is the new WikiPedia!  The idea here is to write a complete
encyclopedia from scratch, without peer review process, etc.
Some people think that this may be a hopeless endeavor, that
the result will necessarily suck.  We aren't so sure.  So, let's get
to work!

-Chad

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread Michael Snow
On 12/14/2010 7:54 AM, Tim Starling wrote:
 I was looking through some old files in our SourceForge project. I
 opened a file called wiki.tar.gz, and inside were three complete
 backups of the text of Wikipedia, from February, March and August 2001!
I guess producing database dumps was easier in those days. Seriously 
though, this is absolutely fantastic news!

--Michael Snow

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread WereSpielChequers
Can these edits be imported into wikipedia in time for the tenth anniversary?

I'm assuming some will relate to pages that have since been moved or
deleted so I appreciate this won't be an easy project.

WereSpielChequers

On 14 December 2010 16:16, Michael Snow wikipe...@frontier.com wrote:
 On 12/14/2010 7:54 AM, Tim Starling wrote:
 I was looking through some old files in our SourceForge project. I
 opened a file called wiki.tar.gz, and inside were three complete
 backups of the text of Wikipedia, from February, March and August 2001!
 I guess producing database dumps was easier in those days. Seriously
 though, this is absolutely fantastic news!

 --Michael Snow

 ___
 WikiEN-l mailing list
 WikiEN-l@lists.wikimedia.org
 To unsubscribe from this mailing list, visit:
 https://lists.wikimedia.org/mailman/listinfo/wikien-l


___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread FT2
Deferring to tech views but I'd have thought almost certainly not. There
may well be gaps after August 2001 for one thing; importing earlier records
would incorrectly imply a complete history was shown of user and page edits.

We probably could make a museum piece of them by creating 
January2001.wikipedia.org though.

FT2



On Tue, Dec 14, 2010 at 5:08 PM, WereSpielChequers 
werespielchequ...@gmail.com wrote:

 Can these edits be imported into wikipedia in time for the tenth
 anniversary?

 I'm assuming some will relate to pages that have since been moved or
 deleted so I appreciate this won't be an easy project.

 WereSpielChequers

 On 14 December 2010 16:16, Michael Snow wikipe...@frontier.com wrote:
  On 12/14/2010 7:54 AM, Tim Starling wrote:
  I was looking through some old files in our SourceForge project. I
  opened a file called wiki.tar.gz, and inside were three complete
  backups of the text of Wikipedia, from February, March and August 2001!
  I guess producing database dumps was easier in those days. Seriously
  though, this is absolutely fantastic news!
 
  --Michael Snow
 
  ___
  WikiEN-l mailing list
  WikiEN-l@lists.wikimedia.org
  To unsubscribe from this mailing list, visit:
  https://lists.wikimedia.org/mailman/listinfo/wikien-l
 

 ___
 WikiEN-l mailing list
 WikiEN-l@lists.wikimedia.org
 To unsubscribe from this mailing list, visit:
 https://lists.wikimedia.org/mailman/listinfo/wikien-l

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread emijrp
Hi;

Thanks Tim. Congratulations.

Is Wikipedia:UuU[1] now out-of-date?

Regards,
emijrp

[1] http://en.wikipedia.org/wiki/Wikipedia:UuU


2010/12/14 Tim Starling tstarl...@wikimedia.org

 I was looking through some old files in our SourceForge project. I
 opened a file called wiki.tar.gz, and inside were three complete
 backups of the text of Wikipedia, from February, March and August 2001!

 This is exciting, because there is lots of article history in here
 which was assumed to be lost forever.

 I've long been interested in Wikipedia's history, and I've tried in
 the past to locate such backups. I asked various people who might have
 had one. I had given up hope.

 The history of particularly old Wikipedia articles, as seen in the
 present Wikipedia database, is incomplete, due to Usemod's policy of
 deleting old revisions of pages after about a month. The script which
 Brion wrote to import the article histories from UseMod to MediaWiki
 only fetched those revisions which hadn't been purged yet.

 I didn't want to believe that those revisions had been lost forever,
 and I even opened the UseMod source code and stared forlornly at the
 unlink() call. What I (and Brion before) missed is that UseMod appends
 a record of every change made to two files, called diff_log and rclog.
 In these two files is a record of every change made to Wikipedia from
 January 15 to August 17, 2001.

 I've put the two log files up on the web, at:

 http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7zhttp://noc.wikimedia.org/%7Etstarling/wikipedia-logs-2001-08-17.7z

 The 7-zip archive is only 8.4MB -- much more manageable than today's
 backups.

 rclog contains IP addresses. The Usemod software made IP addresses of
 logged-in users public, so the people who made these edits had no
 expectation that their IP address would be kept private. That, coupled
 with the passage of time, makes me think that no harm to user privacy
 can come from releasing these files.

 -- Tim Starling

 ___
 foundation-l mailing list
 foundatio...@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread phoebe ayers
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarl...@wikimedia.org wrote:
 I was looking through some old files in our SourceForge project. I
 opened a file called wiki.tar.gz, and inside were three complete
 backups of the text of Wikipedia, from February, March and August 2001!

 This is exciting, because there is lots of article history in here
 which was assumed to be lost forever.

 I've long been interested in Wikipedia's history, and I've tried in
 the past to locate such backups. I asked various people who might have
 had one. I had given up hope.

 The history of particularly old Wikipedia articles, as seen in the
 present Wikipedia database, is incomplete, due to Usemod's policy of
 deleting old revisions of pages after about a month. The script which
 Brion wrote to import the article histories from UseMod to MediaWiki
 only fetched those revisions which hadn't been purged yet.

 I didn't want to believe that those revisions had been lost forever,
 and I even opened the UseMod source code and stared forlornly at the
 unlink() call. What I (and Brion before) missed is that UseMod appends
 a record of every change made to two files, called diff_log and rclog.
 In these two files is a record of every change made to Wikipedia from
 January 15 to August 17, 2001.

 I've put the two log files up on the web, at:

 http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z

 The 7-zip archive is only 8.4MB -- much more manageable than today's
 backups.

 rclog contains IP addresses. The Usemod software made IP addresses of
 logged-in users public, so the people who made these edits had no
 expectation that their IP address would be kept private. That, coupled
 with the passage of time, makes me think that no harm to user privacy
 can come from releasing these files.

 -- Tim Starling

AWESOME. This is so cool. I've copied the research list too, since
there's many Wikipedia historians that will be eager to see the older
versions.

I hope we can get them up in a browsable way, like nostalgia.wikipedia.org!

-- phoebe

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread Rob Lanphier
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarl...@wikimedia.org wrote:
 I was looking through some old files in our SourceForge project. I
 opened a file called wiki.tar.gz, and inside were three complete
 backups of the text of Wikipedia, from February, March and August 2001!

 This is exciting, because there is lots of article history in here
 which was assumed to be lost forever.

Wow, this is really, really amazing!  I'm not sure just how you
avoided having a heart attack after seeing this:
 --
 HomePage|979586833
 1c1
  Describe the new page here.
 ---
  This is the new WikiPedia!

Great work!

Rob

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread FT2
Would prefer on its own wiki as this is comprehensive up to a given date.
Maybe January2001.wikipedia.org -- immediate impact.

(DNS software cannot handle 2001.wikipedia.org)

FT2

On Tue, Dec 14, 2010 at 6:04 PM, phoebe ayers phoebe.w...@gmail.com wrote:

  On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarl...@wikimedia.org
 wrote:
  I was looking through some old files in our SourceForge project. I
  opened a file called wiki.tar.gz, and inside were three complete
  backups of the text of Wikipedia, from February, March and August 2001!
 
  This is exciting, because there is lots of article history in here
  which was assumed to be lost forever.
 
  I've long been interested in Wikipedia's history, and I've tried in
  the past to locate such backups. I asked various people who might have
  had one. I had given up hope.
 
  The history of particularly old Wikipedia articles, as seen in the
  present Wikipedia database, is incomplete, due to Usemod's policy of
  deleting old revisions of pages after about a month. The script which
  Brion wrote to import the article histories from UseMod to MediaWiki
  only fetched those revisions which hadn't been purged yet.
 
  I didn't want to believe that those revisions had been lost forever,
  and I even opened the UseMod source code and stared forlornly at the
  unlink() call. What I (and Brion before) missed is that UseMod appends
  a record of every change made to two files, called diff_log and rclog.
  In these two files is a record of every change made to Wikipedia from
  January 15 to August 17, 2001.
 
  I've put the two log files up on the web, at:
 
  http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
 
  The 7-zip archive is only 8.4MB -- much more manageable than today's
  backups.
 
  rclog contains IP addresses. The Usemod software made IP addresses of
  logged-in users public, so the people who made these edits had no
  expectation that their IP address would be kept private. That, coupled
  with the passage of time, makes me think that no harm to user privacy
  can come from releasing these files.
 
  -- Tim Starling

 AWESOME. This is so cool. I've copied the research list too, since
 there's many Wikipedia historians that will be eager to see the older
 versions.

 I hope we can get them up in a browsable way, like nostalgia.wikipedia.org
 !

 -- phoebe

 ___
 foundation-l mailing list
 foundatio...@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread Tim Starling
On 15/12/10 04:17, emijrp wrote:
 Hi;
 
 Thanks Tim. Congratulations.
 
 Is Wikipedia:UuU[1] now out-of-date?

Yes, the earliest surviving edit is now This is the new WikiPedia!,
made to HomePage by office.bomis.com, presumably Jimmy. Larry signed a
comment a short time later from a different IP address, so it wasn't
him. Articles were created in the following order:

* HomePage
* WikiPedia
* PhilosophyAndLogic
* UnitedStates
* PopularMusic
* SportS
* MathematicsAndStatistics
* CountriesOfTheWorld
* AaA
* AfghanistaN
* UuU
* TechnologY
* ComputinG
* ComputerSoftware
* TransporT
* NamingConventions

-- Tim Starling


___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread Charles Matthews
I appreciate the challenge in getting old versions posted again. But I'm 
also interested in the folks, rather more than in CamelCase and UseMod.

As I asked somewhere else recently, where are they now? I don't mean 
outing people; just what do we really know about the Old Bolsheviks, 
shot or not? (I was rather saddened, talking of Old Bolsheviks, at 
Stevertigo's recent ban and departure, not because I agreed with him, 
but he was apparently editing in 13 June 2002, i.e. a year before me, 
and despite our conflict on this list offered and played with me a 
couple of games of online go.)

Où sont les Wiks d'antan?

Charles

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread Mike Dupont
On Tue, Dec 14, 2010 at 11:02 PM, Tim Starling tstarl...@wikimedia.org wrote:
 HomePage
 * WikiPedia
 * PhilosophyAndLogic
 * UnitedStates
 * PopularMusic
 * SportS
 * MathematicsAndStatistics
 * CountriesOfTheWorld
 * AaA
 * AfghanistaN
 * UuU
 * TechnologY
 * ComputinG
 * ComputerSoftware
 * TransporT
 * NamingConventions

Nice, I have added this as a userpage
http://en.wikipedia.org/wiki/User:Mdupont/FirstPages

All of them work except for. They have been deleted as meaningless
with no relevant historical value.
20:12, 18 April 2006 RexNL (talk | contribs) deleted AfghanistaN ‎
(content was: '{{db|R3:Redirects as a result of an implausible
typo}}#REDIRECT Afghanistan')
09:19, 24 May 2005 Thue (talk | contribs) deleted TechnologY ‎
(content was: '#REDIRECT Technology')
04:48, 8 March 2007 Raul654 (talk | contribs) deleted
NamingConventions ‎ (content was: '#REDIRECT wikipedia:Naming
conventions')

The should all be restored under the catagory Muesum of WIkipedia!

mike

-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] [Foundation-l] Old Wikipedia backups discovered

2010-12-14 Thread Andrew Gray
On 14 December 2010 22:02, Tim Starling tstarl...@wikimedia.org wrote:

 him. Articles were created in the following order:

 * HomePage
 * WikiPedia
 * PhilosophyAndLogic

It's interesting to note our early priorities!

http://grey.colorado.edu/wikipedia_2001/979602227.txt

Two months later...

http://en.wikipedia.org/w/index.php?title=PhilosophyAndLogicoldid=272836

...we'd only done the last one of those.

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l