Re: [Wikitech-l] How to mount a local copy of the English Wikipedia for researchers?

2012-06-13 Thread Steve Bennett
Thanks, I'm trying this. It consumes phenomenal amounts of memory
though - I keep getting a Killed message from Ubuntu, even with a
20Gb swap file. Will keep trying with an even bigger one.

I'll also give mwdumper another go.

Steve

On Wed, Jun 13, 2012 at 3:03 PM, Adam Wight s...@ludd.net wrote:
 I ran into this problem recently.  A python script is available at 
 https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py,
  that will convert .xml.bz2 dumps into flat fast-import files which can be 
 loaded into most databases.  Sorry this tool is still alpha quality.

 Feel free to contact with problems.

 -Adam Wight

 j...@sahnwaldt.de:
 mwdumper seems to work for recent dumps:
 http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html

 On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett stevag...@gmail.com wrote:
  Hi all,
   I've been tasked with setting up a local copy of the English
  Wikipedia for researchers - sort of like another Toolserver. I'm not
  having much luck, and wondered if anyone has done this recently, and
  what approach they used? We only really need the current article text
  - history and meta pages aren't needed.
 
  Things I have tried:
  1) Downloading and mounting the SQL dumps
 
  No good because they don't contain article text
 
  2) Downloading and mounting other SQL research dumps (eg
  ftp://ftp.rediris.es/mirror/WKP_research)
 
  No good because they're years out of date
 
  3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.xml 
  files
 
  No good because they decompress to astronomically large. I got about
  halfway through decompressing them and was over 7Tb.
 
  Also, WikiXRay appears to be old and out of date (although
  interestingly its author Felipe Ortega has just committed to the
  gitorious repository[1] on Monday for the first time in over a year)
 
  4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
 
  No good because it's old and out of date: it only supports export
  version 0.3, and the current dumps are 0.6
 
  5) Using importDump.php on a latest-pages-articles.xml dump [2]
 
  No good because it just spews out 7.6Gb of this output:
 
  PHP Warning:  xml_parse(): Unable to call handler in_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  PHP Warning:  xml_parse(): Unable to call handler out_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  PHP Warning:  xml_parse(): Unable to call handler in_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  PHP Warning:  xml_parse(): Unable to call handler in_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  ...
 
 
  So, any suggestions for approaches that might work? Or suggestions for
  fixing the errors in step 5?
 
  Steve
 
 
  [1] http://gitorious.org/wikixray
  [2] 
  http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] How to mount a local copy of the English Wikipedia for researchers?

2012-06-12 Thread Steve Bennett
Hi all,
  I've been tasked with setting up a local copy of the English
Wikipedia for researchers - sort of like another Toolserver. I'm not
having much luck, and wondered if anyone has done this recently, and
what approach they used? We only really need the current article text
- history and meta pages aren't needed.

Things I have tried:
1) Downloading and mounting the SQL dumps

No good because they don't contain article text

2) Downloading and mounting other SQL research dumps (eg
ftp://ftp.rediris.es/mirror/WKP_research)

No good because they're years out of date

3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.xml files

No good because they decompress to astronomically large. I got about
halfway through decompressing them and was over 7Tb.

Also, WikiXRay appears to be old and out of date (although
interestingly its author Felipe Ortega has just committed to the
gitorious repository[1] on Monday for the first time in over a year)

4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)

No good because it's old and out of date: it only supports export
version 0.3, and the current dumps are 0.6

5) Using importDump.php on a latest-pages-articles.xml dump [2]

No good because it just spews out 7.6Gb of this output:

PHP Warning:  xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning:  xml_parse(): Unable to call handler out_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning:  xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning:  xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
...


So, any suggestions for approaches that might work? Or suggestions for
fixing the errors in step 5?

Steve


[1] http://gitorious.org/wikixray
[2] 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to mount a local copy of the English Wikipedia for researchers?

2012-06-12 Thread Lars Aronsson

On 2012-06-12 23:19, Steve Bennett wrote:

   I've been tasked with setting up a local copy of the English
Wikipedia for researchers - sort of like another Toolserver. I'm not
having much luck,


Have your researchers learn Icelandic. Importing the
small Icelandic Wikipedia is fast. They can test their
theories and see if their hypotheses make any sense.
When they've done their research on Icelandic, have
them learn Danish, then Norwegian, Swedish, Dutch,
before going to German and finally English. There's
a fine spiral of language sizes around the North Sea.

It's when they are frustrated waiting for an analysis
taking 15 minutes for Norwegian, that they will find
smarter algorithms that will enable them to take on
the larger languages.


--
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to mount a local copy of the English Wikipedia for researchers?

2012-06-12 Thread Jona Christopher Sahnwaldt
mwdumper seems to work for recent dumps:
http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html

On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett stevag...@gmail.com wrote:
 Hi all,
  I've been tasked with setting up a local copy of the English
 Wikipedia for researchers - sort of like another Toolserver. I'm not
 having much luck, and wondered if anyone has done this recently, and
 what approach they used? We only really need the current article text
 - history and meta pages aren't needed.

 Things I have tried:
 1) Downloading and mounting the SQL dumps

 No good because they don't contain article text

 2) Downloading and mounting other SQL research dumps (eg
 ftp://ftp.rediris.es/mirror/WKP_research)

 No good because they're years out of date

 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.xml files

 No good because they decompress to astronomically large. I got about
 halfway through decompressing them and was over 7Tb.

 Also, WikiXRay appears to be old and out of date (although
 interestingly its author Felipe Ortega has just committed to the
 gitorious repository[1] on Monday for the first time in over a year)

 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)

 No good because it's old and out of date: it only supports export
 version 0.3, and the current dumps are 0.6

 5) Using importDump.php on a latest-pages-articles.xml dump [2]

 No good because it just spews out 7.6Gb of this output:

 PHP Warning:  xml_parse(): Unable to call handler in_() in
 /usr/share/mediawiki/includes/Import.php on line 437
 PHP Warning:  xml_parse(): Unable to call handler out_() in
 /usr/share/mediawiki/includes/Import.php on line 437
 PHP Warning:  xml_parse(): Unable to call handler in_() in
 /usr/share/mediawiki/includes/Import.php on line 437
 PHP Warning:  xml_parse(): Unable to call handler in_() in
 /usr/share/mediawiki/includes/Import.php on line 437
 ...


 So, any suggestions for approaches that might work? Or suggestions for
 fixing the errors in step 5?

 Steve


 [1] http://gitorious.org/wikixray
 [2] 
 http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to mount a local copy of the English Wikipedia for researchers?

2012-06-12 Thread Adam Wight
I ran into this problem recently.  A python script is available at 
https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py,
 that will convert .xml.bz2 dumps into flat fast-import files which can be 
loaded into most databases.  Sorry this tool is still alpha quality.

Feel free to contact with problems.

-Adam Wight

j...@sahnwaldt.de:
 mwdumper seems to work for recent dumps:
 http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
 
 On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett stevag...@gmail.com wrote:
  Hi all,
   I've been tasked with setting up a local copy of the English
  Wikipedia for researchers - sort of like another Toolserver. I'm not
  having much luck, and wondered if anyone has done this recently, and
  what approach they used? We only really need the current article text
  - history and meta pages aren't needed.
 
  Things I have tried:
  1) Downloading and mounting the SQL dumps
 
  No good because they don't contain article text
 
  2) Downloading and mounting other SQL research dumps (eg
  ftp://ftp.rediris.es/mirror/WKP_research)
 
  No good because they're years out of date
 
  3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.xml 
  files
 
  No good because they decompress to astronomically large. I got about
  halfway through decompressing them and was over 7Tb.
 
  Also, WikiXRay appears to be old and out of date (although
  interestingly its author Felipe Ortega has just committed to the
  gitorious repository[1] on Monday for the first time in over a year)
 
  4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
 
  No good because it's old and out of date: it only supports export
  version 0.3, and the current dumps are 0.6
 
  5) Using importDump.php on a latest-pages-articles.xml dump [2]
 
  No good because it just spews out 7.6Gb of this output:
 
  PHP Warning:  xml_parse(): Unable to call handler in_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  PHP Warning:  xml_parse(): Unable to call handler out_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  PHP Warning:  xml_parse(): Unable to call handler in_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  PHP Warning:  xml_parse(): Unable to call handler in_() in
  /usr/share/mediawiki/includes/Import.php on line 437
  ...
 
 
  So, any suggestions for approaches that might work? Or suggestions for
  fixing the errors in step 5?
 
  Steve
 
 
  [1] http://gitorious.org/wikixray
  [2] 
  http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l