https://bugzilla.wikimedia.org/show_bug.cgi?id=72886

            Bug ID: 72886
           Summary: Recent XML dump files break mwxml2sql
           Product: Datasets
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: blocker
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected]
       Web browser: ---
   Mobile Platform: ---

Dear Sir or Madam,

0) Context

`mwxml2sql' is a utility for rapidly converting published XML dump files into
SQL files for the `page', `revision', and `text' tables. These SQL files may
then be rapidly imported into a database.

1) Breaking change in XML dump file schema

XML dump files using schema `export-0.8.xsd' are processed by `mwxml2sql'.
XML dump files using schema `export-0.9.xsd' break `mwxml2sql'.

2) Example of error

(shell)$ rsync
ftpmirror.your.org::wikimedia-dumps/simplewiki/20141025/simplewiki-20141025-pages-meta-current.xml.bz2
.
(shell)$ rsync
ftpmirror.your.org::wikimedia-dumps/simplewiki/20141025/simplewiki-20141025-stub-meta-current.xml.gz
(shell)$ bzcat simplewiki-20141025-pages-meta-current.xml.bz2 | head -n 1
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
http://www.mediawiki.org/xml/export-0.9.xsd"; version="0.9" xml:lang="en">
(shell)$ /usr/bin/mwxml2sql --stubs
simplewiki-20141025-stub-meta-current.xml.gz --text
simplewiki-20141025-pages-meta-current.xml.bz2 --mysqlfile
simplewiki-20141025.gz --mediawiki 1.24 2>&1
WHINE: (none) no end siteinfo tag

WHINE: (none) no end siteinfo tag

3) Recent dumps

Wiki       Date     Schema mwxml2sql
simplewiki/20140220 0.8    OK
simplewiki/20140723 0.8    OK
simplewiki/20140814 0.8    OK
simplewiki/20140903 0.9    fail
simplewiki/20140927 0.9    fail
simplewiki/20141025 0.9    fail

Sincerely Yours,
Kent

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to