https://bugzilla.wikimedia.org/show_bug.cgi?id=66661
Bug ID: 66661
Summary: `mwxml2sql' fails to process
`enwikinews-20140605-pages-meta-current.xml.bz2' when
it encounters `<ns>90</ns>'
Product: Datasets
Version: unspecified
Hardware: PC
OS: Linux
Status: NEW
Severity: major
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected]
Web browser: ---
Mobile Platform: ---
0) Summary
I tried to build a mirror of `enwikinews' using `mwxml2sql'. This failed
whenever `mwxml2sql' encountered a page from namespace 90 (Thread).
I tried again using `maintenance/importDump.php'. This worked better. However,
it appears that `importDump.php' ignores namespace 90, because no such pages
are later found in the `enwikinews.page' database table.
1) Dataset
`enwikinews-20140605-pages-meta-current.xml.bz2'
2) Error messages
WHINE: (155323) no end page tag
When I divide the XML data dump into smaller files of say 1000 pages, I can
find many more such errors.
3) Pages that cause errors
<page>
<title>Thread:Comments:Chip and PIN 'not fit for purpose', says Cambridge
r\
esearcher/Those in positions of power shirking responsibility and
lying?</title\
>
<ns>90</ns>
<id>155323</id>
<DiscussionThreading>
<ThreadSubject>Those in positions of power shirking responsibility and
\
lying?</ThreadSubject>
<ThreadPage>Comments:Chip and PIN 'not fit for purpose', says
Cambridge\
researcher</ThreadPage>
<ns>90</ns>
<id>155323</id>
<DiscussionThreading>
<ThreadSubject>Those in positions of power shirking responsibility and
\
lying?</ThreadSubject>
<ThreadPage>Comments:Chip and PIN 'not fit for purpose', says
Cambridge\
researcher</ThreadPage>
<ThreadID>92</ThreadID>
<ThreadAuthor>70.31.58.181</ThreadAuthor>
<ThreadEditStatus>has-reply</ThreadEditStatus>
<ThreadType>normal</ThreadType>
<ThreadSignature>[[Special:Contributions/70.31.58.181|70.31.58.181]]
([\
[User talk:70.31.58.181|talk]])</ThreadSignature>
</DiscussionThreading>
<revision>
<id>958267</id>
<timestamp>2010-02-15T04:04:56Z</timestamp>
<contributor>
<ip>70.31.58.181</ip>
</contributor>
<comment>New thread: Those in positions of power shirking responsibility
\
and lying?</comment>
<text xml:space="preserve">"All the banks are lying. They are
malici\
ously and wilfully deceiving the customer [...] The system is not fit for
purpo\
se."
I'm so surprised that I've apparently transcended a serious remark and instead
\
am being sarcastic. Incidentally, only part of that sentence was
sarcastic.</t\
ext>
<sha1>rjidk12i4hv2mxia3a8qq620rlc7lok</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
4) Namespace of pages that cause errors
<namespace key="90" case="first-letter">Thread</namespace>
5) Use of `importDump.php'
Apparently `importDump.php' ignores namespace 90.
mysql> select page_id,page_namespace,page_title from enwikinews.page where
page_id=155323;
Empty set (0.00 sec)
mysql> select page_id,page_namespace,page_title from enwikinews.page where
page_namespace=90;
Empty set (0.00 sec)
Sincerely Yours,
Kent
--
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l