Hello Everyone,

I've been trying to write a python script that will take an XML dump, and
generate all HTML, using Mediawiki itself to handle all the
parsing/processing, but I've run into a problem where all the parsed output
have warnings that templates couldn't be found. I'm not sure what I'm doing
wrong.

So I'll explain my steps:

First I execute the SQL script maintenance/table.sql

Then I remove some indexes from the tables to speed up insertion.

Finally I go through the XML which will execute the following insert
statements:

 'insert into page
    (page_id, page_namespace, page_title, page_is_redirect, page_is_new,
page_random,
     page_latest, page_len, page_content_model) values (%s, %s, %s, %s, %s,
%s, %s, %s, %s)'

'insert into text (old_id, old_text) values (%s, %s)'

'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
   rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
   rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
rc_deleted,
   rc_logid)
   values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s, %s)'

'insert into revision
    (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, rev_timestamp,
     rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
      values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'

All IDs from the XML dump are kept. I noticed that the titles are not web
friendly. Thinking this was the problem I ran the
maintenance/cleanupTitles.php script but it didn't seem to fix any thing.

Doing this, I can now run the following PHP script:
    $id = 'some revision id'
    $rev = Revision::newFromId( $id );
    $titleObj = $rev->getTitle();
    $pageObj = WikiPage::factory( $titleObj );

    $context = RequestContext::newExtraneousContext($titleObj);

    $popts = ParserOptions::newFromContext($context);
    $pout = $pageObj->getParserOutput($popts);

    var_dump($pout);

The mText property of $pout contains the parsed output, but it is full of
stuff like this:

<a href="/index.php?title=Template:Date&action=edit&redlink=1" class="new"
title="Template:Date (page does not exist)">Template:Date</a>


I feel like I'm missing a step here. I tried importing the templatelinks
SQL dump, but it also did not fix anything. It also did not include any
header or footer which would be useful.

Any insight or help is much appreciated, thank you.

--alex
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to