What kind of dump are you working from?

On Mon, Sep 21, 2015 at 2:50 PM, v0id null <[email protected]> wrote:

> Hello Everyone,
>
> I've been trying to write a python script that will take an XML dump, and
> generate all HTML, using Mediawiki itself to handle all the
> parsing/processing, but I've run into a problem where all the parsed output
> have warnings that templates couldn't be found. I'm not sure what I'm doing
> wrong.
>
> So I'll explain my steps:
>
> First I execute the SQL script maintenance/table.sql
>
> Then I remove some indexes from the tables to speed up insertion.
>
> Finally I go through the XML which will execute the following insert
> statements:
>
>  'insert into page
>     (page_id, page_namespace, page_title, page_is_redirect, page_is_new,
> page_random,
>      page_latest, page_len, page_content_model) values (%s, %s, %s, %s, %s,
> %s, %s, %s, %s)'
>
> 'insert into text (old_id, old_text) values (%s, %s)'
>
> 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
>    rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
>    rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
> rc_deleted,
>    rc_logid)
>    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
> %s, %s)'
>
> 'insert into revision
>     (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, rev_timestamp,
>      rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
>       values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
>
> All IDs from the XML dump are kept. I noticed that the titles are not web
> friendly. Thinking this was the problem I ran the
> maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
>
> Doing this, I can now run the following PHP script:
>     $id = 'some revision id'
>     $rev = Revision::newFromId( $id );
>     $titleObj = $rev->getTitle();
>     $pageObj = WikiPage::factory( $titleObj );
>
>     $context = RequestContext::newExtraneousContext($titleObj);
>
>     $popts = ParserOptions::newFromContext($context);
>     $pout = $pageObj->getParserOutput($popts);
>
>     var_dump($pout);
>
> The mText property of $pout contains the parsed output, but it is full of
> stuff like this:
>
> <a href="/index.php?title=Template:Date&action=edit&redlink=1" class="new"
> title="Template:Date (page does not exist)">Template:Date</a>
>
>
> I feel like I'm missing a step here. I tried importing the templatelinks
> SQL dump, but it also did not fix anything. It also did not include any
> header or footer which would be useful.
>
> Any insight or help is much appreciated, thank you.
>
> --alex
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to